Reading from S3¶

Parquet files often live in cloud object storage rather than on local disk. The hardwood-s3 module lets you read them directly from Amazon S3 and S3-compatible services (Cloudflare R2, GCP Cloud Storage via HMAC keys, MinIO) without downloading files first. Hardwood minimizes S3 requests by pre-fetching the file footer on open and coalescing column chunk reads within each row group. Column projection and page-level predicate pushdown further reduce the amount of data transferred.

<dependency>
    <groupId>dev.hardwood</groupId>
    <artifactId>hardwood-s3</artifactId>
</dependency>

Read a file with static credentials:

import dev.hardwood.s3.S3Credentials;
import dev.hardwood.s3.S3Source;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .build();

try (ParquetFileReader reader = ParquetFileReader.open(
        source.inputFile("s3://my-bucket/data/trips.parquet"))) {
    try (RowReader rows = reader.rowReader()) {
        while (rows.hasNext()) {
            rows.next();
            long id = rows.getLong("id");
        }
    }
}

For dynamic or refreshable credentials, implement the S3CredentialsProvider functional interface:

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(() -> fetchCredentialsFromVault())
        .build();

For the full AWS credential chain (env vars, ~/.aws/credentials, EC2/ECS instance profile, SSO, web identity), add the optional hardwood-aws-auth module:

<dependency>
    <groupId>dev.hardwood</groupId>
    <artifactId>hardwood-aws-auth</artifactId>
</dependency>

import dev.hardwood.aws.auth.SdkCredentialsProviders;

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(SdkCredentialsProviders.defaultChain())
        .build();

Multiple Files¶

Read multiple files in the same bucket:

import dev.hardwood.Hardwood;

try (Hardwood hardwood = Hardwood.create();
     ParquetFileReader parquet = hardwood.openAll(
             source.inputFilesInBucket("my-bucket",
                     "data/part-001.parquet",
                     "data/part-002.parquet",
                     "data/part-003.parquet"));
     RowReader reader = parquet.rowReader()) {
    while (reader.hasNext()) {
        reader.next();
        // ...
    }
}

Read multiple files across buckets:

hardwood.openAll(source.inputFiles(
        "s3://bucket-a/events.parquet",
        "s3://bucket-b/events.parquet"));

S3-Compatible Services¶

Set a custom endpoint for S3-compatible services:

// Cloudflare R2
S3Source source = S3Source.builder()
        .endpoint("https://<account-id>.r2.cloudflarestorage.com")
        .credentials(S3Credentials.of(accessKeyId, secretKey))
        .build();

// GCP Cloud Storage (HMAC keys)
S3Source source = S3Source.builder()
        .endpoint("https://storage.googleapis.com")
        .credentials(S3Credentials.of(hmacAccessId, hmacSecret))
        .build();

// MinIO (path-style)
S3Source source = S3Source.builder()
        .endpoint("http://localhost:9000")
        .pathStyle(true)
        .credentials(S3Credentials.of(accessKeyId, secretKey))
        .build();

When a custom endpoint is set, region can be omitted. Use .pathStyle(true) for services that require path-style access (e.g. MinIO, SeaweedFS).

Configuration¶

Timeouts¶

The builder provides connect and request timeouts with sensible defaults:

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .connectTimeout(Duration.ofSeconds(5))   // default: 10s
        .requestTimeout(Duration.ofSeconds(60))   // default: 30s
        .build();

Connect timeout — maximum time to establish a TCP connection (default 10 seconds).
Request timeout — maximum time for an individual HTTP request to complete (default 30 seconds).

Retries¶

GET requests are automatically retried on transient failures (HTTP 500, 503, and network errors) with exponential backoff and jitter. The default maximum number of retries is 3:

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .maxRetries(5)   // default: 3
        .build();

Set maxRetries(0) to disable retries entirely.

Custom HttpClient¶

For full control over connection pooling, proxy settings, or other transport configuration, pass a pre-configured HttpClient:

HttpClient client = HttpClient.newBuilder()
        .connectTimeout(Duration.ofSeconds(5))
        .build();

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .httpClient(client)
        .build();

When a custom HttpClient is provided, the caller is responsible for closing it — S3Source.close() will not close the client. The connectTimeout builder option is ignored when a custom client is supplied.

Column projection, row group filtering, and all other reader features work transparently with S3 files.

I/O Behavior¶

When reading from S3, Hardwood coalesces column chunk reads within each row group into as few HTTP requests as possible (typically 1-2 per row group). Column projection, page-level predicate pushdown, and maxRows all narrow the byte ranges before coalescing, reducing the amount of data transferred.

Note that without a maxRows limit, the reader fetches all projected (and filtered) column data for an entire row group when any column enters it. For large row groups (128 MB–1 GB is typical), this can transfer significant data even if the consumer stops reading early. To minimize data transfer for partial reads, set maxRows on the reader — this truncates each column's fetch to only the pages covering the needed rows.