Skip to content

Reading from S3

Parquet files often live in cloud object storage rather than on local disk. The hardwood-s3 module lets you read them directly from Amazon S3 and S3-compatible services (Cloudflare R2, GCP Cloud Storage via HMAC keys, MinIO) without downloading files first. Hardwood minimizes S3 requests by pre-fetching the file footer on open and coalescing column chunk reads within each row group. Column projection and page-level predicate pushdown further reduce the amount of data transferred.

<dependency>
    <groupId>dev.hardwood</groupId>
    <artifactId>hardwood-s3</artifactId>
</dependency>

Read a file with static credentials:

import dev.hardwood.s3.S3Credentials;
import dev.hardwood.s3.S3Source;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .build();

try (ParquetFileReader reader = ParquetFileReader.open(
        source.inputFile("s3://my-bucket/data/trips.parquet"))) {
    try (RowReader rows = reader.rowReader()) {
        while (rows.hasNext()) {
            rows.next();
            long id = rows.getLong("id");
        }
    }
}

For dynamic or refreshable credentials, implement the S3CredentialsProvider functional interface:

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(() -> fetchCredentialsFromVault())
        .build();

For the full AWS credential chain (env vars, ~/.aws/credentials, EC2/ECS instance profile, SSO, web identity), add the optional hardwood-aws-auth module:

<dependency>
    <groupId>dev.hardwood</groupId>
    <artifactId>hardwood-aws-auth</artifactId>
</dependency>
import dev.hardwood.aws.auth.SdkCredentialsProviders;

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(SdkCredentialsProviders.defaultChain())
        .build();

Multiple Files

Read multiple files in the same bucket:

import dev.hardwood.Hardwood;

try (Hardwood hardwood = Hardwood.create();
     ParquetFileReader parquet = hardwood.openAll(
             source.inputFilesInBucket("my-bucket",
                     "data/part-001.parquet",
                     "data/part-002.parquet",
                     "data/part-003.parquet"));
     RowReader reader = parquet.rowReader()) {
    while (reader.hasNext()) {
        reader.next();
        // ...
    }
}

Read multiple files across buckets:

hardwood.openAll(source.inputFiles(
        "s3://bucket-a/events.parquet",
        "s3://bucket-b/events.parquet"));

S3-Compatible Services

Set a custom endpoint for S3-compatible services:

// Cloudflare R2
S3Source source = S3Source.builder()
        .endpoint("https://<account-id>.r2.cloudflarestorage.com")
        .credentials(S3Credentials.of(accessKeyId, secretKey))
        .build();

// GCP Cloud Storage (HMAC keys)
S3Source source = S3Source.builder()
        .endpoint("https://storage.googleapis.com")
        .credentials(S3Credentials.of(hmacAccessId, hmacSecret))
        .build();

// MinIO (path-style)
S3Source source = S3Source.builder()
        .endpoint("http://localhost:9000")
        .pathStyle(true)
        .credentials(S3Credentials.of(accessKeyId, secretKey))
        .build();

When a custom endpoint is set, region can be omitted. Use .pathStyle(true) for services that require path-style access (e.g. MinIO, SeaweedFS).

Configuration

The builder exposes timeouts, retries, and transport configuration. The consolidated option / default table is in the S3 Reference; the worked examples below cover the common cases.

Timeouts

The builder provides connect and request timeouts with sensible defaults:

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .connectTimeout(Duration.ofSeconds(5))   // default: 10s
        .requestTimeout(Duration.ofSeconds(60))   // default: 30s
        .build();

Retries

GET requests are automatically retried on transient failures (HTTP 500, 503, and network errors) with exponential backoff and jitter. The default maximum number of retries is 3:

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .maxRetries(5)   // default: 3
        .build();

Set maxRetries(0) to disable retries entirely.

When the retry budget is exhausted, the last failure is re-thrown as an IOException to the caller. For HTTP errors the message has the form GET s3://<bucket>/<key> failed: HTTP <status>; for network errors the original IOException message is preserved.

Custom HttpClient

For full control over connection pooling, proxy settings, or other transport configuration, pass a pre-configured HttpClient:

HttpClient client = HttpClient.newBuilder()
        .connectTimeout(Duration.ofSeconds(5))
        .build();

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .httpClient(client)
        .build();

Caller-supplied HttpClient lifecycle

When a custom HttpClient is provided, the caller is responsible for closing it — S3Source.close() will not close the client. The connectTimeout builder option is also ignored when a custom client is supplied.

Column projection, row group filtering, and all other reader features work transparently with S3 files.

I/O Behavior

When reading from S3, Hardwood coalesces column chunk reads within each row group into as few HTTP requests as possible (typically 1-2 per row group). Column projection, page-level predicate pushdown, and maxRows all narrow the byte ranges before coalescing, reducing the amount of data transferred.

Note that without a maxRows limit, the reader fetches all projected (and filtered) column data for an entire row group when any column enters it. For large row groups (128 MB–1 GB is typical), this can transfer significant data even if the consumer stops reading early. To minimize data transfer for partial reads, set maxRows on the reader — this truncates each column's fetch to only the pages covering the needed rows.

Measuring Fetch Cost

Each S3InputFile tracks the network I/O it has performed, which is useful for confirming that column projection, predicate pushdown, and maxRows actually reduced transfer — and for estimating S3 cost (GET requests plus bytes transferred):

import dev.hardwood.s3.S3InputFile;

try (S3InputFile in = source.inputFile("s3://my-bucket/data/trips.parquet");
     ParquetFileReader reader = ParquetFileReader.open(in);
     RowReader rows = reader.rowReader()) {
    while (rows.hasNext()) {
        rows.next();
        // ...
    }
    System.out.printf("fetched %d bytes over %d requests%n",
            in.networkBytesFetched(), in.networkRequestCount());
}

Read the counters before the try-with-resources block exits, while the S3InputFile is still open. For the exact counting semantics of networkRequestCount() and networkBytesFetched(), see the S3 Reference.

Not currently supported

Anonymous (unsigned) requests and requester-pays buckets are not supported — see the S3 Reference for details.