Reading from S3¶

Parquet files often live in cloud object storage rather than on local disk. The hardwood-s3 module lets you read them directly from Amazon S3 and S3-compatible services (Cloudflare R2, GCP Cloud Storage via HMAC keys, MinIO) without downloading files first. Hardwood minimizes S3 requests by pre-fetching the file footer on open and coalescing column chunk reads within each row group. Column projection and page-level predicate pushdown further reduce the amount of data transferred.

<dependency>
    <groupId>dev.hardwood</groupId>
    <artifactId>hardwood-s3</artifactId>
</dependency>

Read a file with static credentials:

import dev.hardwood.s3.S3Credentials;
import dev.hardwood.s3.S3Source;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .build();

try (ParquetFileReader reader = ParquetFileReader.open(
        source.inputFile("s3://my-bucket/data/trips.parquet"))) {
    try (RowReader rows = reader.createRowReader()) {
        while (rows.hasNext()) {
            rows.next();
            long id = rows.getLong("id");
        }
    }
}

For dynamic or refreshable credentials, implement the S3CredentialsProvider functional interface:

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(() -> fetchCredentialsFromVault())
        .build();

For the full AWS credential chain (env vars, ~/.aws/credentials, EC2/ECS instance profile, SSO, web identity), add the optional hardwood-aws-auth module:

<dependency>
    <groupId>dev.hardwood</groupId>
    <artifactId>hardwood-aws-auth</artifactId>
</dependency>

import dev.hardwood.aws.auth.SdkCredentialsProviders;

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(SdkCredentialsProviders.defaultChain())
        .build();

Multiple Files¶

Read multiple files in the same bucket:

import dev.hardwood.Hardwood;

try (Hardwood hardwood = Hardwood.create();
     MultiFileParquetReader parquet = hardwood.openAll(
             source.inputFilesInBucket("my-bucket",
                     "data/part-001.parquet",
                     "data/part-002.parquet",
                     "data/part-003.parquet"));
     MultiFileRowReader reader = parquet.createRowReader()) {
    while (reader.hasNext()) {
        reader.next();
        // ...
    }
}

Read multiple files across buckets:

hardwood.openAll(source.inputFiles(
        "s3://bucket-a/events.parquet",
        "s3://bucket-b/events.parquet"));

S3-Compatible Services¶

Set a custom endpoint for S3-compatible services:

// Cloudflare R2
S3Source source = S3Source.builder()
        .endpoint("https://<account-id>.r2.cloudflarestorage.com")
        .credentials(S3Credentials.of(accessKeyId, secretKey))
        .build();

// GCP Cloud Storage (HMAC keys)
S3Source source = S3Source.builder()
        .endpoint("https://storage.googleapis.com")
        .credentials(S3Credentials.of(hmacAccessId, hmacSecret))
        .build();

// MinIO (path-style)
S3Source source = S3Source.builder()
        .endpoint("http://localhost:9000")
        .pathStyle(true)
        .credentials(S3Credentials.of(accessKeyId, secretKey))
        .build();

When a custom endpoint is set, region can be omitted. Use .pathStyle(true) for services that require path-style access (e.g. MinIO, SeaweedFS).

Configuration¶

Timeouts¶

The builder provides connect and request timeouts with sensible defaults:

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .connectTimeout(Duration.ofSeconds(5))   // default: 10s
        .requestTimeout(Duration.ofSeconds(60))   // default: 30s
        .build();

Connect timeout — maximum time to establish a TCP connection (default 10 seconds).
Request timeout — maximum time for an individual HTTP request to complete (default 30 seconds).

Retries¶

GET requests are automatically retried on transient failures (HTTP 500, 503, and network errors) with exponential backoff and jitter. The default maximum number of retries is 3:

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .maxRetries(5)   // default: 3
        .build();

Set maxRetries(0) to disable retries entirely.

Custom HttpClient¶

For full control over connection pooling, proxy settings, or other transport configuration, pass a pre-configured HttpClient:

HttpClient client = HttpClient.newBuilder()
        .connectTimeout(Duration.ofSeconds(5))
        .build();

S3Source source = S3Source.builder()
        .region("us-east-1")
        .credentials(S3Credentials.of("AKIA...", "secret"))
        .httpClient(client)
        .build();

When a custom HttpClient is provided, the caller is responsible for closing it — S3Source.close() will not close the client. The connectTimeout builder option is ignored when a custom client is supplied.

Column projection, row group filtering, and all other reader features work transparently with S3 files.