Skip to content

Hardwood

A parser for the Apache Parquet file format, optimized for minimal dependencies and great performance. Available as a Java library and a command-line tool.

Goals of the project are:

  • Light-weight: Implement the Parquet file format avoiding any 3rd party dependencies other than for compression algorithms (e.g. Snappy)
  • Correct: Support all Parquet files which are supported by the canonical parquet-java library
  • Fast: Be as fast or faster as parquet-java
  • Complete: Add a Parquet file writer (after 1.0)

Quick Example

import dev.hardwood.InputFile;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;

try (ParquetFileReader fileReader = ParquetFileReader.open(InputFile.of(path));
    RowReader rowReader = fileReader.createRowReader()) {

    while (rowReader.hasNext()) {
        rowReader.next();

        long id = rowReader.getLong("id");
        String name = rowReader.getString("name");
        LocalDate birthDate = rowReader.getDate("birth_date");
        Instant createdAt = rowReader.getTimestamp("created_at");
    }
}

See Getting Started for installation and setup.

Status

This is Alpha quality software, under active development.

Currently, individual Parquet files must be at most 2 GB. Larger datasets should be split across multiple files and read via MultiFileParquetReader.

Package Structure

Hardwood is organized into public API packages and internal implementation packages:

Package Visibility Purpose
dev.hardwood Public API Entry point for creating readers and managing shared resources (thread pool, decompressor pool).
dev.hardwood.reader Public API Single-file and multi-file readers for row-oriented and column-oriented access.
dev.hardwood.metadata Public API Parquet file metadata: row groups, column chunks, physical/logical types, and compression codecs.
dev.hardwood.schema Public API Schema representation: file schema, column schemas, and column projection.
dev.hardwood.row Public API Value types for nested data access: structs, lists, and maps.
dev.hardwood.avro Public API Avro GenericRecord support: schema conversion and row materialization (hardwood-avro module).
dev.hardwood.s3 Public API S3 object storage support: S3Source, S3InputFile, S3Credentials, S3CredentialsProvider (hardwood-s3 module, zero external dependencies).
dev.hardwood.aws.auth Public API Bridges the AWS SDK credential chain to Hardwood's S3CredentialsProvider (hardwood-aws-auth module, optional).
dev.hardwood.jfr Public API JFR event types emitted during file reading, decoding, and pipeline operations.
dev.hardwood.internal.* Internal Implementation details — not part of the public API and may change without notice.