Hardwood¶

A lightweight Java reader for the Apache Parquet file format. Available as a Java library and a command-line tool.

Why Hardwood¶

Hardwood gives applications Parquet read support without pulling in Hadoop, Avro, or the wider parquet-java dependency tree:

Light-weight — zero transitive dependencies beyond optional compression libraries (Snappy, ZSTD, LZ4, Brotli).
Compatible — reads every file that parquet-java reads, with documented divergences where Hardwood applies stricter semantics (e.g. SQL three-valued notEq).
Fast — matches or exceeds parquet-java's read throughput; competitive in native-image builds and short-lived JVMs.
Concurrent — multi-threaded at the core: pages decode in parallel on a shared thread pool, with cross-file prefetching for multi-file reads.
Embeddable — usable from native CLIs, S3-only pipelines (without hadoop-aws), and Avro / Spark consumers via thin shim modules, including a drop-in parquet-java replacement.

Quick Example¶

import dev.hardwood.InputFile;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;

try (ParquetFileReader fileReader = ParquetFileReader.open(InputFile.of(path));
    RowReader rowReader = fileReader.rowReader()) {

    while (rowReader.hasNext()) {
        rowReader.next();

        long id = rowReader.getLong("id");
        String name = rowReader.getString("name");
        LocalDate birthDate = rowReader.getDate("birth_date");
        Instant createdAt = rowReader.getTimestamp("created_at");
    }
}

Ready? Install Hardwood, then read your first file end-to-end.

Status¶

This is Beta quality software, under active development.

Reading from S3 or an in-memory ByteBuffer currently caps a file at 2 GB; split larger datasets across multiple files (local memory-mapped reads have no whole-file size limit). See Parquet file layout.

Roadmap¶

Forward-looking items tracked for post-1.0. None are committed to a specific release.

Finalize ColumnReader API — stabilize the API for columnar access and move it out of "Experimental" state. (#522)
Writer support — write Parquet files in addition to reading; today Hardwood is reader-only. (#9)
Bloom filter predicate pushdown — use per-chunk bloom filters for equality-predicate skipping on high-cardinality columns, where min/max statistics can't help. (#105)
Parquet Modular Encryption — read files encrypted under the Parquet Modular Encryption spec: encrypted footer, per-column keys, AES-GCM and AES-GCM-CTR. (#128)
Apache Arrow interop — ColumnReader output as Arrow FieldVector / VectorSchemaRoot for zero-copy handoff to DuckDB, DataFusion, Pandas-via-JNI, and other Arrow-native consumers. (#153)

Getting help¶

Questions, ideas, design discussion — GitHub Discussions. The best first stop for "how do I…", "is X possible…", or "what's the right way to…".
Bug reports and feature requests — the GitHub issue tracker. Please check whether a similar issue already exists.

Talks & posts¶

Hardwood: A New Parser for Apache Parquet — project announcement.
Open Source Friday with Gunnar Morling — GitHub Open Source Friday.
Chasing Efficient Java Development: From 1BRC to Developing Hardwood AI Natively — InfoQ podcast on building Hardwood.