Hardwood¶
A lightweight Java reader for the Apache Parquet file format. Available as a Java library and a command-line tool.
Why Hardwood¶
Hardwood gives applications Parquet read support without pulling in Hadoop, Avro, or the wider parquet-java dependency tree:
- Light-weight — zero transitive dependencies beyond optional compression libraries (Snappy, ZSTD, LZ4, Brotli).
- Compatible — reads every file that
parquet-javareads, with documented divergences where Hardwood applies stricter semantics (e.g. SQL three-valuednotEq). - Fast — matches or exceeds
parquet-java's read throughput; competitive in native-image builds and short-lived JVMs. - Concurrent — multi-threaded at the core: pages decode in parallel on a shared thread pool, with cross-file prefetching for multi-file reads.
- Embeddable — usable from native CLIs, S3-only pipelines (without
hadoop-aws), and Avro / Spark consumers via thin shim modules, including a drop-inparquet-javareplacement.
Quick Example¶
import dev.hardwood.InputFile;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;
try (ParquetFileReader fileReader = ParquetFileReader.open(InputFile.of(path));
RowReader rowReader = fileReader.rowReader()) {
while (rowReader.hasNext()) {
rowReader.next();
long id = rowReader.getLong("id");
String name = rowReader.getString("name");
LocalDate birthDate = rowReader.getDate("birth_date");
Instant createdAt = rowReader.getTimestamp("created_at");
}
}
Ready? Install Hardwood, then read your first file end-to-end.
Status¶
This is Beta quality software, under active development.
Reading from S3 or an in-memory ByteBuffer currently caps a file at 2 GB; split larger datasets across multiple files (local memory-mapped reads have no whole-file size limit). See Parquet file layout.
Roadmap¶
Forward-looking items tracked for post-1.0. None are committed to a specific release.
- Finalize
ColumnReaderAPI — stabilize the API for columnar access and move it out of "Experimental" state. (#522) - Writer support — write Parquet files in addition to reading; today Hardwood is reader-only. (#9)
- Bloom filter predicate pushdown — use per-chunk bloom filters for equality-predicate skipping on high-cardinality columns, where min/max statistics can't help. (#105)
- Parquet Modular Encryption — read files encrypted under the Parquet Modular Encryption spec: encrypted footer, per-column keys, AES-GCM and AES-GCM-CTR. (#128)
- Apache Arrow interop —
ColumnReaderoutput as ArrowFieldVector/VectorSchemaRootfor zero-copy handoff to DuckDB, DataFusion, Pandas-via-JNI, and other Arrow-native consumers. (#153)
Getting help¶
- Questions, ideas, design discussion — GitHub Discussions. The best first stop for "how do I…", "is X possible…", or "what's the right way to…".
- Bug reports and feature requests — the GitHub issue tracker. Please check whether a similar issue already exists.
Talks & posts¶
- Hardwood: A New Parser for Apache Parquet — project announcement.
- Open Source Friday with Gunnar Morling — GitHub Open Source Friday.
- Chasing Efficient Java Development: From 1BRC to Developing Hardwood AI Natively — InfoQ podcast on building Hardwood.