Hardwood¶
A parser for the Apache Parquet file format, optimized for minimal dependencies and great performance. Available as a Java library and a command-line tool.
Goals of the project are:
- Light-weight: Implement the Parquet file format avoiding any 3rd party dependencies other than for compression algorithms (e.g. Snappy)
- Correct: Support all Parquet files which are supported by the canonical parquet-java library
- Fast: Be as fast or faster as parquet-java
- Complete: Add a Parquet file writer (after 1.0)
Quick Example¶
import dev.hardwood.InputFile;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;
try (ParquetFileReader fileReader = ParquetFileReader.open(InputFile.of(path));
RowReader rowReader = fileReader.createRowReader()) {
while (rowReader.hasNext()) {
rowReader.next();
long id = rowReader.getLong("id");
String name = rowReader.getString("name");
LocalDate birthDate = rowReader.getDate("birth_date");
Instant createdAt = rowReader.getTimestamp("created_at");
}
}
See Getting Started for installation and setup.
Status¶
This is Alpha quality software, under active development.
Currently, individual Parquet files must be at most 2 GB.
Larger datasets should be split across multiple files and read via MultiFileParquetReader.
Package Structure¶
Hardwood is organized into public API packages and internal implementation packages:
| Package | Visibility | Purpose |
|---|---|---|
dev.hardwood |
Public API | Entry point for creating readers and managing shared resources (thread pool, decompressor pool). |
dev.hardwood.reader |
Public API | Single-file and multi-file readers for row-oriented and column-oriented access. |
dev.hardwood.metadata |
Public API | Parquet file metadata: row groups, column chunks, physical/logical types, and compression codecs. |
dev.hardwood.schema |
Public API | Schema representation: file schema, column schemas, and column projection. |
dev.hardwood.row |
Public API | Value types for nested data access: structs, lists, and maps. |
dev.hardwood.avro |
Public API | Avro GenericRecord support: schema conversion and row materialization (hardwood-avro module). |
dev.hardwood.s3 |
Public API | S3 object storage support: S3Source, S3InputFile, S3Credentials, S3CredentialsProvider (hardwood-s3 module, zero external dependencies). |
dev.hardwood.aws.auth |
Public API | Bridges the AWS SDK credential chain to Hardwood's S3CredentialsProvider (hardwood-aws-auth module, optional). |
dev.hardwood.jfr |
Public API | JFR event types emitted during file reading, decoding, and pipeline operations. |
dev.hardwood.internal.* |
Internal | Implementation details — not part of the public API and may change without notice. |