Compatibility Philosophy¶
Hardwood aims to read every file that Apache parquet-java reads, while in a few specific places applying stricter semantics than parquet-java does. That combination is deliberate, and the principle behind it makes the individual divergences read as a consistent stance rather than a list of quirks.
For the drop-in API and its exact behavior, see parquet-java Compatibility; for the filter semantics, see Filter, Project, Limit, and Split.
The principle: liberal on input, strict on semantics¶
Two goals are in tension, and Hardwood resolves them on different axes:
- Read what's out there. Parquet files in the wild were written by many tools across many
years — parquet-mr, Arrow, Spark, Hive, PyArrow — using legacy encodings, deprecated types,
and annotations that newer specs dropped. Hardwood reads these transparently: legacy 2-level
list encodings, INT96 timestamps, the legacy
converted_typeannotations that predate the modern logical-type union,NULL-typed columns, FLOAT16. On the input side, Hardwood is liberal — if parquet-java can read it, Hardwood aims to. - Don't silently produce wrong results. On the semantic side, where a behavior is a matter of correctness rather than file compatibility, Hardwood chooses the well-defined answer even when that diverges from parquet-java's historical behavior. This follows the project's fail-early, no-silent-wrong-results stance: a surprising-but-correct result beats a familiar-but-wrong one.
Liberal on what bytes it accepts; strict on what those bytes are defined to mean.
Where strictness surfaces¶
SQL three-valued logic for comparison predicates¶
The clearest case is notEq and its siblings. Hardwood applies uniform SQL three-valued logic
to every comparison predicate: any comparison against a null column value yields UNKNOWN, and
UNKNOWN rows are not returned. So notEq("x", v) does not return rows where x is null —
just as eq, lt, and the rest don't.
parquet-java treats null <> v as true and so includes null rows in notEq, which breaks the
SQL identity not(gt(x, v)) ≡ ltEq(x, v) on null rows. Hardwood preserves that identity across
all operators, including under not(...). The cost is that a user porting a notEq filter from
parquet-java may see fewer rows; the benefit is that filter algebra behaves predictably and
matches what a SQL engine would do.
The looser behavior is one explicit composition away when you want it:
// parquet-java's notEq semantics: "not equal to v, OR null"
FilterPredicate.or(FilterPredicate.notEq("x", v), FilterPredicate.isNull("x"));
Making null-inclusion explicit, rather than hiding it inside one operator's special case, is the point — the reader can see in the code which rows the filter admits.
Schema validation across multiple files¶
When a reader spans multiple files, the first file's schema is the reference, and every
subsequent file is validated against it as it is opened: each projected column must exist with
a matching physical type, logical type, and repetition type, or a SchemaIncompatibleException
is thrown up front. A silently mismatched column that produced garbage values mid-stream would be
the worse outcome. Non-projected columns aren't checked, so files may carry extra columns — again
liberal on the parts that don't affect correctness, strict on the parts that do. See
Read Multiple Files as One Dataset.
The drop-in compat module¶
For callers who want parquet-java's API without rewriting code, the
hardwood-parquet-java-compat module implements the org.apache.parquet.* interfaces backed by
Hardwood. It is an API-compatibility shim, not a behavioral clone: filters routed through it
still evaluate under Hardwood's semantics. The module is mutually exclusive with parquet-java on
the classpath — see parquet-java Compatibility.
Further reading¶
- parquet-java Compatibility — the drop-in API surface and its constraints.
- Filter, Project, Limit, and Split — the full null semantics and the supported predicate surface.