Avro Support¶
If your application already works with Avro records — for instance in a Kafka or Spark pipeline — you can read Parquet files directly into GenericRecord instances instead of using Hardwood's own row API. The hardwood-avro module handles the schema conversion and record materialization, matching the behavior of parquet-java's AvroReadSupport. Add it alongside hardwood-core:
Read rows as GenericRecord:
import dev.hardwood.avro.AvroReaders;
import dev.hardwood.avro.AvroRowReader;
import dev.hardwood.reader.ParquetFileReader;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
try (ParquetFileReader fileReader = ParquetFileReader.open(InputFile.of(path));
AvroRowReader reader = AvroReaders.rowReader(fileReader)) {
Schema avroSchema = reader.getSchema();
while (reader.hasNext()) {
GenericRecord record = reader.next();
// Access fields by name
long id = (Long) record.get("id");
String name = (String) record.get("name");
// Nested structs are nested GenericRecords
GenericRecord address = (GenericRecord) record.get("address");
if (address != null) {
String city = (String) address.get("city");
}
// Lists and maps use standard Java collections
@SuppressWarnings("unchecked")
List<String> tags = (List<String>) record.get("tags");
}
}
AvroReaders supports all reader options: column projection, predicate pushdown, and their combination:
// With filter
AvroRowReader reader = AvroReaders.buildRowReader(fileReader)
.filter(FilterPredicate.gt("id", 1000L))
.build();
// With projection
AvroRowReader reader = AvroReaders.buildRowReader(fileReader)
.projection(ColumnProjection.columns("id", "name"))
.build();
// With both
AvroRowReader reader = AvroReaders.buildRowReader(fileReader)
.projection(ColumnProjection.columns("id", "name"))
.filter(FilterPredicate.gt("id", 1000L))
.build();
Values are stored in Avro's standard representations: timestamps as Long (millis/micros since epoch), dates as Integer (days since epoch), decimals as ByteBuffer, binary data as ByteBuffer. This matches the behavior of parquet-java's AvroReadSupport.
A Parquet column annotated with the NULL logical type (e.g. PyArrow's pa.null() columns) maps to a bare Avro null field — not the usual union [null, T] nullable wrap, which is illegal when T is itself null. The same collapse applies inside lists and maps: a list<null> element or map<string, null> value position becomes a bare null in the corresponding Avro array / map schema.
Lifecycle¶
AvroRowReader does not take ownership of the ParquetFileReader it wraps — closing the AvroRowReader releases the inner readers and column workers, but the underlying ParquetFileReader must be closed separately by the caller. The two-try-with-resources pattern in the examples above reflects this.
Schema overrides¶
Hardwood derives the Avro schema directly from the Parquet schema via AvroSchemaConverter. There is no equivalent of parquet-java's AvroReadSupport.setRequestedProjection(...) or setAvroReadSchema(...) — supplying an explicit Avro reader schema (for schema-evolution promotions, renames, or alias resolution) is not supported. Column projection (ColumnProjection.columns(...)) is the only way to narrow what is read; the Avro schema returned by getSchema() always matches the projected Parquet schema's converted form.