Skip to content

Avro Support

If your application already works with Avro records — for instance in a Kafka or Spark pipeline — you can read Parquet files directly into GenericRecord instances instead of using Hardwood's own row API. The hardwood-avro module handles the schema conversion and record materialization, matching the behavior of parquet-java's AvroReadSupport. Add it alongside hardwood-core:

<dependency>
    <groupId>dev.hardwood</groupId>
    <artifactId>hardwood-avro</artifactId>
</dependency>

Read rows as GenericRecord:

import dev.hardwood.avro.AvroReaders;
import dev.hardwood.avro.AvroRowReader;
import dev.hardwood.reader.ParquetFileReader;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;

try (ParquetFileReader fileReader = ParquetFileReader.open(InputFile.of(path));
     AvroRowReader reader = AvroReaders.createRowReader(fileReader)) {

    Schema avroSchema = reader.getSchema();

    while (reader.hasNext()) {
        GenericRecord record = reader.next();

        // Access fields by name
        long id = (Long) record.get("id");
        String name = (String) record.get("name");

        // Nested structs are nested GenericRecords
        GenericRecord address = (GenericRecord) record.get("address");
        if (address != null) {
            String city = (String) address.get("city");
        }

        // Lists and maps use standard Java collections
        @SuppressWarnings("unchecked")
        List<String> tags = (List<String>) record.get("tags");
    }
}

AvroReaders supports all reader options: column projection, predicate pushdown, and their combination:

// With filter
AvroRowReader reader = AvroReaders.createRowReader(fileReader,
    FilterPredicate.gt("id", 1000L));

// With projection
AvroRowReader reader = AvroReaders.createRowReader(fileReader,
    ColumnProjection.columns("id", "name"));

// With both
AvroRowReader reader = AvroReaders.createRowReader(fileReader,
    ColumnProjection.columns("id", "name"),
    FilterPredicate.gt("id", 1000L));

Values are stored in Avro's standard representations: timestamps as Long (millis/micros since epoch), dates as Integer (days since epoch), decimals as ByteBuffer, binary data as ByteBuffer. This matches the behavior of parquet-java's AvroReadSupport.