Skip to content

Row-Oriented Reading

The RowReader provides a convenient row-oriented interface for reading Parquet files with typed accessor methods for type-safe field access.

Try it yourself

Want to run it or explore the capabilities yourself? Nested Data walks structs, lists, and maps with the Row API, and Hello Hardwood covers the basics.

import dev.hardwood.InputFile;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;
import dev.hardwood.row.PqStruct;
import dev.hardwood.row.PqList;
import dev.hardwood.row.PqIntList;
import dev.hardwood.row.PqMap;
import java.time.Instant;
import java.time.LocalDate;
import java.time.LocalTime;
import java.math.BigDecimal;
import java.util.UUID;

try (ParquetFileReader fileReader = ParquetFileReader.open(InputFile.of(path));
    RowReader rowReader = fileReader.rowReader()) {

    while (rowReader.hasNext()) {
        rowReader.next();

        // Access columns by name with typed accessors
        long id = rowReader.getLong("id");
        String name = rowReader.getString("name");

        // Logical types are automatically converted
        LocalDate birthDate = rowReader.getDate("birth_date");
        Instant createdAt = rowReader.getTimestamp("created_at");
        LocalTime wakeTime = rowReader.getTime("wake_time");
        BigDecimal balance = rowReader.getDecimal("balance");
        UUID accountId = rowReader.getUuid("account_id");

        // Check for null values
        if (!rowReader.isNull("age")) {
            int age = rowReader.getInt("age");
            System.out.println("ID: " + id + ", Name: " + name + ", Age: " + age);
        }

        // Access nested structs
        PqStruct address = rowReader.getStruct("address");
        if (address != null) {
            String city = address.getString("city");
            int zip = address.getInt("zip");
        }

        // Access lists and iterate with typed accessors
        PqList tags = rowReader.getList("tags");
        if (tags != null) {
            for (String tag : tags.strings()) {
                System.out.println("Tag: " + tag);
            }
        }
    }
}
Advanced: nested lists, maps, and list-of-structs
        // Access list of structs
        PqList contacts = rowReader.getList("contacts");
        if (contacts != null) {
            for (PqStruct contact : contacts.structs()) {
                String contactName = contact.getString("name");
                String phone = contact.getString("phone");
            }
        }

        // Access nested lists (list<list<int>>) using primitive int lists
        PqList matrix = rowReader.getList("matrix");
        if (matrix != null) {
            for (PqList row : matrix.lists()) {
                PqIntList innerList = row.ints();
                for (var it = innerList.iterator(); it.hasNext(); ) {
                    int val = it.nextInt();
                    System.out.println("Value: " + val);
                }
            }
        }

        // Access maps (map<string, int>) — iterate all entries
        PqMap attributes = rowReader.getMap("attributes");
        if (attributes != null) {
            for (PqMap.Entry entry : attributes.getEntries()) {
                String key = entry.getStringKey();
                int value = entry.getIntValue();
                System.out.println(key + " = " + value);
            }
        }

        // Key-based lookup (no per-entry flyweight allocations)
        PqMap attrs = rowReader.getMap("attributes");
        if (attrs != null && attrs.containsKey("age")) {
            Integer age = (Integer) attrs.getValue("age");
        }

        // Access maps with struct values (map<string, struct>)
        PqMap people = rowReader.getMap("people");
        if (people != null) {
            PqStruct alice = (PqStruct) people.getValue("alice");
            if (alice != null) {
                String name = alice.getString("name");
                int age = alice.getInt("age");
            }
        }

PqMap.getValue(key) returns null for both an absent key and a present-but-null value — call containsKey(key) to disambiguate. Lookup is supported by String / int / long / byte[] keys; long-tail key types (DATE / TIMESTAMP / DECIMAL / UUID) are reachable through getEntries() + Entry.getKey(). When a key appears more than once, the lookup methods follow the Parquet spec's last-value-wins rule and surface the value of the last matching entry.

Typed Accessor Methods

All accessor methods are available in two forms:

  • Name-based (e.g., getInt("column_name")) — convenient for ad-hoc access
  • Index-based (e.g., getInt(columnIndex)) — faster for performance-critical loops

The common scalar and nested types:

Method Java Type
getBoolean boolean
getInt int
getLong long
getFloat float
getDouble double
getString String
getDate LocalDate
getTime LocalTime
getTimestamp Instant
getLocalTimestamp LocalDateTime
getDecimal BigDecimal
getUuid UUID
getStruct / getList / getMap PqStruct / PqList / PqMap

For the complete correspondence — physical and logical types, the getBinary / getInterval / getVariant accessors, FLOAT16, BSON, INT96, and the legacy converted_type columns — plus the null- and type-mismatch contracts, see Typed Accessors.

Null and type-mismatch handling

Primitive accessors (getInt, getLong, getFloat, getDouble, getBoolean) throw NullPointerException on a null field — always check isNull() first; object accessors return null. Requesting the wrong type for a column fails at runtime with an unchecked exception. The full rules — including the split getTimestamp / getLocalTimestamp pair — are in Typed Accessors; the reasoning behind the timestamp split is in Timestamp Semantics.

Index-based access

For hot loops, look up column indices once outside the loop and pass them to the accessors instead of names:

// Get column indices once (before the loop)
int idIndex = fileReader.getFileSchema().getColumn("id").columnIndex();
int nameIndex = fileReader.getFileSchema().getColumn("name").columnIndex();

while (rowReader.hasNext()) {
    rowReader.next();
    if (!rowReader.isNull(idIndex)) {
        long id = rowReader.getLong(idIndex);      // No name lookup per row
        String name = rowReader.getString(nameIndex);
    }
}

INTERVAL columns

PqInterval is a plain record with three long properties — months(), days(), and milliseconds(). Each holds an unsigned 32-bit value in the range [0, 4_294_967_295], so no additional conversion is needed. The components are independent and not normalized. INTERVAL is one of the legacy converted_type annotations handled transparently — see Legacy converted-type annotations.

NULL columns

A column annotated with the NULL logical type (e.g. PyArrow's pa.null()) holds a null value at every row. column.logicalType() returns LogicalType.NullType; isNull(name) is always true and the typed/generic accessors return null accordingly. No separate accessor is needed.

Bare BYTE_ARRAY columns

BYTE_ARRAY columns without a STRING logical type annotation may hold arbitrary binary payloads (Protobuf, WKB, custom encodings). Generic accessors such as PqList.get and PqList.iterator surface these as byte[] rather than silently UTF-8 decoding them — invalid byte sequences would otherwise be replaced with U+FFFD. Call getString explicitly when the column is known to contain UTF-8 text from an older writer that omitted the STRING annotation.

Typed accessors on PqList and PqMap.Entry

Both interfaces mirror the RowReader's typed accessor surface — strings() / dates() / times() / timestamps() / decimals() / uuids() / intervals() / floats() / booleans() on PqList (each returning List<T>); the matching getStringValue() / getDateValue() / getIntervalValue() / etc. on PqMap.Entry. Use these in preference to the generic getValue() when iterating over a list / map of a known logical type to avoid the boxed Object return.

PqMap.Entry's typed key accessor surface is intentionally narrower: getStringKey() / getIntKey() / getLongKey() / getBinaryKey() cover the four high-frequency map key types. Long-tail key types (DATE / TIME / TIMESTAMP / DECIMAL / UUID) fall through to getKey() (decoded) and getRawKey() (raw).

PqList.ints() / longs() / doubles() return the specialized PqIntList / PqLongList / PqDoubleList types instead — these expose PrimitiveIterator.OfInt / OfLong / OfDouble, int get(int), and int[] toArray() so primitive list iteration allocates no boxed wrappers. For nested list<list<int>> (or <long> / <double>), iterate the outer list via lists() and call ints() / longs() / doubles() on each inner PqList; primitive element access stays unboxed at any nesting depth.

Reading the physical value

When you want the raw physical value rather than the decoded logical-type representation — e.g. the INT64 micros backing a TIMESTAMP, the INT32 days backing a DATE, or the unscaled INT32 / INT64 / byte[] backing a DECIMAL — call the typed primitive accessor that matches the column's physical type:

// TIMESTAMP column backed by INT64 micros
long micros = rowReader.getLong("created_at");

// DATE column backed by INT32 days since epoch
int daysSinceEpoch = rowReader.getInt("birth_date");

// DECIMAL(precision, scale) column backed by INT64
long unscaled = rowReader.getLong("amount");

getInt / getLong / getFloat / getDouble / getBoolean / getBinary accept any column whose physical type matches, regardless of the logical-type annotation — they read the underlying value directly. Use this whenever you already know the column's physical encoding and want to skip logical-type decoding.

Decoded generic access

When the column type isn't known ahead of time — e.g. generic projection-driven readers, dump tools, schema-introspecting frameworks — the generic fallback accessors return values decoded to their logical-type representation:

  • RowReader.getValue(name) / getValue(index)Integer / Long / String / LocalDate / LocalTime / Instant / BigDecimal / UUID / PqInterval / PqVariant / nested PqStruct / PqList / PqMap, with byte[] for un-annotated BYTE_ARRAY / FIXED_LEN_BYTE_ARRAY columns.
  • PqStruct.getValue(name) — same decoded mapping for nested struct fields.
  • PqMap.Entry.getKey() / getValue() — same decoded mapping for map keys and values.
  • PqList.get(index) / PqList.values() — same decoded mapping for list elements.

A parallel getRawValue family (RowReader.getRawValue, PqStruct.getRawValue, PqMap.Entry.getRawKey / getRawValue, PqList.getRaw / rawValues) returns the boxed physical value when even the physical type isn't known statically. In hot loops, prefer the typed primitive accessor described above — it avoids the boxing and the dispatch overhead.

Nested groups (struct / list / map / variant) have no distinct "raw" form and are returned through their typed flyweight (PqStruct / PqList / PqMap / PqVariant) in both modes.