Skip to content

Row-Oriented Reading

The RowReader provides a convenient row-oriented interface for reading Parquet files with typed accessor methods for type-safe field access.

import dev.hardwood.InputFile;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;
import dev.hardwood.row.PqStruct;
import dev.hardwood.row.PqList;
import dev.hardwood.row.PqIntList;
import dev.hardwood.row.PqMap;
import java.time.Instant;
import java.time.LocalDate;
import java.time.LocalTime;
import java.math.BigDecimal;
import java.util.UUID;

try (ParquetFileReader fileReader = ParquetFileReader.open(InputFile.of(path));
    RowReader rowReader = fileReader.rowReader()) {

    while (rowReader.hasNext()) {
        rowReader.next();

        // Access columns by name with typed accessors
        long id = rowReader.getLong("id");
        String name = rowReader.getString("name");

        // Logical types are automatically converted
        LocalDate birthDate = rowReader.getDate("birth_date");
        Instant createdAt = rowReader.getTimestamp("created_at");
        LocalTime wakeTime = rowReader.getTime("wake_time");
        BigDecimal balance = rowReader.getDecimal("balance");
        UUID accountId = rowReader.getUuid("account_id");

        // Check for null values
        if (!rowReader.isNull("age")) {
            int age = rowReader.getInt("age");
            System.out.println("ID: " + id + ", Name: " + name + ", Age: " + age);
        }

        // Access nested structs
        PqStruct address = rowReader.getStruct("address");
        if (address != null) {
            String city = address.getString("city");
            int zip = address.getInt("zip");
        }

        // Access lists and iterate with typed accessors
        PqList tags = rowReader.getList("tags");
        if (tags != null) {
            for (String tag : tags.strings()) {
                System.out.println("Tag: " + tag);
            }
        }
    }
}
Advanced: nested lists, maps, and list-of-structs
        // Access list of structs
        PqList contacts = rowReader.getList("contacts");
        if (contacts != null) {
            for (PqStruct contact : contacts.structs()) {
                String contactName = contact.getString("name");
                String phone = contact.getString("phone");
            }
        }

        // Access nested lists (list<list<int>>) using primitive int lists
        PqList matrix = rowReader.getList("matrix");
        if (matrix != null) {
            for (PqList row : matrix.lists()) {
                PqIntList innerList = row.ints();
                for (var it = innerList.iterator(); it.hasNext(); ) {
                    int val = it.nextInt();
                    System.out.println("Value: " + val);
                }
            }
        }

        // Access maps (map<string, int>) — iterate all entries
        PqMap attributes = rowReader.getMap("attributes");
        if (attributes != null) {
            for (PqMap.Entry entry : attributes.getEntries()) {
                String key = entry.getStringKey();
                int value = entry.getIntValue();
                System.out.println(key + " = " + value);
            }
        }

        // Key-based lookup (no per-entry flyweight allocations)
        PqMap attrs = rowReader.getMap("attributes");
        if (attrs != null && attrs.containsKey("age")) {
            Integer age = (Integer) attrs.getValue("age");
        }

        // Access maps with struct values (map<string, struct>)
        PqMap people = rowReader.getMap("people");
        if (people != null) {
            PqStruct alice = (PqStruct) people.getValue("alice");
            if (alice != null) {
                String name = alice.getString("name");
                int age = alice.getInt("age");
            }
        }

PqMap.getValue(key) returns null for both an absent key and a present-but-null value — call containsKey(key) to disambiguate. Lookup is supported by String / int / long / byte[] keys; long-tail key types (DATE / TIMESTAMP / DECIMAL / UUID) are reachable through getEntries() + Entry.getKey(). When a key appears more than once, the lookup methods follow the Parquet spec's last-value-wins rule and surface the value of the last matching entry.

Typed Accessor Methods

All accessor methods are available in two forms:

  • Name-based (e.g., getInt("column_name")) — convenient for ad-hoc access
  • Index-based (e.g., getInt(columnIndex)) — faster for performance-critical loops

The common scalar and nested types:

Method Java Type
getBoolean boolean
getInt int
getLong long
getFloat float
getDouble double
getString String
getDate LocalDate
getTime LocalTime
getTimestamp Instant
getLocalTimestamp LocalDateTime
getDecimal BigDecimal
getUuid UUID
getStruct / getList / getMap PqStruct / PqList / PqMap

For the complete correspondence — physical and logical types, the getBinary / getInterval / getVariant accessors, FLOAT16, BSON, INT96, and the legacy converted_type columns — plus the null- and type-mismatch contracts, see Typed Accessors.

Null and type-mismatch handling

Primitive accessors (getInt, getLong, getFloat, getDouble, getBoolean) throw NullPointerException on a null field — always check isNull() first; object accessors return null. Requesting the wrong type for a column fails at runtime with an unchecked exception. The full rules — including the split getTimestamp / getLocalTimestamp pair — are in Typed Accessors; the reasoning behind the timestamp split is in Timestamp Semantics.

Index-based access

For hot loops, look up column indices once outside the loop and pass them to the accessors instead of names:

// Get column indices once (before the loop)
int idIndex = fileReader.getFileSchema().getColumn("id").columnIndex();
int nameIndex = fileReader.getFileSchema().getColumn("name").columnIndex();

while (rowReader.hasNext()) {
    rowReader.next();
    if (!rowReader.isNull(idIndex)) {
        long id = rowReader.getLong(idIndex);      // No name lookup per row
        String name = rowReader.getString(nameIndex);
    }
}

INTERVAL columns

PqInterval is a plain record with three long properties — months(), days(), and milliseconds(). Each holds an unsigned 32-bit value in the range [0, 4_294_967_295], so no additional conversion is needed. The components are independent and not normalized. INTERVAL is one of the legacy converted_type annotations handled transparently — see Legacy converted-type annotations.

NULL columns

A column annotated with the NULL logical type (e.g. PyArrow's pa.null()) holds a null value at every row. column.logicalType() returns LogicalType.NullType; isNull(name) is always true and the typed/generic accessors return null accordingly. No separate accessor is needed.

Bare BYTE_ARRAY columns

BYTE_ARRAY columns without a STRING logical type annotation may hold arbitrary binary payloads (Protobuf, WKB, custom encodings). Generic accessors such as PqList.get and PqList.iterator surface these as byte[] rather than silently UTF-8 decoding them — invalid byte sequences would otherwise be replaced with U+FFFD. Call getString explicitly when the column is known to contain UTF-8 text from an older writer that omitted the STRING annotation.

Typed accessors on PqList and PqMap.Entry

Both interfaces mirror the RowReader's typed accessor surface — strings() / dates() / times() / timestamps() / decimals() / uuids() / intervals() / floats() / booleans() on PqList (each returning List<T>); the matching getStringValue() / getDateValue() / getIntervalValue() / etc. on PqMap.Entry. Use these in preference to the generic getValue() when iterating over a list / map of a known logical type to avoid the boxed Object return.

PqMap.Entry's typed key accessor surface is intentionally narrower: getStringKey() / getIntKey() / getLongKey() / getBinaryKey() cover the four high-frequency map key types. Long-tail key types (DATE / TIME / TIMESTAMP / DECIMAL / UUID) fall through to getKey() (decoded) and getRawKey() (raw).

PqList.ints() / longs() / doubles() return the specialized PqIntList / PqLongList / PqDoubleList types instead — these expose PrimitiveIterator.OfInt / OfLong / OfDouble, int get(int), and int[] toArray() so primitive list iteration allocates no boxed wrappers. For nested list<list<int>> (or <long> / <double>), iterate the outer list via lists() and call ints() / longs() / doubles() on each inner PqList; primitive element access stays unboxed at any nesting depth.

Reading the physical value

When you want the raw physical value rather than the decoded logical-type representation — e.g. the INT64 micros backing a TIMESTAMP, the INT32 days backing a DATE, or the unscaled INT32 / INT64 / byte[] backing a DECIMAL — call the typed primitive accessor that matches the column's physical type:

// TIMESTAMP column backed by INT64 micros
long micros = rowReader.getLong("created_at");

// DATE column backed by INT32 days since epoch
int daysSinceEpoch = rowReader.getInt("birth_date");

// DECIMAL(precision, scale) column backed by INT64
long unscaled = rowReader.getLong("amount");

getInt / getLong / getFloat / getDouble / getBoolean / getBinary accept any column whose physical type matches, regardless of the logical-type annotation — they read the underlying value directly. Use this whenever you already know the column's physical encoding and want to skip logical-type decoding.

Decoded generic access

When the column type isn't known ahead of time — e.g. generic projection-driven readers, dump tools, schema-introspecting frameworks — the generic fallback accessors return values decoded to their logical-type representation:

  • RowReader.getValue(name) / getValue(index)Integer / Long / String / LocalDate / LocalTime / Instant / BigDecimal / UUID / PqInterval / PqVariant / nested PqStruct / PqList / PqMap, with byte[] for un-annotated BYTE_ARRAY / FIXED_LEN_BYTE_ARRAY columns.
  • PqStruct.getValue(name) — same decoded mapping for nested struct fields.
  • PqMap.Entry.getKey() / getValue() — same decoded mapping for map keys and values.
  • PqList.get(index) / PqList.values() — same decoded mapping for list elements.

A parallel getRawValue family (RowReader.getRawValue, PqStruct.getRawValue, PqMap.Entry.getRawKey / getRawValue, PqList.getRaw / rawValues) returns the boxed physical value when even the physical type isn't known statically. In hot loops, prefer the typed primitive accessor described above — it avoids the boxing and the dispatch overhead.

Nested groups (struct / list / map / variant) have no distinct "raw" form and are returned through their typed flyweight (PqStruct / PqList / PqMap / PqVariant) in both modes.