Row-Oriented Reading¶
The RowReader provides a convenient row-oriented interface for reading Parquet files with typed accessor methods for type-safe field access.
import dev.hardwood.InputFile;
import dev.hardwood.reader.ParquetFileReader;
import dev.hardwood.reader.RowReader;
import dev.hardwood.row.PqStruct;
import dev.hardwood.row.PqList;
import dev.hardwood.row.PqIntList;
import dev.hardwood.row.PqMap;
import java.time.Instant;
import java.time.LocalDate;
import java.time.LocalTime;
import java.math.BigDecimal;
import java.util.UUID;
try (ParquetFileReader fileReader = ParquetFileReader.open(InputFile.of(path));
RowReader rowReader = fileReader.rowReader()) {
while (rowReader.hasNext()) {
rowReader.next();
// Access columns by name with typed accessors
long id = rowReader.getLong("id");
String name = rowReader.getString("name");
// Logical types are automatically converted
LocalDate birthDate = rowReader.getDate("birth_date");
Instant createdAt = rowReader.getTimestamp("created_at");
LocalTime wakeTime = rowReader.getTime("wake_time");
BigDecimal balance = rowReader.getDecimal("balance");
UUID accountId = rowReader.getUuid("account_id");
// Check for null values
if (!rowReader.isNull("age")) {
int age = rowReader.getInt("age");
System.out.println("ID: " + id + ", Name: " + name + ", Age: " + age);
}
// Access nested structs
PqStruct address = rowReader.getStruct("address");
if (address != null) {
String city = address.getString("city");
int zip = address.getInt("zip");
}
// Access lists and iterate with typed accessors
PqList tags = rowReader.getList("tags");
if (tags != null) {
for (String tag : tags.strings()) {
System.out.println("Tag: " + tag);
}
}
}
}
Advanced: nested lists, maps, and list-of-structs
// Access list of structs
PqList contacts = rowReader.getList("contacts");
if (contacts != null) {
for (PqStruct contact : contacts.structs()) {
String contactName = contact.getString("name");
String phone = contact.getString("phone");
}
}
// Access nested lists (list<list<int>>) using primitive int lists
PqList matrix = rowReader.getList("matrix");
if (matrix != null) {
for (PqList row : matrix.lists()) {
PqIntList innerList = row.ints();
for (var it = innerList.iterator(); it.hasNext(); ) {
int val = it.nextInt();
System.out.println("Value: " + val);
}
}
}
// Access maps (map<string, int>) — iterate all entries
PqMap attributes = rowReader.getMap("attributes");
if (attributes != null) {
for (PqMap.Entry entry : attributes.getEntries()) {
String key = entry.getStringKey();
int value = entry.getIntValue();
System.out.println(key + " = " + value);
}
}
// Key-based lookup (no per-entry flyweight allocations)
PqMap attrs = rowReader.getMap("attributes");
if (attrs != null && attrs.containsKey("age")) {
Integer age = (Integer) attrs.getValue("age");
}
// Access maps with struct values (map<string, struct>)
PqMap people = rowReader.getMap("people");
if (people != null) {
PqStruct alice = (PqStruct) people.getValue("alice");
if (alice != null) {
String name = alice.getString("name");
int age = alice.getInt("age");
}
}
PqMap.getValue(key) returns null for both an absent key and a
present-but-null value — call containsKey(key) to disambiguate.
Lookup is supported by String / int / long / byte[] keys;
long-tail key types (DATE / TIMESTAMP / DECIMAL / UUID) are reachable
through getEntries() + Entry.getKey(). When a key appears more than
once, the lookup methods follow the Parquet spec's last-value-wins rule
and surface the value of the last matching entry.
Typed Accessor Methods¶
All accessor methods are available in two forms:
- Name-based (e.g.,
getInt("column_name")) — convenient for ad-hoc access - Index-based (e.g.,
getInt(columnIndex)) — faster for performance-critical loops
The common scalar and nested types:
| Method | Java Type |
|---|---|
getBoolean |
boolean |
getInt |
int |
getLong |
long |
getFloat |
float |
getDouble |
double |
getString |
String |
getDate |
LocalDate |
getTime |
LocalTime |
getTimestamp |
Instant |
getLocalTimestamp |
LocalDateTime |
getDecimal |
BigDecimal |
getUuid |
UUID |
getStruct / getList / getMap |
PqStruct / PqList / PqMap |
For the complete correspondence — physical and logical types, the getBinary / getInterval /
getVariant accessors, FLOAT16, BSON, INT96, and the legacy converted_type columns — plus the
null- and type-mismatch contracts, see Typed Accessors.
Null and type-mismatch handling¶
Primitive accessors (getInt, getLong, getFloat, getDouble, getBoolean) throw
NullPointerException on a null field — always check isNull() first; object accessors return
null. Requesting the wrong type for a column fails at runtime with an unchecked exception. The
full rules — including the split getTimestamp / getLocalTimestamp pair — are in
Typed Accessors; the reasoning behind the timestamp split is in
Timestamp Semantics.
Index-based access¶
For hot loops, look up column indices once outside the loop and pass them to the accessors instead of names:
// Get column indices once (before the loop)
int idIndex = fileReader.getFileSchema().getColumn("id").columnIndex();
int nameIndex = fileReader.getFileSchema().getColumn("name").columnIndex();
while (rowReader.hasNext()) {
rowReader.next();
if (!rowReader.isNull(idIndex)) {
long id = rowReader.getLong(idIndex); // No name lookup per row
String name = rowReader.getString(nameIndex);
}
}
INTERVAL columns¶
PqInterval is a plain record with three long properties — months(), days(), and milliseconds(). Each holds an unsigned 32-bit value in the range [0, 4_294_967_295], so no additional conversion is needed. The components are independent and not normalized. INTERVAL is one of the legacy converted_type annotations handled transparently — see Legacy converted-type annotations.
NULL columns¶
A column annotated with the NULL logical type (e.g. PyArrow's pa.null()) holds a null value at every row. column.logicalType() returns LogicalType.NullType; isNull(name) is always true and the typed/generic accessors return null accordingly. No separate accessor is needed.
Bare BYTE_ARRAY columns¶
BYTE_ARRAY columns without a STRING logical type annotation may hold arbitrary binary payloads (Protobuf, WKB, custom encodings). Generic accessors such as PqList.get and PqList.iterator surface these as byte[] rather than silently UTF-8 decoding them — invalid byte sequences would otherwise be replaced with U+FFFD. Call getString explicitly when the column is known to contain UTF-8 text from an older writer that omitted the STRING annotation.
Typed accessors on PqList and PqMap.Entry¶
Both interfaces mirror the RowReader's typed accessor surface — strings() / dates() / times() / timestamps() / decimals() / uuids() / intervals() / floats() / booleans() on PqList (each returning List<T>); the matching getStringValue() / getDateValue() / getIntervalValue() / etc. on PqMap.Entry. Use these in preference to the generic getValue() when iterating over a list / map of a known logical type to avoid the boxed Object return.
PqMap.Entry's typed key accessor surface is intentionally narrower: getStringKey() / getIntKey() / getLongKey() / getBinaryKey() cover the four high-frequency map key types. Long-tail key types (DATE / TIME / TIMESTAMP / DECIMAL / UUID) fall through to getKey() (decoded) and getRawKey() (raw).
PqList.ints() / longs() / doubles() return the specialized PqIntList / PqLongList / PqDoubleList types instead — these expose PrimitiveIterator.OfInt / OfLong / OfDouble, int get(int), and int[] toArray() so primitive list iteration allocates no boxed wrappers. For nested list<list<int>> (or <long> / <double>), iterate the outer list via lists() and call ints() / longs() / doubles() on each inner PqList; primitive element access stays unboxed at any nesting depth.
Reading the physical value¶
When you want the raw physical value rather than the decoded logical-type representation — e.g. the INT64 micros backing a TIMESTAMP, the INT32 days backing a DATE, or the unscaled INT32 / INT64 / byte[] backing a DECIMAL — call the typed primitive accessor that matches the column's physical type:
// TIMESTAMP column backed by INT64 micros
long micros = rowReader.getLong("created_at");
// DATE column backed by INT32 days since epoch
int daysSinceEpoch = rowReader.getInt("birth_date");
// DECIMAL(precision, scale) column backed by INT64
long unscaled = rowReader.getLong("amount");
getInt / getLong / getFloat / getDouble / getBoolean / getBinary accept any column whose physical type matches, regardless of the logical-type annotation — they read the underlying value directly. Use this whenever you already know the column's physical encoding and want to skip logical-type decoding.
Decoded generic access¶
When the column type isn't known ahead of time — e.g. generic projection-driven readers, dump tools, schema-introspecting frameworks — the generic fallback accessors return values decoded to their logical-type representation:
RowReader.getValue(name)/getValue(index)—Integer/Long/String/LocalDate/LocalTime/Instant/BigDecimal/UUID/PqInterval/PqVariant/ nestedPqStruct/PqList/PqMap, withbyte[]for un-annotatedBYTE_ARRAY/FIXED_LEN_BYTE_ARRAYcolumns.PqStruct.getValue(name)— same decoded mapping for nested struct fields.PqMap.Entry.getKey()/getValue()— same decoded mapping for map keys and values.PqList.get(index)/PqList.values()— same decoded mapping for list elements.
A parallel getRawValue family (RowReader.getRawValue, PqStruct.getRawValue, PqMap.Entry.getRawKey / getRawValue, PqList.getRaw / rawValues) returns the boxed physical value when even the physical type isn't known statically. In hot loops, prefer the typed primitive accessor described above — it avoids the boxing and the dispatch overhead.
Nested groups (struct / list / map / variant) have no distinct "raw" form and are returned through their typed flyweight (PqStruct / PqList / PqMap / PqVariant) in both modes.