The Layer Model¶
ColumnReader hands you a column as flat, typed primitive arrays. For a nested column — a struct
field, a list, a map, or any combination — those flat arrays need extra structure to say which
leaf values belong to which record, and where the nulls and empty containers sit. The layer
model is how ColumnReader expresses that structure without boxing or per-row objects. This page
explains the model; for worked code against it, see
Column-Oriented Reading.
Try it yourself
Want to run it or explore the capabilities yourself? Layer Model counts list and map items from offsets without materializing rows, and Nested Data reads the same shapes through the Row API.
A note on lineage. The layer model — flat value arrays, offset buffers, and set-bit-present validity bitmaps — takes inspiration from Apache Arrow's columnar representation, so the shapes feel familiar if you come from an Arrow-based engine. The resemblance is conceptual only: Hardwood implements no part of the Arrow specification, and the buffers are plain Java arrays rather than an Arrow-bit-compatible layout.
Layers¶
ColumnReader exposes a column's schema chain as a sequence of layers. Each non-leaf node
along the chain contributes zero or one layer:
| Schema node | Contributes layer? | Layer kind |
|---|---|---|
REQUIRED group |
no | — |
OPTIONAL group (struct) |
yes | STRUCT |
LIST / MAP-annotated group |
yes — exactly one | REPEATED |
synthetic repeated group inside a LIST / MAP |
no | — |
Layers follow a column's logical structure rather than its physical schema nodes, so node count
and layer count diverge wherever a LIST or MAP appears: its stack of group nodes contributes a
single REPEATED layer. How schema nodes map to layers works this
through on a concrete schema.
Layers are numbered 0..getLayerCount() - 1 outermost-to-innermost, and the leaf is queried
separately. A flat column reports getLayerCount() == 0.
For each layer, two buffers describe the items at that layer:
getLayerValidity(k)— a Validity indexed by item position;isNull(i)/isNotNull(i)answer the per-item question,hasNulls()is the O(1) fast-path gate. Returns the sharedValidity.NO_NULLSsingleton when no item at layerkis null in this batch.getLayerOffsets(k)— sentinel-suffixed offsets of lengthcount(k) + 1into the next inner layer's items (or, for the innermost layer, into the leaf-value array). Only valid whengetLayerKind(k) == REPEATED; throws on aSTRUCTlayer.
The leaf has its own getLeafValidity() (also a Validity).
Empty versus null containers¶
The four states of a LIST / MAP container fall out cleanly from offsets and validity:
| Logical value | getLayerValidity |
offsets[r+1] - offsets[r] |
|---|---|---|
null |
isNull(r) |
0 |
[] |
isNotNull(r) |
0 |
[null] |
isNotNull(r) |
1, leaf validity says null |
[v] |
isNotNull(r) |
1, leaf validity says not null |
Empty-vs-null is the offsets diff; the validity bit picks out null. No empty-marker bitmap is needed.
STRUCT keeps cardinality, REPEATED expands it¶
Two rules govern how item counts flow down the chain:
- STRUCT keeps cardinality. Items at layer
k+1equal items at layerk. STRUCT layers carry validity, no offsets. - REPEATED expands cardinality. Items at layer
k+1equalgetLayerOffsets(k)[count(k)]. REPEATED layers carry both validity and offsets.
The leaf array and getLayerOffsets carry real items only — phantom slots from null/empty
parents at any REPEATED layer are excluded. getValueCount() returns the real leaf count.
STRUCT layers do not expand or contract the item stream; only REPEATED layers add cardinality,
via their offsets.
Those two rules generate the layer shape of any chain:
| Schema chain | getLayerCount() |
Kinds (outer → inner) |
|---|---|---|
optional double x |
0 | — |
optional struct { ... int x } |
1 | STRUCT |
list<int> |
1 | REPEATED |
map<string, int> |
1 | REPEATED |
list<list<int>> |
2 | REPEATED, REPEATED |
optional struct { list<int> } |
2 | STRUCT, REPEATED |
list<optional struct { ... }> |
2 | REPEATED, STRUCT |
optional struct { map<string, int> } |
2 | STRUCT, REPEATED |
Maps report as REPEATED — the layer enum does not distinguish map-shape from list-shape; consult
getColumnSchema() if you need that distinction.
How schema nodes map to layers¶
Because layers follow logical structure, the physical group nodes of a schema do not map one-to-one
onto layers: a LIST or MAP is encoded as an annotated outer group wrapping a synthetic repeated
group, and that pair contributes one REPEATED layer rather than one layer per node. The rules
above resolve any schema unambiguously. Take a contacts column whose schema prints as:
optional group contacts (LIST) {
repeated group list {
optional group element {
required byte_array name (STRING);
optional byte_array phoneNumber (STRING);
}
}
}
Walking the chain for contacts.list.element.name, the two LIST nodes fuse into one layer:
optional group contacts (LIST) ──┐
repeated group list ──┴─► REPEATED (layer 0)
optional group element ────► STRUCT (layer 1)
required byte_array name ────► leaf
getLayerCount() is 2, with kinds [REPEATED, STRUCT]. The list's own nullability is not a
separate layer: a null contacts is recorded in getLayerValidity(0) of the REPEATED layer — the
same layer shape a required list would have, differing only in that validity. Layer 1 is the
element struct, which carries validity but no offsets, so getLayerOffsets(1) throws Layer 1 is
STRUCT, not REPEATED.
Logical element versus physical encoding¶
The same logical-over-physical split governs schema navigation, not just layers.
SchemaNode.GroupNode.getListElement() returns a list's logical element — what the element
is — independent of whether the list is physically 2-level or 3-level. For list<list<int>> the
element is the inner list<int>; for list<int> it is the int. The encoding changes only where
that element sits in the node tree, never what getListElement() resolves it to.
The returned node keeps all its physical attributes. In a standard 3-level list the element is the
non-repeated child below the synthetic repeated group; in a legacy 2-level list the repeated field
is the element, so the returned node is itself REPEATED. Counted structurally, a list's nesting
depth is the number of repeated nodes on the path (equivalently, the leaf's maximum repetition
level), not the number of LIST annotations. To walk to the leaf logically, recurse
getListElement() while the result isList() — uniform across both encodings. To inspect the raw
physical tree instead, walk children().
Counts at each layer¶
Every per-layer buffer is sized to count(k), defined recursively:
count(0) == getRecordCount()- For
k > 0:count(k)equalscount(k-1)if layerk-1isSTRUCT, orgetLayerOffsets(k-1)[count(k-1)](the trailing sentinel) if layerk-1isREPEATED.
The leaf array itself follows the same rule one step past the innermost layer, so
getValueCount() matches count(layerCount). Deeper nestings extend the same chain: at depth N you
walk getLayerOffsets(0) through getLayerOffsets(N - 1), checking getLayerValidity(k) (and, for
REPEATED layers, the zero-length offsets diff that flags an empty container) at each step before
descending.