Class ColumnReader
- All Implemented Interfaces:
AutoCloseable
Batch-oriented column reader for reading a single column across all row groups.
Exposes a column's batch as typed leaf values plus a layer-model view of
the schema chain between root and leaf. Each non-leaf node along the chain
contributes zero or one LayerKind layer:
OPTIONALgroup →LayerKind.STRUCTLIST/MAP-annotated group →LayerKind.REPEATEDREQUIREDgroup / synthetic LIST scaffolding → no layer
Layers are numbered 0..getLayerCount() - 1 outermost-to-innermost. A flat
column (no enclosing nullable groups, no repetition) reports
getLayerCount() == 0 and is queried solely through getLeafValidity()
plus the typed value accessors.
Polarity: validity bitmaps carry set bit = present semantics. A
null return is the sparse representation of "every item at that scope
is present in the current batch."
Real items only. Layer offsets and the leaf array are sized to
real-items-only counts. Phantom positions for null/empty parents are
excluded; getLayerOffsets(k)[i+1] - getLayerOffsets(k)[i] == 0
distinguishes empty from null at REPEATED layers (validity carries the
null bit).
This API is Experimental: the shape of the batch accessors and
layer representation may change in future releases without prior
deprecation.
-
Method Summary
Modifier and TypeMethodDescriptionvoidclose()byte[][]Materialises onebyte[]per leaf value, copying out of the binary buffer.int[]Sentinel-suffixed offsets intogetBinaryValues().byte[]Backing byte buffer for a varlength leaf.boolean[]int[]Raw definition levels for the current batch.double[]float[]int[]getInts()intNumber of layers in this column's schema chain.0for a flat column.getLayerKind(int layer) Layer kind atlayer.int[]getLayerOffsets(int layer) Offsets atlayer.getLayerValidity(int layer) Validity atlayer.Validity over the leaf-value array, indexed0..getValueCount().long[]getLongs()intNumber of top-level records in the current batch.int[]Raw repetition levels for the current batch.String[]Convenience: materialises oneStringper leaf value by UTF-8 decoding the slice ofgetBinaryValues()for each entry.intTotal number of leaf values in the current batch — sized to real items only (phantom slots from null/empty parents are excluded).booleanAdvance to the next batch.
-
Method Details
-
nextBatch
public boolean nextBatch()Advance to the next batch.
Multi-column alignment. Every
ColumnReaderover the same file produces batches at the same row boundaries — callnextBatch()on each in turn and they will report identicalgetRecordCount()s for the matching batch. This holds because the per-column drain workers all use the same internal batch capacity and every reader observes the same total row count per row group.Consumers reading multiple columns in lockstep should generally prefer
ColumnReaders(obtained fromParquetFileReader.buildColumnReaders(dev.hardwood.schema.ColumnProjection)), which shares a singleRowGroupIteratoracross all columns and exposes a single coordinatedColumnReaders.nextBatch()that drives every reader and validates alignment in one call.- Returns:
- true if a batch is available, false if exhausted
-
getRecordCount
public int getRecordCount()Number of top-level records in the current batch. -
getValueCount
public int getValueCount()Total number of leaf values in the current batch — sized to real items only (phantom slots from null/empty parents are excluded). For flat columns this equalsgetRecordCount(). -
getLayerCount
public int getLayerCount()Number of layers in this column's schema chain.0for a flat column. Stable for the lifetime of this reader and safe to call before the firstnextBatch()— useful for sizing consumer-side buffers. -
getLayerKind
Layer kind atlayer. Stable for the lifetime of this reader and safe to call before the firstnextBatch(). -
getLayerValidity
Validity atlayer. ReturnsValidity.NO_NULLSwhen no item at that layer is null in the current batch (the sparse fast path); otherwise returns a wrapper over the per-item null bitmap. -
getLayerOffsets
public int[] getLayerOffsets(int layer) Offsets atlayer. Length == count(layer) + 1, withoffsets[count(layer)]equal to count(layer + 1) (or togetValueCount()for the innermost layer).offsets[i+1] - offsets[i] == 0denotes an empty list/map.- Throws:
IllegalStateException- iflayeris outside[0, getLayerCount())or if the layer is notLayerKind.REPEATED
-
getLeafValidity
Validity over the leaf-value array, indexed0..getValueCount(). ReturnsValidity.NO_NULLSwhen no leaf in the current batch is null. -
getInts
public int[] getInts() -
getLongs
public long[] getLongs() -
getFloats
public float[] getFloats() -
getDoubles
public double[] getDoubles() -
getBooleans
public boolean[] getBooleans() -
getBinaryValues
public byte[] getBinaryValues()Backing byte buffer for a varlength leaf. Capacity-sized: only bytes in the half-open range[0, getBinaryOffsets()[getValueCount()])are valid; bytes beyond that position are unspecified.- Throws:
IllegalStateException- for non-byte-array leaves
-
getBinaryOffsets
public int[] getBinaryOffsets()Sentinel-suffixed offsets intogetBinaryValues(). Length ==getValueCount() + 1; the byte length of valueiisoffsets[i+1] - offsets[i]. ForFIXED_LEN_BYTE_ARRAYcolumns the offsets are triviallyi * width.- Throws:
IllegalStateException- for non-byte-array leaves
-
getBinaries
public byte[][] getBinaries()Materialises one
byte[]per leaf value, copying out of the binary buffer. Returnsnullat indexes wheregetLeafValidity()is unset. Allocates one byte array per leaf — hot loops should consultgetBinaryValues()+getBinaryOffsets()directly.The returned array has length
getValueCount()— i.e. the real leaf count, notgetRecordCount(). For a flat column the two coincide; forlist<binary>and similar nested chains they differ, and lookups must go through the appropriate layer offsets rather than indexing by record. -
getStrings
Convenience: materialises one
Stringper leaf value by UTF-8 decoding the slice ofgetBinaryValues()for each entry. Returnsnullat indexes wheregetLeafValidity()is unset. BSON columns are not string-decoded; usegetBinaries()/getBinaryValues()for those.The returned array has length
getValueCount()— i.e. the real leaf count, notgetRecordCount(). For a flat column the two coincide; forlist<string>and similar nested chains they differ, and lookups must go through the appropriate layer offsets rather than indexing by record. -
getDefinitionLevels
public int[] getDefinitionLevels()Raw definition levels for the current batch. Returnsnullfor flat columns; their validity is fully captured bygetLeafValidity(). -
getRepetitionLevels
public int[] getRepetitionLevels()Raw repetition levels for the current batch. Returnsnullfor columns whosemaxRepetitionLevel == 0. -
getColumnSchema
-
close
public void close()- Specified by:
closein interfaceAutoCloseable
-