Class ColumnReader

java.lang.Object
dev.hardwood.reader.ColumnReader
All Implemented Interfaces:
AutoCloseable

@Experimental public class ColumnReader extends Object implements AutoCloseable

Batch-oriented column reader for reading a single column across all row groups.

Exposes a column's batch as typed leaf values plus a layer-model view of the schema chain between root and leaf. Each non-leaf node along the chain contributes zero or one LayerKind layer:

Layers are numbered 0..getLayerCount() - 1 outermost-to-innermost. A flat column (no enclosing nullable groups, no repetition) reports getLayerCount() == 0 and is queried solely through getLeafValidity() plus the typed value accessors.

Polarity: validity bitmaps carry set bit = present semantics. A null return is the sparse representation of "every item at that scope is present in the current batch."

Real items only. Layer offsets and the leaf array are sized to real-items-only counts. Phantom positions for null/empty parents are excluded; getLayerOffsets(k)[i+1] - getLayerOffsets(k)[i] == 0 distinguishes empty from null at REPEATED layers (validity carries the null bit).

This API is Experimental: the shape of the batch accessors and layer representation may change in future releases without prior deprecation.

  • Method Details

    • nextBatch

      public boolean nextBatch()

      Advance to the next batch.

      Multi-column alignment. Every ColumnReader over the same file produces batches at the same row boundaries — call nextBatch() on each in turn and they will report identical getRecordCount()s for the matching batch. This holds because the per-column drain workers all use the same internal batch capacity and every reader observes the same total row count per row group.

      Consumers reading multiple columns in lockstep should generally prefer ColumnReaders (obtained from ParquetFileReader.buildColumnReaders(dev.hardwood.schema.ColumnProjection)), which shares a single RowGroupIterator across all columns and exposes a single coordinated ColumnReaders.nextBatch() that drives every reader and validates alignment in one call.

      Returns:
      true if a batch is available, false if exhausted
    • getRecordCount

      public int getRecordCount()
      Number of top-level records in the current batch.
    • getValueCount

      public int getValueCount()
      Total number of leaf values in the current batch — sized to real items only (phantom slots from null/empty parents are excluded). For flat columns this equals getRecordCount().
    • getLayerCount

      public int getLayerCount()
      Number of layers in this column's schema chain. 0 for a flat column. Stable for the lifetime of this reader and safe to call before the first nextBatch() — useful for sizing consumer-side buffers.
    • getLayerKind

      public LayerKind getLayerKind(int layer)
      Layer kind at layer. Stable for the lifetime of this reader and safe to call before the first nextBatch().
    • getLayerValidity

      public Validity getLayerValidity(int layer)
      Validity at layer. Returns Validity.NO_NULLS when no item at that layer is null in the current batch (the sparse fast path); otherwise returns a wrapper over the per-item null bitmap.
    • getLayerOffsets

      public int[] getLayerOffsets(int layer)
      Offsets at layer. Length == count(layer) + 1, with offsets[count(layer)] equal to count(layer + 1) (or to getValueCount() for the innermost layer). offsets[i+1] - offsets[i] == 0 denotes an empty list/map.
      Throws:
      IllegalStateException - if layer is outside [0, getLayerCount()) or if the layer is not LayerKind.REPEATED
    • getLeafValidity

      public Validity getLeafValidity()
      Validity over the leaf-value array, indexed 0..getValueCount(). Returns Validity.NO_NULLS when no leaf in the current batch is null.
    • getInts

      public int[] getInts()
    • getLongs

      public long[] getLongs()
    • getFloats

      public float[] getFloats()
    • getDoubles

      public double[] getDoubles()
    • getBooleans

      public boolean[] getBooleans()
    • getBinaryValues

      public byte[] getBinaryValues()
      Backing byte buffer for a varlength leaf. Capacity-sized: only bytes in the half-open range [0, getBinaryOffsets()[getValueCount()]) are valid; bytes beyond that position are unspecified.
      Throws:
      IllegalStateException - for non-byte-array leaves
    • getBinaryOffsets

      public int[] getBinaryOffsets()
      Sentinel-suffixed offsets into getBinaryValues(). Length == getValueCount() + 1; the byte length of value i is offsets[i+1] - offsets[i]. For FIXED_LEN_BYTE_ARRAY columns the offsets are trivially i * width.
      Throws:
      IllegalStateException - for non-byte-array leaves
    • getBinaries

      public byte[][] getBinaries()

      Materialises one byte[] per leaf value, copying out of the binary buffer. Returns null at indexes where getLeafValidity() is unset. Allocates one byte array per leaf — hot loops should consult getBinaryValues() + getBinaryOffsets() directly.

      The returned array has length getValueCount() — i.e. the real leaf count, not getRecordCount(). For a flat column the two coincide; for list<binary> and similar nested chains they differ, and lookups must go through the appropriate layer offsets rather than indexing by record.

    • getStrings

      public String[] getStrings()

      Convenience: materialises one String per leaf value by UTF-8 decoding the slice of getBinaryValues() for each entry. Returns null at indexes where getLeafValidity() is unset. BSON columns are not string-decoded; use getBinaries() / getBinaryValues() for those.

      The returned array has length getValueCount() — i.e. the real leaf count, not getRecordCount(). For a flat column the two coincide; for list<string> and similar nested chains they differ, and lookups must go through the appropriate layer offsets rather than indexing by record.

    • getDefinitionLevels

      public int[] getDefinitionLevels()
      Raw definition levels for the current batch. Returns null for flat columns; their validity is fully captured by getLeafValidity().
    • getRepetitionLevels

      public int[] getRepetitionLevels()
      Raw repetition levels for the current batch. Returns null for columns whose maxRepetitionLevel == 0.
    • getColumnSchema

      public ColumnSchema getColumnSchema()
    • close

      public void close()
      Specified by:
      close in interface AutoCloseable