Variant Columns¶
A Parquet column annotated with the VARIANT logical type carries semi-structured, JSON-like data in a self-describing binary encoding. Physically it is a group of two required BYTE_ARRAY children, metadata and value, whose bytes together define a Variant value with its own type tag (object, array, string, int, etc.). getVariant reads both children and surfaces them through the PqVariant API.
Try it yourself
Want to run it or explore the capabilities yourself? The Variant Columns example renders every Variant type through one recursive method, including nested objects and arrays.
try (RowReader rowReader = fileReader.rowReader()) {
while (rowReader.hasNext()) {
rowReader.next();
PqVariant v = rowReader.getVariant("event");
if (v == null) {
continue; // SQL NULL
}
// Type introspection
VariantType tag = v.type(); // OBJECT, ARRAY, STRING, INT32, ...
if (tag == VariantType.OBJECT) {
PqVariantObject obj = v.asObject();
String userId = obj.getString("user_id");
int age = obj.getInt("age");
Instant ts = obj.getTimestamp("ts");
// Nested Variant OBJECT / ARRAY — same vocabulary all the way down
PqVariantObject addr = obj.getObject("address");
PqVariantArray tags = obj.getArray("tags");
}
// Raw canonical bytes (for round-tripping or hashing)
byte[] metadata = v.metadata();
byte[] value = v.value();
}
}
The PqVariantObject view exposes the same primitive getters as a Parquet struct (getInt, getString, getTimestamp, …), but its complex navigation uses getObject and getArray (Variant-spec terminology) rather than getStruct / getList / getMap. A PqVariantArray is iterable and indexed; elements are heterogeneous PqVariants — inspect each element's type() and unwrap appropriately.
Primitive extraction on PqVariant: When you already hold a PqVariant (e.g. an array element) use the as*() methods — asInt, asString, asTimestamp, and so on. Each throws VariantTypeException if the variant's type tag doesn't match.
Timestamp tags: The Variant binary format carries four timestamp tags, split along the same isAdjustedToUTC boundary as the Parquet TIMESTAMP logical type. asTimestamp returns Instant and accepts the UTC-adjusted tags TIMESTAMP / TIMESTAMP_NANOS; asLocalTimestamp returns LocalDateTime and accepts the wall-clock tags TIMESTAMP_NTZ / TIMESTAMP_NTZ_NANOS. PqVariantObject.getTimestamp / getLocalTimestamp follow the same split.
Shredded Variants: Some writers store part of the payload in a typed sibling column (typed_value) alongside value for better compression and pushdown. Reassembly is transparent: metadata() and value() return canonical bytes regardless of whether the file was shredded, so PqVariant consumers see a single consistent representation.
Current limitations¶
- No Variant-aware predicate pushdown. Filter predicates against a Variant sub-path (e.g.
WHERE v.age > 30) aren't yet understood by the pushdown pipeline. Filtering still works against the file's physical shredded columns if you know the layout — aFilterPredicate.gt("v.typed_value.age", 30)gets row-group and page skipping via ordinary column statistics — but that ties query code to the writer's shredding strategy and misses any rows where the payload sits in the opaquevalueblob instead. Tracked as #309. - No path projection optimization. Reading only
v.agefrom a Variant column still reassembles the whole Variant for each row rather than reading just the shreddedtyped_value.agecolumn. Requires the same variant-aware planning as #309. Tracked as #700.