Skip to content

ORC¤

ORC Dataset¤

The ORC dataset plugin in BUILD provides the ability to read from or write to Apache ORC files. It is a Spark-optimized dataset, designed to leverage Spark’s in-memory, parallel execution model and take advantage of ORC-specific optimizations.

ORC supports optimizations such as column pruning and predicate pushdown, which allow queries to efficiently read only the needed data. This makes ORC particularly suitable for workflows where only a subset of columns or filtered rows are required, improving performance for large-scale transformations.

Key Features¤

  • Columnar storage format: ORC stores data in a column-oriented manner, enabling efficient compression and retrieval of individual columns.
  • Partitioning: Optional support for partitioned outputs, allowing Spark to parallelize data processing effectively.
  • Compression: Supports multiple compression algorithms, e.g., Snappy or Zlib, reducing storage footprint and improving I/O performance.
  • Schema handling: Supports explicit or inferred schemas, with the schema applied to the dataset as a whole, not individual rows.

Example¤

An ORC dataset in CMEM Build might look conceptually like this:

transaction_id customer_id amount timestamp
1 1001 150.00 2025-11-01 08:15
2 1002 200.50 2025-11-01 09:30
3 1003 75.25 2025-11-01 11:20

This dataset could be stored in a file transactions.orc and used in workflows where only certain columns or filtered rows are needed. Spark can read only the relevant columns or rows matching a condition, thanks to ORC’s column pruning and predicate pushdown optimizations, making transformations efficient even at large scale.

Reference¤

For more information on the ORC format and its optimizations, see the Apache ORC project page.


Comparison of Spark-optimized datasets¤

The following table summarizes the key differences and typical use cases of the main Spark-optimized datasets supported in CMEM BUILD. It provides a quick reference for understanding the optimizations, storage formats, and workflow suitability for ORC, Parquet, and Avro datasets.

Aspect ORC Parquet Avro
Spark optimization Yes – in-memory, columnar, leverages column pruning & predicate pushdown Yes – in-memory, columnar, leverages column pruning & predicate pushdown Yes – row-based, less efficient for selective column access, better for streaming or row-oriented data
Storage format Columnar Columnar Row-oriented
Column pruning Supported Supported Not supported efficiently
Predicate pushdown Supported Supported Limited / not native
Partitioning support Optional Optional Optional, typically less impactful
Compression Snappy, Zlib, etc. Snappy, Gzip, etc. Snappy, Deflate, etc.
Schema handling Explicit or inferred, applied to dataset Explicit or inferred, applied to dataset Schema required, applied to rows
Typical usage Workflows needing efficient column-based access and filtered rows Similar to ORC, widely used in Hadoop ecosystems, columnar analytics Workflows needing row-wise access, streaming ingestion, or evolving schema
Best use case in CMEM workflows Workflows where only subsets of columns or filtered rows are processed frequently, large-scale transformations General-purpose analytics and ETL tasks where columnar processing improves performance, medium-to-large datasets Ingesting external row-oriented sources, streaming integration, or datasets with frequently evolving schema

Parameter¤

File¤

Path (e.g. relative like ‘path/filename.orc’ or absolute ‘hdfs:///path/filename.orc’).

  • ID: file
  • Datatype: resource
  • Default Value: None

Uri pattern¤

A pattern used to construct the entity URI. If not provided the prefix + the line number is used. An example of such a pattern is ‘urn:zyx:{id}’ where id is a name of a property.

  • ID: uriPattern
  • Datatype: string
  • Default Value: None

Properties¤

Comma-separated list of URL-encoded properties. If not provided, the list of properties is read from the first line.

  • ID: properties
  • Datatype: string
  • Default Value: None

Partition¤

Optional specification of the attribute for output partitioning

  • ID: partition
  • Datatype: string
  • Default Value: None

Compression¤

Optional compression algorithm (e.g. snappy, zlib)

  • ID: compression
  • Datatype: string
  • Default Value: snappy

Charset¤

The file encoding, e.g., UTF8, ISO-8859-1

  • ID: charset
  • Datatype: string
  • Default Value: UTF-8

Advanced Parameter¤

None

Comments