Parquet¤
Parquet Dataset¤
The Parquet dataset plugin in BUILD provides the ability to read from or write to Apache Parquet files. It is a Spark-optimized dataset, designed to leverage Spark’s in-memory, parallel execution model and take advantage of Parquet-specific optimizations.
Parquet supports optimizations such as column pruning and predicate pushdown, which enable Spark to read only the required columns or rows. This makes Parquet particularly effective for workflows where selective access to data is needed, improving performance for large-scale transformations.
Key Features¤
- Columnar storage format: Parquet stores data in a column-oriented manner, enabling efficient compression and retrieval of individual columns.
- Partitioning: Optional support for partitioned outputs, allowing Spark to parallelize data processing effectively.
- Compression: Supports multiple compression algorithms, e.g., Snappy, Gzip, or LZO, reducing storage footprint and improving I/O performance.
- Schema handling: Supports explicit or inferred schemas, with the schema applied to the dataset as a whole, not individual rows.
Example¤
A Parquet dataset in CMEM Build might look conceptually like this:
| transaction_id | customer_id | amount | timestamp |
|---|---|---|---|
| 1 | 1001 | 150.00 | 2025-11-01 08:15 |
| 2 | 1002 | 200.50 | 2025-11-01 09:30 |
| 3 | 1003 | 75.25 | 2025-11-01 11:20 |
This dataset could be stored in a file transactions.parquet and used in workflows where only certain columns or filtered rows are needed. Spark can read only the relevant columns or rows matching a condition, thanks to Parquet’s column pruning and predicate pushdown optimizations, making transformations efficient even at large scale.
Reference¤
For more information on the Parquet format and its optimizations, see the Apache Parquet project page.
Comparison of Spark-optimized datasets¤
The following table summarizes the key differences and typical use cases of the main Spark-optimized datasets supported in CMEM BUILD. It provides a quick reference for understanding the optimizations, storage formats, and workflow suitability for ORC, Parquet, and Avro datasets.
| Aspect | ORC | Parquet | Avro |
|---|---|---|---|
| Spark optimization | Yes – in-memory, columnar, leverages column pruning & predicate pushdown | Yes – in-memory, columnar, leverages column pruning & predicate pushdown | Yes – row-based, less efficient for selective column access, better for streaming or row-oriented data |
| Storage format | Columnar | Columnar | Row-oriented |
| Column pruning | Supported | Supported | Not supported efficiently |
| Predicate pushdown | Supported | Supported | Limited / not native |
| Partitioning support | Optional | Optional | Optional, typically less impactful |
| Compression | Snappy, Zlib, etc. | Snappy, Gzip, etc. | Snappy, Deflate, etc. |
| Schema handling | Explicit or inferred, applied to dataset | Explicit or inferred, applied to dataset | Schema required, applied to rows |
| Typical usage | Workflows needing efficient column-based access and filtered rows | Similar to ORC, widely used in Hadoop ecosystems, columnar analytics | Workflows needing row-wise access, streaming ingestion, or evolving schema |
| Best use case in CMEM workflows | Workflows where only subsets of columns or filtered rows are processed frequently, large-scale transformations | General-purpose analytics and ETL tasks where columnar processing improves performance, medium-to-large datasets | Ingesting external row-oriented sources, streaming integration, or datasets with frequently evolving schema |
Parameter¤
File¤
Path (e.g. relative like ‘path/filename.orc’ or absolute ‘hdfs:///path/filename.parquet’).
- ID:
file - Datatype:
resource - Default Value:
None
Uri pattern¤
A pattern used to construct the entity URI. If not provided the prefix + the line number is used. An example of such a pattern is ‘urn:zyx:{id}’ where id is a name of a property.
- ID:
uriPattern - Datatype:
string - Default Value:
None
Properties¤
Comma-separated list of URL-encoded properties. If not provided, the list of properties is read from the first line.
- ID:
properties - Datatype:
string - Default Value:
None
Partition¤
Optional specification of the attribute for output partitioning
- ID:
partition - Datatype:
string - Default Value:
None
Compression¤
Optional compression algorithm (e.g. snappy, zlib)
- ID:
compression - Datatype:
string - Default Value:
None
Charset¤
The file encoding, e.g., UTF8, ISO-8859-1
- ID:
charset - Datatype:
string - Default Value:
UTF-8
Advanced Parameter¤
None