Avro¤
Avro Dataset¤
The Avro dataset plugin in BUILD provides the ability to read from or write to Apache Avro files. It is a Spark-optimized dataset, designed to leverage Spark’s in-memory, parallel execution model and take advantage of Avro-specific optimizations.
Avro supports schema evolution and efficient serialization, enabling Spark to process structured data reliably across workflow changes. While columnar optimizations like Parquet or ORC are not applicable, Avro provides fast row-based access and compact storage, making it suitable for workloads where the data structure may change over time or full-row access is common.
Key Features¤
- Row-based storage format: Avro stores data row-wise, enabling efficient serialization and deserialization for distributed processing.
- Schema support: Explicit schemas are stored with the dataset, ensuring compatibility and supporting schema evolution.
- Compression: Supports multiple compression algorithms, e.g., Snappy or Deflate, reducing storage footprint and improving I/O performance.
- Interoperability: Avro is widely supported across big data tools, making it easy to exchange data between systems.
Example¤
An Avro dataset in CMEM Build might look conceptually like this:
| transaction_id | customer_id | amount | timestamp |
|---|---|---|---|
| 1 | 1001 | 150.00 | 2025-11-01 08:15 |
| 2 | 1002 | 200.50 | 2025-11-01 09:30 |
| 3 | 1003 | 75.25 | 2025-11-01 11:20 |
This dataset could be stored in a file transactions.avro and used in workflows where row-oriented access or schema evolution is needed. Spark reads the data efficiently, and the explicit schema ensures correct mapping of fields across transformations.
Reference¤
For more information on the Avro format and its optimizations, see the Apache Avro project page.
Comparison of Spark-optimized datasets¤
The following table summarizes the key differences and typical use cases of the main Spark-optimized datasets supported in CMEM BUILD. It provides a quick reference for understanding the optimizations, storage formats, and workflow suitability for ORC, Parquet, and Avro datasets.
| Aspect | ORC | Parquet | Avro |
|---|---|---|---|
| Spark optimization | Yes – in-memory, columnar, leverages column pruning & predicate pushdown | Yes – in-memory, columnar, leverages column pruning & predicate pushdown | Yes – row-based, less efficient for selective column access, better for streaming or row-oriented data |
| Storage format | Columnar | Columnar | Row-oriented |
| Column pruning | Supported | Supported | Not supported efficiently |
| Predicate pushdown | Supported | Supported | Limited / not native |
| Partitioning support | Optional | Optional | Optional, typically less impactful |
| Compression | Snappy, Zlib, etc. | Snappy, Gzip, etc. | Snappy, Deflate, etc. |
| Schema handling | Explicit or inferred, applied to dataset | Explicit or inferred, applied to dataset | Schema required, applied to rows |
| Typical usage | Workflows needing efficient column-based access and filtered rows | Similar to ORC, widely used in Hadoop ecosystems, columnar analytics | Workflows needing row-wise access, streaming ingestion, or evolving schema |
| Best use case in CMEM workflows | Workflows where only subsets of columns or filtered rows are processed frequently, large-scale transformations | General-purpose analytics and ETL tasks where columnar processing improves performance, medium-to-large datasets | Ingesting external row-oriented sources, streaming integration, or datasets with frequently evolving schema |
Parameter¤
File¤
Path (e.g. relative like path/filename.avro or absolute hdfs:///path/filename.avro).
- ID:
file - Datatype:
resource - Default Value:
None
Uri pattern¤
A pattern used to construct the entity URI. If not provided the prefix + the line number is used. An example of such a pattern is urn:zyx:{id} where *id* is a name of a property.
- ID:
uriPattern - Datatype:
string - Default Value:
None
Properties¤
Comma-separated list of URL-encoded properties. If not provided, the list of properties is read from the first line.
- ID:
properties - Datatype:
string - Default Value:
None
Charset¤
The file encoding, e.g., UTF8, ISO-8859-1
- ID:
charset - Datatype:
string - Default Value:
UTF-8
Advanced Parameter¤
None