Skip to content

Hive database¤

Hive Dataset¤

Within BUILD, several Spark-aware datasets exist — Avro, Parquet, ORC, HDFS, and Hive — each optimized to leverage Spark’s parallel, in-memory execution. Among these, Hive datasets provide structured table-based storage, complementing file-based formats with queryable relational tables. Hive tables enable BUILD to query and process large datasets while taking full advantage of Spark’s distributed, in-memory execution model.

The Hive dataset plugin in BUILD allows workflows to read from and write to Hive tables seamlessly. Conceptually, Hive represents the table-oriented, relational side of CMEM’s Spark-aware ecosystem, bridging raw storage formats and semantic integration into the Knowledge Graph. It is table-based, meaning that workflows can apply SQL filtering and projection, leverage partitioning, and construct entity URIs for downstream knowledge-graph alignment — all without manual management of Spark execution details. Hive thus complements other Spark-aware datasets, which are typically file-based (columnar or row-oriented), by providing structured, queryable, relational entities.

Key Features¤

  • Database and table mapping – Workflows target specific Hive databases and tables, with explicit schema definitions.
  • SQL query support – Optional queries allow pre-filtering of rows and columns before processing.
  • Entity URI construction – Configurable patterns enable semantic alignment with the Knowledge Graph.
  • Property handling – Properties can be specified explicitly or inferred from table schema.
  • Partitioning – Native Hive partitions are recognized and used by Spark for parallelized processing.
  • Encoding support – Handles various source encodings for table content.
  • Parallelized, memory-efficient execution – Spark manages intermediate data in-memory across partitions, while maintaining fault tolerance.

Conceptual Usage in Workflows¤

Hive datasets are best understood in the context of CMEM’s Spark-aware dataset ecosystem:

  • They allow structured ingestion and transformation of relational datasets.
  • Filters or projections via SQL minimize unnecessary data movement.
  • Entity URIs and property mapping ensure semantic continuity into the Knowledge Graph.
  • Hive tables complement file-based datasets by offering queryable, relational views, which can be combined in workflows with Parquet, ORC, Avro, or HDFS sources.

In short, Hive tables are the relational, table-oriented pillar of Spark-aware datasets within BUILD. While Spark mechanics like lazy evaluation, DAG planning, and RDD lineage underpin all datasets, they do not need to be repeated here — Hive’s value lies in its integration, structured storage, and semantic alignment.

Example: Ingesting a Hive Table into BUILD¤

A typical Hive dataset configuration might reference a structured table exposed in a Hive warehouse. For instance:

  • Schema: sales_data
  • Table: monthly_transactions
  • Query (optional): SELECT * FROM monthly_transactions WHERE year = 2024
  • URI pattern: urn:transaction:{id}
  • Properties: (optional) auto-detected if not provided
  • Charset: UTF-8

This configuration allows BUILD to load the table as a Spark DataFrame, apply transformations or entity extraction workflows, and integrate the resulting entities into the Knowledge Graph. Filtering via an optional SQL query supports focused processing without materializing the entire table.

Comparison to Other Spark-Aware Datasets in CMEM¤

Aspect Hive Other Spark-Aware Datasets (Avro, Parquet, ORC, HDFS)
Storage model Table-based, relational File-based: columnar (Parquet, ORC) or row-oriented (Avro, HDFS)
Spark optimization In-memory, partition-parallel execution In-memory for columnar; row-oriented formats support streaming or batch processing
Schema Explicit from Hive table Columnar: explicit/inferred; row-oriented: schema required
Filtering / Projection SQL queries, table partition pruning Columnar: column pruning/predicate pushdown; row-based: limited
Partitioning Native Hive partitions, used for parallelism Optional file-level partitions or directory-based sharding
Compression Configurable at table level Columnar: Snappy/Gzip; row: Snappy/Deflate
Typical usage Structured, large-scale tables integrated into CMEM pipelines File-based analytics, ETL, streaming ingestion, or intermediate workflow steps
Semantic integration Direct mapping to entity URIs and Knowledge Graph properties Generally less semantic, focus on raw transformation or analytics
Best CMEM use case Large, structured datasets requiring filtering, projections, and semantic KG alignment Flexible ingestion and processing, format-specific performance optimization

Parameter¤

Schema¤

Name of the hive schema or namespace.

  • ID: schema
  • Datatype: string
  • Default Value: None

Table¤

Name of the hive table.

  • ID: table
  • Datatype: string
  • Default Value: None

Query¤

Optional query for projection and selection, e.g. “SELECT * FROM table WHERE x = true”.

  • ID: query
  • Datatype: string
  • Default Value: None

Uri pattern¤

A pattern used to construct the entity URI. If not provided the prefix + the line number is used. An example of such a pattern is ‘urn:zyx:{id}’ where id is a name of a property.

  • ID: uriPattern
  • Datatype: string
  • Default Value: None

Properties¤

Comma-separated list of URL-encoded properties. If not provided, the list of properties is read from the first line.

  • ID: properties
  • Datatype: string
  • Default Value: None

Charset¤

The source internal encoding, e.g., UTF8, ISO-8859-1

  • ID: charset
  • Datatype: string
  • Default Value: UTF-8

Advanced Parameter¤

None

Comments