Introduction

The Dataset catalog provides options to manage, map, explore and visualize datasets.A dataset consists of a resource and the metadata about the resource in a specific distribution. Use the Datasets module to map and link several datasets or to create dataset workflows.

To open the Datasets module, click DATASETS in the Module bar. The Dataset catalog is displayed showing all datasets that are accessible for you.

The Dataset catalog shows all datasets contained in a graph and provides the following tabs:

  • DATASETS: lists all available datasets in a table overview (including some of their metadata)
  • ATTRIBUTES: lists the specific content (data) of all datasets available in the Dataset catalog (including some metrics)

When, for example, a dataset in the Dataset catalog contains an Excel file with 3 spreadsheets, the ATTRIBUTES tab shows the following information:

  • Dataset name (Dataset)
  • Spreadsheet name (Type)
  • Column titles of the spreadsheet (Attribute)
  • Datatype contained in this column (Datatype)
  • Sparkline of the column (Sparkline)
  • Unique number of values of the column (Unique count)
  • Unique count of values divided by the total number of values (Selectivity)
  • Position of the column in the spreadsheet (Source order)
Note: Depending on which kind of content is contained in a dataset, the schema types as well as the information shown in the ATTRIBUTE tab may differ.

Use the Search field to search for a specific keyword in the metadata of all datasets or all attributes. The icons next to each column title in the Dataset catalog provide sorting options.

To see more metadata or edit metadata of a dataset, click .

Use  to edit existing metadata. In the editing mode, use (plus) to add more metadata information on dataset level.

Dataset management

The dataset management provides options to create and manage datasets as well as to discover, map and link datasets.

Registering new datasets

In order to add a new dataset to the catalog, you need to register the dataset:

  • Open the menu  in the upper right corner of the Dataset catalog window.
  • Click Register new dataset.

A dialog box appears. Enter the following metadata for the new dataset:

  • A name for the new dataset
  • A short description of the dataset content

Click REGISTER.

The new dataset is added to the catalog. The details of the new dataset appear showing your entries. After creation the new dataset is empty and contains no data. You can use it as an empty target dataset in a workflow operation or you can add data. See section Dataset details to learn how to add data and which options are available for dataset management.

Dataset details

The Dataset details provide access to the metadata of datasets as well as options for adding data resources and for dataset management.

To open the details of a dataset, click the name of the dataset on the DATASET tab in the Dataset catalog. Options and tabs provided in the Dataset details vary depending on whether the dataset contains data or not.

Note: When you register a new dataset you are automatically forwarded to the Dataset details by clicking REGISTER (see section Registering new datasets).

The OVERVIEW tab appears showing metadata of the dataset and providing options to add or manage data.

Use  to edit the existing metadata of the dataset. In the editing mode, use  to add more metadata information on dataset level.

When a dataset contains no data, only the following options are available:

  • ADD DATA to upload data from a file or connect an endpoint to the dataset
  • REMOVE DATASET to remove the dataset and all related tasks

After adding data, more tabs, options and functionalities are available for a dataset. The OVERVIEW tab then provides the following options depending on what kind of data has been added to the dataset:

  • CHANGE DATA to change the data added to dataset
  • DOWNLOAD DATA to download the data of a dataset when a file-based dataset type has been selected
  • RE-PROFILE DATA to refresh data profiling and preview after a new data upload
  • EDIT MAPPING to map the data to an installed vocabulary
  • EXECUTE WORKFLOW to execute a workflow in case a dataset is defined as a target dataset
  • START/STOP JDBC to start or stop an internal JDBC endpoint (SparkSQL)

Adding data

To use a dataset for mapping and linking tasks you need to add data to the dataset.

  • In the OVERVIEW tab of the Dataset details, click ADD DATA.
  • Follow the steps in the dialog box as required.
  • To finish the registration process, click DONE.

The data source can be either a file-based dataset type (e.g. .csv or .orc files) or an endpoint connection to a database (e.g. SparkSQL or Hive). If you choose a file-based type you can select one of the following options:

  • Create an empty file
  • Select an already existing file
  • Upload a new file

Depending on the specific configuration of Corporate Memory the following types of datasets are allowed to upload:

  • SPARQL endpoint (remote): Retrieves all entities from a SPARQL endpoint.
  • Alignment file: Writes the alignment format specified at http://alignapi.gforge.inria.fr/format.html.
  • CSV file: Retrieves all entities from a CSV file.
  • XML file: Retrieves all entities from an XML file.
  • JSON file: Retrieves all entities from a JSON file.
  • RDF graph: The RDF graph that is used for storing internal data.
  • ORC file: Retrieves data from various source systems (file, hdfs) in ORC format and converts them to Spark Data Frames and RDDs.
  • JDBC endpoint: JDBC URL (base URL without database name or parameters, e.g. jdbc:mysql://localhost:port).
  • Hive endpoint: Retrieves data from hive and converts them to Spark Data Frames and RDDs or stores data as a hive table when used as an output.
  • SparkSQL view (virtual): As an output of a workflow, this dataset generates a cached or uncached view that can be queried over JDBC.
  • Excel file: Reads and writes entities from/to an Excel Workbook in Open XML format (XLSX).
  • Multi CSV ZIP file: CSV dataset that holds multiple CSV files inside a ZIP file. Sub folders and files not ending in .csv are ignored.

To execute workflows (a collection of mapping and linking tasks) you need a target dataset for storing the workflow results. You can choose the option Create empty file when you want to use an empty dataset as target dataset. However, it is also possible to use any registered dataset as a target dataset.

If you choose an endpoint connection for your dataset, it is possible to access cached data or data without materialization by executing all workflow operations at the query time.

When you select SparkSQL view as dataset type and use the dataset as a target dataset for a workflow, a JDBC (Java Database Connectivity) URL is displayed in the OVERVIEW tab. This URL can be used to query the result with any JDBC client. Further, the option to start or stop the endpoint that provides access to a ‘SparkSQL view’ dataset is available by using the buttons START JDBC or STOP JDBC in the Dataset details.

Note: Some of the options mentioned above are only available with specific configuration settings. Refer to the system manual of eccenca DataIntegration or the Spark system manual.

ATTRIBUTES

Having added data from an existing or an uploaded file to a dataset, the ATTRIBUTES tab provides an overview of the specific data content. Depending on the structure of the data added, you can examine schema types, attributes and some example data (PREVIEW tab). The ATTRIBUTES tab displays the same information as described in section Dataset catalog, but now restricted to the dataset selected.

To see more metadata or edit metadata of types and attributes shown in the ATTRIBUTES tab, click  at the front of the type or attribute (Source identifier).

Use  to edit existing metadata. In the editing mode, use  to add more metadata information on type or attribute level.

Mapping of datasets

In order to bring different datasets in a consistent form, or in order to link different datasets with each other you need to map them to a vocabulary. A mapping is a set of transformations that assign data elements in a source dataset to elements of a vocabulary.

There are two options to create and edit mappings. One option is to use EDIT MAPPING in the Dataset details view which is described in the following subsections. Another option is available in the Discovery tab which provides a dataset visualization. Refer to section Discovery to learn more about this option.

Creating mappings

To create a mapping for a dataset that is opened in the Dataset details view:

  • Click EDIT MAPPING.
  • Select one or more vocabularies you want to use in your mapping operation.
  • Click CREATE MAPPING.

The mapping editor appears and you can edit your mapping rules as described in the following section.

Editing mappings

With EDIT MAPPING in the Dataset details view you can start a new or change an existing mapping in an editor where you can create and edit hierarchical mappings for the selected dataset. You can either edit mapping rules manually as described in the following section, or automatically generate mapping suggestions based on algorithms as described in section Suggest mappings.

On the left side of the editor is a navigation tree showing the root mapping element as well as all its subordinate mappings. The number in brackets displayed next to the element name indicates how many mapping rules exist for the selected element. On the right side of the editor the details and mapping rules of the selected element are displayed.

The upper row shows the element currently selected and provides a menu () offering convenient options such as to hide the tree navigation as well as to expand or reduce all elements.

Below the upper row the element as well as its mapping rules are listed.

In front of each mapping rule is an icon  which allows you to move the selected mapping rule to top, up, down or bottom. Click and hold the mouse button on a mapping rule to drag and drop the rule to any position in the list.

Use  on the right side of an element to show more information. This expanded view provides further options to edit or remove an element.

Corporate Memory supports two types of mapping rules:

  • Object mappings, indicated by 
  • Value mappings, indicated by 

An object mapping rule is a mapping that transforms a data element in a source dataset to a data element in a target dataset. If the target dataset is an RDF graph, object mappings result in triples with resource objects and are thus suitable to map object properties. Object mappings create a new subordinate level in the hierarchical mapping view.

An object mapping rule consists of the following editable elements:

  • Target property (e.g. owl object property or column or XML tags, but only if it is not the root mapping rule)
  • Target entity type(s) (0 or 1 or n; e.g. RDF/OWL class or table)
  • Value path (an abstract notation of schema hierarchy, but only if it is not the root mapping)
  • URI pattern or URI formula
  • Description
  • A number of submappings (object or value mapping rules)

The URI pattern is a template that specifies how object URIs are created. You can either specify a simple URI pattern or a complex URI formula. If no specific URI pattern is set, Corporate Memory uses a default pattern for the URI creation.

To define a simple URI pattern, click EDIT, enter the pattern and click SAVE.

To define a complex URI formula click  and create your URI formula in the editor.

To remove an existing URI formula, click .

A value mapping rule is a mapping that transforms a data element value (like a string, number or data) in a source dataset to a data element value in a target dataset. If the target dataset is an RDF graph, value mappings result in triples with Literal objects and are suitable to map Datatype properties.

A value mapping rule consists of the following editable elements:

  • Target property (mandatory, e.g. OWL Datatype property or column or XML tags or attributes, etc.)
  • Data type
  • Value path or value formula (optional)
  • Description

A value path is a path expression that selects the value of the data element in the source dataset specified by that path.

In order to create (or edit) a complex mapping rule in a value mapping, use the Edit icon shown next to the value path. With the editor that is opened you can define a complex rule specification for the target value generation that uses multiple value paths and calculations. Drag and drop source paths and transformation tasks from the left to the main working space to create a complex rule.

Adding new mapping rules

In order to add a new value or object mapping rule to the element selected, use  in the lower right corner and choose the required mapping option.

Suggest mappings

Instead of adding mappings manually you can use the function Suggest mappings to support your mapping process. This function automatically generates possible mappings of the source dataset to the selected vocabulary. A matching algorithm is used to identify properties of the vocabulary which probably match to attributes of the source dataset. You can confirm or decline the suggested mappings in an overview and thus accelerate the mapping process.

To use the function Suggest mappings, click  in the lower right corner of the mapping editor and select Suggest mappings.

A table overview appears showing the suggested mapping rules. Each suggestion consists of the following three elements:

  • Value path: Denotes the attribute in the source dataset.
  • Target property: Shows the suggested matching property of the vocabulary. The value (default mapping) indicates that the algorithm could not find a suitable property in the vocabulary. When you confirm such a suggestion, a new property name is generated.
  • Mapping type: You can define whether a value mapping or an object mapping is generated.

Select all mapping suggestions you want to confirm and click SAVE.

The suggested mappings are added as mapping rules in the mapping editor. Based on the identified data types value normalizations are automatically applied, for example dates in US format (7/21/1977) are converted to normalized values according to ISO 8601 (1977-07-21). You can further edit the mapping rules as described in section Editing mappings.

Discovery

The Discovery tab in the Dataset details view lets you explore your datasets and manage relations between them. Corporate Memory provides indicators and statistical analytics based on the profiling information that have been derived for your data to suggest similar datasets that can be linked.

On the left side you see a graph visualization of the datasets and their relations.

Note: The Discovery view shows only datasets when you have at least 2 datasets registered.

The table on the right side lists datasets similar to the currently selected dataset. The blue color indicates that a dataset has not been mapped yet, whereas the green color indicates that a mapping has been already made.

  • Use ,  and  to change the order of the table rows.
  • Use  to expand the details of a dataset.
  • Use  to adjust the dataset table.
  • Click an expanded dataset in the list to see more metadata of the dataset.

The graph presents the listed datasets in relation with the currently selected dataset. The selected dataset points to the related datasets. The similarity between two datasets is indicated by a percentage shown in the edge line.

  • Use  to jump to a neighbor dataset.

Create mappings

You can create a mapping specification for blue colored datasets that have not been mapped yet:

  • In the visualization window on the left, select a blue colored dataset.
  • Click  Create mapping specification .
  • Select one or more vocabularies.
  • Click CREATE MAPPING.
  • Add your mapping rules in the editor.

The dataset mapped appears now in green color.

For more information how to edit mapping rules refer to Edit mappings.

In the visualization window, click a green dataset to edit () or remove () the mapping specification for this dataset. Use  to explore the related neighborhood of the dataset. Refer to section Edit mappings to learn more about how to edit mappings.

Create linking specification

Note: The option Create Linking Specification is only available between two mapped datasets (green color).

A blue dashed edge (line) between two mapped datasets indicates that there is no linking specification yet. A green edge between two datasets indicates that both datasets are already linked.

Select an edge between two datasets to:

  •  create,
  •  edit, or
  •  delete

a linking specification between the adjacent datasets.

Workflows

The Workflows tab in the Dataset details provides options to manage the data workflows in which the selected dataset is involved. You can define new, edit existing and re-run your workflows here.

A workflow describes a specific set of operations and rules executed by eccenca DataIntegration, that have been defined to generate a new merged dataset. To create and execute a workflow a target dataset for storing the workflow results is necessary. This target dataset can be any dataset with already added data or an empty dataset without data.

On the left side, you see the active dataset together with all datasets that are linked to it. On the right side, you see all workflows that exist for the active dataset.

  • Use Search… to find a workflow by name and other metadata.
  • Use ,  and  to change the order of the table rows.
  • Use Add Workflow to create a new workflow.
  • Use  to adjust the workflow table.
  • Use  to expand the details of a workflow.

On an expanded workflow:

  • Use  Edit Workflow to adjust the workflow.
  • Use  Delete Workflow to delete the workflow.

Provenance

The Provenance tab lists recorded activities of the dataset. The latest activity records are listed in the table.

  • Use  to change the order of the activities in the table.
  • Click an activity to access details for the entry.