.. comments - headings # with overline, for parts * with overline, for chapters = for sections - for subsections ^ for subsubsections " for paragraphs * for H5 + for H6 .. index:: pair: Importing; Spreadsheets .. _importing_spreadsheets_target: Importing Spreadsheets ^^^^^^^^^^^^^^^^^^^^^^ Import Spreadsheet using Template """"""""""""""""""""""""""""""""" **Import Spreadsheet using Template** is available under the *Import* tab for all collection types except Crosswalks, Content Tagsets and Corpora. This importer lets the user select a spreadsheet and a "template" that will be used to convert and store the spreadsheet data. The "template" is created using the mapping process explained in the *Import Spreadsheet using Pattern* importer. The imported spreadsheet must have exactly the same structure as the spreadsheet used to develop the template. The names and order of the columns must be exactly the same. If multiple worksheets are used, the order and structure of each worksheet (even for worksheets that are not imported) must be the same. The mapping can also be created or edited using TopBraid Composer if the mapping capabilities are insufficient and a more complex transformations is required (e.g. concatenation of values). TopBraid Composer’s SPINMAP tool provides a drag-and-drop interface that makes it especially easy to create more complex mappings. Templates developed with TopBraid Composer must be stored in files with “.tablemap.” in their name (for example, *myMapping.tablemap.ttl*) and be uploaded to the EDG server to be available to EDG users. Import Spreadsheet using Pattern """""""""""""""""""""""""""""""" **Import Spreadsheet using Pattern** is available under the *Import* tab for all collection types except Crosswalks, Content Tagsets and Corpora. To use this importer, the worksheet must have a header row with the names of columns. EDG will find the first row with data and will assume that it is the header column. The **Import > Import Spreadsheet Using Pattern** link shows the following screen: .. figure:: _images/edg_import_spreadsheet_using_pattern.png :alt: TopBraid EDG Import Spreadsheet using Pattern :align: center :class: edg-figure-m **TopBraid EDG Import Spreadsheet using Pattern** Use **Browse** to select the spreadsheet file to import. Supported file types are an Excel file (*.xls* or *.xlsx*), a tab-separated value (*.tsv*) file, or a comma-separated value (*.csv*) file. The file should have the expected extension. Because an Excel file may have more than one sheet of data, there is an option to specify a **sheet index** value to identify which sheet to import. The default is 1, for the first sheet. The sheet index counts all sheets in an Excel workbook, including hidden ones. For example, if you enter a 3 here and EDG seems to import the second sheet, there may be a hidden one between the first and second sheet that made the third one look like it was the second one when Excel was displaying the workbook. The Excel online help explains how to check for the existence of hidden sheets. The **Entity type for the imported data** identifies the *class* of the assets you will be importing. Each row in the spreadsheet will be imported as an instance of the selected *class* and the spreadsheet columns can be mapped to the declared properties of the *class*. Select **Next** to continue and specify the mapping. .. note:: All imported assets are given the same type. To import assets of a different type, either import the same spreadsheet multiple times, with different mappings, or bulk edit the data after import. Select Spreadsheet Type """"""""""""""""""""""" The *Select Spreadsheet Type* view enumerates five possible (column-wise) spreadsheet layout patterns, showing an example of each pattern. The **1. No Hierarchy** layout is the most common and simplest to use. We recommend users become familiar with it first, before using other patterns. .. note:: The **No Hierarchy** import pattern can create relationships in the imported data, including hierarchical relationships. However, **referenced asset must exist prior to the import** when using with this pattern. In the mapping, the column used to find existing assets to reference (e.g. hierarchical parents) can be indicated. This can be addressed, for example, by importing the same spreadsheet multiple times using different mappings - perhaps once to create assets and then again to add relationships between assets. For data explicitly structured as a hierarchy, like a taxonomy, there are four layouts from which to select. The main difference between these four layouts and **No Hierarchy** is that children and parents can be imported and connected in a single import. EDG will create and relate both the children and their parents in the same import, even if they did not exist prior to the import. In the hierarchical layouts, each row also indicates its hierarchical path, either explicitly (absolute path, #2, #3, #4) or implicitly (recursive path #5); note that lighter text in the layout patterns indicates optional data. These are complex import patterns that behave differently depending on the URI construction method set for the asset collection. Depending on the URI construction method for the collection, they may not be able to be used. .. figure:: _images/edg_select_spreadsheet_type.png :alt: TopBraid EDG Select Spreadsheet Type Page :align: center :class: edg-figure-l **TopBraid EDG Select Spreadsheet Type Page** .. note:: Note the *header row* of column labels in every layout. The imported spreadsheet must always have a header row. Below the five layout options, the view will show a sample of the spreadsheet’s actual data. The following image shows a spreadsheet of airport codes. .. figure:: _images/edg_source_spreadsheet.png :alt: TopBraid EDG Source Spreadsheet Page :align: center :class: edg-figure-l **TopBraid EDG Source Spreadsheet Page** Select the **layout link** that most closely corresponds to the structure of the spreadsheet. Import No Hierarchy Spreadsheet """"""""""""""""""""""""""""""" After selecting **1. No Hierarchy** the *Import Spreadsheet* page appears where the user defines the data-mapping rules from the spreadsheet columns into the target properties of the class. The page is broken into sections: * Column Mappings * Default Concept Scheme, if a Taxonomies collections * Unique Identifiers * various selectable options * prompt to enter the mapping template name if saving the pattern for future use * Your Source Spreadsheet, data read from the spreadsheet When importing into a Taxonomies collection and not selecting a concept scheme, EDG creates a new scheme using the name of a spreadsheet. All concepts that do not have a broader parent nor are defined as top concepts of some scheme, will be made into top concepts of the selected or automatically created scheme. Column Mapping Section ********************** In the *Column Mapping* section, the user specifies the target property to which each spreadsheet column corresponds. The target properties are taken from target entity type (class or asset type) identified on the previous page. When column names and property names are similar, EDG automatically proposes the mapping. When mapping a relationship, the user can indicate inverse relationships to the target entities. Unmapped columns and their data are ignored during the import. The following example shows an example Column Mapping. .. figure:: _images/edg_import_spreadsheet.png :alt: TopBraid EDG Import Spreadsheet Page :align: center :class: edg-figure-l **TopBraid EDG Import Spreadsheet Page** For the mapping to be successful, the datatype of the spreadsheet cell values should match that of the target property into which the column is being mapped. Do not use the string "abc" in a cell value being mapped into an target property with an integer datatype, for example. In cases where a string-valued target property supports language tags, the user is can optionally set the **Language** value for the import to set. If an imported row will result in a new instance, rather than adding data to an existing instance, then one of columns should be mapped to the *label* property or th *preferred label* if a Taxonomies collection. Those are used as the EDG display name, for auto-complete, etc. When importing relationships EDG needs to find the existing asset to which to add a reference. If the asset does not yet exist, EDG may be able to create a URI as the value of the reference. After selecting a relationship property from the dropdown, the following methods for directing EDG on how to build relationship values are possible: 1. If the selected class has a designated :ref:`primary_key_constraint`, no additional information is needed. EDG uses the value in each row of the mapped column to form the URI of the new instance according to the primary key definition. This option is demonstrated in the above screenshot – *airport country* is a relationship from airports to countries and the ontology defines a primary key for the class *Country*. * Therefore, the values in the mapped column must the exact value of the property used as the primary key meaning it must be unique for all instances of the class. * In the example, the Country primary key values are 2 character country codes and the Country Code column with those values is mapped to the “airport country” relationship. * Note that in this case, imported rows will *always* construct a reference with that URI structure regardless of whether it exists in the collection or not. 2. If the values in the mapped column are actually valid URLs, then they can be used “as-is” to be the URI of the referenced asset, as indicated by the associated **Use values as URIs** label. 3. If neither option 1 nor option 2 apply, the user selects a property of the related class on which to to match. The property will be used to find assets at the end of the relationship and will only create the relationship if the asset already exists in the collection. * For example, if Country did not have a defined primary key, the user could map the Country Code column and select “ISO 3166-2 alphabetic country code” as the property to match. * Values of the matching property must be unique across all instances of the class. If duplicate values are found, then the related resource will be assigned arbitrarily. Option 3 is demonstrated in the screenshot below – after removing the primary key from the class Country. .. figure:: _images/edg_import_spreadsheet_country_code.png :alt: TopBraid EDG Import Spreadsheet - Country Code Mapping :align: center :class: edg-figure-l **TopBraid EDG Import Spreadsheet - Country Code Mapping** For *inverse* relationships, the spreadsheet column represents links from instances of some other class to the instances the import creates. Similar to forward relationships, if an inverse relationship is the chosen mapping, then there is a further choice of which referencing-class property to use to identify the referencing instances. .. note:: * As explained under *Other Parameters* below, if the target of a relationship has the same entity type as the entity type chosen for the import AND Option 3 is being used then the **Override existing values** option must be unchecked for the relationship to be created. * When importing into a *Reference Datasets* collection, one of spreadsheet columns must map to the primary-key property of the main entity (class) for the dataset. For example, the screen image above identifies this field as the IATA code. * When importing into a *Taxonomies* collection, the user can select a concept scheme to contain the imported concepts. Otherwise, EDG will create a new concept scheme and make all concepts that have no parents in the spreadsheet its top concepts. Unique Identifiers ****************** This section explains the logic EDG uses to generate URIs for the imported data. If the import target class has a :ref:`primary_key_constraint`, no selections are required . Instead, use the **Column Mapping** section to map one of the spreadsheet columns to the primary key property. Otherwise, some selections are required. The available selections depend on the **URI Construction Rules** configured for the asset collection. If the **URI Construct Method** is *label*, specify the column(s) to be used to generate the URI of each imported row as shown below. .. figure:: _images/edg_unique_identifiers_labels.png :alt: TopBraid EDG Unique Identifiers Based on Labels :align: center :class: edg-figure-l **TopBraid EDG Unique Identifiers Based on Labels** Typically, the spreadsheet column containing the *label* value will map to the **Id column #1**, leaving the rest of mappings empty. However, it is possible to select a different column to generate URIs or even a combination of columns. When multiple columns are selected, their values are concatenated to form the URI. The **Start of URIs** option is also available to modify the default namespace of the collection to be used as the basis of the generated URI for the import. For a successful import adding new information to existing assets, make URI choices that will match the URIs of those assets. If the **URI Construct Method** of the asset collection is either the *counter* or *uuid*, a different set of options are needed, as shown in the following figure. .. figure:: _images/edg_unique_identifiers_uuid.png :alt: TopBraid EDG Unique Identifiers Based on Counter or UUID :align: center :class: edg-figure-l **TopBraid EDG Unique Identifiers Based on Counter or UUID** If the import is creating new instances, leave the selection empty and EDG will generate URIs according to the currently configured URI Construction Method for the collection, using the default namespace for the URIs. If the import is adding information to existing assets: * use a spreadsheet containing a column with values that are the URIs of existing assets and select that column as the URI column; * match on a property to find the existing assets. Values of this property must be unique for the entity type. If duplicate values are found, assignments will happen arbitrarily. Other Parameters **************** Other parameters are located directly below the Unique Identifiers section. Selecting **Overwrite existing values** will *delete* an existing value for a mapped property before adding its new (different) value; otherwise, new values will be added to existing ones. If the imported rows are adding new data values to existing instances and/or adding new instances, it is best to make sure that the *Override existing values* option is **unchecked**. Checking this option has the following consequences: * If an instance already exists and has a value for any of the mapped columns, the value will be replaced with the spreadsheet data. * Relationships between instances of the same type that rely on matching of values will not be created (because these values may be overridden as part of the processing). * When working with a Taxonomies collection, a combination of checked *Override existing values* and the *No Hierarchy* pattern will always make imported instances top concepts of a new Concept Scheme, even if they already exist in the Taxonomy and have parent concepts. Selecting **Record each new triple in change history** (warning: not recommended for large files) prevents EDG from recording the addition of each new triple in the change history. .. note:: This option will not let you build relationships to assets that are members of the same class as the one you are importing – because these values can be replaced as part of the import. A **Preview** button on the Import Spreadsheet form shows the RDF triples that would be generated with the currently configured settings. The browser’s *Back* button returns to the form. **Make this a reusable mapping template** is optional and saves all of the settings on this form for later reuse. Reusable mappings are selectable using **Import Spreadsheet using Template** on the Import tab instead of **Import Spreadsheet using Pattern**. When used, a drop-down list of the saved template names appears for selection. When satisfied with the sample data shown on the preview, click the **Finish** button. EDG will start the import, running it in the background. Import Using Hierarchical Spreadsheet Patterns """""""""""""""""""""""""""""""""""""""""""""" After selecting one of the hierarchical patterns, there are three sections: * Column Mappings * Hierarchy * Unique Identifiers and, as on the previous page, for convenience an example of the source spreadsheet data is shown. There are also URI column selections below the Unique Identifiers section, unless the URI Construct Method is *counter* or *uuid*, where no option to specify columns to use as URIs appears. EDG will always generate the URIs following the chosen method. When the asset collection uses the *counter* or *uuid* methods, these importers **CANNOT** be used to overwrite existing hierarchies – they can only create new assets: * If the need is to add information to the previously imported hierarchies, use **No Hierarchy** import. * If the need is to add a child tree consisting of new resources to an existing hierarchy, these importers can be used. However, the top of the new tree will not be connected to an already existing parent, that connection must be added after the import. Most of the options on the hierarchical import pages are the same as those described for the No Hierarchy pattern. Those are not described here again and this section of the guide focuses on the unique aspects of the hierarchy mapping. For all the hierarchical patterns, select a **Hierarchy Property** (e.g. "has broader”) to connect items in the hierarchy. .. note:: * All hierarchical levels will be connected using the same relationship. * To create different relationships between levels, use the **No Hierarchy** pattern. The **Generate in inverse direction** checkbox will reverse the direction of how the property specified in **Hierarchy property** is applied. When working with Taxonomies collections, there is an option to select an existing **Concept Scheme**. If not specified, the importer creates a new concept scheme using the name of a spreadsheet. All concepts that do not have hierarchical parents will be made top concepts of the scheme. Path with Separator Pattern *************************** This pattern works **ONLY** if the URI Construct Method is *label*. For **Path with Separator** spreadsheets, in which a spreadsheet entry such as “World > Europe > France” indicates the hierarchical structure above the term “France”, the **Hierarchy** mapping section works as follows: * Select a column containing the path and type a separator e.g., “>”. * Identify the Column containing the *last node of each path string* * In the **Column Mapping** section, to generate a name for imported resources, make sure to assign some column as the preferred label (in case of Taxonomies) or as the label (for all other asset collections) – to generate labels. This will typically be the same column as the one you selected in the **Hierarchy** section as containing a last node of the path. * Map this column again in the **Unique Identifiers** section. If not specified, EDG will use row numbers to generate URIs. Alternatively, use other column(s) to generate URIs. Column-based Tree Pattern ************************* When using this pattern and when URI Construct Method is *counter* or *uuid*, the spreadsheet needs a single column containing the label for each asset. When URI Construct Method is *label*, there is not need for such a column, EDG will assume that the hierarchy columns contain the labels. For **Column-based Trees** spreadsheets, the **Hierarchy** mapping section works as follows: * Specify the top and bottom levels of the hierarchy by picking the first and last columns containing hierarchical levels. All hierarchical columns must be located together and sequentially in the spreadsheet. * If a column is mapped in the **Hierarchy** section, **DO NOT** map it in the **Column Mapping** section nor the **Unique Identifiers** section. These sections are used **ONLY** for mapping columns that do not specify the column based tree. * If the URI Construction Method is *label*, EDG will assume that the hierarchy columns contain labels of respective resources. If not *label* then, as mentioned above, a separate column (outside of the hierarchy) needs to contains the label for each resource * If the URI Construction Method is *label*, leave the **Unique Identifiers** section empty and EDG will use values in the hierarchical columns to generate URIs. Only make mappings in this section to override the Label-based approach and use some other values for the URIs. * Carefully examine the **Column-based Trees** sample layout on the **Select Spreadsheet Type** screen. It is important that each item in the hierarchy has a row of its own. See below for correct and incorrect options. .. figure:: _images/column_based_tree_pattern.png :alt: Column-based Tree Pattern Examples :align: center :class: edg-figure-l **Column-based Tree Pattern Examples** It is important to remember that, as with all spreadsheet import options, all resources will be imported as members of the same class as selected as the start of the import mapping definition. An import cannot support Level 3 representing countries and Level 2 representing continents. Path with Fixed-length Segments Pattern *************************************** This pattern requires a path column, where values are such that removing a string of a fixed length from a value identifies a parent for a resource on that row. For example, if using 2 character segments and "Australia" has a path column value of "010201", its parent would be on a row with a path column value "0102" and its parent’s parent would be on a row with path column value "01". EDG finds a parent by removing the exact number of characters specified in the segment length from the child’s path column value. The top most items could have a path column value that is different from the segment length e.g., 1 instead of 01. For **Path with fixed-length Segments** spreadsheets, the **Hierarchy** mapping section works as follows: * Specify the column with the path values. * Specify the length of the segments to use to calculate the parent row. * In the **Column Mapping** section, to generate a name for imported resources, make sure to assign some column as the preferred label (in case of Taxonomies) or as the label (for all other asset collections). Otherwise, labels will not be generated. * If the URI Construction Method is *label*, map the same column again in the **Unique Identifiers** section. If not specified, EDG will use row numbers to generate URIs. Alternatively, use other column(s) to generate URIs. Self-join Pattern ***************** For spreadsheets following the **Self-Join** pattern, the **Hierarchy** mapping section works as follows: * Specify the **Column containing the parent ids** – this column will not necessarily be used to generate URI, it is simply a way to match children and parents * Specify the **Column containing the child ids** – this column will not necessarily be used to generate URI, it is simply a way to match children and parents * In the **Column Mapping** section, to generate a name for imported resources, make sure to assign some column as the preferred label (in case of Taxonomies) or as the label (for all other asset collections). Otherwise, labels will not be generated. Typically, but not necessarily, this will be the column you used as a *Column containing the child ids*. * If the URI Construction Method is *label*, map the same column again in the **Unique Identifiers** section. If not specified, EDG will use row numbers to generate URIs. Alternatively, use other column(s) to generate URIs. Import Data Set from Spreadsheet """""""""""""""""""""""""""""""" **Import Data Set from Spreadsheet** is available under the *Import* tab and is available only for **Data Assets** collections. It reads the input spreadsheet and creates an EDG *Spreadsheets Workbook* instance, and a *Spreadsheet DataSet* instance that is part of the workbook and with related *DataSet Element* instances for each spreadsheet column. The import also includes data profiling for each imported column as shown in the following figure. .. figure:: ./_images/edg_data_profile.jpeg :alt: TopBraid EDG Data Profile Page :align: center :class: edg-figure-l **TopBraid EDG Data Profile Page** .. figure:: _images/edg_frequencies.jpeg :alt: TopBraid EDG Frequencies Page :align: center :class: edg-figure-l **TopBraid EDG Frequencies Page** Import Crosswalk from Spreadsheet """"""""""""""""""""""""""""""""" **Import Crosswalk from Spreadsheet** is available under the *Import* tab and is available only for **Crosswalk** asset collections. The input spreadsheet must contain two columns: * the first column must contain the primary key used to build URIs of resources in the *From* asset collection; * the second column must contain primary key used to build URIs of resources in the *To* asset collection. Import Property Definitions (Schema) from a Spreadsheet """"""""""""""""""""""""""""""""""""""""""""""""""""""" A spreadsheet can be used to to create property definitions for a class in an **Ontologies** collection. For information on how to do this, see the guide on :ref:`creating_ps_from_ss_columns`.