In part 2 of this blog series we will explore another frequently asked question:
How can we capture and maintain relationships between reference data?
A related question may also be “Why do we need to do this?”. Reference datasets don’t exist in a complete isolation from one another. Quite often they are connected. For example, country codes are broadly referred to by other codes such as country subdivisions, currency codes, languages and market exchanges, to name just a few.
One approach is just to repeat the related data. Following this approach one could, for example, create the following table for currencies:
An obvious downside of this approach is that it embeds country names with the currency codes. Now, if the official country name changes, it needs to be updated in multiple places. A much better approach is to define a relationship between the currency and a country. Through this relationship users looking for currency codes can also get all the corresponding country reference data – names, codes and any other information the organization maintains about countries.
The TopBraid RDM screenshot below displays a similar table except, although it looks like text, the column showing countries that use a currency is not a text field, but a relationship. The country name displayed there is not stored with the reference data for currencies, but taken directly from the reference data for the related country. Even in this small sample we see that the country names are slightly different from the ones in the first table. For example, note that the official English name for the part of Saint Martin that uses Netherlands Antillean guilders is Sint Maarten (Dutch Part) not Sint Maarten (SX). If we follow a relationship link to get more information about this location we will see that SX is Sint Maarten’s two-character alpha code.
Using relationships significantly reduces maintenance of related reference datasets and ensures consistency.
There can be more than one relationship between two entities. For example, looking at the detailed information about Argentina in the screenshot above, you will notice that in addition to “country using currency” we have decided to also capture the “issuing country” relationship. Often, these are one and the same, but sometimes a country using a currency doesn’t have the authority to issue it. For example, US Dollar can only be issued by the US, but it is an official currency for countries such as Bermuda, Panama and El Salvador. If we were using text field, we would now have to repeat the name of the country again creating yet another maintenance issue.
What happens when a country code changes its status? For example, becomes ‘conditionally reserved’? This must have an impact on currencies. Since the datasets are connected, we can easily query to examine whether there is any impact of the changes on the related entities and act accordingly.
Relationships also allow us to find currencies based on the other information about countries and their related entities. For example, we may need to know what currency in used in Abkhazia. It is not a country but a subdivision of a country Georgia. Since currencies are connected to countries and countries are connected to their subdivisions, as shown in the screenshot below we can easily find out that Abkhazia is using Lari as its currency.
A relationship is just like any other property or a field associated with the entity that we manage reference data for. It has a name and an associated description. Detailed descriptions of properties are key to understanding the meaning of reference data. We can see in the screenshot below that currency reference data has a number of properties:
The difference between the properties with literal values (text strings, numbers, etc.) and relationships is that the range of values for relationships are other entities – such as other code sets. In this example, the range of the relationship is ISO Countries.
So far, we have focused on relationships between datasets that complement and extend each other like countries and currencies. Another area where relationships matter is one of connecting codes used as alternatives. For example, the ISO 3166-2 standard includes codes for subdivisions of countries such as US states. Some data sources may use these. Alternatively, the FIPS standard provides numeric codes for the US states. It may be used by other data sources. When data from multiple places comes together, it is important that the aggregated data is categorized or coded correctly and consistently even when the original sources were using alternative codes.
Relationships between the alternative codes are often called ‘mappings’. Mappings can be one-to-one as is the case between ISO 3166-2 and FIPS. They can also be one-to-many and many-to-many. Here is an example of one-to-many mappings between an ISCO-08 occupational code and several 2010 SOC occupational codes. ISCO-8 is maintained by the International Labor Organization, while SOC is maintained by the US Bureau of Labor Statistics.
Another example of mappings between these two standards for occupational codes, that is many-to-many, is shown below
Is there any difference in connecting complementary codes (such as Countries and Currencies) versus alternative codes? Technically, in both cases we are using a relationship link. The main differences are:
- When connecting complementary reference data, we want to use a variety of properties with different meanings. For example, ‘countries using a currency’, ‘countries issuing a currency’, ‘subdivision of a country’, ‘located in’ and so on.
- When mapping alternative codes; however, we can generically use a single relationship across any datasets we are mapping. The meaning doesn’t change; this is always about matching the alternatives.
- The role of mappings is different from other relationships. They are used solely for translation from one code system to another.
- In case of many-to-many mapping relationships, it is often important to document the specifics of each mapping. Fairly complex inclusions, exclusions and exception conditions that differ in each case are common and may require explanation notes.
Finally, you may need to capture relationships between codes within the same dataset. For example, you may want to know which countries border each other. Another common use of relationships is to form hierarchies of codes within a given dataset. You may also want to connect codes that were in place at one time, and have been retired, with codes that replaced them. This was covered in the previous blog Deep dives into Reference Data Management with TopBraid RDM – part 1 where we discussed the life cycle of reference data.
The reference data examples used in this blog come from external “standard” sources. Organizations typically adopt some of the standards as their codes and they also create reference data that is unique to their organization. In our next blog, we will tackle questions about the unique requirements and processes for managing external and internal reference data.
Other blogs in this series:
Part 1: Addressing change in reference data
Part 3: Supporting external and internal reference data
Part 4: Enriching reference data with new information fields