In the last few months we have been collecting the most commonly asked questions and concerns about reference data management – those posed by our recent webinar attendees, those raised during the conferences we attend, speak or exhibit at (such as Enterprise Data World) and brought up in many conversations with our customers, prospects and industry analysts.
In response, we are starting a series of blogs dedicated to sharing our thoughts and perspectives on these important topics. Today, I will address a multi-part question on the lifecycle of codes. Here it is along with details of the challenges involved:
What happens when a code becomes obsolete and/or replaced by another code or codes?
- One recent example is a change of the ISO 3166 2 character country code for Myanmar (Burma) to MM from BU.
- Can the old code be deleted?
What to do with the historical data that is using a code that is no longer in effect?
- Should data be migrated?
- How can we capture the fact that one code was replaced by another one?
How can tools help in this process?
Reference data changes relatively slowly, but it does change. And since there is a lot of reference data and all enterprise applications and data sources use it, people responsible for managing reference data end up addressing code changes rather frequently.
The best practice is to never delete a code once an organization has started to use it. This is because pre-change transactional data (unless migrated, which is often unpractical) will be using the old code. Thus, in order to produce accurate reports and analysis, we need to know that the code existed and what it meant when it was in use. For example, an organization may want to aggregate or compare all its trade exports to Myanmar for the periods that include the time before and after the country code change.
TopBraid RDM will help you enforce this best practice. By default, once the status of a reference dataset is changed to “approved for use”, TopBraid RDM will no longer let you physically delete the codes in the dataset. Additionaly, if you want to use your own dataset statuses, TopBraid RDM is fully configurable. It will let your organization define its own statuses and configure which of the statuses don’t allow physical deletes of codes.
Instead of physical deletes, logical deletes are about changing the status of the code itself and, possibly, capturing its effective dates. TopBraid RDM provides a small status vocabulary that users can include in a reference dataset to keep track of code statuses and associated dates. Alternatively, users can define their own concepts and fields for keeping this information.
In cases where an organization wants to migrate some historical data to using new codes, TopBraid RDM can help facilitate this process by capturing the mapping relationships between old and new codes. Knowing these relationships is also important for reporting.
A new code may have a one-to-one relationship to the previous code, as when Myanmar’s code changed to MM from BU. One code can also be replaced by multiple new codes. For example the Czechoslovakia, country code CS was divided into Czech Republic (CZ) and Slovakia (SK). Finally, multiple codes can be merged into one. As you change the status of a code, you can also create one or more relationships between the old and the new. TopBraid status vocabulary has a pre-built relationship ‘replaced by’. And, of course, you can also create your own relationship or set of relationships if you need to differentiate between different replace operations. Web services and query access to TopBraid RDM let applications access this information as well as people.
In splitting and merging, some of the previous codes may be kept, effectively changing their original meaning. For example, when East and West Germany were re-unified, East Germany, country code DD, was merged into Germany which kept the code DE that was previously associated with West Germany. Such operations add more complexity because they change the meaning of the code. So, while the code DE remains in effect, when using it we may need to know that a change happened to it at some point in the past. In TopBraid RDM, relationships can be accessed from either direction. Thus, when looking at or querying for DE, you will be able to see that DD was folded into it and when this happened.
Every change to a reference dataset should be logged. Some reference data changes on the “as needed” basis, while other datasets have a predetermined schedule for publishing changes such as monthly or quarterly. You may want to capture the frequency and timing of updates so that reference data users know in advance when to expect changes. Similarly, your data governance organization needs to know how they will learn about the changes and from whom. For example, the maintenance agency for ISO 3166 used to publish newsletters with the changes. Today, they provide online annual subscription service for download of the codes that notifies you when changes are made. TopBraid RDM allows you to capture this and other important information about each reference dataset.
In part 2 of this blog series we will look into another frequent question: how to capture and maintain relationships between reference data. This is broader than new and old codes. It encompasses any relationship – for example, between a country and a language or a currency code.
Other blogs in this series:
Part 2: Creating and maintaining relationships between reference data
Part 3: Supporting external and internal reference data
Part 4: Enriching reference data with new information fields