Traditionally, organizations exchanged large amounts of information in the form of PDF and other text documents. Increasingly, text is being replaced by data, creating new data governance challenges. A case in point is the pharmaceutical industry. With IDMP, new EU (and, soon, world-wide) directives are forcing pharma companies to move from text-based to data-based submissions to obtain market authorization for medicines. This involves creating data structures with an estimated 1,700 data points, using more than 75 reference datasets. These reference datasets ensure every submission uses the same terminology to refer to countries, currencies, substances, diseases, adverse effects and more.

Getting a medicine to market in any country requires going through an extensive procedure in which a well-documented request is submitted to the regulatory authorities. Before the early 2000s submissions were done on paper, but today companies submit electronic documents via the internet. While it is a step forward, the information being exchanged between the market authorization holder (that is, the company trying to obtain permission to sell the medicine) and the health authorities is still in the form of unstructured text, essentially digitized paper documents. Analyzing, integrating and, otherwise, processing such submissions is slow and error prone.

The European Commission, in cooperation with the US Secretary of Health and Human Services, has decided that a significant number of key describing characteristics of an authorization application have to be submitted in the form of raw data. In line with this policy, the European health authority, the European Medicines Agency (EMA), has started an initiative to standardize reference data sets for expressing substances, products, organizations and so-called “referentials.” These standards have been formalized and published by the ISO. Together, they are referred to as IDMP or Identification of Medicinal Products.

The sheer amount of raw data comprised in an IDMP-compliant submission is daunting. The number of defined relations and attributes exceed 400. Many object classes will recur multiple times in a submission; a given medicine may involve multiple substances, each substance may be produced by multiple suppliers and each of these objects will have specific attribute and relation values. Some estimate that the number of data points in a typical submission will exceed 1,700.

This has impact. Not only European submissions are affected. The US has stated that it will effectuate a similar directive, making the same IDMP format obligatory in the US two years after the EU. Other countries in the world will follow sooner or later.

The objectives behind IDMP are to increase the quality of life for all of us. Ensuring that only high quality and well-vetted drugs get approved requires thorough analysis of large and diverse information. Computers can do that more effectively than humans, but they require data, not text.

As an example, consider a situation where health authorities discover that there is a serious problem with a specific substance produced by a specific supplier. This substance may be used in many medicines marketed by many different companies. Now, the question is, “Which medicines are affected and need to be taken off the market immediately?”

The answer to this question is currently buried in gargantuan amounts of PDF documents. No search engine will ever provide a reliable answer. The only way to resolve this structurally is by expressing the information in the form of raw data and letting a computer process these. It is precisely this set of data that the IDMP directive will produce.

Texts are good at expressing subtle nuances and capturing a train of thought, but since text requires a person to interpret it, this does not scale against the need to make decisions based on large amounts of information. This results in pressure to express information in the form of data so that computers can take over the work. This trend necessitates a new approach to managing data quality. Reference data management is a key ingredient of this.

Reference data is very commonly used. Sometimes it is called code lists, value lists, controlled vocabularies, business vocabularies, look-up tables, taxonomies, thesauri or coding systems. Reference data defines permissible values in certain data fields, thus providing information needed to make other data meaningful and interpretable in an unambiguous way.

For example, say a product can be safely stored until three days after opening. To make sure everyone involved has the same understanding of which unit of measure (in this case, days) is meant, the different options for specifying it are taken from a shared list. This list is the reference dataset for units of measures and will contain such items as days, months, years, etc. Each entry in the list will contain a code to be used in records (“D”), a descriptive label associated to it, perhaps in different languages (“Days” in English, “Jours” in French), along with other information that provides users with understanding as to the meaning of the entry.

Consistent use of well governed reference data enables information exchange and is also a prerequisite for effective recordkeeping over time. If you find in the records of your organization that many years ago products were sold to a customer in a country with country code GDR, it is the reference dataset that enables you to retrace this to the German Democratic Republic, even though the country doesn’t exist anymore. Without a managed reference data, historical records soon become incomprehensible.

Since the purpose of reference data is to enable shared understanding, it stands to reason that some reference datasets are defined by an external party, often a standards body such as ISO. In addition, however, every organization will have reference data defined internally e.g., the list of cost centers and journal headings, the list of function profiles, the list of production locations, of customer types, of product lines, et cetera.

To manage reference data properly, a data governance unit must be formed. A central role within this group is the data steward; He or she is the one who is operationally responsible for the required management and administration tasks. In Taxonic’s white paper “Reference Data as a Service: How Emerging Technologies Support the Next Level of Data Governance,” Taxonic CEO Jan Voskuil delves deeper into these administration tasks. The white paper, which Voskuil wrote with input from TopQuadrant’s cofounders, Irene Polikoff and Ralph Hodgson, also discusses in further detail what reference data management is, why it is important, which processes it involves and how this practice can be optimally supported using emerging technologies. Read the white paper here to learn more.