There are many promising initiatives underway that seek to combine rich clinical data from electronic health record systems running in provider sites across the county into large patient cohorts and then combine that data with genetic sequences created from samples provided by each patient in the cohort.
The sponsors of these initiatives span industry, private foundations and the federal government. While the ambitious goals are commendable and the potential for discovery is worthy of the effort, there are data quality and semantic interoperability requirements that must be met prior to the combining of the clinical data.
Without these standards, consolidation and harmonization of rich clinical data will yield only a portion of the expected value. Identifying genetic determinants of disease from this large population, for instance, requires a clinical data set based on standards such that large subpopulations can be compared using common clinical vocabularies.
The standards to be addressed fall into two categories: 1) Existing, popularly used or incentivized coding systems and 2) value lists that need to be defined.
Category 1 includes coding systems such as: ICD-10 diagnosis codes (and historically ICD-9), CPT and ICD-10 procedure codes, LOINC codes for lab orders and results and RxNorm codes for drugs.
The ICD and CPT coding systems have been used in EMRs and billing systems for decades. LOINC and RxNorm are relatively recent additions thanks to Meaningful Use Stage 2 incentives but given the less than 100% achievement of Stage 2 in the industry combined with large volumes of historical data that is not codified, it cannot be assumed that all lab orders and results residing in EHRs and clinical data warehouses have been associated with valid LOINC codes. Similarly with orderable drugs and RxNorm.
Category 2 data include critical items used for cohort subdivision such as: vital signs (height, weight, blood pressure, etc.); allergies; smoking history; demographics such as race, ethnicity, address, age, socio-economics; and encounter types, notably inpatient, outpatient, home health, and so on.
The second category is the most challenging given the lack of common national standards in these areas. This means that the organizations sponsoring the creation of the cohort must define standards in advance and provide the funding to the contributing providers to map their local data to those standards.
In addition, a common patient probabilistic matching algorithm should be defined by each sponsor to ensure data from the same patient across multiple contributing sites is linked to the same human being in the cohort.
Lastly, unstructured text extraction software and models should be defined by the sponsoring organizations such that all exam notes, pathology reports, radiology reports, etc. can be processed to extract or derive clinical facts from the rich collections of unstructured text in every contributing EHR.
While the above efforts are daunting, they are critical to the success of any attempt to join clinical data from multiple providers into a common repository. Discoveries from large populations of patients’ genetic variant data cannot be made without common clinical vocabularies and ontologies that enable consistent profiling of millions of patients.
Brian Wells is associate vice president of health technology and academic computing at Penn Medicine.