Chapter 1. Introduction to Healthcare Data

Healthcare data is an exciting vertical for data science, and there are many opportunities to have real impact, whether from a clinical or technical perspective. For patients and clinicians, there is the alluring promise of truly personalized care where patients get the right treatment at the right time, tailored to their genetics, environment, beliefs, and lifestyle—each requiring effective integration, harmonization, and analysis of highly complex data. For data scientists and computer scientists, there are many open problems for natural language processing, graphs, semantic web, and databases, among many others.

Additionally, there are “frontier” problems that arise given the specific combination of a specific technology and the nuances and complexities of healthcare. For example, there is nothing about healthcare data itself nor data science that requires “regulatory-grade” reproducibility. Data scientists know how to use version control tools such as Git, and IT people know how to create database snapshots and use Docker containers. However, with regulatory bodies such as the US Food and Drug Administration (FDA) or the European Medicines Agency (EMA), there are specific requirements to track and store metadata and other artifacts to “prove” the results of the analysis, including reproducibility. Similarly, there is increasing desire and pressure to ensure reproducibility of studies or the sharing of negative results among academics. How we can address these challenges at scale is still unsolved.

Despite the excitement for working with various types of healthcare data, there are still many misconceptions. Those with extensive experience working in enterprise environments tend to underestimate the complexity, often comparing real-world healthcare data projects to enterprise integrations. This is not to say that a typical enterprise data project is simple or easy. One of the major differences is the relation of how and why the data was captured relative to the actual work being done.

In nearly every industry, the use of data today is a function of engineered systems. In other words, most data is generated by software systems versus collected and entered by a human. For example, in advertising/marketing analytics, the data is generated by websites that track clicks and impressions.

In this chapter, we will walk through some of the nuances and complexities of healthcare data. Much of this complexity is a reflection of the delivery of healthcare itself—it is just really complicated!

For those with a traditional IT background or who have worked in large companies dealing with complex data issues, we will start with a little discussion of the enterprise mindset and how you might frame healthcare data. After this, we will dive into a broader view of the complexities of healthcare data. Once this foundation has been set, you will get a broad overview of common sources of healthcare data.

The Enterprise Mindset

The data science industry has had many successes—from companies using data science, as well as creating new data science methods. When leveraging and using data science, most organizations have the benefit of following the traditional enterprise mindset. Information and data architects within the organization can sit down together, discuss the various sources of data and intended use cases, and then craft an overarching information model and architecture.

Part of this process typically involves getting various stakeholders together into a single room to agree on how best to define individual nuggets of data or information. Until recently, this has been the approach that most companies have taken when trying to build data warehouses. The challenge in healthcare is that the sources of data operate in disconnected silos. When a patient enters the healthcare system, they typically do so via their primary care physician, urgent care, or the emergency department.

Naturally, one might say we should start here in order to create the information model that will be used to represent healthcare data. After all, nearly everything else flows downstream from the moment a patient makes an appointment or shows up at urgent care. Insurance companies or governments will need to reimburse hospitals for providing care; physicians will prescribe medications and companion diagnostics from the biopharma industry.

So, information architects can start by defining the idea of a patient and all of the associated data elements, such as demographics, medical history, and medication prescription history. However, as we start to look at the healthcare industry overall, there are already potential issues even when defining the “simple” idea of a patient. How does an insurance company think of patients? At least in the United States, insurance companies typically think of people as covered lives, not patients. While some may be quick to say that insurance companies are dehumanizing people and thinking only of statistics, it largely comes down to the operational aspects of tracking benefits.

While the person who seeks care is the patient, they may be a beneficiary of someone else who is not the patient. For example, if a child goes to urgent care for some stitches after falling at the playground, they are the patient as far as the clinic is concerned. However, to the insurance company, there are two people who need to be tracked—the child as well as the parent or guardian who is the insurance policy holder. The insurance company must track these two individuals as different people for the claim, though they are obviously related.

The Potential of Graphs

The previous example highlights how graphs can be particularly useful when dealing with complex data. A single claim can be created as a node that is connected to two other nodes, one each for the child and their parent/guardian.

Even this relatively “simple” example can get complicated quickly, and we will discuss similar examples throughout this book, highlighting the hidden complexities of data.

Those who are veterans in the data space will likely see the parallels between the previous discussion and challenges in setting up data warehouses. While an organization may be united in its overall product, each business unit or department within the organization may have very different views of a customer or user. Anecdotally, it can take upward of 18 months to get all of the stakeholders to agree on a common understanding of concepts that make up an organization’s enterprise information architecture. It is important for us to find ways to narrow this window, though it is less about getting everyone to agree and more about getting the right people, process, and tools in place to help us iterate more effectively.

That said, we must keep in mind that there are inherent complexities when delivering healthcare, and those complexities creep into the data that is captured. Instead of chasing the new shiny object (whether that’s a new database, orchestration framework, or other technology), we need to consider the core challenge and how to best and most cost-effectively address it.

To start us on our journey, we will spend a bit of time talking about the inherent complexity of healthcare data, particularly real-world data (RWD).

Real-World Data

The term real-world data is a bit odd since it implies there is some sort of data that does not come from the real world. In this book and throughout pharma and other parts of healthcare, we are referring to data collected about patients during the delivery of healthcare.

If you work for a hospital, it’s just “data” (though there is also a need for academic medical centers and health systems to distinguish data collected as part of clinical research).

The Complexity of Healthcare Data

Data meshes are one example of the changing landscape when it comes to data—there have already been evolutions from databases to data warehouses to data lakes—and the evolution will continue. This is, however, an example of the shifting thinking around data, highlighting a key aspect of healthcare data.

Organizations and their leadership are starting to realize that there is a lot of potentially useful data that lives outside of traditional systems such as the databases and applications that support sales or manufacturing. As organizations are struggling to keep up with an increasingly heterogenous data landscape, ideas such as the data mesh are invented to try to address shortcomings of existing approaches.

In healthcare, however, these complexities have always been a rate-limiting factor. Administrators are not just starting to see value in data and, as a result, trying to find scalable and repeatable ways to link data from the various service lines within a hospital. Nor are pharmaceutical companies just realizing that there is a tremendous amount of value in claims data from a medical affairs perspective.

Consequently, healthcare has been struggling to find ways to bring disparate data sources together, while still balancing critical issues such as security and privacy, governance, and (most importantly from the perspective of data scientists) normalization and harmonization of the data. Chapter 3 will go into more detail, but data scientists often think of normalization in the context of statistics or machine learning algorithms (e.g., min-max scaling, zero mean and unit variance, etc.). While that definition is certainly true and applicable to healthcare data, there is an additional element of normalization, often referred to as harmonization.

Harmonization requires some domain knowledge and is not purely a function of only data. For example, we may need to find all of the ways the dataset might represent the idea of “Tylenol” or “Acetaminophen” and replace them with a single representation; the dataset may even have an internal, hospital-specific code of “M458239.” Whether we are attempting descriptive statistics, machine learning, or any other type of analysis, we must first harmonize the data to reduce this noise.

Epidemiologists, biostatisticians, clinical researchers, and medical informaticists are well aware of these issues, though I have seen a variety of approaches. Most often, I see people embed any necessary harmonization directly into their code, particularly using SQL at query time. From an engineering standpoint, this makes it quite difficult to maintain over time since you would need to alter the query should you need to change anything. Of course, this means that your dataframes would also be different and may affect your downstream code. As an alternative, many scientists will extract the data with as minimal processing as possible in SQL and do any harmonization in R, Python, SAS, or other environment. While this improves maintainability of the code, it is not easily reusable. As we will discuss throughout this book, there are other ways to handle this that improve the overall engineering. But, as with any technology decision, there will be trade-offs!

Now that we have some high-level exposure to the complexity of healthcare data, let’s discuss some specific types and sources of healthcare data.

Sources of Healthcare Data

There are many different sources and types of healthcare data. This book focuses on real-world data, data that are collected during the course of delivering healthcare. This can come in many different forms, though the most common is data from electronic health records (EHRs).

By definition and given the constant innovation, nearly anything can be considered RWD if it’s used to help improve the quality and efficiency of care or to help improve patient outcomes. We have seen this a bit with COVID where all sorts of data is being used to help combat the spread of COVID. This might include something as common as your GPS/location data or something more specific such as data from contact tracing apps.

We will review a few common sources of healthcare data so that you have a sense of the landscape. This is by no means an exhaustive list and is simply to help orient you for the rest of this book.

Electronic Health Records

When people think about healthcare data, one of the most common sources that comes to mind is the EHR. You have probably heard of Epic or Cerner, two of the most common EHRs used in the United States. There are many other commercial EHR vendors as well as open source projects, some focused on the United States and others more internationally. While I will list a few common ones, the main focus of this section is to provide you with an introduction to the data typically captured in an EHR and how we need to approach it as data engineers and data scientists.

Electronic health record versus electronic medical record

You may be wondering—what is the difference between an EHR and an electronic medical record (EMR)? NextGen, an ambulatory EHR,¹ differentiates an EMR as a patient’s record in a single institution and an EHR as a comprehensive record across multiple sources. The Office of the National Coordinator of Health IT (ONCHIT) provides a slightly more specific definition: an EHR is a comprehensive record from all the clinicians involved in a patient’s care.

Competing Definitions

You may find various definitions of EHRs and EMRs out there. One common distinction is that an EMR is primarily used by clinicians as a replacement for the paper chart within a single institution; an EHR is a more comprehensive record of a patient and may include data from multiple sources.

In practice, the industry often uses these terms interchangeably, so be sure to clarify what someone means when they use the term EHR versus EMR.

Despite this distinction, it has been my experience that the terms are used interchangeably to mean the same thing. As I will continue to mention over and over (especially in Chapter 3), it is important to make sure that everyone is using the same definition. In this case, when you are discussing EHRs or EMRs, be sure to clarify if you are discussing records from a single institution or from multiple institutions. Throughout this book, I will use the term EHR to mean a patient’s clinical record from one or more sources but will also highlight situations where more specificity is necessary.

EHRs and data harmonization

If I had to choose a single theme to describe this book, it would be this idea of data harmonization. Data harmonization and interoperability are closely related and often used interchangeably. In this book, I use interoperability to describe the sharing of healthcare data from a transactional perspective. In other words, how can we share data about a patient as part of the patient journey or care process? For example, when a patient is admitted to the emergency department, how do we transfer information about their visit to their primary care physician? Or, if the patient is referred to a specialist, how is their data sent to their new physician?

On the other hand, I use data harmonization to refer to the process of integrating data from multiple sources in such a way that as much of the underlying context is preserved and the meaning of the data is normalized across different datasets. For example, as a pharmaceutical company, I may want to combine a claims dataset from a payer with a dataset from a hospital EHR. How can I ensure that the medications in both datasets are normalized such that a simple query such as “I want all patients who received a platinum-based chemotherapy” returns the correct patients from the EHR dataset and the correct claims from the payer dataset?

Both interoperability and data harmonization are big challenges in healthcare. There is also a lot of overlap in the underlying issues and associated solutions as highlighted in Figure 1-1. For example, whether one is transmitting a list of a patient’s medication history from one hospital to the next or trying to combine two datasets, the solution may be to link the medications to a standard coding system such as RxNorm or National Drug Codes.

For the purpose of this book, the key issue is that how you interpret the data largely depends on why the data was collected, who collected the data, and any workflow or user experience limitations during the collection process—what I collectively refer to as context. As data engineers and data scientists, we do not always have access to all of the necessary context, and we oftentimes need to make a best guess. However, it is important that we at least ask ourselves how the data might have been affected.

One of the most commonly highlighted examples of the challenges with data harmonization of EHR data is the use of International Classification of Diseases (ICD) codes within a patient’s clinical record. ICD codes are a common method for tracking diagnoses with a patient’s clinical record—at least this is the high-level intention and assumption.

However, to dig a little deeper, we need to ask a few questions:

How are the ICD codes actually used?
Why ICD codes (versus any other type of code or coding system)?
Who assigns the codes to a patient’s clinical record?

These may seem obvious, but remember that not every organization approaches data the same way. In the United States, ICD codes in an EHR are typically used as part of the billing process and entered into the system by medical billing specialists, not physicians or nurses. This impacts our analyses because we need to keep in mind that the codes may be used to justify other parts of an insurance claim or to satisfy internal reporting requirements, and not intended at all to accurately document a patient’s medical history. Let’s walk through an example that dives into some of the potential complexities when dealing with something as seemingly simple as diagnosis codes.

A patient may go to their primary care physician for a routine checkup. During the encounter, a physician suspects that their patient may have diabetes and decides to order a hemoglobin A1c test. When the billing specialist reviews this encounter and prepares a claim for submission to the insurance company, they add the ICD-10 code of Z13.1, “Encounter for screening for diabetes mellitus.” At the same time, there is an internal policy to also code such patients with the code R73.03, “Prediabetes,” which is used to feed a dashboard for subsequent follow-up with educational and other patient engagement materials.

As data scientists, we come along months or years later and are attempting to generate a cohort of patients who are prediabetic (but not yet diabetic) and subsequently diagnosed as diabetic. We know that there is the perfect ICD-10 code (R73.03) and proceed to write a SQL query to find all patients who have that code. However, what if our patient had an A1c result of 5.5% (completely within normal limits) and the physician was just being overly cautious? In this scenario, our SQL query would erroneously return this patient as part of our prediabetic cohort.

To further highlight the potential complexity, let’s say our clinic is currently participating in a local program with the county’s department of public health on a diabetes screening campaign. The physicians in our clinic are now ordering prediabetes screenings at a much higher rate than other counties in the state. If we continue to rely on this “perfect” ICD-10 code of R73.03, it will appear as if this particular county has an extremely high rate of prediabetics who are never subsequently diagnosed with diabetes. This could be interpreted that the campaign is extremely successful in preventing diabetes.

Given this example, what is the solution? Typically, in this sort of a situation where we have an objective definition of “prediabetes,” we can just add a threshold to our SQL query, triggering off the A1c result. For example, the threshold for prediabetes may be 5.7–6.4%, so we can update the WHERE clause of our SQL query accordingly.

However, the threshold at which point a person is diagnosed as diabetic is not uniform across all hospitals, clinics, or health systems. For example, while our clinic uses 6.4%, another clinic in the county or state may use a threshold of 6.7%. As data scientists and data engineers, we must then query for the raw lab result and apply additional filtering later in our process.

While the previous example may seem unnecessarily complex, it is actually simpler than many other situations involving diagnoses, especially those in specialties such as oncology or neurology where the diseases themselves are not as well understood. Things also get quite complicated when you start factoring in complexities of specific disease areas. For example, “line of therapy” is quite nuanced and a challenge for researchers and data scientists.² I will continue to share similar examples and show how the use of graph databases can make our job as data engineers and data scientists easier and more efficient, particularly through the lens of reproducibility.

That concludes our overview of electronic health record data, but we will spend a bit more time with EHR data in Chapter 4. While EHR data is usually one of the first data sources to come to mind when someone mentions real-world data, another very common type of data is reimbursement claims, which we will cover in the next section.

Claims Data

Claims data captures the financial side of healthcare delivery. While this is commonly associated with the US healthcare system given our reliance on private insurers, countries with national health systems also track similar data (essentially treating the government as the payer). You may hear the term payer used to describe insurance companies, and I will also use that term throughout this book.

On the surface, we might assume that the data contained in a claim (e.g., diagnoses, medications, procedures, etc.) corresponds to the data contained in the patient’s record in an electronic health record. For example, if you go to your doctor’s office for a routine checkup and they do a blood draw and order a panel of lab tests, we should expect the records of these in both the EHR and the claim to correspond to one another.

For simple situations, this may be true. However, there is usually a serious discrepancy between claims and EHR data. You may have heard of the term upcoding, which is often used to describe the process by which clinics and hospitals (collectively referred to as providers) submit codes fraudulently to increase payments.

For example, a provider may have seen a patient for a short visit where only routine screening services were performed, which should have been coded using 99212 (the evaluation of an established patient with at least two of the following: a problem-focused history, a problem-focused examination, and straightforward medical decision-making). However, the clinic knows that it would get reimbursed more if it submits code 99215 (an evaluation requiring at least two of the following: a comprehensive history, a detailed examination, and medical decision-making of high complexity). While this is a very clear example of fraudulent upcoding, not all situations where codes appear to be inflated are fraudulent.

Another example may be when a provider adds a particular diagnosis code that appears to increase the complexity of a patient (as discussed in our diabetes example in the previous section). In this situation, a code was attached to the claim to justify associated lab tests. From the perspective of retrospective analysis, this may initially appear to be upcoding since the patient was not actually diabetic.

Either way, this highlights that claims data may be inaccurate in a clinical sense given the processes behind claims adjudication and reimbursement or may even be fraudulent upcoding. The key takeaway is that we, as data engineers and data scientists, must really understand the nuances underlying the claims data and not assume that it is accurate for our particular use case.

Self-Insured Employers

Most larger companies in the United States are self-insured or self-funded—this means they take on the financial risk instead of an insurance company. They typically contract with third-party administrators (e.g., United HealthCare, Anthem, etc.) to handle the claims processing. In these situations, there is yet another source of influence on how/why data is collected, which then directly impacts any downstream analyses.

Despite the challenges with data quality in both EHR and claims data, they are still the most popular sources of data from the real world. Since the data are collected during the delivery of care, the data collection process itself is not the main focus. This results in many of the data quality issues we see. One way to mitigate this is to set up a registry where data is collected, abstracted, and curated to answer specific research or scientific questions. So, let’s quickly talk about clinical registries and disease registries.

Clinical/Disease Registries

Clinical and disease registries are typically used to collect data prospectively given a specific set of criteria (often referred to as a study protocol). Many of the same data harmonization challenges exist in registries as they do with EHR data. However, one of the key differences is that the primary intention behind a disease registry is to collect data for later analysis (e.g., clinical research, population health, public health surveillance).

We discuss this notion of intention in a bit of detail in Chapters 3 and 4, but this highlights the biggest difference between EHR/claims data and most other data collected in healthcare—whether the data was collected for the purpose of later analysis or not. EHR and claims data are collected primarily to transact the business of healthcare. When a physician captures data in the EHR, they are trying to document so that they (or another clinician) can provide the patient with the appropriate care. When a medical coding specialist assigns ICD-10 codes, they are helping the clinic submit reimbursement claims. The intention is not to collect data for data science.

Registries, on the other hand, are collected so that analysts and scientists can use the data to derive insights about populations of patients. Instead of needing to do a deep dive into a particular hospital’s workflow to understand the context and nuance of the data, you would refer to the study protocol instead. This becomes particularly beneficial for us when working with data from multiple institutions or data collection points. In a registry, all sites for data collection use the same study protocol and attempt to collect data as uniformly as possible. In contrast, the local influences on EHR and claims data will vary among clinics, insurance companies, and even employers as discussed previously.

The process of setting up a registry and then executing the data collection, abstraction, and curation is itself a specialty within the healthcare industry. For example, the National Cancer Registrars Association provides a Certified Tumor Registrar credential for those working on cancer registries.

So, EHR, claims, and registry data make up the “big three” of RWD. There are pros and cons to each, and it would be foolish to think that there is a “best” source for RWD. It ultimately comes down to your intended use cases and finding the most cost-effective source of data to help you extract the insights you need. In the next section, we give a brief nod to clinical trials data. While the pharma industry puts clinical trials data into an entirely separate bucket, I am including it here because there is increasing interest across the entire industry to bring all of these sources of data together—EHR, claims, registries, and clinical trials.

Clinical Trials Data

Clinical trials data is likely the “cleanest” of all healthcare data since there are significantly more financial and regulatory incentives in place. The success of a clinical trial and approval by regulatory authorities hinges on having clean data and robust analyses. As a result, pharmaceutical companies and clinical/contract research organizations (CROs) dedicate significant resources to the data collection, cleaning, processing, and analysis.

Additionally, there are clearly defined standards (e.g., CDISC) around clinical trials data since regulatory agencies have clearly defined submission requirements. While such standards help decrease some of the challenges when harmonizing trials data, they do not solve all of the challenges.

Now that we’ve added clinical trials data to the list, we have covered the biggest sources of data about patients to date. This is already starting to change as we see wearables, mobile devices, and apps becoming part of the fabric of healthcare. Oftentimes, these solutions all get lumped under the term digital health, and it is certainly something we should keep our eyes on!

We will now switch gears a little and focus on the data collection process itself, regardless of whether it is for the electronic health record or a clinical registry. Understanding how and why the data was collected is critical to effectively extracting actionable insights from real-world data.

Data Collection and How That Affects Data Scientists

As data scientists, we love to get our hands on data and start playing with it. In most companies and with most data, we know exactly why and how the data was collected. They were collected to facilitate the operation of the business, and analytics are also in service of the business. In other words, the alignment between why the data was collected and how they will be used are quite aligned. For example, in a streaming service, you might see data collected on which songs someone is listening to, how often they skip, how long they might spend on the song’s information page, or other songs they are listening to. In retail, you track which items a customer views and revisits, their purchase history, and what they ultimately purchase. The analytics are also closely aligned—recommending new content to consume or items to purchase.

In healthcare, there are many analytics projects that have a close alignment between why the data were collected and the goals of the analysis project. For example, clinical trials collect and analyze data for the purpose of measuring a treatment’s safety and efficacy; clinical research studies collect and analyze data to answer specific research questions.

However, when working with real-world data in healthcare, there is usually a big disconnect between why the data was collected and how it is analyzed. Data from EHRs is collected to facilitate the care of patients; clinical research data was collected by following a specific study protocol; claims data was collected to facilitate reimbursement; and the list goes on. Then, we come along and want to use the data to identify digital biomarkers, create risk prediction machine learning models, do population-level analytics, etc.

This forces us to reconcile our current use case and data needs against the data. This section walks you through different ways data is collected in healthcare. While these will help you begin to think about RWD differently, they are not the only considerations. We still need to think about other contexts such as the type of medical center, disease area or indication, and the platform used for data collection, among others.

Our first stop involves looking at two common distinctions of healthcare studies. Understanding the nuance between prospective and retrospective studies will help you better understand the context in which data is collected and analyzed.

We will start with a focus on the distinction between prospective and retrospective studies, from the perspective of the types of analysis typically performed. Alongside this distinction, we will also discuss a bit the notions of primary and secondary data. Most RWD use cases I have come across involve retrospective analysis of secondary data.

Prospective studies

The term prospective is adapted from descriptions of clinical research studies—highlighting the relationship between when the study starts and when the final outcome is measured. In prospective studies, a study protocol is put in place, and data is collected. Data continues to be collected per the study protocol until the end of the study. Analysis of the data may start immediately (even while the study is ongoing) or may start after the data has been locked and no additional data is collected.

One of the key points to consider with prospective studies is that the criteria for data collection and how the data is collected are explicit and influenced by the purpose of the study. For example, take a study that seeks to identify clinical signs associated with impending death in patients with advanced cancer. This was set up as a prospective study, and the protocol dictated that 52 physical signs were documented every 12 hours, starting with the patient’s admission.

In this study, and as with most prospective studies, decisions about the underlying format/data types and the meaning (often referred to as semantics) of each data element are determined up front. Those involved with the collection, management, and analysis of the data can all refer to the study protocol for the intended semantics.

The concepts of prospective studies and primary data are often conflated given their frequent association. Data collected in the context of prospective studies is typically considered primary data because it is collected for the purpose of the study. However, though data may have been collected in a prospective study, it could be used as secondary data in a follow-up study. Continuing with the cancer study referenced earlier, if researchers took that data and wanted to look for correlations between various bedside clinical signs and various medications, this would be considered secondary use of the data. That is, the data is being used and analyzed for reasons other than why they were originally collected.

While data from prospective studies is usually considered primary data, it may also be considered secondary data for other analyses—this distinction between primary and secondary use depends entirely on the question(s) being asked of the data, relative to how and why the data was initially collected.

So, prospective studies are those looking forward. Let’s look at those studies that look backward.

Retrospective studies

Historically, prospective studies were the major mechanism for gathering healthcare data for analysis, often in the form of clinical research. However, as with most industries, data is being collected more and more frequently—in a variety of forms whether through electronic health records, digital health tools, or even clinical and disease registries. Consequently, people are looking to these data to find insights and are essentially conducting retrospective studies.

Retrospective studies are those where the outcome is already known (e.g., we already know the overall survival of all patients in the dataset) and data is collected from existing sources or memory. As a result, retrospective studies typically involve secondary use of data. This is where things can get confusing between prospective studies and retrospective studies, and primary data and secondary data.

One common example of a retrospective study involving secondary use of data is the extraction of data from electronic health records—a researcher may want to look at the relative overall survival of cancer patients on a particular medication (e.g., bevacizumab) relative to standard chemotherapy alone. Instead of constructing a prospective study, the researcher decides they will extract data on a subset of patients who match the inclusion/exclusion criteria. Though the data has already been collected in the EHR, the researcher is retrospectively analyzing previously collected data for their study.

The previous example also highlights secondary use of data. The data was originally collected in the EHR for the purposes of patient care or billing but are now being used to compare the efficacy of traditional chemotherapy regimens and those that include the addition of bevacizumab.

Generally speaking, from the perspective of the data, whether a study is prospective or retrospective is less important than whether it is a primary or secondary use (and collection) of data. As data engineers or data scientists, we must consider how and why the data were collected since this directly impacts the wrangling (cleaning, processing, normalization, and harmonization) of data.

It is usually easier to wrangle data that have been collected for the specific study being conducted. Data types and formats have been decided; there is an established common understanding of the data elements and how they are supposed to be collected. This does not ensure that the data is in fact clean, but it does decrease the data wrangling challenges.

In the case of secondary use of data, the data was collected for a variety of different (and sometimes conflicting) reasons, so it is not always clear how the data should be cleaned and processed. For example, one common misconception is that a list of ICD-10 codes within a patient’s record in the EHR is a good source of identifying patients with a particular diagnosis. While ICD-10 codes are commonly used to track diagnoses in a variety of datasets, it is important to understand the context of the use of ICD-10 codes in many (though not all) EHRs.

Take a patient who comes into their primary care provider’s office for a routine checkup and the physician orders a hemoglobin A1c (HbA1c) test to rule out diabetes. That is, the physician feels their patient may have diabetes and is attempting to validate this hypothesis. They will put in the order for the test and continue with the visit. However, somewhere behind the scenes, someone responsible for medical billing also tags the patient’s record with the ICD-10 code of E13, indicating “other specified diabetes mellitus.”

Why did they do this? Perhaps this allows the hospital administration to track why particular tests are being ordered, or this allows insurance companies to identify erroneous test orders. The insurance company may have a policy that says, “HbA1c tests are approved only for patients having or suspected of having diabetes.” To validate incoming claims, the insurance company has pushed the burden onto hospitals. Existing diabetic patients will already have a corresponding ICD-10 code and will pass the validation. However, a patient who has not been diagnosed with diabetes will fail this test, and the claim will be kicked back to the hospital. So, to pass the validation, the hospital codes the patient as having diabetes.

In this example, is the patient diabetic? Perhaps. Perhaps not. Until the result of the HbA1c test is examined, there is no way for a data scientist to know if this is a diabetic patient (and whether to include this patient in the cohort).

Conclusion

We spent most of this chapter talking about healthcare data at a high level. Though we mentioned studies such as clinical trials and clinical research, the majority of the focus of this book will be on what many refer to as real-world data. Of course, if you are a medical center simply trying to make sense of your electronic health record data, it’s simply data. In the biotech and pharma industry, we use RWD to differentiate from interventional and noninterventional study data—basically, data collected for specific study purposes following strict protocols versus data collected during the delivery of healthcare without any clear standards.

As we started the journey on RWD, we discussed some of the inherent complexities between why the data was collected in contrast to the types of analysis we want to perform. Much of this book is dedicated to some of the approaches we use to mitigate this complexity.

I have spent the better part of my professional life dedicated to working with healthcare RWD, and yet there is still so much to learn. This book won’t make you an expert overnight, but the goal is to give you the right vocabulary while highlighting many lessons learned so that you can hit the ground running. While we don’t often have a choice in what type of data we get, having a basic understanding of the type of data and how/why it was collected makes us better data engineers and scientists. We can begin to factor these into our overall data pipelines and analyses.

If there is one thing to keep in mind as you read this book (or any other book, blog post, library readme/documentation, research paper, etc.), it would be that healthcare is inherently complex, the data is complex, and there is no silver bullet to address all of these complexities. The delivery of healthcare captures and represents the diversity of the human condition—seeking a one-size-fits-all solution is simply impossible.

As a medical informaticist, I always have one foot firmly planted in healthcare data and another in technology and software. This chapter provided us with a basic introduction to healthcare data. In the next chapter, we focus on technology, particularly databases. While our discussion will be in the context of healthcare data, the focus of the chapter is to help level the field for those with a less technical background.

¹ Ambulatory is being used to mean outpatient, distinguishing this type of EHR from an inpatient or hospital EHR.

² Kamal S. Saini and Chris Twelves, “Determining Lines of Therapy in Patients with Solid Cancers: A Proposed New Systematic and Comprehensive Framework,” British Journal of Cancer 125, (2021): 155–163, https://doi.org/10.1038/s41416-021-01319-8.

Get Hands-On Healthcare Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hands-On Healthcare Data by Andrew Nguyen