AI-Ready Data Is Crucial to Advancing Precision Health

Perspectives

AI-Ready Data Is Crucial to Advancing Precision Health

Health data producers must adopt AI-ready data practices

Precision health holds the potential to bring forth game-changing breakthroughs in medical research and practice, but it鈥檚 being held back by a dearth of AI-ready data.

As a discipline, precision health aims to improve individual and public health and wellness by precisely tailoring interventions, treatments, and care plans to the unique genetic, environmental, and lifestyle circumstances of each individual patient. Precision health is advanced through research, clinical, and public health practices that synthesize knowledge and meaning from a wide variety of high-volume data sources, including electronic health records, insurance claims, social determinants of health, wearables, environmental exposures, and omics data such as genomics, metabolomics, and epigenomics.

Due to the vast quantities of data involved, emerging technologies like artificial intelligence (AI) and machine learning (ML) have quickly become central to applying and advancing precision health, and achieving meaningful progress in the response to key public health threats and treatment of some of the world鈥檚 most burdensome chronic health conditions.

Unfortunately, a shortage of AI-ready data is hindering precision health鈥檚 development聽 by slowing AI/ML adoption in the field and limiting the technology鈥檚 ability to integrate data points and uncover beneficial health insights. A key first step for incorporating AI-ready data practices into any discipline is developing a clear, widely accepted definition of the relatively new term. At 蘑菇视频 Allen, we consider AI-ready health data to be that which conforms to the (i.e., findable, accessible, interoperable and reusable) in addition to being equitable, protected, machine readable, and well-defined.

The factors limiting the availability of AI-ready data are many. They include lax data provenance practices; insufficient data governance, siloed data, non-standardized data (e.g., data that is not mapped to common ontologies or terminologies) and inconsistency in the collection and management of sample and patient identifiers. To harness the transformative potential of precision health, the data needs to be AI-ready.

Seeing that it will take large quantities of broadly accessible AI-ready data to realize the full transformative potential of precision health, U.S. Government agencies like the Department of Veteran鈥檚 Affairs (VA), the Department of Defense (DOD), and the National Institutes of Health (NIH) are prioritizing its generation and availability. For example, NIH is engaged in two initiatives, and , that seek to advance biomedical research in part through the development of flagship AI-ready data.

It's time for all biomedical data producers and managers to recognize this imperative and follow suit.

Implementing AI-Ready Data Practices to Promote Equitable, Protected, Machine Readable, and Well-Defined Precision Health Data

We offer the following four approaches for ensuring that precision health data is AI-ready. Several of these approaches were informed by practices from聽, a movement focused on improving the data practices (e.g., data engineering) used in AI development.

1. Build Well-Defined Datasets to Reduce Domain Knowledge Barriers

Due to the complexity of precision health datatypes, it鈥檚 often necessary to possess domain-specific knowledge when analyzing them. Adoption of effective data documentation protocols serves to greatly reduce domain knowledge barriers. For instance, the creation of聽聽that detail the motivation, collection process, maintenance, intended use (e.g., how an individual鈥檚 data is used and shared), and distribution plan of a dataset can help ensure its appropriate use. Data generators and AI/ML modelers should also leverage automated anomaly detection tools and statistical techniques (e.g., Random Cut Forest) to validate data quality by identifying anomalous data points. Lastly, manual data inspection, through methods like basic distributional statistics, should be used to identify potential data quality issues and supplement automated tools by providing an additional dimension of contextual understanding.

2. Create and Apply Data Protection and Privacy Principles

Patients are more likely to provide reliable data when they trust that those collecting and using their sensitive health information will protect and handle it appropriately. Appropriate handling includes adhering to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy and Security Rules governing the protection of sensitive individual and health information. Organizations should also consider creating new or utilizing existing data privacy principles, such as those outlined by the聽聽(e.g., purpose limitation, data minimization, accuracy, security). There is currently no U.S. federal law equivalent to GDPR, but some states have adopted similar legislation at the local level. For example, the California Consumer Privacy Act and subsequent California Privacy Rights Act enables consumers to know and control how their personal information is used by the businesses that collect it.

搁别肠别苍迟听听补苍诲听聽promise to better enable AI/ML model development using protected sensitive data. Emerging privacy-enhancing technology solutions, such as federated learning and differential privacy, protect personal data by minimizing unnecessary data sharing, encrypting or anonymizing data, and ensuring confidentiality in aggregate data. Federated learning is a method of AI/ML model training in which multiple models are iteratively trained on independent datasets and combined, avoiding the explicit exchange of training data. Synthetic data generation advancements, enabled by Generative Adversarial Networks (GANs), have been demonstrated by research to produce realistic synthetic image and tabular (numerical, text) data, enabling AI/ML model development while allowing sensitive health data to remain protected.

3. Test the Machine Readability of Your Data

Properly preparing data, including ensuring that it can be processed by a computer, is a critical and often time-consuming prerequisite step to advanced analytics. To expedite AI/ML modeling, AI-ready data should be distributed in file formats and structures that ease ingestion into coding environments. Data repositories should provide random representative subsets of full datasets to enable quick data readability and suitability checks. AI/ML developers can use these random representative data samples to easily ingest data into coding notebook environments, such as Jupyter and RStudio, and use libraries such as pandas and Dplyr to gain a basic understanding of a dataset, including descriptive statistics, features and data types, and presence of null or missing values. AI/ML developers should also consider using AutoML tools, which automate many of the steps of the machine-learning process (e.g., feature and model selection), to accelerate suitability and exploratory analysis and inform future modeling efforts.

Through the application of these four AI-ready data practices, organizations can accelerate AI/ML research, discovery, and utilization to drive better precision health outcomes.聽

Explore More Precision Health Insights

Artificial Intelligence for Public Health Surveillance

蘑菇视频 Allen helps federal health agencies use artificial intelligence to streamline public health surveillance.

two operating doctors performing surgery

Precision Health

Powered by evolving technologies, precision health is poised to have an impact on healthcare delivery by accelerating diagnoses.

Article

1 - 4 of 8

蘑菇视频