AI-Ready Open Data
Artificial intelligence and machine learning (AI/ML) have the potential to create applications that tackle societal challenges from human health to climate change. These applications, however, require data to power AI model development and implementation. Government’s vast amount of open data can fill this gap: McKinsey estimates that open data can help unlock $3 trillion to $5 trillion in economic value annually across seven sectors. But for open data to fuel innovations in academia and the private sector, the data must be both easy to find and use. While Data.gov makes it simpler to find the federal government’s open data, researchers still spend up to 80% of their time preparing data into a usable, AI-ready format. As Intel warns, “You’re not AI-ready until your data is.”
In this explainer, the Bipartisan Policy Center provides an overview of existing efforts across the federal government to improve the AI readiness of its open data. We answer the following questions:
- What is AI-ready data?
- Why is AI-ready data important to the federal government’s AI agenda?
- Where is AI-ready data being applied across federal agencies?
- How could AI-ready data become the federal standard?
What is AI-ready data?
The federal government is generating and collecting data from sensors, systems, and contractors at a rapid pace, but collecting raw data is only the first step. The next step is to clean and process this data into a usable format that can power an AI application—we call this being “AI-ready.”
If data is AI-ready, it is less time-consuming to analyze. On a basic level, preparing data means cleaning and parsing the information into a structured format (e.g., non-proprietary CSV/TSV formats) with unique column labels. On a more complex level, making data AI-ready means providing documentation, in the form of sample code and visualizations, as an on-ramp for researchers to start working with unfamiliar data. It is most efficient to do this work during front-end data collection, because the data provider has context that program offices and later researchers lack.
When an agency uses satellites to monitor hurricanes, or an individual researcher puts a GoPro on a surfboard to observe fish behavior, this produces raw data that must be cleaned and processed.
Why act to implement AI-ready data now?
Strategic and ethical reasons underlie the move toward AI-ready data. Strategically, migrating data to the cloud provides a timely opportunity to refresh existing data management and governance practices. Data migration also presents an opportunity for agencies to develop best practices in data management. For example, the National Oceanic and Atmospheric Administration (NOAA) is developing its first Enterprise Data Management Handbook, documenting best practice standards for the agency to follow.
At the same time, ethical challenges arise when AI-ready systems are being implemented. The Pentagon’s controversial surveillance video program, Project Maven, highlighted the need for government to be more transparent about its efforts to develop AI. In 2020, the Defense Innovation Unit responded to this need by publishing Responsible AI Guidelines. Other agencies are using this opportunity to ensure their data sets are ethically sourced and do not perpetuate inequities that often occur during data collection and analysis.
How does AI-ready data fit into the federal government’s AI and data agenda?
- The President’s Management Agenda created a cross-agency goal to 1) “leverage data as a strategic asset” and 2) implement the first governmentwide data strategy, known as the Federal Data Strategy.
- Congress passed the Evidence Act, which included guidance to make data open and machine-readable by default. It also required agencies to create C-level data administrators, ensuring the infrastructure is in place to develop data applications.
- President Donald Trump issued an Executive Order on “Maintaining American Leadership in Artificial Intelligence.” The order requested agencies to “improve data and model inventory documentation” and “prioritize improvements to access and quality” based on the “AI research community’s user feedback” (emphasis added).
- The White House National Science and Technology Council released a National Artificial Intelligence Research and Development Strategic Plan, which included three relevant strategies for achieving AI-ready data:
- Make long-term investments in AI research;
- Develop shared public data sets and environments for AI training and testing;
- Measure and evaluate AI technologies through standards and benchmarks.
- The White House Office of Science and Technology Policy (OSTP) Subcommittee on Open Science released a four-tier, pilot AI-readiness matrix that agencies could use to benchmark data quality.
- Congress passed the National AI Initiative Act, leading to the creation of a National AI Research Resource (NAIRR) Task Force. NAIRR’s goal was to become a shared research environment with access to compute resources and centralized data sets.
- The Federal Data Strategy 2021 Action Plan included three agency actions relevant to AI-ready data:
- Mature Data Governance;
- Data and Infrastructure Maturity;
- Artificial Intelligence and Automation.
- The National Defense Authorization Act (NDAA) authorized the creation of a national AI resource, with discussions on how to use the resource as a home for AI-ready data sets that can be leveraged by academics, researchers, and others. For example, Title II, Section 232 authorized a Pilot Program to create Defense Department data repositories, available to public and private entities that facilitate the development of AI capabilities. These data repositories would contain training quality, unclassified data sets.
- OSTP published a Request for Information seeking feedback for updating the National Artificial Intelligence Research and Development Strategic Plan.
- OSTP published a “Blueprint for an AI Bill of Rights,” emphasizing a set of five principles and practices to help guide the design, use, and deployment of AI systems.
- The NAIRR Task Force finalized an implementation plan—“Strengthening and Democratizing the U.S. Artificial Intelligence Innovation Ecosystem”—for how the federal government will achieve a democratic AI cyberinfrastructure. The plan includes a section on “Data and Datasets” and calls for “analysis-ready” data sets to be defined using existing community-driven principles and standards.
Where is AI-ready data being applied?
There are at least three significant efforts within federal agencies to improve the AI-readiness of open data.
- Defense: In 2020, the Air Force partnered with the Massachusetts Institute of Technology (MIT) on an AI Accelerator to help the Air Force become AI-ready. Co-chaired by MIT, the public-private partnership tackled a range of unclassified technical and humanitarian challenges. When MIT’s Computer Science & Artificial Intelligence Laboratory called for potential projects to help the Air Force, MIT researchers submitted 180 responses. The collaborative process led to the largest response that MIT had ever recorded.
- Health: In September 2022, the National Institutes of Health (NIH) announced a four-year, $130 million investment into a “Bridge2AI”—a program that promises to accelerate the widespread use of AI in biomedical and behavioral research. Bridge2AI was launched as a “grand challenge” for biomedical research and emerged from NIH’s prioritization of AI research and development under the National AI Initiative. Its goal is to eliminate bias in data sets, models, and interpretations of model predictions, while preparing ethically sourced data for AL/ML models.
- Earth Science: The NOAA Center for AI is developing a community standard for AI-ready open environmental data. Earth Science Information Partners (ESIP), a nonprofit that convened representatives from NASA, the U.S. Geological Survey, and other agencies, is supporting this undertaking. Together, they developed an AI-Ready Checklist for data managers seeking to prepare data for new use. ESIP’s Data Readiness Cluster hosts monthly video calls on positive use cases of AI-ready data and, in July 2022, hosted a “Data-A-Thon” for open climate data.
AI-ready data is not yet a governmentwide practice. The efforts listed above are exciting, but they represent only the early adopters—a “coalition of the willing.” In the absence of a regulation, incentive, or federal standards, government agencies are responsible for how and when to embrace the movement toward AI-ready data.
What are positive-use cases of AI-ready data?
AI-ready data can advance the public good; three examples include:
- Defense: The Biden administration called the transnational organized crime a billion-dollar problem affecting millions of lives through drug overdose, violence, firearm deaths, and human trafficking. In response, the Defense Department, Defense Intelligence Unit, and AI company Quantifind launched the Countering Malign Influence project. The project uses open-source data to identify, track, and counter transnational criminal groups attempting to mask their identities and activities. The speed and volume of data collected would be too large for human analysts alone to process.
- Health: Data collection for health research typically lacks participants with diverse backgrounds. NIH’s Bridge2AI program builds participant diversity into the design of all funded projects. For example, Bridge2AI recently funded a University of Washington-led coalition to create a flagship, ethically sourced data set to uncover how human health is restored after disease. The effort, which is using Type 2 diabetes as a case study, will recruit an equal number of Black, Hispanic/Latinx, Asian, and white participants while engaging with tribal communities to address barriers to participation.
- Earth Science: Rip currents cause hundreds of drownings and require tens of thousands of rescues annually. In response, NOAA launched the first national rip current forecast model to inform coastal communities and visitors about the risk of rip currents. The model uses AI to generate the probability of a rip current based on NOAA Coastal Observation Networks.
A future goal of the National AI Research Resource is to use AI-ready data as a bridge across the physical and social sciences. For example, an AI application could address both environmental and equity issues, exploring the economic impact of climate change and natural disasters on vulnerable communities. Although social science data carries greater privacy risks, a “federated” data repository—whereby different sets of data are connected in a secure and easily accessible way—will be critical to bringing environmental and social science knowledge together.
What are the requirements for AI-ready data?
ESIP’s Data Readiness Cluster asked a cross-sector group of more than 100 AI/ML researchers about the factors that make open data easier or more time-consuming to use. It identified four overarching factors:
- Quality: Whether data is consistently formatted from a one-time step or data file to the next. Once a researcher has written code to handle one file, a researcher should be able to use the same code for the entire data set. Quality also includes the lack of implicit bias and a quantitative measure of representativeness to real-world conditions.
- Documentation: The extent to which there is support and context. This support can include metadata (information about the data), a data dictionary (definitions of each parameter), and a description of the original sources that went into creating the data set. Documentation also includes example software, code, or visualizations to consult as an on-ramp to using unfamiliar data.
- Access: The extent to which the data set is available in a variety of formats and delivery options. For example, a data set might be available in both a cloud-optimized and text format, or as both an application programming interface (API) and bulk download to a local machine. Variety is essential to ease the transition to accessing data and the cloud. Access also includes usage rights and security protections.
- Preparation: The extent to which the data set went through certain preprocessing steps, such as filling null values and gaps; if the data came from a single source or multiple sources; or how the data were prepared for a domain-specific need. Decisions taken during preparation can become metadata labels to aid researchers in data discovery.
What would a readiness matrix to benchmark data access look like? The NOAA Center for AI proposed the following example:
In addition to a general standard, domain-specific requirements are also possible. For example, sensors used for satellite imagery have a shorter lifespan. A domain-specific requirement for satellite imagery would require a nuanced definition of quality for sensor calibration, or to ensure readings across multiple sensors are consistent over time. In this context, it is important that observed changes come from changes to natural and physical Earth systems—not from changes in the sensors themselves.
What steps can agencies take to reduce bias of AI models?
Bias can give data sets a “systematic tilt” that makes the predictions of AI models less trustworthy. Ensuring diverse researchers, methodologies, and cohorts of participants can reduce bias.
NIH’s Bridge2AI program collects data with large sample sizes and ensures the data are gathered from participants representing the true diversity of the U.S. population. The program also focuses on the ethics of data generation, use, and reuse—consent forms must concretely explain to participants how their data will be used, and later users of the data must respect participants’ original consent. Other agencies can similarly build these processes into their preparation, collection, and analysis of data sets.
How should policymakers move the AI-ready data agenda forward?
There are at least three ways the federal government should improve the AI-readiness of its open data to power responsible public and private applications.
- The National Institute of Science and Technology (NIST) should establish a general U.S. government standard for AI-ready data. Following the success of the NIST Cybersecurity Framework—embraced as the gold standard by many in the industry—an AI-ready data standard could look like a “nutrition label,” building on existing projects such as the Data Nutrition Project, Datasheets for Datasets, and AI-Ready Checklist. Agencies could publish domain-specific “add-on” requirements to the NIST baseline (e.g., NOAA could publish geospatial requirements through the Open Geospatial Consortium). OSTP could support this work by encouraging agencies to openly release training data sets, establish a data search portal on AI.gov with compliant data sets, and collaborate with international organizations to promote consistent AI-ready data standards. Further, just as agencies undertook a “privacy review step” to ensure that data sets released to Data.gov did not unintentionally reveal personally identifiable information, a “nutrition label” could ensure data sets are transparent about inherent biases. Because standards-setting is so time intensive, the challenge is to make standards flexible enough to withstand evolving technology. Moreover, as data that agencies collect are often global and agencies use data from international partners, international collaboration is critical to ensure that any standard is interoperable across countries.
- The federal government should launch “Data Challenges” to spur collaboration across academia, industry, and government using open data sets. Following the success of the DARPA Prize Challenges that encouraged self-driving cars, the contest could drive data science breakthroughs. Industry and academia would co-define problems with government, while researchers could publish papers and receive awards for responsible AI applications. Model data challenges include the SEVIR Dataset Challenge (weather nowcasting), MIT Datacenter Challenge (multisensor systems), RF Challenge at MIT (radio frequency signals), Maneuver Identification Challenge (pilot flying training), and DIU xView3 (dark vessel detection).
- The federal government should embed the principles of AI-ready data into its contracting process whenever it expects contractors or grantees to produce data that will be posted on Data.gov. It should also update the Federal Data Strategy to include requirements for AI-ready data sets. For example, organizations entering into a contractual agreement with the federal government could be required to provide access to AI-ready training data with government-purpose rights (i.e., the contractor can still use the IP commercially while the government maintains usage rights). The Evidence Act requires making public data assets available as open government data assets under an open license. Agencies could supply sample contract clauses, or Congress could insert a forcing mechanism as peripheral legislation. NOAA inserted its beta AI-Ready Checklist as a recommendation for grantees to follow when producing agency deliverables. The checklist is not yet agency-wide but has early adopters: NOAA’s Climate Program Office included the checklist as a recommendation for potential grantees, and the NOAA Center for AI made the checklist a requirement for three pilot projects. These steps help ensure that the federal government receives machine-readable data that is accessible and readily usable from its contractors and grantees.
Many capabilities would be unimaginable without AI tools: from helping cities map spots at higher risk of extreme heat waves, to identifying “digital twins” of the Earth for climate mitigation, to identifying humpback whales based on underwater acoustic recordings. A few enterprising agencies—namely in the defense, health, and Earth sciences—comprise a “coalition of the willing” to make AI-ready data the standard across the U.S. government. Yet despite the progress, this work is not an embedded practice. As federal agencies migrate data to the cloud and face calls to be more responsible about their data and algorithms, agencies and lawmakers would be well served to make AI-ready data the federal standard.
We appreciate the assistance of the following groups for their thoughts and feedback on this report: the Department of the Air Force Artificial Intelligence Accelerator; Earth Science Information Partners; the NOAA Center for Artificial Intelligence and Chief Data Officer staff; and the National Institutes of Health Common Fund’s Bridge2AI program.
Support Research Like This
With your support, BPC can continue to fund important research like this by combining the best ideas from both parties to promote health, security, and opportunity for all Americans.Donate Now
Join Our Mailing List
BPC drives principled and politically viable policy solutions through the power of rigorous analysis, painstaking negotiation, and aggressive advocacy.