As part of an ongoing series of blogs on AI case studies where we introduce examples of real-world machine learning applications and their complexities, we will discuss the foundational tools needed for some machine learning models where information is limited or restricted. In a data hungry world, the use of augmented or synthetic data is essential to produce or validate predictive models, train AI, or correct for errors or bias in datasets. But synthetic data does not have the granularity of the real data. The use of synthetic data has advantages and disadvantages, and data users and policymakers must be aware of the associated tradeoffs.
As mentioned in our AI and Ethics report, research must be done to address ethical components of AI, including privacy, bias, and fairness, to which use of synthetic data could be of value. Greater research is needed to understand varying degrees of efficacy for different applications of synthetic data and their implications on ethical use of AI. In this blog, we will outline common use-cases of synthetic data and consider potential risks and advantages.
There are a few ways to generate ‘anonymized’ data from ‘real’ data. Using a running example of a dataset including people’s age, sex, car model, and license plate number:
- Dropout of records or features. Features may be removed from all records. License plate numbers, for example, that may risk allowing a third party to identify individuals could be removed from the data. Similarly, records with extreme outlier values (for example, a 110-year-old man driving a school bus) may be removed since these records would only be reasonably associated with a small number of people.
- Aggregation. Data may be aggregated such that individuals are no longer present in the data. For example, vehicle ownership data could be summarized by model of car, reporting only average age and proportion of male to female drivers.
- Generalization. Certain continuous values may be reduced in precision and ‘binned’. For example, instead of reporting exact age, the example data may report age ranges (16-26, 27-36, 37-46, etc.).
- Synthesis. Using the original ‘real’ data to train a machine learning generator, data can be generated that have many of the same statistical characteristics as the ‘real’ data (similar means, correlations between features, etc.) while not being directly associated with any one individual.
Among synthetic data, there are two main categories. There is fully synthetic, where all data is computer-generated from a model trained using ‘real’ data and all real data are withheld. Then, there is partially synthetic, where computer-generated data are used to balance a dataset with respect to some sensitive characteristic (age or sex, for example), meaning that underrepresented characteristics are some combination of real and synthetic. Synthetic data is often used in replacement of real data to:
- Train data for machine learning algorithms;
- Overcome data usage restrictions;
- Overcome data usage restrictions;
- Minimize the costs of collecting real data;
- Obtain access to centralized datasets; or
- Produce situations that have not yet happened (i.e. testing a new product before it is released, or scientific research).
Donald B. Rubin, a Harvard statistics professor, came up with the idea of using replacement values to fill in for missing data or undercounted populations. Many argue the Census Bureau’s use of synthetic data is more important today as survey response rates decline and models depend on data that reflect an increasingly diverse country. The ACS uses synthetic data to improve estimates that have small sample sizes, for example, undercounted communities like Alaskan Natives, young children, or people experiencing poverty. Synthetic data can also be a cheaper alternative to the rising cost of surveying thousands of individuals.
As part of the Bureau’s ongoing effort to increase protection of citizens’ privacy and obey strict confidentiality requirements, with penalties that range from fines to jail time for non-compliance, it has increased the use of anonymized, or synthetic data. Along with mimicking real-world scenarios, synthetic data should also maintain the statistical properties of the real data, especially when used in research to make causal predictions about human behaviors, economic analysis, or climate change. Recently, reliance on synthetic data in such inferential circumstances has been criticized for sacrificing accuracy to improve privacy. In fiscal year 2017, allocation of more than $1.504 trillion in federal funds by 316 programs used 2010 Census Bureau data, helping communities, businesses, and the public access needed resources.
Some researchers are worried that the Census Bureau’s use of synthetic data is manipulating important information used for economic and demographic research. The data can also be used to determine the distribution of federal funding. However, it also allows researchers to gain a greater visibility of data at small geographical levels, even census blocks, because of the anonymity and privacy it creates. Census data is estimated to aid in the drafting of around 12,000 research papers per year, and the use of poor synthetic data could draw unreliable conclusions.
For applications that can handle relatively inaccurate data (including many machine learning classifiers), the trade-offs made for privacy are unlikely to cause issues for many predictive applications. For example, in a 2017 study, MIT scientists measured the effectiveness of synthetic data in machine learning models compared with models using real data. They outsourced data scientists to develop predictive models to test the reliability of the given synthesized data compared with real data. They found the synthetic data solved the predictive modeling problems 70% of the time with no significant difference to real data. While far from perfect, when compared with other privacy enhancing technologies and anonymization, synthetic data which exhibit near identical aggregate statistical qualities of their original datasets produce accurate results for many common predictive tasks.
However, for social science research focused on low-level geographies and/or sub-populations within ACS data, the small inaccuracies introduced by data synthesis may reduce researchers’ abilities to establish causal links that lead to important insights. Further, data external to the ACS synthetic anonymization process are inherently not validated for ‘realism’ along with the ACS data itself. For these reasons, the Bureau is exploring a new synthetic data product to produce more accurate data.
To date, the most effective method for generating synthetic data is using ‘generative adversarial networks’, a deep learning tool which pits two distinct neural networks against one another – one with the goal of generating realistic data (generator), and one with the goal of determining whether a given dataset is real or synthetic (discriminator).
The generator is trained using the features of the real dataset, using random noise as a ‘seed’ value to create realistic synthetic observations. The discriminator alternates between observing the synthetic data and the real data, with the objective of correctly determining whether the data is real or synthetic. Fundamentally, one model succeeds only when the other model fails – when the generator fools the discriminator, the generator ‘reinforces’ the internal weights and biases (values that ultimately determine the output of the model) that led to this realistic dataset, while the discriminator weakens the internal weights and biases that led it to make the wrong conclusion. When the discriminator succeeds, the roles are reversed – the generator weakens its network while the discriminator strengthens its own. The competing nature of these two models leads the generator to eventually create observations that are consistently realistic to the discriminator, at which point the performance gain of subsequent model training will plateau and, in theory, the synthetic data will be sufficient for its eventual use in social science or industry.
As the technology develops and the applications increase, synthetic data could provide greater privacy in the future and a means to advance our AI capabilities. Privacy policies should take into consideration the use of synthetic data as its capabilities improve to anonymize individual’s identifiable information. But there are many accounting problems to acknowledge and answer before the issues with synthetic data are resolved.