A seductive solution with hidden dangers
Synthetic data is often presented as a clever fix for three persistent challenges in machine learning: data scarcity, unfair training distributions and privacy restrictions. At the same time, some argue it could democratise AI development by reducing dependence on large proprietary datasets held by a few dominant companies.
But as synthetic data becomes more common, its limitations become clearer. It can produce datasets that look statistically valid while introducing subtle distortions that are hard to detect. These distortions do not simply reduce accuracy; they erode the foundation of trust we place in AI systems.
The core issue: synthetic worlds are not the real world
Synthetic data assumes it can accurately represent the complexity of real-life distributions. Yet rare events, unusual patterns and edge cases, the very elements that determine whether a model performs reliably, are usually absent.
When synthetic datasets enter training pipelines, they create feedback loops in which models learn from artificial patterns rather than real human behaviour. This can lead to model collapse, a degradation in quality where outputs become increasingly distorted or meaningless.
Fairness cannot be engineered without real representation
One popular promise is that synthetic data can improve fairness by simulating underrepresented groups. But how can an algorithm faithfully reproduce a group it has rarely or never encountered in real data?
This approach risks creating the illusion of fairness while quietly reproducing the very inequalities it claims to address. Worse, developers are effectively asked to become arbiters of social values, making ethical decisions that should not rest solely on technical judgement.
The illusion of privacy protection
A common argument is that synthetic data protects privacy by removing direct identifiers. However, the closer synthetic data gets to real-world patterns, the greater the risk of re-identification. A rare combination of traits in a small population can be enough to reveal sensitive information.
Meanwhile, synthetic data generation systems obscure many of the design choices embedded within them. This makes auditing and accountability significantly more challenging.
Why open, documented and high-quality data are essential
The real debate should not be synthetic versus real, but opaque versus transparent, poor-quality versus high-quality. AI systems require datasets grounded in verifiable reality, not statistical approximations detached from lived experience.
High-quality open datasets, such as those produced through glossapi, provide precisely what synthetic data cannot:
• transparent provenance and full documentation,
• alignment with real linguistic, cultural and social contexts,
• public accountability and reproducibility,
• independence from closed ecosystems dominated by the United States and China.
Open data enables meaningful scrutiny. It allows researchers, journalists, public bodies and citizens to evaluate how models are trained, identify biases and contribute to improvements. Synthetic data, by contrast, hides the very mechanisms that shape its output.
The simulation-to-reality gap will not disappear
Synthetic data creates what researchers call a simulation-to-reality gap, a disconnect between artificially generated patterns and real-world behaviour. This gap can be useful for stress testing or controlled experimentation, but it cannot form the basis of trustworthy AI intended to support real decisions.
As reliance on synthetic data grows, so does the risk of widespread data contamination across entire industries. The more AI models learn from synthetic approximations rather than real evidence, the more fragile and error-prone they become.
Conclusion: AI must be grounded in reality, not simulations
Synthetic data has valid uses, but it cannot replace the role of real, open, high-quality datasets. If AI is to serve the public good, it must be trained on data that reflects genuine human experiences and verifiable contexts.
Open datasets, especially those that are meticulously curated, publicly documented and culturally relevant, are the only solid foundation for reliable, fair and transparent AI. Anything less builds our technological future on synthetic foundations that cannot bear the weight of real-world expectations.
—
Source of this article: https://www.adalovelaceinstitute.org/blog/synthetic-data-real-harm/