Why LLM Training Data Must Become a Global Public Good

AI Innovation Starts with Shared, Trusted Data


As large language models (LLMs) increasingly shape critical systems in governance, education, public health, and disaster response, the quality and governance of the data that train them have become central to public-interest innovation. Artificial intelligence is only as fair, transparent, and representative as the information it learns from. Yet much of the world’s data remains locked away, inaccessible, or fragmented across institutions. The concept of data commons offers a transformative alternative: shared, community-governed data resources that treat training datasets for AI not as proprietary assets, but as public goods.

The New Commons Challenge and a Global Shift in Data Governance


During the September 25th, 2025 event in New York City, the Open Data Policy Lab, in collaboration with Microsoft and partners including CrisisReady, Harvard’s Institutional Data Initiative, and with UNESCO as observer, presented the winners of the New Commons Challenge. The initiative recognized leaders who are building data commons that enable trustworthy AI for disaster response and local decision-making. The variety of global actors involved demonstrated a broader realization: meaningful AI governance requires meaningful data governance.

The winning projects illustrate how data commons can directly address gaps in representation and accessibility. The CERTI Amazônia Institute built an AI-ready environmental index for the Amazon basin, while the NYU Peace Research and Education Program launched the Malawi Voice Data Commons to generate multilingual datasets for early-warning systems. These projects are not simply producing data; they are defining how training data for LLMs should be created: ethically, inclusively, and with community participation.

Treating LLM Training Data as a Public Good


A core assumption of today’s AI ecosystem—that companies can privately curate vast datasets for LLM training—has proven both insufficient and inequitable. Many languages, cultures, and experiences are nearly absent from existing training corpora. This leads to biased outputs and widens the AI divide. Data commons provide a structural remedy: they allow communities to co-create their own datasets in transparent ways that reflect local realities.

The projects showcased in the New Commons Challenge demonstrate that when data governance is participatory, the resulting datasets are more accurate, more ethically sourced, and far more valuable for training AI systems that reflect the diversity of human experience. Treating LLM training data as a public good is not only a matter of fairness; it is a prerequisite for building trustworthy AI.

Data Commons as Foundational Digital Infrastructure


Speakers at the event repeatedly referred to data commons as the “missing infrastructure” for responsible AI. Without shared, interoperable, and well-governed datasets, LLMs will continue to mirror existing social and geographic inequalities. The UN’s Global Digital Compact reinforces this view, explicitly highlighting data commons as tools to reduce digital inequality and prevent the emergence of a permanent AI divide.

The fireside discussions also highlighted that data governance and AI governance are inseparable. Around the world, policymakers face pressure to regulate AI, but without robust systems for data access and oversight, regulation becomes symbolic rather than substantive. Data commons provide the necessary scaffolding for transparency, accountability, and equitable participation.

Building a Democratic Data Economy for the AI Era


The initiatives celebrated at the New Commons Challenge represent more than technical achievements, they represent a philosophical shift. Around the world, governments, researchers, and communities are beginning to recognize that high-quality datasets must be treated as shared digital infrastructure. For AI to advance the public good, its training data must be accurate, inclusive, and governed through democratic principles. By elevating data commons as a central pillar of digital governance, we can build LLMs that genuinely serve humanity rather than a select few.

Source of this article: opendatapolicylab.org