May 30, 2025

GlossAPI: Developing the Greek Data Set for Large Language Model Training

The rapid expansion of Large Language Models (LLMs) has created an unprecedented need for large-scale, high-quality, and linguistically coherent datasets. For the Greek language, rich in history, structure, and semantic nuance, this need is even more urgent due to its underrepresentation in mainstream AI development.

In response, GlossAPI was launched in 2023 as an initiative aimed at building a comprehensive Greek Data Set, suitable for training new Greek-centric LLMs as well as for fine-tuning existing ones.

The Challenge of Data Preparation

Building a reliable Greek Data Set is not simply a matter of collecting large quantities of text. It requires a structured and sophisticated workflow involving:

Systematic data collection and acquisition from diverse sources,
High-precision data cleaning to remove noise, inconsistencies, and duplicates,
Linguistic, semantic, and contextual annotation,
Classification across conceptual, functional, and domain categories,
Continuous evaluation of linguistic accuracy and conceptual consistency.

These processes are not purely technical. They rely heavily on linguistic expertise, historical context, and the synergy between human and machine intelligence.

Architectural Solution: MindsDB as an Integration Platform

A major technical challenge involves storing, managing, and serving the massive datasets required for LLM training. The study proposes the adoption of MindsDB, an emerging AI-native database capable of embedding AI models directly within the data layer.

This enables:

AI-assisted data ingestion, powered by existing LLMs and ML models,
in-database preprocessing, annotation, and labeling,
automated classification,
efficient data delivery to researchers and training pipelines.

MindsDB thus acts as a unified environment connecting the Greek Data Pile with intelligent AI models, making the entire pipeline more scalable and effective.

Towards a Complete and Trusted Greek Data Set

Building a national-scale Greek Data Set requires:

unified technical standards,
collaboration across linguistic, technical, and research communities,
AI-powered workflows,
secure and high-performance storage and distribution infrastructure.

GlossAPI aims to serve as the backbone of this effort by providing:

secure access to Greek-language datasets,
advanced tools for cleaning, annotation, and classification,
a development environment for Greek-specialized AI models.

Conclusion

Developing a comprehensive Greek Data Set is not merely a technical prerequisite for training Greek-focused LLMs. It is a strategic investment that strengthens:

national digital sovereignty,
scientific research in artificial intelligence,
and the emergence of a thriving Greek-language AI ecosystem.

—

Source of this article: acm.org