Building a Fully Open Greek LLM: A Three-Millennia Language Model Powered by Open Data Infrastructure

The rapid evolution of fully open large language models represents a transformative moment for countries that possess rich linguistic and cultural heritage. Over the past two years, the global AI community has shown that high-performance LLMs can be built openly, with transparent pipelines, published datasets and weights, and licenses that support both research and commercial usage. Apertus, OLMo, and BLOOM are landmark examples. Greece, with its uniquely long linguistic continuum, now faces a historic opportunity: to develop a fully open Greek LLM covering Ancient Greek, Medieval Greek, Modern Greek, Katharevousa, and regional dialects.

Global open LLMs provide a roadmap

Apertus, developed by ETH Zurich, EPFL and the Swiss National Supercomputing Centre, exemplifies radical openness. It offers 8B and 70B parameter models trained on more than 1,000 languages, with over 15 trillion tokens. Everything is released: datasets, code, weights, training configurations. OLMo provides an equally transparent model ecosystem, including the Dolma dataset. BLOOM, created by the BigScience community, was the first large LLM trained collaboratively with openly licensed data, pipelines, and full documentation.

These efforts demonstrate that open-source AI is not only viable but competitive. They establish the foundation on which Greece can build its own open model.

Why Greece needs a fully open Apertus-style Greek LLM

Greek is not merely a language. It is a three-thousand-year continuum with unparalleled internal diversity. No existing model captures the full historical, semantic, and orthographic evolution of the Greek language:

  • Ancient Greek
  • Hellenistic Koine
  • Medieval and Byzantine Greek
  • Katharevousa
  • Modern Greek
  • Cypriot, Pontic, Cretan, and other dialects
  • Specialized corpora: legal, ecclesiastical, philosophical, scientific, journalistic

A fully open Greek LLM would deliver strategic advantages:

  • digital sovereignty and reduced dependence on closed foreign systems
  • open tools for culture, education, research, and the humanities
  • support for public administration, law, health, and citizen services
  • creation of a national AI innovation ecosystem
  • preservation and revitalization of linguistic heritage

The Glossapi Open Infrastructure as a foundational enabler

The cornerstone of any such initiative is the availability of high-quality, open, AI-ready Greek datasets. The “Glossapi Open Infrastructure for Greek AI-Ready Data” is precisely this missing piece. It provides:

  • clean, standardized, and openly licensed Greek language datasets
  • unified pipelines for ingestion, transformation, annotation, and validation
  • coverage across historical periods and dialectal variations
  • compliance with international open data standards
  • a common foundation for training fully open Greek LLMs

Without such infrastructure, any attempt to build a Greek LLM would be fragmented and incomplete. Glossapi transforms the landscape by offering the same type of systematic data foundation that projects like Apertus rely on at the international level.

Technical best practices for fine tuning and evaluation

Drawing from international open LLM practice:

Apertus fine tuning

Supports full fine tuning and LoRA. Recommended LR around 5e-5, batch sizes 64–128, AdEMA optimizer, and curated instruction datasets. Evaluation with MMLU, ARC, HellaSwag, and safety checks.

OLMo fine tuning

Uses either full-parameter tuning or LoRA adapters, LR 5e-5 to 1e-4, batch size 32–64. Dolma-based datasets enable multi-domain training. Evaluation includes reasoning, code tasks, and catastrophic forgetting checks.

BLOOM fine tuning

LR 3e-5 to 1e-4, batch 64–128, early stopping. Requires multilingual benchmarks, bias evaluation, and monitoring of token overflow.

Shared evaluation standards

MMLU, ARC, HellaSwag, domain-specific benchmarks, bias assessment, and validation splits.

A national-scale open AI infrastructure

Building a fully open Greek LLM is not only a scientific challenge; it is a national strategic investment. By combining Glossapi’s open data infrastructure with academic research and open-source AI development, Greece can create an LLM ecosystem that is sustainable, transparent, and internationally competitive.

A fully open Greek LLM would become a public digital asset, enabling innovation across sectors and placing Greece at the forefront of open AI development worldwide.

Primary source of this article:10 Best Open-Source LLM Models – huggingface.co