GlossAPI: Open Infrastructure for Greek AI-Ready Data – glossAPI

Toward a Transparent and Participatory Natural Language Processing Ecosystem

The Free and Open Source Software Company (GFOSS – ΕΕΛΛΑΚ) is one of the most active organizations in Greece in the field of open technology and digital governance. As a non-profit entity, GFOSS systematically promotes openness, transparency, and collaborative innovation, fostering the adoption of open standards across public, academic, and research institutions.

Within this mission, GFOSS developed GlossAPI — an initiative that aims to strengthen the Greek language in the age of Artificial Intelligence (AI) and establish it as an equal player among major European and international languages in the Natural Language Processing (NLP) ecosystem.

From the Need for Language Models… to the Need for Data

GlossAPI was born out of the need to develop Greek language models. During the initial exploration phase, it quickly became evident that the main obstacle was not the lack of technology, but the lack of high-quality, well-documented, and openly accessible Greek data.

The Greek language remains underrepresented in global AI training datasets. For example, while the English Wikipedia exceeds 80 GB, the Greek version barely reaches 1 GB. Beyond volume, most Greek datasets suffer from limited accessibility, poor documentation, and low reproducibility.

Through GlossAPI, GFOSS seeks to address this foundational problem by building a sustainable infrastructure for producing, curating, and publishing open linguistic datasets — ready to be used by researchers, institutions, and communities.

What Is GlossAPI

GlossAPI is both an open-source Python library and a technical infrastructure for creating, processing, and publishing Greek AI-ready datasets for Natural Language Processing and AI applications.

The library ingests text from various file formats (PDF, DOCX, HTML, etc.), cleans, standardizes, and annotates it, and exports it in AI-ready formats such as Parquet or Markdown. All datasets are fully documented and released under open licenses (Creative Commons, EUPL-1.2, etc.), while conforming to international standards such as those applied by the Hugging Face Datasets Hub.

GlossAPI serves both as an automation tool and as a framework of transparency and documentation in the data production process.

Produced Outcomes

To date, GlossAPI has produced and published 15 high-quality datasets, covering a wide range of thematic domains, including:

Public consultations and political discourse
Encyclopedic and educational texts
Academic theses and scientific works
Classical and modern Greek literature

These datasets are freely available at huggingface.co/glossAPI, accompanied by complete documentation and metadata, ready for use in research projects, language models, and educational programs.

Technical and Organizational Approach

The development of GlossAPI is based on interdisciplinary collaboration between software engineers, linguists, and open-technology experts. The team combines NLP techniques with advanced methods for data cleaning, normalization, and enrichment, ensuring both technical robustness and linguistic quality.

GFOSS also coordinates the participation of students, research labs, and universities, promoting collaborative data production and the culture of open science. This approach aims to make the development of Greek NLP tools a collective, reproducible, and democratic process.

How the GlossAPI Pipeline Works

At its core, GlossAPI features a text-processing pipeline designed to transform heterogeneous Greek documents into clean, normalized, and well-documented datasets ready for AI applications.

The system is based on the Corpus class and follows a four-stage sequential workflow:

Data Retrieval – Using corpus.download(), the system fetches documents from URLs or existing metadata files (metadata.parquet), maintaining full reference to the original source.
Text Extraction – The corpus.extract() method isolates text from PDF, DOCX, or HTML files and converts it to Markdown for maximum readability and minimal structural loss.
Analysis and Segmentation – Through corpus.section(), the system detects sections (tables of contents, introductions, bibliographies, main body, appendices) and creates sections_for_annotation.parquet files for further processing.
Section Recognition and Annotation – The corpus.annotate() function applies classification models that identify section types (e.g., “p” for table of contents, “b” for bibliography, “e.s.” for introduction, “k” for main text, “a” for appendix, or “other”), producing final files such as classified_sections.parquet and fully_annotated_sections.parquet.

Output folder structure:

downloads/ – input files
markdown/ – extracted texts
sections/ – segmented data
download_results/ – intermediate outputs and metadata

The pipeline is modular and flexible, allowing execution from any stage and integrating Rust-based quality and noise indicators. This design makes GlossAPI one of the most systematic and transparent tools for large-scale Greek text processing.

You can visit the project repository and read the full README at: github.com/eellak/glossAPI.

Applications and Projects

The GlossAPI infrastructure has already been utilized in the European project AI4Deliberation, where automated summarization and thematic analysis tools were developed for Greece’s OpenGov.gr public consultations.

These tools enabled the automatic summarization of legislative drafts in plain language and the mapping of citizens’ comments into thematic clusters — enhancing transparency and understanding of democratic processes.

This application demonstrated the potential of AI as a catalyst for accessibility and accountability, confirming the public value of open, Greek-language datasets.

Values and Principles

GlossAPI embodies GFOSS’s philosophy of open, transparent, and ethically aligned technology. Its foundational principles include:

Transparency: Full documentation and traceability of data and tools.
Participation: Opportunities for contribution by students, researchers, small businesses, and institutions.
Open standards and access: Compatibility with global models of open science and interoperability.
Ethical AI: Respect for information rights and the democratic oversight of algorithmic systems.

Through these principles, GlossAPI seeks to turn openness from a technical goal into a social and cultural value.

Vision and Next Steps

GlossAPI’s long-term vision is to create a comprehensive ecosystem of Greek language technology founded on open data, transparency, and collaboration.

Next steps include:

Expanding the dataset repository
Training Greek language models
Supporting institutions and organizations to produce their own AI-ready datasets under common open standards

In this way, GFOSS aims to strengthen the digital presence of the Greek language in AI and Large Language Models (LLMs), contributing to a sustainable, collaborative, and publicly beneficial language technology ecosystem.

Links and Contact:

🌐 Website: https://glossapi.gr

💠 Blog: https://blog.glossapi.gr/

🤗 Dataset Repository: https://huggingface.co/glossAPI

💻 Code & Documentation: https://github.com/eellak/glossAPI

📧 Contact: glossapi.team@eellak.gr

🖼️ Join our team: https://blog.glossapi.gr/en/become-part-of-glossapi/