Building an Open Greek Book Corpus from Openbook.gr

From Web Scraping to a Clean, Research-Ready Dataset

At GlossAPI, our mission is to build open, high-quality Greek language datasets that support the Greek AI ecosystem. In this article, we present our latest dataset: a large-scale collection of Greek books sourced from Openbook.gr, processed and cleaned using the GlossAPI pipeline, and prepared for publication through the Mozilla Data Collective.

This work demonstrates how open cultural content can be transformed into structured, machine-readable resources that benefit the wider research and open-source community.


Data Source: openbook.gr

Openbook.gr is a Greek digital library hosting thousands of freely accessible books across many categories, including:

  • Literature
  • History
  • Philosophy
  • Education
  • Cookbooks and practical manuals

Most of the content is available as scanned PDFs, meaning the text is not natively machine-readable. To unlock its value for computational use, OCR (Optical Character Recognition) is required.

Data Collection

We developed a custom web scraping pipeline to:

  • Crawl Openbook.gr
  • Identify book pages and metadata
  • Download all available PDF files
  • Organize them into a structured dataset

In total, we collected 4,063 Greek books for processing.

OCR Processing with GlossAPI

To extract text from scanned PDFs, we used the GlossAPI OCR pipeline, which:

  • Converts PDF pages into images
  • Applies OCR models optimized for Greek
  • Produces raw text files for each book

You can explore the full pipeline architecture here:
👉 https://github.com/eellak/glossAPI

While OCR enables large-scale digitization, it also introduces systematic errors. This made post-processing and cleaning a critical step.

Text Cleaning & Normalization

The raw OCR output contained thousands of recognition and encoding errors. We designed an automated cleanup workflow to improve text quality while preserving linguistic integrity.

Dataset Statistics

  • Total books processed: 4,063
  • Final clean books: 3,719
  • Files removed due to severe corruption: 344
  • Total corrected sigma errors: 224,000+
  • Files affected by sigma corrections: ~1,900
  • Encoding issues corrected (Σ → ΢): ~39,000
  • PUA placeholders removed: ~97,800
  • Files affected by placeholder cleanup: 635
  • Standardized temperature references: 129
  • Common OCR typos corrected: dozens of recurring patterns

These figures highlight the scale of noise introduced by OCR and the importance of systematic post-processing for producing reliable language resources.

Issues Identified & Resolved

1. Character Misrecognition

OCR frequently confused visually similar characters:

  • Latin “o” → Greek “ο”
  • Latin “a” → Greek “α”

We automatically detected and corrected these cases using contextual analysis:
if a Latin character appeared inside a Greek word, it was replaced with the correct Greek letter.

2. Greek Sigma Correction

Greek uses two sigma forms:

  • σ → inside words
  • ς → at word endings

OCR often mixed them up.
We corrected 224,000+ sigma errors across ~1,900 files.

3. Temperature Formatting (Cookbooks)

Many books contained recipes. OCR broke temperature values:

  • 1800C → should be 180°C
  • Values split across lines

We detected and standardized 129 temperature references to the correct format.

4. Special Character Encoding

Some Greek characters were replaced with rare Unicode symbols.
Most notably:

  • Σ (Sigma)΢ (obsolete variant)

We corrected ~39,000 encoding errors across the dataset.

5. Problematic Non-Text Characters

OCR embedded invisible junk characters:

a) Control Characters

Legacy printer commands that break text processing.

b) Private Use Area (PUA) Placeholders

Unicode U+E000–U+F8FF characters used as OCR “unknown” symbols
Displayed as: ▯ or ?

We removed ~97,800 placeholders from 635 files.

6. Common OCR Typos

We identified systematic word errors:

  • Πάουντερ → Πάουνετρ
  • δημιουργήσετε → δημιοργήσετε

We built a verified correction dictionary and applied it globally.

7. Formatting Artifacts

We removed:

  • Page counters (1 / 9)
  • Blank page markers
  • PDF layout artifacts

These were not part of the original book content.

8. Quality Control

Some files were beyond repair:

  • Entire pages unreadable
  • Massive character corruption

We removed 344 files where automated cleaning could not ensure quality.

Final Dataset

After processing:

  • 3,719 high-quality Greek books
  • Clean UTF-8 text
  • Normalized characters
  • Research-ready format

This dataset is suitable for:

  • NLP model training
  • Language modeling
  • Digital humanities
  • Cultural analytics
  • Search & retrieval systems

Publication: Mozilla Data Collective

The dataset will be published via the Mozilla Data Collective, ensuring:

  • Open access
  • Transparent licensing
  • Long-term availability
  • Community reuse

This aligns with our commitment to open knowledge infrastructure.

👉datacollective.mozillafoundation.org/datasets

Licensing & Reuse

All texts included in this dataset originate from Openbook.gr, which hosts books that are legally available for free public access. We respect the original licensing terms of each work and provide:

  • Full source attribution
  • Original download links
  • Original license information where available

The processed dataset will be released under an open data license via Mozilla Data Collective, enabling:

  • Research use
  • Non-commercial reuse
  • Model training
  • Educational applications

Users are responsible for complying with the original licensing terms of individual books. Our role is to provide clean, machine-readable access to already open content, not to alter ownership or rights.

Why This Matters

Greek remains a low-resource language in NLP.
By releasing high-quality corpora:

  • Researchers gain better training data
  • Developers can build stronger models
  • Cultural heritage becomes computationally accessible

This project contributes to Greek digital sovereignty and open AI ecosystems.

Tools & Infrastructure

  • GlossAPI pipeline
  • OCR processing
  • Custom Python cleaning scripts
  • Automated QA checks
  • Open publication standards

👉 https://github.com/eellak/glossAPI

Lessons Learned

Working with large-scale OCR-derived text revealed several important insights:

  • OCR quality varies dramatically depending on scan resolution, typography, and page layout. Even books from the same source can differ significantly in text quality.
  • Greek presents unique challenges for OCR systems, especially with characters that visually resemble Latin letters (ο/o, α/a) and the dual sigma forms (σ/ς). Language-aware post-processing is essential.
  • Automation is powerful but not enough on its own. While most errors were corrected programmatically, some files were beyond repair and required strict quality filtering.
  • Systematic errors repeat across thousands of files. Once detected, building rule-based corrections produced massive improvements at scale.
  • Private Use Area placeholders are a major hidden problem. They silently break downstream NLP pipelines and must be aggressively removed.
  • Domain-specific content matters. Cookbooks introduced unique formatting issues (temperatures, measurements) that required custom logic.
  • Open datasets require governance. Even when content is openly available, careful documentation, licensing checks, and transparency are necessary for responsible reuse.

These lessons will directly inform future OCR projects and improvements to the GlossAPI pipeline.

Closing

This project demonstrates how open libraries + open infrastructure can create powerful public datasets. We invite researchers, developers, and institutions to reuse, improve, and build upon this corpus.