
From Web Scraping to a Clean, Research-Ready Dataset
At GlossAPI, our mission is to build open, high-quality Greek language datasets that support the Greek AI ecosystem. In this article, we present our latest dataset: a large-scale collection of Greek books sourced from Openbook.gr, processed and cleaned using the GlossAPI pipeline, and prepared for publication through the Mozilla Data Collective.
This work demonstrates how open cultural content can be transformed into structured, machine-readable resources that benefit the wider research and open-source community.
Data Source: openbook.gr
Openbook.gr is a Greek digital library hosting thousands of freely accessible books across many categories, including:
- Literature
- History
- Philosophy
- Education
- Cookbooks and practical manuals
Most of the content is available as scanned PDFs, meaning the text is not natively machine-readable. To unlock its value for computational use, OCR (Optical Character Recognition) is required.
Data Collection
We developed a custom web scraping pipeline to:
- Crawl Openbook.gr
- Identify book pages and metadata
- Download all available PDF files
- Organize them into a structured dataset
In total, we collected 4,063 Greek books for processing.
OCR Processing with GlossAPI
To extract text from scanned PDFs, we used the GlossAPI OCR pipeline, which:
- Converts PDF pages into images
- Applies OCR models optimized for Greek
- Produces raw text files for each book
You can explore the full pipeline architecture here:
👉 https://github.com/eellak/glossAPI
While OCR enables large-scale digitization, it also introduces systematic errors. This made post-processing and cleaning a critical step.
Text Cleaning & Normalization
The raw OCR output contained thousands of recognition and encoding errors. We designed an automated cleanup workflow to improve text quality while preserving linguistic integrity.
Dataset Statistics
- Total books processed: 4,063
- Final clean books: 3,719
- Files removed due to severe corruption: 344
- Total corrected sigma errors: 224,000+
- Files affected by sigma corrections: ~1,900
- Encoding issues corrected (Σ → ): ~39,000
- PUA placeholders removed: ~97,800
- Files affected by placeholder cleanup: 635
- Standardized temperature references: 129
- Common OCR typos corrected: dozens of recurring patterns
These figures highlight the scale of noise introduced by OCR and the importance of systematic post-processing for producing reliable language resources.
Issues Identified & Resolved
1. Character Misrecognition
OCR frequently confused visually similar characters:
- Latin “o” → Greek “ο”
- Latin “a” → Greek “α”
We automatically detected and corrected these cases using contextual analysis:
if a Latin character appeared inside a Greek word, it was replaced with the correct Greek letter.
2. Greek Sigma Correction
Greek uses two sigma forms:
- σ → inside words
- ς → at word endings
OCR often mixed them up.
We corrected 224,000+ sigma errors across ~1,900 files.
3. Temperature Formatting (Cookbooks)
Many books contained recipes. OCR broke temperature values:
- 1800C → should be 180°C
- Values split across lines
We detected and standardized 129 temperature references to the correct format.
4. Special Character Encoding
Some Greek characters were replaced with rare Unicode symbols.
Most notably:
- Σ (Sigma) → (obsolete variant)
We corrected ~39,000 encoding errors across the dataset.
5. Problematic Non-Text Characters
OCR embedded invisible junk characters:
a) Control Characters
Legacy printer commands that break text processing.
b) Private Use Area (PUA) Placeholders
Unicode U+E000–U+F8FF characters used as OCR “unknown” symbols
Displayed as: ▯ or ?
We removed ~97,800 placeholders from 635 files.
6. Common OCR Typos
We identified systematic word errors:
- Πάουντερ → Πάουνετρ
- δημιουργήσετε → δημιοργήσετε
We built a verified correction dictionary and applied it globally.
7. Formatting Artifacts
We removed:
- Page counters (1 / 9)
- Blank page markers
- PDF layout artifacts
These were not part of the original book content.
8. Quality Control
Some files were beyond repair:
- Entire pages unreadable
- Massive character corruption
We removed 344 files where automated cleaning could not ensure quality.
Final Dataset
After processing:
- 3,719 high-quality Greek books
- Clean UTF-8 text
- Normalized characters
- Research-ready format
This dataset is suitable for:
- NLP model training
- Language modeling
- Digital humanities
- Cultural analytics
- Search & retrieval systems
Publication: Mozilla Data Collective
The dataset will be published via the Mozilla Data Collective, ensuring:
- Open access
- Transparent licensing
- Long-term availability
- Community reuse
This aligns with our commitment to open knowledge infrastructure.
👉datacollective.mozillafoundation.org/datasets
Licensing & Reuse
All texts included in this dataset originate from Openbook.gr, which hosts books that are legally available for free public access. We respect the original licensing terms of each work and provide:
- Full source attribution
- Original download links
- Original license information where available
The processed dataset will be released under an open data license via Mozilla Data Collective, enabling:
- Research use
- Non-commercial reuse
- Model training
- Educational applications
Users are responsible for complying with the original licensing terms of individual books. Our role is to provide clean, machine-readable access to already open content, not to alter ownership or rights.
Why This Matters
Greek remains a low-resource language in NLP.
By releasing high-quality corpora:
- Researchers gain better training data
- Developers can build stronger models
- Cultural heritage becomes computationally accessible
This project contributes to Greek digital sovereignty and open AI ecosystems.
Tools & Infrastructure
- GlossAPI pipeline
- OCR processing
- Custom Python cleaning scripts
- Automated QA checks
- Open publication standards
👉 https://github.com/eellak/glossAPI
Lessons Learned
Working with large-scale OCR-derived text revealed several important insights:
- OCR quality varies dramatically depending on scan resolution, typography, and page layout. Even books from the same source can differ significantly in text quality.
- Greek presents unique challenges for OCR systems, especially with characters that visually resemble Latin letters (ο/o, α/a) and the dual sigma forms (σ/ς). Language-aware post-processing is essential.
- Automation is powerful but not enough on its own. While most errors were corrected programmatically, some files were beyond repair and required strict quality filtering.
- Systematic errors repeat across thousands of files. Once detected, building rule-based corrections produced massive improvements at scale.
- Private Use Area placeholders are a major hidden problem. They silently break downstream NLP pipelines and must be aggressively removed.
- Domain-specific content matters. Cookbooks introduced unique formatting issues (temperatures, measurements) that required custom logic.
- Open datasets require governance. Even when content is openly available, careful documentation, licensing checks, and transparency are necessary for responsible reuse.
These lessons will directly inform future OCR projects and improvements to the GlossAPI pipeline.
Closing
This project demonstrates how open libraries + open infrastructure can create powerful public datasets. We invite researchers, developers, and institutions to reuse, improve, and build upon this corpus.