The Creation of the Academic Knowledge Corpus

Abstract

The present dataset constitutes a high-quality text corpus derived from Greek doctoral dissertations, accompanied by their respective metadata. It includes 55,423 records covering the period 1975–2025, representing the largest unified corpus of Greek academic writing constructed to date for Natural Language Processing (NLP) purposes. The collection and processing pipeline involved a multi-layered procedure comprising scraping, text extraction, modern OCR methods, conversion to markdown, and large-scale GPU-accelerated cleaning.
The paper documents the methodology, pipeline architecture, computational infrastructure, and quality assurance procedures, while also discussing the limitations and potential applications of the corpus in contemporary large language model (LLM) development.

1. Introduction

The creation of large, clean, and well-documented text corpora is a critical prerequisite for the development of high-performance language models, particularly for low-resource languages such as Greek. Despite the significant academic output in Greece, the corresponding texts remain scattered across repositories with heterogeneous structures and inconsistent metadata quality.

In this context, the glossAPI team undertook the construction of a unified, technically homogenized and fully processed corpus of doctoral dissertations from Greek universities, with the aim of:

  • ensuring high collection coverage and metadata validity,
  • providing clean text suitable for NLP workflows and LLM training,
  • creating a structurally consistent markdown corpus,
  • ensuring transparency throughout the processing pipeline.

2. Dataset Description

2.1 Core Information

The final dataset contains 55,423 dissertations spanning 1975–2025, along with 26 structured metadata fields. It also includes a 200 MB parquet file storing all metadata and pointers, as well as derivative corpora of 6.5B and 12B tokens, generated using the tokenizer nlpaueb/bert-base-greek-uncased-v1.
The corpus forms a coherent body of scientific writing originating from a wide range of Greek academic institutions and exhibits substantial thematic diversity.


2.2 Data Sources

The primary data source for the dataset is OpenArchives.gr, enriched with several smaller datasets.
OpenArchives.gr is a major repository of Greek academic and scientific texts from universities, research institutes, and libraries in Greece and Cyprus. It provides cross-repository search capabilities for theses, dissertations, articles, studies, and other scholarly materials, supporting open access to Greek scientific knowledge.

2.3 Metadata Structure and Content

Each dataset entry contains a rich set of metadata describing various aspects of the dissertations.
Identification fields include the handle, internal identifier, DOI, and repository URL.
Bibliographic fields include titles in Greek and English, author names, publication year, and document language.
Academic documentation includes university, school/department, supervisor, examination committee members, and date of approval.Subject classification consists of a three-level scientific taxonomy and associated keywords.
Content-related fields include abstracts in Greek and English, the length of the dissertation, number of references, PDF filename, and link to the original repository.
Finally, licensing information reflects the licenses displayed in the Open Archives.

3. Data Collection and Processing Methodology

3.1 Dataset Construction Workflow

The construction of the dataset was organized into three major technical domains:
source identification and scraping, text extraction from PDFs, and markdown conversion with OCR and cleaning. Source identification and scraping were conducted outside AWS. Following collection, all PDFs were processed through both embedded text extraction and OCR for cases where the content consisted of images or otherwise non-readable formats.

3.2 Text Extraction from PDF

Depending on the nature of each PDF, two approaches were used:
embedded text extraction for text-based PDFs, or Optical Character Recognition (OCR) for image-based PDFs.

Initial OCR was performed using Tesseract; however, the introduction of DeepSeek OCR resulted in significantly improved accuracy, especially for scientific symbols, polytonic systems, and complex layouts. Consequently, all PDFs were reprocessed from scratch using DeepSeek OCR.

3.3 PDF-to-Markdown Conversion with Docling

For text-based PDFs, Docling was employed to generate consistent markdown output, reduce noise, preserve core document structure (headings, subsections), and provide homogenized results across large batches.

In cases requiring OCR, markdown conversion occurred in the second stage of the processing pipeline.

3.4 OCR, Cleaning, and Homogenization via GlossAPI

The final stages of processing were performed using a customized GlossAPI pipeline designed to run in a highly parallelized GPU environment to drastically reduce total processing time.
Four NVIDIA A10G GPUs were used for this purpose.
All processing (except scraping) was executed on an AWS g5.12xlarge instance, which supported concurrent OCR, markdown conversion, and data cleaning.

3.5 Quality Assurance

Quality assurance procedures included sample-based OCR accuracy measurement, detection and correction of encoding errors, metadata consistency checks, tokenization stability evaluation, and verification of markdown uniformity.
The combined use of Docling and DeepSeek OCR led to significantly improved overall quality.

4. Repository Hosting the Academic Knowledge Corpus

The glossAPI team has recently initiated a collaboration with the Mozilla Foundation to support the hosting and distribution of Greek AI-ready datasets. As part of this collaboration, the Mozilla Data Collective will host a dedicated subset of the dataset, specifically the PhD Theses Corpus (PTC).

5. Limitations and Future Research

This first attempt to construct the Academic Knowledge Corpus presented certain limitations. Some entries lacked complete licensing information, restricting the full public release of the corpus. In addition, several older PDFs suffered from quality degradation, affecting OCR performance. Finally, the thematic distribution naturally reflects the academic output of Greek universities and is therefore not uniformly balanced.

5.1 Future Research

The current effort to assemble a unified, homogenized, and technically refined corpus of Greek doctoral dissertations lays the groundwork for a continuously evolving data ecosystem that can expand in multiple directions.

Future developments may include enriching the corpus with additional types of academic texts. Incorporating further scholarly material could significantly increase coverage of Greek academic output and generate data volumes suitable for training large-scale language models.

The corpus already serves as a critical resource for LLM training. Future research may explore optimal tokenization strategies for scientific discourse, as well as investigate how different data subsets influence model performance across tasks such as summarization, entity extraction, and question answering (QA).