LLM-Based Metadata Extraction from NATO Scanned Documents

Historical archives contain valuable evidence, but scanned documents are difficult to search when their metadata is incomplete or inconsistent. At the C4DHI Anniversary Workshop, I presented a workflow that uses large language models to extract structured metadata from scanned NATO archival documents. The talk focused on noisy OCR, multilingual records and the need to preserve evidence for human review.

Scanned archival documents are rarely ready for direct computational analysis. Layout, OCR errors, several languages and inconsistent cataloguing all stand between the image of a page and structured data that historians can search and compare.

At the C4DHI Anniversary Workshop, I presented a workflow in which large language models help extract titles, dates, institutions, archive codes, correspondence information and thematic tags from scanned NATO documents. The presentation covered model accuracy and the surrounding software workflow needed to turn experimental extraction into a repeatable research tool.

First page of the C4DHI Anniversary Workshop programme — Programme of the C4DHI Anniversary Workshop in Prague.

Abstract

The presentation focuses on transforming scanned NATO archival documents into structured, research-ready data. It shows how large language models can support extraction of metadata such as document titles, dates, institutions, archive codes, correspondence information and thematic tags. It also addresses practical challenges of noisy historical documents, multilingual archival material and efficient software design with Speckit.

Abstract

Links

Read next