KB National Library

AI-based metadata extraction for the National Library collection

Summary

Turning physical books into structured digital records: an AI-powered tool that photographs, extracts, and validates metadata for tens of thousands of uncataloged titles.

The KB, the National Library of the Netherlands, collects every book, newspaper, or magazine published in the country. But gaps remain: a retro-collection of 40,000+ uncataloged titles acquired through donations and second-hand purchases needed to be checked, described, registered, and de-duplicated. We built a solution that automates metadata extraction using OCR and computer vision and matched the extracted metadata against the library's massive catalog system.

Industry

Government & Public Sector

Services

AI Experience Architectures, Products & Services

The Challenge

Tens of thousands of uncataloged books. A manual process that couldn't scale.

The KB's retro-collection contained tens of thousands of physical titles sitting in storage without proper digital records. For each book, staff needed to manually check whether it already existed in the catalog system, and if not, create a new entry with complete metadata: title, author, publisher, year, ISBN, and more. This process was entirely manual and extremely time-consuming.

Problem 01

Check if retro-collection books exist in the catalog

▪Automated matching against the KB's massive catalog system
▪Confidence scoring for each match
▪Support for both modern and historical typography
▪Integration with external book databases

Problem 02

Generate title descriptions and collect metadata automatically

▪OCR-based text extraction from title pages
▪Structured metadata capture (title, author, publisher, year)
▪Handling of diverse designs, layouts, and typefaces across centuries
▪De-duplication against the existing collection

The Solution

We designed and built a three-component solution: a mobile data capture app, an OCR and computer vision extraction pipeline, and a results dashboard integrated with the library's catalog system.

Mobile data capture

A React-based mobile web application allows staff to photograph title pages and colophons directly from their phone. Photos are automatically uploaded to the backend for processing with no manual file handling required.

OCR and computer vision extraction

Using OCR and computer vision, the system extracts key metadata fields from both modern and historical typography. Multiple recognition models were benchmarked and combined with form recognition AI to optimize accuracy across diverse book formats.

Catalog matching and enrichment

Extracted metadata is cross-referenced against the library’s catalog system and external book databases to check for duplicates, assign confidence scores, and enrich records with additional data.

Impact

From a manual, time-consuming cataloging process to a near-instant, AI-assisted workflow, preserving expert control while dramatically increasing throughput.

Cataloging Speed

What previously required manual research and data entry per book is now handled in seconds through OCR and intelligent matching, transforming the most time-consuming step.

Expert Control Preserved

The tool augments catalogers rather than replacing them. All AI-generated suggestions are validated by experts before being committed to the catalog.

Enriched Metadata

Cross-referencing with public book databases improves confidence levels for title, author, and publisher data, with near-instant duplicate detection through confidence-scored matching.

Methods

A practical AI engineering approach combining mobile capture, OCR optimization, and intelligent catalog matching.

OCR

Computer vision

Form recognition

Model benchmarking

React

Mobile web app

Photo capture

Auto-upload

Catalog matching

Confidence scoring

External database enrichment

Solution architecture

Engineering

Project management

Looking to digitize and automate your collection workflows?

Get in touch →