
KB National Library
AI-based metadata extraction for the National Library collection
Summary
Turning physical books into structured digital records: an AI-powered tool that photographs, extracts, and validates metadata for tens of thousands of uncataloged titles.
The KB, the National Library of the Netherlands, collects every book, newspaper, or magazine published in the country. But gaps remain: a retro-collection of 40,000+ uncataloged titles acquired through donations and second-hand purchases needed to be checked, described, registered, and de-duplicated. We built a solution that automates metadata extraction using OCR and computer vision and matched the extracted metadata against the library's massive catalog system.
Industry
Services
The Challenge
Tens of thousands of uncataloged books. A manual process that couldn't scale.
The KB's retro-collection contained tens of thousands of physical titles sitting in storage without proper digital records. For each book, staff needed to manually check whether it already existed in the catalog system, and if not, create a new entry with complete metadata: title, author, publisher, year, ISBN, and more. This process was entirely manual and extremely time-consuming.
Problem 01
Check if retro-collection books exist in the catalog
- ▪Automated matching against the KB's massive catalog system
- ▪Confidence scoring for each match
- ▪Support for both modern and historical typography
- ▪Integration with external book databases
Problem 02
Generate title descriptions and collect metadata automatically
- ▪OCR-based text extraction from title pages
- ▪Structured metadata capture (title, author, publisher, year)
- ▪Handling of diverse designs, layouts, and typefaces across centuries
- ▪De-duplication against the existing collection
The Solution
We designed and built a three-component solution: a mobile data capture app, an OCR and computer vision extraction pipeline, and a results dashboard integrated with the library's catalog system.
01
Mobile data capture
A React-based mobile web application allows staff to photograph title pages and colophons directly from their phone. Photos are automatically uploaded to the backend for processing with no manual file handling required.
02
OCR and computer vision extraction
Using OCR and computer vision, the system extracts key metadata fields from both modern and historical typography. Multiple recognition models were benchmarked and combined with form recognition AI to optimize accuracy across diverse book formats.
03
Catalog matching and enrichment
Extracted metadata is cross-referenced against the library’s catalog system and external book databases to check for duplicates, assign confidence scores, and enrich records with additional data.
Impact
From a manual, time-consuming cataloging process to a near-instant, AI-assisted workflow, preserving expert control while dramatically increasing throughput.
Cataloging Speed
What previously required manual research and data entry per book is now handled in seconds through OCR and intelligent matching, transforming the most time-consuming step.
Expert Control Preserved
The tool augments catalogers rather than replacing them. All AI-generated suggestions are validated by experts before being committed to the catalog.
Enriched Metadata
Cross-referencing with public book databases improves confidence levels for title, author, and publisher data, with near-instant duplicate detection through confidence-scored matching.
Methods
A practical AI engineering approach combining mobile capture, OCR optimization, and intelligent catalog matching.
OCR
Computer vision
Form recognition
Model benchmarking
React
Mobile web app
Photo capture
Auto-upload
Catalog matching
Confidence scoring
External database enrichment
Solution architecture
Engineering
Project management