The Challenge

The National Library of the Netherlands (KB) is the national library of the country tasked with the comprehensive collection of all publications related to the Netherlands, spanning from medieval to today’s works.

The role of preserving the nation’s memory is defined by a continuous influx of new materials being published/found on daily basis. This results in a long labour-intensive manual process of checking and updating archival (meta)data.

Our Solution

Leveraging an Optical Character Recognition engine (OCR), we built a web interface that simplifies and speeds up the classification task KB employees face on daily basis. All they have to do is take a picture of the book, and our engine checks whether the book already exists in their collection. If the book is not already part of the collection, our PoC extracts the required meta data such the author, title, subtitle, and year of publication and prepares them to be added to the library system.

Highlights

The materials KB deals with span multiple centuries. To accommodate for the specific differences in each time period, we're currently benchmarking the performance of three different approaches:

OCR

In combination with our custom rules

OCR + LLM

First running OCR over the image, followed by asking ChatGPT which of the OCR texts is an author, a title, among others.

A computer vision model

Using the image as the input for the model, and outputting the fields directly, so combining OCR + field selection into one

conclusion

Leveraging character recognition and a set of ML models, we significantly reduced the time it takes for KB’s employees to check whether a book already exists in their archives. This enabled the National Library to keep up with the vast influx of information published/found on daily basis and keep the nation’s memory up-to-date.

FESTINA LENTE