Human-in-the-loop tabular data extraction methods for historical climate data rescue

Rafal Tekreeti
Jul 2
3 min read

By: Bas Vercruysse, Julie M. Birkholz, Krishna Kumar Thirukokaranam Chandrasekar, Derrick Muheki, Wim Thiery, Hans Verbeeck, Koen Hufkens, Kim Jacobsen & Christophe Verbruggen

Historical weather records from the Congo Basin (1907–1960) are being rescued from fragile paper archives into digital data for climate research. These logs are handwritten tables of daily weather observations such as temperatures, humidities, and precipitation, often in aging ink and varying layouts. The goal is to digitize these complex tables accurately and efficiently by using a human-in-the-loop (HIL) workflow, which combines automatic algorithms with expert human corrections. This HIL approach recognizes that a one-size-fits-all Optical Character Recognition (OCR) solution won’t work for diverse old handwriting, so it leverages both AI and human expertise to transcribe the data.

Figure 1: Sample of historical logbook of hydroclimatic records from the Congo Basin

Computer Vision Approach

To tackle this task, we built a custom OCR-validation dataset called CoBeCo, drawn from Congo Basin climate logbooks and semi-automatically annotated ten sample pages of varying quality. This workflow greatly sped up creating ground-truth data for training and testing.

We evaluated a mix of open-source and commercial text recognition tools on the validation dataset:

Open-Source OCR/HTR Engines: Tesseract (OCR for printed text) and PyLaia (an open-source handwriting recognition model). Both were fine-tuned on the new dataset; a specialized “Tesseract-CoBeCo” model and a PyLaia model were trained, dramatically improving their accuracy.
Vision-Language Model: Qwen-2-VL-7B, a 7-billion-parameter open vision-language model from Alibaba, was tested as an OCR engine. We used prompt engineering to give the model instructions to extract numbers from each table cell and evaluated the influence of these prompts on the results.
Commercial OCR Services: Amazon AWS Textract, Microsoft Azure AI Vision, and Google Document AI were tried as “all-in-one” solutions that detect table structure and text together. Additionally, Transkribus, a platform tailored to historical documents, was evaluated for comparison. These services were accessed via their cloud APIs and offer state-of-the-art handwriting recognition without local training.

Throughout the process, Human-in-the-Loop (HIL) methods were used for more than just data labeling. For instance, after the initial OCR stage, we applied domain knowledge as constraints, such as recognizing that temperatures in the Congo cannot exceed certain limits, to flag and correct unlikely values. We also leveraged statistical checks, comparing the sum of individual observations against the total temperature reported in the source, to identify and rectify inconsistencies. This ensures that the human expert remains in the loop at critical points, from preparing training annotations to validating the final extracted numbers.

Figure 2: The Tabular extraction workflow with HIL-principle

Results

Our findings show a significant boost in accuracy when using advanced models and HIL strategies:

There was an advantage with using new Vision Transformer models, like Qwen-2-VL model which far outperformed traditional OCR on these handwritten tables achieving a significantly lower character error rate (CER) than Tesseract’s and Pylaia’s. Different prompt styles led to up to 11% variation in CER which underscores the HIL principle that humans can steer AI effectively without retraining it.
Among the all-in-one services, AWS Textract was the top performer, Microsoft’s Azure Vision was close behind while Google’s system lagged for table cell detection. In terms of reading the numbers, all three were comparable, with Google slightly ahead in accuracy once it actually found the numbers. Transkribus had comparable results to these state-of-the-art models.
Finally, we note that involving humans at key points lowers error rates beyond what automation alone can achieve, without relying on the full manual labor of transcribing everything by hand. For instance, when the model mistakenly read a value, a human in the loop could catch it by checking consistency (like recalculating a daily mean temperature or comparing a dubious digit against its neighbors). These HIL post-processing steps, applying common sense, simple rules, and targeted manual edits, help raise the data quality significantly, striking a balance between efficiency and accuracy.

In short, the best outcomes came from AI and human cooperation, not AI alone!

Read the full publication in Springer Nature here

🔍 Your Research Matters — Let’s Share It!

Have you used VSC’s computing power in your research? Did our infrastructure support your simulations, data analysis, or workflow?

We’d love to hear about it!

Take part in our #ShareYourSuccess campaign and show how VSC helped move your research forward. Whether it’s a publication, a project highlight, or a visual from your work, your story can inspire others.

🖥️ Be featured on our website and social media. Show the impact of your work. Help grow our research community

📬 Submit your story: https://www.vscentrum.be/sys