A systematic evaluation of Dutch large language models’ surprisal estimates in sentence, paragraph and book reading

Rafal Tekreeti
Sep 22
3 min read

By: Sam Boeve & Louisa Bogaerts

Every day life involves a lot of reading. Almost 10% of all the words we encounter, we process through print. This comes down to a daily diet of almost 100.000 printed words (Bohn & Short, 2009). Of those 100.000, some are way more difficult to read, not because they are uncommon or complex, but simply because they are unexpected in the context. This is exactly what you experience reading a sentence such as “The young nervous paratrooper jumped out of the chair”, where the word chair is unpredictable compared to, for example, plane.

Psycholinguistic work has converged on three main characteristics that predict how long a reader will look at a word before moving on. The big three are word length, frequency and predictability. For a long time, predictability remained the most elusive of those three. Likely because it is very difficult to determine a word’s predictability. Early work used human raters; others implemented statistical analyses on the co-occurrence counts of words (i.e., N-gram models). Later came the neural network models: the recurrent neural network, long-short term memory networks, etc. These methods worked well but had several shortcomings, limiting their use in research. The real breakthrough came when the transformer models were launched in 2017 (Vaswani et al., 2017). This powerful class of language models could model linguistic patterns at unprecedented accuracy, making them extremely good at predicting the next word and thus also at estimating a word’s predictability. Soon, they were deployed in psycholinguistic studies on the word predictability effect, and with great success.

These models were shown to be better at modelling our processing difficulties due to a word’s predictability than older methods (de Varda et al., 2023; Merkx & Frank, 2021; Shain et al., 2024). In these studies, a few trends became apparent. First, as the models grew larger, their ability to predict reading times didn’t scale proportionally. The smaller transformer models (in number of parameters) worked better. Second, the effect of predictability turned out to be logarithmic. For very unpredictable words, a further decrease in predictability has a much larger effect on our reading times than the same decrease for highly predictable words. All studies focused on the English language, while the few studies that did address other languages used multilingual models, a special class trained on more than one language simultaneously. Here, we wanted to expand the scope of predictability research and evaluate the Dutch language models in terms of psycholinguistic predictive power. Put differently, we asked which Dutch language models’ predictability estimates align best with human reading times. We visualized the surprisal values (lower surprisal = higher predictability) on sample texts for each corpus, available here: https://wordpredictabilityvisualized.vercel.app/

Key findings

The inverse scaling trend (i.e., smaller models predicting reading times better) generalizes to Dutch models.
The amount of context has an impact on this effect. When predicting the reading times of participants reading an entire book (GECO corpus), the larger models do better.
In general, language-specific models perform better than their multilingual counterparts.
Using Dutch models, the effect of predictability on reading times is also shown to be logarithmic.
Overall, when modelling reading times using an open-source language model, gpt2-small-dutch (de Vries & Nissim, 2020) is a great option.

Psychometric predictive power in relation to model size for predicting gaze duration

How did VSC contribute to our work?

Although the models deployed in this work are relatively small compared to current standards (< 10 billion parameters). Downloading and running inference on these models requires computing power not available on a local machine. The VSC made it possible to use these models to generate predictability estimates on large amounts of text.

Reference

Boeve, S., & Bogaerts, L. (2025). A systematic evaluation of Dutch large language models’ surprisal estimates in sentence, paragraph and book reading. Behavior Research Methods, 57(9), 266. https://doi.org/10.3758/s13428-025-02774-4

🔍 Your Research Matters — Let’s Share It!

Have you used VSC’s computing power in your research? Did our infrastructure support your simulations, data analysis, or workflow?

We’d love to hear about it!

Take part in our #ShareYourSuccess campaign and show how VSC helped move your research forward. Whether it’s a publication, a project highlight, or a visual from your work, your story can inspire others.

🖥️ Be featured on our website and social media. Show the impact of your work. Help grow our research community

📬 Submit your story: https://www.vscentrum.be/sys