Blog

Keynote: How Linguistics can help to eliminate unnecessary complexity in modern Natural Language Processing

Yesterday I had the pleasure to deliver a Keynote talk at the Interdisciplinary Perspectives: Bridging Sociological Studies in the Digital Age conference, organised by the amazing Digital Humanities PhD students at King’s College London.

In the talk I argued that performance in a Natural Language Processing task, namely authorship analysis, can be significantly increased if insight and knowledge from (Cognitive) Linguistics is taken in consideration. Even though the field of NLP is now at a stage of wondering to what extent Linguistics is still needed, I make a case that the often quoted statement that “Every time I fire a linguist, the performance of the system goes up” tends to apply to certain strand of Linguistics that do not engage with the study of real-life language usage. Instead, Cognitive Linguistic Usage-Based approaches to the study of language are very compatible with modern advances in NLP. As shown in this case study, the application of Cognitive Linguistics theoretical frameworks can lead to better trade offs between computational complexity and performance, reducing the number of computational steps, processing time, and model parameters.

The slides of my talk are available here: https://doi.org/10.5281/zenodo.11583694.

Invited talk at the Corpus Linguistics Symposium: Style and Authorship

Tomorrow afternoon I’ll give a keynote talk at the Corpus Linguistics Symposium: Style and Authorship at the University of Leeds. My talk will be on how our new authorship verification method, LambdaG, can be more transparent in visualising the important features that identify an author than the state of the art and I will use Dickens as a case study. The event is hybrid and the talk is going to be recorded. For more information, the webpage of the event is: https://www.latl.leeds.ac.uk/research-satellites/corpus-linguistics/clstyleauthorship/.

Authorship Verification based on the Likelihood Ratio of Grammar Models

I’m extremely excited to announce the pre-print of our new paper: “Authorship Verification based on the Likelihood Ratio of Grammar Models”

https://arxiv.org/abs/2403.08462v1
with Oren Halvani, Lukas Graner, Valerio Gherardi and Shunichi Ishihara.

This is a new authorship verification method based on the use of a type of language model we call “Grammar Model”. A Grammar Model is an n-gram language model that only uses information about grammatical tokens in a text. The method is fundamentally based on calculating a score that we call LambdaG: the ratio of the likelihoods of the document given by two Grammar Models, one for the author and one for the population given by a reference corpus. Despite the simplicity, LambdaG outperforms other state of the art algorithms, like the Impostors Method and a BERT Siamese neural network! This is after testing 12 different corpora designed to reflect various kinds of autorship analysis challenges.

This new method has several attractive properties: 1) it does not need a significant amount of data for training; 2) it is stable to topic differences and genre variations in the reference corpus; 3) its results can be manually explored; 4) it is based on Cognitive Linguistics.

This paper also demostrates that in the age of Large Language Models, Small Language Models still have a role to play, especially if theoretically-driven. Collaboration between NLP and Linguistics CAN lead to paradigm shifts, rather than just adding more data/parameters!