Authorship Verification based on the Likelihood Ratio of Grammar Models

I’m extremely excited to announce the pre-print of our new paper: “Authorship Verification based on the Likelihood Ratio of Grammar Models”

https://arxiv.org/abs/2403.08462v1
with Oren Halvani, Lukas Graner, Valerio Gherardi and Shunichi Ishihara.

This is a new authorship verification method based on the use of a type of language model we call “Grammar Model”. A Grammar Model is an n-gram language model that only uses information about grammatical tokens in a text. The method is fundamentally based on calculating a score that we call LambdaG: the ratio of the likelihoods of the document given by two Grammar Models, one for the author and one for the population given by a reference corpus. Despite the simplicity, LambdaG outperforms other state of the art algorithms, like the Impostors Method and a BERT Siamese neural network! This is after testing 12 different corpora designed to reflect various kinds of autorship analysis challenges.

This new method has several attractive properties: 1) it does not need a significant amount of data for training; 2) it is stable to topic differences and genre variations in the reference corpus; 3) its results can be manually explored; 4) it is based on Cognitive Linguistics.

This paper also demostrates that in the age of Large Language Models, Small Language Models still have a role to play, especially if theoretically-driven. Collaboration between NLP and Linguistics CAN lead to paradigm shifts, rather than just adding more data/parameters!