My latest paper with Hugo Bowles and Claire Wood examines a Dickens mystery: did he author the recently decoded story “The Two Brothers”? The answer is complicated. The paper showcases our new method, LambdaG (forthcoming!). The paper can be found here or the free accepted version is here.
Category: New paper
Authorship Verification based on the Likelihood Ratio of Grammar Models
I’m extremely excited to announce the pre-print of our new paper: “Authorship Verification based on the Likelihood Ratio of Grammar Models”
https://arxiv.org/abs/2403.08462v1
with Oren Halvani, Lukas Graner, Valerio Gherardi and Shunichi Ishihara.
This is a new authorship verification method based on the use of a type of language model we call “Grammar Model”. A Grammar Model is an n-gram language model that only uses information about grammatical tokens in a text. The method is fundamentally based on calculating a score that we call LambdaG: the ratio of the likelihoods of the document given by two Grammar Models, one for the author and one for the population given by a reference corpus. Despite the simplicity, LambdaG outperforms other state of the art algorithms, like the Impostors Method and a BERT Siamese neural network! This is after testing 12 different corpora designed to reflect various kinds of autorship analysis challenges.
This new method has several attractive properties: 1) it does not need a significant amount of data for training; 2) it is stable to topic differences and genre variations in the reference corpus; 3) its results can be manually explored; 4) it is based on Cognitive Linguistics.
This paper also demostrates that in the age of Large Language Models, Small Language Models still have a role to play, especially if theoretically-driven. Collaboration between NLP and Linguistics CAN lead to paradigm shifts, rather than just adding more data/parameters!