The second blog (out of four) in the series “academic exploits in times of Corona”. Let me tell you about a fascinating seminar I did in Amsterdam, at the UvA. My presence was virtual, which was a pity coz my dear friend Teja lives just around the corner – just think of those tingling glasses and gastronomic goodies we had to deny ourselves. Hopefully we can make up for it at some point in the near(er) future.
Anyway, the seminar was not what I had expected. At all. I thought I was going to learn something about the technical construction of language. Well, I suppose I did learn about a specific kind kind of construction. The kind that helps you to find to identify who really wrote a particular document. Or even, when there is more than one author, to identify who has written which part. For instance, some real detective work has been done on the writings of Hildegard of Bingen. Do you know about her? If not, look her up. She was an abbess, a famous composer of sacred music, mystic, philosopher, scientist, writer – and born in 1089! She must have been quite a woman. Anyway, she was a bit of a control freak. Her clerks were not allowed to change anything in her texts without her approval. But it is thought that she put a lot of trust in her last clerk, to the extent that he may have completed or even written some of her texts.
Curious? Have a look at this documentary. I thought it was wonderful, so exciting to find out what must have happened. Of course these researchers did not just look at the writings. They already knew a lot about her, her life, the general context, other authors, so they had an idea where to look. Still, a remarkable discovery (if you have not started the documentary by now, do it soon, or I will give up you).
Great stuff, right? But interesting as it was, I was wondering how to relate it to my research topic. And then I thought about the authorship of rules and regulations in security (my daytime job for those of you that do not know). I described the problem once to my supervising professor, at the start of my back-to-academia project. You will find it tucked away in my original problem definition, in this post. The problem with anonymous texts or texts that have been written by a “body” of people, is that it is almost impossible to get textual clarification. As I put it in that other piece:
“There is no one to ask. There is no author to ask for clarification, nor is there an easily accessible expert group. An additional problem is that reaching out to the publisher of the regulation or standard in question, must be done through proper channels, i.e. not something just any employee can do. Usually, the best that may be achieved is to send in a formal request for clarification – which may or may not be processed during a future maintenance window”.
The discipline that does this kind of investigation is called stylometry. Basically, stylometry analyses measurable textual features: word and sentence length, various frequencies (of words, word lengths, word forms, etc.), vocabulary richness, use of punctuation, use of certain expressions and preferences for certain spelling variants. You can imagine that the more texts you have by one particular author, the more you get to know about his or her particular stylometric style. Such analyses also allow you to pick out texts that seem odd, i.e. do not have the characteristic features commonly found in texts by that particular author.
What I find fascinating is that these kind of features are the ones we are not aware of: our use of little worlds like a/an/the, for instance. So disguising your handwriting or attempting to stay anonymous will not stop the literary detective from finding out who you are!
So I thought I’d study official, parliamentary, publications by and about the Dutch Tax office. Which turned out to be a lot more work than I thought, because it is not possible to get the documents from one particular dossier in one go. But since I only wanted about 50, that was doable, so I started collecting. I found some really interesting things. For instance, the official “functional” authorship which was stated on the documents (minister of finance, secretary of state, audit chamber, Dutch tax office, ministry of finance) rarely matched the author or group of authors that – according to tools and theory – actually wrote those documents. The most amazing was a set of two letters, one by the prime minister and one by the head of the audit chamber which appear to have been written by the same person. Which is weird, considering the audit chamber is suppose to check out the government.
At this point the seminar’s professor said that I must be very very careful interpreting these results, and perhaps I would like to do a further study and involve a data scientist. 🙂 Yes yes, I understand. This must be the n-th time where I have written something which might be a little explosive to publish. Like my paper on “naive normativity in animals”, or the piece about “artificial intelligence and profiling”. I suppose in this case – unlike the other two – there really is more work to do. After all, it was the very first time I played with these tools, and I am still not sure about their limitations.
If you want to read my paper, it is here. It is a bit dry, because it is basically analysis, but you will get the idea. Some pretty graphs included. I also did an analysis on trustworthiness. This was of particular interest because some of the documents in recent debates were said to contain falsehoods. I ran tests to find out if any signs of untruthfulness could be found. And I found? The opposite. All of these text breathed a 1000% “you can trust me”. Which probably means that in texts, trustworthiness cannot be measured, or maybe that trustworthiness is a style which can easily be faked. Or perhaps our society is not so interested in truth anymore.
My next blog will be about Frege. Yes, the one that sort of incidentally provided the mathematical foundation for the whole analytical philosophy of language approach which I found so very boring when I was first at university. I dared to go back into hell, and I will tell you the story. Next.