Epistemic status:

Theory of change and mental models of world: Weak to moderate

Importance of stylometrics research if above theories are correct: Moderate to strong

Basically I just wanted to write about stylometrics but then I tried to argue its importance, and realised that understanding its importance requires accepting a much larger set of assumptions. I won't fully defend all of these assumptions, but I think they're mostly important to think about and ideally deserve separate posts.

Key hypothesis

 - Surveillance tech is becoming more powerful and more widespread. AI is enabling automation of the entire pipeline from data collection to analysis to its usage in offensive actions such as blackmail. Automation allows such operations to scale up to large sets of targets.

 - Surveillance tech stabilises totalitarian regimes and destabilises democratic ones.

 - Advocates of decentralised technology and privacy-preserving technology have ambitious plans of building an unsurveillable society.

 - This dream is sufficiently ambitious and requires so many different things to go right that it may be worth analysing through the lens of impossibility results. This means trying to identify the most difficult challenges such technology faces, and proving results on things it cannot do no matter how hard it tries.

 - AI for stylometrics could provide one such impossibility result. Stylometrics is the study of linguistic style. In this post it is being used primarily in the context of doxxing anonymous individuals.

Surveillance tech is becoming more powerful and more widespread

Here's a shortened version of some of the claims from The Global Expansion of AI Surveillance by Steven Feldstein, Carnegie Endowment:

 - AI surveillance technology is spreading at a faster rate to a wider range of countries than experts have commonly understood. At least seventy-five out of 176 countries globally are actively using AI technologies for surveillance purposes.

 - China is a major driver of AI surveillance worldwide. Huawei alone is responsible for providing AI surveillance technology to at least fifty countries worldwide.

 - Chinese product pitches are often accompanied by soft loans. This raises troubling questions about the extent to which the Chinese government is subsidizing the purchase of advanced repressive technology.

 - But China is not the only country supplying advanced surveillance tech worldwide. U.S. companies are also active in this space. AI surveillance technology supplied by U.S. firms is present in thirty-two countries. The most significant U.S. companies are IBM (eleven countries), Palantir (nine countries), and Cisco (six countries).

 - The index shows that 51 percent of advanced democracies deploy AI surveillance systems.

 - There is a strong relationship between a country’s military expenditures and a government’s use of AI surveillance systems

AI is enabling automation of the process of gaining meaningful insights about individual and collective lives, so that they can be obtained at large scale.

Example targets include high-profile goverment officials, high-profile individuals and corporates, human rights activists, journalists, dissenters, students, and so on.

For instance, Pegasus, a spyware created by Israeli NSO group, was used to leak a list of 50,000 such targets, including heads of state. It was discovered in 2016. It has since been sold to UAE and Gulf states for surveillance of anti-regime activists and political rivals. It has also been used for similar purposes by govts in Armenia, Bahrain, India, Morocco, Palestine and others, and by drug cartels in Mexico.

While not strictly surveillance, the US govt suffered its largest data breach in 2014 of nearly 22.1 million personnel records, including names, dates of birth, addresses and family connections.

There exist various means to act on such information to pursue strategic outcomes. Full analysis of this is out of scope for this post. Information can be used to deliver threats via blackmail material or economic or political disincentives. It can be used to shift people's loyalty more efficiently. It can also be paired with physical offensive acts or threats, such as those from drones, LAWs or just military and police personnel. It is unknown to what extent such acts can be automated and executed at large-scale, but it is likely that AI advances help with some steps of the process.

China has used mass surveillance to effectively detain Uyghurs and monitor detention camps.

Surveillance tech stabilises totalitarian regimes and destabilises democratic ones

It is possible (albeit not proven) that a dismantling of Chinese autocracy from the inside is an order of magnitude harder than it was before the invention of surveillance tech. Xi Jinping has further centralised the decision-making power of the Chinese State in his own hands, instead of the Politburo

Until now I have discussed how surveillance tech has a stabilising effect on authoritarian regimes, by making the suppression of dissent much more effective. But there is also a destabilising effect on democracies.

Due to principal-agent problems, surveillance tech and the subsequent obtained data and insights often have to necessarily be controlled by a small group of actors. It is harder to keep these actors accountable to larger government structures and the public. 

This in turn allows such actors to pursue their own independent outcomes more aggressively than the public would permit them. Intelligence agencies of countries such as the US and the UK have been frequently accused of the same.

Functioning militaries require the allegiance of millions of people to be live and maintained, but nukes require only a few hundred of people to be allegiant to be controlled. It's likely that surveillance tech too needs to be controlled by such small groups.

Unlike nukes however, surveillance can be more effectively used for a totalitarian power grab inside of the State, because it can enable individual targeted threats which are a lot more credible and reliable than threats of mass destruction. Such power grabs often happen under the pretext of war or conflict. The odds of this are low, but may still be comparable to the odds of a nuclear exchange playing out, and hence deserve attention. This highlights Importance as per the ITN framework.

Tech for unsurveillable societies

Advocates of privacy and decentralised tech believe that technological solutions can be used to protect individuals from mass surveillance. This however requires solutions at multiple levels, including but not limited to encryption (symm and asymm. key exchange, double ratchet, etc), transport privacy (Tor, mixnets), p2p routing systems (devP2P, libP2P), decentralised DOS protection, privacy-preserving computation (such as Homomorphic encryption), domain name resolution (blockchain), payment channels (such as ZK rollups) and so on.

Analysing the effects and weaknesses of each tech is a major ordeal. It may help to get a high-level view and start with proofs of things that cannot be done, no matter how much tech is advanced. Or to find inherent tradeoffs that cannot be averted. Finding such results could be Tractable, and will be discussed further below.

Surveillance as well as its defences are not entirely Neglected in general, however what may be neglected is high-leverage paths to meaningfully change the landscape. More specifically they may be a bit neglected in EA in my weak opinion, as are other risks of stable totalitarianism or bad value lock-in. A lot of these concerns however are mentioned in FHI's AI governance agenda, which makes me hopeful they are seriously analysed in the near future. (Even if only to conclude which concerns are not worth focussing on.)


I'm considering one specific domain in which unsurveillability seems hard to achieve. Namely I am considering online communication that is intended for public readership. This could be public forums, articles, blogs, and public social media profiles. Anything in the public domain can be accessed by anyone with sufficient bandwidth and compute (barring some trivial concerns on IP blocking and DOS protection). It does not require having access to private data collections of either governments or corporates.

I am also not considering data collection outside of online environments, such as closed circuit camera, in-person collection and surveillance satellites. These make unsurveillability even harder, and I'll ignore them for now.

It is an open question whether one can maintain anonymous identities in the public internet domain, and disseminate content. These are most likely used today by activists, journalists, crypto investors and anyone who is ideologically hardline regarding privacy. In the unsurveillable society ideal, perhaps everyone could maintain such online identities.

Gwern has a fascinating post on the numerous footprints we leave in both hardware and software. There is also ofcourse the possibility of bugs - both intentional and unintentional - in both hardware and software - that are exploited for surveillance. I am going to assume an ideal world where no bugs exist and all digital footprints can be cleverly randomised such that they match background noise, and no unique information (entropy) can be extracted about individuals from them.

This still leaves open the question of stylometrics - namely analysing linguistic style. As long as content is written by humans in natural language, patterns in writing will likely be left. This could include common sentence structures, words, punctuation and typing patterns, concepts, grammar and so on. If  AI-based stylometrics alone is sufficient to map a person's anonymous identity to their public one, the task of creating an unsurveillable society becomes that much harder. Note that this mapping, if possible, could be done by almost anyone, rather than just governments and corporates.

On the Feasibility of Internet-Scale Author Identification is the best paper I have found that attempts to explicitly answer whether this is possible. Their result however identifies only 30% of cases. Other papers that attempt to study related problems are linked here (1, 2) but they do not target the problem directly. Obfuscating document stylometry to preserve anonymity attempts to build both - techniques for deanonymisation, and obfuscation techniques to defend against them. There is also older literature on stylometry that does not use AI, and hence doesn't seem as relevant (though some ideas might be reusable by AI engineers).

Research Proposal: Advance AI-based stylometrics research

A better funded initiative could make a more concerted effort to develop AI-based stylometric deanonymisation tools, to see how far they can be developed.

Note that this is essentially developing offensive tools that we wish would not be used in practice, hence there is an infohazard. If it turns out stylometric deanonymisation is very effective with very little investment ( < $1M funding and no particularly brilliant insights), then it can make sense to open-source the code and declare this information. The assumption here would be that if this tech is easy to obtain, it has likely already been obtained by various intelligence agencies. All users using anonymous identities online as of today can adapt, and those building anonymity-preserving tech will be aware of this knowledge. If it turns out stylometric deanonymisation will always be ineffective for structural reasons, that too might make sense to open-source.

The infohazard exists if deanonymisation turns out neither easy nor impossible. At that point it might make sense to destroy the results instead of risking anyone else building upon them.

This proposal combined a lot of thoughts I had on a lot of related topics, and I'm aware I haven't defended all of them perfectly. I'm still keen on thoughts on any of it.


4 comments, sorted by Click to highlight new comments since: Today at 12:22 AM
New Comment

Somewhat related : a while ago I wrote a benefit-risk analysis of research in improving device detection. I concluded the risks likely outweight the benefits.

Thanks, this is interesting.

(One point that is not mentioned is the counterfactual world where you didn't develop the tech, would someone else have developed the same tech with the same funding instead?)

[This comment is no longer endorsed by its author]Reply

Thanks, this was surprisingly interesting. I agree stylometrics is a relevant field, and this problem might require more awareness than it has got.

On the other hand... My epistemic status here is very low, but I guess that, in the long run,  the asymmetry offense v. defense here would likely play against surveillance. It doesn't seem hard to use writing assistance software to change your "stylometric signature" - even something very amateurish, like translating a text back and forth, renders very different results.
I see an analogy here with cheating-detection software... which has been reportedly used by essay mills to ensure their products won't be detected. The paper you cite (from 2006) suggest this is not particularly technically hard - if I understood it correctly.

What I think might be  problematic is that we have conflicting desiderata: if you're an anonymous blogger, you actually want to have a recognizable and definite style; you could use obfuscating techniques in other interactions (in your academic essays, or personal blog), but now the catch: if you're the only one doing this, you're basically signalling to everyone else you have a "secret identity". But I guess this applies to other matters regarding surveillance as well.

So my point is: though I reckon your concern, I am still way more worried about other forms of surveillance in general, and I think the "crux" here is a mainly social issue, instead of a tech one.

Interesting, but I do think it's an impossibly large cognitive load to be modifying every word you write publicly online (atleast from your public identity). It's more than research papers - it's any blog you want to comment on, any reddit shitpost, any other hobby communities you may be a part of and so on. Also you'll have to modulate how you speak in person if you fear someone is recording you or plans to quote you verbatim on the internet later on.

P.S. This will probably be even more helpless against state and corporate actors, for instance if they can read your private 1-to-1 chats or get data from robocalls you pick up, or record you when you visit a store, etc.