[Cross-posted from the Alignment Forum.]

Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals.

I think that getting really good at this might be a weird but good plan for learning some skills that might turn out to be really valuable for alignment research. (And it wouldn’t shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don’t love math or computer science.

I'd probably fund people to spend at least a few months on this; email me if you want to talk about it.

Some main activities I’d do if I was a black-box LM investigator are:

  • Spend a lot of time with them. Write a bunch of text with their aid. Try to learn what kinds of quirks they have; what kinds of things they know and don’t know.
  • Run specific experiments. Do they correctly complete “Grass is green, egg yolk is”? Do they know the population of San Francisco? (For many of these experiments, it seems worth running them on LMs of a bunch of different sizes.)
  • Try to figure out where they’re using proxies to make predictions and where they seem to be making sense of text more broadly, by taking some text and seeing how changing it changes their predictions.
  • Try to find adversarial examples where they see some relatively natural-seeming text and then do something really weird.

The skills you’d gain seem like they have a few different applications to alignment:

  • As a language model interpretability researcher, I’d find it very helpful to talk to someone who had spent a long time playing with the models I work with (currently I’m mostly working with gpt2-small, which is a 12 layer model). In particular, it’s much easier to investigate the model when you have good ideas for behaviors you want to explain, and know some things about the model’s algorithm for doing such behaviors; I can imagine an enthusiastic black-box investigator being quite helpful for our research.
  • I think that alignment research (as well as the broader world) might have some use for prompt engineers–it’s kind of fiddly and we at Redwood would have loved to consult with an outsider when we were doing some of it in our adversarial training project (see section 4.3.2 here).
  • I’m excited for people working on “scary demos”, where we try to set up situations where our models exhibit tendencies which are the baby versions of the scary power-seeking/deceptive behaviors that we’re worried will lead to AI catastrophe. See for example Beth Barnes’s proposed research directions here. A lot of this work requires knowing AIs well and doing prompt engineering.
  • It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive (though I definitely wouldn’t want to rely on it to solve the whole problem). When we’re building actual AIs we probably need the red team to be assisted by various tools (AI-powered as well as non-AI-powered, eg interpretability tools). We’re working on building simple versions of these (e.g. our red-teaming software and our interpretability tools). But it seems pretty reasonable for some people to try to just do this work manually in parallel with us automating parts of it. And we’d find it easier to build tools if we had particular users in mind who knew exactly what tools they wanted.
    • Even transformatively intelligent and powerful AIs seem to me to be plausibly partially understandable by humans. Eg it seems plausible to me that these systems will communicate with each other at least partially in natural language, or that they'll have persistent memory stores or cognitive workspaces which can be inspected.

My guess is that this work would go slightly better if you had access to someone who was willing to write you some simple code tools for interacting with models, rather than just using the OpenAI playground. If you start doing work like this and want tools, get in touch with me and maybe someone from Redwood Research will build you some of them.

88

7 comments, sorted by Click to highlight new comments since: Today at 9:24 AM
New Comment

I’ve started doing a bunch of this and posting results to my Twitter.

Nice post!

Do you think a person working on this should also have some basic knowledge of ML? Or might it be better to NOT have that, to have a more "pure" outsider view on the behaviour of the models?

I think that knowing a bit about ML is probably somewhat helpful for this but not very important.

I'd also be interested in funding activities like this. This could  inform how much we can learn about models without distributing weights.

I'm not very well-versed in the CS aspect of ML or AI, but I really enjoyed reading about Redwood's work and reading this post reminded me of something I found striking then. I am not a trained EFL teacher but I have a decent grasp of some of the theory and some experience with classroom observation at different levels and teaching/tutoring EFL, and the examples in your "Redwood Research’s current project" write up are very similar in a lot of ways to mistakes intermediate or almost proficient EFL speakers would make (not being able to track  what pronouns are referring to across longer bodies of texts, not being able to grasp the implications of certain verbs when applied to certain objects etc). This makes me think that getting both language acquisition experts and EFL researchers' perspectives on your data may also be interesting and useful to this kind of research.

Super interesting topic- I sent you a message Buck :)

This is so cool. I spent a lot of time analyzing AI, but mainly from the perspective of visual, tone of voice, and symbolical rather than objective content. Maybe my analysis of abstract advertisement can be of an inspiration to you. I see the greatest issue with AI persuasion, in particular one that the audience cannot rationalize since it is working with human intuition (prima facie, positive objectives are narrated - by advertisers and humans).

I only saw a few GPT-3 texts, but it seems to me that the model is optimizing for attention by making discussants 'submit' - normalizes aggression: expresses the unwillingness of the discussants to interact yet interacting (I commented a bit here). This is a suboptimal system - norms of free cooperation on important problem solving may seem better.

It may not be an issue of GPT-3, it may be the issue of prominent/majority internet content that is co-developed by humans and AI (humans intuitively learn what to produce by seeing metrics which relate to engagement). If GPT-3 reflects other content, it may behave differently (has anyone already tried training it on the EA Forum?)

How can one use a software to train language models in self-reflection (analyze and qualify the rationale of audience's action or the viewers' feelings - e. g. if they are acting based on fear vs. on somewhat reasoned conclusion regarding usefulness of the product to their true objectives) and fostering humans' objectives which are also virtuous and inclusive (e. g. be healthy, improve one's and others' problems, be well-enjoyed in one's environment, be informed about reasons for one's actions, empathize with animals, learn unbiasing information, etc)? Is this already developed, just the questions/prompts are missing/limited?

I think that a software that can sell actually good products by motivating reasoning is somewhat unbeatable (consumers will decrease their demand for products less aligned with their objectives), so any company or economy that has a competitive advantage in this capacity (alongside with the ability to diversify and rapidly adjust production) benefits.

There is a risk of explaining AI in a way that omits the rationale for persons' emotions or actions, since then AI can seem prima facie safe but actually lead to the loss of human agency without humans/regulators able to detect it or having the infrastructure to react to it.