Steven Byrnes

Research Fellow @ Astera
1232 karmaJoined Nov 2019Working (6-15 years)Boston, MA, USA


Hi I'm Steve Byrnes, an AGI safety researcher in Boston, MA, USA, with a particular focus on brain algorithms—see


Topic Contributions

That might be true in the very short term but I don’t believe it in general. For example, how many reporters were on the Ukraine beat before Russia invaded in February 2022? And how many reporters were on the Ukraine beat after Russia invaded? Probably a lot more, right?

Thanks for the comment!

I think we should imagine two scenarios, one where I see the demonic possession people as being “on my team” and the other where I see them as being “against my team”.

To elaborate, here’s yet another example: Concerned Climate Scientist Alice responding to statements by environmentalists of the Gaia / naturalness / hippy-type tradition. Alice probably thinks that a lot of their beliefs are utterly nuts. But it’s pretty plausible that she sees them as kinda “on her side” from a vibes perspective. (Hmm, actually, also imagine this is 20 years ago; I think there’s been something of a tribal split between pro-tech environmentalists and anti-tech environmentalists since then.) So probably Alice would probably make somewhat diplomatic statements, emphasizing areas of agreement, etc. Maybe she would say “I think they have the right idea about deforestation and many other things, although I come at it from a more scientific perspective. I don’t think we should take the Gaia idea too literally. But anyway, everyone agrees that there’s an environmental crisis here…” or something like that.

In your demon example, imagine someone saying “I think it’s really great to see so many people questioning the narrative that the police are always perfect. I don’t think demonic possession is the problem, but y’know why so many people keep talking about demonic possession? It’s because they can see there’s a problem, and they’re angry, and they have every right to be angry because there is in fact a problem. And that problem is police corruption…”.

So finally back to the AI example, I claim there’s a strong undercurrent of “The people talking about AI x-risk, they suck, those people are not on my team.” And if there wasn’t that undercurrent, I think most of the x-risk-doesn’t-exist people would have at worst mixed feelings about the x-risk discourse. Maybe they be vaguely happy that there are all these new anti-AI vibes going around, and they would try to redirect those vibes in the directions that they believe to be actually productive, as in the above examples: “I think it’s really great to see people across society questioning the narrative that AI is always a force for good and tech companies are always a force for good. They’re absolutely right to question that narrative; that narrative is wrong and dangerous! Now, on this specific question, I don’t think future AI x-risk is anything to worry about, but let’s talk about AI companies stomping on copyright law…”

Very different vibe, right? Much less aggressive trashing of AI x-risk than what we actually see from some people.

To be clear, in a perfect world, people would ignore vibes and stay on-topic and at the object level, and Alice would just straightforwardly say “My opinion is that Gaia is pseudoscientific nonsense” instead of sanewashing it and immediately changing the subject, and ditto with the demon person and the other imaginary people above. I’m just saying what often happens in practice.

Back to your example, I think it’s far from obvious IMO that the number of articles about police corruption are going to go down in absolute numbers, although it obviously goes down as a fraction of police articles. It’s also far from obvious IMO that this situation will make it harder rather than easier to get anti-corruption laws passed, or to fundraise.

  • I suggest to spend a few minutes pondering what to do if crazy people (perhaps just walking by) decide to "join" the protest. Y'know, SF gonna SF.

  • FYI at a firm I used to work at, once there was a group protesting us out front. Management sent an email that day suggesting that people leave out a side door. So I did. I wasn't thinking too hard about it, and I don't know how many people at the firm overall did the same.

(I have no personal experience with protests, feel free to ignore.)

In your hypothetical, if Meta says “OK you win, you're right, we'll henceforth take steps to actually cure cancer”, onlookers would assume that this is a sensible response, i.e. that Meta is responding appropriately to the complaint. If the protester then gets back on the news the following week and says “no no no this is making things even worse”, I think onlookers would be very confused and say “what the heck is wrong with that protester?”

I don’t think “mouldability” is a synonym of “white-boxiness”. In fact, I think they’re hardly related at all:

  • There can be a black box with lots of knobs on the outside that change the box’s behavior. It’s still a black box.
  • Conversely, consider an old-fashioned bimetallic strip thermostat with a broken dial. It’s not mouldable at all—it can do one and only thing, i.e. actuate a switch at a certain fixed temperature. (Well, I guess you can use it as a doorstop!) But a bimetallic strip thermostat still very white-boxy (after I spend 30 seconds telling you how it works). 

You wrote “They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.” I feel like I keep pressing you on this, and you keep motte-and-bailey'ing into some other claim that does not align with a common-sense reading of what you originally wrote:

  • “Well, the cost of analysis could theoretically be even higher—like, if you had to drill into skulls…” OK sure but that’s not the same as “essentially no cost”.
  • “Well, the cost of analysis may be astronomically high, but there’s a theorem proving that it’s not theoretically impossible…” OK sure but that’s not the same as “essentially no cost”.
  • “Well, I can list out some specific analysis and manipulation tasks that we can do at essentially no cost: we can do X, and Y, and Z, …” OK sure but that’s not the same as “we can analyze and manipulate however we want at essentially no cost”.

Do you see what I mean?

If you want to say "it's a black box but the box has a "gradient" output channel in addition to the "next-token-probability-distribution" output channel", then I have no objection.

If you want to say "...and those two output channels are sufficient for safe & beneficial AGI", then you can say that too, although I happen to disagree.

If you want to say "we also have interpretability techniques on top of those, and they work well enough to ensure alignment for both current and future AIs", then I'm open-minded and interested in details.

If you want to say "we can't understand how a trained model does what it does in any detail, but if we had to drill into a skull and only measure a few neurons at a time etc. then things sure would be even worse!!", then yeah duh.

But your OP said "They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost", and used the term "white box". That's the part that strikes me as crazy. To be charitable, I don't think those words are communicating the message that you had intended to communicate.

For example, find a random software engineer on the street, and ask them: "if I give you a 1-terabyte compiled executable binary, and you can do whatever you want with that file on your home computer, would you describe it as closer to "white box" or "black box"?". I predict most people would say "closer to black box", even though they can look at all the bits and step through the execution and run decompilation tools etc. if they want. Likewise you can ask them whether it's possible to "analyze" that binary "at essentially no cost". I predict most people would say "no".

I was reading it as a kinda disjunctive argument. If Nora says that a pause is bad because of A and B, either of which is sufficient on its own from her perspective, then you could say "A isn't cruxy for her" (because B is sufficient) or you could say "B isn't cruxy for her" (because A is sufficient). Really, neither of those claims is accurate.

Oh well, whatever, I agree with you that the OP could have been clearer.

If you desperately wish we had more time to work on alignment, but also think a pause won’t make that happen or would have larger countervailing costs, then that would lead to an attitude like: “If only we had more time! But alas, a pause would only make things worse. Let’s talk about other ideas…” For my part, I definitely say things like that (see here).

However, Nora has sections claiming “alignment is doing pretty well” and “alignment optimism”, so I think it’s self-consistent for her to not express that kind of mood.

I have a vague impression—I forget from where and it may well be false—that Nora has read some of my AI alignment research, and that she thinks of it as not entirely pointless. If so, then when I say “pre-2020 MIRI (esp. Abram & Eliezer) deserve some share of the credit for my thinking”, then that’s meaningful, because there is in fact some nonzero credit to be given. Conversely, if you (or anyone) don’t know anything about my AI alignment research, or think it’s dumb, then you should ignore that part of my comment, it’s not offering any evidence, it would just be saying that useless research can sometimes lead to further useless research, which is obvious! :)

I probably think less of current “empirical” research than you, because I don’t think AGI will look and act and be built just like today’s LLMs but better / larger. I expect highly-alignment-relevant differences between here and there, including (among other things) reinforcement learning being involved in a much more central way than it is today (i.e. RLHF fine-tuning). This is a big topic where I think reasonable people disagree and maybe this comment section isn’t a great place to hash it out. ¯\_(ツ)_/¯

My own research doesn’t involve LLMs and could have been done in 2017, but I’m not sure I would call it “purely conceptual”—it involves a lot of stuff like scrutinizing data tables in experimental neuroscience papers. The ELK research project led by Paul Christiano also could have been done 2017, as far as I can tell, but lots of people seem to think it’s worthwhile; do you? (Paul is a coinventor of RLHF.)

By contrast, AIs implemented using artificial neural networks (ANN) are white boxes in the sense that we have full read-write access to their internals. They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost. 

Suppose you walk down a street, and unbeknownst to you, you’re walking by a dumpster that has a suitcase full of millions of dollars. There’s a sense in which you “can”, “at essentially no cost”, walk over and take the money. But you don’t know that you should, so you don’t. All the value is in the knowledge.

A trained model is like a computer program with a billion unlabeled parameters and no documentation. Being able to view the code is helpful but doesn’t make it “white box”. Saying it’s “essentially no cost” to “analyze” a trained model is just crazy. I’m pretty sure you have met people doing mechanistic interpretability, right? It’s not trivial. They spend months on their projects. The thing you said is just so crazy that I have to assume I’m misunderstanding you. Can you clarify?

Load more