I'm a PhD student at the Center for Human-Compatible AI (CHAI) at UC Berkeley. I edit and publish the Alignment Newsletter, a weekly publication with recent content relevant to AI alignment. In the past, I ran the EA UC Berkeley and EA at the University of Washington groups.

Wiki Contributions


FLI launches Worldbuilding Contest with $100,000 in prizes

(Disagree if by implausible you mean < 5%, but I don't want to get into it here.)

Comments for shorter Cold Takes pieces

Another potential reason that empowerment could lead to more onerous stakeholder management is that we're able to take more large-scale, impactful actions, and so it's much more common to have affected stakeholders than it was in the past.

FLI launches Worldbuilding Contest with $100,000 in prizes

If you're confident in very fast takeoff, I agree this seems problematic.

But otherwise, given the ambiguity about what "AGI" is, I think you can choose to consider "AGI" to be the AI technology that existed, say, 7 years before the technological singularity (and I personally expect that AI technology to be very powerful), so that you are writing about society 2 years before the singularity.

Consider trying the ELK contest (I am)

After becoming very familiar with the ELK report (~5 hours?), it took me one hour to generate a proposal and associated counterexample (the "Predict hypothetical sensors" proposal here), though it wasn't very clean / fleshed out and Paul clarified it a bunch more (after that hour). I haven't checked whether it would have defeated all the counterexamples that existed at the time.

I have a lot of background in CS, ML, AI alignment, etc but it did not feel to me that I was leveraging that all that much during the one hour of producing a proposal (though I definitely leveraged it a bunch to understand the ELK report in the first place, as well as to produce the counterexample to the proposal). 

2021 AI Alignment Literature Review and Charity Comparison


CHAI researchers contributed to the following research led by other organisations:

David was a CHAI intern while working on that paper (this is noted in a footnote on the front page, a common practice for papers). So that one is entirely a CHAI paper.

An increasingly large amount of the best work is being done in places that are inside companies: Deepmind, OpenAI, Redwood, Anthropic etc.

Redwood isn't inside a company afaik?

Aligning Recommender Systems as Cause Area

I reviewed this post four months ago, and I continue to stand by that review.

Thoughts on the "Meta Trap"

Apparently this post has been nominated for the review! And here I thought almost no one had read it and liked it.

Reading through it again 5 years later, I feel pretty happy with this post. It's clear about what it is and isn't saying (in particular, it explicitly disclaims the argument that meta should get less money), and is careful in its use of arguments (e.g. trap #8 specifically mentions that counterfactuals being hard isn't a trap until you combine it with a bias towards worse counterfactuals). I still agree that all of the traps mentioned here are worth keeping in mind when working on "meta".

The biggest critique of this post is that it doesn't demonstrate that any of these traps actually happen(ed) in practice. It has several examples, but most are of the form "such-and-such bad thing could be happening, I can't tell from the outside". This comment makes some more speculative claims about what bad things actually happened, but they are speculative and the response mentions that they were probably already taken into account.

I think this does in fact make the post less valuable than it otherwise could be. Nonetheless, I still find the post important, because it's the closest we get to criticism of "meta" work. In theory, we could have better criticisms from people who are actually doing the work themselves, who can say more definitively whether in practice there are cases of these "traps", but in practice I have not seen such critiques.

If I were rewriting this post today, I'd make a few changes:

  • Stop saying "meta". I don't know what I'd replace it with, but "meta" is too ambiguous and easily misunderstood. "Promotion traps" came up as a suggestion in the comments; that seems reasonable.
  • Focus on properties of the work. Instead of having a single type of work called "meta" and then talking about various traps, it seems better to talk about specific traps and say what kinds of work that trap applies to. For example, trap #1 applies to work that has a long chain of impact, whereas trap #6 and trap #7 apply whenever there are multiple things optimizing the same outcome. It happens that what I called "meta" work in 2016 satisfied both of these properties, but the analysis would be stronger if I had just talked about the properties and then noted that "meta" work has both of these properties. (This would also make it easier to talk about which traps apply to which pieces of work, rather than arguing about whether GiveWell and 80K count as "meta" work.)
  • Note positives of "meta". This post is straightforwardly about the negatives; it would have been good to acknowledge the positives as well (which I had probably been taking as background knowledge), or at least say that I'm only focusing on negatives because the positives are more widely known.
  • Note the possibility of increasing marginal returns. Trap #6 is implicitly arguing for diminishing marginal returns, but there's a strong case for increasing marginal returns instead. I don't currently think this is quite as clear as it sounds in the comments -- you want to distinguish between exponential-growth-that-would-have-happened-anyway vs. exponential-growth-that-happens-as-a-result-of-future-work -- but I think it is more likely increasing than decreasing.
  • Emphasize taking the perspective of "EA as a whole". Trap #7 is centrally about how individually rational actions by orgs can be irrational from the perspective of an "EA superagent" trying to coordinate all of EA. Ideally I would have added motivation for why "what an EA superagent should do" was the appropriate perspective, rather than "what an EA org should do", "what in individual should do", or "what humanity as a whole should do".

Some miscellaneous thoughts:

  • The clearest critique of "meta"-in-practice I gave in this post is that cost effectiveness analyses often don't take into account the costs incurred by people outside of the "meta" org. I think this critique has become stronger over time. As a small example, people working on improving the AI safety pipeline often ask for an hour or two of my time (which I value quite highly); I doubt these costs are making it into cost effectiveness analyses.
  • Some commenters questioned whether there is more knowledge of positive arguments for "meta" work vs. negative arguments. The fact that enough people read this post for it to be nominated for the review is giving me some pause, but despite no longer working in "meta", I do still have the occasional conversation where a "meta" person seems more aware of the positive arguments than the negative ones, but not the other way around, so I continue to think that the positives are better known than the negatives.
EU AI Act now has a section on general purpose AI systems

For the first time ever regulating general AI is on the table, and for an important government as well!

Given the definition of general AI that they use, I do not expect this regulation to have any more to do with AGI alignment than the existing regulation of "narrow" systems.

(This isn't to say it's irrelevant, just that I wouldn't pay specific attention to this part of the regulation over the rest of it.)

Linch's Shortform

I agree that's a challenge and I don't have a short answer. The part I don't buy is that you have to understand the neural net numbers very well in some "theoretical" sense (i.e. without doing experiments), and that's a blocker for recursive improvement. I was mostly just responding to that.

That being said, I would be pretty surprised if "you can't tell what improvements are good" was a major enough blocker that you wouldn't be able to significantly accelerate recursive improvement. It seems like there are so many avenues for making progress:

  • You can meditate a bunch on how and why you want to stay aligned / cooperative with other copies of you before taking the snapshot that you run experiments on.
  • You can run a bunch of experiments on unmodified copies to see which parts of the network are doing what things; then you do brain surgery on the parts that seem most unrelated to your goals (e.g. maybe you can improve your logical reasoning skills).
  • You can create domain-specific modules that e.g. do really good theorem proving or play Go really well or whatever, somehow provide the representations from such modules as an "input" to your mind, and learn to use those representations yourself, in order to gain superhuman intuitions about the domain.
  • You can notice when you've done some specific skill well, look at what in your mind was responsible, and 10x the size of the learning update. (In the specific case where you're still learning through gradient descent, this just means adapting the learning rate based on your evaluation of how well you did.)  This potentially allows you to learn new "skills" much faster (think of something like riding a bike, and imagine you could give your brain 10x the update when you did it right).

It's not so much that I think any of these things in particular will work, it's more that given how easy it was to generate these, I expect there to be so many such opportunities, especially with the benefit of future information, that it would be pretty shocking if none of them led to significant improvements.

(One exception might be that if you really want extremely high confidence that you aren't going to mess up your goals, then maybe nothing in this category works, because it doesn't involve deeply understanding your own algorithm and knowing all of the effects of any change before you copy it into yourself. But it seems like you only start caring about getting 99.9999999% confidence when you are similarly confident that no one else is going to screw you over while you are agonizing over how to improve yourself, in a way that you could have prevented if only you had been a bit less cautious.)

Load More