Seth Herd

10 karmaJoined Jul 2023


I personally think LLMs will plateau around human level, but that they will be made agentic and self-teaching, and therefore and self-aware (in sum, "sapient") and truly dangerous by scaffolding them into language model agents or language model cognitive architectures. See Capabilities and alignment of LLM cognitive for my logic in expecting that.

That would be a good outcome. We'd have agents with their own goals, capable enough to do useful and dangerous things, but probably not quite capable enough to self-exfiltrate, and probably initially under the control of relatively sane people. That would scare the pants off of the world, and we'd see some real efforts to align the things. Which is uniquely do-able, since they'd take top-level goals in natural language, and be readily interpretable by default (with real concerns still there aplenty, including waluigi effects and their utterances not reliably reflecting their real underlying cognition).

I think the general consensus, which I share, is that neither mind uploading nor good BCI to allow brain extensions are likely to happen before AGI. I wish I had citations ready to hand.

I haven't heard as much discussion of the biological superbrains approach. I think it's probably feasible to increase intelligence through genetic engineering, but that's probably also too long to help with alignment before AGI happens, if you took the route of altering embryos. Altering adults would be tougher and more limited. And it would hit the same legal problems.

I think that neuromorphic AGI is a possibility, which is why some of my alignment work addresses it. I think the best and most prominent work on that topic is Steve Byrnes' Intro to Brain-Like-AGI Safety.

I think that's quite a pessimistic take. I take Altman seriously on caring about x-risk, although I'm not sure he takes it quite seriously enough. This is based on public comments to that effect around 2013, before he started running OpenAI. And Sutskever definitely seems properly concerned.

I agree that those teams aren't completely trustworthy, and in an ideal world, we should be making this decision by including everyone on earth. But with a partial pause, do you expect to have better or worse teams in the lead for achieving AGI? That was my point.

I'm not sure which is the better place to have this discussion, so I'm trying both. Copied from my comment on Less Wrong

That all makes sense. To expand a little more on some of the logic:

It seems like the outcome of a partial pause rests in part on whether that would tend to put people in the lead of the AGI race who are more or less safety-concerned.

I think it's nontrivial that we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.

On the other hand, the argument for alignment risks is quite strong, and we might expect more people to take the risks more seriously as those arguments diffuse. This might not happen if polarization becomes a large factor in beliefs on AGI risk. The evidence for climate change was also pretty strong, but we saw half of America believe in it less, not more, as evidence mounted. The lines of polarization would be different in this case, but I'm afraid it could happen. I outlined that case a little in AI scares and changing public beliefs

In that case, I think a partial pause would have a negative expected value, as the current lead decayed, and more people who believe in risks less get into the lead by circumventing the pause.

This makes me highly unsure if a pause would be net-positive. Having alignment solutions won't help if they're not implemented because the taxes are too high.

The creation of compute overhang is another reason to worry about a pause. It's highly uncertain how far we are from making adequate compute for AGI affordable to individuals. Algorithms and compute will keep getting better during a pause. So will theory of AGI, along with theory of alignment.

This puts me, and I think the alignment community at large, in a very uncomfortable position of not knowing whether a realistic pause would be helpful.

It does seem clear that creating mechanisms and political will for a pause are a good idea.

Advocating for more safety work also seems clear cut.

To this end, I think it's true that you create more political capitol by successfully pushing for policy.

A pause now would create even more capitol, but it's also less likely to be a win, and it could wind up creating polarization and so costing rather than creating capitol. It's harder to argue for a pause now when even most alignment folks think we're years from AGI.

So perhaps the low-hanging fruit is pushing for voluntary RSPs, and government funding for safety work. These are clear improvements, and likely to be wins that create capitol for a pause as we get closer to AGI.

There's a lot of uncertainty here, and that's uncomfortable. More discussion like this should help resolve that uncertainty, and thereby help clarify and unify the collective will of the safety community.

I agree with you that humans have mismatched goals among ourselves, so some amount of goal mismatch is just a fact we have to deal with. I think the ideal is that we get an AGI that makes its goal the overlap in human goals; see [Empowerment is (almost) All We Need]( and others on preference maximization. 

I also agree with your intuition that having a non-maximizer improves the odds of an AGI not seeking power or doing other dangerous things. But I think we need to go far beyond the intuition; we don't want to play odds with the future of humanity. To that end, I have more thoughts on where this will and won't happen.

I'm saying "the problem" with optimization is actually mismatched goals, not optimization/maximization. In more depth, and hopefully more usefully: I think unbounded goals are the problem with optimization (not the only problem, but a very big one). 

If an AGI had a bounded goal like "make on billion paperclips", it wouldn't be nearly as dangerous; it might decide to eliminate humanity to make the odds of getting to a billion as good as possible (I can't remember where I saw this important point; I think maybe Nate Soares made it). But it might decide that its best odds would just be making some improvements to the paperclip business, in which case it wouldn't cause problems.

Mismatched goals is the problem. The logic of instrumental convergence applies to any goal, not just maximization goals.

This is a start, but just a start. Optimization/maximization isn't actually the problem. Any highly competent agent with goals that don't match ours is the problem.

A world that's 10% paperclips and the rest composed of other stuff we don't care about is no better than a true optimizer.

The idea "just don't optimize" has a surprising amount of support in AGI safety, including quantilizers and satisficing. But they seem like only a bare start on taking the points off of the tiger's teeth to me. The tiger will still gnaw you to death if it wants to even a little.

It means humans are highly imperfect maximizers of some imperfectly defined and ever-changing thing: your estimated future rewards according to your current reward function.

It doesn't matter that you're not exactly maximizing one certain thing; you're working toward some set of things, and if you're really good at that, it's really bad for anyone who doesn't like that set of things.

Optimization/maximization is a red herring. Highly compentent agents with goals different from yours is the core problem.

From a neuroscience/psychology perspective, I'd say that you are maximizing your future reward. And while that's not a well-defined thing, it doesn't matter; if you were highly competent, you'd make a lot of changes to the world according to what tickles you, and those might or might not be good for others, depending on your preferences (reward function). The slight difference between turning the world into one well-defined thing and a bunch of things you like isn't that important to anyone who doesn't like what you like.

This is a broader and more intuitive form of the argument Miles is trying to make precise.

If you can be dutch-booked without limit, well, you're just not competent enough to be a threat; but you're not going to let that happen, let alone a superintelligent version of you.

Load more