890 karmaJoined Feb 2022


Topic contributions

Are you saying AIs trained this way won’t be agents?

Not especially. If I had to state it simply, it's that massive space for instrumental goals isn't useful today, and plausibly in the future for capabilities, so we have at least some reason to not worry about misalignment AI risk as much as we do today.

In particular, it means that we shouldn't assume instrumental goals to appear by default, and to avoid overrelying on non-empirical approaches like your intuition or imagination. We have to take things on a case-by-case basis, rather than using broad judgements.

Note that instrumental convergence/instrumental goals isn't a binary, but rather a space, where more space for instrumental goals being useful for capabilities is continuously bad, rather than a sharp binary of instrumental goals being active or not active.

My claim is that the evidence we have is evidence against much space for instrumental convergence being useful for capabilities, and I expect this trend to continue, at least partially as AI progresses.

Yet I suspect that this isn't hitting at your true worry, and I want to address it today. I suspect that your true worry is this quote below:

And regardless of whatever else you’re saying, how can you feel safe that the next training regime won’t lead to instrumental convergence?

And while I can't answer that question totally, I'd like to suggest going on a walk, drinking water, or in the worst case getting mental help from a professional. But try to stop the loop of never feeling safe around something.

The reason I'm suggesting this is because the problem with acting on your need to feel safe is that the following would happen:

  1. This would, if adopted leave us vulnerable to arbitrarily high demands for safety, possibly crippling AI use cases, and as a general policy I'm not a fan of actions that would result in arbitrarily high demands for something, at least without scrutinizing it very heavily, and would require way, way more evidence than just a feeling.

  2. We have no reason to assume that people's feelings of safety or unsafety actually are connected to the real evidence of whether AI is safe, or whether misalignment risk of AI is big problem. Your feelings are real, but I don't trust that your feeling of unsafety of AI is telling me anything else other than your feelings about something. This is fine, to the extent that it isn't harming you materially, but it's an important thing to note here.

Kaj Sotala made a similar post, which talks about why you should mostly feel safe. It's a different discussion than my comment, but the post below may be useful:


EDIT 1: I deeply hope you can feel better, no matter what happens in the AI space.

EDIT 2: One thing to keep in mind in general is that in typical cases, when claims that something is more or less anything based on x evidence, this is usually smoothly less or more, rather than something going to zero of something or all of something, so in this case I'm claiming that AI is less dangerous, probably a lot less dangerous, but it doesn't mean we totally erase the danger, it just means that things are more safe and things have gotten smoothly better based on our evidence to date.

Yeah, at least several comments have much more severe issues than tone or stylistic choices, like rewording ~every claim by Ben, Chloe and Alice, and then assuming that the transformed claims had the same truth value as the original claim.

I'm in a position very similar to Yarrow here: While I think Kat Woods has mostly convinced me that the most incendiary claims are likely false, and I'm sympathetic to the case for suing Ben and Habryka, there was dangerous red flags in the responses, so much so that I'd stop funding Nonlinear entirely, and I think it's quite bad that Kat Woods responded the way they did.

I unendorsed primarily because apparently, the board didn't fire because of safety concerns, though I'm not sure this is accurate.

It seems like the board did not fire Sam Altman for safety reasons, but instead for other reasons instead. Utterly confusing, and IMO demolishes my previous theory, though a lot of other theories also lost out.

Sources below, with their archive versions included:





While I generally agree that they almost certainly have more information on what happened, which is why I'm not really certain on this theory, my main reason here is that for the most part, AI safety as a cause basically managed to get away with incredibly weak standards of evidence for a long time, until the deep learning era in 2019-, especially with all the evolution analogies, and even now it still tends to have very low standards (though I do believe it's slowly improving right now). This probably influenced a lot of EA safetyists like Ilya, who almost certainly imbibed the norms of the AI safety field, and one of them is that there is a very low standard of evidence needed to claim big things, and that's going to conflict with corporate/legal standards of evidence.

But I don't think most people who hold influential positions within EA (or EA-minded people who hold influential positions in the world at large, for that matter) are likely to be that superficial in their analysis of things. (In particular, I'm strongly disagreeing with the idea that it's likely that the board "basically had no evidence except speculation from the EA/LW forum". I think one thing EA is unusually good at – or maybe I should say "some/many parts of EA are unusually good at" – is hiring people for important roles who think for themselves and have generally good takes about things and acknowledge the possibility of being wrong about stuff. [Not to say that there isn't any groupthink among EAs. Also, "unusually good" isn't necessarily that high of a bar.])

I agree with this weakly, in the sense that being high up in EA is at least a slight update towards them actually thinking through things and being able to make actual cases. My disagreement here is that this effect is probably not strong enough to wash away the cultural effects of operating in a cause area where they don't need to meet any standard of evidence except long-winded blog posts and getting rewarded, for many reasons.

Also, the board second-guessed it's decision, which would be evidence for the theory that they couldn't make a case that actually abided to the standard of evidence for a corporate/legal setting.

If it was any other cause like say GiveWell or some other causes in EA, I would trust them much more that they do have good reason. But AI safety has been so reliant on very low-non-existent standards of evidence or epistemics that they probably couldn't explain themselves in a way that would abide by the strictness of a corporate/legal standard of evidence.

Edit: The firing wasn't because of safety related concerns.

If Ilya can say "we're pushing capabilities down a path that is imminently highly dangerous, potentially existentially, and Sam couldn't be trusted to manage this safely" with proof that might work - but then why not say that?

I suspect this is due to the fact that quite frankly, the concerns they had about Sam Altman being unsafe on AI basically had no evidence except speculation from the EA/LW forum, which is not enough evidence at all in the corporate world/legal world, and to be quite frank, the EA/LW standard of evidence on AI risk being a big deal enough to investigate is very low, sometimes non-existent, and that simply does not work once you have to deal with companies/the legal system.

More generally, EA/LW is shockingly loose, sometimes non-existent in its standards of evidence for AI risk, which doesn't play well with the corporate/legal system.

This is admittedly a less charitable take than say, Lukas Gloor's take.

My general thoughts on this can be stated as: I'm mostly of the opinion that EA will survive this, bar something massively wrong like the board members willfully lying or massive fraud from EAs, primarily because most of the criticism is directed to the AI safety wing, and EA is more than AI safety, after all.

Nevertheless, I do think that this could be true for the AI safety wing, and they may have just hit a key limit to their power. In particular, depending on how this goes, I could foresee a reduction in AI safety power and influence, and IMO this was completely avoidable.

Yeah, this is one of the few times where I believe that the EAs on the board likely overreached here, because they probably didn't give enough evidence to justify their excoriating statement there that Sam Altman was dishonest, and he might be coming back to lead the company.

I'm not sure how to react to all of this, though.

Edit: My reaction is just WTF happened, and why did they completely play themselves? Though honestly, I just believe that they were inexperienced.

The Bay Area rationalist scene is a hive of techno-optimisitic libertarians.[1] These people have a negative view of state/government effectiveness at a philosophical and ideological level, so their default perspective is that the government doesn't know what it's doing and won't do anything. [edit: Re-reading this paragraph it comes off as perhaps mean as well as harsh, which I apologise for]

Yeah, I kinda of have to agree with this, I think the Bay Area rationalist scene underrates government competence, though even I was surprised at how little politicking happened, and how little it ended up being politicized.

Similary, 'Politics is the Mind-Killer' might be the rationalist idea that has aged worst - especially for its influences on EA. EA is a political project - for example, the conclusions of Famine, Affluence, and Morality are fundamentally political.

I think that AI was a surprisingly good exception to the rule that politicizing something would make it harder to get, and I think this is mostly due to the popularity of AI regulations. I will say though that there's clear evidence that at least for now, AI safety is in a privileged position, and the heuristic no longer applies.

Overly-short timelines and FOOM. If you think takeoff is going to be so fast that we get no firealarms, then what governments do doesn't matter. I think that's quite a load bearing assumption that isn't holding up too well

Not just that though, I also think being overly pessimistic around AI safety sort of contributed, as a lot of people's mental health was almost certainly not great at best, making them catastrophize the situation and being ineffective.

This is a real issue in the climate change movement, and I expect that AI safety's embrace of pessimism was not good at all for thinking clearly.

Thinking of AI x-risk as only a technical problem to solve, and undervaluing AI Governance. Some of that might be comparative advantage (I'll do the coding and leave political co-ordination to those better suited). But it'd be interesting to see x-risk estimates include effectiveness of governance and attention of politicians/the public to this issue as input parameters.

I agree with this, at least for the general problem of AI governance, though I disagree if we talk about AI alignment, though I agree that rationalists underestimate the governance work required to achieve a flourishing future.

Okay, my crux is that the simplicity/Kolmogorov/Solomonoff prior is probably not very malign, assuming we could run it, and in general I find the prior not to be malign except for specific situations.

This is basically because it relies on the IMO dubious assumption that the halting oracle can only be used once, and notably once we use the halting/Solomonoff oracle more than once, the Solomonoff oracle loses it's malign properties.

More generally, if the Solomonoff Oracle is duplicatable, as modern AIs generally are, then there's a known solution to mitigate the malignancy of the Solomonoff prior: Duplicate it, and let multiple people run the Solomonoff inductor in parallel to increase the complexity of manipulation. The goal is essentially to remove the uniqueness of 1 Solomonoff inductor, and make an arbitrary number of such oracles to drive up the complexity of manipulation.

So under a weak assumption, the malignancy of the Solomonoff prior goes away.   This is described well in the link below, and the important part is that we need either a use-once condition, or we need to assume uniqueness in some way. If we don't have either assumption holding, as is likely to be the case, then the Solomonoff/Kolmogorov prior isn't malign.


And that's if it's actually malign, which it might not be, at least in the large-data limit:


More specifically, it's this part of John Wentworth's comment:

In Solomonoff Model, Sufficiently Large Data Rules Out Malignness

There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that:

A simple application of the no free lunch theorem shows that there is no way of making predictions that is better than the Solomonoff prior across all possible distributions over all possible strings. Thus, agents that are influencing the Solomonoff prior cannot be good at predicting, and thus gain influence, in all possible worlds.

... but in the large-data limit, SI's guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit.

Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.)

As far as the actual practical question, there is a very important limitation on inner-misaligned agents by SGD, primarily because gradient hacking is very difficult to do, and is an underappreciated limitation on misalignment, since SGD has powerful tools to remove inner-misaligned circuits/TMs/Agents in the link below:


Load more