Aleksi Maunu

87 karmaJoined Jun 2019Otaniemi, 02150 Espoo, Finland


Naively I would trade a lot of clearly-safe stuff being delayed or temporarily prohibited for even a minor decrease in chance of safe-seeming-but-actually-dangerous stuff going through, which pushes me towards favoring a more expansive scope of regulation.

(in my mind the potential loss of decades of life improvements currently pale vs potential non-existence of all lives in the longterm future)

Don't know how to think about it when accounting for public opinion though, I expect a larger scope will gather more opposition to regulation, which could be detrimental in various ways, the most obvious being decreased likelihood of such regulation being passed/upheld/disseminated to other places.

But the difficulty of alignment doesn't seem to imply much about whether slowing is good or bad, or about its priority relative to other goals.

At the extremes, if alignment-to-"good"-values by default was 100% likely I presume slowing down would be net-negative, and racing ahead would look great. It's unclear to me where the tipping point is, what kind of distribution over different alignment difficulty levels one would need to have to tip from wanting to speed up vs wanting to slow down AI progress.

Seems to me like the more longtermist one is, the more slowing down looks good even when one is very optimistic about alignment. Then again there are some considerations that push against this: risk of totalitarianism, risk of pause that never ends, risk of value-agnostic alignment being solved and the first AGI being aligned to "worse" values than the default outcome. 

(I realize I'm using two different definitions of alignment in this comment, would like to know if there's standardized terminology to differentiate between them)

How is the "secretly is planning to murder all humans" improving the models scores on a benchmark?

(I personally don't find this likely, so this might accidentally be a strawman)

For example: planning and gaining knowledge are incentivized on many benchmarks -> instrumental convergence makes model instrumentally value power among other things -> a very advanced system that is great at long-term planning might conclude that "murdering all humans" is useful for power or other instrumentally convergent goals


You could prove this. Make a psychopathic model designed to "betray" in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.

I think with our current interpretability techniques we wouldn't be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments

GPT-4 doesn't have the internal bits which make inner alignment a relevant concern.

Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it's an open empirical question. The way I understand is is that there's a reward signal (human feedback) that's shaping different parts of the neural network that determines GPT-4's ouputs, and we don't have good enough interpretability techniques to know whether some parts of the neural network are representations of "goals", and even less so what specific goals they are.

I would've thought it's an open question whether even base models have internal representations of "goals", either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active.

(would love to be corrected :D) 

Anyone else not able to join the group through the link? 🤔 It just redirects me to the dashboard without adding me in


In those cases I would interpret agree votes as "I'm also thankful" or "this has also given me a lot to think about"

I think the stated reasoning there by OP is that it's important to influence OpenAI's leadership's stance and OpenAI's work on AI existential safety. Do you think this is unreasonable?

To be fair I do think it makes a lot of sense to invoke nepotism here. I would be highly suspicious of the grant if I didn't happen to place a lot of trust in Holden Karnofsky and OP.

(feel free to not respond, I'm just curious)

I think if I was issuing grants, I would use misleading language in such a letter to make it less likely that the grantee organization can't get registered for some bureaucracy reasons. It's possible to mention that to the grantee in an email or call too to not cause any confusion. My guess would be that that's what happened here but that's just my 2 cents. I have no relevant expertise.

Thanks for the comment! I feel funny saying this without being the author, but feel like the rest of my comment is a bit cold in tone, so thought it's appropriate to add this :) 


I lean more moral anti-realist but I struggle to see how the concept of "value alignment" and "decision-making quality" are not similarly orthogonal from a moral realist view than an anti-realist view. 

Moral realist frame: "The more the institution is intending to do things according to the 'true moral view', the more it's value-aligned."

"The better the institutions's decision making process is at predictably leading to what they value, the better their 'decision-making quality' is."

I don't see why these couldn't be orthogonal in at least some cases. For example, a terrorist organization could  be outstandingly good at producing outstandingly bad outcomes.


Still, it's true that the "value-aligned" term might not be the best, since some people seem to interpret it as a dog-whistle for "not following EA dogma enough" [link] (I don't, although might be mistaken). "Altruism" and "Effectiveness"as the x and y axes would suffer from the  problem mentioned in the post that it could alienate people coming to work on IIDM from outside the EA community. For the y-axis, ideally I'd like some terms that make it easy to differentiate between beliefs common in EA that are uncontroversial ("let's value people's lives the same regardless of where they live"), and beliefs that are more controversial ("x-risk is the key moral priority of our times").


About the problematicness of " value-neutral": I thought the post gave enough space for the belief that institutions might be worse than neutral on average, marking statements implying the opposite as uncertain. For example crux (a) exists in this image to point out that if you disagree with it, you would come to a different conclusion about the effectiveness of (A).


(I'm testing out writing more comments on the EA forum, feel free to say if it was helpful or not! I want to learn to spend less time on these. This took about 30 minutes.)

(not the author)


4. When I hear "(1) IIDM can improve our intellectual and political environment", I'm imagining something like if the concept of steelmanning becomes common in public discourse, we might expect that to indirectly lead to better decisions by key institutions. 

Load more