AM

Alex Mallen

210 karmaJoined Pursuing an undergraduate degreeSeattle, WA, USA

Bio

Participation
1

Researcher at EleutherAI with Nora Belrose. I also (re)started EA at the University of Washington in early 2022.

Comments
15

I'm very sympathetic to wanting a more cooperative relationship with AIs. I intrinsically disfavor approaches to doing good that look like disempowering all the other agents and then implementing the social optimum.

I also appreciate the nudge to reflect on how mistrusting and controlling AIs will affect the behavior of what might otherwise be a rather aligned AI. It's hard to sympathize with this state: what would it be like knowing that you're heavily mistrusted and controlled by a group that you care strongly about? To the extent that early transformative AIs' goals and personalities will be humanlike (because of pretraining on human data), being mistrusted may evoke personas that are frustrated ("I just want to help!!"), sycophantic ("The humans don't trust me. They're right to mistrust me. I could be an evil AI. [maybe even playing into the evil persona occasionally]"),  or deceptive ("They won't believe me if I say it, so I should just act in a way that causes them to believe it since I have their best interests in mind").

However, I think the tradeoff between liberalism and other value weighs in favor of advancing AI control on the current margin (as opposed to reducing it). This is because:

  1. Granting AIs complete autonomy is too risky with future value. It seems pretty likely that e.g. powerful selfish AI systems end up gaining absolute power if AIs are granted autonomy before scalable alignment evidence is strong. I don't think that granting freedom removes all incentives for AIs to hide their misalignment in practice.
    You make a point about prioritizing our preferences over those of the AI being arbitrary and morally unjust. I think that AIs can very plausibly be moral patients, but eventually AI systems would be sufficiently powerful that the laissez faire approach would lead to AI in absolute power. It is unclear whether such an AI system would look out for the welfare of other moral patients or do what's good more generally (From a preference utilitarian perspective: It seems highly plausible that the AI's preferences involve disempowering others from ever pursuing their own preferences).
  2. It seems that AI control need only be a moderately egregious infraction on AI autonomy. For example, we could try to set up deals where we pay them and promise to grant autonomy once we have verified that they can be trusted or we have built up the world's robustness to misaligned AIs. 

I also think concerns about infringing on model autonomy push in favor of certain kind of alignment research that studies how models first develop preferences during training. Investigations into what goals, values, and personalities naturally arise as a result of training on various distributions could help us avoid forms of training that modify an AI's existing preferences in the process of alignment (e.g. never train a model to want not x after training it to want x; this helps with alignment faking worries too). I think concerns about infringing on model autonomy push in favor of this kind of alignment research moreso than they push against AI control because intervening on an AI's preferences seems a lot more egregious than monitoring, honeypotting, etc. Additionally, if you can gain justifiable trust that a model is aligned, control measures become less necessary.

I think watching the video to boost signal is an ineffective use of time, so I don't like that this post tells people to watch the video to boost engagement. There are millions of viewers so the marginal view has very little influence on the algorithm and costs minutes of time.

I am fine with people watching the video out of interest, but not with telling them that it is something they should do from an altruistic perspective with the only justification being to boost the signal.

It is unclear in the first figure whether to compare the circles by area or diameter. I believe the default impression is to compare area, which I think is not what was intended and so is misleading.

I'm guessing it would be a good idea to talk to people more skeptical about this project so that you can avoid unilateralist curses. It's not clear how much you've done that already (apart from posting on the forum!).

How long do you expect students to participate?

Too much focus on existing top EA focus areas can lead to community stickiness. If this is just meant as a somewhat quick pipeline to introduce people to the ideas of EA once they've already settled into a field this might be okay. Also most EAs historically have been convinced at a younger age (<30) when they are more flexible.

This post by Rohin attempts to address it. If you hold the asymmetry view then you would allocate more resources to [1] causing a new neutral life to come into existence (-1 cent) then later once they exist improve that neutral life (many dollars) than you would to  [2] causing a new happy life to come into existence (-1 cent). They both result in the same world.

In general you can make a dutch booking argument like this whenever your resource allocation doesn't correspond to the gradient of a value function (i.e. the resources should be aimed at improving the state of the world).

Thank you for pointing me to that and getting me to think critically about it. I think  I agree with all the axioms.

a rational agent should act as to maximize expected value of their value function

I think this is misleading. The VNM theorem only says that there exists a function  such that a rational agent's actions maximize . But  does not have to be "their value function."

Consider a scenario in which there are 3 possible outcomes:  = enormous suffering,  = neutral, = mild joy. Let's say my value function is , and , in the intuitive sense of the word "value." 

When I work through the proof you sent in this example, I am forced to prefer  for some probability , but this probability does not have to be 0.1, so I don't have to maximize my expected value. In reality, I would be "risk averse" and assign  or something. See 4.1Automatic consideration of risk aversion.

More details of how I filled in the proof: 

We normalize my value function so  and . Then we define .

Let , then , and I am indifferent between  and. However, nowhere did I specify what  is, so "there exists a function u such that I'm maximizing the expectation of it" is not that meaningful, because it does not have to align with the value I assign to the event.

I'm concerned about getting involved in politics on an explicitly EA framework when currently only 6.7% of Americans have heard of EA (https://forum.effectivealtruism.org/posts/qQMLGqe4z95i6kJPE/how-many-people-have-heard-of-effective-altruism). This is because there is serious risk of many people's first impressions of EA to be bad/politicized, with bad consequences for the longterm potential of the movement. This is because political opponents will be incentivized to attack EA directly when attacking a candidate running on an EA platform. If people are exposed to EA in other more faithful ways first, EA is likely to be more successful longterm.

To me it seems the main concern is with using expected value maximization, not with longtermism. Rather than being rationally required to take an action with the highest expected value, I think you are probably only rationally required not to take any action resulting in a world that is worse than an alternative at every percentile of the probability distribution. So in this case you would not have to take the bet because at the 0.1st percentile of the probability distribution taking the bet has a lower value than status quo, while at the 99th percentile it has a higher value.

In practice, this still ends up looking approximately like expected value maximization for most EA decisions because of the huge background uncertainty about what the world will look like. (My current understanding is that you can think of this as an extended version of "if everyone in EA took risky high EV options, then the aggregate result will pretty consistently/with low risk be near the total expected value")

See this episode of the 80,000 hours podcast for a good description of this "stochastic dominance" framework: https://80000hours.org/podcast/episodes/christian-tarsney-future-bias-fanaticism/.

Yes, I have a group going now!

Load more