Send me anonymous feedback: https://docs.google.com/forms/d/1qDWHI0ARJAJMGqhxc9FHgzHyEFp-1xneyl9hxSMzJP0/viewform

Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.

Some quick info about me:

I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).

You can also find me on the AI Alignment Forum and LessWrong.

Feel free to reach out by sending me a PM here or on my website.

Wiki Contributions


When and how should an online community space (e.g., Slack workspace) for a particular type/group of people be created?

As an aside: EA groups on FB generally seem like a coordination failure (due to FB probably being an addictive platform that optimizes for something like users' time spent). So for every active FB group it seems beneficial to create a non-FB group that could replace it.

We're Redwood Research, we do applied alignment research, AMA

What type of legal entity is Redwood Research operating as/under? Is it plausible that at some point the project will be funded by investors and that shareholders will be able to financially profit?

Why does (any particular) AI safety work reduce s-risks more than it increases them?
Answer by oferOct 04, 20215

I think that's an important question. Here are some thoughts (though I think this topic deserves a much more rigorous treatment):

Creating an AGI with an arbitrary goal system (that is potentially much less satiable than humans') and arbitrary game theoretical mechanisms—via an ML process that can involve an arbitrary amount of ~suffering/disutility—generally seems very dangerous. Some of the relevant considerations are weird and non-obvious. For example, creating such an arbitrary AGI may constitute wronging some set of agents across the multiverse (due to the goal system & game theoretical mechanisms of that AGI).

I think there's also the general argument that, due to cluelessness, trying to achieve some form of a vigilant Long Reflection process is the best option on the table, including by the lights of suffering-focused ethics (e.g. due to weird ways in which resources could be used to reduce suffering across the multiverse via acausal trading). Interventions that mitigate x-risks (including AI-related x-risks) seem to increase the probability that humanity will achieve such a Long Reflection process.

Finally, a meta point that seems important: People in EA who have spent a lot of time on AI safety (including myself), or even made it their career, probably have a motivated reasoning bias towards the belief that working on AI safety tends to be net-positive.

The problem of artificial suffering

Though implausible (at least to me), you could imagine an Astronomical Waste-style argument about delaying or preventing the instantiation of artificial consciousness with positively valenced states. In this case, a moratorium would be a seriously ethically negative event.

I think the astronomical waste paper argues that the EV loss from a "small" delay seems negligible; quoting from it:

For example, a single percentage point of reduction of existential risks would be worth (from a utilitarian expected utility point-of-view) a delay of over 10 million years.

A mesa-optimization perspective on AI valence and moral patienthood

(I don't know/remember the details of AlphaGo, but if the setup involves a value network that is trained to predict the outcome of an MCTS-guided gameplay, that seems to make it more likely that the value network is doing some sort of search during inference.)

A mesa-optimization perspective on AI valence and moral patienthood

Essentially the more the setup factors out valence-relevant computation (e.g. by separating out a module, or by accessing an oracle as in your example) the less likely it is for valenced processing to happen within the agent.

I think the analogy to humans suggests otherwise. Suppose a human feels pain in their hand due to touching something hot. We can regard all the relevant mechanisms in their body outside the brain—those that cause the brain to receive the relevant signal—as mechanisms that have been "factored out from the brain". And yet those mechanisms are involved in morally relevant pain. In contrast, suppose a human touches a radioactive material until they realize it's dangerous. Here there are no relevant mechanisms that have been "factored out from the brain" (the brain needs to use ~general reasoning); and there is no morally relevant pain in this scenario.

Though generally if "factoring out stuff" means that smaller/less-capable neural networks are used, then maybe it can reduce morally relevant valence risks.

A mesa-optimization perspective on AI valence and moral patienthood

GPT-3 is of that form, but AlphaGo/MuZero isn't (I would argue).

I don't see why. The NNs in AlphaGo and MuZero were trained using some SGD variant (right?), and SGD variants can theoretically yield mesa-optimizers.

A mesa-optimization perspective on AI valence and moral patienthood

Re comment 1: Yes, sorry this was just meant to point at a potential parallel not to work out the parallel in detail. I think it'd be valuable to work out the potential parallel between the DM agent's predicate predictor module (Fig12/pg14) with my factored-noxiousness-object-detector idea. I just took a brief look at the paper to refresh my memory, but if I'm understanding this correctly, it seems to me that this module predicts which parts of the state prevent goal realization.

I guess what I don't understand is how the "predicate predictor" thing can make it so that the setup is less likely to yield models that support morally relevant valence (if you indeed think that). Suppose the environment is modified such that the observation that the agent gets in each time step includes the value of every predicate in the reward specification. That would make the "predicate predictor" useless (I think; just from a quick look at the paper). Would that new setup be more likely than the original to yield models that have morally relevant valence?

A mesa-optimization perspective on AI valence and moral patienthood

This topic seems to me both extremely important and neglected. (Maybe it's neglected because it ~requires some combination of ML and philosophy backgrounds that people rarely have).

My interpretation of the core hypothesis in this post is something like the following: A mesa optimizer may receive evaluative signals that are computed by some subnetwork within the model (a subnetwork that was optimized by the base optimizer to give "useful" evaluative signals w.r.t. the base objective). Those evaluative signals can constitute morally relevant valenced experience. This hypothesis seems to me plausible.

Some further comments:

  1. Re:

    For instance, the DeepMind agent discussed in section 4 pre-processes an evaluation of its goal achievement. This evaluative signal is factored/non-integrated in a sense, and so it may not be interacting in the right way with the downstream, abstract processes to reach conscious processing.

    I don't follow. I'm not closely familiar with the Open-Ended Learning paper, but from a quick look my impression is that it's basically standard RL in multi-agent environments, with more diversity in the training environments than most works. I don't understand what you mean when you say that the agent "pre-processes an evaluation of its goal achievement" (and why the analogy to humans & evolution is less salient here, if you think that).

  2. Re:

    Returning to Tomasik’s assumption, “RL operations are relevant to an agent’s welfare”, the functionalist must disagree. At best we can say that RL operations can be instrumentally valuable by (positively) modifying the valence system.

    (I assume that an "RL operation" refers to things like an update of the weights of a policy network.) I'm not sure what you mean by "positively" here. An update to the weights of the policy network can negatively affect an evaluative signal.

[EDIT: Also, re: "Compare now to mesa-optimizers in which the reward signal is definitionally internal to the system". I don't think that the definition of mesa-optimizers involves a reward signal. (It's possible that a mesa-optimizer will never receive any evidence about "how well it's doing".)]

Investigating how technology-focused academic fields become self-sustaining

We define a self-sustaining field as “an academic research field that is capable of attracting the necessary funds and expertise for future work without reliance on a particular small group of people or funding sources” (see the next subsection for more on this definition).

I think that another aspect that is important to consider here is the goals of the funders. For example, an academic field may get huge amounts of funding from several industry actors that try to influence/bias researchers in certain ways (e.g. for the purpose of avoiding regulation). Such an academic field may satisfy the criterion above for being "self-sustaining" while having lower EV than what it would have if it had only a small group of EA-aligned funders.

Load More