Ok, seems like this might have been more a terminological misunderstanding on my end. I think I agree with what you say here, 'What if the “Inner As AGI” criterion does not apply? Then the outer algorithm is an essential part of the AGI’s operating algorithm'.
Ok, interesting. I suspect the programmers will not be able to easily inspect the inner algorithm, because the inner/outer distinction will not be as clear cut as in the human case. The programmers may avoid sitting around by fiddling with more observable inefficiencies e.g. coming up with batch-norm v10.
Good clarification. Determining which kinds of factoring are the ones which reduce valence is more subtle than I had thought. I agree with you that the DeepMind set-up seems more analogous to neural nociception (e.g. high heat detection). My proposed set-up (Figure 5) seems significantly different from the DM/nociception case, because it factors the step where nociceptive signals affect decision making and motivation. I'll edit my post to clarify.
Your new setup seems less likely to have morally relevant valence. Essentially the more the setup factors out valence-relevant computation (e.g. by separating out a module, or by accessing an oracle as in your example) the less likely it is for valenced processing to happen within the agent.
Just to be explicit here, I'm assuming estimates of goal achievement are valence-relevant. How generally this is true is not clear to me.
Thanks for the link. I’ll have to do a thorough read through your post in the future. From scanning it, I do disagree with much of it, many of those points of disagreement were laid out by previous commenters. One point I didn’t see brought up: IIRC the biological anchors paper suggests we will have enough compute to do evolution-type optimization before the end of the century. So even if we grant your claim that learning to learn is much harder to directly optimize for, I think it’s still a feasible path to AGI. Or perhaps you think evolution like optimization takes more compute than the biological anchors paper claims?
Certainly valenced processing could emerge outside of this mesa-optimization context. I agree that for "hand-crafted" (i.e. no base-optimizer) systems this terminology isn't helpful. To try to make sure I understand your point, let me try to describe such a scenario in more detail: Imagine a human programmer who is working with a bunch of DL modules and interpretability tools and programming heuristics which feed into these modules in different ways -- in a sense the opposite end of the spectrum from monolithic language models. This person might program some noxiousness heuristics that input into a language module. Those might correspond to a Phenumb-like phenomenology. This person might program some other noxiousness heuristics that input into all modules as scalars. Those might end up being valenced or might not, hard to say. Without having thought about this in detail, my mesa-optimization framing doesn't seem very helpful for understanding this scenario.
Ideally we'd want a method for identifying valence which is more mechanistic that mine. In the sense that it lets you identify valence in a system just by looking inside the system without looking at how it was made. All that said, most contemporary progress on AI happens by running base-optimizers which could support mesa-optimization, so I think it's quite useful to develop criterion which apply to this context.
Hopefully this answers your question and the broader concern, but if I'm misunderstanding let me know.
Your interpretation is a good summary!
Re comment 1: Yes, sorry this was just meant to point at a potential parallel not to work out the parallel in detail. I think it'd be valuable to work out the potential parallel between the DM agent's predicate predictor module (Fig12/pg14) with my factored-noxiousness-object-detector idea. I just took a brief look at the paper to refresh my memory, but if I'm understanding this correctly, it seems to me that this module predicts which parts of the state prevent goal realization.
Re comment 2: Yes, this should read "(positive/negatively)". Thanks for pointing this out.
Re EDIT: Mesa-optimizers may or may not represent a reward signal -- perhaps there's a connection here with Demski's distinction between search and control. But for the purposes of my point in the text, I don't think this much matters. All I'm trying to say is that VPG-type-optimizers have external reward signals, whereas mesa-optimizers can have internal reward signals.
Ah great, I have pledged. Is this new this year? Or maybe I didn't fill out the pledge last year; I don't remember.
Would it make sense for the Giving Tuesday organization to send out an annual reminder email? I have re-categorized all of my EA newsletters, and so they don't go to my main inbox. Maybe most people have calendar events, or the like, set up. Maybe though for people who almost forgot about Giving Tuesday (like me) a reminder email could be useful!
The question of how to aggregate over time may even have important consequences for population ethics paradoxes. You might be interested in reading Vanessa Kosoy's theory here in which she sums an individual's utility over time with an increasing penalty over life-span. Although I'm not clear on the justification for these choices, the consequences may be appealing to many: Vanessa, herself, emphasizes the consequences on evaluating astronomical waste and factory farming.