I feel like you're baking a lot into this clause:
With AI delegates, they would presumably be verifiable and would be programmed to tell the truth and keep to deals
I think that aiming for an equilibrium where that's true would be good, but I'm not certain that's the starting point (and if it were otherwise going to scupper getting this off the ground, it probably shouldn't be the starting point).
So if one person adopts the AI delegate and another doesn't, then the human can overexaggerate their preferences, withhold information, and even defect on the deal (without blatantly lying), but a verifiable AI delegate presumably wouldn't be able to do that?
I see no reason why an AI delegate shouldn't be able to withhold information. I agree that people might want delegates that could do the other things too, but I think that it might be better for the human principal if it couldn't -- it can develop a reputation as trustworthy (in a way that's hard for an individual human to develop enough of a reputation for because others don't get enough track record).
I agree that there are significant concerns here! FWIW I'm more concerned about the adversarially-manipulated layer (at least at something needing attention now). I think that a lot of these applications could work with systems that aren't much stronger than what we have today; but that getting effective misaligned scheming would require a significant step up in capabilities. (You might have weaker forms of misalignment, but I think that those are pretty similar to "the systems just aren't really good enough yet".)
We didn't; although two of us were involved in running the AI for Human Reasoning fellowship, and some of the fellows on that did.
I think the reasons we didn't go deeper on this are basically a mix of:
Could you comment on the sense of "should" you have in mind in this post?
I think your core thesis is something like "it would be more socially efficient for AI systems to have prosocial drives". (I lean agree.)
But then sometimes you write as though the implication is "AI companies should unilaterally implement more prosocial drives in their systems". And this feels much less obvious to me.
If the purchasers of AI services prefer them to not have prosocial drives, then this could be imposing values on the consumers (which might ultimately have the effect of driving people to other AI providers). You might think that it's worth while to have that kind of imposition for a period, and socially-minded AI companies should do it -- but if that's the heart of the claim I really think you should be explicit about it (and that AI companies should justifiably be less willing to listen to you if you're not).
Another angle might be: if we ultimately want AI systems to have prosocial drives, we might think about how that should be incentivized -- whether by trying to shape consumer preference, by legal mandate, or by economic gradients (e.g. differing tax rates according to the degree of prosociality).
Anyway possible I'm missing something here! Would love to hear how you're thinking about this question.
Thanks, I agree with your mathematics and think this framework is helpful for letting us zoom in to possible disagreements.
There are two places where I find myself sceptical of the framing in your comment:
Maybe there's a common theme here: I have the impression that I'm more imagining a default world where we get these upgrades to strategic capacity in a timely fashion, and then considering deviations from that; and you're more saying "well maybe things look like that, but maybe they look quite different", and less privileging the hypothesis.
I guess I do just think it's appropriate to privilege this hypothesis. We've written about how even current or near-term AI could serve to power tools which advance our strategic understanding. I think that this is a sufficiently obvious set of things to build, and there will be sufficient appetite to build them, that it's fair to think it will likely be getting in gear (in some form or another) before most radically transformative impacts hit. I wouldn't want to bet everything on this hypothesis, but I do think it's worth exploring what betting on it properly would look like, and then committing a chunk of our portfolio to that (if it's not actively bad on other perspectives).
You discuss the idea of clauses that allow for later escape from poorly-conceived deals as a guardrail. This feels like a powerful possibility which might add a significant amount of robustness.
But I'm wondering if the idea might be more broadly applicable than that. If we have the kind of machinery that allows us to add that kind of clause, maybe we could use it for the whole essence of the deal? Rather than specify up front what you wish to exchange, just specify the general principles of exchange -- and trust the smarter and wiser actors of the future to interpret it in a fair and benevolent manner.
In general I think reading this article I'm finding that I have some sympathy for the central claim that there could be useful deals to strike early (that it isn't possible to strike later); however I find myself feeling quite sceptical of the frameworks for thinking about different types of deals etc. -- I don't see why we shouldn think that we have done more here than scrape the surface of the universe of possibilities, and my best guess is that actually-wise deals would look quite different than anything you're outlining. Curious what you make of this -- does this feel too radically sceptical or something?
I am sympathetic to this.
Like many people, I’ve been following this thread with dismay. I think that Frances’s experiences sound terrible, and seem very unnecessary.
I have hesitated to weigh in on this thread. But I agree that the answers can’t just be at the policy level; and I’m keen to see further discussion about cultural dynamics which may contribute to the issues[1]. At this point I’ve given this question a good amount of thought (though I could definitely still be wrong), so I wanted to highlight a couple of things people might want to consider:
Focus on intent
I’m glad Frances calls this out as a problem, as I think it’s underappreciated as a contributing factor to problematic dynamics. I actually think it has more issues beyond what she lists.
A focus on intent:
Distrust of moral intuitions
(caveat: not sure I’m naming the truest version of this; but I’m pretty sure there’s something in this vicinity)
I think EA teaches people that it’s important to think through the implications of our actions, rather than relying on unconsidered moral intuitions. Which is correct! But I worry that sometimes people can absorb this lesson too far, and start not paying attention to their own moral intuitions when they don’t have explicit arguments for them[2].
A friend put it to me as “I think sometimes EA accidentally encourages a lack of groundedness”.
Anyway it’s pure speculation on my part to imagine this at play in CEA’s (in)actions. But rather than imagine that the people reading Riley’s document didn’t feel any discomfort, I find it easier to imagine them feeling a little uncomfortable about it but not trusting the discomfort, or orienting in a locally-consequentialist way and guessing that it would ultimately create more costs and be worse (possibly including worse-for-Frances) to escalate it rather than leave it be.
TBC, I don’t think that the right amount of focus on intent or distrust of our own moral intuitions is zero! And I absolutely think that it’s possible to do these in ways that are healthy. But if I'm right, then I kind of want people to be tracking the potential vulnerabilities from going too far in these directions; so wanted to share. I'll default to not posting more on this thread.
For the removal of any ambiguity, I'm not trying to disclaim personal responsibility for my own past mistakes! But when things go wrong to the degree of causing harm, I think they've often gone wrong at several levels at once; it's useful to look at all of these.
Or further: discount their own sense of right and wrong in order to defer to people who’ve thought about things more.
I disagree that 5 barely matters and is beside the point. I think doing 5 in an earnest way (as especially Holden's post is doing) is a move towards having the company acting in integrity in a forward-looking way. Maybe that move won't stick, but it really does feel meaningfully better to me to be finding somewhere solid to stand now rather than trying to paper things over.
And it makes sense that people want to discuss 1-4 (I'm not entirely endorsing your descriptions here, but I don't think that's important), I just think it's better for everyone if it's clear that the thing they're upset about is 1-4 rather than 5.
I feel better about Anthropic as a result of this change, although I understand if people feel worse. But I think that the proper target of their upset should be past-Anthropic declaring that it would hold to kind of confused/dubious standards (which I worry may have been corrosive for people's ability to think clearly about what is needed), rather than current Anthropic correcting that.
(I previously felt that the RSP commitments were kind of "off" somehow, and reading the new things feels like fresh air, people taking a more serious look and engaging with the world for real. I don't think I should get any credit for this feeling! Indeed despite feeling that they were "off", I didn't super engage or even manage to get to the bottom of why they felt off. I'm just expressing my feelings as this reaction seemed like a missing mood in the conversation.)
Plausible, yes. For one thing you can run versions of the coordination tech in parallel with old cheap models, and flag and dig into discrepancies. This could make it harder for misalignment to strongly bite.
Of course if there are big misalignment issues and we're not seriously tracking that there could be big misalignment issues, that's gonna be a problem.