Part of your question here seems to be, "If we can design a system that understands goals written in natural language, won't it be very unlikely to deviate from what we really wanted when we wrote the goal?" Regarding that point, I'm not an expert, but I'll point to some discussion by experts.
There are, as you may have seen, lists of examples where real AI systems have done things completely different from what their designers were intending. For example, this talk, in the section on Goodhart's law, has a link to such a list. But from what I can tell, those examples never involve the designers specifying goals in natural language. (I'm guessing that specifying goals that way hasn't seemed even faintly possible until recently, so nobody's really tried it?)
Here's a recent paper by academic philosophers that seems supportive of your question. The authors argue that AGI systems that involve large language models would be safer than alternative systems precisely because they could receive goals written in natural language. (See especially the two sections titled "reward misspecification" -- though note also the last paragraph, where they suggest it might be a better idea to avoid goal-directed AI altogether.) If you want more details on whether that suggestion is correct, you might keep an eye on reactions to this paper. There are some comments on the LessWrong post, and I see the paper was submitted for a contest.
Terminological point: It sounds like you're using the phrase "instrumental convergence" in an unusual way.
I take it the typical idea is just that there are some instrumental goals that an intelligent agent can expect to be useful in the pursuit of a wide range of other goals, whereas you seem to be emphasizing the idea that those instrumental goals would be pursued to extremes destructive of humanity. It seems to me that (1) those two two ideas are worth keeping separate, (2) "instrumental convergence" would more accurately label the first idea, and (3) that phrase is in fact usually used to refer to the first idea only.
This occurred to me as I was skimming the post and saw the suggestion that instrumental convergence is not seen in humans, to which my reaction was, "What?! Don't people like money?"
Regarding "the problem of human existence", it sounds like you'd get along with the people at the Voluntary Human Extinction Movement. I have no reason to think that organization is particularly effective at achieving its goal, but they might be able to point you in the right direction. There's also Population Connection, which I don't believe shares the same end goal but may (or may not) be taking more effective steps in the same direction.
Haven't read the draft, just this comment thread, but it seems to me the quoted section is somewhat unclear and that clearing it up might reduce the commenter's concerns.
You write here about interpreting some objections so that they become "empirical disagreements". But I don't see you saying exactly what the disagreement is. The claim explicitly stated is that "true claims might be used to ill effect in the world" -- but that's obviously not something you (or EAs generally) disagree with.
Then you suggest that people on the anti-EA side of the disagreement are "discouraging people from trying to do good effectively," which may be a true description of their behavior, but may also be interpreted to include seemingly evil things that they wouldn't actually do (like opposing whatever political reforms they actually support, on the basis that they would help people too well). That's presumably a misinterpretation of what you've written, but that interpretation is facilitated by the fact that the disagreement at hand hasn't been explicitly articulated.
Cool post. I had no idea that this debate over steelmanning existed. Glad to see someone on this side of it.
Lots of interesting subtleties in these posts, but the whole thing perplexes me as an issue of practical advice. It seems to me that parts of this debate suggest a picture of conversation where one person is actually paying close attention to the distinction between steelmanning and ITT and explicitly using those concepts to choose what to do next in the conversation. But I feel like I've basically never done that, despite spending lots of time trying to be charitable and teaching undergrads to do the same.
It makes me want to draw a distinction between steelmanning and ITT on the one hand and charity on the other (while acknowledging Rob's point that the word "charity" gets used in all kinds of ways). It's along the lines of the distinction in ethics between act types (lying, saying true things) and virtues (honesty): Steelmanning and passing ITT are the act types, and charity is the virtue.
The picture that emerges for me, then, is something like this: (1) We suppose a certain goal is appropriate. You enter a conversation with someone you disagree with with the goal of working together to both learn what you can from each other and better approach the truth. (2) Given that that's your goal, one thing that will help is being charitable -- meaning, roughly, taking ideas you don't agree with seriously. (3) Doing that will involve (a) considering the thoughts the other person is actually having (even at the expense of ignoring what seems to you a better argument along the same lines) as well as (b) considering other related thoughts that could be used to bolster the other person's position (even at the expense of straying to some degree from what the other person is actually thinking). Talking about ITT brings favorable attention to (a), and talking about steelmanning brings favorable attention to (b).
Yes, there is some kind of tradeoff between (a) and (b). Ozy's Marxist & liberal example nicely illustrates the dangers of leaning too hard in the (b) direction: In trying to present the other person's argument in the form that seems strongest to you, you end up missing what the other person is actually saying, including the most crucial insights present in or underlying their argument. Your dog & cat example nicely illustrates the dangers of leaning too hard in the (a) direction: In trying to get into your friend's exact headspace, you would again miss the other person's actual insight.
Given this picture, it seems to me the best practical advice is something like this: Keep that overall goal of collaborative truth-seeking in mind, and try to feel when your uncharitable impulses are leading you away from that goal and when your charity is leading you toward it. You certainly don't want to handle the tradeoff between (a) and (b) by choosing your loyalty to one or the other at the outset (adopting the rule "Don't steelman" or "Don't use the ITT"). And you don't even really want to handle it by periodically reevaluating which one to focus on for the next portion of the conversation. That's kind of like trying to balance on a skateboard by periodically thinking, "Which way should I lean now? Right or left?" rather than focusing on your sense of balance, trying to feel the answer to the question "Am I balanced?". You want to develop an overall sense of charity, which is partly a sense of balance between (a) and (b).
more or less any level of intelligence could in principle be combined with more or less any final goal.
This seems to me kind of a weird statement of the thesis, "could in principle" being too weak.
If I understand, you're not actually denying that just about any combination of intelligence and values could in principle occur. As you said, we can take a fact like the truth of evolution and imagine an extremely smart being that's wrong about that specific thing. There's no obvious impossibility there. It seems like the same would go for basically any fact or set of facts, normative or not.
I take it the real issue is one of probability, not possibility. Is an extremely smart being likely to accept what seem like glaringly obvious moral truths (like "you shouldn't turn everyone into paperclips") in virtue of being so smart?
(2) I was surprised to see you say your case depended completely on moral realism. Of course, if you're a realist, it makes some sense to approach things that way. Use your background knowledge, right?
But I think even an anti-realist may still be able to answer yes to the question above, depending on how the being in question is constructed. For example, I think something in this anti-orthogonality vein is true of humans. They tend to be constructed so that understanding of certain non-normative facts puts pressure on certain values or normative views: If you improve a human's ability to imaginatively simulate the experience of living in slavery (a non-moral intellectual achievement), they will be less likely to support slavery, and so on.
This is one direction I kind of expected you to go at some point after I saw the Aaronson quote mention "the practical version" of the thesis. That phrase has a flavor of, "Even if the thesis is mostly true because there are no moral facts to discover, it might still be false enough to save humanity."
(3) But perhaps the more obvious the truths, the less intelligence matters. The claim about slavery is clearer to me than the claim that learning more about turning everyone into paperclips would make a person less likely to do so. It seems hard to imagine a person so ignorant as to not already appreciate all the morally relevant facts about turning people into paperclips. It's as if, when the moral questions get so basic, intelligence isn't going to make a difference. You've either got the values or you don't. (But I'm a committed anti-realist, and I'm not sure how much that's coloring these last comments.)
Here are some attempts:
A rough formula for generating these might be this:
Another thing it would be helpful to know is what context you have in mind for this proverb. Are you guys looking for something you can use as a rebuttal when someone says "give a man a fish...", or are you looking for something you can use as a slogan in its own right? (Or both? Or something else?)