328Joined Aug 2017


How “natural” are intended generalizations (like “Do what the supervisor is hoping I’ll do, in the sense that most humans would mean this phrase rather than in a precise but malign sense”) vs. unintended ones (like “Do whatever maximizes reward”)?

I think this is an important point. I consider the question in this paper, published last year at AI Magazine. See the "Competing Models of the Goal" section, and in particular the "Arbitrary Reward Protocols" subsection. (2500 words)

I think there's something missing from the discussion here, which the key point of that section.First, I claim that sufficiently advanced agents will likely need to engage in hypothesis testing between multiple plausible models of what worldly events lead to reinforcement, or else they would fail at certain tasks. So even if the "intended generalization" is a quite bit more plausible to the agent than the unintended one, as long as it is cheap to test them, and as long as it has a long horizon, it would likely deem wireheading to be worth trying out, just in case. That said, in some situations (I mention a chess game in the paper) I expect the intended generalization to be so much simpler that it isn't even worth trying out.

Just a warning before you read it, I use the word "reward" a bit differently than you appear to. In my terminology, I would phrase is this as "Do what the supervisor is hoping" vs. "Do whatever maximizes the relevant physical signal", and the agent would essentially wonder which of the two constitutes "reward", rather than being a priori sure that its past rewards "are" those physical signals.

This is very high-quality. No disputes just clarifications.

I don’t just mean meta-orgs.

I think working for a well-financed grantmaking organization is not outrageously unconventional, although I suspect most lean on part-time work from well-respected academics more than OpenPhil does.

And I think 80k may just be an exception (a minor one, to some extent), borne out of an unusually clear gap in the market. I think some of their work should be done in academia instead (basically whatever work it’s possible to do), but some of the very specific stuff like the jobs board wouldn’t fit there.

Also, if we imagine an Area Dad from an Onion Local News article, I don’t think he’s skepticism would be quite as pronounced for 80k as for other orgs like, e.g., an AI Safety camp.

Thank you for the edit, and thank you again for your interest. I'm still not sure what you mean by a person "having access to the ground truth of the universe". There's just no sense I can think of where it is true that this a requirement for the mentor.

"The system is only safe if the mentor knows what is safe." It's true that if the mentor kills everyone, then the combined mentor-agent system would kill everyone, but surely that fact doesn't weight against this proposal at all. In any case, more importantly a) the agent will not aim to kill everyone regardless of whether the mentor would (Corollary 14), which I think refutes your comment. And b) for no theorem in the paper does the mentor need to know what is safe; for Theorem 11 to be interesting, he just needs to act safely (an important difference for a concept so tricky to articulate!). But I decided these details were beside the point for this post, which is why I only cited Corollary 14 in the OP, not Theorem 11.

Robin Hanson didn't occur to me when I wrote it or any of the times I read it! I was just trying to channel what I thought conventional advice would be.

So basically, just philosophy, math, and some very simple applied math  (like, say, the exponential growth of an epidemic), but already that last example is quite shaky.

In fields where it's possible to make progress with first-principles arguments/armchair reasoning, I think smart non-experts stand a chance of outperforming. I don't want to make strong claims about the likelihood of success here; I just want to say that it's a live possibility. I am much more comfortable saying that outperforming conventional wisdom is extremely unlikely on topics where first-principles arguments/armchair reasoning are insufficient.

(As it happens, EAs aren't really disputing the experts in philosophy, but that's beside the point...)

academics themselves have criticized the peer review system a great deal for various reasons, including predatory journals, incentive problems, publication bias, Why Most Published Research Findings Are False, etc

I think we could quibble on the scale and importance of all of these points, but I'm not prepared to confidently deny any of them. The important I want to make is: compared to what alternatives? The problem is hard, and even the best solution can be expected to have many visible imperfections. How persuaded you be by a revoluationary enumerating the many weaknesses  of democratic government without comparing them to a proposed (often imagined) alternative?

people outside academia, e.g. practitioners in industry, are often pretty sharply skeptical of the usefulness of academic work, or unaware of it entirely,

Again, compared to what alternative? I'm guessing they would say that at a company, the "idea people" get their ideas tested by the free market, and this grounds them, and makes their work practical. I am willing to believe that free-market-testing is a more reliable institution than peer review for evaluating certain propositions, and I am willing to buy that this is conventionally well-understood. But for many propositions, I would think there is no way to "productize" it such that profitability of the product is logically equivalent to the truth of the proposition. (This may not be the alternative you had in mind at all, if you had an alternative in mind).

the peer review system didn't obviously arise from a really intentional and thoughtful design effort

This is definitely not a precondition for a successful social institution.

at the time the peer review system was developed, an enormous amount of our modern tools for communication and information search, processing, dissemination etc. didn't exist, so it really arose in an environment quite different from the one it's now in.

But conventional wisdom persists in endorsing peer-review anyway!

What "major life goals should include (emphasis added)" is not a sociological question. It is not a topic that a sociology department would study. See my comment that  I agree "conventional wisdom is wrong" in dismissing the philosophy of effective altruism (including the work of Peter Singer). And my remark immediately thereafter: "Yes, these are philosophical positions, not sociological ones, so it is not so outrageous to have a group of philosophers and philosophically-minded college students outperform conventional wisdom by doing first-principles reasoning".

I am not citing Amazon as an example of an actor using evidence and reason to do as much good as possible. I am citing it as an example of an organization that is effective at what it aims to do.

Upvoted this.

You generally shouldn't take Forum posts as seriously as peer-reviewed papers in top journals

I suspect I would advise taking them less seriously than you would advise, but I'm not sure.

It could also imply that EA should have fewer and larger orgs, but that's a question too complicated for this comment to cover

I think there might be a weak conventional consensus in that direction, yes. By looking at the conventional wisdom on this point, we don't have deal with the complicatedness of the question--that's kind of my whole point. But even more importantly, perhaps fewer EA orgs that are not any larger; perhaps only two EA orgs (I'm thinking of 80k and OpenPhil; I'm not counting CHAI as an EA org). There is not some fixed quantity of people that need to be employed in EA orgs! Conventional wisdom would suggest, I think, that EAs should mostly be working at normal, high-quality organizations/universities, getting experience under the mentorship of highly qualified (probably non-EA) people.

I'm someone who has read your work (this paper and FGOIL, the latter of which I have included in a syllabus), and who would like to see more work in similar vein, as well as more formalism in AI safety. I say this to establish my bona fides, the way you established your AI safety bona fides.

Thanks! I should have clarified it has received some interest from some people.

you don't show that "when a certain parameter of a certain agent is set sufficiently high, the agent will not aim to kill everyone", you show something more like "when you can design and implement an agent that acts and updates its beliefs in a certain way and can restrict the initial beliefs to a set containing the desired ones and  incorporate a human into the process who has access to the ground truth of the universe, then you can set a parameter high enough that the agent will not aim to kill everyone"

"When you can design and implement an agent that acts and updates its beliefs in a certain way and can restrict the initial beliefs to a set containing the desired ones". That is the "certain agent" I am talking about.  "Restrict" is an odd word choice, since the set can be as large as you like as long as it contains the truth. "and incorporate a human into the process who has access to the ground truth of the universe." This is incorrect; can I ask you to edit your comment? Absolutely nothing is assumed about the human mentor, certainly no access to the ground truth of the universe; it could be a two-year-old or a corpse!  That would just make the Mentor-Level Performance Corollary less impressive.

I don't deny that certain choices about the agent's design make it intractable. This is why my main criticism was "People don't seem to bother investigating or discussing whether their concerns with the proposal are surmountable." Algorithm design for improved tractability is the bread and butter of computer science.

Load more