# 15

This term I took a reinforcement learning course at my university, hoping to learn something useful for the directions of research that I'm considering to enter (one among which is AI safety; others are speculative so I'm not listing them).

I'm about to start coding my first toy model, when I suddenly recalled something that I previously read: Brian Tomasik's Which Computations Do I Care About? and Ethical Issues in Artificial Reinforcement Learning. So I re-read the two essays, and despite dissenting on many of the points that Brian had made, I did become convinced that RL agents (and some other algorithms too), in expectation, deserve a tiny yet non-zero moral weight, and this weight can accumulate over the many episodes in the training process to become significant.

This problem seems to me very counter-intuitive, but as a rational person, I have to admit that it's a legitimate implication under the expected value framework, and so I recognise the problem and start thinking about solutions.

The solution turns out to be obvious, but is even more counter-intuitive. I only need to add an insanely large number (say, ) to every reward value that the agent receives, and then, assuming that the agent can feel happiness, there should be a small yet unneglectable probability that its happiness will increase linearly with the number added.

• One could object that utility should be scale-invariant, and depends only on the temporal difference of expectations (i.e. how much the expectations of future reward has risen or fallen), as suggested by some relevant studies. My response is that 1. this problem is far from settled and I'm only arguing for a unneglectable probability of linear correlation, and 2. I don't think the results of psychological studies imply scale-invariance of utility on all rewards (instead they only imply scale-invariance of utility on monetary returns) - think about it: how on earth can extreme pain be simply neutralized by adjusting one's expectations?

And once I accept this conclusion, the most counter-intuitive conclusion of them all follows. By increasing the computing power devoted to the training of these utility-improved agents, the utility produced grows exponentially (as more computing power means more digits to store the rewards). On the other hand, the impact of all other attempts to improve the world (e.g. by improving our knowledge of artificial sentience so we can more efficiently promote their welfare) grows at only a polynomial rate with the amount of resource devoted into these attempts. Therefore, running these trainings is the single most impactful thing that any rational altruist should do.

Apparently, we're in a situation of Pascal's Mugging.

Quite a few hypothetical scenarios of Pascal's Mugging had already been proposed, but this one strikes me the most. It seems to me the first such scenario that has real practical implication in real  life, and one that I cannot dismiss using simple arguments like "the opposite outcome is equally likely to happen, which makes net expected impact zero".

• One thing to note: GiveWell's article Why we can’t take expected value estimates literally uses Bayesian prior as the remedy to Pascal's Mugging, but here when estimating the probability of linear correlation (between utility and the number added to reward) we have already taken our prior into account, so such reasoning does not work.

New Comment

# 3 Answers sorted by Top

Brian_Tomasik

### Oct 09, 2021

11

Similar to what Carl said, my main response to questions like those you raise is that we'll have to defer a lot of this thinking to future generations. One can generate almost an indefinite stream of plausible Pascalian wagers like this. On this particular issue, the intervention of "improving our knowledge of artificial sentience so we can more efficiently promote their welfare" actually seems like it would help because then more people could apply their minds to questions like those you raise.

In addition to just trying to store larger and larger binary integers in a computer, you could try to develop other representations that would express large numbers more compactly. One obvious way to do that would be to use a floating-point number instead of an integer, since then you can have a large exponent. Maybe instead of the exponent representing a power of 10, it could signify a power of 1000000, or a power of 3^^^3. In Python, you can represent infinity as float("inf"), and that could be the reward.

My own view is that the absolute scale of numbers doesn't matter if it doesn't affect the functional behavior of the agent. Of course, as you say, there's some chance that utility does increase with the absolute scale of reward, but is that factual uncertainty or moral uncertainty? If it's moral uncertainty (as I think it is), then one view plausibly shouldn't be able to dominate others just by having higher stakes, in a similar way as deontology shouldn't dominate utilitarianism just because deontology may regard murder as infinitely wrong while utilitarianism regards it as only finitely wrong.

By the way, I tend to assume that RL computations at the scale you'd run for a course would have pretty negligible moral (dis)value, because the agents are so barebones. Good luck with the course. :)

CarlShulman

### Oct 04, 2021

11

And once I accept this conclusion, the most absurd-seeming conclusion of them all follows. By increasing the computing power devoted to the training of these utility-improved agents, the utility produced grows exponentially (as more computing power means more digits to store the rewards). On the other hand, the impact of all other attempts to improve the world (e.g. by improving our knowledge of artificial sentience so we can more efficiently promote their welfare) grows at only a polynomial rate with the amount of resource devoted into these attempts. Therefore, running these trainings is the single most impactful thing that any rational altruist should do. Q.E.D.

If you believed in wildly superexponential impacts from more compute, you'd be correspondingly uninterested in what could be done with the limited computational resources of our day, since a  Jupiter Brain playing with big numbers instead of being 10^40 times as big a deal as an ordinary life today could be 2^(10^40) times as big a deal. And likewise for influencing more computation rich worlds  that are simulating us.

The biggest upshot (beyond ordinary 'big future' arguments) of superexponential-with-resources utility functions is greater willingnesss to take risks/care about tail scenarios with extreme resources, although that's bounded by 'leaks' in the framework (e.g. the aforementioned influence on simulators with hypercomputation), and greater valuation of futures per unit computation (e.g. it makes welfare in sims like ours conditional on the simulation hypothesis less important).

I'd say that ideas of this sort, like infinite ethics, are reason to develop  a much more sophisticated, stable, and well-intentioned society (which can more sensibly address complex issues affecting an important future) that can address these well, but doesn't make the naive action you describe desirable even given certainty in a superexponential model of value.

Timothy Chan

### Oct 05, 2021

5

Assuming the parts about sentience work, someone who is both rational and altruistic (a rational altruist, as you say) might still have normative reasons not to run these trainings.

Some (non-exhaustive) reasons I can think of, that are based around, or are compatible with an expected value framework (some of these assume that at least some suffering is due to running the trainings - e.g., through suffering subroutines which may result in incidental s-risks, or that there are forgone opportunities to reduce suffering/increase happiness elsewhere) include:

1. Asymmetry ideas in population ethics (e.g., 'making people happy, not making happy people').
2. A position of diminishing returns (or zero returns after a certain point) on the value of happiness.
3. Ideas objecting to intrapersonal or interpersonal tradeoffs of creating happiness at the price of creating suffering (in this case you might have multiple expected (dis)values to work with).
4. Value lexicality that says that some bads are always worse than goods in some or any amount (some conceive this as listing and comparing expected (dis)values on vectors).
5. Different forms of negative utilitarianism. (It is worth emphasizing that the expected (dis)value of happiness and suffering depend on both our subjective valuation of the experiences we imagine to occur, and the subjective probability that these experiences actually occur.)

These could be motivated by the thinking that the (dis)value of suffering and happiness are orthogonal and don't 'cancel out' each other. I think Magnus Vinding's book on SFE has more clearly presented insights than I can offer - so checking that out could be useful if you'd like to learn more.