This is a special post for quick takes by quila. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
i'm curious about the results[1] of EAs[2] on that questionnaire, if anyone wants to volunteer theirs. there are short and long versions (16 and 70 questions).
Are your values about the world, or the effects of your actions on the world?
An agent who values the world will want to effect the world, of course. These have no difference in effect if they're both linear, but if they're concave...
If an agent has a concave value function which they use to pick each individual action: √L where L is the amount of lives saved by the action, then that agent would prefer a 90% chance of saving 1 life (for √1 × .9 = .9 utility), over a 50% chance of saving 3 lives (for √3 × .5 = .87 utility). The agent would have this preference each time they were offered the choice.
This would be odd to me, partly because it would imply that if they were presented this choice enough times, they will appear to overall prefer an x% chance at saving n lives to an x% chance of saving >n lives. (Or rather, the probability distribution version instead of discrete version of that statement)
For example, after taking the first option 10 times, the probability distribution over amount of lives saved looks like this (on the left side). If they had instead took the second option 10 times, it would look like this (right side)
(Note: Claude 3.5 Sonnet wrote the code to display this and to calculate the expected utility, so I'm not certain it's correct. Calculation output and code in footnote[2])
Now if we prompted the agent to choose between each of these probability distributions, they would assign an average utility of 3.00 to the one on the left, and 3.82 to the one on the right, which from the outside looks like contradicting their earlier sequence of choices.[3]
We can generalize this beyond this example to say that, in situations like this, the agent's best action is to precommit to take the second option repeatedly.[4]
We can also generalize further and say that for an agent with a concave function used to pick individual actions, the initial action which scores the highest would be to self-modify into (or commit to taking the actions of) an agent with a concave utility function over the contents of the world proper.[5]
I wrote this after having a discussion (starts ~here at the second quote) with someone who seemed to endorse following concave utility functions over the possible effects of individual actions.[6] I think they were drawn to this as a formalization of 'risk aversion', though, so I'd guess that if they find the content of this text true, they'd want to continue acting in a risk-averse-feeling way, but may search for a different formalization.
My motive for writing this though was mostly intrigue. I wasn't expecting someone to have a value function like that, and I wanted to see if others would too. I wondered if I might have just been mind-projecting this whole time, and if actually this might be common in others, and if that might help explain certain kinds of 'risk averse' behavior that I would consider suboptimal at fulfilling one's actual values[7] (this is discussed more extensively in my linked comment).
(Though, if one accepts that, I have a nascent intuition that the same logic forces one to accept what I was writing about Kelly betting in the discussion this came from.)
Recall that actions are picked only individually, not according to the utility the current function would assign to future choices made under the new utility function.
(That would instead have its own exploits, namely looping between many small positive actions and one big negative 'undoing' action whose negative utility is square-rooted)
I'll claim that if one doesn't reflectively endorse optimally fulfilling some values, then those are not their actual values, but maybe are a simplified version of them.
What is malevolence? On the nature, measurement, and distribution of dark traits was posted two weeks ago (and i recommend it). there was a questionnaire discussed in that post which tries to measure the levels of 'dark traits' in the respondent.
i'm curious about the results[1] of EAs[2] on that questionnaire, if anyone wants to volunteer theirs. there are short and long versions (16 and 70 questions).
(or responses to the questions themselves)
i also posted the same quick take to LessWrong, asking about rationalists
Are your values about the world, or the effects of your actions on the world?
An agent who values the world will want to effect the world, of course. These have no difference in effect if they're both linear, but if they're concave...
Then there is a difference.[1]
If an agent has a concave value function which they use to pick each individual action: √L where L is the amount of lives saved by the action, then that agent would prefer a 90% chance of saving 1 life (for √1 × .9 = .9 utility), over a 50% chance of saving 3 lives (for √3 × .5 = .87 utility). The agent would have this preference each time they were offered the choice.
This would be odd to me, partly because it would imply that if they were presented this choice enough times, they will appear to overall prefer an x% chance at saving n lives to an x% chance of saving >n lives. (Or rather, the probability distribution version instead of discrete version of that statement)
For example, after taking the first option 10 times, the probability distribution over amount of lives saved looks like this (on the left side). If they had instead took the second option 10 times, it would look like this (right side)
(Note: Claude 3.5 Sonnet wrote the code to display this and to calculate the expected utility, so I'm not certain it's correct. Calculation output and code in footnote[2])
Now if we prompted the agent to choose between each of these probability distributions, they would assign an average utility of 3.00 to the one on the left, and 3.82 to the one on the right, which from the outside looks like contradicting their earlier sequence of choices.[3]
We can generalize this beyond this example to say that, in situations like this, the agent's best action is to precommit to take the second option repeatedly.[4]
We can also generalize further and say that for an agent with a concave function used to pick individual actions, the initial action which scores the highest would be to self-modify into (or commit to taking the actions of) an agent with a concave utility function over the contents of the world proper.[5]
I wrote this after having a discussion (starts ~here at the second quote) with someone who seemed to endorse following concave utility functions over the possible effects of individual actions.[6] I think they were drawn to this as a formalization of 'risk aversion', though, so I'd guess that if they find the content of this text true, they'd want to continue acting in a risk-averse-feeling way, but may search for a different formalization.
My motive for writing this though was mostly intrigue. I wasn't expecting someone to have a value function like that, and I wanted to see if others would too. I wondered if I might have just been mind-projecting this whole time, and if actually this might be common in others, and if that might help explain certain kinds of 'risk averse' behavior that I would consider suboptimal at fulfilling one's actual values[7] (this is discussed more extensively in my linked comment).
Image from 'All About Concave and Convex Agents'.
For discussion of the actual values of some humans, I recommend 'Value Theory'
code:
(of course, if it happened it wouldn't really be a contradiction, it would just be a program being run according to what it says)
(Though, if one accepts that, I have a nascent intuition that the same logic forces one to accept what I was writing about Kelly betting in the discussion this came from.)
Recall that actions are picked only individually, not according to the utility the current function would assign to future choices made under the new utility function.
(That would instead have its own exploits, namely looping between many small positive actions and one big negative 'undoing' action whose negative utility is square-rooted)
(I initially thought they meant over the total effects of all their actions throughout their past and future, rather than per action.)
I'll claim that if one doesn't reflectively endorse optimally fulfilling some values, then those are not their actual values, but maybe are a simplified version of them.