Simple comparison polling to create utility functions

NunoSempere

tl;dr:

The Utility Function Extractor is a small web application.
It guides individuals through a process of comparing different pairs of items, by asking them, “How much more valuable is the first option compared to the second one?”
This result is turned into a simple utility function that can be used to compare between all of the items in the set.
You can try the application here.
This technique can be used for comparing moral goods, research outputs, policy initiatives, and more.

Introduction

Say you have a list of items that you want to compare the value of. You just want to know how valuable every item is, in comparison to every other item. Perhaps these items are research documents, or the lives of people at different ages, or personal life improvements.

This meme exists because of Kat Woods' Why boring writing is unethical.

You could try to come up with one standard unit (QALY, "adjusted research paper", "adjusted life improvement"), but this can be messy.

Instead, all you really need to do is to compare them in pairs. In each case, you say, roughly, "The first item is X% as valuable as the second". If you do this enough times [1], you can produce a simple utility function of all of the items. This would allow you to successfully be able to calculate the value of any two items in the set, even items you haven't directly compared. [2]

One problem with a naive approach is that your preferences might be inconsistent. After valuing A vs. B, B vs. C, and A vs. C, it might be the case that your numbers don't completely match each other. However, you could at least see these discrepancies, and make manual changes upon reflection. You might also want to see the inconsistencies, rather than use some other method that smoothes them out from the beginning (like denominating all items in the same unit.)

I've developed a simple web tool to help people walk through the above steps. I've experimented with this on a few small lists of comparison items. This project is an early exploration of this idea. It would be great to later see much more sophisticated versions.

I originally built it while trying to be less confused about the value of research and about how to elicit information about it. But I recently realized that it also has a more immediate and less speculative application: the EA community could use it to elicit and aggregate GiveWell's moral weights by allowing users to input their own comparisons.

By default, the site is asking about nine OpenPhil AI safety grants. If you'd instead like to compare your own set of elements, you can do so under "Advanced Options", or you can reach out, and I'm happy to create a page specifically for a given project. Note that I'm saving the user's inputs because I think it might be interesting to aggregate them later.

Potential use cases

Extract comparisons between different moral goods to use as an input to the giant GiveWell spreadsheet.
Poll associates about the value of projects one is considering doing.
Construct a utility function over different types of research outputs.
Construct a utility function over different types of policy initiatives.
Compare the moral weight of different types of sentient beings.
Allocate a prize fund proportionally to the value of entries.
...

Example

As an example, let's show the app working on comparing various research documents. Right now, we assume that all research is net-positive, and we assume that users feel comfortable estimating how valuable each document is versus each other document.

We're not trying to have a precise definition of "value". Instead, the app leaves this up to the user.

The user is at the start presented with the following screen:

The user then selects, say, that The Mathematical Theory of Communication is worth ~50x as much as Thinking Fast and Slow—meaning that the user would be indifferent between ~50 books as valuable as Thinking Fast and Slow or one paper as fundamental as The Mathematical Theory of Communication [3].

After entering several such comparisons, the user arrives at a result like the following:

This produces a scale where Categorizing Variants of Goodhart's Law is worth 1 unit. The lines between two nodes indicate how more valuable the node to the right is than the node to the left. For instance, in this graph node 1 (“Extinguishing...”) is ~50x as valuable [3] as node 0 (“A comment...”), and node 2 (“Shallow evaluations...”) is 100x times as valuable as node 1. But node 2 is ~1000x as valuable as node 0, which is not ~50*100x = ~5000x. So the comparisons are inconsistent by a factor of 5x.

In contrast, here is a screenshot taken from a previous version of this program:

At the time, I was being lazy and hadn’t gotten around to programming sensible rounding, so the value of the first element shows up as ~0. Notice that the comparisons are fairly different than in the previous image, even though the items are the same. To me, this suggests that this kind of method of elicitation could or should be repeated at different points in time—even by the same user—to find out how noisy user’s estimates are. Notice also that the comparisons between items are more consistent than in the first screenshot.

FAQ

Why ask for binary comparisons rather than references to a set point?

Binary comparisons make inconsistencies apparent, and allow them to be aggregated. For example, one might think that:

A is 2x as valuable as B
A is 100x as valuable as C
B is 10x as valuable as C

This is inconsistent. But we can still be pretty sure that A > B > C. We can aggregate the different estimates to come to a final guess of how valuable A is in comparison to C—20x to 100x as valuable.

If we had only asked about comparisons to a set-point, we might get different answers depending on which set-point we used, and the uncertainty and inconsistencies would be hidden from sight.

That said, comparison against a set point is operationally much simpler, and there is also something to be said for reducing noise from the beginning.

In addition, if the app asked about the distribution of the ratio of the impacts—rather than about the ratio of the expected impacts—the app could then pick up that some elements are correlated or easier to compare than others.

Why not uncertain comparisons?

Uncertain comparisons, e.g., being able to enter an estimate like "0.1 to 10x" would be good and/or interesting, but would make programming more difficult. If/when Squiggle becomes more developed, I imagine this would be much easier. I could also just use the foretold/cdf library, but I'm hesitant to get too much into the weeds without a specific need.

How could one aggregate estimates from different people?

The website outputs a utility function, where a reference point element is given a value of “1” (for instance, GiveWell uses “doublings of consumption”) as its reference unit. The naïve way to aggregate different estimates might be to just average the ratios with respect to the reference units. For instance, if A, B and C think that a life saved from some disease in some specific conditions is respectively L, M, N times as valuable as a doubling of consumption, one could take (L+N+M)/3 as the aggregate estimate.

However, this could get pretty messy, and might be very sensitive about what the reference point is. I imagine that one might do something more principled and sophisticated if one had access to distributions, or even without that one might still detect clusters within which values are similar.

How does the underlying algorithm work?

I'm using Merge sort to select the ranking. This means that directionally inconsistent comparisons are dealt with elegantly (loops of the form A > B, B > C, C > A don't happen). Worst-case performance for n inputs is given by the sorting sequence, which grows as n*log(n). To create the graph at the end, I'm choosing a value as a reference point, and I generate paths of increasing length from that value to values further and further away.

What are some possible directions for future work?

Some directions for future work might be

Make the user experience nicer
1. Think about how to elicit inputs that can range many orders of magnitude
2. A slider seems nicer to use than a text box, but its bounds might anchor users
Make comparisons in terms of distributions, not of expected values
Apply this method to specific elicitations, like:
1. Elicitations of GiveWell’s moral weights
2. Elicitations of the moral weights given to different animals
3. Elicitations of the value of different projects which one is considering
4. Elicitations of the value of the research produced by different AI alignment organizations. In particular, I’m curious to apply this method to Larks’ yearly AI Alignment Literature Review and Charity Comparison. By dividing by the number of FTEs, or the amount of funding, one can then get an estimate of the research productivity of different organizations.
5. Etc.
Use the results of elicitations to construct a reference scale for how valuable comments are with respect to posts, posts are with respect to academic papers, and various forms or research outputs with respect to each other.

Note that the code for this website is open source, and pull requests are very welcome.

Conclusion

Although some organizations—like the Global Priorities Institute—are working on the fundamentals of ethical theory, there doesn’t seem to be much practical tooling around estimating utility in practice.

Tooling around estimating utility is important to EAs because doing the most good one can do seems like it would require good estimates of how much good different options produce. This is particularly important for longtermism in particular, because longtermism is beset with uncertainty about how valuable different options are. Better ways of estimating the value of longtermist options might help with evaluation, prioritization, clarity, etc.

The Utility Function Extractoris relatively rough, but I hope it can help other people become less confused about the shape of their utility functions, and ideally aid in EA and longtermism’s evaluation efforts.

Thanks to Ozzie Gooen and Kelsey Rodríguez for comments and editing suggestions

Footnotes

[1]: And in particular, O(n*log(n)) times, not O(n^2) times. See the “How does the underlying algorithm work” section.

[2]: More technically: By the Von Neumann–Morgenstern utility theorem, one can build, extract or deduce a utility function merely by considering many binary comparisons. But if we want to evaluate or rank something fuzzy or uncertain—like longtermist research or the value of different development interventions—this isn't of much practical help. On the one hand, this might be because quantifying and reasoning about what we care about is difficult. But on the other hand, practical tooling is just missing.

[3]: Why 51? Well, I'm using a log scale, but this means that inputs don't clearly map to round decimal numbers. This could be improved.

46 Reactions

External Evaluation of the EA Wiki

18 comments78 karma

Five steps for quantifying speculative interventions

8 comments95 karma

Mentioned in

114Valuing research works by eliciting comparisons from EA researchers

111An experiment eliciting relative estimates for Open Philanthropy’s 2018 AI safety grants

95Five steps for quantifying speculative interventions

81Brief evaluations of top-10 billionnaires

36Estimating value from pairwise comparisons

Comments13

Sorted by

New & upvoted

Click to highlight new comments since: Today at 12:06 PM

Michael St Jules 🔸4y7

I only took a quick look, but this looks pretty cool.

Here are two small things that might be worth doing:

Allow for negative values
Allow users to input the multiplier by typing (or at least mark some places on the scale with values), since if you have a number in mind, it's a bit of work to find it.

NunoSempere4y6

2. Is now implemented.

1. is a bit tricky because the "is x times as valuable as" relation is kind of weird for negative inputs

NunoSempere4y2

Cheers, I've added both suggestions as Github issues to remember.

David Johnston4y5

I would be interested in this same concept but framed so as to compare personal utility instead of impersonal utility, because I feel like I'm trying to estimate other people's values for personal utility and aggregate them in order to get an idea of impersonal utility. It seems tricky, though:

- How many {50} year old {friends/family members/strangers} would you save vs {5} year old {friends/family members/strangers}?

This seems straightforward, except maybe it's necessary to add "considering only your own benefit" if we want personal utilities that we can aggregate instead of a mixture of personal and impersonal utilities.

- How many 50 year old yourselves would you save vs 5 year old yourselves?

This one doesn't make much sense to me, and if I try to frame it differently, e.g.

"imagine a group of 50-74 year olds and a group of <5 year olds. There's a treatment that saves {X} 50 year olds and {Y} 5 year olds, and the <5 year olds dictate who gets it. What is the minimum X:Y for there to be a 50% chance of choosing the 50-74 year olds?"

My first thought is there's no way to sensibly answer this question because 3 year olds are incredibly stubborn and also won't understand.

Anyway, don't know if this is very helpful, but that was my first response to the app and the result of my first few minutes thinking about it.

skluug4y5

I like this! UI suggestion: instead of "The first option is 5x as valuable as the second option", I would insert the sentence between them in the middle: "...is 5x as valuable as...". Or if you're willing to mess up marginal/total utility, you could format it as "One [X] is worth as much as five [Y]", which I think would help it be more concrete to most people.

NunoSempere4y5

Done!

skluug4y1

Cool!!

NunoSempere4y2

Hey, this is a good idea, but it turns out it's slightly tricky to program. I'll get around to it eventually, though

Ozzie Gooen4y5

I just wanted to give my take on some of this:

The web app is neat to experiment with the ideas and help us build intuitions.
That said, I think the key ideas (not the web app in particular), are the main insight here.
The current implementation is a solid first step, but I think we’re still a ways from having something that’s fun to use. My guess is that it will require some sophisticated UX / UI work to do a job that’s good enough for this to be useful in production. (If anyone reading this wants to try, let one of us know!)
I also think it’s important to figure out how to allow for negative values. This is annoying, but so it goes.

One thing I learned over the course of this, is that we probably don’t actually want big tables of utility estimates. Or, more specifically, is that we want functions that we can query as “how does X compare to Y”?, and they give us the correct amount. These can trivially convert to tables, but are subtly better. The reason for this is that they’ll handle correlations between items.

10 apples might be exactly 10 times as good as 1 Apple; 10 oranges 10x as 1 orange. We want a query of “how much better is 10 apples compared to 1 apple” to return exactly “10x”, and similar for oranges. If we tried putting them all into a common unit, like “pear equivalent”, then we wouldn’t get this property.

I’m not sure what the best format is to store this sort of data. Maybe some cluster analysis or something. There must be some clever mathematics for this somewhere, it it’s not clear to either of us.

mako yass4y1

I've been thinking about this sort of preference aggregation problem for a few years. I think the best way to do it, that we have right now, is to form a graph with edges weighted by the comparison strength, then do pagerank, or something like it. Rank entries by their pagerank scores.

But I've been working towards something more precise, and this might be novel: Parallel and serial reducer functions (and another one, a "crosslink" reducer, I believe), sort of like the reducer functions you'd use to make judgements about electronic circuit graphs, which, given two nodes in the graph, work together to reduce the edges of the graph to a single edge between them, and then you know how those edges compare.

I have a really firm and consistent and unambiguous sense of how the reducers should behave, but I'd need to spend some time with a mathematician and a whiteboard to come up with formalizations. I'm pretty confident we'd produce something legit, if that were set up, though!

In case you're wondering how it handles cycles: If the cycle is even it resolves that each option in it is equal. If the cycle is a little bit lopsided then it creates a ranking, but with high controversy. If it's extremely lopsided then it creates a ranking with low controversy.
If you put a gun to its head, it'll always be able to tell you which nodes are at the top of the ranking, but it can also tell you when there's a lot of uncertainty between them.

NunoSempere4y2

My sense is that the mathematized version would be much more valuable (for instance, I could incorporate it into my tooling), but also harder to obtain than you might realize.

gwern4y3

I dunno if it's that hard. Comparisons are an old and very well-developed area of statistics, if only for use in tournaments, and you can find a ton of papers and code for pairwise comparisons. I have some & a R utility in a similar spirit on my Resorter page. Compared (ahem) to many problems, it's pretty easy to get started with some Elo or Bradley-Terry-esque system and then work on nailing down your ordinal rankings into more cardinal stuff. This is something where the hard part is the UX/UI and tailoring to use-cases, and too much attention to the statistics may be wankery.

NunoSempere4y2

Comparisons are an old and very well-developed area of statistics

Yeah, but it's not clear to me that discrete choice is a good fit for the kind of thing that I'm trying to do (though I've downloaded a few textbooks, and I'll find out). I agree that UX is important.