# 26

TLDR

Submit ideas of “interesting evaluations” in the comments. The best one by December 5th will get $50. All of them will be highly appreciated. Motivation A few of us (myself, Nuño Sempere, and Ozzie Gooen), have been working recently to better understand how to implement meaningful evaluation systems for EA/rationalist research and projects. This is important both for short-term use (so we can better understand how valuable EA/rationalist research is) and for long-term use (as in, setting up scalable forecasting systems on qualitative parameters). In order to understand this problem, we've been investigating evaluations specific to research and evaluations in a much broader sense. We expect work in this area to be useful for a wide variety of purposes. For instance, even if Certificates of Impact eventually get used as the primary mode of project evaluation, purchasers of certificates will need strategies to actually do the estimation. Existing writing on “evaluations” seems to be fairly domain-specific (only focused on Education or Nonprofits), one-sided (yay evaluations or boo evaluations), or both. This often isn’t particularly useful when trying to understand the potential gains and dangers of setting up new evaluation systems. I’m now investigating a neutral history of evaluations, with the goal of identifying trends in what aids or hinders an evaluation system in achieving its goals. The ideal output of this stage would be an absolutely comprehensive list that will be posted to LessWrong. While this is probably impractical, hopefully, we could make one comprehensive enough, especially with your help. Task Suggest an interesting example (or examples) of an evaluation system. For these purposes, evaluation means "a systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards", but if you think of something that doesn't seem to fit, err on the side of inclusion Prize The prize is$50 for the top submission.

Rules

To enter, submit a comment suggesting an interesting example below, before the 5th of December. This post is both on LessWrong and the EA Forum, so comments on either count.

Rubric

To hold true to the spirit of the project, we have a rubric evaluation system to score this competition. Entries will be evaluated using the following criteria:

• Usefulness/uniqueness of lesson from the example
• Novelty or surprise of the entry itself, for Elizabeth
• Novelty of the lessons learned from the entry, for Elizabeth.

Accepted Submission Types

I care about finding interesting things more than proper structure. Here are some types of entries that would be appreciated:

• A single example in one of the categories already mentioned
• Four paragraphs on an unusual exam and its interesting impacts
• A babbled list of 104 things that vaguely sound like evaluations

Examples of Interesting Evaluations

We have a full list here, but below is a subset to not anchor you too much. Don't worry about submitting duplicates: I’d rather risk a duplicate than miss an example.

1. Chinese Imperial Examination
2. Westminster Dog Show
3. Turing Test
4. Consumer Reports Product Evaluations
6. Art or Jewelry Appraisal
7. ESGs/Socially Responsible Investing Company Scores
8. “Is this porn?”
1. Legally?
2. For purposes of posting on Facebook?
9. Charity Cost-Effectiveness Evaluations
10. Judged Sports (e.g. Gymnastics)

Motivating Research

These are some of our previous related posts:

# 26

New Comment

### Nov 28, 2020

16

Babble!

1. Psychological evaluation
2. Job interview
3. Debate competition judge
4. Using emotion recognition (say by image recognition) to find out consumer's preferences
5. Measuring Pavlov's dog saliva
6. Debate in ancient greek
7. Factored cognition
8. forecasting
9. karma on the forum / reddit
10. democratic voting
11. Stock prices as an evaluation of a companies value
12. Bibliometrics. Impact factor.
13. Using written recommendations to evaluate candidates.
14. Measuring truth-telling using a polygraph
15. Justice system evaluation of how bad crimes are based on previous cases
16. Justice system use of a Jury
17. Lottery - random evaluation
18. Measuring dopamine signals as a proxy to a fly's brain valence (which is an evaluation of its situation)
19. throw stuff into a neural net
20. python
21. Discrete Valuation Rings
22. Signaling value using jewels.
23. Evaluation based on social class
24. fight to the death
25. torturing people until they confess
26. market price
27. A mathematical exam
28. A high-school history exam
30. Stress testing a phone by putting it in extreme situations
31. checking if a car is safe by using a crash dummy and checking impact force
32. Software testing
33. Open source as a signal of "someone had looked into me and I'm still fine"
34. Colonoscopy
36. running polls
37. running stuff by experts
38. asking god what she thinks of it
39. The choice of a pope
40. Public consensus, 50 years down the line
41. RCT
43. Nobel prize committee
44. Testing purity of chemical ingredients
45. Testing problems in chip manufacturing
46. Reproduce a study/project and see if the results replicate
47. Set quantitative criteria in advance, and check the results after the fact
48. ask people what they'd think that the results would be in advance, and ask people to evaluate the results afterward. Focus the evaluation on the parameters which the people before the test did not consider
50. a flagging system for moderators
51. New York Times Best Seller list
52. subjective evaluation
53. subjective evaluation when on drugs
54. subjective evaluation by psychopaths (which are also perfect utilitarians!)
55. subjective evaluation by a color choosing octopus
56. Managerial decisions (a 15-minute powerpoint presentation and then an arbitrary decision)
57. Share-holder reports
58. bottleneck/limiting-factor analysis
59. Crucial considerations
60. Theory of Change model
61. Taking a set amount of time to critically analyze the subject, focusing on trying to find as many downsides.
62. Using weights and a two-sided scale to measure goods.
63. Setting a benchmark that one only evaluates against.
64. A referee evaluating a Boxing match
65. Using score for football
66. Buying a car - getting the information from the seller and assessing their truthfullness
67. Looking at a fancy report and judging based on length, images, and businessy words
68.
1. peer review in science
2. citations count
3. journal status
4. grant making - assessing requests, say by scoring according to a fixed scoring template
5. evaluating a scoring template by comparing similarity of different people's scoring of the same text
6. Code review
7. Fact-Checking
8. Editor going through a text
9. 360 peer feedback - sociometry
10. gut intuition after long relationship/experience
11. Amazon Reviews
12. ELO
13. Chess engine position assessment
14. Theoretical assessment of a chess position - experts explaining what is good or bad about the position
15. Running a tournament starting with this position, evaluating based on success percentage
16. multiple choice exam
17. political lobbying for or against something
18. the grandma test

It was fun! Hope that something here might be helpful :)

alexrjl

### Nov 29, 2020

9

Difficulty ratings in outdoor rock-climbing
Common across all types of climbing are the following features of grades:

• A subjective difficulty assessment of the climb, by the first person to climb it, is used for them to "propose" a grade.
• Other people to manage the same climb may suggest a different grade. Often the grade of a climb will not be agreed upon in the community until several ascents have been made.
• Climbing guidebooks publish grades, typically based on the authors' opinion of current consensus, though some online platforms where people can vote on grades exist.
• Grades can change even after a consensus has appeared stable. This might be due to a hold breaking, however it may also be due to a new sequence being discovered.
• Grades tend to approach a single stable point, even though body shape and size (particularly height and armspam) can make a large difference to difficulty.

There are many different grading systems for different types of climb, a good overview is here. Some differences of interest:

• While most systems grade the overall difficulty of the entire climb, British trad climbs have two grades, niether of which purely map to overall difficulty. The first describes a combination of overall difficulty and safety (so an unsafe, but easy climb, may have a higher rating than a safe), the second describes the difficulty only of the hardest move or short sequence (which can be very different from the overall difficulty, as endurance is a factor).
• Aid climbs, which allow climbers to use ropes to aid their movement rather than only for protection, are graded seperately. However other technology is not considered "aid". In particular, climbing grades have steadily increased over time, at least in part due to development of better shoe technology. More recently, the development of rubberised kneepads has lead to several notable downgrades of hard boulders and routes, as the kneepads make much longer rests possible.

I think climbing grading is interesting as the grades emerge out of a complex set of social interactions, and despite most climbers frequently saying things like "grades are subjective", and "grades don't really matter", they in general remain remarkably stable, and important to many climbers.

Misha_Yagudin

### Dec 12, 2020

7

Correlating subjective metrics with objective outcomes to provide better intuitions about what an additional point on a scale might mean. Resulting intuitions still suffers from "correlation ≠ causation" and all curses of self-reported data (which, in my opinion, makes such measurements close to useless) but is a step forward.

Huh! The thread I linked to and David Manheim's winning comment cite the same paper :)

Tetraspace Grouping

### Dec 06, 2020

7

Simple linear models, including improper ones(!!). In Chapter 21 of Thinking Fast and Slow, Kahneman writes about Meehl's book Clinical vs. Statistical Prediction: A Theoretical Analysis and a Review, which finds that simple algorithms made by getting some factors related to the final judgement and weighting them gives you surprisingly good results.

The number of studies reporting comparisons of clinical and statistical predictions has increased to roughly two hundred, but the score in the contest between humans and algorithms has not changed. About 60% of the studies have shown significantly better accuracy for the algorithms. The other comparisons scored a draw in accuracy [...]

If they are weighted optimally to predict the training set, they're called proper linear models, and otherwise they're called improper linear models. Kahneman says about Dawes' The Robust Beauty of Improper Linear Models in Decision Making that

A formula that combines these predictors with equal weights is likely to be just as accurate in predicting new cases as the multiple-regression formula that was ptimal in the original sample. More recent research went further: formulas that assign equal weights to all the predictors are often superior, because they are not affected by accidents of sampling.

That is to say: to evaluate something, you can get very far just by coming up with a set of criteria that positively correlate with the overall result and with each other and then literally just adding them together.

# Winner

Last week we announced a prize for the best example of an evaluation. The winner of the evaluations prize is David Manheim, for his detailed suggestions on quantitative measures in psychology.  I selected this answer because, although IAT was already on my list, David provided novel information about multiple tests that saved me a lot of work in evaluating them. David has had involvement with QURI (which funded this work) in the past and may again in the future, so this feels a little awkward, but ultimately it was the best suggestion so it didn’t feel right to take the prize away from him.

Honorable mentions to Orborde on financial stress tests, which was a very relevant suggestion that I was unfortunately already familiar with, and alexrjl on rock climbing route grades, which I would never have thought of in a million years but has less transferability to the kinds of things we want to evaluate.

# Post-Mortem

How useful was this prize? I think running the contest was more useful than \$50 of my time, however it was not as useful as it could have been because the target moved after we announced the contest. I went from writing about evaluations as a whole to specifically evaluations that worked, and I’m sure if I’d asked for examples of that they would have been provided. So possibly I should have waited to refine my question before asking for examples. On the other hand, the project was refined in part by looking at a wide array of examples (generated here and elsewhere), and it might have taken longer to hone in on a specific facet without the contest.

Thanks - I'm happy to see that this was useful, and strongly encourage prize-based crowdsourcing like this in the future, as it seems to work well.

That said, given my association with QURI, I elected to have the prize money donated to Givewell.