An experiment to evaluate the value of one researcher's work

NunoSempere

An experiment to evaluate the value of one researcher's work

Comments 23

Sorted by

New & upvoted

Ozzie Gooen

I have a bunch of thoughts on this, and would like to spend time thinking of more. Here are a few:

---

I’ve been advising this effort and gave feedback on it (some of which he explicitly included in the post in the “Caveats and warnings” section). Correspondingly, I think it’s a good early attempt, but definitely feel like things are still fairly early. Doing deep evaluation of some of this with getting more empirical data (for instance, surveying people to see which ones might have taken this advice, or having probing conversations with Guesstimate users) seems necessary to get a decent picture. However, it is a lot of work. This effort was much more of Nuño intuitively estimating all the parameters, which can get you kind of far, but shouldn’t be understood to be substantially more. Rubrics like these can be seen as much more authoritative than they actually are.

Reasons to expect these estimates to be over-positive

I tried my best to encourage Nuño to be fair and unbiased, but I’m sure he felt incentivized to give positive grades. I don’t believe I gave feedback to encourage him to exchange the scores favorably, but I did request that he made uncertainty more clear in this post. This wasn’t because I thought I did poorly in the rankings, is more because I thought that this was just a rather small amount of work for the claims being made. I imagine this will be an issue forward with evaluation, especially if people are evaluated you might be seen as possibly holding grudges or similar later on. It is not enough for them to not retaliate, the problem is that from an evaluator’s perspective, there’s a chance that they might retaliate.

Also, I imagine there is some selection pressure to a positive outcome. One of the reasons why I have been advising his efforts is because they are very related to my interests, so it would make sense that he might be more positive towards my previous efforts then would be others of different interests. This is one challenging thing about evaluation; typically the people who best understand the work have the advantage of better understanding its quality, but the disadvantage typically be biased towards how good this type of work is.

Note that all none of the projects wound up with a negative score, for example. I’m sure that at least one really should if we were clairvoyant, although it’s not obvious to me to say which one at this point.

Reasons to expect these estimates to be over-negative

I personally care whole lot more about being able to be neutral, and also in seeming neutral, than I do that my projects were evaluated favorably at the stage. I imagine this could been the case for Nuño as well. So it’s possible there was some over-compensation

here, but my guess is that you should expect things to be biased on the positive side regardless.

Tooling

I think this work brings to light how valuable improved tooling (better software solutions) could be. A huge spreadsheet can be kind of a mess, and things get more complicated if multiple users (like myself) would try to make rankings. I’ve been inspecting no-code options and would like to do some iteration here.

One change that seems obvious would be for reviews to be posted on the same page as the corresponding blog post. This could be done on the comments or in the post itself, like a Github status icon.

Decision Relevance

I’m hesitant to update much due to the rather low weight I place on this. I was very uncertain about the usefulness my projects before this and also I’m uncertain about it afterwards. I agree that most of it is probably not to be valuable at all unless I specifically, are much more unlikely someone else, continues his work into a more accessible or valuable form.

If it’s true the Guesstimate is actually far more important than anything else I’ve worked on, it would probably update me to focus a lot more on software. Recently I’ve been more focused on writing and mentorship than on technical development, but I’m considering changing back.

I think I would have paid around $1,000 or so for a report like this for my own usefulness. Looking back, the main value perhaps would come from talking through the thoughts with the people doing the rating. We haven’t done this yet, but might going forward. I’d probably pay at least $10,000 or so if I was sure that it was “fairly correct”.

The value of research in neglected areas

I think one big challenge with research is that you either focus on an active area or a neglected one. In active areas, marginal contributions may be less valuable because others are much more like you to come up with them. There’s one model where there is basically a bunch of free prestige lying around, and if you get there first you are taking zero-sum gains directly from someone else. In the EA community in particular I don’t want to play zero-sum games with other people. However, for neglected work, it seems very possible that almost no one will continue with it. My read is that neglected work is generally a fair bit more risky. There are instances where goes well and this could actually encourage a whole field to emerge (though this takes a while). There are other instances where no one happens to be interested in continuing this kind of research, and it dies before being able to be useful at all.

I think of my main research as being in areas I feel are very neglected. This can be exciting, but has obvious challenge that is difficult to be adopted by others, and so far this has been the case.

MichaelA🔸

Doing deep evaluation of some of this without getting more empirical data (for instance, surveying people to see which ones might have taken this advice, or having probing conversations with Guesstimate users) seems necessary to get a decent picture.

(I assume you mean something like "and getting more empirical data", not "without"?)

I think it'd indeed be interesting and useful to combine the sort of intuitive estimation approach Nuño adopted here with some gathering of empirical data. Nuño (or whoever) could perhaps randomly select a subset of posts/outputs to gather empirical data on, to reduce how time-consuming/costly the data collection is.

Two potential methods of data collection that come to mind are:

Surveys
- E.g., Rethink Priorities' "impact survey", my survey which was inspired by Rethink's, 80k's annual survey, and a recent Happier Lives Institute survey
- Some extra discussion here
Interviews
- E.g., Rethink do "structured interviews with key decision-makers and leaders at EA organizations[, and seek] interviewees’ feedback on the general importance of our work for them and for the community, what they have and have not found helpful in what we’ve done, what we can do in the future that would be useful for them, and ways we can improve."

In fact, one possibility would be to use the intuitive estimation approach on the work of one of the orgs/people who already have a bunch of this sort of data relevant to that work (after checking that the org/people are happy to have their work used for this process), and then look at the empirical data, and see how they compare.

(I recently started working for Rethink, but the views in this comment are just my personal views.)

Ozzie Gooen

That's quite useful, thanks

In fact, one possibility would be to use the intuitive estimation approach on the work of one of the orgs/people who already have a bunch of this sort of data relevant to that work (after checking that the org/people are happy to have their work used for this process), and then look at the empirical data, and see how they compare.

This seems like a neat idea to me. We'll investigate it.

NunoSempere

Note that all none of the projects wound up with a negative score, for example. I’m sure that at least one really should if we were clairvoyant, although it’s not obvious to me to say which one at this point.

Expected Error, or how wrong you expect to be ended up with a -1, because of the negative comments.

Ozzie Gooen

Good catch

EdoArad🔸

Regarding tooling, it may be very helpful to input subjective distributions - uncertainty seems to me to be very important here, mostly if we expect this kind of tool to be used by a low number of people

Ozzie Gooen

Yea, I'd love to see things like this, but it's all a lot of work. The existing tooling is quite bad, and it will probably be a while before we could rig it up with Foretold/Guesstimate/Squiggle.

EdoArad🔸

Another idea - to be able to use units in cells such that the end result will depend on them. Say, for scale one can write "20000 QALYs" or "400 BCLs" (Broiler Chicken Lives) or "2% XpC" (X-risk per Century)

Ozzie Gooen

Yep, I think this is quite useful/obvious. (If I understand it correctly). Work though :)

Joey🔸

Happy to have my posts used for this. One thing I would love to see integrated would be a willingness to pay metric as we have been experimenting with this a bit in our research process and have found it quite useful.

Ozzie Gooen

One challenge with willingness to pay is that we need to be clear who the money would be coming from. For instance, I would pay less for things if the money were coming from the budget of EA Funds than I would Open Phil, than I would the US Government. This seems doable to me, but is tricky. Ideally we could find a measure that wouldn't vary dramatically over time. For instance, the EA Funds budget might be desperate for cash some years have have too much others, changing the value of the marginal dollar dramatically.

NunoSempere

Thanks! A willingness to pay is an interesting proxy; will keep in mind. In particular, I imagine that it consolidates some intuitions, or makes them more apparent, though it probably won't help if your intuitions are just wrong.

Davidmanheim

This sounds great, and happy if you want to use my posts for this.

I also am super-happy that the Goodhart paper was used as an example of a "fairly valuable" paper! I should look at my other non-forum-post output and consider the score-to-time-and-effort ratio, to see if I can maximize the ratio by doing more of specific types of work, or emphasizing different projects.

Ben_West🔸

This is a cool idea. Thanks Nuno for doing this evaluation, and thanks Ozzie for being willing to participate!

MichaelA🔸

Thanks for this interesting post.

I'm also happy to have my posts used for this process. (Though some were written with or on behalf of people/orgs who I can't speak for, so if you would be interested in using my posts for this and making the results public, just let me know so I can check with those people/orgs first.)

If you do use my posts for this, you could perhaps also compare the results from your process to the results of my survey of the quality and impact of things I've written (as I sort-of suggest in another comment).

NunoSempere

Though some were written with or on behalf of people/orgs who I can't speak for

Are there any posts for which this is the case but where this isn't stated in the post?

MichaelA🔸

No - for all posts where this is the case, the post will say so near the top or bottom (usually in italics).

Also, btw, I think if it's just that I acknowledge a person for helpful comments and discussions, there'd be no need to check with them. I just feel it'd be worth checking in cases where a person is acknowledged as a coauthor (or as having written an earlier draft that my post is based on), or where I say a post is "written for [organisation]".

NunoSempere

Cheers, thanks

Alexander Herwix 🔸

Thanks for the interesting post.

One consideration that comes to my mind is if something like this type of evaluation further reinforces a "success to the successful" feedback loop which is inherently sensitive to initial conditions. As in people might be able to produce great work given the right support and conditions but don't have them in the beginning. Someone else is more lucky and gets picked up, then more supported, which then reinforces further success.

Thus, it seems generally pretty hard to use something like this kind of system to achieve "optimal" outcomes or, rather, let's say you have to be careful about how you implement such rating systems and be aware of such feedback loops.

What do you think about this?

NunoSempere

Yeah, I agree that for forecasting setups self-fulfilling prophesies/feedback loops can be a problem, but it seems likely that they can be mitigated with a certain amount of exploration (e.g., occasionally try things you'd expect to fail in order to test your system.)

It's also not clear that this type of evaluation is worse than the alternative informal evaluation system. For example, with a formal evaluation system you'd be able to pick up high quality outputs even if they come from otherwise low status people (and then give them further support.)

Alexander Herwix 🔸

Thanks for the thoughtful answer. I agree that it's not clear that it is worse than other alternatives, in my comment I didn't give a reference solution to compare it to after all.

I just wanted to highlight the potential for problems that ought to be looked at while designing such solutions. So, if you consider working more on this in the future, it might be fruitful to think about how it would influence such feedback loops.

EdoArad🔸

Do you mean that people who had past successes (even if by external support) would be more likely to score higher here?

I intuitively think that this kind of rubric would actually be more robust against such feedback loops. Most of the factors are not specific to the person, so I think that there would be less bias there.

Alexander Herwix 🔸

In essence, I think that act of adding quantitative measures may lend a veil of "objectivity" to assessments of peoples work, which is intrinsically vulnerable to the success to the successful feedback loop.

Based on your comment, I had another look at the specific criteria of the rubric and agree that it seems possible that it could help to counteract something like the dynamic I outlined above, however, it would still have to be applied with care and recognizing the possibility of such dynamics.

The main problem I wanted to highlight is that something like this might obscure those dynamics and might be employed for political purposes such as justifying existing status hierarchies which might be simply circumstantial and not based on merit.

Comments

Ozzie Gooen

I have a bunch of thoughts on this, and would like to spend time thinking of more. Here are a few:

Reasons to expect these estimates to be over-positive

Reasons to expect these estimates to be over-negative

here, but my guess is that you should expect things to be biased on the positive side regardless.

Tooling

One change that seems obvious would be for reviews to be posted on the same page as the corresponding blog post. This could be done on the comments or in the post itself, like a Github status icon.

Decision Relevance

The value of research in neglected areas

I think of my main research as being in areas I feel are very neglected. This can be exciting, but has obvious challenge that is difficult to be adopted by others, and so far this has been the case.

What I perceive	What I care about	How it’s aggregated
log(Scale)	Scale	Total value ~ $S c a l e \cdot T r a c t a b i l i t y \cdot N e g l e c t e d n e s s \cdot \dots$

An experiment to evaluate the value of one researcher's work

An experiment to evaluate the value of one researcher's work

Introduction

Method

Results and comments.

Caveats and warnings

Conclusion

Future work