Using Points to Rate Different Kinds of Evidence

Ozzie Gooen

Comments 6

Sorted by

New & upvoted

Could you explain this a bit more? It's very shorthand, and hard to know what you are doing and what you are asking us. I think I have a lot of comments, but for most of them I worry I might be missing the point.

E.g., a math proof is something different than scientific evidence, and it generally applies in different domains. If I have confidence in the proof itself (i.e., proof not in error), that would make any other evidence moot. However, in most relevant cases the 'math proof' is a proof of something that is only a very simplified model of the question at hand.

Aryeh Englander

This is great! Just wanted to mention that this kind of weighting approach works very well with the recent post A Model-based Approach to AI Existential Risk, by Sammy Martin, Lonnie Chrisman, and myself, particularly the section on Combining inside and outside view arguments. Excited to see more work in this area!

Ozzie Gooen

Quick flag/reminder that I'd be interested in comments from others here - try giving your own scores, or flag things that you think are wrong about mine (or an LLMs).

titotal

I think this is a reasonable exercise in the abstract, and could help people more easily communicate how they approach different forms of evidence.

However, if actually implemented practically, I think it would be too easily gamed to be of any use. Using your system as an example, if person A has a mathematical proof of X (20 points), but person B makes 11 clever tweets suggesting not x (2*11 = 22 points), then person B "wins" the argument.

The other problem I see is that there's no modifier here for "actually being correct". If person A presents a correct mathematical proof for X, and person B presents a mathematical proof for not X that is actually false, do they both get 20 points?

Chris Kerr

The other problem I see is that there's no modifier here for "actually being correct". If person A presents a correct mathematical proof for X, and person B presents a mathematical proof for not X that is actually false, do they both get 20 points?

If you check the proofs yourself and you can see that one is obviously wrong and the other is not obviously (to you) wrong then you only give the not-obviously-wrong one 20 points. If you can't tell which is wrong then they cancel out. If a professor then comes along and says "that proof is wrong, because [reason that you can't understand], but the other one is OK" then epistemically it boils down to "tenured academic in field - 6 points" for the proof that the professor says is OK.

Ozzie Gooen

This equation was definitely meant as a rough initial guide. I think it's still usable as a heuristic - i.e. most of the time, you pay attention to higher point evidence than lower point evidence. It's meant to be better than other heuristics, not a complete solution.

if person A has a mathematical proof of X (20 points), but person B makes 11 clever tweets suggesting not x (2*11 = 22 points), then person B "wins" the argument.

I didn't get into adding evidence, for this reason. I think it's very clear that things are not linearly-additive like that. I think that an aggregation function would take into account the similarity of different sorts of content (two tweets that are clever, but near-identical), but also the similarity of the types of content (it's better to have a diverse set of different kinds of content, like a meta-study and "businesses commonly use it"). There would be quick leveling off - so that 50 tweets would have the evidence strength of something like 2 to 5 or so.

The other problem I see is that there's no modifier here for "actually being correct".

I thought this was fairly obvious to add. Again, I think this would need a lot more complexity, depending on how much you actually rely on it.

Comments

More from the author

120

QURI is moving into maintenance mode

Ozzie Gooen·1mo ago·2m read

116

Announcing RoastMyPost: LLMs Eval Blog Posts and More

Ozzie Gooen·7mo ago·6m read

218

What's Going on With OpenAI's Messaging?

Ozzie Gooen·2y ago·4m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 6d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

How (not) to fundraise from Anthropic staff

Jack Lewars·1w ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·4d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Recent opportunities to take action

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·2d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·3d ago·3m read

Starting an EA group @ SUNY Binghamton

micahzarin·1d ago·1m read

Using Points to Rate Different Kinds of Evidence

Equation

Initial Points

Point Modifiers

Points, In Practice

Meta

Using an Equations for Discussion

Presumptions

Agreeing on an evidence-weighing algorithm before direct discussions

Is this too complicated and speculative?

Future Work

Afterward: Quick Attempts by LLMS

Claude

GPT-4