CG

Charlie_Guthmann

1077 karmaJoined

Bio

pre-doc at Data Innovation & AI Lab
previously worked in options HFT and tried building a social media startup
founder of Northwestern EA club

Comments
281

Hi Mamhud welcome to the forum :)

This is a complicated project (depending on the scope)! Not a doctor but I'll try to walk you through some my thinking if it's of any interest.

First of all, I believe there are some medical benchmarks, e.g. https://openai.com/index/healthbench/ 
https://crfm.stanford.edu/helm/medhelm/latest/
https://bench.arise-ai.org/

I'm not very familiar with any of them, maybe they suck. It's also a massive field and this would surely bit the tip of a much larger iceberg to getting to a reliable place. 


Also a quick high level on what "METR of x" would mean to the community. 

Benchmark/eval = tests AI for something

METR = A specific benchmark that measures the human time equivalent duration of tasks that ai can do with x% reliability. 

It's not clear to me exactly the scope of your benchmarking but e.g. demographic name swapping would be more analogous to the small but existing literature/benchmarking on llm biases than METR, though of course their could be something like health METR, but it would mean something specific and I'm not sure that's what you mean. 


To understand how to benchmark LLMs it helps to have a model of what an LLM is (or can be). 

The "brain" (and mouth and ears)

Level 1 — Pre-training. The raw model, trained on internet-scale data. Helps it understand lanaguage and the world

Level 2 — Post-training. RLHF, RLVF, make it into a friendly bot and better at math (or maybe medicine).  

How much the brain thinks

Level 3 — Inference scaling. How much compute you throw at the model at runtime. Thinking tokens, chain-of-thought, best-of-N sampling.

What digital actions the brain can take

Level 4 — Agentic harnesses. The scaffolding around the model: Claude Code, Codex, SWE-Agent, Pi, Devin. The digital robot armor for the AI brain.

The house the robot lives in

Level 5 — Context engineering. The prompt, the skill files, the retrieved context, the evolutionary algorithms that search prompt space. Everything that determines what the model sees when it starts working.

The world the robot lives in

Level 6 — The built environment. APIs designed for agent consumption, verification infrastructure, data markets, workflows rewritten to be machine-readable. The world reshaping itself around AI.

 

I couldn't read your labs nature papers because they are paywalled (lol) but from a quick skim the science direct one, the framework would be 
 

Framework LevelAddressed in Paper?Notes
1. Pre-trainingYesSpecified model families/sizes
2. Post-trainingYessome med fine tunes but no recent models with heavy rl
3. Inference ScalingPartiallyMentions reasoning models but not specific compute
4. Agentic HarnessesYesN/A (Single-turn prompt only)
5. Context EngineeringYesPrompt only context engineering
6. Built EnvironmentYesN/A (Offline/Lab setting)

Your models (gpt 4o level or lower) are about 1.5 years off the frontier "brains", and there are many other innovations that people believe are useful that you aren't using. So in thinking about the results, just understand that the "ceiling" could be much higher. As a general rule of thumb, if you want to prove ai capabilities -> use older open source models. If you want to disprove -> use newest/best models and tech stack. Proof moves up the capabilities stack (heuristic not law), disproof moves backwards. That's not to say it isn't useful to see what chatgpt 4o might do when pushed in a certain direction, after all tons of people will end up using less than frontier llms in suboptimal ways, just worth having a clear model and I think this framing will make it easier to quickly communicate to an audience what you are testing (though this is not a field standard, just something I made up). 


Now getting back to some of the clinical side of this - A wise man once told me "garbage in garbage out". My understanding is that we do not have something close to a good answer to "should this person get a mental health referral" or 75%+ of medical questions. 

It might help to walk you through a really reductive version of how an effective altruist might think of this triage. 

  1. what is the benefit of this intervention
  2. What is the cost

benefit might be measured in qalys (or many other outputs) and cost in dollars. The correct answer would choose the most cost effective treatments (again, really simplistic and reductive). While some parts of the American medical system look something like this, most medical decisions look a little different. So one must ask what the right answer to a medical benchmark looks like unless they just want to calcify the industries priors. 

Even if we do agree on the right answer looking something like the model above, we must enter into figuring out (1) & (2) and enter into the world of evidence based medicine. 

 

Finding the Evidence - Evidence-Based Medicine - Research & Subject Guides  at Stony Brook University

 

We simply don't have enough rcts or accurate enough models between them to have confident answers to most medical questions for specific people with specific DNA and specific life experiences. We don't have all the answers or anything close I think. Excluding the need for better theoretical models and more RCTs, Here are some of the current ontological problems with making clinical medical decisions. 

 

The clinical-research ontology gap for diseases - medical billing is often like ICD or similar, research done at MESH level which is often more granular and focused on causes not symptoms. I mean really, what is a "disease". Is a disease the symptoms or the cause? And since we have these different codes + medical data always tricky, we don't even necessarily know what the incidence or prevalence of most things. This would be a fundamental building block in doing some sort of hallucination free bayesian analysis I would think. 

What constitutes "evidence" - Hopefully there is a Cochrane review or similar, but if not, and we start moving down the pyramid, how do we incorporate evidence into a clinical decision. I'm not sure the medical field has a unified systematic take here, so again hard to see how you judge an llm? 

unknown drug/treatment prices - Both doctors and patients might not know the cost to the society or hospital or patient via insurance because of current healtcare setup. 

Fraud, phacking, poor statistics, etc. - Lot's of issues with the evidence itself.

bad/incomplete meta-studies and systematic reviews. 

 

Again this isn't all to say you shouldn't benchmark llms, but again worth being wary that you are trying to test them on fundamentally shaky and uncertain ground (scientifically, economically, politically). I have a lot more thoughts on text parsing, meta studies, clinical information compression sorting and maintenance but already wrote too much. Good luck!
 

I don't really understand this perspective. Let me try to make sure I'm understanding you. 

(1) Anthropic wrote a company policy/governance document that claimed something
(2) This document was the foundation of much of the communities and companies perspective on how to think about and interact with AI safety, including making major donations and career choices. There are large irreversable path dependencies here. 
(3) the document always felt quite dubious to you, to the point where it felt like it wouldn't hold the whole time, whether purposely or due to a lack of clarity on anthropics part (I agree completely!)
(4) While this wasn't all 100% predicatable write when rsp was written, it surely has become increasingly obvious to anthropic leadership for months at this point. Nothing that has happened in the last 6 months is all that surprising and in fact basically right on trend, and dario has stated this himself many times. Yet anthropic continued to wait, taking in significantly more funding and increasingly roping in huge swaths of this community, and only when it was literally the case that they were about to violate their own document (or already had), they change it. 
(5) This makes you feel better than if they kept lying/decieving/whatever more charitable word that could be used here. 

is this approximately your perspective? Obviously I'm throwing my own biasing perspective in here and apologies if I'm misinterpretting.

I mean sure in a trivial sense I feel better about them doing 5. Taking a step back, it really barely matters and is beside the point. Them admiting 5 It's just a natural segway for us to discuss 1-4. Nothing they say about their own commitments matter anymore really. Incentives matter. 

FWIW though I am still highly confused on if anthropic is net positive or negative and quite open to despite all of this thinking we should still be throwing our weight completely behind them. 

What someones #1 focus should be is a really complicated question that involves values, interests, etc. For the movements part, there is no official list. 

That being said It's reasonable to argue against democracy preservation as a good use of EA (or specfic peoples) time, but neglectedness alone would only be a part of that story. 

Yes I think you mostly captured it and quite well. But I think there is something a little more too, which is that EA meme actually is more epistemically humble than you think. There is EA the meme and EA the group. The EA meme has leaked into much of mainstream policy and economics. It's in the water. The EA group has not. 

Let's say (referring to your other comment here), you do get a rich funder to fund work on applying alternate moral systems, in a ratio such that we, we being the current people and groups who you think compose ea (who is that?), in tandem with this new funding, are riding the perfect part of the curve where the marginal efficiency of exploration and exploitation (of our moral values) is equivalent. 

Taking a specific example, let's say this founder funds EA of biodiversity. Based on some (evolving) metric of biodiversity, this new group finds the best interventions for preserving biodiversity. Let's say their current best cause areas after all of this debate are saving the coral reefs and preserving indigenous languages and culture. 

In what sense are they any longer part of EA? Would you expect this subgroup to then post to the EA forums and go to EAG? More likely is they just become their own thing or the people get absorbed into the existing biodiversity or climate movements. 

So then are we still properly exploring/exploiting? or do we now need a new group? Again, who is we? 

We is some effort-status-capital-talent weighted aggregation of all the people who care to engage in the spaces and network of other people who would self describe as ea. It's a very ephemeral thing driven by subliminal status games and hidden incentives. 

I'm definitely not sure this is futile. I still try to push towards you vision, and others have too. 

However the question isn't can it be done, but is it the best path. I now lean in the direction that it is better to just start a new movement. I have tried to flesh parts of generative visions.

Spot on in your analysis but I don't know if it's fixable. I have suggested many times on this forum that we need to bake moral anti-realism into the core of the movement (which as you state probably does nothing). Ironically I think one of the core (but maybe not so novel) lessons of uppercase EA is that decentralization breeds fanaticism in a social movement if it financially exists in a larger extremely unequal society (even if the members are insanely thoughtful and bayesian). Some form of centralization is required to conform evolutionary value drift into something closer to ideal reflection. 

There are many paths, but unfortunately all of them require state capacity and culture. We would need some sort of political system to enforce the financial regulations that stops the gravity of the wealth-weighted dominant aesthetics from consuming the meta idea of ea (lower case ea). And probably a bunch of other things. But this is hard, there are 3 camps of main resistance. 

(1) the pure

Those who believe counting is not politics but math. 

(2) the pragmatic

Those who believe decentralization is good for the movement

(3) the de jure

Those who believe decentralization is good for their career, usually because it continues the default out status quo of who current has power

Together this coalition is sizable. I'm not sure exactly how much and maybe a vocal minority but I'd reckon at least 30%. Let's assume the rest of the movement is at least weakly in favor of centralization. But I think that 30% is more like a 50-70 percent in the hubs of oxford, dc, sf (just speculating here). These parts of the movement have not just money but better organization as well. The remaining 70% are spread throughout the world and It's not clear how they might at current coordinate to force some sort of constitution. 

Your functional path 5s are good ideas, but again who exactly is doing or paying for them? maybe you can convince someone rich right now, or maybe you can go build these projects, but there is nothing legally or politically forced and the Egregore will eat it up all the same. Anything short of a real politically binding set of laws and delineation between members and non members seems like window dressing to me. But increasingly I think even if this would get passed I wonder if the ea infra is best left as is and new young people try to just start a more functionally agnostic version of the movement. That's at least some of the essence of post-rats, though they never meant for that to be a big tent idea. 

I feel mixed about ai-writing detection for a few reasons. I have very few issues with someone putting the bullet points of their argument into ai and then reading/editing/discussing a few times the response and letting the ai write it. I also think there is value in just putting your messy thoughts out as you have them and not having everything polished, but it depends on the situation.

Also separately I'm worried ai-writing detector proliferation will just speed up "immunity".  I don't think there is something deep and fundamental that stops ai from writing e.g. exactly what I have written to this point.  You can already download all your writing, ask ai to summarize it, make a text file that precisely describes your style and then ask the ai to write something in your voice. I've done this and yes they still have a bit of that vanilla llm feel but if there is actual market demand for solutions it doesn't seem like this is an insurmountable problem. 

I think people should say when they used ai and to what degree and there should be an expectation that just because polished writing is cheaper than it used to be you will not pollute the forum with things that you have not thought an appropriate amount about. 

FWIW I went to the best (or second best lol) high school in Chicago, Northside, and tbh the kids at these top city highschools are of comparable talent to the kids at northwestern, with a higher tail as well. More over everyone has way more time and can actually chew on the ideas of EA. There was a jewish org that sent an adult once a week with food and I pretty much went to all of them even tho i would barely even self identify as jewish because of the free food and somewhere to sit and chat about random stuff while I waited for basketball practice. 

So yes I think it would be highly successful. But I think you would need adult actual staff to come at least every other week (as brian mentioned) and as far as I can tell EA is currently struggling pretty hard w/organizing capacity and it seems to be getting worse (in part because as I have said many times, we don't celebrate organizers enough and we focus the movement too much on intellectualism rather than coordination and organizing). So I kind of doubt there is a ton of capacity for this. But if there is it's a good idea. I'm happy to help you understand how you could implement this at CPS selective enrollment schools if you want to help do it yourself. 

Thank you for doing this, love to see some data. 

I don't have high familiarity with METR but I think it is probably not great data for this type of analysis. Few issues or clarifications would be needed (and anyone who understands METR better bear with me/or correct me on my mistakes plz). 
 
1.  How does METR handle context windows? Are we doing a rolling window? Compact? something else? 
How much of this inverse quadratic relationship is just caused by longer tasks having a larger used context window for the back half of the run? How much is caused by a lack of a default information management system that persists?

2. What is the exact harness(es) METR is using? 

Harness/enviornment engineering/information management might control more of the cost of long running SWE projects than iq (past a point). 

3. Does METR allow repo forking? routing? 
In the future no 180 iq ai is building the ORM and buttons for a crud app. They are either forking a boilerplate or routing it to a cheaper model. 

etc. 

It is said that the current iteration of models suffer from retrograde amnesia. Whether or not this will get bitter lesson pilled is a separate question, but for this class of memento models, version control, information management, context management, and the meta process of improving and routing through the best versions of these combos for a specific task is not some side quest but in fact the main route to making long tasks cheaper. Even as we enter the next paradigm of models that don't have such profound short term memory loss, a huge part of cost reduction will come from the orchestrator meta planning about how much to explore the space of options/build out the software factory / vs actually starting the work. 

I’m not denying the core question OP is raising — costs could plausibly be rising and could matter a lot. I’m just not convinced this specific curve cleanly isolates “AI economics” from “how expensive a particular scaffold/set of arbitrary constraints makes long-context work.”

yea doesn't arc leaderboard have somewhat opposing trends? https://arcprize.org/leaderboard

might want to check out this (only indirectly related but maybe useful). 

https://forum.effectivealtruism.org/posts/zuQeTaqrjveSiSMYo/a-proposed-hierarchy-of-longtermist-concepts
 

Personally don't mind o-risk think it has some utility but s-risk ~somewhat seems like it still works here. An O-risk is just a smaller scale s-risk no?
 

Load more