TK

Thomas Kwa🔹

Researcher @ METR
3959 karmaJoined Working (0-5 years)Berkeley, CA, USA

Bio

Participation
4

AI safety researcher

Comments
320

My guess is something like: Many organizations have quarterly caps on the number of false claims published. Their employees often want to make false claims, but towards the end of the quarter they're at the cap, so they delay the post to the first day of the next quarter when space is available.

Okay, but why only April 1? Well, on Jan 1 everyone is on holiday, and on July 1 everyone is out enjoying the good weather. Oct 1 coincides with national holidays in populous countries like China, Nigeria, and the US, and people are also hung over from fiscal New Year's Eve. So we only really see the effect on April 1.

I would strongly predict that a false claims spike also happens in places with bad weather on July 1. Unfortunately, most places are in the Northern Hemisphere where it's warm, and Australia has good weather all year, so I think this is only testable when it snows in New Zealand.

I'd love to sign up, but due to adverse selection concerns I'd prefer to be matched with an EA picked uniformly at random (whether they signed up or not). Is this possible?

what prompt did you use?

On a global scale I agree. My point is more that due to the salary standards in the industry, Eliezer isn't necessarily out of line in drawing $600k, and it's probably not much more than he could earn elsewhere; therefore the financial incentive is fairly weak compared to that of Mechanize or other AI capabilities companies.

Being really good at your job is a good way to achieve impact in general, because your "impact above replacement" is what counts. If a replacement level employee who is barely worth hiring has productivity 100, and the average productivity is 150, the average employee will get 50 impact above replacement. If you do your job 1.67x better than average (250 productivity), you earn 150 impact above replacement, which is triple the average.

I strongly disagree with a couple of claims:

MIRI's business model relies on the opposite narrative. MIRI pays Eliezer Yudkowsky $600,000 a year. It pays Nate Soares $235,000 a year. If they suddenly said that the risk of human extinction from AGI or superintelligence is extremely low, in all likelihood that money would dry up and Yudkowsky and Soares would be out of a job.

[...] The kind of work MIRI is doing and the kind of experience Yudkowsky and Soares have isn't really transferable to anything else.

  • $235K is not very much money [edit: in the context of the AI industry]. I made close to Nate's salary as basically an unproductive intern at MIRI. $600K is also not much money. A Preparedness researcher at OpenAI has a starting salary of $310K – $460K plus probably another $500K in equity. As for nonprofit salaries, METR's salary range goes up to $450K just for a "senior" level RE/RS, and I think it's reasonable for nonprofits to pay someone with 20 years of experience, who might be more like a principal RS, $600K or more.
    • In contrast, if Mechanize succeeds, Matthew Barnett will probably be a billionaire.
  • If Yudkowsky said extinction risks were low and wanted to focus on some finer aspect of alignment, e.g. ensuring that AIs respect human rights a million years from now, donors who shared their worldview would probably keep donating. Indeed, this might increase donations to MIRI because it would be closer to mainstream beliefs.
  • MIRI's work seems very transferable to other risks from AI, which governments and companies both have an interest in preventing. Yudkowsky and Soares have a somewhat weird skillset and I disagree with some of their research style but it's plausible to me they could still work productively in a mathy theoretical role in either capabilities or safety.

However, things I agree with

If the Mechanize co-founders wanted to focus on safety rather than capabilities, they could.

the Mechanize co-founders decided to start the company after forming their views on AI safety.

The Yudkowsky/Soares/MIRI argument about AI alignment is specifically that an AGI's goals and motivations are highly likely to be completely alien from human goals and motivations in a way that's highly existentially dangerous.

Is there a formula for the pledge somewhere? I couldn't find one.

See the gpt-5 report. "Working lower bound" is maybe too strong; maybe it's more accurate to describe it as an initial guess at a warning threshold for rogue replication and 10x uplift (if we can even measure time horizons that long). I don't know what the exact reasoning behind 40 hours was, but one fact is that humans can't really start viable companies using plans that only take a ~week of work. IMO if AIs could do the equivalent with only a 40 human hour time horizon and continuously evade detection, they'd need to use their own advantages and have made up many current disadvantages relative to humans (like being bad at adversarial and multi-agent settings).

What scale is the METR benchmark on? I see a line that "Scores are normalized such that 100% represents a 50% success rate on tasks requiring 8 human-expert hours.", but is the 0% point on the scale 0 hours?

METR does not think that 8 human hours is sufficient autonomy for takeover; in fact 40 hours is our working lower bound.

What if we decide that the Amazon rainforest has a negative WAW sign? Would you be in favor of completely replacing it with a parking lot, if doing so could be done without undue suffering of the animals that already exist there?

Definitely not completely replacing because biodiversity has diminishing returns to land. If we pave the whole Amazon we'll probably extinct entire families (not to mention we probably cause ecological crises elsewhere and disrupt ecosystem services etc), whereas on the margin we'll only extinct species endemic to the deforested regions.

If the research on WAW comes out super negative I could imagine it being OK to replace half the Amazon with higher-welfare ecosystems now, and work on replacing the rest when some crazy AI tech allows all changes to be fully reversible. But the moral parliament would probably still not be happy about this. Eg killing is probably bad, and there is no feasible way to destroy half the Amazon in the near term without killing most of the animals in it.

Load more