Visiting Researcher @ MIT Media Lab, Nucleic Acid Observatory
Pursuing a doctoral degree (e.g. PhD)


Physician and visiting researcher at Kevin Esvelt's research group Sculpting Evolution. Thinking about ways to safeguard the world from bio. Apart from that, I have an assortment of more random EA interests, e.g., how to create better universities, how to make Germany a greater force for technological and social progress, or how to increase high-skilled immigration in the US.


I'm excited that more people are looking into this area!

Flagging that I only read the intro and the conclusion, which might mean I missed something. 

High-skilled immigration

From my current understanding, high-skilled immigration reform seems promising not so much because of the effects on the migrants (though they are positive) but mostly due to the effect on the destination country's GDP and technological progress. The latter has sizeable positive spillover effects (that also accrue to poorer countries).

Advocacy for high-skilled immigration is less controversial and thus easier, which could make interventions in this area more valuable when compared to general immigration reform.

Then again, for the reasons above, more individuals are likely already working on improved high-skilled immigration. 


Also, have you chatted with Johannes Haushofer? He knows EA and recently started Malengo, which wants to facilitate educational migration from low-income countries. I'd assume he has thought about these topics a bunch.

Comment by Paul Christiano on Lesswrong:


""RLHF and Fine-Tuning have not worked well so far. Models are often unhelpful, untruthful, inconsistent, in many ways that had been theorized in the past. We also witness goal misspecification, misalignment, etc. Worse than this, as models become more powerful, we expect more egregious instances of misalignment, as more optimization will push for more and more extreme edge cases and pseudo-adversarial examples.""

These three links are:

  • The first is Mysteries of mode collapse, which claims that RLHF (as well as OpenAI's supervised fine-tuning on highly-rated responses) decreases entropy. This doesn't seem particularly related to any of the claims in this paragraph, and I haven't seen it explained why this is a bad thing. I asked on the post but did not get a response.
  • The second is Discovering language model behaviors with model-written evaluations and shows that Anthropic's models trained with RLHF have systematically different personalities than the pre-trained model.  I'm not exactly sure what claims you are citing, but I think you are making some really wild leaps.
  • The third is Compendium of problems with RLHF, which primarily links to the previous 2 failures and then discusses theoretical limitations.

I think these are bad citations for the claim that methods are "not working well" or that current evidence points towards trouble.

The current problems you list---"unhelpful, untruthful, and inconsistent"---don't seem like good examples to illustrate your point. These are mostly caused by models failing to correctly predict which responses a human would rate highly. That happens because models have limited capabilities and is rapidly improving as models get smarter. These are not the problems that most people in the community are worried about, and I think it's misleading to say this is what was "theorized" in the past.

I think RLHF is obviously inadequate for aligning really powerful models, both because you cannot effectively constrain a deceptively aligned model and because human evaluators will eventually not be able to understand the consequences of proposed actions. And I think it is very plausible that large language models will pose serious catastrophic risks from misalignment before they are transformative (it seems very hard to tell). But I feel like this post isn't engaging with the substance of those concerns or sensitive to the actual state of evidence about how severe the problem looks like it will be or how well existing mitigations might work.


This post reads like it wants to convince its readers that AGI is near/will spell doom, picking and spelling out arguments in a biased way. 

Just because many ppl on the Forum and LW (including myself) believe that AI Safety is very important and isn't given enough attention by important actors, I don't want to lower our standards for good arguments in favor of more AI Safety.

Some parts of the post that I find lacking:

 "We don’t have any obstacle left in mind that we don’t expect to get overcome in more than 6 months after efforts are invested to take it down."

I don't think more than 1/3 of ML researchers or engineers at DeepMind, OpenAI, or Anthropic would sign this statement.

"No one knows how to predict AI capabilities."

Many people are trying though (Ajeya Cotra, EpochAI), and I think these efforts aren't worthless. Maybe a different statement could be: "New AI capabilities appear discontinuously, and we have a hard time predicting such jumps. Given this larger uncertainty, we should worry more about unexpected and potentially dangerous capability increases".

"RLHF and Fine-Tuning have not worked well so far."

Not taking into account if RLHF scales (as linked, Jan Leike of OpenAI doesn't think so) and if RLHF leads to deception, from my cursory reading and experience, ChatGPT shows substantially better behavior than Bing, which might be due to the latter not using RLHF.

Overall I do agree with the article and think that recent developments have been worrying. Still, if the goal of the articles is to get independently-thinking individuals to think about working on AI Safety, I'd prefer less extremized arguments.

Thanks for writing this up. I just wanted to note,  the OWID graph that appears while hovering over a hyperlink is neat!  @JP Addison or whoever created that, cool work.


Flagging that I'm only about 1/3 in.

Regarding this paragraph:

" An epistemically healthy community seems to be created by acquiring maximally-rational, intelligent, and knowledgeable individuals, with social considerations given second place. Unfortunately, the science does not bear this out. The quality of an epistemic community does not boil down to the de-biasing and training of individuals;[3] more important factors appear to be the community’s composition, its socio-economic structure, and its cultural norms.[4]"

When saying that the science doesn't bear this out you go on to cite footnotes in your original article. If you want to make the case for this, it might be better to either i) point to very specific ways how the current qualities of EA lead to flawed conclusions, or ii) point to research that makes a similar claim.

Appreciated this post! Have you considered crossposting this to Lesswrong?  Seems like an important audience for this. 

I just wanted to note that I appreciated this post and the subsequent discussion, as it quickly allowed me to get a better model of the value of antivirals. Publicly visible discussions around biosecurity interventions are rare, making it hard to understand other people's models. 

I appreciate that there are infohazards considerations here, but I feel it's too hard for people to scrutinize the views of others because of this.

Appreciated the 5-minute summary; I think more reports of this length should have two summaries, one TL;DR, the other similar to your 5 min summary.


Let's phrase it even more explicitly: You trust EVF to always make the right calls, even in 10 years from now.


The quote above (emphasis mine) reads like a strawman; I don't think Michael would say that they always make the right call. My personal view is that individuals steering GWWC will mostly make the right decisions and downside risks are small enough not to warrant costly governance interventions.

Load more