Visiting Researcher @ MIT Media Lab, Nucleic Acid Observatory
438 karmaJoined Mar 2020Pursuing a doctoral degree (e.g. PhD)Basel, Switzerland


Physician and visiting researcher at Kevin Esvelt's research group Sculpting Evolution. Thinking about ways to safeguard the world from bio. Apart from that, I have an assortment of more random EA interests, e.g., how to create better universities, how to make Germany a greater force for technological and social progress, or how to increase high-skilled immigration in the US.


That's a good pointer, thanks! I'll drop the reference to Diggans and Leproust for now.


Thanks for the write-up. Just adding a note on how this distinction has practical implications for how to design databases containing hazardous sequences that are required for gene synthesis screening systems.

With gene synthesis screening, companies want to stop bad actors from getting access to the physical DNA or RNA of potential pandemic pathogens. Now, let's say researchers find the sequence of a novel pathogen that would likely spark a pandemic if released. Most would want this sequence to be added to synthesis screening databases. But some also want this database to be public. The information hazards involved in making such information publicly available could be large, especially if there is attached discussion of how exactly these sequences are dangerous.

I skimmed it, and it looks good to me. Thanks for the work! A separate post on this would be cool.

I set a reminder! Also, let me know if you do end up updating it.

Is there an updated version of this? E.g., GDP numbers have changed.

Flagging that I approve this post; I do believe that the relevant biosecurity actors within EA are thinking about this (though I'd love a more public write-up of this topic). Get in touch if you are thinking about this!

I'm excited that more people are looking into this area!

Flagging that I only read the intro and the conclusion, which might mean I missed something. 

High-skilled immigration

From my current understanding, high-skilled immigration reform seems promising not so much because of the effects on the migrants (though they are positive) but mostly due to the effect on the destination country's GDP and technological progress. The latter has sizeable positive spillover effects (that also accrue to poorer countries).

Advocacy for high-skilled immigration is less controversial and thus easier, which could make interventions in this area more valuable when compared to general immigration reform.

Then again, for the reasons above, more individuals are likely already working on improved high-skilled immigration. 


Also, have you chatted with Johannes Haushofer? He knows EA and recently started Malengo, which wants to facilitate educational migration from low-income countries. I'd assume he has thought about these topics a bunch.

Comment by Paul Christiano on Lesswrong:


""RLHF and Fine-Tuning have not worked well so far. Models are often unhelpful, untruthful, inconsistent, in many ways that had been theorized in the past. We also witness goal misspecification, misalignment, etc. Worse than this, as models become more powerful, we expect more egregious instances of misalignment, as more optimization will push for more and more extreme edge cases and pseudo-adversarial examples.""

These three links are:

  • The first is Mysteries of mode collapse, which claims that RLHF (as well as OpenAI's supervised fine-tuning on highly-rated responses) decreases entropy. This doesn't seem particularly related to any of the claims in this paragraph, and I haven't seen it explained why this is a bad thing. I asked on the post but did not get a response.
  • The second is Discovering language model behaviors with model-written evaluations and shows that Anthropic's models trained with RLHF have systematically different personalities than the pre-trained model.  I'm not exactly sure what claims you are citing, but I think you are making some really wild leaps.
  • The third is Compendium of problems with RLHF, which primarily links to the previous 2 failures and then discusses theoretical limitations.

I think these are bad citations for the claim that methods are "not working well" or that current evidence points towards trouble.

The current problems you list---"unhelpful, untruthful, and inconsistent"---don't seem like good examples to illustrate your point. These are mostly caused by models failing to correctly predict which responses a human would rate highly. That happens because models have limited capabilities and is rapidly improving as models get smarter. These are not the problems that most people in the community are worried about, and I think it's misleading to say this is what was "theorized" in the past.

I think RLHF is obviously inadequate for aligning really powerful models, both because you cannot effectively constrain a deceptively aligned model and because human evaluators will eventually not be able to understand the consequences of proposed actions. And I think it is very plausible that large language models will pose serious catastrophic risks from misalignment before they are transformative (it seems very hard to tell). But I feel like this post isn't engaging with the substance of those concerns or sensitive to the actual state of evidence about how severe the problem looks like it will be or how well existing mitigations might work.


This post reads like it wants to convince its readers that AGI is near/will spell doom, picking and spelling out arguments in a biased way. 

Just because many ppl on the Forum and LW (including myself) believe that AI Safety is very important and isn't given enough attention by important actors, I don't want to lower our standards for good arguments in favor of more AI Safety.

Some parts of the post that I find lacking:

 "We don’t have any obstacle left in mind that we don’t expect to get overcome in more than 6 months after efforts are invested to take it down."

I don't think more than 1/3 of ML researchers or engineers at DeepMind, OpenAI, or Anthropic would sign this statement.

"No one knows how to predict AI capabilities."

Many people are trying though (Ajeya Cotra, EpochAI), and I think these efforts aren't worthless. Maybe a different statement could be: "New AI capabilities appear discontinuously, and we have a hard time predicting such jumps. Given this larger uncertainty, we should worry more about unexpected and potentially dangerous capability increases".

"RLHF and Fine-Tuning have not worked well so far."

Not taking into account if RLHF scales (as linked, Jan Leike of OpenAI doesn't think so) and if RLHF leads to deception, from my cursory reading and experience, ChatGPT shows substantially better behavior than Bing, which might be due to the latter not using RLHF.

Overall I do agree with the article and think that recent developments have been worrying. Still, if the goal of the articles is to get independently-thinking individuals to think about working on AI Safety, I'd prefer less extremized arguments.

Thanks for writing this up. I just wanted to note,  the OWID graph that appears while hovering over a hyperlink is neat!  @JP Addison or whoever created that, cool work.

Load more