Joseph Bloom

112 karmaJoined


As an author on this post,I think this is a surprisingly good summary. Some notes:

  • While all of the features are fictional, the more realistic ones are not far from reality. We've seen scripture features of various kinds in real models. A scripture intersect Monty Python feature just wouldn't be that surprising.
  • Some of the other features were more about tying in interesting structure in reality than playing anything else (eg criticism of criticism feature).
  • In terms of the absurdities of feature interpretation, I think the idea was to highlight awareness of possible flaws like buying into overly complicated stories we could tell if we work too hard to explain our results. We're not sure what we're doing yet in this pre-paradigmatic science so having a healthy dose of self-awareness is important!

Project Proposal: Working on understanding how AIs work by watching them think as they play video games.  Needs python developers/possibly c++.

I'd like to extend my current technical alignment project stack in one of a few non-trivial ways and would love help from more experienced software engineers to do it.
- Post: https://www.lesswrong.com/posts/bBuBDJBYHt39Q5zZy/decision-transformer-interpretability
- GitHub: https://github.com/jbloomAus/DecisionTransformerInterpretability

I'm not sure what the spread of technical proficiency is or how interested people are in assisting with my research agenda, but I've made a list of what I think are solid engineering challenges that I would love to get help with. 1/2 is stuff I can do/manage, and 3 is something I would need assistance with from someone with more experience.

1. Re-implementing bespoke grid worlds such as AI safety grid worlds, proper mazes or novel environments in currently maintained/compatible packages (gymnasium and/or <inigrid) to study alignment-relevant phenomena in RL agents/agent simulators.

2. Implementing methods for optimizing inputs (feature visualization) for pytorch models/MiniGrid environments.

3. Develop an  real-time mechanistic interpretability app for procgen games (ie: extend https://distill.pub/2020/understanding-rl-vision/#feature-visualization to game-time, interactive play with pausing). I have a streamlit app that does this for gridworlds which I can demo.

Further Details:

1. The AI Safety GridWorlds (https://github.com/deepmind/ai-safety-gridworlds) is more than 5 years old and implemented in DeepMind’s pycolab engine (https://github.com/deepmind/pycolab). I’d love to study them with the current mechanistic interpretability techniques implemented in  TransformerLens and the Decision Transformer Interpretability codebase, however, getting this all working will take time so it would be cool if people were interested in smashing that out. Having proper mazes for agents to solve in Minigrid would also be interesting in order to test our ability to reverse engineer algorithms from models using current techniques.

2. Feature Visualization Techniques aren’t new but have been previously used on continuous input spaces like CNNs. However, recent work by Jessica Rumbelow (SolidGoldMagicarp post: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#Prompt_generation ) has shown that it’s possible to perform this technique on discrete spaces such as word embeddings. Extending this to the discrete environments we have been studying or might study (see 1) may provide valuable insights. Lucent (lucid for pytorch) may also be useful for this.

3. The current interactive analysis app for Decision Transformer Interpretability is written in Streamlit and so runs very slowly. This is fine for grid world-type environments but won’t work for continuous procedurally generated environments like procgen (https://github.com/openai/procgen). Writing a procgen/python wrapper that provides live model analysis (with the ability to pause mid-game) will be crucial to further work.

Feel free to ask questions here!

Hi all, I'm Joseph. I have a double degree in computational biology/statistics, have RA'd in protein engineering (structure/dynamics) and worked in proteomics (LCMS). Currently on an FTX regrant to find ways to help with Biosecurity/AI. I'm focusing on upskilling in AI, but keen to keep discussing biosecurity.

Thanks so much! Was hoping someone would do this soon!


It's great that you are trying to make these kinds of decisions with impact in mind!

I have a comp bio background but more in proteomics and have spent time this year looking at different ways to have a large impact, although my own focus was much more on pandemics preparedness / x-risk.

Probably this problem is underspecified. It's just very hard for anyone at this level of abstraction to make the decision for you. Details like your relationship with the supervisor, other lab members for example could be critical. It does sound like you like the first option but I'd encourage you to test your hypothesis thoroughly (or proportionally to the subsequent time investment).

However, some guiding principles may help:

  • Speak to current lab members/students of either lab. If you feel very confident that they are sending out good cultural and intellectual vibes then you're time is a much safer bet there.
  • Fieldwise, meta-genomics seems likely to be very useful in pandemic preparedness (see SecureDNA) so if your work has higher inner product (more in common) with those kinda of projects then I'd see that as a concretely safer bet.
  • Given your other interests, I'd definitely go speak to more bio experts. Book an appt with the EA consult a bio expert (if it's still open) or look for EAs who you can chat and contact them.

EffectiveThesis might have some useful content too. https://effectivethesis.org/

Good luck and all the best!

Thanks Agustin, 

I appreciate the clarification and this kind of detail ("people with experience working on climate change research, activism or public policy" as opposed to others). 

Based on this thread, I think we'd be looking for a document that meets the following criteria:

  •  Extends/Summarises current EA material on climate change so that it's clear that EA has made serious attempts to assess it. 
  • A nuanced explanation for the ITN framework, explaining how much of the work on climate change is not-neglected, and which observations might justify working on climate change over other cause areas. 
  • Some description of other EA cause areas and links to similar reasoning which may explain why they are prioritised by some EAs. 

Such a document should also be simple enough to be linked as introductory material to someone not familiar with EA.  It would also be valuable to test such a document/set of arguments on some climate activists or even iterate based on their feedback in order to be more effective. 

I'm definitely not the person to write this, but I could ask around a few places to see if anyone is keen to work on it. It sounds like our prior is that this is likely enough to be valuable, and simple enough to attempt, that it's worth a shot. 

That's fair. I'll keep thinking about it but this was helpful, thanks.

My general sense of the 80k handbook is that it is very careful to emphasise uncertainty and leaves room for people to project existing beliefs without updating. 

For example:

Working on this issue seems to be among the best ways of improving the long-term future we know of, but all else equal, we think it’s less pressing than our highest priority areas.

I value the integrity that 80k has here, but I think something shorter, with more direct comparisons to other cause areas, might be more effective. 

Thanks for the answer. Does this idea of looking at it in that hypothetical word framing have a related post somewhere?

Load more