Project Proposal: Working on understanding how AIs work by watching them think as they play video games. Needs python developers/possibly c++.I'd like to extend my current technical alignment project stack in one of a few non-trivial ways and would love help from more experienced software engineers to do it.- Post: https://www.lesswrong.com/posts/bBuBDJBYHt39Q5zZy/decision-transformer-interpretability- GitHub: https://github.com/jbloomAus/DecisionTransformerInterpretabilityI'm not sure what the spread of technical proficiency is or how interested people are in assisting with my research agenda, but I've made a list of what I think are solid engineering challenges that I would love to get help with. 1/2 is stuff I can do/manage, and 3 is something I would need assistance with from someone with more experience.1. Re-implementing bespoke grid worlds such as AI safety grid worlds, proper mazes or novel environments in currently maintained/compatible packages (gymnasium and/or <inigrid) to study alignment-relevant phenomena in RL agents/agent simulators.2. Implementing methods for optimizing inputs (feature visualization) for pytorch models/MiniGrid environments.3. Develop an real-time mechanistic interpretability app for procgen games (ie: extend https://distill.pub/2020/understanding-rl-vision/#feature-visualization to game-time, interactive play with pausing). I have a streamlit app that does this for gridworlds which I can demo.Further Details:1. The AI Safety GridWorlds (https://github.com/deepmind/ai-safety-gridworlds) is more than 5 years old and implemented in DeepMind’s pycolab engine (https://github.com/deepmind/pycolab). I’d love to study them with the current mechanistic interpretability techniques implemented in TransformerLens and the Decision Transformer Interpretability codebase, however, getting this all working will take time so it would be cool if people were interested in smashing that out. Having proper mazes for agents to solve in Minigrid would also be interesting in order to test our ability to reverse engineer algorithms from models using current techniques.2. Feature Visualization Techniques aren’t new but have been previously used on continuous input spaces like CNNs. However, recent work by Jessica Rumbelow (SolidGoldMagicarp post: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#Prompt_generation ) has shown that it’s possible to perform this technique on discrete spaces such as word embeddings. Extending this to the discrete environments we have been studying or might study (see 1) may provide valuable insights. Lucent (lucid for pytorch) may also be useful for this.3. The current interactive analysis app for Decision Transformer Interpretability is written in Streamlit and so runs very slowly. This is fine for grid world-type environments but won’t work for continuous procedurally generated environments like procgen (https://github.com/openai/procgen). Writing a procgen/python wrapper that provides live model analysis (with the ability to pause mid-game) will be crucial to further work.Feel free to ask questions here!
I'm Joseph. I have a double degree in computational biology/statistics, have RA'd in protein engineering (structure/dynamics) and worked in proteomics (LCMS). Currently on an FTX regrant to find ways to help with Biosecurity/AI. I'm focusing on upskilling in AI, but keen to keep discussing biosecurity.
Thanks so much! Was hoping someone would do this soon!
It's great that you are trying to make these kinds of decisions with impact in mind!
I have a comp bio background but more in proteomics and have spent time this year looking at different ways to have a large impact, although my own focus was much more on pandemics preparedness / x-risk.
Probably this problem is underspecified. It's just very hard for anyone at this level of abstraction to make the decision for you. Details like your relationship with the supervisor, other lab members for example could be critical. It does sound like you like the first option but I'd encourage you to test your hypothesis thoroughly (or proportionally to the subsequent time investment).
However, some guiding principles may help:
EffectiveThesis might have some useful content too. https://effectivethesis.org/
Good luck and all the best!
I appreciate the clarification and this kind of detail ("people with experience working on climate change research, activism or public policy" as opposed to others).
Based on this thread, I think we'd be looking for a document that meets the following criteria:
Such a document should also be simple enough to be linked as introductory material to someone not familiar with EA. It would also be valuable to test such a document/set of arguments on some climate activists or even iterate based on their feedback in order to be more effective.
I'm definitely not the person to write this, but I could ask around a few places to see if anyone is keen to work on it. It sounds like our prior is that this is likely enough to be valuable, and simple enough to attempt, that it's worth a shot.
That's fair. I'll keep thinking about it but this was helpful, thanks.
My general sense of the 80k handbook is that it is very careful to emphasise uncertainty and leaves room for people to project existing beliefs without updating.
Working on this issue seems to be among the best ways of improving the long-term future we know of, but all else equal, we think it’s less pressing than our highest priority areas.
I value the integrity that 80k has here, but I think something shorter, with more direct comparisons to other cause areas, might be more effective.
Thanks for the answer. Does this idea of looking at it in that hypothetical word framing have a related post somewhere?
This is fantastic, thank you! Is there a summary of the main insights/common threads from the interviews?