Comment Permalink

Project Proposal: Working on understanding how AIs work by watching them think as they play video games. Needs python developers/possibly c++.

I'd like to extend my current technical alignment project stack in one of a few non-trivial ways and would love help from more experienced software engineers to do it.
- Post: https://www.lesswrong.com/posts/bBuBDJBYHt39Q5zZy/decision-transformer-interpretability
- GitHub: https://github.com/jbloomAus/DecisionTransformerInterpretability

I'm not sure what the spread of technical proficiency is or how interested people are in assisting with my research agenda, but I've made a list of what I think are solid engineering challenges that I would love to get help with. 1/2 is stuff I can do/manage, and 3 is something I would need assistance with from someone with more experience.

1. Re-implementing bespoke grid worlds such as AI safety grid worlds, proper mazes or novel environments in currently maintained/compatible packages (gymnasium and/or <inigrid) to study alignment-relevant phenomena in RL agents/agent simulators.

2. Implementing methods for optimizing inputs (feature visualization) for pytorch models/MiniGrid environments.

3. Develop an real-time mechanistic interpretability app for procgen games (ie: extend https://distill.pub/2020/understanding-rl-vision/#feature-visualization to game-time, interactive play with pausing). I have a streamlit app that does this for gridworlds which I can demo.

Further Details:

1. The AI Safety GridWorlds (https://github.com/deepmind/ai-safety-gridworlds) is more than 5 years old and implemented in DeepMind’s pycolab engine (https://github.com/deepmind/pycolab). I’d love to study them with the current mechanistic interpretability techniques implemented in TransformerLens and the Decision Transformer Interpretability codebase, however, getting this all working will take time so it would be cool if people were interested in smashing that out. Having proper mazes for agents to solve in Minigrid would also be interesting in order to test our ability to reverse engineer algorithms from models using current techniques.

2. Feature Visualization Techniques aren’t new but have been previously used on continuous input spaces like CNNs. However, recent work by Jessica Rumbelow (SolidGoldMagicarp post: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#Prompt_generation ) has shown that it’s possible to perform this technique on discrete spaces such as word embeddings. Extending this to the discrete environments we have been studying or might study (see 1) may provide valuable insights. Lucent (lucid for pytorch) may also be useful for this.

3. The current interactive analysis app for Decision Transformer Interpretability is written in Streamlit and so runs very slowly. This is fine for grid world-type environments but won’t work for continuous procedurally generated environments like procgen (https://github.com/openai/procgen). Writing a procgen/python wrapper that provides live model analysis (with the ability to pause mid-game) will be crucial to further work.

Feel free to ask questions here!

Thanks for your detailed project idea!!

See in context

Software, Data, and Tech Effective Altruism

EA Hackathon

by NicoleJaneway 🔸

8

Who: anyone! software engineers will be primary contributors of course, but we will offer optional introductory sessions for the curious / aspiring developer. You do not have to have attended EAG Bay Area to attend the Hackathon.
Where: Momentum office at 3004 16th St, just off the 16th St Mission BART Station
When: Mon, 2/27 from 10am - 7pm
What: work independently or with collaborators on EA-aligned project of your choosing

If you would like to share your Hackathon project idea, please leave a comment!

Agenda:

10am-10:15 — participants arrive and get set up
10:15-10:20 — welcome and logistics talk by Nicole Janeway Bills of EA Software Engineers
10:20-10:30 — opening talk by Austin Chen of Manifold Markets on expectations and ways of working for the event
10:30-10:45 — project pitches — people with ideas can share them with the group
10:45 — start of work and learning sessions
12pm — lunch — vegan and nonvegan options
6pm — dinner and project presentations
6:45-7pm — prize announcements and wrap up

Learning Sessions:

10:45 — setting up your development environment
11:30 — basics of git
1pm — intro to frontend development
2pm — open source contributions in AI safety (presentation link to be added later)

Looking forward to seeing you at the event! Add your photos here.

8 Reactions

Comments10

Sorted by

New & upvoted

Click to highlight new comments since: Today at 10:22 AM

Keira Wiechecki2y5

My proposal is to build a minimal language model that can be used for inferring symbol manipulation algorithms. Though minimal models exist for NLP and modulo arithmetic, I was surprised not to find one for typed lambda calculus. I believe such a model would be particularly useful as it could serve as a universal grammar for a language acquisition device for inductively building interpretable models.

My idea is to use a GAN to train a model that outputs valid proof terms for propositions in simply typed lambda calculus. The generator would attempt to find proof terms that type check when fed to a proof assistant. The discriminator would attempt to produce propositions that confused the discriminator.

The architecture of the discriminator shouldn't really matter. The generator would ideally be an attention-only model with the most informative number of heads and layers.

Mckiev 🔸2y5

My idea would be to implement mvp for Badge Collecting for EAs: e.g. badge on EA forum showing that you have been at EAG or EAGx. Or GWWC pledge giver etc. Kind of like https://poap.xyz/

At a first pass it can be made as a chrome extension that would display these badges to users who have opted in, and without any blockchain whatsoever ;)

I think it's just fun to show off your involvement with the community. More importantly it would be one more step towards developing online reputation system

In the future this system can be extended in more sophisticated ways. E.g. badges that can be displayed on demand, or only to a subset of users. Or badges that contain more data that just "been there". I don't feel comfortable sharing amount of personal donations with everyone, but maybe the frequency could be fine? Or maybe sharing the donation trends over time could be encouraging?

I'm only familiar with python and have zero web coding skills: anyone is up to join?

NicoleJaneway 🔸2y2

This sounds like a lot of fun. Love poaps.

Nan Chen2y5

Project proposal: An app that is like a website blocker like Freedom, but instead of blocking a website, the user can set a maximum amount of money to be donated to charity per week. Any amount of time that is spent on a website (YouTube, Twitter, TikTok, etc) more than a specified amount of time (also set by the user) will be translated to a monetary amount to be donated to a charity(ies) of choice. Like Freedom, the app aims to curtail too much web use/wasting time.

Example: I set the app so that I have a maximum of $100 a week and one hour of surfing YT per day. At the end of the week I have spent a total of 58 minutes over my allotted time (7 hours). That means the app will automatically give $58 dollars that week to e.g. UNICEF using my credit card.

The app has the advantage over Freedom of adding a monetary incentive to decrease internet use above a set time. It also will habitualize and normalize charity giving for those using it.

NicoleJaneway 🔸2y1

Cool idea — I definitely know people who would use this.

Joseph Bloom2y5

NicoleJaneway 🔸2y1

Thanks for your detailed project idea!!

Zac Hatfield-Dodds2y5

Many people at the EAGx Berkeley Hackathon chose to contribute to open source projects, which is a great way to learn and practice software engineering skills - not just programming, but project workflows like using version control, static analysis, testing, and responding to code reviews (it certainly worked for me). This page goes into more detail on how and why people contribute to OSS projects. However, to calibrate your expectations: contributing to open source can be challenging, and getting set up is often time-consuming.

I therefore recommend running through this interactive intro to GitHub and then this quickstart guide (up to "be social") well before coming. If you don't have previous experience in open source or working in industry (few academic labs teach these skills!) you'll probably spend most of your time learning the process of contributing, rather than writing code - but it's worth it!

Here's a link to some specific open source issues.

NicoleJaneway 🔸2y1

Thank you so much for your suggestions and involvement!! Also, recommendation that everyone download an IDE (e.g., VSCode) before we get started on Monday.

MauriceBurg2y3

Project proposal: My co-founder and I are starting a new AI alignment venture in the legal space. For our first prototype we're building a legal research assistant. To do this, we will need to scrape publicly available legal text corpora and make them queryable by a large language model. We intend to use scalable oversight technologies (factored cognition, debate, etc.) to improve the model's reliability.

We'd be excited to have more engineers join the two of us in building our first prototype!