WS

William_S

264 karmaJoined

Bio

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.

Posts
1

Sorted by New
3
· · 1m read

Comments
51

Fair, I'm grumpy about Leopold's position but my above comment wasn't careful to target the real problems and doesn't give a good general rule here.

Also, unless one understands the Chinese situation, one should avoid moves that risk escalating a race, like making loud and confident predictions that a race is the only way.

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.

One tool that I think would be quite useful is having some kind of website where you gather:

  1. Situations: descriptions of decisions that people are facing, and their options
  2. Outcomes: the option that they took, and how they felt about it after the fact

Then you could get a description of a decision that someone new is facing and automatically assemble a reference class for them of people with the most similar decisions and how they turned out. Could work without any ML, but language modelling to cluster similar situations would help.

Kind of similar information to a review site, but hopefully can aggregate by situation instead of by product used, and cover decisions that are not in the category of "pick a product to buy"

Appreciate that point that they are competing for time (as I was only thinking of monopolies over content).

If the reason it isn't used is that users don't "trust that the system will give what they want given a single short description", then part of the research agenda for aligned recommender systems is not just producing systems that are aligned, but systems where their users have a greater degree of justified trust that they are aligned (placing more emphasis on the user's experience of interacting with the system). Some of this research could potentially take place with existing classification-based filters.

While fully understanding a user's preferences and values requires more research, it seems like there are simpler things that could be done by the existing recommender systems that would be a win for users, ie. facebook having a "turn off inflammatory political news" switch (or a list of 5-10 similar switches), where current knowledge would suffice to train a classification system.

It could be the case that this is bottlenecked by the incentives of current companies, in that there isn't a good revenue model for recommender systems other than advertising, and advertising creates the perverse incentive to keep users on your system as long as possible. Or it might be the case that most recommender systems are effectively monopolies on their respective content, and users will choose an aligned system over an unaligned one if options are available, but otherwise a monopoly faces no pressure to align their system.

In these cases, the bottleneck might be "start and scale one or more new organizations that do aligned recommender systems using current knowledge" rather than "do more research on how to produce more aligned recommender systems".

If we want to maximize flow-through effects to AI Alignment, we might want to deliberately steer the approach adopted for aligned recommender systems to one that is also designed to scale to more difficulty problems/more advanced AI systems (like Iterated Amplification). Having an idea become standard in the world of recommender systems could significantly increase the amount of non-saftey researcher effort put towards that idea. Solving the problem a bit earlier with a less scalable approach could close off this opportunity.

I wonder how much of the interview/work stuff is duplicated between positions - if there's a lot of overlap, then maybe it would be useful for someone to create the EA equivalent of TripleByte - run initial interviews/work projects with a third party organization to evaluate quality, pass along to most relevant EA jobs.

I agree with this. It seems like the world where Moral Circle Expansion is useful is the world where:

The creators of AI are philosophically sophisticated (or persuadable) enough to expand their moral circle if they are exposed to the right arguments or work is put into persuading them.

They are not philosophically sophisticated enough to realize the arguments for expanding the moral circle on their own (seems plausible).

They are not philosophically sophisticated enough to realize that they might want to consider a distribution of arguments that they could have faced and could have persuaded them about what is morally right, and design AI with this in mind (ie CEV), or with the goal of achieving a period of reflection where they can sort out the sort of arguments that they would want to consider.

I think I'd prefer pushing on point 3, as it also encompasses a bunch of other potential philosophical mistakes that AI creators could make.

Load more