227 karmaJoined


Another potential windfall I just thought of: the kind of AI scientist system discussed by Bengio in this talk (older writeup). The idea is to build a non-agentic system that uses foundation models and amortized Bayesian inference to create and do inference on compositional and interpretable world models. One way this would be used is for high-quality estimates of p(harm|action) in the context of online monitoring of AI systems, but if it could work it would likely have other profitable use cases as well.

A concrete version of this I've been wondering about the last few days: To what extent are the negative results on Debate (single-turn, two-turn) intrinsic to small-context supervision vs. a function of relatively contingent design choices about how people get to interact with the models?

I agree that misuse is a concern. Unlike alignment, I think it's relatively tractable because it's more similar to problems people are encountering in the world right now.

To address it, we can monitor and restrict usage as needed. The same tools that Elicit provides for reasoning can also be used to reason about whether a use case constitutes misuse.

This isn't to say that we might not need to invest a lot of resources eventually, and it's interestingly related to alignment ("misuse" is relative to some values), but it feels a bit less open-ended.

Elicit is using using the Semantic Scholar Academic Graph dataset. We're working on expanding to other sources. If there are particular ones that would be helpful, message me?

Have you listened to the 80k episode with Nova DasSarma from Anthropic? They might have cybersecurity roles. The closest we have right now is devops—which, btw, if anyone is reading this comment, we are really bottlenecked on and would love intros to great people.

No, it's that our case for alignment doesn't rest on "the system is only giving advice" as a step. I sketched the actual case in this comment.

Oh, forgot to mention Jonathan Uesato at Deepmind who's also very interested in advancing the ML side of factored cognition.

The things that make submodels easier to align that we’re aiming for:

  • (Inner alignment) Smaller models, making it less likely that there’s scheming happening that we’re not aware of; making the bottom-up interpretability problem easier
  • (Outer alignment) More well-specified tasks, making it easier to generate a lot of in-distribution feedback data; making it easier to do targetted red-teaming

For AGI there isn't much of a distinction between giving advice and taking actions, so this isn't part of our argument for safety in the long run. But in the time between here and AGI it's better to focus on supporting reasoning to help us figure out how to manage this precarious situation.

Load more