stuhlmueller

206Joined Mar 2019

Posts
2

Sorted by New

Comments
21

A concrete version of this I've been wondering about the last few days: To what extent are the negative results on Debate (single-turn, two-turn) intrinsic to small-context supervision vs. a function of relatively contingent design choices about how people get to interact with the models?

I agree that misuse is a concern. Unlike alignment, I think it's relatively tractable because it's more similar to problems people are encountering in the world right now.

To address it, we can monitor and restrict usage as needed. The same tools that Elicit provides for reasoning can also be used to reason about whether a use case constitutes misuse.

This isn't to say that we might not need to invest a lot of resources eventually, and it's interestingly related to alignment ("misuse" is relative to some values), but it feels a bit less open-ended.

Elicit is using using the Semantic Scholar Academic Graph dataset. We're working on expanding to other sources. If there are particular ones that would be helpful, message me?

Have you listened to the 80k episode with Nova DasSarma from Anthropic? They might have cybersecurity roles. The closest we have right now is devops—which, btw, if anyone is reading this comment, we are really bottlenecked on and would love intros to great people.

No, it's that our case for alignment doesn't rest on "the system is only giving advice" as a step. I sketched the actual case in this comment.

Oh, forgot to mention Jonathan Uesato at Deepmind who's also very interested in advancing the ML side of factored cognition.

The things that make submodels easier to align that we’re aiming for:

  • (Inner alignment) Smaller models, making it less likely that there’s scheming happening that we’re not aware of; making the bottom-up interpretability problem easier
  • (Outer alignment) More well-specified tasks, making it easier to generate a lot of in-distribution feedback data; making it easier to do targetted red-teaming

For AGI there isn't much of a distinction between giving advice and taking actions, so this isn't part of our argument for safety in the long run. But in the time between here and AGI it's better to focus on supporting reasoning to help us figure out how to manage this precarious situation.

To clarify, here’s how I’m interpreting your question:

“Most technical alignment work today looks like writing papers or theoretical blog posts and addressing problems we’d expect to see with more powerful AI. It mostly doesn’t try to be useful today. Ought claims to take a product-driven approach to alignment research, simultaneously building Elicit to inform and implement its alignment work.  Why did Ought choose this approach instead of the former?”

First, I think it’s good for the community to take a portfolio approach and for different teams to pursue different approaches. I don’t think there is a single best approach, and a lot of it comes down to the specific problems you’re tackling and team fit.

For Ought, there’s an unusually good fit between our agenda and Elicit the product—our whole approach is built around human-endorsed reasoning steps, and it’s hard to do that without humans who care about good reasoning and actually want to apply it to solve hard problems. If we were working on ELK I doubt we’d be working on a product.

Second, as a team we just like building things. We have better feedback loops this way and the nearer-term impacts of Elicit on improving quality of reasoning in research and beyond provide concrete motivation in addition to the longer-term impacts.

Some other considerations in favor of taking a product-driven approach are:

  • Deployment plans help us choose tasks. We did “pure alignment research” when we ran our initial factored cognition experiments. At the time, choosing the right task felt about as hard as choosing the right mechanism or implementing it correctly. For example, we want to study factored cognition - should we factor reasoning about essays? SAT questions? Movie reviews? Forecasting questions? When experiments failed, it was hard to know whether we could have stripped down the task more to better test the mechanism, or whether the mechanism in fact didn’t solve the problem at hand. Our findings seemed brittle and highly dependent on assumptions about the task, unlikely to hold up in future deployment scenarios. Now that we have a much clearer incremental deployment story in mind we can better think about what research is more or less likely to be useful. FWIW I suspect this challenge of task specification is a pretty underrated obstacle for many alignment researchers.
  • Eventually we’ll have to cross the theory-practice gap. At some point alignment research will likely have to cover the theory-practice gap. There are different ways to do this - we could first develop theoretical foundations, then basic algorithms, then implement them in the real-world, or co-evolve the different parts and propagate constraints between them as we go. I think both ways have pros and cons, but it seems important that some groups pursue the latter, especially in a world with short AI timelines.

Risks with trying to do both are:

  • Balancing multiple stakeholders. Sometimes an impressive research demonstration isn’t actually useful in the short term, or the most useful things for users don’t teach us anything new about alignment. Models are barely capable enough to be acceptable stand-ins for crowd workers, which limits what we can learn; conversely, the best way to solve some product problems could just be to scale up the models. Overall, our product-research dynamic has been net positive and creates virtuous cycles where the product grounds the research and the research improves the product. But it’s a fine line to tread and a tension we have to actively manage. I can easily imagine other teams or agendas where this would be net negative. I imagine we’d also have a harder time making good choices here if we were a for-profit.
  • Being too solution-driven. From a product angle, I sometimes worry that we might over-apply the decomposition hammer. But an important part of our research goal is understanding where decomposition is useful / competitive / necessary so it’s probably fine as long as we course-correct quickly.
Load More