A lot of people want to form inside views on AI Safety, and this is a post about concrete advice on how to get started. I have a lot of hot takes on inside views and ways I think people misframe them, so I'll begin with some thoughts on this - if you just want to concrete advice, I recommend skipping ahead. This post is aimed at people who already have some context on AI Safety + Inside Views and want help making progress, it's not pitched as totally introductory resource.

Meta Thoughts on Inside Views

Note: Rohin Shah wrote an excellent comment pushing back on this section - go read that too!

This is mostly a compressed version of a previous post of mine: How I Formed My Own Views About AI Safety (and ways that trying to do this was pretty stressful and counterproductive) - go read that post if you want something more in-depth! 

Inside Views are Overrated

First point - I think people often wildly overrate inside views. I think they're important, and worth trying to cultivate (else I wouldn't write this post), but I think less so than many people (especially in the Bay Area) often think. Why?

  • The obvious reason to form inside views is to form truer beliefs - AI X-risk is a weird, controversial and confusing thing, and it's important to have good beliefs on it.
    • I think this is true to an extent, but that in practice inside views tend to feel true a lot more than they are true. When I have an inside view, that just feels like how the world is, it feels deeply true and compelling. But empirically, a lot of people have inside views, that are mutually contradictory, and all find their own views compelling. They can't all be right!
    • An alternate framing: There's a bunch of people in the world who are smarter than me and have spent longer thinking about AI Alignment than me. Two possible baselines for an outside view: 
      • Make a list of my top 5 'smart & high-status alignment researchers', and for any question, take a majority vote of what they all think
      • Randomly pick one of the top 5 and believe everything they believe
    • Neither of these baselines is great. But I would predict that this actually tracks truth better than the vast majority of inside views - having correct beliefs about controversial and confusing topics is just really hard!
  • Relatedly, it's much more important to understand other people's views than to evaluate them - if I can repeat a full, gears-level model of someone's view back to them in a way that they endorse , that's a lot more valuable than figuring out how much I agree or disagree with their various beliefs and conclusions. 
    • I'm a lot more excited about someone who has a good gears-level model of what their top 5 alignment researchers believe and why, than I am about someone who confidently has their own beliefs but a fuzzy model - having several models lets you compare and contrast them, figure out novel predictions, better engage with technical questions, do much better research, etc
  • Forming a "true" inside view - one where you fully understand something from first principles with zero deferring - is wildly impractical. 
    • For example, let's take the question of AI Timelines. Really understanding this requires a deep engagement with diverse topics like economics, AI hardware, international relations, tech financing, deep learning, politics, etc. I'd guess that no one in the world is remotely close to an expert in all of these.
  • People often orient to inside views pretty unhealthily. Some themes I've noticed (especially in myself!):
    • I should get incredibly stressed about this
    • There's one true perspective on AI Alignment and I can find it if I just try hard enough
    • Everyone around me seems confident that AI Safety matters, so they must all have great inside views, so this must be easy and I'm just not trying hard enough
    • It's terrible and a failure to ever defer to anyone on anything
    • I am not allowed to do safety work until I have a crisp and clear inside view, else I'm a terrible person with poor epistemics

Thoughts on how to orient to inside views

With all that said, I do think there's a bunch of reasons to care about inside views! You probably can track truth slightly better, it can be very motivating, and it's hard to do good research without a gears-level model in your head about why what you're doing matters. Some general advice for orienting to it healthily:

  • You're allowed to not form an inside view! I know some people who do great safety relevant work, despite not having an inside view. This is great! Focus on your comparative advantage - if you find forming inside views really hard and unmotivating, that's a sign that it's not your's
    • In particular, if you've been trying to form an inside view, and have been getting incredibly stressed about it, I give you full permission to just forget about it and do something else! Doing things that make you really stressed is rarely healthy
  • Forming inside views will happen naturally, and will happen much better alongside actually trying to do things and contribute to safety - you don't form them by locking yourself in your room for months and meditating on safety! 
    • Also, expect this to take a while! One model of PhDs is that they're designed to help someone have an expert-level inside view in some tiny niche of research, and be able to contribute original research. These take 3-6 years! And I think true inside views on AI Safety are much harder
  • Inside views lie on a spectrum. You will never form a "true" inside view, but conversely, not having a true inside view doesn't mean you're failing, or that you shouldn't even try. You want to aim to get closer to having an inside view! And making progress here is great and worthy
  • Aim for domain specific inside views. As an interpretability researcher, it's much more important to me to have an inside view re how to make interpretability progress and how this might interact with AI X-risk, than it is for me to have an inside view on timelines, the worth of conceptual alignment work, etc.

Inside Views = Zooming In

Now for concrete advice! Firstly, I want to present a concrete model for how to think about inside views, and how to make progress on them - inside views are about zooming in. Concretely, in this framework, inside views look like starting with some high-level confusing claim, and then breaking it down into sub-claims, breaking those down into sub-claims, etc.

Example high-level claim: It is valuable to work on reducing AGI x-risk

Breakdown:

  • AGI will happen in the next 50 years (>50% prob)
  • If AGI is created, by default it will likely want to cause x-risk
  • If AGI exists and wants to cause x-risk it will likely succeed
  • There are actions we can take today that will make AGI x-risk less likely

Key features: 

  • We started with a black box, and expanded it into several smaller black boxes. This isn't anything close to a "true" inside view. But it is meaningful progress!
  • We can further make progress by repeatedly expanding subclaims as we learn and think more - there's always a natural next step to learn and think more, and a natural metric for progress
  • These claims are probabilistic - I'm not saying that any are definitely true, just that they are plausible conditioned on previous ones
    • Some are explicitly probabilistic, some are implicitly (eg, using likely)

Exercise 1: Practice Zooming In!

Set a 5 minute timer. Then pick a high-level question that feels important to you, and practice expanding it into sub-claims, expanding those, etc. See how far you can get!

(Note: 5 minutes is just a minimum time to ensure you get momentum - feel free to keep going!)

Example Qs:

  • It is valuable to work on reducing AGI X-risk
  • A misaligned  AGI could/couldn't cause x-risk
  • We will/won't get AGI by 2070
  • Deep Learning is/isn't sufficient to get AGI without further breakthroughs
  • Inner alignment is/isn't a big deal
  • Reducing AI x-risk is/isn't tractable
  • The world could/couldn't coordinate to not build AGI

Separate Understanding & Evaluation

The second key part of this framework is to separate understanding and evaluating. Separate the act of understanding a belief/model, and the act of deeply engaging with it and figuring out what you believe about it. Both are important, but it's extremely hard to do both at the same time - it's easy to nod along to something that feels obvious and intuitive without really getting it or noticing subtle flaws, and easy to reject a weird idea out of hand without engaging enough to see its merits. But if you first understand, holding aside your personal views, your evaluation will be far more grounded

Understanding

The goal of understanding is to have a coherent, gears-level model in your head of a perspective or set of beliefs. This can mean a bunch of things! Modelling what a specific research believes, trying to model the theory of change for an agenda, steelmanning an argument, understanding a friend's beliefs, etc. The output of understanding will normally look like a tree of arguments, as in my model of inside views, but where the goal is to have the claims in the tree follow from sub-claims and prior claims, not to be fully grounded or be airtight arguments.

You want to be as concrete and gears-level as possible - probing into flaws, noticing errors, or bits that don't quite work, but ideally holding off on properly evaluating it. This is a somewhat subtle distinction - you can't notice and try to fix errors without somewhat evaluating it. But this tends to involve mental moves like:

  • Taking certain assumptions as given and seeing how the model follows from that, even if you don't buy those assumptions. 
  • Trying to steelman an argument and figure out a way to be charitable and make it as strong as you can, rather than trying to disprove the conclusion. 
  • If you find something that seems kinda dodgy, just add 'this claim is actually valid' as an implicit assumption

Ideally you want to be validating this opinion as much as you can - getting feedback from the world on how good your model is. 

Concrete ways of doing this

  • Read something, and then summarise it. There are a lot of things people have written about alignment! Trying to grok the models behind these can be a great start.
    • Summarising is a really key step here, don't skip it! You probably get 10x the value, from taking 2x time. It's so easy to skim through something without a coherent model
  • Talk to someone and paraphrase back to them - ie repeat back
    • This is a great thing to do when eg having a career mentoring chat with someone or asking about their research at EAG
    • If you find doing things in the moment hard, try writing up notes after the call! Often if you send it to someone they'll both be flattered, and will be able to give you good feedback
  • Harder: Pick a claim, sit down with a blank google doc, and try to steelman it - imagine you're in a world where the claim is true, and try to generate the arguments that explain how that world is coherent. 
    • I find using my Inner Simulator is often helpful here
    • I recommend doing this for at least 15 minutes before giving up, even if you feel stuck. It's hard!

Exercise 2

Practice reading and summarising! Take a piece you're interested in, read it while taking rough notes as you go, and then at the end try to distill these down to a coherent narrative. I am a big fan of doing this as a bullet point structure, and trying to distill the post down to a series of key points. Set a 5 minute timer after reading, and spend at least this long summarising (it's fine to take much longer!)

Tips:

  • Track underlying assumptions that drive the post, or underlying models of the world that are used in the reasoning (eg, a mathematician may often assume that AGI is bottlenecked by clever ideas)
  • Post this as a comment on the post - I think this is generally a public service, and sometimes get good feedback!
  • Try to make your summary as short as you can without losing key information
    • Try saying it aloud to yourself/a rubber duck - this forces me to better brevity
    • Note - short does not mean easy to write! I find shorter things take notably more time and effort (for example, I challenged myself to write this post in an hour - it's not great for brevity!)
  • Try explaining it to a friend and asking for feedback. Or pair up with a friend, get them to read it too, and compare your models
  • Forming a great summary can be hard! It's totally normal to have some parts of the post that feel fuzzy and confusing
    • Sometimes thinking harder can be helpful, sometimes this is a genuine flaw in the post - I recommend leaving a comment about it!

Reading recommendations:

Evaluation

The second stage is try to actually integrate these models into your inside view, or to figure out that you disagree with them (and importantly, why and how!). It's totally fine if you try evaluating and just feel really confused - these are genuinely hard and confusing questions! But it's important to note that confusion to yourself, and explicitly have it as a black box in your model, rather than lying to yourself and pretending that you really understand it.

Some techniques:

  • Keep zooming in: Taking the tree of arguments (either in your inside view, or in a model you've been trying to understand), pick a claim that feels confusing or particularly interesting to you, and try to expand it - basically doing exactly the same thing as exercise 1 but at a finer level of detail
  • Generate counter-arguments: Pick a sub-claim that feels dodgy to you, or that you disagree with. Set a 5 minute timer, and try to generate as many counter-arguments as you can
    • Aim for quantity over quality - find as many possible counters as you can
    • Then filter for counter-arguments that actually seem legit
    • This can also be great for claims that don't feel dodgy! It's a solid exercise.
  • Resonate: When you notice yourself intuitively disagreeing or flinching from an argument, before you allow yourself to express disagreement, first think through what you agree with in there. Find at least one point you resonate with, at least one aspect of common ground, before you push back
    • Sometimes you fail to find anything, and the perspective just sucks! Sometimes your common ground might be something super vague, like 'we both agree that dying from AI would be bad'
    • But often when I try this, I genuinely get traction on absorbing some novel views, and get more empathy with them than I had before. 
    • This is generally a great move to do in disagreements/debates outside of AI Safety too!

Exercise 3

Set a 5 minute timer. Pick an interesting sub-claim within your inside view tree from exercise 1 or model from exercise 2, and practice zooming in on it.

Exercise 4

Set a 5 minute timer. Pick an interesting sub-claim within your inside view tree from exercise 1 or model from exercise 2, and practice generating counter-arguments for it.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 8:35 PM

(Copied from the EA Forum)

Lots of thoughts on this post:

Value of inside views

Inside Views are Overrated [...]

The obvious reason to form inside views is to form truer beliefs

No? The reason to form inside views is that it enables better research, and I'm surprised this mostly doesn't feature in your post. Quoting past-you:

  • Research quality - Doing good research involves having good intuitions and research taste, sometimes called an inside view, about why the research matters and what’s really going on. This conceptual framework guides the many small decisions and trade-offs you make on a daily basis as a researcher
    • I think this is really important, but it’s worth distinguishing this from ‘is this research agenda ultimately useful’. This is still important in eg pure maths research just for doing good research, and there are areas of AI Safety where you can do ‘good research’ without actually reducing the probability of x-risk.

Quoting myself:

There’s a longstanding debate about whether one should defer to some aggregation of experts (an “outside view”), or try to understand the arguments and come to your own conclusion (an “inside view”). This debate mostly focuses on which method tends to arrive at correct conclusions. I am not taking a stance on this debate; I think it’s mostly irrelevant to the problem of doing good research. Research is typically meant to advance the frontiers of human knowledge; this is not the same goal as arriving at correct conclusions. If you want to advance human knowledge, you’re going to need a detailed inside view.

Let’s say that Alice is an expert in AI alignment, and Bob wants to get into the field, and trusts Alice’s judgment. Bob asks Alice what she thinks is most valuable to work on, and she replies, “probably robustness of neural networks”. What might have happened in Alice’s head?

Alice (hopefully) has a detailed internal model of risks from failures of AI alignment, and a sketch of potential solutions that could help avert those risks. Perhaps one cluster of solutions seems particularly valuable to work on. Then, when Bob asks her what work would be valuable, she has to condense all of the information about her solution sketch into a single word or phrase. While “robustness” might be the closest match, it’s certainly not going to convey all of Alice’s information.

What happens if Bob dives straight into a concrete project to improve robustness? I’d expect the project will improve robustness along some axis that is different from what Alice meant, ultimately rendering the improvement useless for alignment. There are just too many constraints and considerations that Alice is using in rendering her final judgment, that Bob is not aware of.

I think Bob should instead spend some time thinking about how a solution to robustness would mean that AI risk has been meaningfully reduced. Once he has a satisfying answer to that, it makes more sense to start a concrete project on improving robustness. In other words, when doing research, use senior researchers as a tool for deciding what to think about, rather than what to believe.

It’s possible that after all this reflection, Bob concludes that impact regularization is more valuable than robustness. The outside view suggests that Alice is more likely to be correct than Bob, given that she has more experience. If Bob had to bet which of them was correct, he should probably bet on Alice. But that’s not the decision he faces: he has to decide what to work on. His options probably look like:

  1. Work on a concrete project in robustness, which has perhaps a 1% chance of making valuable progress on robustness. The probability of valuable work is low since he does not share Alice’s models about how robustness can help with AI alignment.
  2. Work on a concrete project in impact regularization, which has perhaps a 50% chance of making valuable progress on impact regularization.

It’s probably not the case that progress in robustness is 50x more valuable than progress in impact regularization, and so Bob should go with (2). Hence the advice: build a gearsy, inside-view model of AI risk, and think about that model to find solutions.

(Though I should probably edit that section to also mention that Bob could execute on Alice's research agenda, if Alice is around to mentor him; and that would probably be more directly impactful than either of the other two options.)

Other meta thoughts on inside views

  • Relatedly, it's much more important to understand other people's views than to evaluate them - if I can repeat a full, gears-level model of someone's view back to them in a way that they endorse , that's a lot more valuable than figuring out how much I agree or disagree with their various beliefs and conclusions.
    • [...] having several models lets you compare and contrast them, figure out novel predictions, better engage with technical questions, do much better research, etc


 

I'm having trouble actually visualizing a scenario where Alice understands Bob's views (well enough to make novel predictions that Bob endorses, and say how Bob would update upon seeing various bits of evidence), but Alice is unable to evaluate Bob's view. Do you think this actually happens? Any concrete examples that I can try to visualize?

(Based on later parts of the post maybe you are mostly saying "don't reject an expert's view before you've tried really hard to understand it and make it something that does work", which I roughly agree with.)

Forming a "true" inside view - one where you fully understand something from first principles with zero deferring - is wildly impractical.

Yes, clearly true. I don't think anyone is advocating for this. I would say I have an inside view on bio anchors as a way to predict timelines, but I haven't looked into the data for Moore's Law myself and am deferring to others on that.

People often orient to inside views pretty unhealthily.

:'(

What fraction of people who are trying to build inside views do you think have these problems? (Relevant since I often encourage people to do it)

I know some people who do great safety relevant work, despite not having an inside view.

Hmm, I kind of agree in that there are people without inside views who are working on projects that other people with inside views are mentoring them on. I'm not immediately thinking of examples of people without inside views doing independent research that I would call "great safety relevant work".

(Unless perhaps you're counting e.g. people who do work on forecasting AGI, without having an inside view on how AGI leads to x-risk? I would say they have a domain-specific inside view on forecasting AGI.)

Forming inside views will happen naturally, and will happen much better alongside actually trying to do things and contribute to safety - you don't form them by locking yourself in your room for months and meditating on safety!

Idk, I feel like I formed my inside views by locking myself in my room for months and meditating on safety. This did involve reading things other people wrote, and talking with other junior grad students at CHAI who were also orienting to the problem. But I think it did not involve trying to do things and contributing to safety (I did do some of that but I think that was mostly irrelevant to me developing an inside view).

I do agree that if you work on topic X, you will naturally form an inside view on topic X as you get more experience with it. But in AI safety that would look more like "developing a domain-specific inside view on (say) learning from human feedback and its challenges" rather than an overall view on AI x-risk and how to address it. (In fact it seems like the way to get experience with an overall view on AI x-risk and how to address it is to meditate on it, because you can't just run experiments on AGI.)

Inside views lie on a spectrum. You will never form a "true" inside view, but conversely, not having a true inside view doesn't mean you're failing, or that you shouldn't even try. You want to aim to get closer to having an inside view! And making progress here is great and worthy

Strong +1

Aim for domain specific inside views. As an interpretability researcher, it's much more important to me to have an inside view re how to make interpretability progress and how this might interact with AI X-risk, than it is for me to have an inside view on timelines, the worth of conceptual alignment work, etc.

Yes, once you've decided that you're going to be an interpretability researcher, then you should focus on an interpretability-specific inside view. But "what should I work on" is also an important decision, and benefits from a broader inside view on a variety of topics. (I do agree though that it is a pretty reasonable strategy to just pick a domain based on deference and then only build a domain-specific inside view.)

Concrete advice

inside views are about zooming in. Concretely, in this framework, inside views look like starting with some high-level confusing claim, and then breaking it down into sub-claims, breaking those down into sub-claims, etc.

I agree that this is a decent way to measure your inside view -- like, "how big can you make this zooming-in tree before you hit a claim where you have to defer" is a good metric for "how detailed your inside view is".

I'm less clear on whether this is a good way to build an inside view, because a major source of difficulty for this strategy is in coming up with the right decomposition into sub-claims. Especially in the earlier stages of building an inside view, even your first and second levels of decomposition are going to be bad and will change over time. (For example, even for something like "why work on AI safety", Buck and I have different decompositions.) It does seem more useful once you've got a relatively fleshed out inside view, as a way to extend it further -- at this point I can in fact write out a tree of claims and expect that they will stay mostly the same (at the higher levels) after a few years, and so the leaves that I get to probably are good things to investigate.

Exercises

These seem great and I'd strongly recommend people try them out :)

Thanks for the thoughts, and sorry for dropping the ball on responding to this!

I appreciate the pushback, and broadly agree with most of your points.

In particular, I strongly agree that if you're trying to form the ability to be a research lead in alignment (and less strongly, be an RE/otherwise notably contribute to research) that forming an inside view is important, totally independently from how well it tracks the truth, and agree that I undersold that in my post. 

In part, I think the audience I had in mind is different from you? I see this as partially aimed at proto-alignment researchers, but also a lot of people who are just trying to figure out whether to work on it/how to get into the field, including in less technical roles (policy, ops, community building), where I also have often seen a strong push for inside views. I strongly agree that if someone is actively trying to be an alignment researcher that forming inside views is useful. Though it seems pretty fine to do this on the job/after starting a PhD program, and in parallel to trying to do research under a mentor.

don't reject an expert's view before you've tried really hard to understand it and make it something that does work

I'm pretty happy with this paraphrase of what I mean. Most of what I'm pointing to is using the mental motion of trying to understand things rather than the mental motion of trying to evaluate things, I agree that being literally unable to evaluate would be pretty surprising. 

One way that I think it's importantly different is that it feels more comfortable to maintain black boxes when trying to understand something than when trying to evaluate something. Eg, I want to understand why people in the field have short timelines. I get to the point where I see how if I bought scaling laws continuing then everything follows. I am not sure why people believe this, and personally feel pretty confused, but expect other people to be much more informed than me. This feels like an instance where I understand why they hold their view fairly well, and maybe feel comfortable deferring to them, but don't feel like I can really evaluate their view?

What fraction of people who are trying to build inside views do you think have these problems? (Relevant since I often encourage people to do it)

Honestly I'm not sure - I definitely did, and have had some anecdata of people telling me they found my posts/claims extremely useful, or that they found these things pretty stressful, but obviously there's major selection bias. This is also just an objectively hard thing that I think many people find overwhelming (especially when tied to their social identity, status, career plans, etc). I'd guess maybe 40%? I expect framing matters a lot, and that eg pointing people to my posts may help?

I'm not immediately thinking of examples of people without inside views doing independent research that I would call "great safety relevant work".

Agreed, I'd have pretty different advice for people actively trying to do impactful independent research.

Idk, I feel like I formed my inside views by locking myself in my room for months and meditating on safety.

Interesting, thanks for the data point! That's very different from the kinds of things that work well for me (possibly just because I find locking myself in my room for a long time hard and exhausting), and suggests my advice may not generalise that well. Idk, people should do what works for them. I've found that spending time in the field resulted in me being exposed to a lot of different perspectives and research agendas, forming clearer views on how to do research, flaws in different approaches, etc. And all of this has helped me figure out my own views on things. Though I would like to have much better and clearer views than I currently do.

also a lot of people who are just trying to figure out whether to work on it/how to get into the field, including in less technical roles (policy, ops, community building), where I also have often seen a strong push for inside views.

Oh wild. I assumed this must be directed at researchers since obviously they're the ones who most need to form inside views. Might be worth adding a note at the top saying who your audience is.

For that audience I'd endorse something like "they should understand the arguments well enough that they can respond sensibly to novel questions".

One proxy that I've considered previously is "can they describe an experiment (in enough detail that a programmer could go implement it today) that would mechanistically demonstrate a goal-directed agent pursuing some convergent instrumental subgoal".

I think people often call this level of understanding an "inside view", and so I feel like I still endorse what-people-actually-mean, even though it's quantitatively much less understanding than you'd want to actively do research.

(Though it also wouldn't shock me if people were saying "everyone in the less technical roles needs to have a detailed take on exactly which agendas are most promising and why and this take should be robust to criticism from senior AI safety people". I would disagree with that.)

This feels like an instance where I understand why they hold their view fairly well, and maybe feel comfortable deferring to them, but don't feel like I can really evaluate their view?

I would have said you don't understand an aspect of their view, and that's exactly the aspect you can't evaluate. (And then if you try to make a decision, the uncertainty from that aspect propagates into uncertainty about the decision.) But this is mostly semantics.

I'd guess maybe 40%? I expect framing matters a lot, and that eg pointing people to my posts may help?

Thanks, I'll keep that in mind.

I've found that spending time in the field resulted in me being exposed to a lot of different perspectives and research agendas, forming clearer views on how to do research, flaws in different approaches, etc.

Tbc I did all of this too -- by reading a lot of papers and blog posts and thinking about them.

(The main exception is "how to do research", that I think I learned from just practicing doing research + advice from my advisors.)

To give my own take despite it not being much different from Rohin's: The point of an inside view is to generalize, and the flaw of just copying people you respect is that it fails to generalize.

So for all the parts that don't need to generalize - that don't need to be producing thoughts that nobody has ever thought before - deferring to people you respect works fine. For this part I'm totally on board with you - I too think the inside view is overrated.

But I think it's overratedness is circumscribed. It's overrated when you're duplicating other peoples' cognitive work. But it's definitely not overrated when you're doing your own thinking!

The advice that I'd like new people to try (and give me feedback on) is to not worry about being able to do first-principles reasoning about AI safety in its entirety. Pick people you respect and try to go where they're pointing. But once you're there, try to learn what's going on for yourself - generate an inside view centered on solving a particular piece of the problem, with tendrils extending along other directions of your interest.

Such an inside view would look like a combination of technical understanding of the details (e.g. if you want to interpret model-based RL agents, you should understand them and understand the approaches to interpreting them and what algorithms would be used), along with intuitions about the course of history in this sub-field.