Summary
This post gives my personal take on “AI for epistemics” and how important it might be to work on.
Some background context:
AI capabilities are advancing rapidly and I think it’s important to think ahead and prepare for the possible development of AI that could automate almost all economically relevant tasks that humans can do.[1]
- That kind of AI would have a huge impact on key epistemic processes in our society. (I.e.: It would have a huge impact on how new facts get found, how new research gets done, how new forecasts get made, and how all kinds of information spread through society.)
- I think it’s very important for our society to have excellent epistemic processes. (I.e.: For important decisions in our society to be made by people or AI systems who have informed and unbiased beliefs that take into account as much of the available evidence as is practical.)
- Accordingly, I’m interested in affecting the development and usage of AI technology in ways that lead towards better epistemic processes.
So: How can we affect AI to contribute to better epistemic processes? When looking at concrete projects, here, I find it helpful to distinguish between two different categories of work:
Working to increase AIs’ epistemic capabilities, and in particular, differentially advancing them compared to other AI capabilities. Here, I also include technical work to measure AIs’ epistemic capabilities.[2]
- Efforts to enable the diffusion and appropriate trust of AI-discovered information. This is focused on social dynamics that could cause AI-produced information to be insufficiently or excessively trusted. It’s also focused on AIs’ role in communicating information (as opposed to just producing it). Examples of interventions, here, include “create an independent organization that evaluates popular AIs’ truthfulness”, or “work for countries to adopt good (and avoid bad) legislation of AI communication”.
I’d be very excited about thoughtful and competent efforts in this second category. However, I talk significantly more about efforts in the first category, in this post. This is just an artifact of how this post came to be, historically — it’s not because I think work on the second category of projects is less important.[3]
For the first category of projects: Technical projects to differentially advance epistemic capabilities seem somewhat more “shovel-ready”. Here, I’m especially excited about projects that differentially boost AI epistemic capabilities in a manner that’s some combination of durable and/or especially good at demonstrating those capabilities to key actors.
Durable means that projects should (i) take the bitter lesson into account by working on problems that won’t be solved-by-default when more compute is available, and (ii) work on problems that industry isn’t already incentivized to put huge efforts into (such as “making AIs into generally better agents”). (More on these criteria here.)
Two example projects that I think fulfill these criteria (I discuss a lot more projects here):
- Experiments on what sort of arguments and decompositions make it easier for humans to reach the truth in hard-to-verify areas. (Strongly related to scalable oversight.)
- Using AI to generate large quantities of forecasting data, such as by automatically generating and resolving questions.
Separately, I think there’s value in demonstrating the potential of AI epistemic advice to key actors — especially frontier AI companies and governments. When transformative AI (TAI)[4] is first developed, it seems likely that these actors will (i) have a big advantage in their ability to accelerate AI-for-epistemics via their access to frontier models and algorithms, and (ii) that I especially care about their decisions being well-informed. Thus, I’d like these actors to be impressed by the potential of AI-for-epistemics as soon as possible, so that they start investing and preparing appropriately.
If you, above, wondered why I group “measuring epistemic capabilities” into the same category of project as “differentially advancing AI capabilities”, this is now easier to explain. I think good benchmarks could be both a relatively durable intervention for increasing capabilities, via inspiring work to beat the benchmark for a long time, and that they’re a good way of demonstrating capabilities.
Structure of the post
In the rest of this post, I:
- Link to some previous work in this area.
- Describe my basic impression of why work on AI-for-epistemics could be important.
- Go into more details on Heuristics for good interventions, including:
- A distinction between direct vs. indirect strategies and the value of indirect interventions.
- Some abstract guidelines for how to avoid interventions that get swamped by the bitter lesson or by commercial interests.
- An attempt to paint a concrete picture for what excellent use of AI for epistemics might eventually look like (to give a sense of what we want to steer towards).
- Discuss what concrete types of interventions seem best in the domain of differentially advancing epistemic capabilities.
Previous work
Here is an incomplete list of previous work on this topic:
- I previously wrote Project ideas: Epistemics on this blog.
- Carl Shulman on government and society after AGI at the 80,000 hours podcast. (This current blog post owes especially many ideas to Carl.)
- Ben Todd: The most interesting startup idea I've seen recently: AI for epistemics
- On AI and forecasting:
- Ozzie Gooen: My Current Claims and Cruxes on LLM Forecasting & Epistemics
- Ozzie runs the Quantified Uncertainty Research Institute which is more generally relevant.
- FutureSearch is using AI for forecasting and other tricky questions.
- See also The Rationale-Shaped Hole At The Heart Of Forecasting where they lay out some of their views as of April 2024.
- Approaching Human-Level Forecasting with Language Models by Halawi, Zhang, Yueh-Han, and Steinhardt (2024).
- LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy by Schoenegger, Tuminauskeite, Park, and Tetlock (2024).
- See also Forecasting Research Institute where Tetlock is president and chief scientist.
- Metaculus is running a bot-only forecasting series. (Launch announcement.)
- Ozzie Gooen: My Current Claims and Cruxes on LLM Forecasting & Epistemics
- On AI & persuasion:
- Risks from AI persuasion by Beth Barnes.
- Persuasion Tools by Daniel Kokotajlo
- Elicit is an AI research assistant, developed by a company that was spun out from Ought, with the intention of improving human judgment as AI capabilities improve. (See here for some of the original case.)
Why work on AI for epistemics?
Summary
I think there’s very solid grounds to believe that AI’s influence on epistemics is important. Having good epistemics is super valuable, and human-level AI would clearly have a huge impact on our epistemic landscape. (See here for more on importance.)
I also think there are decent plausibility arguments for why epistemics may be important: Today, we are substantially less epistemically capable than our technology allows for, due to various political and social dynamics which don’t all seem inevitable. And I think there are plausible ways in which poor epistemics can be self-reinforcing (because it makes it harder to clearly see what’s the direction towards better epistemics). And vice-versa that good epistemics can be self-reinforcing. (See here for more on path-dependence.)
That’s not very concrete though. To be more specific, I will go through some more specific goals that I think are both important and plausible path-dependent:
- Good norms & practices for AI-as-knowledge-producers. Such as transparency of how AI-based science/investigations work, minimal censorship of AI-produced research results, and maximizing the number of actors who can trust some technically sophisticated institution to verify AI methods’ trustworthiness. (E.g. by having many such institutions with different political affiliations.)
- Good norms & practices for AI-as-communicatiors. Such as transparency of how AIs decide what to communicate, independent evaluators who measure AIs’ truthfulness, and laws that limit the degree to which AIs can present contradictory arguments to different people or be paid-off to present biased views.
- Differentially high epistemic capabilities. Such as high alignable capabilities (compared to underlying capabilities), relative strength at persuading people of true beliefs compared to false beliefs, and relative strength at understanding & predicting the world compared to building new technologies.
Let’s go through all of this in more detail.
Importance
I think there’s very solid grounds to believe that AI’s influence on epistemics is important.
- AI’s influence on human epistemic abilities will eventually be huge. Briefly:
- AI will eventually automate epistemic labor. This includes both knowledge production work and communication work. (The latter which includes both good and bad persuasion of humans.)
- AIs’ epistemic work won’t just replace humans’ 1-for-1. AI comes with special capabilities that will change the epistemic ecosystem:
- Cheaper & easier to delegate epistemic labor.
- It could be cheaper to delegate epistemic labor to AIs (than to humans) because you can copy software for free.
- If we develop methods to train reliably truth-seeking AIs, it will be easier to delegate epistemic labor to AIs (than to humans), because you would have to worry less about being deceived.
- More delegation could lead to more equitable distribution of epistemic capabilities, but also to reduced incentives and selection for humans to have reasonable beliefs and epistemic practices (because AIs make all decisions that matter for your power).
- Better epistemic science.
- You can more easily control what information AIs have and have not seen and thereby run reproducible experiments on what epistemic strategies work best.
- Cheaper & easier to delegate epistemic labor.
- Epistemic capabilities during and after the development of TAI are very valuable. Briefly:
Most AI takeover risk comes from “unforced errors”. A vast majority of powerful people don’t want AI to take over, but I think that many underestimate the risk. If I thought that people were going to have reasonable, well-calibrated beliefs about AI takeover risk, my subjective probability of AI takeover would more than halve.[5]
Most extinction risk comes from “unforced errors”. Just as above: A vast majority of powerful people don’t want extinction, and (I strongly suspect) would be capable of preventing the exceptions from being able to cause extinction.[6]
- Strong epistemic capabilities seem great for moral deliberation. For example via: Helping you better imagine the realistic consequences of various moral principles; by letting you forecast what sort of deliberation procedures will go off-the-rails; and by teaching you about the underlying empirical reasons for moral disagreement (so you can choose which drivers of moral intuition you trust more).
Path-dependence
While less solid than the arguments for importance, I think there are decent plausibility arguments for why AI’s role in societal epistemics may be importantly path-dependent.
- Comparing with the present. Today, I think that our epistemics are significantly worse than they “could have been”. We aren’t just constrained by high-quality labor or evidence — there are also significant political and self-serving forces/incentives that actively distort people’s beliefs. These won’t automatically go away in the future.
Feedback loops. People often choose to learn the truth when the choice is presented sufficiently clearly and unambiguously to them.[7] But with poor enough epistemic starting abilities, it won’t be clear what methods are more or less truth-seeking. So poor epistemic capabilities can be self-reinforcing, and vice versa.
Veil of ignorance. Conversely, people may be more enthusiastic to invest in novel, strong epistemic methods while they think that those methods will come to support their current beliefs (which would be the default, if they actually believe their current beliefs[8]). Whereas if they first learn that the methods are going to contradict their current beliefs, then they may oppose them.
Early investment. I can easily imagine both a future where frontier AI projects either (i) spend continuous effort on making their AIs strong forecasters and strategic analysts, and distributes those capabilities to other key institutions, or (ii) almost exclusively focus on using their AI systems for other tasks, such as technical R&D.[9] My being able to imagine both might just be a fact of my own ignorance — but it’s at least suggestive that both futures are plausible, and could come about depending on our actions.
Distribution of epistemic capabilities. Even without changing the pace at which powerful AI epistemics are developed, the question of whether important decisions are made with or without AI epistemic assistance may depend on how quickly different actors get access to those capabilities. It seems probably great for those epistemic capabilities to quickly be made widely available,[10] and if they’re powerful enough, it could be essential for multiple key players (such as AI companies, governments, and opposition parties) to get access to them at a similar time, so they can provide checks on each others’ new capabilities.
To be more concrete
Now, let’s be more specific about what goals could be important to achieve in this area. I think these are the 3 most important instrumental goals to be working towards:
- Society adopts good norms & practices for AI-as-knowledge-producers, i.e., norms & practices that allow insights from AI-as-knowledge-producers to be widely spread and appropriately trusted.
- Society adopts good norms & practices for AI-as-communicators, i.e., norms & practices that make it systematically easy for AIs to spread true information and relatively more difficult for AIs to illegitimately persuade people of falsehoods.
- For a given amount of general capabilities, we have high “epistemic capabilities” and high justified trust in those capabilities.
Let’s go through these in order.
Good norms & practices for AI-as-knowledge-producers
Let’s talk about norms and practices for AIs as knowledge-producers. With this, I mean AIs doing original research, rather than just reporting claims discovered elsewhere. (I.e., AIs doing the sort of work that you wouldn’t get to publish on Wikipedia.)
Here are some norms/institutions/practices that I think would contribute to good usage of AI-as-knowledge-producers:
- Minimal (formal or informal) censorship of AI-produced research results.
- Transparency of how results from AI-as-knowledge-producers were arrived at.
A government agency that is non-partisan (in practice and not only in name) and charged with using AI to inform government decision-making or to transparently review whether other knowledge-producing AIs in government are doing so in a truth-seeking manner.[11]
- Maximizing the number of actors who can trust some technically sophisticated institution to verify claims about AI methods’ trustworthiness.
- For example, this could be achieved via having many actors with different political affiliations verify claims about a centralized project, or by having many actors with different political affiliations train their own truth-seeking AIs (noticing how they tend to converge).
- Great evals of AIs epistemic capabilities.
- For this, it’s helpful if you have a longer track record of AIs being used for important real-world questions and getting them right or wrong.
Good norms & practices for AI-as-communicators
Now let’s talk about norms for AIs as communicators. This is the other side of the coin from “AI as knowledge producers”. I’m centrally thinking about AIs talking with people and answering their questions.
Here are some norms/institutions/practices that I think would enable good usage of AI-as-communicators:
- Transparency about how AIs decide what to communicate.
- E.g. via publishing information about AIs’ constitutions or model spec.
- Independent evaluators publishing reports on AI truthfulness, including…
- Fraction of statements that the evaluators believe to be clearly true, debatable, vs. clearly false.
- Results from AI lie-detection tests on whether the AI is being dishonest.
- The degree to which AIs contradict themselves in different contexts or when talking with different audiences.
- The degree to which AI is misleading via behaving differently (e.g. being more or less evasive, using a different tone, or relying on very different sources) on questions that are similar except for their implications about a topic that the AI may want to mislead users about (e.g. something political or something where the AI developer has commercial interests).
- Experiments on whether humans tend to be better or worse at answering questions about a certain topic after conversing the AI about related topics. If humans are systematically worse, that suggests that the AI may be systematically misleading.
- Certain laws about AI communication may be helpful, such as:
You’re not allowed to pay other actors to program their AIs to be more positively inclined towards you.[12]
- AI cannot systematically say directly contradictory statements to different audiences or in different contexts.
- Conversely, it’s also important to avoid bad laws. For example, laws that forbid AIs from saying blatant falsehoods may be good if they were judged in a reasonable way, and had the threshold for “blatant” set highly enough, but they could also be very bad if they became a tool for pushing political agendas.
Differentially high epistemic capabilities
Finally: I want AIs to have high epistemic capabilities compared to their other capabilities. (Especially dangerous ones.) Here are three metrics of “epistemic capabilities” that I care about (and what “other capabilities” to contrast them with):
Asymmetric persuasion: How capable is AI at persuading people of true things vs. how capable is AI at persuading people of anything?[13]
It’s good for the former to be high relative to the latter, because I think it’s typically better for people to be convinced of true things than false things.
(The web of lies eval in Evaluating Frontier Models for Dangerous Capabilities tests for one version of this, where current models seem significantly better at persuading people of true things.[14])
- Understanding (vs. building): How useful is AI for understanding & predicting the world vs. building new technologies?
- Central examples that I want to capture in “building new technologies”: Coding, AI R&D, bio R&D, building robots.
- Central examples that I want to capture in “understanding”: Forecasting, policy development, geopolitical strategy, philosophy.
I suspect (but am not confident) that it’s good for the former to be high relative to the latter, because I am scared of new technologies causing accidents (mainly AI takeover[15]) or being misused by the wrong people (mainly bioweapons), and think that better understanding could help reduce this risk.
- What makes this a natural dichotomy? Or a more decision-relevant question: Why should we think it’s possible to differentially accelerate “understanding” separately from “building”? Here are some of the core differences that I see between the two:
- 1. “Building technology” typically has better empirical feedback loops.
- 2. When “building technology”, it’s typically easier and more helpful to make accurate & precise mathematical models.
- 3. “Technology” is typically more specialized/modular, whereas "understanding" relies more on the ability to incorporate lots of messy interdisciplinary data.
- 4. “Technology” is typically less political, whereas for “understanding” it’s often more important to manage and correct for political biases.
There are exceptions to all four of these. But they hold often enough that I think they induce some important difference in what epistemic methods are most useful for “understanding” vs. “building”. Which may lead to some opportunities to differentially advance one over the other.[16]
- Aligned capabilities: What knowledge & understanding can AI developers leverage towards the AI developers’ goals vs. what knowledge & understanding can AIs leverage towards their own goals?
- It’s good for the former to be high, because if the latter is higher, then AI takeover would be more likely. Specifically, AI may then be able to (i) leverage powers that we don’t have access to, and (ii) take us by surprise, if we didn’t know about their capabilities.
- Even if AI takeover isn’t a problem, this might still introduce a discrepancy where people can use AI’s full capabilities to pursue easy-to-measure goals but can’t use them to pursue hard-to-measure goals. (Since it’s difficult to provide a feedback signal which encourages an AI to pursue those goals.) I think this is also undesirable, and related to the previous categories:
- It’s easier to measure whether you’ve persuaded someone than whether you’ve persuaded them of something true.
- It’s easier to measure whether you’ve developed a powerful technology than whether you’ve produced a correct forecast for whether releasing that technology is likely to cause an irreversible catastrophe.
- (C.f.: Strategy Stealing, Decoupling Deliberation & Competition.)
In order for these distinctions to be decision-relevant, there needs to be ways of differentially accelerating one side of the comparison compared to the other. Here are two broad categories of interventions that I think have a good shot at doing so:
- Scalable Oversight, Weak-to-Strong Generalization (W2S generalization), and Elicing Latent Knowledge (ELK).
- I list these three domains together because they’re strongly related.
- See Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem.
- Also, from the W2S generalization paper: “Our setting can be viewed as a general methodology for empirically studying problems like ELK and honesty across a wide range of tasks.”
- These research areas push forward “aligned capabilities” via letting us elicit stronger capabilities towards arbitrary goals.
- This pushes towards “asymmetric persuasion” via letting people increase their skepticism of unsupported AI statements, while still being able to believe AI statements backed up by a spot-checked decomposed argument (scalable oversight), by a scientific understanding of generalization (W2S generalization), or by methods that directly eliciting AI’s latent knowledge.
- This pushes forward “understanding” over “building technology” via being disproportionately helpful for boosting capabilities in areas with poor feedback loops. (Whereas I think “building technology” typically has better feedback loops.)
- I list these three domains together because they’re strongly related.
- Building and iteratively improving capabilities on “understanding”-loaded tasks, such as forecasting and strategic analysis.
- (This partly overlaps with the first point, because you might want to practice using scalable-oversight/W2S-generalization/ELK on these tasks in particular.)
- Examples of how you might do this includes:
- Making the models do forecasts of unseen data, and iterating to improve their performance.
- This becomes more interesting if you can train highly capable models on only old data, since this would let you test and iterate the models on more long-range forecasting.
- Training models using experts’ (superforecasters, policy analysts, AI strategy researchers) feedback.
- Either in a baseline RLHF sort-of way, or going further towards scalable oversight & W2S generalization.
- Experimentally determine what sort of procedures and arguments tend to lead humans towards truth.
- For example, via the methodology that Tom Davidson outlines in this post. Note that this might be meaningfully different than the procedures that work well in technical domains, because of the fuzzier topics and increased political biases.
- Making the models do forecasts of unseen data, and iterating to improve their performance.
- I think this differentially pushes forward “aligned capabilities” in “understanding”-loaded domains, because I expect that models will (via generalization from pre-training) start out with some baseline understanding of these domains. Effort on these tasks will go towards some mix of increasing model capabilities and improving our ability to elicit existing capabilities, and I expect the net-effect will be to somewhat reduce the amount of capabilities that we can’t elicit. (But I don’t feel fully confident in this.)
- This can push towards “asymmetric persuasion” in these domains insofar as developers take care to develop truth-seeking methods rather than just indiscriminately iterating to improve models’ ability to persuade people.
- This clearly differentially pushes towards “understanding”.
Heuristics for good interventions
Having spelled-out what we want in the way of epistemic capabilities, practices for AI-as-knowledge-producers, and AI-as-communicators: Let’s talk about how we can achieve these goals. This section will talk about broad guidelines and heuristics, while the next section will talk about concrete interventions. I discuss:
- A distinction between direct vs. indirect strategies and the value of indirect interventions.
- Some abstract guidelines for how to avoid interventions that get swamped by the bitter lesson or by industry investments.
- An attempt to paint a concrete picture for what eventually-excellent epistemics might look like (to give a sense of what we want to steer towards).
Direct vs. indirect strategies
One useful distinction is between direct and indirect strategies. While direct strategies aim to directly push for the above goals, indirect strategies instead focus on producing demos, evals, and/or arguments indicating that epistemically powerful AI will soon be possible, in order to motivate further investment & preparation pushing toward the above goals.
My current take is that:
Direct, competent efforts on Good practices for AI-as-knowledge-producers[17] and Good practices for AI-as-communicators[18] seem great.
- Direct efforts on differentially high epistemic capabilities via scalable oversight, W2S generalization, & ELK seem great.
- Direct efforts on differentially high epistemic capabilities via building and iteratively improving capabilities on “understanding”-loaded tasks (as discussed in the previous section) seem useful if done in the right way. But there’s some worry that progress here will be swamped by the bitter lesson and/or commercial investments. I talk about this more later.
- Indirect efforts via building and iteratively improving abilities on “understanding”-loaded tasks seem potentially useful. But I’m not sure how compelling the path-to-impact really is. Let’s talk a bit more about it.
Indirect value generation
One possible path-to-impact from building and iteratively improving capabilities on “understanding”-loaded tasks is that this gives everyone an earlier glimpse of a future where AIs are very epistemically capable. This could then motivate:
- Investment into making AIs even more epistemically capable.
- Increased concern about AI companies’ upcoming epistemic advantage over other actors, motivating stronger demands for transparency and increased urgency of actors acquiring their own ability to verify AI epistemic methods.
- The production of further evals for exactly how epistemically capable AIs are.
- More attention, as well as more-informed attention, going towards developing norms for how AIs should communicate information.
- Also: It would directly enable AIs to start gathering a good epistemic track record earlier, which could be helpful for evaluating AIs’ trustworthiness later.
- Also: It might help build a track record for a particular organization, which could be helpful for evaluating that organization’s track-record later.
The core advantage of the indirect approach is that it seems way easier to pursue than the direct approach.
- Demos and evals for epistemically capable AI are very easily measurable/testable which gives you great feedback loops.
- Because the path-to-impact is indirect, it’s ok if methods or evals don't generalize to future, more powerful AI systems. They can still “wake people up”.
- Because the path-to-impact is indirect, it’s ok if the work takes place in organizations that will eventually be marginalized/outcompeted by the well-resourced AI company or other actors. They can still “wake people up”.
Core questions about the indirect approach: Are there really any domain-specific demos/evals that would be convincing to people here, on the margin? Or will people’s impressions be dominated by “gut impression of how smart the model is” or “benchmark performance on other tasks” or “impression of how fast the model is affecting the world-at-large”? I feel unsure about this, because I don’t have a great sense of what drives people’s expectations, here.
A more specific concern: Judgmental forecasting hasn’t “taken off” among humans. Maybe that indicates that people won’t be interested in AI forecasting? This one I feel more skeptical of. My best-guess is that AI forecasting will have an easier time of becoming widely adopted. Here’s my argument.
I don’t know a lot of why forecasting hasn’t been more widely adopted. But my guess would be that the story if something like:
- Many forecasting practices are useful. (Such as assigning probabilities to questions that are highly relevant to your decisions, keeping track of how well you’re doing, and keeping track of how well people-who-you-listen-to are doing.)
- However, they’re not useful enough that people who use these tools but who don’t have experience in a profession can easily outcompete people without experience in that profession.
- And it takes time for people in existing professions to adopt good practices. (Cultural change is slow.)
For AIs, these problems seem smaller:
- There’s already a culture of measuring AI performance numerically, so you don’t need much of a cultural shift to get AIs to quantify their probability estimates and be scored on them.
- And AI advantages will eventually deliver lots of advantages over existing human experts, so there will eventually be strong incentives to shift over to using AI.
Overall, I feel somewhat into “indirect” approaches as a path-to-impact, but only somewhat. But it at least seems worth pursuing the most leveraged efforts here: Such as making sure that we always have great forecasting benchmarks and getting AI forecasting services to work with important actors as soon as (or even before) they start working well.
On long-lasting differential capability improvements
It seems straightforward and scalable to boost epistemic capabilities in the short run. But I expect a lot of work that leads to short-run improvements won’t matter after a couple of year. (This completely ruins your path-to-impact if you’re trying to directly improve long-term capabilities — but even if you’re pursuing an indirect strategy, it’s worse for improvements to last for months than for them to last for years.)
So ideally, we want to avoid pouring effort into projects that aren’t relevant in the long-run. I think there’s two primary reasons for why projects may become irrelevant in the long run: either due to the bitter lesson or due to other people doing them better with more resources.
- The bitter lesson: This suggests that we want to avoid “leveraging human knowledge” and instead work on projects that “leverage computation”.
- This inherently gets easier over time — as investments go up and more compute becomes available. But you can still test out prototypes now.
- Example projects that are especially good for “leveraging computation” are those that explore how we could generate large amounts of data to train on, given access to large amounts of compute.
- Other people doing them better: Epistemic capabilities will be significantly boosted by efforts that other people have huge incentives to pursue, such as learning how to train models on synthetic data, or making AI agents who can pursue tasks over longer periods of time, and who can easily navigate interfaces that were made for humans. Such projects therefore don’t seem worth prioritizing.
- (Though it might be valuable to make sure that epistemic projects are in close touch with people working on more general capabilities, in order to plan around their advances and make use of their unreleased prototypes. This is one reason why “AI for epistemics” projects may benefit from being inside of frontier AI companies.)
That said, even if we mess this one up, there’s still some value in projects that temporarily boost epistemic capabilities, even if the technological discoveries don’t last long: The people who work on the project may have developed skills that let them improve future models faster, and we may get some of the indirect sources of value mentioned above.
Ultimately, key guidelines that I think are useful for this work are:
- Inform people who can influence the future of AI-for-epistemics. (As discussed in the previous section on indirect approaches.)
- Leverage computation rather than human knowledge.
- Avoid building stuff that others will build better.
Painting a picture of the future
To better understand which ones of today’s innovations will be more/less helpful for boosting future epistemics, it’s helpful to try to envision what the systems of the future will look like. In particular: It’s useful to think about the systems that we especially care about being well-designed. For me, these are the systems that can first provide a very significant boost on top of what humans can do alone, and that get used during the most high-stakes period around TAI-development.
Let’s talk about forecasting in particular. Here’s what I imagine such future forecasting systems will look like:
- In contrast to today’s systems, I don’t think they’ll have a series of hard-coded steps. I think they’ll be much more flexible in going back and forth between different types of considerations. But even if they don’t have e.g. a hard-coded “baserate generation”-step, they’ll probably still use “baserate generation” in a similar sense as human forecasters do: As a useful, flexible input into their forecasting.
- For short-horizon forecasting (e.g.: what will tomorrow’s newspaper report about ongoing events?) I imagine that they will rely a lot on hard-to-express heuristics that they’ve learned from fine-tuning. Because there’s enough data that this is feasible.
- Most of this data will be model-generated. AIs will formulate questions about tomorrow’s newspaper, that other AIs will forecast, and that other AIs will resolve after reading the next day’s paper.
- For medium-horizon forecasting (e.g. what will happen in 1-12 months) I think there won’t be enough data for finetuning to build in a lot of really opaque heuristics. But we will be able to learn what types of reasoning tends to lead to relatively better vs. worse forecasts, by instructing different systems to use different strategies (and verifying that they do so, c.f. measuring faithfulness in chain-of-thought reasoning). Then we can teach models to use the strategies and heuristics that work best. (E.g. via few-shot prompting them with good examples, or by doing supervised learning on passages of chain-of-thought that empirically did well (similar to the Halawi paper), or by using constitutional-AI-style-training to generate many more examples of similar passages to finetune on.)
- Again, most of the data will be model-generated.
Medium-horizon forecasters can reference all kinds of evidence, including the forecasts of short-horizon forecasters (we should have good statistics on how reliable these are).[19]
For long-horizon forecasting (e.g. what will happen in >1 year) we won’t have any ground-truth data to train on, so we’ll have to rely on human feedback.[20] In order to know what kind of feedback to give, here, we’ll want to use medium-horizon forecasting as a “lab” to test different hypotheses about what sort of AI-human interactions tend to lead to accurate forecasts, and what types of arguments tend to work well in practice.
In order to generalize sensibly from the medium-horizon-forecasting case, we’ll want the learnings from this to be as human-interpretable as possible. E.g. “arguments from analogy tends to work well/poorly”, not “this 100k word long prompt tends to give good results, and no human can understand why”.
Long-horizon forecasters can reference all kinds of evidence, including the forecasts of short- and medium-horizon forecasters, insofar as they’re relevant.
When using medium-horizon forecasting as a “lab”: We’ll want to run both (i) studies where we try to get as good forecasting abilities as we can, including by relying substantially on good generalization from AIs, and (ii) studies where a red-team tries to make the AIs maximally subtly misleading, and see whether humans who are getting AI advice can notice this, or whether they get tricked into believing terrible forecasts.
If the latter tests lead to humans making terrible forecasts, then we should assume that scheming AIs would be able to mislead us about both medium-term and long-term forecasts. (And probably also short-term forecasts in recognizably rare, high-stakes situations.)
C.f. control evaluations.
- Medium- and long-horizon forecasters may be finetuned copies of models that were previously fine-tuned on short-horizon forecasting, insofar as that tends to instill some useful intuitions.
- Ideally, someone will have trained a competent model that hasn’t seen any data from the last 3 years, or maybe 30 years, or possibly 300 years. (The longer time frames would require a lot of synthetic data.) We could use such a model as a “lab” to test hypotheses about what types of reasoning tends to do well or poorly on long-range forecasting.
Concrete projects for differentially advancing epistemic capabilities
Now, let’s talk about concrete projects for differentially advancing epistemic capabilities, and how well they do according to the above criteria and vision.
Here’s a summary/table-of-contents of projects that I feel excited about (no particular order). More discussion below.
- Evals/benchmarks for forecasting (or other ambitious epistemic assistance) (Evals seems leveraged for demonstrating capabilities, and seems also likely to lead to somewhat longer-lived capability benefits.)
- Project to automate question-generation and question-resolution.
- Figuring out logistics of past-casting. E.g.: How do we date past data? Does training models on data in chronological order cause any issues?
- Start efforts inside of AI companies for AI forecasting or other ambitious epistemic assistance.
- Scalable oversight / weak-to-strong-generalization / eliciting latent knowledge.
- Experiments on what type of arguments and AI-interactions tend to lead humans toward truth vs. mislead them.
Effort to provide AI forecasting assistance (or other ambitious epistemic assistance) to governments is another category of work that I’d really like to happen eventually. But I’m worried that there will be more friction in working with governments, so that it’s better to iterate outside them first and then try to provide services to them once they’re better. This is only a weakly held guess, though. If someone who was more familiar with governments thought they had a good chance of usefully working with them, I would be excited for them to try it.
In the above paragraph, and the above project titles, I refer to AI forecasting or “other ambitious epistemic assistance”. What do I mean by this?
- “Ambitious epistemic assistance” is meant to include projects that don’t do forecasting specifically, but that still leverage AIs to do a large amount of epistemic labor, in a way that could be scaled up to be extremely valuable.
- For example, I would want to include AIs that assist with threat modeling or policy analysis in flexible, scalable ways.
- On the other hand, an example of something that would be insufficiently ambitious is a tool that was narrowly targeted at a particular well-scoped type of analysis. Which could be perfectly automated without providing that much of an acceleration to overall strategic efforts. E.g. a highly structured literature review process like “automatically find and combine randomized trials of medical interventions”.
Now for more detail on the projects I’m most excited about.
Evals/benchmarks for forecasting (or other ambitious epistemic assistance)
- I generally think that evaluations and benchmarks are pretty leveraged for motivating work and for making it clear when models are getting seriously good.
- Some ongoing work on this includes.
- Open Philanthropy funded FutureSearch to develop forecasting evals.
- Metaculus is running a bot-only forecasting series. (Launch announcement.)
- Further work on this would ideally check in with these efforts and see if there’s important angles that they don’t cover, that would be good to evaluate.
Automate forecasting question-generation and -resolution
- I.e.: Train AI systems to automatically formulate forecasting questions, and train AI systems to automatically seek out information about what happened and resolve them.
- This is strongly related to the above “evals/benchmark” category. It's a tool that can be used to generate really large amounts of questions that models can be evaluated on. (Or trained on.)
- Other than general reasons why I like evals/benchmarks, I like this angle because:
- It’s also squarely in the domain of “leveraging computation rather than human knowledge”, as a way of improving forecasting abilities. (Via training models on the automatically generated questions.)
- I think that “automatically generated and resolved forecasting questions” is a core part of what it would eventually look like to have a flourishing science of AI forecasting. And it seems great to have prototypes of all the most core parts, as early as possible, so that:
- We can track how close we are to the full vision.
- It gets easier to demo & explain what the future might look like.
- We can encounter obstacles and start working on them early
- My main concern is that the models might not quite be good enough to make this really easy, yet. So it might be significantly easier to do with generations that are one step further along.
Logistics of past-casting
- Another angle for getting more forecasting data is to exploit all the data in the past. If we could train competent AI systems without “spoiling” them on recent events, then we could run experiments on what methodologies work best for long-range forecasting.
- One question to answer here is: What methods do we have for determining the date of past data, and how reliable are they?
- Can we date past internet-data by checking for its presence in older crawls? (E.g. past versions of the common crawl.)
- By looking at website meta-data?
- By having AI models read it and guess?
- Some existing work is already trying to address this problem, and my impression is that it’s surprisingly annoying to do in practice.
- In particular, a mini-version of “past-casting” is to take existing models with training-data cutoffs several months ago, and see how well they can forecast events since then.
- Even here, you have to deal with questions about how to date information. You’d like to give AIs the option to read-up on newspaper articles and similar that are relevant to the events they’re forecasting — but it turns out to be non-trivial to ensure that e.g. the articles haven’t been updated in ways that leak information about the future.
- The fact that this is surprisingly difficult is part of why I’m excited for people to start working on it early.
- (C.f. here where Halawi et al. solves the problem by white-listing a small number of websites where it’s easy to determine dates. This means that their system cannot use data from most of the web.)
- Different methods can be evaluated by comparing them against each other.
- Note that you don’t need 100% reliability. It’s generally ok to accidentally date older data as actually being newer; it just means that models will have a bit less old data to access. And it might be ok to date some new data as old if the reason that you’re doing it is that it’s very difficult to recognise as new data — because that probably means that it’s not leaking much information about the present.
- One way to get a lot of different models that have been trained on different amounts of history would be to order all of the pre-training data chronologically and then train on it in that order. It seems useful to explore and address various problems with this.
- Do you get any weird results from the pre-training data not being IID? Does this compromise capabilities in practice? Or does it lead to increased capabilities because the model cannot lean as much on memorization when it’s constantly getting trained on a previously-unseen future?
What if you want to run multiple epochs?[21] Then you have a conflict between wanting to fully update on the old data before you see new data vs. wanting to maximally spread out the points in time at which you repeat training data. How severe is this conflict? Are there any clever methods that could reduce it?
- This seems great to invest in early, because “chronological training” is risky to attempt for expensive models without smaller trials showing that it doesn’t compromise capabilities. It’s also hard to do on short notice, because you have to commit to it before a big training run starts.
For really long-range experiments (where we avoid spoiling AIs on the past 100+ years) we would need to be able to do pretraining with mostly synthetic data. “How to usefully pre-train models on synthetic data) is not something I recommend working on, because I think it would be very useful for AI capabilities. So I expect capabilities researchers to be good at exploring it on their own.[22]
However, it might be useful to consider how you would prevent leaking information from the present if you could usefully pre-train models on synthetic data.
In particular, the synthetic data would probably be constructed by models that have a lot of knowledge about the present. So you would have to prevent that knowledge from leaking into the synthetic data.
(This research project may be easier to do once we understand more about good methods of training on synthetic data. I’m personally not sure what the SOTA is, here.)
- Another question someone could answer is: How much data do we even have from various time periods? (I’m not sure.)
Start efforts inside of AI companies for AI forecasting or other ambitious epistemic assistance
- It seems great for epistemic tools to be developed in close contact with users, so that the tools fill real needs. And people inside AI companies are important customers who I eventually want to be aided by epistemic assistance.
- Also, if external AI efforts ever become uncompetitive with the big companies (because of the greater AI capabilities available inside of labs) then I want people in those companies to already be working on this.
- A variant of this would be to start an effort outside of AI companies, but consult with them to understand what they’re interested in and so that they can get impressed with the technology. Such that, if external projects become uncompetitive, then the people inside of AI labs are interested in starting up similar AI & epistemics efforts inside the labs (or to provide external ones with privileged access to company models).
- My main concern is that the current technology might not actually be good enough to be really helpful, yet. Or that (in order to be sufficiently good) the current technology needs a ton of schlep-work that won’t generalize to future models.
Scalable oversight / weak-to-strong-generalization / ELK
- It’s possible that we’ll develop powerful AI systems which themselves have excellent epistemic abilities, without us being able to use those abilities for everything that we want to use them for. For example, if you trained an AI to predict the stock market, it could develop all kinds of powerful epistemic methods and interesting hypotheses about the world — but all that you would see as its developer would be its projected stock prices.
- In order for AIs to provide great epistemic assistance to humans, we want to be able to elicit and verify all the knowledge and heuristics that AI systems develop.
- This also overlaps heavily with alignment research, since if AI have abilities that humans can’t elicit from them, that makes it difficult for humans to train AIs to behave well, and more dangerous if AIs try to seize power.
- Eliciting Latent Knowledge (ELK) is a framing from the Alignment Research Center. While they work on it from a theoretical perspective, it can also be tackled experimentally. It’s the problem of eliciting knowledge that AIs “knows” even when you can’t provide any direct feedback signal that incentivizes honesty.
- Weak to Strong Generalization is a framing from OpenAI: “Our setting can be viewed as a general methodology for empirically studying problems like ELK and honesty across a wide range of tasks.”
- A related area includes lie detection of AI system, see e.g.:
- Scalable Oversight refers to attempts to amplify the overseers of an AI system such that they are more capable than the system itself (typically by having the overseers use the system itself). If successful, this could give us a feedback signal with which to train powerful AI systems while its reasoning remains understandable to (amplified) humans.
- See here for some discussion about its relationship to weak-to-strong generalization.
- There are teams working on this at Anthropic and DeepMind.
Experiments on what type of arguments and AI interactions tend to lead humans toward truth vs. mislead them
- As a final category of experiments that could be useful to run, I wanted to flag experiments on what kinds of arguments and interactions-with-AIs tend to lead humans towards the truth, and what kinds tend to mislead them.
- One example of this is Tom Davidson’s proposal described in Let’s use AI to harden human defenses against AI manipulation:
- “tldr: AI may manipulate humans; we can defend against that risk better by optimizing AIs to manipulate humans, seeing what manipulation techniques they use, and learning to detect those techniques.”
- Compared to e.g. automated forecasting, note that this centers the fact that AIs will assist humans in decision-making, and we ultimately want them to lead humans towards truth, rather than just be right themselves.
- Note also that the human component makes experiments here more time-sensitive. For AIs, we might be able to use past-casting to do experiments later on. But it’s difficult to run past-casting experiments for humans. So if you want time on forecasting questions, here, we may have to start soon.
- This category is arguably a subset of scalable oversight research (mentioned above), since an “amplified overseer” will typically be an AI-assisted human, and it’s critical for such a human to be led towards truth rather than misled by AIs. But I’m still calling it out separately because:
- Even separately from humans’ role in overseeing (i.e., providing a training signal to) AI systems, it would be great to better understand how AIs can lead humans towards truth rather than mislead them. (For example, this kind of knowledge could be important when people use AIs that they didn’t train themselves.)
- I think the results may be importantly different depending on whether you’re looking at a more technical area (e.g.: “does this code have a backdoor?” or an area that’s more interdisciplinary and politically charged (e.g.: “what would happen if the US made this deal with China on how to handle TAI?”). The latter seems especially important for “AI for epistemics”.
Concluding thoughts
The development of AI systems with powerful epistemic capabilities presents both opportunities and significant challenges for our society. Transformative AI will have a big impact on our society’s epistemic processes, and how good or bad this impact is may depend on what we do today.
I started out this post by distinguishing between efforts to differentially increase AI capabilities and efforts to enable the diffusion and appropriate trust of AI-discovered information. While I wrote a bit about this second category (characterizing it as good norms & practices for AI as knowledge producers and communicators), I will again note that the relative lack of content on it doesn’t mean that I think it’s any less important the first category.
On the topic of differentially increasing epistemic AI capabilities, I’ve argued that work on this today should (i) focus on methods that will complement rather than substitute for greater compute budgets, (ii) prioritize problems that industry isn’t already trying hard to solve, and (iii) be especially interested to show people what the future have in store by demonstrating what’s currently possible and prototyping what’s yet to come. I think that all the project ideas I listed do well according to these criteria, and I’d be excited to see more work on them.
- ^
Personally, I focus a lot on the possibility of this happening within the next 10 years. Because I think that’s plausible, and that our society would be woefully underprepared for it. But I think this blog post is relevant even if you’re planning for longer timelines.
- ^
I explain why below.
- ^
Feel free to reach out to me via DM or email at [my last name].[my first name]@gmail.com if you’re considering working on this and would be interested in my takes on what good versions could look like.
- ^
Here, I’m using a definition of “transformative AI” that’s similar to the one discussed in this note.
- ^
Other than underestimates of AI takeover risk, another significant reason I’m worried about AI takeover is AI races where participants think that the difference in stakes between “winning the race” and “losing the race” is on the same scale as the difference in stakes between “losing the race” and “AI takeover”. Assuming that no important player underestimated the probability of AI takeover, I expect this sort of race to happen between nation states, because if a state thought there was a significant probability of AI takeover, I would expect them to stop domestic races. On the international scene, it’s somewhat less obvious how a race would be stopped, but I’m decently optimistic that it would happen if everyone involved estimated, say, ≥20% probability of AI takeover.
- ^
Even for extinction-risk that comes from “rational” brinksmanship, I suspect that the world offers enough affordances that countries could find a better way if there was common knowledge that the brinkmanship route would lead to a high probability of doom. It’s plausible that optimal play could risk a small probability of extinction, but I don’t think this is where most extinction-risk comes from.
- ^
I think there’s two mutually reinforcing effects, here. One is that people may try to learn the truth, but make genuine mistakes along the way. The other is that people may (consciously or sub-consciously) prefer to believe X over Y, and the ambiguity in what’s true gives them enough cover to claim to (and often actually) believe X without compromising their identity as a truth-seeker. Note that there’s a spectrum, here: Some people may be totally insensitive to what evidence is presented to them while some people are good at finding the truth even in murky areas. I think most people are somewhere in the middle.
- ^
Though this has exceptions. For example, Alex may already be skeptical of an existing epistemic method M’s ability to answer certain types of questions, perhaps because M contradicts Alex’s existing beliefs on the topic. If a new epistemic method is similar to M, then Alex may suspect that this method, too, will give unsatisfying answers on those questions — even if it looks good on the merits, and perhaps even if Alex will be inclined to trust it on other topics..
- ^
I don’t think this would permanently preclude companies from using their AIs for epistemic tasks, because when general capabilities are high enough, I expect it to be easy to use them for super-epistemics. (Except for some caveats about the alignment problem.) But it could impose delays, which could be costly if it leads to mistakes around the time when TAI is first developed.
- ^
If necessary: After being separated from any dangerous AI capabilities, such as instructions for how to cheaply construct weapons.
- ^
One analogy here is the Congressional Budget Office (CBO). The CBO was set up in the 1970s as a non-partisan source of information for Congress and to reduce Congress’ reliance on the Office of Management and Budget (which resides in the executive branch and has a director that is appointed by the currently sitting president). My impression is that the CBO is fairly successful, though this is only based on reading the Wikipedia page and this survey which has >30 economists “Agree” or “Strongly agree” (and 0 respondents disagree) with “Adjusting for legal restrictions on what the CBO can assume about future legislation and events, the CBO has historically issued credible forecasts of the effects of both Democratic and Republican legislative proposals.”
- ^
I.e.: It would be illegal for Walmart to pay OpenAI to make ChatGPT occasionally promote/be-more-positive on Walmart. But it would be legal for Walmart to offer their own chatbot (that told people about why they should use Walmart) and to buy API access from OpenAI to run that chatbot.
- ^
C.f. the discussion of “asymmetric” vs “symmetric” tools in Guided By The Beauty Of Our Weapons.
- ^
I was uncertain about whether this might have been confounded by the AIs having been fine-tuned to be honest, so I asked about this, and Rohin Shah says “I don't know the exact details but to my knowledge we didn't have trouble getting the model to lie (e.g. for web of lies).”
- ^
Which is an accident in the sense that it’s not intended by any human, though it’s also not an accident in the sense that it is intended by the AI systems themselves.
- ^
I think the most important differences here are 1 & 2, because they have big implications for what your main epistemic strategies are. If you have good feedback loops, you can follow strategies that look more like "generate lots of plausible ideas until one of them work" (or maybe: train an opaque neural network to solve your problem). If your problem can be boiled down to math, then it's probably not too hard to verify a theory once it's been produced, and you can iterate pretty quickly in pure theory-land. But without these, you need to rely more on imprecise reasoning and intuition trained on few data points (or maybe just in other domains). And you need these to be not-only good enough to generate plausible ideas, but good enough that you can trust the results.
- ^
Such as:
Write-ups on what type of transparency is sufficient for outsiders to trust AI-as-knowledge-producers, and arguments for why AI companies should provide it.
Write-ups or lobbying pushing for governments (and sub-parts of governments, such as the legislative branch and opposition parties) to acquire AI expertise. To either verify or be directly involved in the production of key future AI advice.
Evaluations testing AI trustworthiness on e.g. forecasting.
- ^
Such as:
Write-ups on what type of transparency is sufficient to trust AI-as-communicators, and arguments for why AI companies should provide it.
Setting up an independent organization for evaluating AI truthfulness.
Developing and advocating for possible laws (or counter-arguments to laws) about AI speech.
- ^
This could include asking short-horizon forecasters about hypothetical scenarios, insofar as we have short-term forecasters that have been trained in ways that makes it hard for them to distinguish real and hypothetical scenarios. (For example: Even when trained on real scenarios, it might be important to not give these AIs too much background knowledge or too many details, because that might be hard to generate for hypothetical scenarios.)
- ^
Significantly implemented via AIs imitating human feedback.
- ^
I.e., use each data point several times.
- ^
Indeed, it seems useful enough for capabilities that it might be net-negative to advance, due to shorter timelines and less time to prepare for TAI.
Do you have any arguments for why this would be more important rather than working on evals of deceptive AI or evals of cybersecurity capabilities? Asking in general, I'm trying to figure out how one should think about prioritizing things like that.