Note: This report was conducted in June 2024 and is based on research originally commissioned by the Future of Life Foundation (FLF). The views and opinions expressed in this document are those of the authors and do not represent the positions of FLF.

This report investigates Training Data Attribution (TDA) and its potential importance to and tractability for reducing extreme risks from AI. TDA techniques aim to identify training data points that are especially influential on the behavior of specific model outputs. They are motivated by the question: how would the model's behavior change if one or more data points were removed from or added to the training dataset? 

Report structure:

  • First, we discuss the plausibility and amount of effort it would take to bring existing TDA research efforts from their current state, to an efficient and accurate tool for TDA inference that can be run on frontier-scale LLMs. Next, we discuss the numerous research benefits AI labs will expect to see from using such TDA tooling.
  • Then, we discuss a key outstanding bottleneck that would limit such TDA tooling from being accessible publicly: AI labs’ willingness to disclose their training data. We suggest ways AI labs may work around these limitations, and discuss the willingness of governments to mandate such access.
  • Assuming that AI labs willingly provide access to TDA inference, we then discuss what high-level societal benefits you might see. We list and discuss a series of policies and systems that may be enabled by TDA. Finally, we present an evaluation of TDA’s potential impact on mitigating large-scale risks from AI systems.

Key takeaways from our report:

  • Modern TDA techniques can be categorized into three main groups: retraining-based, representation-based (or input-similarity-based), and gradient-based. Recent research has found that gradient-based methods (using influence functions) are the most likely path to practical TDA.
  • The most efficient approach to conduct TDA using influence functions today has training costs on par with pre-training an LLM. It has significantly higher (but feasible) storage costs than an LLM model, and somewhat higher per-inference costs.
  • Based on these estimates, TDA appears to be no longer infeasible to run on frontier LLMs with enterprise-levels of compute and storage. However, these techniques have not been tested on larger models, and the accuracy of these optimized TDA techniques on large models is unclear.
  • Compressed-gradient TDA is already plausible to be used on fine-tuned models, which have orders of magnitude fewer training examples and parameters (on the order of millions or billions rather than hundreds of billions).
  • Timing to achieve efficient and accurate TDA on frontier models is likely between 2-5 years, depending largely on specific incremental research results and amount of funding / researchers allocated to the space.
  • Efficient TDA techniques will likely have a substantial positive impact on AI research and LLM development, including the following effects:
    • Mitigating the prevalence of hallucinations and false claims
    • Identifying training data that produces poor results (bias, misinformation, toxicity), improved data filtering / selection
    • Shrinking overall model size / improving efficiency
    • Improved interpretability & alignment
    • Improved model customization and editing
  • AI labs are likely already well-incentivized to invest in TDA research efforts because of the benefits to AI research.
  • Public access to TDA tooling on frontier AI models is limited primarily by the unwillingness / inability of AI labs to publicly share their training data.
    • AI labs currently have strong incentives to keep their training data private, as publishing such data would have negative outcomes such as:
      • Reduced competitive advantages from data curation
      • Increased exposure to legal liabilities from data collection
      • Violating privacy or proprietary data requirements
    • AI labs may be able to avoid these outcomes by selectively permitting TDA inference on certain training examples, or returning sources rather than the exact training data.
    • Governments are highly unlikely to mandate public access to training data.
  • If AI labs willingly provided public access to TDA, you could expect the following benefits, among others:
    • Preventing copyrighted data usage.
    • Improved fact checking / content moderation
    • Impacts on public trust and confidence in LLMs
    • Accelerated research by external parties
    • Increased accountability for AI labs
  • AI labs appear largely disincentivized to provide access to TDA inference, as many of the public benefits are disadvantageous for them.
    • Governments are highly unlikely to mandate public access to TDA.
    • It seems plausible that certain AI labs may expose TDA as a feature, but that the majority would prefer to use it privately to improve their models.
  • Several systems that could be enabled by efficient TDA include:
    • Providing royalties to data providers / creators
    • Automated response improvement / fact-checking
    • Tooling for improving external audits of training data
    • Content attribution tooling for LLMs, though it is unlikely to replace systems reliant on RAG
  • We believe that the most promising benefit of TDA for AI risk mitigation is its potential to improve the technical safety of LLMs via interpretability.
    • There are some societal / systematic benefits from TDA, and these benefits may be a small contributing factor to reducing some sources of risk. We don’t think these appear to move the needle significantly to reduce large-scale AI risks.
    • TDA may meaningfully improve AI capabilities research, which might actually increase large-scale risk.
    • TDA may eventually be highly impactful in technical AI safety and alignment efforts. We’d consider TDA’s potential impact on technical AI safety to be in a similar category to supporting mechanistic interpretability research. 
Comments1


Sorted by Click to highlight new comments since:

Executive summary: Training Data Attribution (TDA) is a promising but underdeveloped tool for improving AI interpretability, safety, and efficiency, though its public adoption faces significant barriers due to AI labs' reluctance to share training data.

Key points:

  1. TDA identifies influential training data points to understand their impact on model behavior, with gradient-based methods currently the most practical approach.
  2. Running TDA on large-scale models is now feasible but remains untested on frontier models, with efficiency improvements expected within 2-5 years.
  3. Key benefits of TDA for AI research include mitigating hallucinations, improving data selection, enhancing interpretability, and reducing model size.
  4. Public access to TDA tooling is hindered by AI labs’ desire to protect proprietary training data, avoid legal liabilities, and maintain competitive advantages.
  5. Governments are unlikely to mandate public access to training data, but selective TDA inference or alternative data-sharing mechanisms might mitigate privacy concerns.
  6. TDA’s greatest potential lies in improving AI technical safety and alignment, though it may also accelerate capabilities research, potentially increasing large-scale risks.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
 ·  · 1m read
 · 
(Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.) > “There comes a moment when the children who have been playing at burglars hush suddenly: was that a real footstep in the hall?”  > > - C.S. Lewis “The Human Condition,” by René Magritte (Image source here) 1. Introduction Sometimes, my thinking feels more “real” to me; and sometimes, it feels more “fake.” I want to do the real version, so I want to understand this spectrum better. This essay offers some reflections.  I give a bunch of examples of this “fake vs. real” spectrum below -- in AI, philosophy, competitive debate, everyday life, and religion. My current sense is that it brings together a cluster of related dimensions, namely: * Map vs. world: Is my mind directed at an abstraction, or it is trying to see past its model to the world beyond? * Hollow vs. solid: Am I using concepts/premises/frames that I secretly suspect are bullshit, or do I expect them to point at basically real stuff, even if imperfectly? * Rote vs. new: Is the thinking pre-computed, or is new processing occurring? * Soldier vs. scout: Is the thinking trying to defend a pre-chosen position, or is it just trying to get to the truth? * Dry vs. visceral: Does the content feel abstract and heady, or does it grip me at some more gut level? These dimensions aren’t the same. But I think they’re correlated – and I offer some speculations about why. In particular, I speculate about their relationship to the “telos” of thinking – that is, to the thing that thinking is “supposed to” do.  I also describe some tags I’m currently using when I remind myself to “really think.” In particular:  * Going slow * Following curiosity/aliveness * Staying in touch with why I’m thinking about something * Tethering my concepts to referents that feel “real” to me * Reminding myself that “arguments are lenses on the world” * Tuning into a relaxing sense of “helplessness” about the truth * Just actually imagining differ
Garrison
 ·  · 7m read
 · 
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to build Machine Superintelligence. Consider subscribing to stay up to date with my work. Wow. The Wall Street Journal just reported that, "a consortium of investors led by Elon Musk is offering $97.4 billion to buy the nonprofit that controls OpenAI." Technically, they can't actually do that, so I'm going to assume that Musk is trying to buy all of the nonprofit's assets, which include governing control over OpenAI's for-profit, as well as all the profits above the company's profit caps. OpenAI CEO Sam Altman already tweeted, "no thank you but we will buy twitter for $9.74 billion if you want." (Musk, for his part, replied with just the word: "Swindler.") Even if Altman were willing, it's not clear if this bid could even go through. It can probably best be understood as an attempt to throw a wrench in OpenAI's ongoing plan to restructure fully into a for-profit company. To complete the transition, OpenAI needs to compensate its nonprofit for the fair market value of what it is giving up. In October, The Information reported that OpenAI was planning to give the nonprofit at least 25 percent of the new company, at the time, worth $37.5 billion. But in late January, the Financial Times reported that the nonprofit might only receive around $30 billion, "but a final price is yet to be determined." That's still a lot of money, but many experts I've spoken with think it drastically undervalues what the nonprofit is giving up. Musk has sued to block OpenAI's conversion, arguing that he would be irreparably harmed if it went through. But while Musk's suit seems unlikely to succeed, his latest gambit might significantly drive up the price OpenAI has to pay. (My guess is that Altman will still ma
 ·  · 5m read
 · 
When we built a calculator to help meat-eaters offset the animal welfare impact of their diet through donations (like carbon offsets), we didn't expect it to become one of our most effective tools for engaging new donors. In this post we explain how it works, why it seems particularly promising for increasing support for farmed animal charities, and what you can do to support this work if you think it’s worthwhile. In the comments I’ll also share our answers to some frequently asked questions and concerns some people have when thinking about the idea of an ‘animal welfare offset’. Background FarmKind is a donation platform whose mission is to support the animal movement by raising funds from the general public for some of the most effective charities working to fix factory farming. When we built our platform, we directionally estimated how much a donation to each of our recommended charities helps animals, to show users.  This also made it possible for us to calculate how much someone would need to donate to do as much good for farmed animals as their diet harms them – like carbon offsetting, but for animal welfare. So we built it. What we didn’t expect was how much something we built as a side project would capture peoples’ imaginations!  What it is and what it isn’t What it is:  * An engaging tool for bringing to life the idea that there are still ways to help farmed animals even if you’re unable/unwilling to go vegetarian/vegan. * A way to help people get a rough sense of how much they might want to give to do an amount of good that’s commensurate with the harm to farmed animals caused by their diet What it isn’t:  * A perfectly accurate crystal ball to determine how much a given individual would need to donate to exactly offset their diet. See the caveats here to understand why you shouldn’t take this (or any other charity impact estimate) literally. All models are wrong but some are useful. * A flashy piece of software (yet!). It was built as