Thanks to Esben Kran and the whole Alignment Jam team for setting this up.

Context: We recently got the chance to share a bit more about the flavor of research we're particularly excited about at Straumli AI — that is, designing infrastructure that could help the relevant parties (e.g., developers, auditors, regulators, users) coordinate more effectively, and so make new governance initiatives possible. The talk is mostly an attempt to gesture at this line of work through a series of examples, each of which generally highlights how a certain cryptographic primitive could be used to address one particular coordination problem.

The talk expands on the following pieces of infrastructure:

  1. Hashmarking (2:10) is the only item on the list which we've already written a paper on and which we're already using in practice. This protocol helps developers, auditors, and domain experts coordinate on creating and administering QA-style benchmarks without having to share the reference solutions outright. We think it may be particularly helpful in the context of dual-use capabilities.
  2. Responsible Pioneer Protocol (5:56) targets the race dynamics that frontier labs may default to when trying not to fall behind their competitors. It builds on existing solutions to Yao's Millionaire's Problem to help frontier labs learn if there exists a competitor that is more than e.g. 10% ahead on a particular metric, without requiring labs to disclose their actual progress, in the hope that labs may find it easier to be cautious knowing that no one is too far ahead.
  3. Neural Certificate Authorities (8:37) build on battle-tested practices from traditional internet infrastructure to help auditors share tamper-evident certificates with evaluation results in a way that other parties could seamlessly build on. This could include other NCAs automatically running inference on the findings of other parties: "based on statements  and  about model  which have been signed by trusted parties  and , respectively, I issue statement  as party ."
  4. Latent Differential Privacy (12:01) combines ideas from differential privacy with embeddings as numerical representations of meaning to help individuals contribute their personal data to the training of a language model. This is meant to incentivize the opening up of the training process in a way that may also make third-party monitoring easier and provide regulators with new affordances.
  5. Cognitive Kill Switch (15:23) targets the ease of removing safety guardrails from open source models with ideas from meta-learning. If prior work explored ways of optimizing models so that fine-tuning is easier, we may be able to optimize a model such that any attempt to tamper with it would lead to (late) activations going to zero, making it difficult to fine-tune the guardrails away due to the technicalities of gradient descent. An NCA could then attest the presence of such a mechanism.

The second half of the recording was for Q&A. Some great questions people brought up:

  • What type of entities might be best-suited for contributing to such infrastructure? How can for-profits help with building infrastructure as a public good?
  • What type of resources or upskilling journeys might be useful for people interested in this space? Spoiler: the OpenMined courses are awesome.
  • How can this line of work complement verifiable ML and more broadly work towards obtaining various provable guarantees on models?

Action point: If you think tools like the ones in the talk could help address a pain point of your safety-conscious organization, let's chat!

Comments1


Sorted by Click to highlight new comments since:

Thank you so much for the talk, Paul! It was exciting to see the vignettes besides the very practical first case. It will be interesting to see the entry of Straumli on the evaluations scene since I think you have a solid case for success.

CoI statement: Straumli donated the prize money for the Governance Sprint, though nothing goes to me or Apart, just the AI safety community.

Curated and popular this week
trammell
 ·  · 25m read
 · 
Introduction When a system is made safer, its users may be willing to offset at least some of the safety improvement by using it more dangerously. A seminal example is that, according to Peltzman (1975), drivers largely compensated for improvements in car safety at the time by driving more dangerously. The phenomenon in general is therefore sometimes known as the “Peltzman Effect”, though it is more often known as “risk compensation”.[1] One domain in which risk compensation has been studied relatively carefully is NASCAR (Sobel and Nesbit, 2007; Pope and Tollison, 2010), where, apparently, the evidence for a large compensation effect is especially strong.[2] In principle, more dangerous usage can partially, fully, or more than fully offset the extent to which the system has been made safer holding usage fixed. Making a system safer thus has an ambiguous effect on the probability of an accident, after its users change their behavior. There’s no reason why risk compensation shouldn’t apply in the existential risk domain, and we arguably have examples in which it has. For example, reinforcement learning from human feedback (RLHF) makes AI more reliable, all else equal; so it may be making some AI labs comfortable releasing more capable, and so maybe more dangerous, models than they would release otherwise.[3] Yet risk compensation per se appears to have gotten relatively little formal, public attention in the existential risk community so far. There has been informal discussion of the issue: e.g. risk compensation in the AI risk domain is discussed by Guest et al. (2023), who call it “the dangerous valley problem”. There is also a cluster of papers and works in progress by Robert Trager, Allan Dafoe, Nick Emery-Xu, Mckay Jensen, and others, including these two and some not yet public but largely summarized here, exploring the issue formally in models with multiple competing firms. In a sense what they do goes well beyond this post, but as far as I’m aware none of t
 ·  · 1m read
 · 
 ·  · 19m read
 · 
I am no prophet, and here’s no great matter. — T.S. Eliot, “The Love Song of J. Alfred Prufrock”   This post is a personal account of a California legislative campaign I worked on March-June 2024, in my capacity as the indoor air quality program lead at 1Day Sooner. It’s very long—I included as many details as possible to illustrate a playbook of everything we tried, what the surprises and challenges were, and how someone might spend their time during a policy advocacy project.   History of SB 1308 Advocacy Effort SB 1308 was introduced in the California Senate by Senator Lena Gonzalez, the Senate (Floor) Majority Leader, and was sponsored by Regional Asthma Management and Prevention (RAMP). The bill was based on a report written by researchers at UC Davis and commissioned by the California Air Resources Board (CARB). The bill sought to ban the sale of ozone-emitting air cleaners in California, which would have included far-UV, an extremely promising tool for fighting pathogen transmission and reducing pandemic risk. Because California is such a large market and so influential for policy, and the far-UV industry is struggling, we were seriously concerned that the bill would crush the industry. A partner organization first notified us on March 21 about SB 1308 entering its comment period before it would be heard in the Senate Committee on Natural Resources, but said that their organization would not be able to be publicly involved. Very shortly after that, a researcher from Ushio America, a leading far-UV manufacturer, sent out a mass email to professors whose support he anticipated, requesting comments from them. I checked with my boss, Josh Morrison,[1] as to whether it was acceptable for 1Day Sooner to get involved if the partner organization was reluctant, and Josh gave me the go-ahead to submit a public comment to the committee. Aware that the letters alone might not do much, Josh reached out to a friend of his to ask about lobbyists with expertise in Cal