[Cross-post] Change my mind: we should define and measure the effectiveness of advanced AI

David Johnston

Epistemic status: trying to bait you into responding (original).

Artificial Intelligence safety researchers are concerned with the behaviour of “highly capable” AI systems. The problem is, no-one knows what they mean when they say “highly capable”.

There are two assumptions underpinning the concern with “highly capable” systems:

People want highly capable AI systems, and will spend lots of time and money on trying to get them

Highly capable AI systems can be more destructive than less capable systems

This sounds reasonable, but if no-one knows what they mean by capability then how do we know if it’s true? Shane Legg and Marcus Hutter have had a go at addressing this problem: they collected many different definitions of “intelligence” and proposed their own definition, which they informally summarise as

Intelligence measures an agent’s ability to achieve goals in a wide range of
environments

(they also offer a more precise definition in their paper)

This does a good job of explaining the second assumption. A system that is very good at achieving goals in a very wide range of environments could in particular be very good at achieving destructive goals in a wide range of environments; on the other hand, a system that is bad at achieving goals can try to do destructive things, but it’s probably going to fail.

It’s not quite so good at explaining the first assumption. Sure, a system that can achieve many different goals can, presumably, do a better job of achieving our goals than a less intelligent system. But there’s a gap in that argument: we have to assume that a system that’s good at achieving some kind of goal is likely to achieve our goals. That is: a more intelligent AI system is more useful assuming it can be controlled.

AI safety researchers are trying to come up with proposals for highly capable AI systems that are both competitive with unsafe systems and safe. If they want to do this, they need to understand what they need to compete with unsafe systems on. They don’t need to compete on intelligence alone: they need to compete on effectiveness, which can be roughly understood as “controllable intelligence”.

AI effectiveness

Highly effective AI systems are systems that people can use to achieve their goals. More specifically, I propose effectiveness is the degree to which a system augments a person’s existing capabilities:

An AI system X is at least as effective as a system Y if a person in possession of X can do everything the same person could do if they had Y but not X, in the same time or less. X is strictly more effective than Y if a person in possession of X can also do at least one thing in less time than the same person in possession of Y.

Effectiveness has a more direct bearing on what kind of systems people might want than intelligence does. Someone with a more effective AI system can do more than someone with a less effective system, and so (we might presume), the more effective system is more desirable.

Decomposing effectiveness into intelligence and controllability

In the situation where the AI does all of the hard work, we can decompose effectiveness into “intelligence” and “controllability”. If a person has the ability perfectly transmit their goals to an AI system, and they rely on it entirely to achieve the goal, then whether or not the goal is achieved depends entirely on the AI system’s ability to achieve goals, which is intelligence in the Legg-Hutter sense. Therefore, any difference between “effectiveness” and “intelligence” must come from the inability of a person to perfectly transmit their goals to the system.

On the other hand, we are interested in how much the person and the system can achieve together, and the person may perform important steps that the system cannot do. It is not so obvious to me if or how we could decompose the capability of this hybrid system into “intelligence” and “controllability”. One option would be to consider second-order control - can the person A +AI system successfully achieve a third party’s (“person B’s”) goals? The problem here is: if person A cannot perfectly align the AI system with their own goals, then what is the “goal” of the person A + AI system"? We need some way of determining this if we want to apply Legg and Hutter’s definition of intelligence.

How can we assess effectiveness?

One way we could try to assess effectiveness is to examine AI systems as they are being used in organisations or by individuals, and form judgements about:

What the organisation is achieving
How much time it takes to achieve this
What we imagine the organisation might achieve without the system in question

In practice, all of these assessment are at least a bit woolly. It seems hard, though not impossible, to answer “what is Google achieving?” The answer, I think, would be a long list. Similarly, it seems hard but perhaps not impossible to offer an answer regarding how long it has taken them to achieve any particular item on that list. It seems particularly hard to identify an AI system that Google is using and to guess what they would instead be achieving if they could not use this system.

However, I think that with substantial effort, some plausible if not particularly reliable answers to these questions could be offered. Having answers to these questions seems like it could be a useful benchmark of AI progress.

I think a particular difficulty with this approach is that we have to rely on counterfactual speculation about what cannot be achieved without a particular system. Controlled tests could help to address this, where we pair people or teams with AI systems and given them a battery of tasks to achieve in a given amount of time. For reasons of practicality, the amount of time available probably isn’t going to be very long and so the tasks assessed would have to be somewhat simpler.

There are many batteries of automated tasks that AIs are routinely tested on. These test how “effective” AI systems are on particular problems, but these problems are often quite specialised, and must be specified in very stringent ways. “Effectiveness” in general encompasses a wider variety of problems, where human operators might use an AI system for intermediate steps, or employ some creativity in repurposing it for a given task.

Do interesting batteries of tasks for human-AI pairs exist? Given that interesting batteries of tasks for humans exist, and interesting batteries of tasks for AIs also exist, I think that surely there’s something that’s interesting for the combination (Kaggle comes to mind as an almost proof-of-concept).

I think it is particularly easy to think of interesting tasks for humans paired with high-powered language models. Here there are many things that are somewhat interesting for people to achieve that can plausibly (or perhaps not so plausibly) be augmented by the AI:

Sitting a standardised test
Writing a decently received blog, fan fiction or comic
Writing a program to a specification
Diagnosing illness
Participating in a prediction tournament

On the other hand, test batteries for AI systems more broadly defined seem to be tricky. For example, I don’t see an easy way to run trials of a human + “hardware + framework + the idea of the transformer architecture”, even if high powered language models are arguably a special case of this.^[1]

It would be interesting to know whether there are questions that remain interesting for many years. I can imagine that some tasks may initially need a little human prompting to point the AI in the right direction, but will quickly become entirely automated, but perhaps not everything will operate this way.

Another interesting question is whether there is a discernable trend of generalisability. For example, in 2022, it might be the case that language models help people a lot with “language-heavy” tasks like writing, but not much for something like diagnosing an illness, while in 2027 perhaps the systems that provide the biggest boost to writing capability also help a lot with diagnosis of illness.

A final approach to assessing effectiveness is to theorise about it. I imagine this looks something like further developing theories of “AI controllability” and “AI intelligence” and how they combine to yield effectiveness. This avenue could be particularly useful for predicting the effectiveness of safe vs unsafe AI systems, which might otherwise be hard to test.

Safety implications of effectiveness

If we know only that an AI system is highly effective (or ineffective), can we draw any conclusions regarding its safety?

We might hope that an ineffective AI system is safe, because no-one can figure out how to make it do anything useful. Thus, perhaps, it also can’t do anything destructive. I don’t think this really holds - consider an AI system that is highly intelligent and always pursues a single goal. It will not be very effective because whatever you want it to do, it will always do the same thing. However, if its single goal is a destructive one, it could pursue it very efficiently.

We can say that, all else equal, if the primary safety concern is unexpected behaviour, then increasing a system’s controllability is good for safety and good for effectiveness. This doesn’t mean that highly effective systems are highly controllable, though.

If the human operator is one of the primary safety concerns, then we probably can conclude that highly effective systems are unsafe. If a human + AI pair can do many different things that the human in isolation cannot, this probably includes some destructive things, which is a safety concern.

Though the conclusions that can be drawn from effectiveness alone are not particularly compelling, it is possible that together with additional assumptions there are interesting things to be said about the relationship between safety and effectiveness.

Bugs to iron out

One question is, if we want to know how someone will perform “without system X”, what are they allowed to use instead? Someone in possession of AlexNet in 2012 could perhaps classify extremely large sets of images more accurately than someone without it (that is, sets of images too large for a person to manually classify). Today, however, there are plenty of free tools that substantially outperform AlexNet, so a human with AlexNet no longer has a capability advantage over the same human without, unless we also rule out these other tools.

The problem is even worse with knowledge. Suppose AI system X makes a scientific discovery that everyone then learns about. We can hardly ask someone to pretend they’ve never learned about this discovery, but as long as they make use of it then they’re not really operating without system X.

One way to try to address this problem is by benchmarking measures to particular points in time. Something like

An AI system X is at least as 2022-effective as a system Y if a person in possession of X can do everything the same person could do if they had Y and other technology + knowledge available in the year 2022, but not X. (etc.)

Conclusion

If we want to think about which AI systems will be competitive, we need to think about their intelligence and their controllability
While there is a lot of theoretical work on “intelligence” in the form of things like regret bounds, as far as I know there’s a lot less work on formalising notions of controllability
“Effectiveness” could be a reasonable synthesis of intelligence and controllability
We might be able to roughly estimate system effectiveness, either by surveying deployed systems or by running controlled tests, and this could provide useful information about trends and levels

^{^}
My feeling is that training high powered language models is closer to the kind of activity that raises safety concerns than utilising them. "Controlling" the training process is difficult and involves choosing loss functions, optimizers, datasets and so forth so as to produce something that does useful things. On the other hand, "controlling" the deployed model is more of an ad-hoc kind of activity and doesn't seem as likely to result in some runaway optimization. So the controllability of the trained model doesn't seem to be the main question of interest, and therefore perhaps this kind of testing is of limited use to safety research.

Effective Altruism Forum
EA Forum