"As AI systems get more powerful, we exit the regime where models fail to understand what we want, and enter the regime where they know exactly what we want and yet pursue their own goals anyways, while tricking us into thinking they aren't until it's too late." (source: https://t.co/s3fbTdv29V)
Historical investigation on the relation between incremental improvements and paradigm shifts
Artificial Intelligence
One major question that heavily influences the choice of alignment research directions is the degree to which incremental improvements are necessary for major paradigm shifts. As the field of alignment is largely preparadigmatic, there is a high chance that we may require a paradigm shift before we can make substantial progress towards aligning superhuman AI systems, rather than merely incremental improvements. The answer to this question determines whether the best approach to alignment is to choose metrics and try to make incremental progress on alignment research questions, or to attempt to mostly fund things that are long shots, or something else entirely. Research in this direction would entail combing through historical materials in the field of AI, as well as in other scientific domains more broadly, to gain a better understanding of the context in which past paradigm shifts occurred, and putting together a report summarizing the findings.
Some possible ways-the-world-could-be include:
Creating materials for alignment onboarding
Artificial Intelligence
At present, the pipeline from AI capabilities researcher to AI alignment researcher is not very user friendly. While there are a few people like Rob Miles and Richard Ngo who have produced excellent onboarding materials, this niche is still fairly underserved compared to onboarding in many other fields. Creating more materials for a field has the advantage that because there are different formats that different people find more helpful, having more increases the likelihood that something works for any given person. While there are many different possible angles for onboarding, there are several potential avenues that stand out as promising due to successes in other fields:
Getting former hiring managers from quant firms to help with alignment hiring
Artificial Intelligence, Empowering Exceptional People
Despite having lots of funding, alignment seems to not have been very successful at attracting top talent to date. Quant firms, on the other hand, have become known for very successfully acquiring talent and putting them to work on difficult conceptual and engineering problems. Although buy-in to alignment before one can contribute is often cited as a reason, this is, if anything, even more of a problem for quant firms, since very few people are inherently interested in quant trading as an end. As such, importing some of this know how could help substantially improve alignment hiring and onboarding efficiency.
AI alignment: Evaluate the extent to which large language models have natural abstractions
Artificial Intelligence
The natural abstraction hypothesis is the hypothesis that neural networks will learn abstractions very similar to human concepts because these concepts are a better decomposition of reality than the alternatives. If it were true in practice, it would imply that large NNs (and large LMs in particular, due to being trained on natural language) would learn faithful models of human values, as well as bound the difficulty of translating between the model and human ontologies in ELK, avoiding the hard case of ELK in practice. If it turns out that the natural abstraction hypothesis is true at relevant scales, this would allow us to sidestep a large part of the alignment problem, and if it is false then this allows us to know to avoid a class of approaches that would be doomed to fail.
We'd like to see work towards gathering evidence on whether natural abstractions holds in practice and how this scales with model size, with a focus on interpretability of model latents, and experiments in toy environments that test whether human simulators are favored in practice. Work towards modifying model architectures to encourage natural abstractions would also be helpful towards this end.
How much work would it be to make something like this as a python library, and how much would that reduce its usefulness? I think this is really cool and have been looking for something like this, but I am multiple times more likely to use something if it's a python library as opposed to a brand new language, and I assume others think similarly.