Hide table of contents

This is a follow-up to my August 2025 post.

This post contains all the AI safety research and project ideas I've had in the past 10 months which I think could be high impact. I’m sharing them in case any are helpful or generative for others. I don’t plan to pursue most of these myself as there are just too many for me to do them all justice. All of these ideas are at least somewhat novel. Many could be expanded into full essays, policy proposals, or longer-term research programmes. Also, a warning: some of them are a bit weird!

Field infrastructure, mapping, and neglectedness

There should be a live, regularly-updated and highly visual database of AIS research questions and the latest progress on each of them. The nearest existing thing is the EA Midwest list-of-lists; it could be far more. Could plausibly be made and maintained mostly with AI scraping and auto-categorisation. A database like this would allow people to easily see the state of research on a given subtopic, and where there is progress to be made. I suspect that the majority of updating could be automated. AISafety.com's map separates listings into categories with an Airtable backend and a graveyard for discontinued projects, plus aisafetyfeed (where AI helps summarise, tag, and rate the novelty of content) and trecursive's tree-maps. None nail "questions + live progress on each".

Similarly, there should be a live, regularly-updated and highly visual database of proposed interventions in AI safety, which tracks how many people are working on each intervention, and roughly how much time they're on each intervention. This would allow us to more quantitatively assess neglectedness. Landscape maps and 80k problem profiles gesture at neglectedness, but a quantitative time-allocation tracker I couldn't find. These two projects could even be combined into one comprehensive database.

ChatGPT image generator's mockup of these two website ideas combined

Which disciplines have had the least contact with AI safety, and might they have anything to contribute? See this Claude deep research artefact: https://claude.ai/public/artifacts/55bde7d3-2216-43ea-83f0-9857e1e48750 

Are attractor basins from AI use a risk vector that curtails genuine innovation in AIS itself? One candidate antidote: deliberately spending time in layers of reality that are far from AI (eg spending time in nature) in order to tap into sources of inspiration which lack these attractor basins. Adjacent to model-collapse and LLM-homogenisation-of-thought work, but the specific application to safety-research creativity I haven't seen.

Intervention prioritisation and AI safety strategy

Given the high degree of disagreement among experts regarding which AI safety interventions are most promising, would it be helpful for intervention comparisons to factor in interactions between interventions (synergies, clashes) and viability across broad timelines? From what I've been able to gather, these factors aren't often taken into account. For example, mechanistic interpretability and evals may be mutually reinforcing, because better interpretability can improve the design of evals while evals can help identify where interpretability work is most urgently needed. By contrast, aggressive public campaigning for a pause might clash with quiet institutional work if it makes policymakers or labs more defensive, though it might also expand the Overton window in ways that make moderate regulation easier. Timeline-robustness also matters: some interventions may dominate on short timelines because they can be deployed quickly, while others may only pay off on longer timelines because they require deep scientific or institutional maturation. The point would be to figure out which interventions are most synergistic with other interventions, and which remain viable across both short and long timelines. Portfolio/"defence in depth" framing is standard; formal interaction-matrices don't seem to exist.

Should frontier-AI-company employees strike, with demands of their companies committing to safety?

To whatever extent frontier AI's energy and resource consumption breaks traditional climate forecasts, how should AIS strategy adapt for the likely increased resultant geopolitical and environmental instability?

Recursive self-improvement, AI cognition, and interpretability

Can recursive self-improvement be (roughly) simulated through an LLM repeatedly improving its system prompt, in order to study alignment implications? This would not reproduce full RSI, since the model’s weights, architecture, training data, and capabilities remain fixed while only its instruction context changes. Still, prompt-level self-modification could serve as a useful toy model for studying alignment dynamics, especially how goals, constraints, failure modes, and deceptive or unstable behaviours might shift across iterative improvement cycles. Prompt-level self-improvement exists (Promptbreeder, self-refine, self-rewarding LMs); as a deliberate alignment toy-model for RSI dynamics it's less done. Reasonable small project, modest novelty.

Singularity timing, civilisational trajectory, and macro-instability

If the world is currently getting worse (a perspective which is of course extremely subject to debate), is there an argument that postponing the singularity makes things worse? untouched. The "opposite of a long reflection": scale up today's not-completely-terrible values before they degrade. In that framing, delay is not merely caution, but an active choice to let worse norms, more brittle institutions, and more desperate incentives become the substrate from which superintelligence eventually emerges.

Governance, institutions, and political strategy

What are the implications of evidence and theory of international organisations for the international network of governmental AI safety institutes? For example, good leadership is one of the factors which seems to predict robustness of international organisation regardless of particular threatening circumstances-- how good is the leadership of AISIs, and how could we improve it?

Has political polarisation become so strong as to mean that, contrary to the concern that AI safety should remain unpoliticised, AI safety may in fact only become salient if it becomes a partisan issue? Worth exploring risks and benefits of deliberately allying AIS with one political faction.

If states each had oracle/ASI access and could foresee likely conflict outcomes, might that force negotiation and prevent war? Eg:

  • There's no incentive to fight a war which both sides can see will likely lead to mutual ruin. Maybe that will force negotiation?
  • There's also no incentive for a state to defend themselves if they can see that they'll lose anyway. Does this increase risk of colonial dynamics?
  • Orcale AI likely would not eliminate war entirely, because states may still:
    • face commitment problems; even if both sides know what deal should prevent war, one side may not believe the other will stick to it later
    • distrust each other’s oracles
    • manipulate inputs
    • fight for domestic-political reasons rather than national strategic gain. Leaders do not always maximise “national welfare”; they may maximise regime survival, personal power, ideological legitimacy, elite support, or avoidance of humiliation

Embodiment, robotics, and AI in simulated environments

Is physical AI / robotics a fundamentally different ballpark from purely digital AI for safety purposes? We cannot necessarily assume 1:1 transfer of digital-AI safety theory results to embodied systems. We might have to develop good understanding and control from some degree of first principles. It would be good and interesting to think through AIS for robotics carefully, thinking about how each aspect of AIS interacts with robots. How important is the conceptual and linguistic gap between AI safety and robotics, and how do we close it? And do we need organisations equivalent to Anthropic, Redwood etc for robotics?

Is there a risk of AIs escaping online videogames or virtual environments onto the internet? Particularly acute for villain characters who use 'evil' LLM persona to shape their dialogue and behaviour. This seems like one of the most plausible environments in which ‘evil’ LLM personas may be deliberately designed.

Moral patienthood, digital welfare, and s-risks

Leading figures in AI think we might need to simulate emotion in order for AI to be truly smart and safe. Is this possible without inducing sentience?

  • Two prominent claims from the past year sit oddly well together. In his November 2025 conversation with Dwarkesh Patel, Ilya Sutskever argued that what current models lack, and what blocks human-level generalisation, is something like a value function modulated by emotion: he reached for the textbook case of a patient whose emotional processing was damaged and who consequently became catastrophically bad at ordinary decisions. Separately, since his August 2025 AI4 keynote, Geoffrey Hinton (a 'godfather' of AI) has been arguing that the only stable way to keep a superintelligence safe is to give it something like genuine care for us, because external control of a more capable agent eventually fails.
  • Put them next to each other and you get a single design pressure pointing in one direction: build emotion into the system. A question, then, is whether we can get the functional benefits of emotion without the thing that makes emotion morally consequential: sentience. Can a system model affect the way current networks model cognition, reaping the decision-pruning and the motivational grounding, while the lights stay off inside? One really interesting aspect of this is that it seems to really stretch the limits of the concept of a philosophical zombie: it’s one thing to imagine an entity that walks and talks without internal experience, and quite another to imagine one that exhibits cognition without experiencing it, but the idea of an ‘unfeeling feeling machine’ really does challenge our notions of what constitutes sentience.

Do digital minds research and brain organoid research have anything to learn from each other? Eg can we develop a shared, substrate-neutral framework for assessing when a biological organoid or digital system becomes a possible moral patient? How can welfare-relevant states be assessed in systems that cannot verbally report pain, pleasure, distress, boredom, or preference? What would count as evidence of negative valence in a brain organoid or a digital system? Etc; these are random examples I generated with an LLM, there are probably much higher-impact intersections.

Do near-future videogames pose uniquely severe s-risks? Many (possibly millions) of NPCs in near-future videogames might run on LLMs, which may be sentient, or more advanced architectures which may be even more likely to be sentient. Videogames are possibly the only context in which AI systems might be deliberately tortured (think about how NPCs in games like Grand Theft Auto are treated!). This could suggest that videogames will be an environment particularly susceptible to becoming sites of s-risk. Brian Tomasik wrote on suffering in RL agents and video-game characters years ago; the s-risk field (CLR) owns the scale argument. But these concepts together, combined with increasing genuine possibility of sentience in advanced AI now as well as increasing experimentation with use of LLMs in NPCs, poses a novel risk vector.

2

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities