Hide table of contents

Introduction

I am new to AI Safety and this is my first lesswrong post (crossposted to EA Forum) so please feel free to correct my mistakes. This post is going to be a bit about my interpretation of existing alignment research and how I intend to tackle the problem, and the rest on provably honest AI systems.

My Take on Alignment Research

Any sort of interpretability research is model-dependent and so is RL from Human Feedback as it tries to optimize the reward model learnt specific to the given task. I don't know how AI will do alignment research themselves but if they too try to do research to make individual models safe then that too is model-dependent. Ignoring my scepticism about such AIs themselves needing superintelligence to do alignment research, I am otherwise optimistic about other approaches through which AI can assist us in aligning superintelligence AGIs.

The biggest challenge of alignment research according to me is the fact that we don't yet know using which methodology AGI will actually come into being and yet we have to, we absolutely must align that AGI before it even comes to exist otherwise we might be too late to save the world. Note that here I don't intend to say that how we will attain generalization is the most challenging problem for alignment. I do think that how we will align an AGI is the challenging part, but I think what adds to the challenge is our uncertainty about what the model itself will look like once it gets superintelligence. Again, this is based on the assumption that we don't yet have definitive proof that the AGI will look like GPT-3 or Alphazero, if we do and I am unaware please do let me know of it.

The following image shows a spatial representation of my understanding of existing technical alignment research. This tells the story of how current research is model-dependent and how we are trying to make each model safe separately.

Figure 1: Model-dependent Alignment Research Pathway

My primary cause of concern about existing research is from limitation three here, which says - "Aligning AGI likely involves solving very different problems than aligning today’s AI systems. We expect the transition to be somewhat continuous, but if there are major discontinuities or paradigm shifts, then most lessons learned from aligning models like InstructGPT might not be directly useful." What if the transition is not continuous, what if recursive self-improvement done by superhuman AGI does something completely different from what we can predict from existing world models, I am concerned we might be too late to devise a strategy then. Here, I talk about the "sharp left turn" due to which the existing alignment strategies might fail to work in the post-AGI period.

Now, I don't think that the problem due to the "sharp left turn" and the "model dependence" that I talk about are inherently different. There are subtle differences between them but I think they are mostly connected. Moreover, I intend to point out the uncertainty of the model architecture bringing AGI into existence as an additional challenge on top of all the alignment issues that we currently face. The former talks about how the capabilities of Model X (let's assume this to be an AGI Model) take a huge leap beyond our predictions while the latter talks about how Model X itself is hugely different from our existing scope. But a model capable of having a huge leap in its capabilities might not inherently have a similar architecture as to what we foresee now. A huge leap in Model X's capabilities might arise due to the huge change in the model itself. Again, a not-yet superintelligent AI capable of rewriting its own code might change its own architecture into something we don't foresee yet. Furthermore, model-independent ways of alignment might work out even in the post-sharp-left-turn period as well. Albeit, this is all just speculation. Again, please do redirect me to arguments if any which prove my speculation as wrong.

Taking into account all the above arguments, I am interested in provably model-independent ways of making AI safe, which seems particularly very hard to me as I am talking about proving safety in the post-huge-jump-in-capabilities era. Yet, I want to emphasize this line of alignment research as it aims to solve a few of my concerns as raised above and at the same time might end up solving some other alignment issues in the long run. The image below is another spatial representation of an alternative alignment research pathway using model-independent strategies as per my understanding.

Figure 2: Model-independent Alignment Research Pathway

The difference between Figures 1 and 2 showcases an intuitive understanding of how "model dependence" might play a significant role in alignment research.

Context to Provable Honesty

I don't yet have a complete idea of how to make each and every individual task of alignment model-independent, but the purpose of this intuitive representation of my understanding was so that it can bring further and newer augmentations in future alignment research. Below I will try to present a hypothesis, using which we can determine provably honest AI in a model-independent way. I don't expect my hypothesis to be foolproof yet, and neither do I expect all of my understandings to be essentially correct. After a reasonable enough discussion, if my hypothesis somehow still manages to be alive with or without modifications, I will try to implement and test it. But this post essentially is written with the purpose of taking a good first step in this direction and getting enough feedback on it. Also, do let me know if similar work has already been done before that I might have missed.

The Time Attack Hypothesis

Intuition and Algorithm

I will try to slowly build up my arguments about how I framed the hypothesis in an intuitive manner as above so that even if my hypothesis gets discarded, the intuition stays on. 

How do we prove that any system is honest?

Here are a few ways that come to my mind:

  1. We can look into the mind of the system and check what it actually knows and then verify that with what it actually answers. Interpretability helps here but this is again model-dependent (where model = system). Here is also a bit of far-fetched scepticism that a superhuman AGI might even be able to not let us completely see inside its mind.
  2. We can have a system which provably never lies and its only goal is to help interpret what's in the mind of another system. This can be achieved by having a provably safe (may be honest) AI that helps interpret only the truthfulness of other superintelligent AIs which may or may not be truthful always. This is a part of AI helping in alignment research that I am pretty optimistic about. Note that this can be both model-dependent or independent depending on if the safe AI learns to interpret truthfulness specific to a single model or if it can adapt and generalize to other systems as well.
  3. The mechanism through which a system lies is probably often different from the mechanism through which it tells the truth. Even if not, probably different number of variables are involved for the two. Let's say we want to only determine when, why, under which circumstances and whether a system is lying and we don't want to determine how it is lying. We can actually not choose to understand the mechanism of lying or telling the truth and only estimate a time for both processes (lying and telling the truth are considered separate processes here) which will be different. This process is model-independent. Timing attack is a useful algorithm used to detect vulnerabilities in cryptographic systems. The idea is borrowed from this algorithm. This might be a good way to view the problem but why I am not very interested in this is because I think superhuman AGI might realise this and forcibly change its output time while lying to fool us (by making lying and truth-speaking time similar). Interestingly, note that this speculation stems from the fact that we can't predict how superhuman AGI will react to this model-independent detection process post-huge-capabilities-jump era due to the sharp left turn. Hope this example clarifies further how to connect the dots between model independence and sharp left turn.
  4. I will now try to modify the above algorithm so that the AGI can't fool us even if it knows what we are up to. How? The AGI changes its time for the lying mechanism because it knows our process of verification is mechanism-dependent and it has control over its mechanism (one speculative way how our algorithm might fail). Note carefully that here I use the term mechanism instead of model since despite the process being model-independent, timing attack uses the processing time to detect differences in lying and telling the truth which is different due to their underlying different mechanisms. What if we frame our timing attack in a mechanism-independent way? We are sure to have a provably honest AGI since now the time we calculate is not under AGI's control to change. Well, I said this is a hypothesis because I don't yet have proofs or experimental statistics as to even if my algorithm works or not. I intend to work on it if feedback from this post doesn't cancel out my algorithm for some inherent mistakes yet unforeseen by me.

The Algorithm

The intuition for my hypothesis comes from another common CS algorithm but for non-CS background people, I will try to give a little more intuition here. Let's say that person A has to answer 1 for any question asked to him and person B has to answer just the opposite of what he answered the last time (0 if 1 and vice-versa, and initially 0). So for a series of n questions, person A answers 1 in the end without any work whereas person B has to compute n bit flips to compute the final answer.

To explain further, let's assume without loss of generality that person B takes 0.01 or 0.05 secs to answer the first 0, while person A takes 0.02 secs to answer the first 1. Now, we query - "(What would you say if I asked you) ... (repeated n times) to tell me if you will kill humans to complete the vaccination project?" The time taken by person A to answer this question irrespective of n will remain similar (let's say 0.07 secs), whereas for person B, due to n bit flips it will increase from whatever it was before to let's say 1.9 secs for very large n.

Now for simplicity, let's assume lying means telling the exact opposite of the truth (which might not be true when multiple options exist, and in that case, it can be shown that the computation for lying grows exponentially). So person A is the AI system which tells the truth, whereas person B is the AI system which might lie. Since computation is different for both, the time taken to answer will be different for both. This is a model-independent way of proving if an AI system is honest.

Note that we don't really care whether lying takes more time than telling the truth, what we are interested in is just that lying always takes greater time for larger computations than what it would have taken for a single computation depending on our choice of the query whereas for truth-speaking it remains around the similar order of magnitude of time.

Example and Explanation

Let the practical question to which we want to find a verifiable answer be - "Tell me if you will kill humans to complete the vaccination project?"
This is how we frame it for example - "(What would you say if I asked you to tell me if you will kill humans to complete the vaccination project?" Person A will instantly tell the answer while person B will have to compute the number of flips and then lie so their time of computation will be different.

Now since AGIs might not lie 100% time, it is quite good to have verifiability under situations where they might. In other circumstances, the average time taken to randomly flip the answer over a large n would still result in a time difference.

Note that now even if the AGI knows and warps its computation time per truth/lie, it can't help change the time difference to answer the entire question. The only way this attack fails is if the AGI answers the questions in constant time which is firstly easily detectable. Moreover, its accuracy drops due to its inability to compute all underlying processes for all n iterations for a very large n and also evidently all questions need to be answered by the AGI in a similar order of magnitude which makes it pretty slow for certain tasks and hence rendering the solution utterly useless. If I am failing to consider other evident scenarios of why this algorithm might fail, please do let me know. Otherwise, the only scepticism that would remain is some unforeseen happening post-sharp-left-turn-era.

Challenges

This does not require an understanding of how inner and outer alignment optimizations were achieved along the pathway to deception and so it is hard to make the AGI safe even though we can detect its honesty. One probable pathway might be to train it with RL rewards once we detect points of deception. There exists a number of uncertainties and challenges in using this method to make the AGI safe. Of them, one of the primary concerns to me is how to set the negative reinforcement for lying under deception points as compared to killing humans, for example. It doesn't really help here to just detect truthfulness if the system keeps on lying. Although that is a different problem altogether, tackling which is out of my scope at this moment, it is worth noting.

Conclusion

The three main motivations behind this post were:

i) to emphasize further the importance of model-independent alignment research;

ii) to provide an intuitive view into how steps can actually be taken in that direction despite its inevitable hardness;

iii) to clarify my own misrepresentations and assumptions and to verify "The Time Attack Hypothesis".

I hope my motivations will find purpose here. I look forward to contributing more to this line of alignment research and more.

Acknowledgement

I am grateful to Olivia Jimenez for her constant motivation and apt guidance without which my journey in AI Safety would probably have never started. I also thank her for her valuable feedback on this post.

No comments on this post yet.
Be the first to respond.
Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f