T

tobycrisford 🔸

891 karmaJoined
tobycrisford.github.io/

Comments
206

That makes a lot of sense, thanks.

I'm sorry you've said you regret your engagement, since I've found your comments helpful (the link to AISLE's OpenSSL zero days has shifted my view on this a fair bit).

I guess this whole discussion does just feel like a classic example of "All debates are bravery debates".

Thanks for the detailed reply, I understand your point clearly now I think!

But $20,000 for *all* of the OpenBSD bugs (not just the published ones) doesn't sound like that much to spend on inference compute to me. If AISLE could have spent the same and made an equally impressive announcement, unearthing enough bugs at once that government ministers around the world start issuing statements about it, then shouldn't they have been able to find the investors to fund that? That would have been incredible publicity for them.

The crux for me seems to be whether they have made equally impressive announcements, as you suggest they might have done. Maybe they're just worse at marketing. I don't know enough to evaluate that claim properly, but that does seem the relevant question here: have Anthropic been able to use Mythos to go significantly beyond what the best harnesses could already achieve with existing models for the same inference spend? I thought the answer was a clear yes, and I didn't find the original linked AISLE writeup very convincing at all. Your comment has made me more uncertain, but has still not convinced me, and I'd be really interested to read something more in depth on that question. (Maybe we also would disagree about what the word 'significantly' means here, since I guess you are acknowledging it probably represents some improvement).

(Also, I'd push back a bit on your characterization of AI progress. I agree the scaffolding is extremely important, but in my experience the "paradigm shifts" in capability over the last two and a half years I've been working with them have come from the models)

(And extra comment: the fact that cybersecurity capabilities might not imply imminent superintelligence takeoff seems an entirely independent point that I don't necessarily disagree with)

On the take by AISLE, maybe I'm missing something here, but if their headline claim was correct (that the harness is more important than the model), shouldn't they have been able to find the vulnerabilities that Anthropic hasn't published? Or find hundreds more similarly impactful ones?

Re-discovering the ones Anthropic had already published seems much less impressive, because there are lots of ways to cheat, and from their write up it sounded to me like they were essentially admitting that they had cheated.

Of course Anthropic could be lying about the existence or significance of the vulnerabilities they haven't published. But they have committed in advance to what those vulnerabilities are (I think they have already made some kind of cryptographic commitment to their unpublished write ups..?) which seems impressive to me.

Either they have used the new model to find significant vulnerabilities in every major OS and browser that are too dangerous to be released, or they haven't. If they have, it seems genuinely scary and impressive (not just marketing hype), because I'm not aware people working on fancy harnessing have had similar results (or have they?) And if they haven't, then it's a very weird marketing ploy, because they're going to get found out very quickly!

I think this misunderstands what people mean when they compare arguments about the importance of AI safety to a Pascal's wager.

Pascal's wager refers to situations where a tiny probability of enormous value seemingly leads to ridiculous conclusions if you try to do naive expected value calculations with it. When people say that strong longtermism is a Pascal's wager, the "small probability" they are talking about is not the probability of extinction, which as you point out, is significant. The small probability is the probability that the future will contain "septillions of future sapients". That is the probability that is small. And it gets even smaller if the probability of extinction soon is high! So a large probability of extinction this century makes the Pascal's wager comparison more relevant as a critique of strong longtermism, not less. It is multiplying this small probability by the value of those septillions of potential "sapients" that gives you the astronomical value that says existential risk reduction should almost automatically dominate our concerns.

I think you're completely right to point out that people should care a lot about things which might carry a 10% chance of causing human extinction, even ignoring their stance on longtermism. But some people believe that existential risk has astronomically more value than just the impact it will have on the next few generations, and that therefore tiny changes in the probability of existential risk almost automatically trump any other concern, however small those changes are. When people talk about Pascal's wager in the context of strong longtermism or AI safety, I think it is this claim that they are challenging, not the claim that we should care about extinction at all. And that criticism is just as valid, actually more valid, if the probability of extinction from AI safety is high (though I of course agree that if there are people who use the Pascal's Wager argument to dismiss all work on AI risk then they are making a serious mistake).

I agree with your title, but I don't think negative utilitarianism is the answer. I like Toby Ord's essay on this, "Why I'm Not a Negative Utilitarian": https://www.amirrorclear.net/academic/ideas/negative-utilitarianism/ 

On your argument about tradeoffs, people make choices all the time where they accept some very small risk of some very severe suffering in order to increase their happiness by a modest amount. For example: cycling along a busy road to visit their friend. If you say that no amount of happiness can make up for the trauma of being involved in a serious accident, then it seems like you are forced to say that this choice is wrong. That seems like a strange conclusion to me.

It's really cool that you've done this and released the code!

Am I understanding right that the givewell baseline you're trying to beat used GPT, while your approach uses Claude? How can you be sure that the improvements aren't due to the model choice, rather than the architecture?

Sorry for the very delayed reply to this. I meant to reply at the time and then it slipped my mind!

Yes, you've summarised my position perfectly, I like those diagrams!

I guess my deeper point was that I wasn't sure there was any meaningful way to say something like "X is twice as painful as Y" without defining it via choices among gambles or durations. You say for humans it seems real, but does it? I can definitely introspect and discover that X is more painful than Y, but I'm not sure I can introspect and discover that it is N times as painful. Where does that number come from?

Although as I was thinking more about how to justify this, I started thinking about other sensory experiences, like sound. Is it meaningful to say that "X feels twice as loud as Y", in a sense that doesn't have to line up with the intensity of the physical sound wave? And then I remembered my physics lessons from way back, and realised the answer might be yes. I was definitely taught that the reason we measure sound volume on a log scale (decibels) is because it lines up better with our sensory perception of it (you have to square the intensity of the sound wave in order to double the perceived intensity). But if this is true then it means there is some sense in which we can introspect and say "X sounds twice as loud as Y", even though the underlying sound wave might not be twice as intense. And if that is the case then maybe this should also be true for pain.

I'm still very uncertain about this though. If I listened to different sounds and tried to place them on a numerical scale, I'm not really sure what it is that I'd actually be doing.

Thank you for your reply and clarification!

If the claim is that the gap between 'Disabling' and 'Excruciating' should be larger than the gap between 'Annoying' and 'Hurtful', then that makes sense to me, and seems interesting.

But it sounds like this wasn't a numerical scale to begin with? So this again just feels like a claim about how we should go about assigning numbers to those categories (if we need numbers), rather than a claim that pain unpleasantness is 'superlinear' in some objective sense?

Defining what a numerical score for pain means seems like a hard problem. From my perspective, it seems like it should be defined so that the being concerned would be indifferent between a day of 2*x and 2 days of x. I think this is the notion you are referring to as 'unpleasantness'. The question then for any other pain metric is just: "how well does it measure this?". I'm still not sure it makes sense to ask "How does pain intensity scale with unpleasantness?", since then we would first have to define a numerical scale for pain intensity in some different way, and I'm still not sure how we begin to do that?

I suppose there is another ineresting complication here, which is that you could also try to define your pain scale in terms of preferences among gambles. For example, the pain scale should be defined so that a rational being is indifferent between 100% chance of x and a 50% chance of 2*x. And then you're confronted with the question of whether this should give you the same answer as defining it in terms of preferences among durations. My feeling is that it should be the same (something about personal identity not being a 'further fact' and applying standard utilitarian aggregation approach to person-moments rather than persons..?) but would be interesting to explore points of view where those two potential scale definitions are different. That doesn't feel quite the same as 'intensity' vs 'unpleasantness' though. More like two different definitions of 'unpleasantness'.

I'm confused about what "superlinearity" is even supposed to mean here.

In the intro you distinguish "unpleasantness" and "intensity", and say that one grows superlinearly with the other, but how are these two things even defined to begin with? And what is the difference between them? Defining one scale for measuring pain is hard enough, but before we can evaluate this "superlinear" claim we first need to define two!

In the examples with humans, I can see what the claim is. There are at least two ways you could try to define a pain scale: (i) self-report on a scale of 1-10, and (ii) something that more consistently tracked actual preferences with respect to gambles or experiences of different duration, and in this example the claim is that (ii) grows super-linearly with (i).

But this just seems like a claim about the limitations of the self-report 1-10 scale, which is only relevant for humans (think I'm probably agreeing with the summary of Bob Fischer's take here).

In the case of non-humans, it's not that I disagree, but I don't even understand what the claim is that is being made?

If I understand right, the claim you're making here is that if I give £10 to a Givewell charity, I cause Dustin Muskovitz to give £10 less to that Givewell charity, and do something else with it instead. What else does he do with it?

  • Donate it to a different global health charity - Ok, doesn't seem like too big a deal, my counterfactual impact is still to move money to a highly effective global health charity
  • Spend it on himself - Seems unlikely..?
  • Donate it to a different cause area, e.g. AI safety - so while I think I have supported global health, the counterfactual impact is actually to move more money to AI safety.

The second two possibilities seem surprising and important if true, and I'd be interested to hear more justification for this! Is there some evidence that this is really what happens?

Load more