I think this is a very good point, and it definitely gives me some pause—and probably my original statement there was too strong. Certainly I agree that you need to do evaluations using the best possible scaffolding that you have, but overall my sense is that this problem is not that bad. Some reasons to think that:
Cross-posted from LessWrong.
One reason I'm critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going.
It's hard to take anything else you're saying seriously when you say things like this; it seems clear that you just haven't read Anthropic's RSP. I think that the current conditions and resulting safeguards are insufficient to prevent AI existential risk, but to say that it doesn't make them clear is just patently false...
It think calling a take "lazy", which could indeed be considered "mean" is not avery helpful approach, you could have made your point without that kind of derision. There are going to be a lot of misunderstandings and hot takes around RSPs, and I think AI company employees especially should err heavily on the side of patience and kind understanding it they want to avoid people becoming more adversarial towards them.
Live by the sword, die by the sword.
Akash said...
"that it does not make it clear under what conditions it would actually pause, or for how long...
Cross-posted with LessWrong.
I found this post very frustrating, because it's almost all dedicated to whether current RSPs are sufficient or not (I agree that they are insufficient), but that's not my crux and I don't think it's anyone else's crux either. And for what I think is probably the actual crux here, you only have one small throwaway paragraph:
...Which brings us to the question: “what’s the effect of RSPs on policy and would it be good if governments implemented those”. My answer to that is: An extremely ambitious version yes; the misleading version
Have the resumption condition be a global consensus on an x-safety solution or a global democratic mandate for restarting (and remember there are more components of x-safety than just alignment - also misuse and multi-agent coordination).
This seems basically unachievable and even if it was achievable it doesn't even seem like the right thing to do—I don't actually trust the global median voter to judge whether additional scaling is safe or not. I'd much rather have rigorous technical standards then nebulous democratic standards.
...I think it's pushing it
This is soon enough to be pushing as hard as we can for a pause right now!
I mean, yes, obviously we should be doing everything we can right now. I just think that a RSP-gated pause is the right way to do a pause. I'm not even sure what it would mean to do a pause without an RSP-like resumption condition.
Why try and take it right down to the wire with RSPs?
Because it's more likely to succeed. RSPs provides very clear and legible risk-based criteria that are much more plausibly things that you could actually get a government to agree to.
...The tradeoff
But what about dangerous capabilities that have more to do with AI takeover (e.g., a company develops a system that shows signs of autonomous replication, manipulation, power-seeking, deception) or scientific capabilities (e.g., the ability to develop better AI systems)?
Supposing that 3-10 other companies are within a few months of these systems, do you think at this point we need a coordinated pause, or would it be fine to just force company 1 to pause?
What should happen there is that the leading lab is forced to stop and try to demonstrate that e.g. t...
Perhaps the crux is related to how dangerous you think current models are? I'm quite confident that we have at least a couple additional orders of magnitude of scaling before the world ends, so I'm not too worried about stopping training of current models, or even next-generation models. But I do start to get worried with next-next-generation models.
So, in my view, the key is to make sure that we have a well-enforced Responsible Scaling Policy (RSP) regime that is capable of preventing scaling unless hard safety metrics are met (I favor understanding-based...
What happens when a dangerous capability eval goes off— does the government have the ability to implement a national pause?
I think presumably the pause would just be for that company's scaling—presumably other organizations that were still in compliance would still be fine.
...If the narrative is "hey, we agree that the government should force everyone to scale responsibly, and this means that the government would have the ability to tell people that they have to stop scaling if the government decides it's too risky", then I'd still probably prefer stoppi
I guess I'm not really sure what your objection is to Responsible Scaling Policies? I see that there's a bunch of links, but I don't really see a consistent position being staked out by the various sources you've linked to. Do you want to describe what your objection is?
I guess the closest there is "the danger is already apparent enough" which, while true, doesn't really seem like an objection. I agree that the danger is apparent, but I don't think that advocating for a pause is a very good way to address that danger.
I tend to put P(doom) around 80%, so I think I'm on the pessimistic side, and I tend to think short timelines are at least a real and serious possibility that we should be planning for. Nevertheless, I disagree with a global stop or a pause being the "only reasonable hope"—global stops and pauses seem basically unworkable to me. I'm much more excited about governmentally enforced Responsible Scaling Policies, which seem like the "better option" that you're missing here.
In any case, I don't see any reason to think the neural net prior is malign, or particularly biased toward deceptive, misaligned generalization. If anything the simplicity prior seems like good news for alignment.
I definitely disagree with this—especially the last sentence; essentially all of my hope for neural net inductive biases comes from them not being like an actual simplicity prior. The primary literature I'd reference here would be "How likely is deceptive alignment?" for the practical question regarding concrete neural net inductive biases and ...
Public debates strengthen society and public discourse. They spread truth by testing ideas and filtering out weaker arguments.
I think this is extremely not true, and am pretty disappointed with this sort of "debate me" communications policy. In my opinion, I think public debates very rarely converge towards truth. Lots of things sound good in a debate but break down under careful analysis, and the pressure of saying things that look good to a public audience creates a lot of pressure opposed to actual truth-seeking.
I understand and agree with the import...
At a sufficiently sophisticated technological level, vacuum decay actually becomes worthwhile, as it increases the total amount of available free energy. The problem is ensuring any sort of civilizational continuity before and after the vacuum decay—though, like any other physical process, vacuum decay shouldn't destroy information, so theoretically if you understood the mechanics well enough you should be able to engineer whatever outcome you wanted on the other side.
In my opinion, I think the best solution here is incentivizing people to voluntarily have more children—e.g. child tax credits, maternity/paternity leave, etc. If you don't think fetuses are moral patients, then the pro-natalist, longtermist, total utilitarian view doesn't distinguish between having an abortion and just choosing not to have a child, so I don't really see the reason to focus on abortion specifically in that case.
I don't really know what other purpose the "we must be very clear" here serves besides trying to indicate that you think it's very important that EA projects a unified front here.
I am absolutely intending to communicate that I think it would be good for people to say that they think fraud is bad. But that doesn't mean that I think we should condemn people who disagree regarding whether saying that is good or not. Rather, I think discussion about whether it's a good idea for people to condemn fraud seems great to me, and my post was an attempt to provide my (short, abbreviated) take on that question.
in a form that implies strong moral censure to anyone who argues the opposite
I don't think this and didn't say it. If you have any quotes from the post that you think say this, I'd be happy to edit it to be more clear, but from my perspective it feels like you're inventing a straw man to be mad at rather than actually engaging with what I said.
...You also said that we should do so independently of the facts of the FTX case, which feels weird to me, because I sure think the details of the case are very relevant to what ethical lines I want to draw in the
think what we owe the world is both reflection about where our actual lines are (and how the ones that we did indeed have might have contributed to this situation), as well as honest and precise statements about what kinds of things we might actually consider doing in the future.
I actually state in the post that I agree with this. From my post:
In that spirit, I think it's worth us carefully confronting the moral question here: is fraud in the service of raising money for effective causes wrong?
Perhaps that is not as clear as you would like, but like...
I mean, indeed the combination of "fraud is a vague, poorly defined category" together with a strong condemnation of said "fraud", without much explicit guidance on what kind of thing you are talking about, is what I am objecting to in your post.
I guess I don't really think this is a problem. We're perfectly comfortable with statements like “murder is wrong” while also understanding that “but killing Hitler would be okay.” I don't mean to say that talking about the edge cases isn't ever helpful—in fact, I think it can be quite useful to try to be clear ...
That sounds like a fully generalized defense against all counterarguments, and I don't think is how discourse usually works.
It's clearly not fully general because it only applies to excluding edge cases that don't satisfy the reasons I explicitly state in the post.
...If you say "proposition A is true about category B, for reasons X, Y, Z" and someone else is like "but here is an argument C for why proposition A is not true about category B", then of course you don't get to be like, "oh, well, I of course meant the subset of category B where argument C do
Adding on to my other reply: from my perspective, I think that if I say “category A is bad because X, Y, Z” and you're like “but edge case B!” and edge case B doesn't satisfy X, Y, or Z, then clearly I'm not including it in category A.
I think you're wrong about how most people would interpret the post. I predict that if readers were polled on whether or not the post agreed with “lying to Nazis is wrong” the results would be heavily in favor of “no, the post does not agree with that.” If you actually had a poll that showed the opposite I would definitely update.
I think the nazi example is too loaded for various reasons (and triggers people's "well, this is clearly some kind of thought experiment" sensors).
I think there are a number of other examples that I have listed in the comments to this post that I think would show this. E.g. something in the space of "jewish person lies about their religious affiliation in order to escape discrimination that's unfair to them for something like scholarship money, of which they then donate a portion (partially because they do want to offset the harm that came from being disho...
My guess is most readers are more interested in the condemnation part though, given the overwhelming support that posts like this have received, which have basically no content besides condemnation (and IMO with even bigger problems on being inaccurate about where to draw ethical lines).
I think my post is quite clear about what sort of fraud I am talking about. If you look at the reasons that I give in my post for why fraud is wrong, they clearly don't apply to any of examples of justifiable lying that you've provided here (lying to Nazis, doing the lea...
The portion you quote is included at the very end as an additional point about how even if you don't buy my primary arguments that fraud in general is bad, in this case it was empirically bad. It is not my primary reason for thinking fraud is bad here, and I think the post is quite clear about that.
I agree with this post from a moral perspective, though one thing it does not touch on is the legal question. My guess is that, in the same way that a court probably wouldn't try to claw back money from a utility company/janitor/etc. that FTXFF beneficiaries are also probably safe, but IANAL so maybe somebody who knows more there could comment.
Jason has made a comments on this issue with a number of points worth considering; I found this thread particularly eye-opening. I came away feeling that the risk of clawbacks shouldn't be ignored.
Geoffrey Miller also made the important point that "if there are any legal 'clawbacks' of money in the future, that would have to be done through official legal channels -- and they might not care that we've already sent money back somewhere for allegedly honorable reasons. So we might end up returning a bunch of money, and then being legally obligated to r...
That's a pretty wild misreading of my post. The main thesis of the post is that we should unequivocally condemn fraud. I do not think that the reason that fraud is bad is because of PR reasons, nor do I say that in the post—if you read what I wrote about why I think it's wrong to commit fraud at the end, what I say is that you should have a general policy against ever committing fraud, regardless of the PR consequences one way or another.
Anecdotally it seems like many of the world's most successful companies do try to make frugality part of their culture, e.g. it's one of Amazon's leadership principles.
Google, by contrast, is notoriously the opposite—for example emphasizing just trying lots of crazy, big, ambitious, expensive bets (e.g. their "10x" philosophy). Also see how Google talked about frugality in 2011.
One thing that bugged me when I first got involved with EA was the extent to which the community seemed hesitant to spend lots of money on stuff like retreats, student groups, dinners, compensation, etc. despite the cost-benefit analysis seeming to favor doing so pretty strongly. I know that, from my perspective, I felt like this was some evidence that many EAs didn't take their stated ideals as seriously as I had hoped—e.g. that many people might just be trying to act in the way that they think an altruistic person should rather than really carefully thin...
My anecdotal experience hiring is that I get many more prospective candidates saying something like "if this is so important why isn't your salary way above market rates?" than "if you really care about impact, why are you offering so much money?" (Though both sometimes happen.)
Precisely. Also, the frugality of past EA creates a selection effect, so probably there is a larger fraction of anti-frugal people outside the community (and among people who might be interested) than we would expect from looking inside it.
Great point! I think each spending strategy has its pitfalls related to signalling.
I think this correlates somewhat with people's knowledge/engagement with economics, and political lean. The "frugal altruism" will probably attract more left leaning people, while "spending altruism" probably attracts more right leaning people
I agree that it’s possible to be unthinkingly frugal. It’s also possible to be unthinkingly spendy. Both seem bad, because they are unthinking. A solution would be to encourage EA groups to practice good thinking together, and to showcase careful thinking on these topics.
I like the idea of having early EA intro materials and university groups that teach BOTECs, cost-benefit analysis, and grappling carefully with spending decisions.
This kind of training, however, trades off against time spent learning about eg. AI safety and biosecurity.
Academic projects are definitely the sort of thing we fund all the time. I don't know if the sort of research you're doing is longtermist-related, but if you have an explanation of why you think your research would be valuable from a longtermist perspective, we'd love to hear it.
Since it was brought up to me, I also want to clarify that EA Funds can fund essentially anyone, including:
I'm one of the grant evaluators for the LTFF and I don't think I would have any qualms with funding a project 6-12 months in advance.
To be clear, I agree with a lot of the points that you're making—the point of sketching out that model was just to show the sort of thing I'm doing; I wasn't actually trying to argue for a specific conclusion. The actual correct strategy for figuring out the right policy here, in my opinion, is to carefully weigh all the different considerations like the ones you're mentioning, which—at the risk of crossing object and meta levels—I suspect to be difficult to do in a low-bandwidth online setting like this.
Maybe it'll still be helpful to just give my take us...
I think you're imagining that I'm doing something much more exotic here than I am. I'm basically just advocating for cooperating on what I see as a prisoner's-dilemma-style game (I'm sure you can also cast it as a stag hunt or make some really complex game-theoretic model to capture all the nuances—I'm not trying to do that there; my point here is just to explain the sort of thing that I'm doing).
Consider:
A and B can each choose:
And they each have utility function...
(It seems that you're switching the topic from what your policy is exactly, which I'm still unclear on, to the model/motivation underlying your policy, which perhaps makes sense, as if I understood your model/motivation better perhaps I could regenerate the policy myself.)
I think I may just outright disagree with your model here, since it seems that you're not taking into account the significant positive externalities that a public argument can generate for the audience (in the form of more accurate beliefs, about the organizations involved and EA topics i...
For example would you really not have thought worse of MIRI (Singularity Institute at the time) if it had labeled Holden Karnofsky's public criticism "hostile" and refused to respond to it, citing that its time could be better spent elsewhere?
To be clear, I think that ACE calling the OP “hostile” is a pretty reasonable thing to judge them for. My objection is only to judging them for the part where they don't want to respond any further. So as for the example, I definitely would have thought worse of MIRI if they had labeled Holden's criticisms as “host...
Still pretty unclear about your policy. Why is ACE calling the OP "hostile" not considered "meta-level" and hence not updateable (according to your policy)? What if the org in question gave a more reasonable explanation of why they're not responding, but doesn't address the object-level criticism? Would you count that in their favor, compared to total silence, or compared to an unreasonable explanation? Are you making any subjective judgments here as to what to update on and what not to, or is there a mechanical policy you can write down (that anyone can f...
I disagree, obviously, though I suspect that little will be gained by hashing it out in more here. To be clear, I have certainly thought about this sort of issue in great detail as well.
I would be curious to read more about your approach, perhaps in another venue. Some questions I have:
It clearly is actual, boring, normal, bayesian evidence that they don't have a good response. It's not overwhelming evidence, but someone declining to respond sure is screening off the worlds where they had a great low-inferential distance reply that was cheap to shoot off that addressed all the concerns. Of course I am going to update on that.
I think that you need to be quite careful with this sort of naive-CDT-style reasoning. Pre-commitments/norms against updating on certain types of evidence can be quite valuable—it is just not the case that you sho...
To be clear, I think it's perfectly reasonable for you to want ACE to respond if you expect that information to be valuable. The question is what you do when they don't respond. The response in that situation that I'm advocating for is something like “they chose not to respond, so I'll stick with my previous best guess” rather than “they chose not to respond, therefore that says bad things about them, so I'll update negatively.” I think that the latter response is not only corrosive in terms of pushing all discussion into the public sphere even when that makes it much worse, but it also hurts people's ability to feel comfortably holding onto non-public information.
“they chose not to respond, therefore that says bad things about them, so I'll update negatively.” I think that the latter response is not only corrosive in terms of pushing all discussion into the public sphere even when that makes it much worse, but it also hurts people's ability to feel comfortably holding onto non-public information.
This feels wrong from two perspectives:
Yeah, I downvoted because it called the communication hostile without any justification for that claim. The comment it is replying to doesn't seem at all hostile to me, and asserting it is, feels like it's violating some pretty important norms about not escalating conflict and engaging with people charitably.
Yeah—I mostly agree with this.
I think it's pretty important for people to make themselves available for communication.
Are you sure that they're not available for communication? I know approximately nothing about ACE, but I'd surprised if they wo...
Are you sure that they're not available for communication? I know approximately nothing about ACE, but I'd surprised if they wouldn't be willing to talk to you after e.g. sending them an email.
Yeah, I am really not sure. I will consider sending them an email. My guess is they are not interested in talking to me in a way that would later on allow me to write up what they said publicly, which would reduce the value of their response quite drastically to me. If they are happy to chat and allow me to write things up, then I might be able to make the time, but ...
I also think there's a strong tendency for goalpost-moving with this sort of objection—are you sure that, if they had said more things along those lines, you wouldn't still have objected?
I do think I would have still found it pretty sad for them to not respond, because I do really care about our public discourse and this issue feels important to me, but I do think I would feel substantially less bad about it, and probably would only have mild-downvoted the comment instead of strong-downvoted it.
...What I have a problem with is the notion that we should
Why was this response downvoted so heavily? (This is not a rhetorical question—I'm genuinely curious what the specific reasons were.)
As Jakub has mentioned above, we have reviewed the points in his comment and fully support Anima International’s wish to share their perspective in this thread. However, Anima’s description of the events above does not align with our understanding of the events that took place, primarily within points 1,5, and 6.
This is relevant, useful information.
...The most time-consuming part of our commitment to Representation, Equity
I didn't downvote (because as you say it's providing relevant information), but I did have a negative reaction to the comment. I think the generator of that negative reaction is roughly: the vibe of the comment seems more like a political attempt to close down the conversation than an attempt to cooperatively engage. I'm reminded of "missing moods"; it seems like there's a legitimate position of "it would be great to have time to hash this out but unfortunately we find it super time consuming so we're not going to", but it would naturally come with a...
I downvoted because it called the communication hostile without any justification for that claim. The comment it is replying to doesn't seem at all hostile to me, and asserting it is, feels like it's violating some pretty important norms about not escalating conflict and engaging with people charitably.
I also think I disagree that orgs should never be punished for not wanting to engage in any sort of online discussion. We have shared resources to coordinate, and as a social network without clear boundaries, it is unclear how to make progress on many of the...
I'd personally love to get more Alignment Forum content cross-posted to the EA Forum. Maybe some sort of automatic link-posting? Though that could pollute the EA Forum with a lot of link posts that probably should be organized separately somehow. I'd certainly be willing to start cross-posting my research to the EA Forum if that would be helpful.
Instinctively, I wish that discussion on these posts could all happen on the Alignment Forum, but since who can join is limited, having discussion here as well could be nice.
I don't know whether every single post should be posted here, but it would be nice to at least have occasional posts summarizing the best recent AF content. This might look like just crossposting every new issue of the Alignment Newsletter, which is something I may start doing soon.
Glad you enjoyed it!
So, I think what you're describing in terms of a model with a pseudo-aligned objective pretending to have the correct objective is a good description of specifically deceptive alignment, though the inner alignment problem is a more general term that encompasses any way in which a model might be running an optimization process for a different objective than the one it was trained on.
In terms of empirical examples, there definitely aren't good empirical examples of deceptive alignment right now for the reason you mentioned, though whether
...This thread on LessWrong has a bunch of information about precautions that might be worth taking.
I won't repeat my full LessWrong comment here in detail; instead I'd just recommend heading over there and reading it and the associated comment chain. The bottom-line summary is that, in trying to cover some heavy information theory regarding how to reason about simplicity priors and counting arguments without actually engaging with the proper underlying formalism, this post commits a subtle but basic mathematical mistake that makes the whole argument fall apart.