All of Paul_Christiano's Comments + Replies

ARC returned this money to the FTX bankruptcy estate in November 2023.

I replaced the original comment with "goal-directed," each of them has some baggage and isn't quite right but on balance I think goal-directed is better. I'm not very systematic about this choice, just a reflection of my mood that day.

Quantitatively how large do you think the non-response bias might be? Do you have some experience or evidence in this area that would help estimate the effect size? I don't have much to go on, so I'd definitely welcome pointers.

Let's consider the 40% of people who put a 10% probability on extinction or similarly bad outcomes (which seems like what you are focusing on). Perhaps you are worried about something like: researchers concerned about risk might be 3x more likely to answer the survey than those who aren't concerned about risk, and so in fact only 20... (read more)

I really appreciate your and @Katja_Grace's thoughtful responses, and wish more of this discussion had made it into the manuscript. (This is a minor thing, but I also didn't love that the response rate/related concerns were introduced on page 20 [right?], since it's standard practice—at least in my area—to include a response rate up front, if not in the abstract.) I wish I had more time to respond to the many reasonable points you've raised, and will try to come back to this in the next few days if I do have time, but I've written up a few thoughts here.

Yes, I'd bet the effects are even smaller than what this study found. This study gives a small amount of evidence of an effect > 0.05 SD. But without a clear mechanism I think an effect of < 0.05 SD is significantly more likely. One of the main reasons we were expecting an effect here was a prior literature that is now looking pretty bad.

That said, this was definitely some evidence for a positive effect, and the prior literature is still some evidence for a positive effect even if it's not looking good. And the upside is pretty large here since creatine supplementation is cheap. So I think this is good enough grounds for me to be willing to fund a larger study.

My understanding of the results: for the preregistered tasks you measured effects of 1 IQ point (for RAPM) and 2.5 IQ points (for BDS), with a standard error of ~2 IQ points. This gives weak evidence in favor of a small effect, and strong evidence against a large effect.

You weren't able to measure a difference between vegetarians and omnivores. For the exploratory cognitive tasks you found no effect. (I don't know if you'd expect those tests to be sensitive enough to notice such a small effect.)

At this point it seems a bit unlikely to me that there is a cl... (read more)

It seems quite likely to me that all the results on creatine and cognition are bogus, maybe I'd bet at 4:1 against there being a real effect >0.05 SD.

Unless I'm misunderstanding, does this mean you'd bet that the effects are even smaller than what this study found on its preregistered tasks? If so, do you mind sharing why?

I think the "alignment difficulty" premise was given higher probability by superforecasters, not lower probability.

Agree that it's easier to talk about (change)/(time) rather than (time)/(change). As you say, (change)/(time) adds better. And agree that % growth rates are terrible for a bunch of reasons once you are talking about rates >50%.

I'd weakly advocate for "doublings per year:" (i) 1 doubling / year is more like a natural unit, that's already a pretty high rate of growth, and it's easier to talk about multiple doublings per year than a fraction of an OOM per year, (ii) there is a word for "doubling" and no word for "increased by an OOM," (iii) I think the ari... (read more)

Yes, I'm not entirely certain Impossible meat is equivalent in taste to animal-based ground beef. However, I do find the evidence I cite in the second paragraph of this section somewhat compelling.

Are you referring to the blind taste test? It seems like that's the only direct evidence on this question.

It doesn't look like the preparations are necessarily analogous. At a minimum the plant burger had 6x more salt. All burgers were served with a "pinch" of salt but it's hard to know what that means, and in any case the plant burger probably ended up at least ... (read more)

1
Jacob_Peacock
7mo
Yes. The Sogari blind taste test is indeed affected by saltiness; it also includes an informed taste test similarly effected (but again finding Impossible and animal-based meat tied for first). There is a second blind taste test cited immediately thereafter (Chicken and Burger Alternatives, 2018), although salt levels were not reported. No, I haven't. I agree, food is varied and such comparisons are hard—that's part of why I argue we should do more taste tests! Can you clarify what you mean by an N of 1 study, as usually this refers to a study with a single participants, but Sogari indeed had many participants. If you're suggesting comparison against multiple burgers, this gets a bit tricky since one has to decide which burger you actually want to be equivalent to, if that's your goal. Can you clarify what specifically you disagree with here? I don't think I especially disagree with anything you wrote that follows from here. Instead, I think it's indeed perception of taste that matters for the impact of PBM and we can likely best measure that perception with informed, rather than blind, taste tests. Overall, as I write, I think actually operationalizing a taste test to identify whether "taste competitiveness" is obtained is non-trivial. The literature so far neglects such operationalizations. What do you have in mind as an ideal experiment to conduct to measure taste competitiveness?

The linked LW post points out that nuclear power was cheaper in the past than it is today, and that today the cost varies considerably between different jurisdictions. Both of these seem to suggest that costs would be much lower if there was a lower regulatory burden. The post also claims that nuclear safety is extremely high, much higher than we expect in other domains and much higher than would be needed to make nuclear preferable to alternative technologies. So from that post I would be inclined to believe that overregulation is the main reason for a hi... (read more)

3
Ulrik Horn
8mo
Hi Paul, thanks for taking the time to respond. My main concern is the combination of saying contrarian and seemingly uninvestigated things publicly. Therefore, I am only instrumentally, and at a lower priority interested in whether or not nuclear power is over regulated or not (I think nuclear energy is probably less important than poverty, AI or bio). I do not think I need to spend a lot of time to understand whether nuclear power is over regulated or not in order to state that I think it is over-confident to say that it is in fact over regulated. In other words, and in general, I think a smaller burden of proof is required to say that something is highly uncertain, than to actually say whether something is like this or like that with some significant amount of certainty. Instead of going down a potential nuclear energy rabbit hole, I would advise people looking for examples of over-regulation to either pick a less controversial example of over regulation, or if such examples are not easily found, to avoid making these analogies altogether as I do not think they are important for the arguments in which these examples have been used (i.e. the 2 podcasts). Still, while I feel confident about my stance above and the purpose of my original post, I want to take the time to respond as carefully as I can to your questions, making it clear that in doing so we are shifting the conversation away from my originally stated purpose of improving optics/epistemics and towards digging into what is actually the case with nuclear (a potential rabbit hole!). I also, unfortunately do not have too much time to look more deeply into this (but did on three occasions assess nuclear in terms of costs for powering a civilizational shelter - it did not look promising and am happy to share my findings). So I will attempt to answer your questions but in order to stick with the main recommendation that I make (to be epistemically humble) my answers are probably not very informative and I th

I'm confused about your analysis of the field experiment. It seems like the three options are {Veggie, Impossible, Steak}. But wouldn't Impossible be a comparison for ground beef, not for steak? Am I misunderstanding something here?

Beyond that, while I think Impossible meat is great, I don't think it's really equivalent on taste. I eat both beef and Impossible meat fairly often (>1x / week for both) and I would describe the taste difference as pretty significant when they are similarly prepared.

If I'm understanding you correctly then 22% of the people p... (read more)

3
Jacob_Peacock
7mo
Hi Paul, thanks for checking the analysis so closely! (And apologies for the slow reply; I've been gathering some more information.) This is a good point and I've now confirmed with the authors that the steak was cubed, rather than minced or ground, so indeed not likely directly comparable to Impossible ground beef. I'll be making some updates to the paper accordingly. Thank you! The build-your-own-entree bar offers shredded beef, which while also not the same, might be a more similar comparison. Unfortunately, I wasn't able to get more granular data at this time to test whether that was more readily displaced. Overall, despite these caveats on taste, lots of plant-based meat was still sold, so it was "good enough" in some sense, but there was still potentially little resulting displacement of beef (although maybe somewhat more of chicken). Yes, I'm not entirely certain Impossible meat is equivalent in taste to animal-based ground beef. However, I do find the evidence I cite in the second paragraph of this section somewhat compelling. I'm not sure where you're getting this exact figure, but I don't put much credence in it. Instead, I'd refer to estimates in Fig 3, which range from 0.3 to 4.0 percentage points of beef displacement, after accounting for behavior at the control sites and/or spillover effects. That is compared to an 5.0 or 11.4 pp increase in Impossible meal sales, respectively. Furthermore, it's important to keep in mind "the study employed several co-interventions designed to reduce meat consumption (Malan, 2020). These included environmental education, low carbon footprint labels on menus, and an advertising campaign to promote the new product, all of which have some evidence demonstrating their effectiveness." So the effect is likely not entirely attributable to the Impossible meat. I agree and discuss this issue some in the Taste section. In short, this is part of why I think informed taste tests would be more relevant than blind: in naturali

I didn't mean to imply that human-level AGI could do human-level physical labor with existing robotics technology; I was using "powerful" to refer to a higher level of competence. I was using "intermediate levels" to refer to human-level AGI, and assuming it would need cheap human-like bodies.

Though mostly this seems like a digression. As you mention elsewhere, the bigger crux is that it seems to me like automating R&D would radically shorten timelines to AGI and be amongst the most important considerations in forecasting AGI.

(For this reason I don't o... (read more)

My point in asking "Are you assigning probabilities to a war making AGI impossible?" was to emphasize that I don't understand what 70% is a probability of, or why you are multiplying these numbers. I'm sorry if the rhetorical question caused confusion.

My current understanding is that 0.7 is basically just the ratio (Probability of AGI before thinking explicitly about the prospect of war) / (Probability of AGI after thinking explicitly about prospect of war). This isn't really a separate event from the others in the list, it's just a consideration that leng... (read more)

That's fair, this was some inference that is probably not justified.

To spell it out: you think brains are as effective as 1e20-1e21 flops. I claimed that humans use more than 1% of their brain when driving (e.g. our visual system is large and this seems like a typical task that engages the whole utility of the visual system during the high-stakes situations that dominate performance), but you didn't say this. I concluded (but you certainly didn't say) that a human-level algorithm for driving would not have much chance of succeeding using 1e14 flops.

7
Ted Sanders
11mo
I think you make a good argument and I'm open to changing my mind. I'm certainly no expert on visual processing in the human brain. Let me flesh out some of my thoughts here. On whether this framework would have yielded bad forecasts for self-driving: When we guess that brains use 1e20-1e21 FLOPS, and therefore that early AGIs might need 1e16-1e25, we're not making a claim about AGIs in general, or the most efficient AGI possible, but AGIs by 2043. We expect early AGIs to be horribly inefficient by later standards, and AGIs to get rapidly more efficient over time. AGI in 2035 will be less efficient than AGI in 2042 which will be less efficient than AGI in 2080. With that clarification, let's try to apply our logic to self-driving to see whether it bears weight. Supposing that self-driving needs 1% of human brainpower, or 1e18-1e19 FLOPS, and then similarly widen our uncertainty to 1e14-1e23 FLOPS, it might say yes, we'd be surprised but not stunned at 1e14 FLOPS being enough to drive (10% -> 100%). But, and I know my reasoning is motivated here, that actually seems kind of reasonable? Like, for the first decade and change of trying, 1e14 FLOPS actually was not enough to drive. Even now, it's beginning to be enough to drive, but still is wildly less sample efficient than human drivers and wildly worse at generalizing than human drivers. So it feels like if in 2010 we predicted self-driving would take 1e14-1e23 FLOPS, and then a time traveler from the future told us that actually it was 1e14 FLOPS, but it would take 13 years to get there, and actually would still be subhuman, then honestly that doesn't feel too shocking. It was the low end of the range, took many years, and still didn't quite match human performance. No doubt with more time and more training 1e14 FLOPS will become more and more capable. Just as we have little doubt that with more time AGIs will require fewer and fewer FLOPS to achieve human performance. So as I reflect on this framework applied

Incidentally, I'm puzzled by your comment and others that suggest we might already have algorithms for AGI in 2023. Perhaps we're making different implicit assumptions of realistic compute vs infinite compute, or something else. To me, it feels clear we don't have the algorithms and data for AGI at present

I would guess that more or less anything done by current ML can be done by ML from 2013 but with much more compute and fiddling. So it's not at all clear to me whether existing algorithms are sufficient for AGI given enough compute, just as it wasn't clea... (read more)

2
Ted Sanders
11mo
Yep. We're using the main definition supplied by Open Philanthropy, which I'll paraphrase as "nearly all human work at human cost or less by 2043." If the definition was more liberal, e.g., AGI as smart as humans, or AI causing world GDP to rise by >100%, we would have forecasted higher probabilities. We expect AI to get wildly more powerful over the next decades and wildly change the face of human life and work. The public is absolutely unprepared. We are very bullish on AI progress, and we think AI safety is an important, tractable, and neglected problem. Creating new entities with the potential to be more powerful than humanity is a scary, scary thing.
1
Ted Sanders
11mo
Bingo. We didn't take the time to articulate it fully, but yeah you got it. We think it makes it easier to forecast these things separately rather than invisibly smushing them together into a smaller set of factors.  We are multiplying out factors. Not sure I follow you here.
1
Ted Sanders
11mo
Agree 100%. Our essay does exactly this, forecasting over a wide range of potential compute needs, before taking an expected value to arrive a single summary likelihood. Sounds like you think we should have ascribed more probability to lower ranges, which is a totally fair disagreement.
2
Ted Sanders
11mo
Interesting - this is perhaps another good crux between us. My impression is that existing robot bodies are not good enough to do most human jobs, even if we had human-level AGI today. Human bodies self-repair, need infrequent maintenance, last decades, have multi-modal high bandwidth sensors built in, and are incredibly energy efficient. One piece of evidence for this is how rare tele-operated robots are. There are plenty of generally intelligent humans around the world who would be happy to control robots for $1/hr, and yet they are not being employed to do so.

Are you saying that e.g. a war between China and Taiwan makes it impossible to build AGI? Or that serial time requirements make AGI impossible? Or that scaling chips means AGI is impossible?

C'mon Paul - please extend some principle of charity here. :)

You have repeatedly ascribed silly, impossible beliefs to us and I don't know why (to be fair, in this particular case you're just asking, not ascribing). Genuinely, man, I feel bad that our writing has either (a) given the impression that we believe such things or (b) given the impression that we're the type ... (read more)

3
Ted Sanders
11mo
I don't follow you here. Why is a floating point operation 1e5 bit erasures today? Why does a fp16 operation necessitate 16 bit erasures? As an example, if we have two 16-bit registers (A, B) and we do a multiplication to get (A, A*B), where is the 16 bits of information loss? (In any case, no real need to reply to this. As someone who has spent a lot of time thinking about the Landauer limit, my main takeaway is that it's more irrelevant than often supposed, and I suspect getting to the bottom of this rabbit hole is not going to yield much for us in terms of TAGI timelines.)
1
Ted Sanders
11mo
Pretty fair summary. 1e6, though, not 1e7. And honestly I could be pretty easily persuaded to go a bit lower by arguments such as: * Max firing rate of 100 Hz is not the informational content of the channel (that buys maybe 1 OOM) * Maybe a smaller DNN could be found, but wasn't * It might take a lot of computational neurons to simulate the I/O of a single synapse, but it also probably takes a lot of synapses to simulate the I/O of a single computational neuron Dropping our estimate by 1-2 OOMs would increase step 3 by 10%abs-20%abs. It wouldn't have much effect on later estimates, as they are already conditional on success in step 3.
1
Ted Sanders
11mo
Maybe, but maybe not, which is why we forecast a number below 100%. For example, it is very very rare to ever see a CEO hired with <2 years of experience, even if they are very intelligent and have read a lot of books and have watched a lot of interviews. Some reasons might be irrational or irrelevant, but surely some of it is real. A CEO job requires a large constellation of skills practiced and refined over many years. E.g., relationship building with customers, suppliers, shareholders, and employees. For an AGI to be installed as CEO of a corporation in under two years, human-level learning would not be enough - it would need to be superhuman in its ability to learn. Such superhuman learning could come from simulation (e.g., modeling and simulating how a potential human partner would react to various communication styles), come from parallelization (e.g., being installed as a manager in 1,000 companies and then compiling and sharing learnings across copies), or from something else. I agree that skills learned from reading or thinking or simulating could happen very fast. Skills requiring real-world feedback that is expensive, rare, or long-delayed would progress more slowly.
1
Ted Sanders
11mo
What's an algorithm from 2013 that you think could yield AGI, if given enough compute? What would its inputs, outputs, and training look like? You're more informed than me here and I would be happy to learn more.

You start off saying that existing algorithms are not good enough to yield AGI (and you point to the hardness of self-driving cars as evidence) and fairly likely won't be good enough for 20 years. And also you claim that existing levels of compute would be a way too low to learn to drive even if we had human-level algorithms. Doesn't each of those factors on its own explain the difficulty of self-driving? How are you also using the difficulty of self-driving to independently argue for a third conjunctive source of difficulty?

Maybe another related question:... (read more)

3
Ted Sanders
11mo
This is not a claim we've made.

Maybe another related question: can you make a forecast about human-level self-driving (e.g. similar accident rates vs speed tradeoffs to a tourist driving in a random US city) and explain its correlation with your forecast about human-level AI overall?

Here are my forecasts of self-driving from 2018: https://www.tedsanders.com/on-self-driving-cars/

Five years later, I'm pretty happy with how my forecasts are looking. I predicted:

  • 100% that self-driving is solvable (looks correct)
  • 90% that self-driving cars will not be available for sale by 2025 (looks correct
... (read more)

I don't think I understand the structure of this estimate, or else I might understand and just be skeptical of it. Here are some quick questions and points of skepticism.

Starting from the top, you say:

We estimate optimistically that there is a 60% chance that all the fundamental algorithmic improvements needed for AGI will be developed on a suitable timeline.

This section appears to be an estimate of all-things-considered feasibility of transformative AI, and draws extensively on evidence about how lots of things go wrong in practice when implementing compl... (read more)

Am I really the only person who thinks it's a bit crazy that we use this blobby comment thread as if it's the best way we have to organize disagreement/argumentation for audiences? I feel like we could almost certainly improve by using, e.g., a horizontal flow as is relatively standard in debate.[1]

With a generic example below:

To be clear, the commentary could still incorporate non-block/prose text.

Alternatively, people could use something like Kialo.com. But surely there has to be something better than this comment thread, in terms of 1) ease of determini... (read more)

Excellent comment; thank you for engaging in such detail. I'll respond piece by piece. I'll also try to highlight the things you think we believe but don't actually believe.

Section 1: Likelihood of AGI algorithms

"Can you say what exactly you are assigning a 60% probability to, and why it's getting multiplied with ten other factors? Are you saying that there is a 40% chance that by 2043 AI algorithms couldn't yield AGI no matter how much serial time and compute they had available? (It seems surprising to claim that even by 2023!) Presumably not that, but wh

... (read more)

There was a related GiveWell post from 12 years ago, including a similar example where higher "unbiased" estimates correspond to lower posterior expectations.

That post is mostly focused on practical issues about being a human, and much less amusing, but it speaks directly to your question #2.

(Of course, I'm most interested in question #3!)

3
Davidmanheim
1y
Also see Why the Tails Come Apart, and to regressional Goodhart.

I agree. When I give numbers I usually say "We should keep the risk of AI takeover beneath 1%" (though I haven't thought about it very much and mostly the numbers seem less important than the qualitative standard of evidence).

I think that 10% is obviously too high. I think that a society making reasonable tradeoffs could end up with 1% risk, but that it's not something a government should allow AI developers to do without broader public input (and I suspect that our society would not choose to take this level of risk).

2
Habryka
1y
Cool, makes sense. Seems like we are mostly on the same page on this subpoint.

Yeah, the sentence cut off. I was saying: obviously a 10% risk is socially unacceptable. Trying to convince someone it's not in their interest is not the right approach, because doing so requires you to argue that P(doom) is much greater than 10% (at least with some audiences who care a lot about winning a race). Whereas trying to convince policy makers and the public that they shouldn't tolerate the risk requires meeting a radically lower bar, probably even 1% is good enough.

2
Greg_Colbourn
1y
I think arguing P(doom|AGI) >>10% is a decent strategy. So far I haven't had anyone give good enough reasons for me to update in the other direction. I think the CEOs in the vanguard of AGI development need to really think about this. If they have good reasons for thinking that P(doom|AGI) ≤ 10%, I want to hear them! To give a worrying example: LeCun is, frankly, sounding like he has no idea of what the problem even is. OpenAI might think they can solve alignment, but their progress on alignment to date isn't encouraging (this is so far away from the 100% watertight, 0 failure modes that we need). And Google Deepmind are throwing caution to the wind (despite safetywashing their statement with 7 mentions of the word "responsible"/"responsibly"). The above also has the effect of shifting the public framing toward the burden being on the AI companies to prove their products are safe (in terms of not causing global catastrophe). I'm unsure as to whether the public at large would tolerate a 1% risk. Maybe they would (given the potential upside). But we are not in that world. The risk is at least 50%, probably closer to 99% imo.

I mean that 90% or 99% seem like clearly reasonable asks, and 100% is a clearly unreasonable ask.

I'm just saying that the argument "this is a suicide race" is really not the way we should go. We should say the risk is >10% and that's obviously unacceptable, because that's an argument we can actually win.

9
Habryka
1y
Hmm, just to be clear, I think saying that "this deployment has a 1% chance of causing an existential risk, so you can't deploy it" seems like a pretty reasonable ask to me.  I agree that I would like to focus on the >10% case first, but I also don't want to set wrong expectations that I think it's reasonable at 1% or below. 

I'm pushing back against the framing: "this is a suicide race with no benefit from winning."

If there is a 10% chance of AI takeover, then there is a real and potentially huge benefit from winning the race. But we still should not be OK with someone unilaterally taking that risk.

I agree that AI developers should have to prove that the systems they build are reasonably safe. I don't think 100% is a reasonable ask, but 90% or 99% seem pretty safe (i.e. robustly reasonable asks).

(Edited to complete cutoff sentence and clarify "safe.")

2
Greg_Colbourn
1y
Paul, you are saying "50/50 chance of doom" here (on the Bankless podcast). Surely that is enough to be using the suicide race argument!? I mean "it's not suicide, it's a coin flip; heads utopia, tails you're doomed" seems like quibbling at this point. Or at least: you should be explicit when talking to CEOs that you think it's 50/50 that AGI dooms us!
9
Habryka
1y
Sorry, just to clarify, what do we mean by "safe" here? Clearly a 90% chance of the system not disempowering all of humanity is not sufficient (and neither would a 99% chance, though that's maybe a bit more debatable), so presumably you mean something else here. 
4
Greg_Colbourn
1y
90% or 99% safe is still gambling the lives of 80M-800M humans in expectation (in the limit of scaling to superintelligence). I don't think it's acceptable for AI companies, with no democratic mandate, to be unilaterally making that decision! Or did you mean to say something to that effect with this truncated sentence?

I've seen this a few times but I'm skeptical about taking this rhetorical approach.

I think a large fraction of AI risk comes from worlds where the ex ante probability of catastrophe is more like 50% than 100%. And in many of those worlds, the counterfactual impact of individual developers move faster is several times smaller (since someone else is likely to kill us all in the bad 50% of worlds). On top of that, reasonable people might disagree about probabilities and think 10% in a case where I think 50%.

So putting that together they may conclude that raci... (read more)

4
Greg_Colbourn
1y
Have you had an in depth discussion with Roman Yampolskiy? (If not, I think you should!)  I think the Overton Window is really shifting on the issue of AGI x-risk now, with it going mainstream. The burden of proof should be on the developers of AGI to prove that it is 100% safe (as opposed to the previous era where it was on the x-risk worriers to prove it was dangerous). Do you have a good post-GPT-4+plugins/AutoGPT (new 2023 era) answer to this question (a mechanistic explanation for why we get an ok outcome, given AGI).

As an analogy: consider a variant of rock paper scissors where you get to see your opponent's move in advance---but it's encrypted with RSA. In some sense this game is much harder than proving Fermat's last theorem, since playing optimally requires breaking the encryption scheme. But if you train a policy and find that it wins 33% of the time at encrypted rock paper scissors, it's not super meaningful or interesting to say that the task is super hard, and in the relevant intuitive sense it's an easier task than proving Fermat's last theorem.

Smaller notes:

  • The conditional GAN task (given some text, complete it in a way that looks human-like) is just even harder than the autoregressive task, so I'm not sure I'd stick with that analogy.
  • I think that >50% of the time when people talk about "imitation" they mean autoregressive models; GANs and IRL are still less common than behavioral cloning. (Not sure about that.)
  • I agree that "figure out who to simulate, then simulate them" is probably a bad description of the cognition GPT does, even if a lot of its cognitive ability comes from copying human cognitive processes.

I agree that it's best to think of GPT as a predictor, to expect it to think in ways very unlike humans, and to expect it to become much smarter than a human in the limit.

That said, there's an important further question that isn't determined by the loss function alone---does the model do its most useful cognition in order to predict what a human would say, or via predicting what a human would say?

To illustrate, we can imagine asking the model to either (i) predict the outcome of a news story, (ii) predict a human thinking step-by-step about what will happe... (read more)

5
Anthony DiGiovanni
1y
I don't understand this claim. Why would the difficulty of the task not be super meaningful when training to performance that isn't near the upper limit?

For what it's worth, I think Eliezer's post was primarily directed at people who have spent a lot less time thinking about this stuff than you, and that this sentence:

"Getting perfect loss on the task of being GPT-4 is obviously much harder than being a human, and so gradient descent on its loss could produce wildly superhuman systems."

Is the whole point of his post, and is not at all obvious to even very smart people who haven't spent much time thinking about the problem. I've had a few conversations with e.g. skilled Google engineers who have said things... (read more)

Smaller notes:

  • The conditional GAN task (given some text, complete it in a way that looks human-like) is just even harder than the autoregressive task, so I'm not sure I'd stick with that analogy.
  • I think that >50% of the time when people talk about "imitation" they mean autoregressive models; GANs and IRL are still less common than behavioral cloning. (Not sure about that.)
  • I agree that "figure out who to simulate, then simulate them" is probably a bad description of the cognition GPT does, even if a lot of its cognitive ability comes from copying human cognitive processes.

The better reference class is adversarially mined examples for text models. Meta and other researchers were working on a similar projects before Redwood started doing that line of research. https://github.com/facebookresearch/anli is an example

I agree that's a good reference class. I don't think Redwood's project had identical goals, and would strongly disagree with someone saying it's duplicative. But other work is certainly also relevant, and ex post I would agree that other work in the reference class is comparably helpful for alignment

Reader: evaluate

... (read more)

I don't think Redwood's project had identical goals, and would strongly disagree with someone saying it's duplicative.

I agree it is not duplicative. It's been a while, but if I recall correctly the main difference seemed to be that they chose a task with gave them a extra nine of reliability (started with an initially easier task) and pursued it more thoroughly.

think I'm comparably skeptical of all of the evidence on offer for claims of the form "doing research on X leads to differential progress on Y,"

I think if we find that improvement of X leads ... (read more)

I've argued about this point with Evan a few times but still don't quite understand his take. I'd be interested in more back and and forth. My most basic objection is that the fine-tuning objective is also extremely simple---produce actions that will be rated highly, or even just produce outputs that get a low loss. If you have a picture of the training process, then all of these are just very simple things to specify, trivial compared to other differences in complexity between deceptive alignment and proxy alignment. (And if you don't yet have such a picture, then deceptive alignment also won't yield good performance.)

3
David Johnston
1y
My intuition for why "actions that have effects in the real world" might promote deception is that maybe the "no causation without manipulation" idea is roughly correct. In this case, a self-supervised learner won't develop the right kind of model of its training process, but the fine-tuned learner might. I think "no causation without manipulation" must be substantially wrong. If it was entirely correct, I think one would have to say that pretraining ought not to help achieve high performance on a standard RLHF objective, which is obviously false. It still seems plausible to me that a) the self-supervised learner learns a lot about the world it's predicting, including a lot of "causal" stuff and b) there are still some gaps in its model regarding its own role in this world, which can be filled in with the right kind of fine-tuning. Maybe this falls apart if I try to make it all more precise - these are initial thoughts, not the outcomes of trying to build a clear theory of the situation.

Yes, I think that's how people have used the terms historically. I think it's also generally good usage---the specific thing you talk about in the post is important and needs its own name.

 Unfortunately I think it is extremely often misinterpreted and there is some chance we should switch to a term like "instrumental alignment" instead to avoid the general confusion with deception more broadly.

6
DavidW
1y
Thanks! I've updated both posts to reflect this. 

Because they lead to good performance on the pre-training objective (via deceptive alignment). I think a similarly big leap is needed to develop deceptive alignment during fine-tuning (rather than optimization directly for the loss). In both cases the deceptively aligned behavior is not cognitively similar to the intended behavior, but is plausibly simpler (with similar simplicity gaps in each case).

For the sake of argument, suppose we have a model in pre-training that has a misaligned proxy goal and relevant situational awareness. But so far, it does not have a long-term goal. I'm picking these parameters because they seem most likely to create a long-term goal from scratch in the way you describe. 

In order to be deceptively aligned, the model has to have a long enough goal horizon so it can value its total goal achievement after escaping oversight more than its total goal achievement before escaping oversight. But pre-training processes are inc... (read more)

I don't know how common each view is. My guess would be that in the old days this was the more common view, but there's been a lot more discussion of deceptive alignment recently on LW.

I don't find the argument about "take actions with effects in the real world" --> "deceptive alignment," and my current guess is that most people would also back off from that style of argument if they thought about the issues more thoroughly. Mostly though it seems like this will just get settled by the empirics.

I don't know how common each view is either, but I want to note that @evhub has stated that he doesn't think pre-training is likely to create deception:

The biggest reason to think that pre-trained language models won’t be deceptive is just that their objective is extremely simple—just predict the world. That means that there’s less of a tricky path where stochastic gradient descent (SGD) has to spend a bunch of resources making their proxies just right, since it might just be able to very easily give it the very simple proxy of prediction. But th

... (read more)

Models that are only pre-trained almost certainly don’t have consequentialist goals beyond the trivial next token prediction.

If a model is deceptively aligned after fine-tuning, it seems most likely to me that it's because it was deceptively aligned during pre-training.

"Predict tokens well" and "Predict fine-tuning tokens well" seem like very similar inner objectives, so if you get the first one it seems like it will move quickly to the second one. Moving to the instrumental reasoning to do well at fine-tuning time seems radically harder. And generally it'... (read more)

8
Alexander Turner
2d
Paul, I think deceptive alignment (or other spontaneous, stable-across-situations goal pursuit) after just pretraining is very unlikely. I am happy to take bets if you're interested. If so, email me (alex@turntrout.com), since I don't check this very much.  I agree, and the actual published arguments for deceptive alignment I've seen don't depend on any difference between pretraining and finetuning, so they can't only apply to one. (People have tried to claim to me, unsurprisingly, that the arguments haven't historically focused on pretraining.)
9
DavidW
1y
How would the model develop situational awareness in pre-training when: 1. Unlike in fine-tuning, the vast majority of internet text prompts do not contain information relevant for the model figure out that it is an ML model. The model can't infer context from the prompt in the vast majority of pre-training inputs.  2. Because predicting internet text for the next token is all that predicts reward, why would situational awareness help with reward unless the model were already deceptively aligned? 3. Situational awareness only produces deceptive alignment if the model already has long-term goals, and vice versa. Gradient descent is based on partial derivatives, so assuming that long-term goals and situational awareness are represented by different parameters: 1. If the model doesn't already have long enough goal horizons for deceptive alignment, then marginally more situational awareness doesn't increase deceptive alignment. 2. If the model doesn't already have the kind of situational awareness necessary for deceptive alignment, then a marginally longer-term goal doesn't increase deceptive alignment.  3. Therefore, the partial derivatives shouldn't point toward either property unless the model already has one or the other. 
8
David Johnston
1y
How common do you think this view is? My impression is that most AI safety researchers think the opposite, and I’d like to know if that’s wrong. I’m agnostic; pretraining usually involves a lot more training, but also fine tuning might involve more optimisation towards “take actions with effects in the real world”.

Doesn't deceptive alignment require long-term goals? Why would a model develop long-term goals in pre-training? 

There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly.

I think "deceptive alignment" refers only  to situations where the model gets a high reward at training for instrumental reasons. This is a source of a lot of confusion (and should perhaps be called "instrumental alignment") but worth trying to be clear about.

I might be misunderstanding what you are saying here. I think the post you link doesn't use th... (read more)

6
DavidW
1y
My goal was just to clarify that I’m referring to the specific deceptive alignment story and not models being manipulative and dishonest in general. However, it sounds like what I thought of as ‘deceptive alignment’ is actually ‘playing the training game’ and what I described as a specific type of deceptive alignment the only thing referred to as deceptive alignment. Is that right?  Thanks for clarifying this!

Rather, I don't think that GPUs performing parallel searches through a probabilistic word space by themselves are likely to support consciousness.

This seems like the crux. It feels like a big neural network run on a GPU, trained to predict the next word, could definitely be conscious. So to me this is just a question about the particular weights of large language models, not something that can be established a priori based on architecture.

It seems reasonable to guess that modern language models aren't conscious in any morally relevant sense. But it seems odd to use that as the basis for a reductio of arguments about consciousness, given that we know nothing about the consciousness of language models.

Put differently: if a line of reasoning would suggest that language models are conscious, then I feel like the main update should be about consciousness of language models rather than about the validity of the line of reasoning. If you think that e.g. fish are conscious based on analysis of thei... (read more)

2
splinter
1y
Thanks for this response. It seems like we are coming at this topic from very different starting assumptions. If I'm understanding you correctly, you're saying that we have no idea whether LLMs are conscious, so it doesn't make sense to draw any inferences from them to other minds. That's fair enough, but I'm starting from the premise that LLMs in their current form are almost certainly not conscious. Of course, I can't prove this. It's my belief based on my understanding of their architecture. I'm very much not saying they lack consciousness because they aren't instantiated in a biological brain. Rather, I don't think that GPUs performing parallel searches through a probabilistic word space by themselves are likely to support consciousness. Stepping back a bit: I can't know if any animal other than myself is conscious, even fellow humans. I can only reason through induction that consciousness is a feature of my brain, so other animals that have brains similar in construction to mine may also have consciousness. And I can use the observed output of those brains -- behavior -- as an external proxy for internal function. This makes me highly confident that, for example,  primates are conscious, with my uncertainty growing with greater evolutionary distance. Now along come LLMs to throw a wrench in that inductive chain. LLMs are -- in my view -- zombies that can do things previously only humans were capable of.  And the truth is, a mosquito's brain doesn't really have all that much in common with a human's. So now I'm even more uncertain -- is complex behavior really a sign for interiority? Does having a brain made of neurons really put lower animals on a continuum with humans? I'm not sure anymore. 

I don't think those objections to offsetting really apply to demand offsetting. If I paid someone for a high-welfare egg, I shouldn't think about my action as bringing an unhappy hen into existence and then "offsetting" that by making it better off. And that would be true even if I paid someone for a high-welfare egg, but then swapped my egg with someone else's normal egg. And by the same token if I pay someone to sell a high-welfare egg on the market labeled as a normal egg, and then buy a normal egg from the market, I haven't increased the net demand for normal eggs at all and so am not causally responsible for any additional factory-farmed hens.

1
Tyler Johnston
1y
I agree on this — what you bring up is more about the immediate logic of demand offsetting, and less about the optics or longer-term implications of demand offsetting. My first objection was that this doesn’t scale well to the broader public as OP mentioned (because to them you are voluntarily purchasing and eating a product from an animal you think was mistreated, while also sparing a totally separate animal, or two). So I don’t think it avoids the bad optics that things like murder offsets would carry. But it’s not that it doesn’t make a certain sense within the consequentialist framework (which I think it does, though I hesitate on account of the other objections I mentioned — how this would impact someone’s psychology long-term and the lack of some signaling effects in abstaining from low-welfare products).

The idea is: if I eat an egg and buy a certificate, I don't increase demand for factory farmed eggs, and if I buy 2 certificates per egg I actually decrease demand. So I'm not offsetting and causing harm to hens, I'm directly decreasing the amount of harm to hens. I think this is OK according to most moral perspectives, though people might find it uncompelling. (Not sure which particular objections you have in mind.)

1
Tyler Johnston
1y
Right! I appreciated reading your post about this. I think the objection that I find is most relevant is that moral offsetting only seems intuitive to a subset of consequentalist-leaning people (who may be overrepresented on this forum), but strikes many as morally abhorrent, at least for harming living creatures. I guess carbon offsetting is more popular, but I don’t think an offset for beating your dog would be widely admired, so I’m not sure what people would make of an offset about the treatment of farmed animals. But I think people thinking caged eggs are wrong but then offsetting them so they can keep eating them might not be seen with any moral credibility by the wider public. I also think the other objections raised in the forum post are interesting — that it might be psychologically complicated to both eat animals raised under poor conditions and still aim to better their lot, and that the signaling effects of being vegan (or abstaining from particularly bad animal products, in your case) are probably underrated.

I'm particularly interested in demand offsetting for factory farming. I think this form of offsetting makes about as much sense as directly purchasing higher-welfare animal products.

2
Tyler Johnston
1y
Interesting — if you’d ever be interested in expanding on your post, I’d be curious to hear your response to the objections I bring up, or that are mentioned in the comments here.

The picture I was suggesting is: FTX leaves customer assets in customer accounts, FTX owes and is owed a ton of $ as part of its usual operation as a broker + clearinghouse, Alameda in particular owes them a huge amount of $ and is insolvent (requiring negligence on both risk management and accounting), FTX ends up owing customers $ that FTX can't pay, FTX starts grabbing customer assets to try to meet withdrawals.

I'm still at maybe 25-50% overall chance that their conduct is similar to a conventional broker+exchange+clearinghouse except for the two points... (read more)

7
RyanCarey
1y
One bit of clarifying info is that according to Sam, FTX wasn't just grabbing customer $ after Alameda became insolvent, but lent ~1/3 or more of the customer funds held to Alameda. And this happened whenever users deposited funds through Alameda, something we know was already happening years ago - from the early stages of FTX. The same gist comes across in interviewing from Coffezilla: https://t.co/rMljwAqhDq

None of this is decision-relevant for me and waiting a few months or years makes complete sense, but alas "interesting" != "decision-relevant."

Another issue is the fiat that FTX owed customers. I think FTX did not say that the $ were being held as deposits (e.g. I think the terms of service explicitly say they aren't). Those numbers in accounts just represent the $ that FTX owes its customers. And if the main problem was a -$8B fiat balance, then it's less clear any of the expectations about assets and lending matters.

The statement "we never invest customer funds" is still deeply misleading but it's more clear how SBF could spin this as not being an outright lie: he could ask (i) "How much cash a... (read more)

3
RyanCarey
1y
The question is whether FTX's leadership knowingly misled for financial gain, right? We know that thety said they weren't borrowing (T&C) or investing (public statements) users' digital assets, and that Alameda played by the same rules as other FTX users. They then took $8B of user-loaded money (most of which users did not ask to be lent) and knowingly lent it to Alameda, accepting FTT (approximately the company's own stocks) as collateral, to make big crypto bets. Seems like just based on that, they might have defrauded the user in 4+ different ways. I think "we were net-lenders, and seemed to have enough collateral" might (assuming it is true) be a (partally) mitigating factor for the fourth point, but not the others?

One important accounting complexity is that there might be a lot of futures contracts that effectively represent borrow/lend without involving physical borrowing or lending. My impression is that FTX offered a lot of futures so this could be significant.

For example, a bunch of people might have had long BTC futures contracts, with Alameda having the short end of that contract. If Alameda is insolvent then it can't pay its side of the contract; usually FTX (which is both broker and clearinghouse) would make a margin call to ensure that Alameda has enough co... (read more)

It seems significant if only $3B of customers were earning interest on cash or assets, and the other customers had the option to opt in to lending but explicitly did not. I think all the other customers would have an extremely reasonable expectation of their assets just sitting there.  I'm not super convinced by the close reading of the terms of service but it seems like the common-sense case is strong.

I'm interested in understanding that $3B number and any relevant subtleties in the accounting.  I feel like if that number had been $15B then this... (read more)

Yeah, I also don't have anything else better. 

I am piecing some of this together from this alleged report of someone who worked at in the two days before the end, from which I inferred that Caroline and Sam knew that they took an unprecedented step when they loaned that customer money that nobody knew about, which seems like it just wouldn't really be the case if they had only used money from the people who had lending enabled: https://twitter.com/libevm/status/1592383746551910400 

I am definitely very interested in a better estimate of how many c... (read more)

Doesn't FTX pay interest on deposits and prominently offer margin loans? Do you have a citation for the claim that the terms of service excluded the prospect of lending? (All I've seen are some out-of-context screenshots.)

Why do you say "Alameda (FTX trading)"? Aren't these just separate entities?

3
RyanCarey
1y
You're right - fixed.

It's reported here: https://www.axios.com/2022/11/12/ftx-terms-service-trading-customer-funds

I've heard (unverified) that customer deposits were $16B and voluntary customer lending <$4B. It would make sense to me that a significant majority of customer funds were not voluntarily lent, based on the fact that returns from lending crypto were minimal, and lending was opt-in, and not pushed hard on the website.

It seems like FTX offered borrowing and lending to its clients, and this was prominently marketed.  I don't think you can call FTX offering margin loans to Alameda a "straightforward case of fraud" if they publicly offer margin loans to all of their clients. (There may be other ways in which their behavior was straightforwardly fraudulent, especially as they were falling apart, but I don't know.)

In general brokers can get wiped out by risk management failures. I agree this isn't just an "ordinary bank run," the bank run was just an exacerbating featur... (read more)

It seems like FTX offers borrowing and lending to its clients, and this was prominently marketed.  I don't think you can call FTX offering margin loans to Alameda a "straightforward case of fraud" if they publicly offer margin loans to all of their clients. (There may be other ways in which their behavior was straightforwardly fraudulent, especially as they were falling apart, but I don't know.)

I think Matt Levine mentioned something similar, but I don't think I understand this, so let me try to get some more clarity on this by thinking through it ste... (read more)

Per the screenshot at tweet 17 of this thread, it seems like 2.8B of customer assets were opted-in to lending. Not nearly enough to explain the amount that went to Alameda.

If someone is strongly considering donating to a charitable fund, I think they should usually instead participate in a donor lottery up to  say 5-10% of the annual money moved by that fund. If they win, they can spend more time deciding how to give (whether that means giving to the fund that they were considering, giving to a different fund, changing cause areas, supporting a charity directly,  participating in a larger lottery, saving in a donor-advised fund, or doing something altogether different).

I'm curious how you feel about that advice. Ob... (read more)

Thanks for the thoughtful comment.

I think there’s a strong theoretical case in favour of donation lotteries — Giving What We Can just announced our 2022/2023 lottery is open!

I see the case in favour of donation lotteries as relying on some premises that are often, but not always true:

  • Spending more time researching a donation opportunity increases the expected value of a donation.
  • Spending time researching a donation opportunity is costly, and a donation lottery allows you to only need to spend this time if you win.
  • Therefore, all else equal, it’s more impact
... (read more)

I would prefer more people give through donor lotteries rather than deferring to EA funds or some vague EA vibe. Right now I think EA funds do like $10M / year vs $1M / year through lotteries, and probably in my ideal world the lottery number would be at least 10x higher.

I think that EAs are consistently underrating the value of this kind of decentralization. With all due respect to EA funds I don't think it's reasonable to say "thinking more about how to donate wouldn't help because obviously I should just donate  100% to EA funds." (That said, I don... (read more)

3
Jason
1y
At present, the Funds evaluate hundreds of grant applications a year, generally seeking high four to low six figures in funding (median seems to be mid-five). In a world where the Funds were largely replaced by lotteries, where would those applicants go? Do we predict that the lottery winners would largely take over the funding niche currently filled by the Funds? If so, how would this change affect the likelihood of funding for smaller grant applicants?

Personally the FTX regrantor system felt like a nice middle ground between EA Funds and donor lotteries in terms of (de)centralization. I'd be excited to donate to something less centralized than EA Funds but more centralized than a donor lottery.

2
plex
1y
Strong agree, and note that it's not obvious after browsing both GWWC and EAF how to donate to lotteries currently, or when they'll be running next. I'd like to see them run more regularly and placed prominently on the websites.
2
John_Maxwell
1y
Is there somewhere we can see how the winners of donor lotteries have been donating their winnings?

Following up on this. Since May 2020 when these comments were written:

  • VT is up 13.5%/year.
  • VWO (emerging markets ETF) is up 3.6%/year.
  • VMOT is up 7%/year.

Comparing to your predictions:

  • You expected VMOT to outperform by about 6%, and suggesting cut that in half to 3% to be conservative, but it underperformed by 6.5%. So that's another 1.5-3 standard deviations of underperformance.
  • You expected VWO to significantly outperform, but it underperformed by about 10%/year.
2
MichaelDickens
1y
Your recent comment got me thinking more about this. Basically, I didn't think the last few years of underperformance was good evidence against factor investing, but I wasn't sure how to explain why. After thinking for a while, I think I have a better handle on it. You're (probably) smarter than me and you're better at statistics than I am, so I was kind of bothered at myself for not being able to convince you earlier, and I think your argument was better than mine. I want to try to come up with something more coherent. Most of this comment is about estimating future expected returns, but one thing I will say first: you are right that my original estimate of standard deviation, where I took the historical standard deviation was too optimistic. Originally I looked at vol over the last ~25 years and assumed it would stay the same. But I just looked at the performance of value and momentum back to 1926 and I noticed the last few decades have had a standard deviation 2–3 percentage points lower than through most of history. Higher standard deviation decreases optimal leverage. You were correct that I was relying too much on a short sample. (The return for a value/momentum/trend strategy back to 1926 is about the same as the return 1991–2017.) (I believe in my head I was thinking, "historical volatility is a good predictor of future volatility, so I can just take this sample and use that to predict future volatility." Which is somewhat true but not good enough because volatility isn't that stable over time.) Regarding future returns: Between "nothing beats the market" on one side and "it's reasonable to expect a median +X% return over the market for this particular fund or similar funds" on the other side, I can see a few intermediate steps: 1. do value and momentum work at all? 2. will value and momentum work close to as well as they did historically (to within 50%)? 3. am I right about how well value and momentum worked historically? 4. does VMOT in particular

I still broadly endorse this post. Here are some ways my views have changed over the last 6 years:

  • At the time I wrote the OP I considered consequentialist evaluation the only rubric for judging principles like this, and the only reason we needed anything else was because of the intractability of consequentialist reasoning or moral uncertainty. I’m now more sympathetic to other moral intuitions and norms, and think my previous attempts to shoehorn them into a consequentialist justification involve some motivated cognition and philosophical error.
  • That said I
... (read more)
1
interstice
1y
What arguments/evidence caused you to be more hesitant about retaliation?

Earlier this year ARC received a grant for $1.25M from the FTX foundation. We now believe that this money morally (if not legally) belongs to FTX customers or creditors, so we intend to return $1.25M to them.

It may not be clear how to do this responsibly for some time depending on how bankruptcy proceedings evolve, and if unexpected revelations change the situation (e.g. if customers and creditors are unexpectedly made whole) then we may change our decision. We'll post an update here when we have a more concrete picture; in the meantime we will set aside t... (read more)

4
Paul_Christiano
19d
ARC returned this money to the FTX bankruptcy estate in November 2023.

I think the point of most non-profit boards is to ensure that donor funds are used effectively to advance the organization's charitable mission. If that's the case, then having donor representation on the board seems appropriate. Why would this represent a conflict of interest? My impression is that this is quite common amongst non-profits and is not considered problematic. (Note that Holden is on ARC's board.)

I'm also not sure this what the NYT author is objecting to. I think they would be equally unhappy with SBF claiming to have donated a lot, but it se... (read more)

I think the point of most non-profit boards is to ensure that donor funds are used effectively to advance the organization's charitable mission. If that's the case, then having donor representation on the board seems appropriate.

I don't see how this follows.

It is indeed very normal to have one or more donors on the board of a nonprofit. But FTX the for-profit organization did in fact have different interests than the FTX Foundation. For example, it was in the FTX Foundation's interest to not make promises to grantees that it could not honor. It was also... (read more)

7
Aaron C
1y
My original comment mentions 'can lead to hazards' due to not knowing the details of what went down/being relatively new (couple months) to the EA space.  However subsequent comments of individual regrantors needing to self 'reinvent the wheel'  around COI policies (btw, I think Linch came to a good equilibrium for his particular situation) and other concerns raised in the past I think confirm the hunch that this is an area of improvement.  Yes, that is one function of boards.  Funders also ideally function objectively when choosing who to fund.   Not so hypothetical example: Foundation can fund grantee A or B.  Foundation employee making the decision is board member/trustee of B with fiduciary duties to B. Should foundation employee fund A or B? Conflict. Employee should probably recuse themselves from the decision making process and document.  If employee was a board observer on B with no decision making power at B, it might be ok, but might still a good idea to recuse themselves from the fund distribution process if want to be safe/ doubly above board.  Most folks who have been required to go through COI learning modules for work can attest to how fun they are <sarcasm>, but I guess I'm seeing the importance now. There are general good practices because of problems that arise with   'self serving'  and I think the policies of well known foundations speak for themselves and are instructive if anyone wants to dive deeper.    I included COI policies from MacArthur Foundation and Bill and Melinda Gates Foundation and they are similar to policies in the government contracting and healthcare spaces which are more regulated. For reference, MacArthur states that "a grant is material to an entity when the amount of the grant is in excess of five percent (5%) of the revenue of the entity."  I suspect that is the case in several EA orgs, so may be good to at least revisit their COI policies in light of recent events. One key thing is a person's degree of decision making
Load more