All of David Johnston's Comments + Replies

I would take the proposal to be AI->growth->climate change or other negative growth side effects

I can see how this gets you  for each each item , but not . One of the advantages Ozzie raises is the possibility to keep track of correlations in value estimates, which requires more than the marginal expectations.

1
Jonas Moss
10mo
I'm not sure what you mean. I'm thinking about pairwise comparisons in the following way. (a) Every pair of items i,j have a true ratio of expectations E(Xi)/E(Xj)=μij. I hope this is uncontroversial. (b) We observe the variables Rij according to logRij∼logμij+ϵij for some some normally distributed ϵij. Error terms might be dependent, but that complicates the analysis. (And is most likely not worth it.) This step could be more controversial, as there are other possible models to use. Note that you will get a distribution over every E(Xi) too with this approach, but that would be in the Bayesian sense, i.e., p(E(Xi)∣comparisons), when we have a prior over E(Xi).

So constructing a value ratio table means estimating a joint distribution of values from a subset of pairwise comparisons, then sampling from the distribution to fill out the table?

In that case, I think estimating the distribution is the hard part. Your example is straightforward because it features independent estimates, or simple functional relationships.

1
Jonas Moss
10mo
Estimation is actually pretty easy (using linear regression), and is essentially a solved problem since 1952. Scheffé, H. (1952). An Analysis of Variance for Paired Comparisons. Journal of the American Statistical Association, 47(259), 381–400. https://doi.org/10.1080/01621459.1952.10501179 I wrote about the methodology (before finding Scheffé's paper) here.

The only piece of literature I had in mind was von Neumann and Morgenstern’s representation theorem. It says: if you have a set of probability distributions over a set of outcomes and for each pair of distributions you have a preference (one is better than the other, or they are equal) and if this relation satisfies the additional requirements of transitivity, continuity and independence from alternatives, then you can represent the preferences with a utility function unique up to affine transformation.

Given that this is a foundational result for expected ... (read more)

2
Ozzie Gooen
10mo
The value ratio table, as shown, is a presentation/visualization of the utility function (assuming you have joint distributions).  The key question is how to store the information within the utility function.  It's really messy to try to store meaningful joint distributions in regular ways, especially if you want to approximate said distributions using multiple pieces. It's especially to do this with multiple people, because then they would need to coordinate to ensure they are using the right scales. The value ratio functions are basically one specific way to store/organize and think about this information. I think this is feasible to work with, in order to approximate large utility functions without too many trade-offs.  "Joint distributions on values where the scales are arbitrary" seem difficult to intuit/understand, so I think that typically representing them as ratios is a useful practice. 

Because we are more likely to see no big changes than to see another big change.

if the risk is usually quite low (e.g. 0.001 % per century), but sometimes jumps to a high value (e.g. 1 % per century), the cumulative risk (over all time) may still be significantly below 100 % (e.g. 90 %) if the magnitude of the jumps decreases quickly, and risk does not stay high for long.

I would call this model “transient deviation” rather than “random walk” or “regular oscillation”

We can still get H4 if the amplitude of the oscillation or random walk decreases over time, right?

The average needs to fall, not the amplitude. If we're looking at risk in percentage points (rather than, say, logits, which might be a better parametrisation), small average implies small amplitude, but small amplitude does not imply small average.

Only if the sudden change has a sufficiently large magnitude, right?

The large magnitude is an observation - we have seen risk go from quite low to quite high over a short period of time. If we expect such large magnitude changes to be rare, then we might expect the present conditions to persist.

2
Vasco Grilo
10mo
Thanks for the clarifications! Agreed. I meant that, if the risk is usually quite low (e.g. 0.001 % per century), but sometimes jumps to a high value (e.g. 1 % per century), the cumulative risk (over all time) may still be significantly below 100 % (e.g. 90 %) if the magnitude of the jumps decreases quickly, and risk does not stay high for long. Why should we expect the present conditions to persist if we expect large magnitude changes to be rare?

FWIW I think the general kind of model underlying what I’ve written is a joint distribution that models value something like

Thought about this some more. This isn't a summary of your work, it's an attempt to understand it in my terms. Here's how I see it right now: we can use pairwise comparisons of outcomes to elicit preferences, and people often do, but they typically choose to insist that each outcome has a value representable as a single number and use the pairwise comparisons to decide which number to assign each outcome. Insisting that each outcome has a value is a constraint on preferences that can allow us to compute which outcome is preferred between two outcomes for w... (read more)

1
Jonas Moss
10mo
I don't understand your notion of context here. I'm understanding pairwise comparisons as standard decision theory - you are comparing the expected values of two lotteries, nothing more. Is the context about psychology somehow? If so, that might be interesting, but adds a layer of complexity this sort of methodology cannot be expected to handle. Players may have different utility functions, but that might be reasonable to ignore when modelling all of this. In any case, every intervention Ai will have its own, unique, expected utility from each player p, hence xpij=E[Up(Ai)]/E[Up(Aj)]=1/xpji. (This is ignoring noise in the estimates, but that is pretty easy to handle.)
2
Ozzie Gooen
10mo
There's a lot here, and it will take me some time to think about. It seems like you're coming at this from the lens of the pairwise comparison literature. I was coming at this from the lens of (what I think is) simpler expected value maximization foundations. I've spent some time trying to understand the pairwise comparison literature, but haven't gotten very fair. What I've seen has been focused very much on (what seems to me) like narrow elicitation procedures. As you stated, I'm more focused on representation.  "Table of value ratios" are meant to be a natural extension of "big lists of expected values".  You could definitely understand a "list of expected value estimates" to be a function that helps convey certain preferences, but it's a bit of an unusual bridge, outside the pairwise comparison literature.   On Contexts You spend a while expressing the importance of clear contexts. I agree that precise contexts are important. It's possible that the $1 example I used was a bit misleading - the point I was trying to make is that many value ratios will be less sensitive to changes context, then absolute values (the typical alternative, in expected value theory) would be.  Valuing V($5)/V($1) should give fairly precise results, for people of many different income levels. This wouldn't be the case if you tried converting dollars to a common unit of QALYs or something first, before dividing.  Now, I could definitely see people from the discrete choice literature saying, "of course you shouldn't first convert to QALYs, instead you should use better mathematical abstractions to represent direct preferences". In that case I'd agree, there's just a somewhat pragmatic set of choices about which abstractions give a good fit of practicality and specificity. I would be very curious if people from this background would suggest other approaches to large-scale, collaborative, estimation, as I'm trying to achieve here.  I would expect that with Relative Value estimation,
1
David Johnston
10mo
FWIW I think the general kind of model underlying what I’ve written is a joint distribution that models value something like P(c|outcome)P(V|c,outcome)

I don't think it's all you are doing, that's why I wrote the rest of my comment (sorry to be flippant).

The point of  bringing up binary comparisons is that a table of binary comparisons is a more general representation than a single utility function.

If all we are doing is binary comparisons between a set of items, it seems to me that it would be sufficient to represent relative values as a binary - i.e., is item1 better, or item2? Or perhaps you want a ternary function - you could also say they're equal.

Using a ratio instead of a binary indicator for relative values suggests that you want to use the function to extrapolate. I'm not sure that this approach helps much with that, though. For example,

costOfp001DeathChance = ss(10 to 10k) // Cost of a 0.001% chance of death, in dol

... (read more)
2
Ozzie Gooen
10mo
Why do you think this is all we're doing? We often want to know how much better some items are than others - relative values estimate this information.  You can think of relative values a lot like "advanced and scalable expected value calculations". There are many reasons to actually know the expected value of something. If you want to do extrapolation ("The EV of one person going blind is ~0.3 QALYs/year, so the EV of 20 people going blind is probably..."), it's often not too hard to ballpark it. Related, businesses often use dollar approximations of the costs of very different things. This is basically a set of estimates of the value of the cost. 

AFAIK the official MIRI solution to AI risk is to win the race to AGI but do it aligned.

Part of the MIRI theory is that winning the AGI race will give you the power to stop anyone else from building AGI. If you believe that, then it’s easy to believe that there is a race, and that you sure don’t want to lose.

It cannot both be controllable because it’s weak and also uncontrollabile.

That said, I expect more advanced techniques will be needed for more advanced AI; I just think control techniques probably keep up without sudden changes in control requirements.

Also LLMs are more controllable than weaker older designs (compare GPT4 vs Tay).

5
Greg_Colbourn
1y
Yes. This is no comfort for me in terms of p(doom|AGI). There will be sudden changes in control requirements, judging by the big leaps of capability between GPT generations. More controllable is one thing, but it doesn't really matter much for reducing x-risk when the numbers being talked about are "29%".

I’d love to hear from people who don’t “have adhd”. I have a diagnosis myself but I have trouble believing I’m all that unusual. I tried medication for a while, but I didn’t find it that helpful with regard to the bottom line outcome of getting things done, and I felt uncomfortable with the idea of taking stimulants regularly for many years. I’d certainly benefit from being more able to finish projects, though!

People will continue to prefer controllable to uncontrollable AI  and continue to make at least a commonsense level of investment in controllability; that is, they invest as much as naively warranted by recent experience and short term expectations, which is less than warranted by a sophisticated assessment of uncertainty about misalignment, though the two may converge as “recent experience” involves more and more capable AIs. I think this minimal level of investment in control is very likely (99%+).

Next, the proposed sudden/surprising phase transitio... (read more)

3
Greg_Colbourn
1y
Thanks for answering! The only reason AI is currently controllable is that it is weaker than us. All the GPT-4 jailbreaks show how high the uncontollability potential is, so I don't think a phase-transition is necessary as we still far from AI being controllable in the first place.

I'm writing quickly because I think this is a tricky issue and I'm trying not to spend too long on it. If I don't make sense, I might have misspoken or made a reasoning error.

One way I thought about the problem (quite different to yours, very rough): variation in existential risk rate depends mostly on technology. At a wide enough interval (say, 100 years of tech development at current rates), change in existential risk with change in technology is hard to predict, though following Aschenbrenner and Xu's observations it's plausible that it tends to some eq... (read more)

9
Linch
1y
Appreciate your comments! (As an aside, it might not make a difference mathematically, but numerically one possible difference between us is that I think of the underlying unit to be ~logarithmic rather than linear) Agreed, an important part of my model is something like nontrivial credence in a) technological completion conjecture and b) there aren't "that many" technologies laying around to be discovered. So I zoom in and think about technological risks, a lot of my (proposed) model is thinking about the a) underlying distribution of scary vs worldsaving technologies and b) whether/how much the world is prepared for each scary technology as they appears, c) how high is the sharpness of dropoff of lethality for survival from each new scary technology conditional upon survival in the previous timestep. I think I probably didn't make the point well enough, but roughly speaking, you only care about worlds where you survive, so my guess is that you'll systematically overestimate longterm risk if your mixture model doesn't update on survival at each time step to be evidence that survival is more likely on future time steps. But you do have to be careful here.  Yeah I think this is true. A friend brought up this point, roughly, the important parts of your risk reduction comes from temporarily vulnerable worlds. But if you're not careful, you might "borrow" your risk-reduction from permanently vulnerable worlds (given yourself credit for high microextinctions averted), and also "borrow" your EV_of_future from permanently invulnerable worlds (given yourself credit for a share of an overwhelmingly large future). But to the extent those are different and anti-correlated worlds (which accords with David's original point, just a bit more nuanced), then your actual EV can be a noticeably smaller slice. 

I don't see this. First, David's claim is that a short time of perils with low risk thereafter seems unlikely - which is only a fraction of hypothesis 4, so I can easily see how you could get H3+H4_bad:H4_good >> 10:1

I don't even see why it's so implausible that H3 is strongly preferred to H4. There are many hypotheses we could make about time varying risk:

 - Monotonic trend (many varieties)

 - Oscillation (many varieties)

 - Random walk (many varieties)

 - ...

If we aren't trying to carefully consider technological change (and ignori... (read more)

7
Vasco Grilo
10mo
Hi David, We can still get H4 if the amplitude of the oscillation or random walk decreases over time, right? Only if the sudden change has a sufficiently large magnitude, right?

Fair overall. I talked to some other people, and I think I missed the oscillation model when writing my original comment, which in retrospect is a pretty large mistake. I still don't think you can buy that many 9s on priors alone, but sure, if I think about it more maybe you can buy 1-3 9s. :/ 

First, David's claim is that a short time of perils with low risk thereafter seems unlikely.

Suppose you were put to cryogenic sleep. You wake up in the 41st century. Before learning anything about this new world, is your prior really[1] that the 41st centur... (read more)

When I read your scripts and Rob is interviewing, I like to read Rob’s questions at twice the speed of the interviewees’ responses. Can you accommodate that with your audio version?

Thanks for the suggestion David! We're discussing adding this as a premium feature — perhaps activated only for Giving What We Can members.

Now I want to see how much I like honey-drenched fat

1
vaniver
1y
I think the 'traditional fine dining' experience that comes closest to this is Peking Duck. Most of my experience has been with either salt-drenched cooked fat or honey-dusted cooked fat; I'll have to try smoking something and then applying honey to the fat cap before I eat it. My experience is that it is really good but also quickly becomes unbalanced / no longer good; some people, on their first bite, already consider it too unbalanced to enjoy. So I do think there's something interesting here where there is a somewhat subtle taste mechanism (not just optimizing for 'more' but somehow tracking a balance) that ice cream seems to have found a weird hole in. [edit: for my first attempt at this, I don't think the honey improved it at all? I'll try it again tho.]

I have children, and I would precommit to enduring the pain without hesitation, but I don’t know what I would do in the middle of experiencing the pain. If pain is sufficiently intense, “I” am not in chatter any more, and whatever part of me is in charge, I don’t know very well how it would act

1
LGS
1y
Oh, I didn't mean for you to make the decision in the middle of pain! The scenario is: first, you experience 5 minutes of pain. Then take a 1 hour break. Then decide: 1 hour pain, or dead child. No changing your mind once you've decided. The possibility that pain may twist your brain into taking actions you do not endorse when not under duress is interesting, but not particularly morally relevant. We usually care about informed decisions not made under duress.

I have the complete opposite intuition: equal levels of pain are harder to endure for equal time if you have the option to make them stop. Obviously I don’t disagree that pain for a long time is worse than pain for a short time.

This intuition is driven by experiences like: the same level of exercise fatigue is a lot easier to endure if giving up would cause me to lose face. In general, exercise fatigue is more distracting than pain from injuries (my reference points being a broken finger and a cup of boiling water in my crotch - the latter being about as d... (read more)

2
Matt Goodman
1y
Your comment made me realise I'm actually talking about two different things: * When you can choose to end the pain at any point e.g.  exercise, the hand-in-cold-water experiment. * When you can't choose to end the pain, but you know that it will end soon with some degree of certainty. e.g. "medics will be here with morphine in 10 minutes", or "we can see the head, the baby's almost out". I agree with you that having some kind of peer pressure or social credit for 'doing well' can help a person withstand pain. I'd imagine this has an effect on the hand-in-cold-water experiment, if you're doing it on your own vs as part of a trial with onlookers.

Conditional on AGI being developed by 2070, what is the probability that humanity will suffer an existential catastrophe due to loss of control over an AGI system?

Requesting a few clarifications:

  • I think of existential catastrophes as things like near-term extinction rather than things like "the future is substantially worse than it could have been". Alternatively, I tend to think that existential catastrophe means a future that's much worse than technological stagnation, rather than one that's much worse than it would have been with more aligned AI. What d
... (read more)
4
Jason Schukraft
1y
Hi David, Thanks for your questions. We're interested in a wide range of considerations. It's debatable whether human-originating civilization failing to make good use of its "cosmic endowment" constitutes an existential catastrophe. If you want to focus on more recognizable catastrophes (such as extinction, unrecoverable civilizational collapse, or dystopia) that would be fine. In a similar vein, if you think there is an important scenario in which humanity suffers an existential catastrophe by collectively losing control over an ecosystem of AGIs, that would also be an acceptable topic. Let me know if you have any other questions!

 I think journalists are often imprecise and I wouldn't read too much into the particular synonym of "said" that was chosen.

Does it make more sense to think about all probability distributions that offers a probability of 50% for rain tomorrow? If we say this represents our epistemic state, then we're saying something like "the probability of rain tomorrow is 50%, and we withhold judgement about rain on any other day".

2
Harrison Durland
1y
It feels more natural, but I’m unclear what this example is trying to prove. It still reads to me like “if we think rain is 50% likely tomorrow then it makes sense to say rain is 50% likely tomorrow” (which I realize is presumably not what is meant, but it’s how it feels).

I think this question - whether it's better to take 1/n probabilities (or maximum entropy distributions or whatever) or to adopt some "deep uncertainty" strategy - does not have an obvious answer

3
Harrison Durland
1y
I actually think it probably (pending further objections) does have a somewhat straightforward answer with regards to the rather narrow, theoretical cases that I have in mind, which relate to the confusion I had which started this comment chain. It’s hard to accurately convey the full degree of my caveats/specifications, but one simple example is something like “Suppose you are forced to choose whether to do X or nothing (Y). You are purely uncertain whether X will lead to outcome Great (Q), Good (P), or Bad (W), and there is guaranteed to be no way to get further information on this. However, you can safely assume that outcome Q is guaranteed to lead to +1,000 utils, P is guaranteed to lead to +500 utils, and W is guaranteed to lead to -500 utils. Doing nothing is guaranteed to lead to 0 utils. What should you do, assuming utils do not have non-linear effects?” In this scenario, it seems very clear to me that a strategy of “do nothing” is inferior to doing X: even though you don’t know what the actual probabilities of Q, P, and W are, I don’t understand how the 1/n default will fail to work (across a sufficiently large number of 1/n cases). And when taking the 1/n estimate as a default, the expected utility is positive. Of course, outside of barebones theoretical examples (I.e., in the real world) I don’t think there is a simple, straightforward algorithm for deciding when to pursue more information vs. act on limited information with significant uncertainty.

Perhaps I’m just unclear what it would even mean to be in a situation where you “can’t” put a probability estimate on things that does as good as or better than pure 1/n ignorance.

Suppose you think you might come up with new hypotheses in the future which will cause you to reevaluate how the existing evidence supports your current hypotheses. In this case probabilistically modelling the phenomenon doesn’t necessarily get you the right “value of further investigation” (because you’re not modelling hypothesis X), but you might still be well advised to hol... (read more)

3
Harrison Durland
1y
I basically agree (although it might provide a decent amount of information to this end), but this does not reject the idea that you can make a probability estimate equally or more accurate than pure 1/n uncertainty. Ultimately, if you want to focus on “what is the expected value of doing further analyses to improve my probability estimates,” I say go for it. You often shouldn’t default to accepting pure 1/n ignorance. But I still can’t imagine a situation that truly matches “Level 4 or Level 5 Uncertainty,” where there is nothing as good as or better than pure 1/n ignorance. If you truly know absolutely and purely nothing about a probability distribution—which almost never happens—then it seems 1/n estimates will be the default optimal distribution, because anything else would require being able to offer supposedly-nonexistent information to justify that conclusion. Ultimately, a better framing (to me) would seem like “if you find yourself at 1/n ignorance, you should be careful not to accept that as a legitimate probability estimate unless you are really rock solid confident it won’t improve.” No?

Fair enough, she mentioned Yudkowsky before making this claim and I had him in mind when evaluating it (incidentally, I wouldn't mind picking a better name for the group of people who do a lot of advocacy about AI X-risk if you have any suggestions)

I skimmed from 37:00 to the end. It wasn't anything groundbreaking. There was one incorrect claim ("AI safteyists encourage work at AGI companies"), I think her apparent moral framework that puts disproportionate weight on negative impacts on marginalised groups is not good, and overall she comes across as someone who has just begun thinking about AGI x-risk and so seems a bit naive on some issues. However, "bad on purpose to make you click" is very unfair.

But also: she says that hyping AGI encourages races to build AGI. I think this is true! Large languag... (read more)

5
Sarah Levin
1y
"AI safetyists" absolutely do encourage work at AGI companies. To take one of many examples, 80,000 Hours are "AI safetyists", and their job board currently encourages work at OpenAI, Deepmind, and Anthropic, which are AGI companies. (I haven't watched the video.)

I think it's quite sensible that people hoping to have a positive impact in biosecurity should become well-informed first. However, I don't think this necessarily means that radical positions that would ban a lot of research are necessarily wrong, even if they are more often supported by people with less detailed knowledge of the field. I'm not accusing you of saying this, I just want to separate the two issues.

 Many professionals in this space are scared and stressed. Adding to that isn’t necessarily building trust and needed allies. The professional

... (read more)
3
Elika
1y
Thanks! I do broadly agree with your points. I linked  reference 6 as an example of the benefits and nuances of dual-use research, but don't / shouldn't comment on COVID-19 origins and their views expressed on it.

I do worry about it. Some additional worries I have are 1) if AI is transformative and confers strong first mover advantages, then a private company leading the AGI race could quickly become similarly powerful to a totalitarian government and 2) if the owners of AI depend far less on support from people for their power than today’s powerful organisations, they might be generally less benevolent than today’s powerful organisations

I think they do? Nate at least says he’s optimistic about finding a solution given more time

I'm not sold on how well calibrated their predictions of catastrophe are, but I think they have contributed a large number of novel & important ideas to the field.

The main point I took from video was that Abigail is kinda asking the question: "How can a movement that wants to change the world be so apolitical?" This is also a criticism I have of many EA structures and people.

I think it's surprising that EA is so apolitical, but I'm not convinced it's wrong to make some effort to avoid issues that are politically hot. Three reasons to avoid such things: 1) they're often not the areas where the most impact can be had, even ignoring constraints imposed by them being hot political topics 2) being hot political topics ma... (read more)

3
FJehn
1y
It seems to me that we are talking about different definitions about what political means. I agree that in some situations it can make sense to not chip in political discussions, to not get pushed to one side.  I also see that there are some political issues where EA has taken a stance like animal welfare. However, when I say political I mean what are the reason for us doing things and how do we convince other people of it? In EA there are often arguments that something is not political, because there has been an "objective" calculation of value. However, there is almost never a justification why something was deemed important, even though when you want to change the world in a different way, this is the important part. Or on a more practical level why are QUALYs seen as the best way to measure outcomes in many cases? Using this and not another measure is choice which has to be justified. 

Is the reason you don’t go back and forth about whether ELK will work in the narrow sense Paul is aiming for a) you’re seeking areas of disagreement, and you both agree it is difficult or b) you both agree it is likely to work in that sense?

My intuition for why "actions that have effects in the real world" might promote deception is that maybe the "no causation without manipulation" idea is roughly correct. In this case, a self-supervised learner won't develop the right kind of model of its training process, but the fine-tuned learner might.

I think "no causation without manipulation" must be substantially wrong. If it was entirely correct, I think one would have to say that pretraining ought not to help achieve high performance on a standard RLHF objective, which is obviously false. It still ... (read more)

I think your first priority is promising and seemingly neglected (though I'm not familiar with a lot of work done by governance folk, so I could be wrong here). I also get the impression that MIRI folk believe they have an unusually clear understanding of risks, would like to see risky development slow down and are pessimistic about their near-term prospects for solving technical problems of aligning very capable intelligent systems and generally don't see any clearly good next steps. It appears to me that this combination of skills and views positions the... (read more)

2
Davidmanheim
1y
I don't think they claim to have better longer-term prospects, though.
0
Guy Raveh
1y
"Believe" being the operative word here. I really don't think they do.

If a model is deceptively aligned after fine-tuning, it seems most likely to me that it's because it was deceptively aligned during pre-training.

How common do you think this view is? My impression is that most AI safety researchers think the opposite, and I’d like to know if that’s wrong.

I’m agnostic; pretraining usually involves a lot more training, but also fine tuning might involve more optimisation towards “take actions with effects in the real world”.

6
Paul_Christiano
1y
I don't know how common each view is. My guess would be that in the old days this was the more common view, but there's been a lot more discussion of deceptive alignment recently on LW. I don't find the argument about "take actions with effects in the real world" --> "deceptive alignment," and my current guess is that most people would also back off from that style of argument if they thought about the issues more thoroughly. Mostly though it seems like this will just get settled by the empirics.

All of these comments are focused on my third core argument. What do you think of the other two? They all need to be wrong for deceptive alignment to be a likely outcome. 

Yeah, this is just partial feedback for now.

Recall that in this scenario, the model is not situationally aware yet, so it can't be deceptive. Why would making the goal long-term increase immediate-term reward? If the model is trying to maximize immediate reward, making the goal longer-term would create a competing priority.

I think I don't accept your initial premise. Maybe a mod... (read more)

6
DavidW
1y
Excellent, I look forward to hearing what you think of the rest of it! Are you talking about the model gaining situational awareness from the prompt rather than gradients? I discussed this in the second two paragraphs of the section we’re discussing. What do you think of my arguments there? My point is that a model will understand the base goal before being situational aware, not that it can’t become situationally aware at all.  My central argument is about processes through which a model gains capabilities necessary for deception. If you assume it can be deceptive, then I agree that it can be deceptive, but that’s a trivial result. Also, if the goal isn’t long-term, then the model can’t be deceptively aligned. The original post argument we’re discussing is about how situational awareness could emerge. Again, if you assume that it has situational awareness, then I agree it has situational awareness. I’m talking about how a pre-situationally aware model could become situationally aware.  Also, if the model is situationally aware, do you agree that its expectations about the effect of the gradient updates are what matters, rather than the gradient updates themselves? It might be able to make predictions that are significantly better than random, but very specific predictions about the effects of updates, including the size of the effect, would be hard, for many of the same reasons that interpretability is hard.  Are you arguing that an aligned model could become deceptively aligned to boost training performance? Or are you saying making the goal longer-term boosts performance? I’d be interested to hear what you think of the first post when you get a chance. Thanks for engaging with my ideas!  

Gradient descent can only update the model in the direction that improves performance hyper-locally. Therefore, building the effects of future gradient updates into the decision making of the current model would have to be advantageous on the current training batch for it to emerge from gradient descent.

I think the standard argument here would be that you've got the causality slightly wrong. In particular: pursuing long term goals is, by hypothesis, beneficial for immediate-term reward, but pursuing long term goals also entails considering the effects of f... (read more)

6
DavidW
1y
All of these comments are focused on my third core argument. What do you think of the other two? They all need to be wrong for deceptive alignment to be a likely outcome.  Recall that in this scenario, the model is not situationally aware yet, so it can't be deceptive. Why would making the goal long-term increase immediate-term reward? If the model is trying to maximize immediate reward, making the goal longer-term would create a competing priority.  This isn't necessarily true. Humans frequently plan for their future without thinking about how their own values will be affected and how that will affect their long-term goals. Why wouldn't a model do the same thing? It seems very plausible that a model could have crude long-term planning without yet modeling gradient descent updates.  The relevant factor here is actually how much the model expects its future behavior to change from a gradient update, because the model doesn't yet know the effect of the upcoming gradient update. Models won't necessarily be good at anticipating their own gradients or their own internal calculations. The effect sizes of gradient updates are hard to predict, so I would expect the model's expectation to be much more continuous than the actual gradients. Do you agree? The difficulty of gradient prediction should also make it harder for the model to factor in the effects of gradient updates.  Agreed, but I still expect that to have a limited impact if you're looking over a relatively short-term period. It's not guaranteed, but it's a reasonable expectation.  It seems to me like speed of changing goals depends more on the number of differential adversarial examples and how different the reward is for them. Gradient descent can update in every direction at once. If updating its proxies helps performance, I see no reason why gradient descent wouldn't update the proxies.  If it did, that would be great! Understanding the base objective (the researchers' training goal) early on is an import

I think your title might be causing some unnecessary consternation.  "You don't need to maximise utility to avoid domination" or something like that might have avoided a bit of confusion.

and I would urge the author to create an actual concrete situation that doesn't seem very dumb in which a highly intelligence, powerful and economically useful system has non-complete preferences

I'd be surprised if you couldn't come up with situations where completeness isn't worth the cost - e.g. something like, to close some preference gaps you'd have to think for 100x as long, but if you close them all arbitrarily then you end up with intrasitivity.

6
Michael_PJ
1y
This seems like a great point. Completeness requires closing all preference gaps, but if you do that inconsistently and violate transitivity then suddenly you are vulnerable to money-pumping.

I wonder if it is possible to derive expected utility maximisation type results from assumptions of "fitness" (as in, evolutionary fitness). This seems more relevant to the AI safety agenda - after all, we care about which kinds of AI are successful, not whether they can be said to be "rational".  It might also be a pathway to the kind of result AI safety people implicitly use - not that agents maximise some expected utility, but that they maximise utilities which force a good deal of instrumental convergence (i.e. describing them as expected utility ... (read more)

3
Dan H
1y
I agree fitness is a more useful concept than rationality (and more useful than an individual agent's power), so here's a document I wrote about it: https://drive.google.com/file/d/1p4ZAuEYHL_21tqstJOGsMiG4xaRBtVcj/view

Fixing the “I pit my evidence against itself” problem is easy enough once I’ve recognized that I’m doing this (or so my visualizer suggests); the tricky part is recognizing that I’m doing it.

One obvious exercise for me to do here is to mull on the difference between uncertainty that feels like it comes from lack of knowledge, and uncertainty that feels like it comes from tension/conflict in the evidence. I think there’s a subjective difference, that I just missed in this case, and that I can perhaps become much better at detecting, in the wake of this hars

... (read more)

That being said, polyamory/kink is very often used as a tool of social pressure by predators to force women into a bad choice of either a situation they would not have otherwise agreed to or being called “close minded” and potentially withheld social/career opportunities.

Are such threats believable? Is there a broader culture where people feel that they’re constantly under evaluation such that personal decisions like this are plausibly taken into account for some career opportunities, or is this something that arises mainly where the career opportunities are within someone’s personal fiefdom?

What you're saying here resonates with me, but I wonder if there are people who  might be more inclined to assume they're missing something and consequently have a different feeling about what's going on when they're in the situation you're trying to describe. In particular, I'm thinking about people prone to imposter syndrome. I don't know what their feeling in this situation would be - I'm not prone to imposter syndrome - but I think it might be different.

I would have thought that "all conjectures" is a pretty natural reference class for this problem, and Laplace is typically used when we don't have such prior information - though if the resolution rate diverges substantially from the Laplace rule prediction I think it would still be  interesting.

I think, because we expect the resolution rate of different conjectures to be correlated, this experiment is a bit like a single draw from a distribution over  annual resolution probabilities rather than many draws from such a distribution ( if you can forgive a little frequentism).

2
NunoSempere
1y
I agree, but then you'd have to come up with a dataset of conjectures. Yep! I think that my thinking here is: * We could model the chance of a conjecture being resolved with reference to internal details. For instance, we could look at the increasing number of mathematicians, at how hard a given conjecture seems, etc.  * However, that modelling is tricky, and in some cases the assumptions could be ambiguous * But we could also use Laplace's rule of succession. This has the disadvantage that it doesn't capture the inner structure of the model, but it has the advantage that it is simple, and perhaps more robust.  The question is, does it really work? And then I was looking at one particular case which I could be somewhat informative. * I think I used to like Laplace's law a bit more in the past, for some of those reasons. But I now like it a bit less, because maybe it fails to capture the inner structure of  what is predicting. I agree. On the other hand, I kind of expect to be informative nonetheless.

I think to properly model Ord’s risk estimates, you have to account for the fact that they incorporate uncertainty over the transition rate. Otherwise I think you’ll overestimate the rate at which risk compounds over time, conditional on no catastrophe so far.

I think Gary Marcus seems to play the role of an “anti-AI-doom” figurehead much more than Timnit Gebru. I don’t even know what his views on doom are, but he has established himself as a prominent critic of “AI is improving fast” views and seemingly gets lots of engagement from the safety community.

I also think Marcus’ criticisms aren’t very compelling, and so the discourse they generate isn’t terribly valuable. I think similarly of Gebru’s criticism (I think it’s worse than Marcus’, actually), but I just don’t think it has as much impact on the safety community.

Some quick thoughts: A crude version of the vulnerable world hypothesis is “developing new technology is existentially dangerous, full stop”, in which case advanced AI that increase the rate of new technology development is existentially dangerous, full stop.

One of Bostroms solutions is totalitarianism. This seems to imply something like “new technology is dangerous, but this might be offset by reducing freedom proportionally”. Accepting this hypothesis seems to say that either advanced AI is existentially dangerous, or it accelerates a political transition to totalitarianism, which seems to be its own kind of risk.

1
Jordan Arel
1y
Yes, I agree this is somewhat what Bostrom is arguing. As I mentioned in the post, I think there may be solutions which don’t require totalitarianism, i.e. massive universal moral progress. I know this sounds intractable, I might address why I think this maybe mistaken in a future post, but it is a moot point if a vulnerable world induced X-risk scenario is unlikely, hence why I am wondering if there has been any work on this.

What sort of substantial value would you expect to be added? It sounds like we either have a different belief about the value-add, or a different belief about the costs.

I'd be very surprised if the actual amount of big-picture strategic thinking at either organisation was "very little". I'd be less surprised if they didn't have a consensus view about big-picture strategy, or a clearly written document spelling it out. If I'm right, I think the current content is misleading-ish. If I'm wrong and actually little thinking has been done - there's some chance t... (read more)

4
RobBensinger
1y
I agree with a lot of what you say! I still want to move EA in the direction of "people just say what's on their mind on the EA Forum, without trying to dot every i and cross every t; and then others say what's on their mind in response; and we have an actual back-and-forth that isn't carefully choreographed or extremely polished, but is more like a real conversation between peers at an academic conference". (Another way to achieve many of the same goals is to encourage more EAs who disagree with each other to regularly talk to each other in private, where candor is easier. But this scales a lot more poorly, so it would be nice if some real conversation were happening in public.) A lot of my micro-decisions in making posts like this are connected to my model of "what kind of culture and norms are likely to result in EA solving the alignment problem (or making a lot of progress)?", since I think that's the likeliest way that EA could make a big positive difference for the future. In that context, I think building conversations about heavily polished, "final" (rather than in-process) cognition, tends to be insufficient for fast and reliable intellectual progress: * Highly polished content tends to obscure the real reasons and causes behind people's views, in favor of reasons that are more legible, respectable, impressive, etc. (See Beware defensibility.) * AGI alignment is a pre-paradigmatic proto-field where making good decisions will probably depend heavily on people having good technical intuitions, intuiting patterns before they know how to verbalize those patterns, and generally becoming adept at noticing what their gut says about a topic and putting their gut in contact with useful feedback loops so it can update and learn. * In that context, I'm pretty worried about an EA where everyone is hyper-cautious about saying anything that sounds subjective, "feelings-ish", hard-to-immediately-transmit-to-others, etc. That might work if EA's path to improving
Load more