Harrison Durland

1805 karmaJoined Sep 2020


Sorted by New


Topic Contributions

Strange, unless the original comment from Gerald has been edited since I responded I think I must have misread most of the comment, as I thought it was making a different point (i.e., "could someone explain how misalignment could happen"). I was tired and distracted when I read it, so it wouldn't be surprising. However, the final paragraph in the comment (which I originally thought was reflected in the rest of the comment) still seems out of place and arrogant.

This is a test regarding comment edit history. This comment has been edited post-publication.

This really isn’t the right post for most of those issues/questions, and most of what you mentioned are things you should be able to find via searches on the forum, searches via Google, or maybe even just asking ChatGPT to explain it to you (maybe!). TBH your comment also just comes across quite abrasive and arrogant (especially the last paragraph), without actually appearing to be that insightful/thoughtful. But I’m not going to get into an argument on these issues.

[This comment is no longer endorsed by its author]Reply

I wish! I’ve been recommending this for a while but nobody bites, and usually (always?) without explanation. I often don’t take seriously many of these attempts at “debate series” if they’re not going to address some of the basic failure modes that competitive debate addresses, e.g., recording notes in a legible/explorable way to avoid the problem of arguments getting lost under layers of argument branches.

Hi Oisín, no worries, and thanks for clarifying! I appreciate your coverage of this topic, I just wanted to make sure there aren't misinterpretations.

In policy spaces, this is known as the Brussels Effect; that is, when a regulation adopted in one jurisdiction ends up setting a standard followed by many others.

I am not clear how the Brussels effect applies here, especially since we’re not talking manufacturing a product with high costs of running different production lines. I recognize there may be some argument/step that I’m missing, but I can’t dismiss the possibility that the author doesn’t actually understand what the Brussels Effect really is / normally does, and is throwing it around like a buzzword. Could you please elaborate a bit more?

I’m curious whether people (e.g., David, MIRI folk) think that LLMs now or in the near future would be able to substantially speed up this kind of theoretical safety work?

I was not a huge fan of the instrumental convergence paper, although I didn't have time to thoroughly review it. In short, it felt too slow in making its reasoning and conclusion clear, and once (I think?) I understood what it was saying, it felt quite nitpicky (or a borderline motte-and-bailey). In reality, I'm still unclear if/how it responds to the real-world applications of the reasoning (e.g., explaining why a system with a seemingly simple goal like calculating digits of pi would want to cause the extinction of humanity).

The summary in this forum post seems to help, but I really feel like the caveats identified in this post ("this paper simply argues that this would not be true of agents with randomly-initialized goals") is not made clear in the abstract.[1]

  1. ^

    The abstract mentions "I find that, even if intrinsic desires are randomly selected [...]" but this does not at all read like a caveat, especially due to the use of "even if" (rather than just "if").

Sorry about the delayed reply, I saw this and accidentally removed the notification (and I guess didn't receive an email notification, contrary to my expectations) but forgot to reply. Responding to some of your points/questions:

One can note that AIXR is definitely falsifiable, the hard part is falsifying it and staying alive.

I mostly agree with the sentiment that "if someone predicts AIXR and is right then they may not be alive", although I do now think it's entirely plausible that we could survive long enough during a hypothetical AI takeover to say "ah yeah, we're almost certainly headed for extinction"—it's just too late to do anything about it. The problem is how to define "falsify": if you can't 100% prove anything, you can't 100% falsify anything; can the last person alive say with 100% confidence "yep, we're about to go extinct?" No, but I think most people would say that this outcome basically "falsifies" the claim "there is no AIXR," even prior to the final person being killed.


Knightian uncertainty makes more sense in some restricted scenarios especially related to self-confirming/self-denying predictions.

This is interesting; I had not previously considered the interaction between self-affecting predictions and (Knightian) "uncertainty." I'll have to think more about this, but as you say I do still think Knightian uncertainty (as I was taught it) does not make much sense.  


This can apply as well to different people: If I believe that X has a very good reasoning process based on observations on Xs past reasoning, I might not want to/have to follow Xs entire train of thought before raising my probability of their conclusion.

Yes, this is the point I'm trying to get at with forecast legibility, although I'm a bit confused about how it builds on the previous sentence.


Some people have talked about probability distributions on probability distributions, in the case of a binary forecast that would be a function , which is…weird. Do I need to tack on the resilience to the distribution? Do I compute it out of the probability distribution on probability distributions? Perhaps the people talking about imprecise probabilities/infrabayesianism are onto something when they talk about convex sets of probability distributions as the correct objects instead of probability distributions per se.

Unfortunately I'm not sure I understand this paragraph (including the mathematical portion). Thus, I'm not sure how to explain my view of resilience better than what I've already written and the summary illustration: someone who says "my best estimate is currently 50%, but within 30 minutes I think there is a 50% chance that my best estimate will become 75% and a 50% chance that my best estimate becomes 25%" has a less-resilient belief compared to someone who says "my best estimate is currently 50%, and I do not think that will change within 30 minutes." I don't know how to calculate/quantify the level of resilience between the two, but we can obviously see there is a difference.

Epistemic status: I feel fairly confident about this but recognize I’m not putting in much effort to defend it and it can be easily misinterpreted.

I would probably just recommend not using the concept of neglectedness in this case, to be honest. The ITN framework is a nice heuristic (e.g., usually more neglected things benefit more from additional marginal contributions) but it is ultimately not very rigorous/logical except when contorted into a definitional equation (as many previous posts have explained). Importantly, in this case I think that focusing on neglectedness is likely to lead people astray, given that a change in neglectedness could equate to an increase in tractability.

Load more