HD

Harrison Durland

1890 karmaJoined
0

Posts
21

Sorted by New

Comments
461

Topic contributions
10

I've been advocating for something like this for a while (more recently, here and here), but have only ever received lukewarm feedback at best. I'd still be excited to see this take off, and would probably like to hear what other work is happening in this space!

I spent way too much time organizing my thoughts on AI loss-of-control ("x-risk") debates without any feedback today, so I'm publishing perhaps one of my favorite snippets/threads:

A lot of debates seem to boil down to under-acknowledged and poorly-framed disagreements about questions like “who bears the burden of proof.” For example, some skeptics say “extraordinary claims require extraordinary evidence” when dismissing claims that the risk is merely “above 1%”, whereas safetyists argue that having >99% confidence that things won’t go wrong is the “extraordinary claim that requires extraordinary evidence.” 

I think that talking about “burdens” might be unproductive. Instead, it may be better to frame the question more like “what should we assume by default, in the absence of definitive ‘evidence’ or arguments, and why?” “Burden” language is super fuzzy (and seems a bit morally charged), whereas this framing at least forces people to acknowledge that some default assumptions are being made and consider why. 

To address that framing, I think it’s better to ask/answer questions like “What reference class does ‘building AGI’ belong to, and what are the base rates of danger for that reference class?” This framing at least pushes people to make explicit claims about what reference class building AGI belongs to, which should make it clearer that it doesn’t belong in your “all technologies ever” reference class. 

In my view, the "default" estimate should not be “roughly zero until proven otherwise,” especially given that there isn’t consensus among experts and the overarching narrative of “intelligence proved really powerful in humans, misalignment even among humans is quite common (and is already often observed in existing models), and we often don’t get technologies right on the first few tries.”

I definitely think beware is too strong. I would recommend “discount” or “be skeptical” or something similar.

Venus is an extreme example of an Earth-like planet with a very different climate. There is nothing in physics or chemistry that says Earth's temperature could not one day exceed 100 C. 
[...]
[Regarding ice melting -- ] That will take time, but very little time on a cosmic scale, maybe a couple of thousand years.

I'll be blunt, remarks like these undermine your credibility. But regardless, I just don't have any experience or contributions to make on climate change, other than re-emphasizing my general impression that, as a person who cares a lot about existential risk and has talked to various other people who also care a lot about existential risk, there seems to be very strong scientific evidence suggesting that extinction is unlikely.

Everything is going more or less as the scientists predicted, if anything, it's worse.

I'm not that focused on climate science, but my understanding is that this is a bit misleading in your context—that there were some scientists in the (90s/2000s?) who forecasted doom or at least major disaster within a few decades due to feedback loops or other dynamics which never materialized. More broadly, my understanding is that forecasting climate has proven very difficult, even if some broad conclusions (e.g., "the climate is changing," "humans contribute to climate change") have held up. Additionally, it seems that many engineers/scientists underestimated the pace of alternative energy technology (e.g., solar).

 

That aside, I would be excited to see someone work on this project, and I still have not discovered any such database.

I don't find this response to be a compelling defense of what you actually wrote:

since AIs would "get old" too [...] they could also have reason to not expropriate the wealth of vulnerable old agents because they too will be in such a vulnerable position one day

It's one thing if the argument is "there will be effective enforcement mechanisms which prevent theft," but the original statement still just seems to imagine that norms will be a non-trivial reason to avoid theft, which seems quite unlikely for a moderately rational agent.

Ultimately, perhaps much of your scenario was trying to convey a different idea from what I see as the straightforward interpretation, but I think it makes it hard for me to productively engage with it, as it feels like engaging with a motte-and-bailey.

Apologies for being blunt, but the scenario you lay out is full of claims that just seem to completely ignore very facially obvious rebuttals. This would be less bad if you didn’t seem so confident, but as written the perspective strikes me as naive and I would really like an explanation/defense.

Take for example:

Furthermore, since AIs would "get old" too, in the sense of becoming obsolete in the face of new generations of improved AIs, they could also have reason to not expropriate the wealth of vulnerable old agents because they too will be in such a vulnerable position one day, and thus would prefer not to establish a norm of expropriating the type of agent they may one day become.

Setting aside the debatable assumptions about AIs getting “old,” this just seems to completely ignore the literature on collective action problems. If the scenario were such that any one AI agent can expect to get away with defecting (expropriation from older agents) and the norm-breaking requires passing a non-small threshold of such actions, a rational agent will recognize that their defection has minimal impact on what the collective will do, so they may as well do it before others do.

There are multiple other problems in your post, but I don’t think it’s worth the time going through them all. I just felt compelled to comment because I was baffled by the karma on this post, unless it was just people liking it because they agreed with the beginning portion…?

Sure! (I just realized the point about the MNIST dataset problems wasn't fully explained in my shared memo, but I've fixed that now)

Per the assessment section, some of the problems with assuming that FRVT demonstrates NIST's capabilities for evaluation of LLMs/etc. include:

  1. Facial recognition is a relatively "objective" test—i.e., the answers can be linked to some form of "definitive" answer or correctness metric (e.g., name/identity labels). In contrast, many of the potential metrics of interest with language models (e.g., persuasiveness, knowledge about dangerous capabilities) may not have a "definitive" evaluation method, where following X procedure reliably evaluates a response (and does so in a way that onlookers would look silly to dispute).
  2. The government arguably had some comparative advantage in specific types of facial image data, due to collecting millions of these images with labels. The government doesn't have a comparative advantage in, e.g., text data.
  3. The government has not at all kept pace with private/academic benchmarks for most other ML capabilities, such as non-face image recognition (e.g., Common Objects in Context) and LLMs (e.g., SuperGLUE).
  4. It's honestly not even clear to me whether FRVT's technical quality truly is the "gold standard" in comparison to the other public training/test datasets for facial recognition (e.g., MegaFace); it seems plausible that the value of FRVT is largely just that people can't easily cheat on it (unlike datasets where the test set is publicly available) because of how the government administers it.

For the MNIST case, I now have the following in my memo:

Even NIST’s efforts with handwriting recognition were of debatable quality: Yann LeCun's widely-used MNIST is a modification of NIST's datasets, in part because NIST's approach used census bureau employees’ handwriting for the training set and high school students’ handwriting for the test set.[1]

 

  1. ^

    Some may argue this assumption was justified at the time because it required that models could “generalize” beyond the training set. However, popular usage appears to have favored MNIST’s approach. Additionally, it is externally unclear that one could effectively generalize from the handwriting of a narrow and potentially unrepresentative segment of society—professional bureaucrats—to high schoolers’, and the assumption that this would be necessary (e.g., due to the inability to get more representative data) seems unrealistic.

Seeing the drama with the NIST AI Safety Institute and Paul Christiano's appointment and this article about the difficulty of rigorously/objectively measuring characteristics of generative AI, I figured I'd post my class memo from last October/November.

The main point I make is that NIST may not be well suited to creating measurements for complex, multi-dimensional characteristics of language models—and that some people may be overestimating the capabilities of NIST because they don't recognize how incomparable the Facial Recognition Vendor Test is to this situation of subjective metrics for GenAI and they don't realize NIST arguably even botched MNIST (which was actually produced by Yann LeCun by recompiling NIST's datasets). Moreover, government is slow, while AI is fast. Instead, I argue we should consider an alternative model such as federal funding for private/academic benchmark development (e.g., prize competitions).

I wasn't sure if this warranted a full post, especially since it feels a bit late; LMK if you think otherwise!

I probably should have been more clear, my true "final" paper actually didn't focus on this aspect of the model: the offense-defense balance was the original motivation/purpose of my cyber model, but I eventually became far more interested in using the model to test how large language models could improve agent-based modeling by controlling actors in the simulation. I have a final model writeup which explains some of the modeling choices in more detail and talks about the original offense/defense purpose in more detail.

(I could also provide the model code which is written in Python and, last I checked, runs fine, but I don't expect people would find it to be that valuable unless they really want to dig into this further, especially given that it might have bugs.)

Load more