All of Dan H's Comments + Replies

A brief overview of the contents, page by page.

1: most important century and hinge of history

2: wisdom needs to keep up with technological power or else self-destruction / the world is fragile / cuban missile crisis

3: unilateralist's curse

4: bio x-risk

5: malicious actors intentionally building power-seeking AIs / anti-human accelerationism is common in tech

6: persuasive AIs and eroded epistemics

7: value lock-in and entrenched totalitarianism

8: story about bioterrorism

9: practical malicious use suggestions


10: LAWs as an on-ramp to AI x-risk

11: automated c... (read more)

Would love to identify and fund a well-regarded economist to develop AI risk models, if there were funding for it.

I think the adversarial mining thing was hot in 2019. IIRC, Hellaswag and others did it; I'd venture maybe 100 papers did it before RR, but I still think it was underexplored at the time and I'm happy RR investigated it.

I don't think Redwood's project had identical goals, and would strongly disagree with someone saying it's duplicative.

I agree it is not duplicative. It's been a while, but if I recall correctly the main difference seemed to be that they chose a task with gave them a extra nine of reliability (started with an initially easier task) and pursued it more thoroughly.

think I'm comparably skeptical of all of the evidence on offer for claims of the form "doing research on X leads to differential progress on Y,"

I think if we find that improvement of X leads ... (read more)

The failure of Redwood's adversarial training project is unfortunately wholly unsurprising given almost a decade of similarly failed attempts at defenses to adversarial examples from hundreds or even thousands of ML researchers. For example, the RobustBench benchmark shows the best known robust accuracy on ImageNet is still below 50% for attacks with a barely perceptible perturbation.

The better reference class is adversarially mined examples for text models. Meta and other researchers were working on a similar projects before Redwood started doing that lin... (read more)

In my understanding, there was another important difference in Redwood's project from the standard adversarial robustness literature: they were looking to eliminate only 'competent' failures (ie cases where the model probably 'knows' what the correct classification is), and would have counted it a success if there were still failures if the failure was due to a lack of competence on the model's part (e.g. 'his mitochondria were liberated' -> implies harm but only if you know enough biology)

I think in practice in their exact project this didn't end up be... (read more)

The better reference class is adversarially mined examples for text models. Meta and other researchers were working on a similar projects before Redwood started doing that line of research. https://github.com/facebookresearch/anli is an example

I agree that's a good reference class. I don't think Redwood's project had identical goals, and would strongly disagree with someone saying it's duplicative. But other work is certainly also relevant, and ex post I would agree that other work in the reference class is comparably helpful for alignment

Reader: evaluate

... (read more)

Thanks for the comment Dan. I agree that the adversarially mined examples literature is the right reference class, of which the two that you mention (Meta’s Dynabench and ANLI) were the main examples (maybe the only examples? I forget) while we were working on this project. 

I’ll note that Meta’s Dynabench sentiment model (the only model of theirs that I interacted with) seemed substantially less robust than Redwood’s classifier (e.g. I was able to defeat it manually in about 10 minutes of messing around, whereas I needed the tools we made to defeat the Redwood model).

I agree fitness is a more useful concept than rationality (and more useful than an individual agent's power), so here's a document I wrote about it: https://drive.google.com/file/d/1p4ZAuEYHL_21tqstJOGsMiG4xaRBtVcj/view

"AI safety" refers to ensuring that the consequences of misalignment are not majorly harmful

That's saying that AI safety is about protective mechanisms and that alignment is about preventative mechanisms. I haven't heard the distinction drawn that way, and I think that's an unusual way to draw it.

Context: 

Preventative Barrier: prevent initiating hazardous event (decrease probability(event))

Protective Barrier: minimize hazardous event consequences (decrease impact(event))

Broader videos about safety engineering distinctions in AI safety: [1], [2].

2
titotal
1y
So partially what I'm saying is that the definitions  of " AI alignment" and  "AI safety" are confusing, and people are using them to refer to different things in a way that can mislead. For example, if you declare that your AI is "safe" while it's killing people on the daily (because you were referring to extinction), people will rightly feel mislead and angry.  Similarly, for "misalignment", an image generator giving you an image of a hand with the wrong number of fingers is misaligned in the sense than you care about the correct number of fingers and it doesn't know this. But it doesn't cause any real harm the way that a malfunction in a healthcare diagnoser would.  Your point about safety wanting to prevent as well as protect is a good one. I think "AI safety" should refer to both. 

lasting effect on their beliefs

take new action(s) at work

These could mean a lot of things. Are there more specific results?

2
Vael Gates
1y
You can read more here! 
1
vincentweisser
1y
Thanks, great rec, donated!

Even in deep learning, proofs have by and large been a failure. Proofs would be important, and there are many people trying to find angles of attack for having useful proofs for deep learning systems, so it is hard to say it is neglected. Unfortunately useful proofs are rarely tractable for complex systems such as deep learning systems. Compared to other interventions I would not bet a substantial amount on proofs for deep learning systems given its important, neglectedness, and tractability.

For concrete research directions in safety and several dozen project ideas, please see our paper Unsolved Problems in ML Safety: https://arxiv.org/abs/2109.13916

Note that some directions are less concretized than others. For example, it is easier to do work on Honest AI and Proxy Gaming than it is to do work on, say, Value Clarification.

Since this paper is dense for newcomers, I'm finishing up creating a course that will expand on these safety problems.

1
PabloAMC
2y
Thanks Dan!

I spent some time improving the design. Here is the current design in three color scheme options.

SVGs are here: https://drive.google.com/drive/folders/1ZxtJ_gf9T_H4AIZnpmCzPngp5h0XPojf?usp=sharing

Here are past logos by other people:

A reminder that the competition ends this month!

My main concern is that the format disproportionately encourages submissions from amateurs

We also crosspost on reddit to attract people who know how to design logos.

The claim that "symbolism is important" is not substantiated

I would need evidence against the claim that imagery basically worthless. Even in academic ML research, it's a fatal mistake not to spend at least a day thinking about how to visualize the paper's concepts. This mistake is nonetheless common.

1
Erich_Grunewald
3y
i think you are moving the goalposts a bit when arguing against the view "that imagery is worthless". peter (and you in the original post) wrote about symbolism specifically, and in this context symbolism in flags. i also think there is probably a significant difference between the kind of plots, graphs and other visualisations you see in a research paper, which are aimed at explaining particular results and theories, and flags, which are more meant to associate with concepts, groups, movements and so on. it's like the difference between a paragraph of prose and a slogan -- one of fidelity, i suppose.

There isn't an official body for utilitarianism, so no decisions are official. A community competition brings in more submissions and voices, and it is a less arbitrary process. I'll try to have many utilitarian-minded people vote on the insignia compared to just one person.

I think the "heart in a lightbulb" insignia for EA is a great design choice and excellent for outreach, but there is no such communicable symbol for utilitarianism. Companies know to spend much on design for outreach since visualization is not superfluous. I do not think the optimal spending is $0, as is currently the case. A point of the competition is finding a visual way of communicating a salient idea about utilitarianism suitable for broader outreach. I do not know what part is best to communicate or how best to communicate it--that's part of the reason for the competition.