Currently pursuing a PhD at the "Mathematics for Our Future Climate" CDT at Reading University.
Previously MSc in applied mathematics/theoretical ML.
Not really active here - racism, Rationality and weirdness in the movement are so bad they made me give up on it.
How can you "solve every possible jailbreak"? And is it worth it crippling large-scale research into safeguarding from future AI because of fears about what the current models might be capable of?
(My own answer is "maybe". It depends on how bad you think current models are for society - pretty bad in my opinion - vs. how likely you think it is an existentially-threatening AI will actually be born out of the current efforts).
I still maintain that publicly releasing models is the correct way to get any chance of good alignment research - you can't possibly believe that the researchers at Anthropic alone are enough to tackle the problem. It's a global problem and should have the opportunity for the global population to solve it.
No offense Linch, but aren't these questions for jurists, historians and philosophers? Why should you develop the answers from first principles, so to speak? I'd get writing a blog post about a journey through such sources and what their theories are, but I think trying to answer such questions ourselves is not very robust.
This is not a criticism of you personally - developing ideas that require domain expertise from first principles is an approach I often see in EA and I think it's a wrong one.
Seems like a good compromise. The examples at the end are also helpful.
About this, however:
The laissez-faire option is flawed because LLM-generated writing is increasingly difficult to detect. There are posts (I've seen a lot of these) which have the form of a good quality post which is worth reading, but on closer analysis turn out not to contain any ideas, or just to contain a couple of bullet points' worth of ideas, surrounded by a lot of fluff and repetition. This leads to quite a large waste of time for the reader.
While this is true, and indeed happens a lot everywhere nowadays, let's not forget about the option for actual malice - manipulation by posts that look good or convincing but are actually written to persuade you to serve someone's interests. Which can be done by anyone ranging from individuals, to companies, to industry lobbies to state governments.
Allowing LLM-generated content not only leaves the door open to heaps of slop, but also allows all of this. So some sort of defence is definitely warranted.
We don't know how to align a possible AGI yet. The best we can hope for is that current models are close enough to whatever AGI is going to be, that trying to align them will teach us about aligning an AGI. This task, of trying to align them, is something that shouldn't just be left to researchers in AI companies.