Recently, many people have given increasingly short timelines for AGI. What unites many of these statements is the thorough lack of any evidence. Instead, many resort to narrative arguments, ad-hoc predictive models, and mistaken arguments. In this post, I will cover these different types of non-evidence-based arguments.
I have described my views on AI risk previously in this post, which I think is still relevant. I have also laid down a basic argument against AI risk interventions in this comment where I argue that AI risk is neither important, neglected nor tractable.
Clarification (29.1.2026): In this post, I use the term "evidence" in the context of "evidence-based policy making". A major tenet of the EA movement, I believe, is that interventions should be based on evidence as part of rational decision making. In this context, evidence is contrasted with opinions, intuitions, anecdotes, etc. I have elaborated further about what the bar for this kind of evidence is in the comments – in particular, while I do not believe that pre-publication peer-review is always necessary, results should withstand wider scrutiny and replication.
Narrative arguments
Narrative argument is a way to present an argument in a form of a story, parable, or extended metaphor. These arguments present a chain of events as probable or even inevitable due to their narrative coherence: if the "logical continuation" of the story follows a pattern, that pattern is then postulated to exist in the real world as well. This depends on the "story-logic" matching real-world causal mechanisms.
Narrative arguments can be a powerful way to demonstrate or explain a phenomenon in an understandable way, but they need to be supported by actual evidence. The existence of the narrative is not evidence in itself. Narratives are easily manipulated, and it is always possible to find another narrative (or story) that has a different ending, thus demonstrating a different conclusion.
In recent discussion relating to AI risk, narrative arguments have become increasingly common. They are often accompanied either by faulty evidence or no evidence at all. Perhaps the most prominent example is AI 2027, which features a science fiction story supplemented by predictive models. These models have been criticized heavily to be faulty in many ways. They are also somewhat difficult to understand to an average person, and I doubt most readers of AI 2027 haven't taken a good look at them. Therefore, AI 2027 rests almost entirely on the narrative argument unsubstantiated by any evidence other than the faulty models.
Another example is the book If Anyone Builds It, Everyone Dies by Yudkowsky and Soares. This book is entirely based on narrative arguments, featuring a full multi-chapter science fiction story very similar to that of AI 2027, among many small parable stories that provide little to no value since they are not accompanied by evidence.
One parable struck me as especially memorable since it appeared to argue for something else than what it actually did: One of the stories featured a kingdom where alchemists try to synthesize gold from lead. Still, despite the king threatening to execute anyone who fails to do so along with their entire village, many alchemists attempt this impossible task. I thought the task of creating gold was akin to the task of creating a god-like AI – impossible. The king's threat was an analogue for the bankruptcy of OpenAI and other companies that failed to deliver on this promise. But no! Yudkowsky and Soared actually meant gold to symbolize alignment and the king to symbolize AI killing anyone.
Narratives can be twisted to fit any purpose, and even the same narrative can be used to justify opposite conclusions. Narratives are not evidence; they might help understand evidence. Never trust narratives alone.
Ad-hoc predictive models
The second type of argument I've seen (most prominently in the case of AI 2027), is ad-hoc predictive models.
There is a saying in my city that a skilled user of Excel can justify any decision. Whenever a funding decision is made (a railway project, for example), the city council must propose a budget with predicted costs and revenues, and an assessment of benefits and harmful effects. It is typical that the politicians proposing the project make their own, "optimistic" version of these predictions, and the opposing politicians make a "pessimistic" version. By adjusting the factors that are taken into these predictions – how many people switch from cars to trains, how much new workplaces are created near the railway stations, etc. –, it is possible to mangle the prediction into supporting any outcome.
This is called "policy-based evidence making", and it is common in politics. Often, these types of biased models are accompanied by a narrative argument painting whatever brighter future the politician in question believes their policy will result in. Spotting the faults in these ad-hoc models is very difficult for regular citizens who are not experts in, e.g., traffic modelling. For this reason, for a layperson, the best strategy is to meet any usage of predictive models not supported by academic consensus with strong skepticism.
I believe that this bears a lot of parallels to the predictive models presented in AI 2027. While, unlike in the case of my local politicians, I do not believe that the authors necessarily had any malicious intent, the model is similarly unjustified by academic consensus. It has many hidden, likely biased assumptions and ad-hoc modelling decisions. Since all of these assumptions are more or less based on intuition instead of evidence, they should be considered to have high uncertainty.
Rationalist sometimes use the phrase "If it's worth doing, it's worth doing with made-up statistics." While I'm sympathetic to the idea that people should reveal their hidden assumptions in numerical form, this advice is often taken to mean that these made-up numbers somehow become good evidence when put to a predictive model. No, the numbers are still made-up, and the results of the model are also made-up. Garbage in, garbage out.
Evidence mismatch
Another common style of argumentation is to use evidence for something as evidence for something else.
For example, Aschenbrenner's Situational Awareness basically argues that because AI systems are on the level of a "smart high-schooler" in some tasks, in the future they will be on the PhD level in all tasks. Aschenbrenner justified this by claiming that the current models are "hobbled" and unhobbling them would expand their capabilities to these other tasks. I wrote about the essay in an earlier post.
Similarly, many people (including AI 2027) base their argumentation on METR's Long Tasks benchmark, despite it covering only a limited domain of tasks, many of which are unrealistic[1] when compared to real-world programming tasks. Coding is one of the domains in which it is easy to automatically verify AI responses without human feedback, and it has seen a lot of progress in the recent years, which explains the exponential progress seen in that domain. However, this is in no way evidence for progress in other domains, or even in real-world programming tasks.
Benchmarks, overall, are a very bad source of evidence, due to Goodhart's law. It is easy to cheat benchmarks by training the model specifically on tasks similar to them (even if the models are not trained on the test set per se). If the model is optimized, "benchmaxxed" towards a benchmark, progress in that benchmark cannot be taken as evidence for progress in other tasks. Unfortunately, this issue plagues most benchmarks.
Sometimes evidence mismatch can be caused by mismatch in definitions. One example is the definition of AGI used by Metaculus, which I have criticized in this comment. Since the Metaculus definition doesn't actually define AGI, but instead uses bad indicators for it that arguably could be passed using a non-AGI system, it is entirely unclear what the Metaculus question measures. It cannot be used as evidence for AGI if it does not forecast AGI.
Conclusions
In this post, I've criticized non-evidence-based arguments, which hangs on the idea that evidence is something that is inherently required. Yet it has become commonplace to claim the opposite. One example of this argument is presented in the International AI Safety Report[2], in which the authors argue that AI poses an "evidence dilemma" to policymakers:
Given sometimes rapid and unexpected advancements, policymakers will often have to weigh potential benefits and risks of imminent AI advancements without having a large body of scientific evidence available. In doing so, they face a dilemma. On the one hand, pre-emptive risk mitigation measures based on limited evidence might turn out to be ineffective or unnecessary. On the other hand, waiting for stronger evidence of impending risk could leave society unprepared or even make mitigation impossible – for instance if sudden leaps in AI capabilities, and their associated risks, occur.
This text uses emotive phrases such as "imminent AI advancements" and "impending risk" despite acknowledging that there is only limited evidence. This kind of rhetoric goes against the tenet of rational, evidence-based policy making, and it is alarming that high-profile expert panels use it.
I believe that states should ensure that the critical functions of society will continue to exist in unexpected, unknown crisis situations. In my country, Finland, this is called huoltovarmuus, and it is a central government policy. In past years, this concept has been expanded to include digital infrastructure and digital threats, including things like social media election inference and unspecified cyberthreats. I think it is fine to prepare for hypothetical AI threats together with other, more tangible threats as part of general preparedness. This work has low cost but huge impact in a crisis situation.
But calling for drastic actions specifically for AI is something that requires more evidence. Existential risk from AI, in particular, is an extraordinary claim requiring extraordinary evidence.

Rather than go through this paragraph-by-paragraph, let me pick one particular thing.
Overall I disagree and am also downvoting this post as not a helpful contribution.
Thank you for your criticism.
The point of this post is not to address specific issues of AI 2027, but narrative arguments and ad-hoc models in general. AI 2027 contains both, and thus exemplifies them well. By choosing a model not based on reference literature, and thus established consensus, the authors risk incorporating their own biases and assumptions into the model. This risk is present in all ad-hoc, models, not just AI 2027, which is why all ad-hoc models should be met with strong skepticism until supported by wider consensus.
You make a good observation that the criticisms of AI 2027 do not form an "academic consensus" either. This is because AI 2027 itself is not an academic publication, nor has it been a topic of any major academic discussion. It is possible for non-academic works to contain valuable contributions – as I say in an above comment, peer-review is not magic. Furthermore, even new and original models that were "ad-hoc" when first published can be good. However, the lack of wider adoption of this model suggests scientists suggests it is not viewed as a solid foundation to build on. But this lack does not, of course, explain why this is the case, so I have included links to commentary by other people in the EA community that describe the concrete issues in their model. Again, these issues are not the main point of my post, and are only provided for the reader's convenience.
A narrative argument presents the argument in a form of a story, like AI 2027's science fiction scenario or the parables in Yudkowsky and Soares's book. I'm not sure what part of my text you characterize as a story, could you elaborate on that?
I share the feeling that advocates of short timelines often overestimated the reliability of their methods, and have said so here.
At the same time, when I see skeptics of AI progress talk about these arguments lack "evidence" its unclear to me what the neutral criteria are for what counts as "evidence". I agee that lots of the dynamics you describe exist, but they don't seem at all unique to discussions of AI timelines to me. I think they stem from the fact that interpreting evidence is much messier than an academic, ivory tower view of "evidence" would make it seem.
As an example, a common critique I have noted among AI skeptics is that arguments for short timelines or AI risk don't follow traditional academic proccess such as pre-publication peer review. The implication is that this suggests they lack rigor and in some sense shouldn't count as "real" evidence. But it seems to me that by this standard peer review itself lacks evidence it support its practice. See the various reproducibility projects with underwhelming results and the replication crisis in general. Yes things like METR's work have limitations and don't precisely replicate the ideal experiment, but nor do animal models or cell lines precisely replicate human disease as seen in the clinic. Benchmarks can be gamed, but the idea of "benchmarks" comes from the machine learning literature itself, it isn't something cooked up specifically to argue for short timelines.
What are the standards that you think an argument should meet to count as "evidence-based"?
The purpose of peer-review is to make sure that the publication has no obvious errors and meets some basic standards of publication. I have been a peer-reviewer myself, and what I have seen is that the general quality of stuff sent to computer science conferences is low. Peer-review removes the most blatantly bad papers. To a layperson who doesn't know the field and who cannot judge the quality of studies, it is safest to stick to peer-reviewed papers.
But it has never been suggested that peer-review somehow magically separates good evidence from bad evidence. In my work, I often refer to arXiv papers that are not peer-reviewed, but which I believe are methodologically sound and present valuable contributions. On the other hand, I know that conferences and journals often publish papers even with grave methodological errors or lack of statistical understanding.
Ultimately, the real test of a study is the criticism it receives after its publication, not peer-review. If researchers in the field think that the study is good and build their research on it, it is much more credible evidence than a study that is disproved by studies that come after it. One should never rely on a single study alone.
In case of METR's study, their methodological errors do not preclude that their conclusions are correct. I think what they are trying to do is interesting and worth of research. I'd love to see other researchers attempt to replicate the study while improving on methodology, and if they succeed in having similar results, providing evidence for METR's conclusions. So far, we haven't seen this (or at least I am not aware of). Although even in that case, the problem of evidence mismatch stays, and we should be careful not to draw those conclusions to far.
This seems reasonable to me, but I don't think its necesarily entirely consistent with the OP. I think a lot of the reason why AI is such a talked about topic compared to 5 years ago is that people have seen work that has gone on in the field and are building on and reacting to it. In other words, they perceive existing results to be evidence of significant progress and opportunities. They could be overreaching to or overhyping those results, but to me it doesn't seem fair to say that the belief in short timelines is entirely "non-evidence-based". Things like METR's work, scaling laws, benchmarks, these are evidence even if they aren't necesarily strong or definitive evidence.
I think it is reasonable to disagree with the conclusions that people draw based on these things, but I don't entirely understand the argument that these things are "non-evidence-based". I think it is worthwhile to distinquish between a disagreement over methodology, evidence strength, or interpretation, and the case where an argument is literally completely free of any evidence or substantiation whatsoever. In my view, arguments for short timelines contain evidence, but that doesn't mean that their conclusions are correct.
In my post, I referred to the concept of "evidence-based policy making". In this context, evidence refers specifically to rigorous, scientific evidence, as opposed to intuitions, unsubstantiated beliefs and anecdotes. Scientific evidence, as I said, referring to high-quality studies corroborated by other studies. And, as I emphasize the point of evidence mismatch, using a study that concludes something as evidence for something else is a fallacy.
The idea that current progress in AI can be taken as evidence for AGI, which in some sense is the most extreme progress in AI imaginable, incomparable to current progress, is an extraordinary claim that requires extraordinary evidence. People arguing for this are mostly basing their argument on their intuition and guesses, yet they often demand drastic actions over their beliefs. We, as the EA community, should make decisions based on evidence. Currently, people are providing substantial funding to the "AI cause" based on arguments that do not meet the bar of evidence-based policy, and I think that is something that should and must be criticized.
It seems like the core of your argument is saying that there is a high burden of proof that hasn't been met. I agree that arguments for short timelines haven't met a high burden of proof but I don't believe that there is such a burden. I will try to explain my reasoning, although I'm not sure if I can do the argument justice in a comment, perhaps I will try to write a post about the issue.
When it comes to policy, I think the goal should be to make good decisions. You don't get any style points for how good your arguments or evidence are if the consequences of your decisions are bad. That doesn't mean we shouldn't use evidence to make decisions, we certainty should. But the reason is that using evidence will improve the quality of the decision, not for "style points" so-to-speak.
Doing nothing and sticking with the status quo is also a decision that can have important consequences. We can't just magically have more rigorous evidence, we have to make decisions and allocate resources in order to get that evidence. That also requires making decisions about the allocation of resources. When we make those decisions, we have to live with the uncetainty that we face, and make the best decision given that uncertainty. If we don't have solid scientific evidence, we still have to make some decision. It isn't optional. Sticking with the status quo is still making a decision. If we lack scientific evidence, then that policy decision won't be evidence-based even if we do nothing. I think we should make the best decision we can given what information we have instead of defaulting to an informal burden of proof. If there is a formal burden of proof, like a burden on one party in a court case or a procedure for how an administrative or legislative body should decide, then in my view that formal procedure establishes what the burden of proof is.
Although I believe there should be policy action/changes in response to the risk from AI, I personally don't see the case for this as hinging on the achievement of "AGI". I've described my position as being more concerned about "powerful" AI than "intelligent" AI. I think focusing on "AGI" or how "intelligent" an AI system is or will be often leads to unproductive rabbit holes or definition debates. On the other hand, obviously lots of AI risk advocates do focus on AGI, so I acknowledge it is completely fair game for skeptics to critique this.
Do you think you would be more open to some types of AI policy if the case for those policies didn't rely on the emergence of "AGI"?
No one has ever claimed that evidence should be collected for "style points".
Fortunately, AI research has a plenty of funding right now (without any EA money), so in principle getting evidence should not be an issue. I am not against research, I am a proponent of it.
Sticking with status quo is often the best decision. When deciding how to use funds efficiently, you have to consider the opportunity cost of using those funds to something that has a certain positive benefit. And that alternative action is evidence-based. Thus, the dichotomy between "acting on AI without evidence" and "doing nothing without evidence" is false, the options are actually "acting on AI without evidence" and "acting on another cause area with evidence".
If the estimated value of using the money for AI is below the benefit of the alternative, we should not use it for AI and instead stick to the status quo on that matter. Most AI interventions are not tractable, and due to this their actual utility might even be negative.
Yes, there are several types of AI policy I support. However, I don't think they are important cause areas for EA.
AI certainly has a lot of resources available, but I don't think those resources are primarily being used to understand how AI will impact society. I think policy could push more in this direction. For example, requiring AI companies who train/are training models above a certain compute budget to undergo third-party audits of their training process and models would push towards clarifying some of these issues in my view.
The conclusion that cause A is preferable to cause B involves the uncertainty about both causes. Even if cause A has more rigorous evidence than cause B, that doesn't mean the conclusion that benefits(A) > benefits(B) is similarly rigorous.
Lets take AI and global health and development (GHD) as an example. I think it would be reasonable to say that evidence for GHD is much more rigorous and scientific than the evidence for AI. Yet that doesn't mean that the evidence conclusively shows benefits(GHD) > benefits(AI). Lets say that someone believes that the evidence for GHD is scientific and the evidence for AI is not (or at least much less so), but that the overall, all-things-considered best estimate of benefits(AI) are greater than the best estimate of benefits(GHD). I think many people in the EA community in fact have this view. Do you think those people should still prefer GHD because AI is off limits due to not being "scientific"? I would consider this to be "for style points", and disagree with this approach.
I will caveat this by saying that in my opinion it makes sense for estimation purposes to discount or shrink estimates of highly uncertainty quantities, which I think many advocates of AI as a cause fail to do and can be fairly criticized for. But the issue is a quantitative one, and so can come out either way. I think there is a difference between saying that we should heavily shrink estimates related to AI due to their uncertainty and lower quality evidence, vs saying that they lack any evidence whatsoever.
I agree, but it doesn't follow from one cause being "scientific" while the other isn't that the "scientific" cause area has higher benefits.
I actually agree that tractability is (ironically) a strongly neglected factor and many proponents of AI as a cause area ignore or vastly overestimate the tractability of AI interventions, including the very real possibility that they are counterproductive/net-negative. I still think there are worthwhile opportunities but I agree that this is an underappreciated downside of AI as a cause area.
Can I ask why? Do you think AI won't be a "big deal" in the reasonably near future?
It seems you have an issue with the word "scientific" and are constructing a straw-man argument around it. This has nothing to do with "style points". As I have already explained, by scientific I only refer to high-quality studies that withstand scrutiny. If a study doesn't, then it's value as evidence is heavily discounted, as the probability of the conclusions of the study being right despite methodological errors, failures to replicate it, etc. is lower than if the study does not have these issues. If a study hasn't been scrutinized at all, it is likely bad, because the amount of bad research is greater than the amount of good research (for example, if we look at the rejection rates of journals/conferences), and lack of scrutiny implies lack of credibility as researchers do not take the study seriously enough to scrutinize it.
Yet E[benefits(A)] > E[benefits(B)] is a rigorous conclusion, because the uncertainty can be factored into the expected value.
The International AI Safety Report lists many realistic threats (the first one of those is deepfakes, to give an example). Studying and regulating these things is nice, but they are not effective interventions in terms of lives saved etc.
I'm really at a loss here. If your argument is taken literally, I can convince you to fund anything, since I can give you highly uncertain arguments for almost everything. I cannot believe this is really your stance. You must agree with me that uncertainty affects decision making. It only seems that the word "scientific" bothers you for some reason, which I cannot really understand either. Do you believe that methodological errors are not important? That statistical significance is not required? That replicability does matter? To object to the idea that these issues cause uncertainty is absurd.
The "scientific" phrasing frustrates me because I feel like it is often used to suggest high rigor without actually demonstrating that such rigor actually applies to a give situation, and because I feel like it is used to exclude certain categories of evidence when those categories are relevant, even if they are less strong compared to other kinds of evidence. I think we should weigh all relevant evidence, not exclude cetain pieces because they aren't scientific enough.
Yes, but in doing so the uncertainty in both A and B matters, and showing that A is lower variance than B doesn't show that E[benefits(A)] > E[benefits(B)]. Even if benefits(B) are highly uncertain and we know benefits(A) extremely precsiely, it can still be the case that benefits(B) are larger in expectation.
In my comment that you are responding to, I say:
I also say:
What about these statements makes you think that I don't believe uncertainty affects decision making? It seems like I say that it does affect decision making in my comment.
If stock A very likely has a return in the range of 1-2%, and stock B very likely has a return in the range of 0-10%, do you think stock A must have a better expected return because it has lower uncertainty?
Yes uncertainty matters but it is more complicated than saying that the least uncertain option is always better. Sometimes the option that has less rigorous support is still better in an all-things-considered analysis.
I don't think my argument leads to this conclusion. I'm just saying that AI risk has some evidence behind it, even if it isn't the most rigorous evidence! That's why I'm being such a stickler about this! If it were true that AI risk has actually zero evidence then of course I wouldn't buy it! But I don't think there actually is zero evidence even if AI risk advocates sometimes overestimate the strength of the evidence.
Again, you are attacking me because of the word "scientific" instead of attacking my arguments. As I have many, many times said, studies should be weighted based on their content and the scrutiny it receives. To oppose the word "science" just because of the word itself is silly. Your idea that works are arbitrarily sorted to "scientific" and "non-scientific" based on "style points" instead of assessing their merits is just wrong and a straw-man argument.
Where have I ever claimed that there is no evidence worth considering? In the start of my post, I write:
There are some studies that are rigorously conducted that provide some meager evidence. Not really enough to justify any EA intervention. But instead of referring to these studies, people use stuff like narrative arguments and ad-hoc models, which have approximately zero evidential value. That is the point of my post.
If you believe this, I don't understand where you disagree with me, other than you weird opposition to the word "scientific".
In your OP, you write:
You then quote the following:
Your summary of the quoted text is inaccurate. You claim that this is an arguement that evidence is not something that in inherently required, but the quote says no such thing. Instead, it references "a large body of scientific evidence" and "stronger evidence" vs "limited evidence". This quote essential makes the same arguement I do above. How can we square the differences in these interpretations?
In response to me, you write:
You also have added as a clarfication to your OP:
So, as used in your post, "evidence" means "rigorous, scientific evidence, as opposed to intuitions, unsubstantiated beliefs and anecdotes". This is why I find your reference to "scientific evidence" frustrating. You draw a distinct between two categories of evidence and claim policy should be based on only one. I disagree, I think policy should be based on all available evidence, including intuition and anecdote ("unsubtantiated belief" obviously seems definitionally not evidence). I also think your argument relies heavily on contrasting with a hypothetical highly rigorous body of evidence that isn't often achieved, which is why I have pointed out what I see as the "messiness" of lots of published scientific research.
The distinction you draw and how you defined "evidence" results in an equivocation. Your caracterization of the quote above only makes sense if you are claiming that AI risk can only claim to be "evidence-based" if is is backed by "high-quality studies that withstand scrutiny". In other words, as I said in one of my comments:
So, where do we disagreee? As I say immediately after:
I believe that we should compare E[benefits(AI)] with E[benefits(GHD)] and any other possible alternative cause areas, with no area having any specific burden of proof. The quality of the evidence plays out in taking those expectations. Different people may disagree on the results based on their interpretations of the evidence. People might weigh different sources of evidence differently. But there is no specific burden to have "high-quality studies that withstand scrutiny", although this obviously weighs in favor of a cause that does have those studies. I don't think having high quality studies amounts to "style points". What I think would amount to "style points" is if someone concluded that E[benefits(AI)] > E[benefits(GHD)] but went with GHD anyway because they think AI is off limits due to the lack of "high-quality studies that withstand scrutiny" (i.e. if there is a burden of proof where "high-quality studies that withstand scrutiny" are required).
If you believe that evidence that does not withstand scrutiny (that is, evidence that does not meet basic quality standards, contains major methodological errors, is statistically insignificant, is based on fallacious reasoning, or any other reason why the evidence is scrutinized) is evidence that we should use, then you are advocating for pseudoscience. The expected value of benefits based on such evidence is near zero.
I'm sorry if criticizing pseudoscience is frustrating, but that kind of thinking has no place in rational decision-making.
The quoted text implies that the evidence would not be sufficient under normal circumstances, hence the "evidence dilemma". If the amount of evidence was sufficient, there would be no question about what is the correct action. While the text washes its hands from making the actual decision to rely on insufficient evidence, it clearly considers this as a serious possibility, which is not something that I believe anyone should advocate.
You are splitting hairs about the difference between "no evidence" and "limited evidence". The report considers a multitude of different AI risks, some of which have more evidence and some of which have less. What is important is that they bring up the idea that policy should be made without proper evidence.
People who have radical anti-institutionalist views often take reasonable criticisms of institutions and use them to argue for their preferred radical alternative. There are many reasonable criticisms of liberal democracy; these are eagerly seized on by Marxist-Leninists, anarchists, and right-wing authoritarians to insist that their preferred political system must be better. But of course this conclusion does not necessarily follow from those criticisms, even if the criticisms are sound. The task for the challenger is to support the claim that their preferred system is robustly superior, not simply that liberal democracy is flawed.
The same is true for radical anti-institutionalist views on institutional science (which the LessWrong community often espouses, or at least whenever it suits them). Pointing out legitimate failures in institutional science does not necessarily support the radical anti-institutionalists' conclusion that peer-reviewed journals, universities, and government science agencies should be abandoned in favour of blogs, forums, tweets, and self-published reports or pre-prints. On what basis can the anti-institutionalists claim that this is a robustly superior alternative and not a vastly inferior one?
To be clear, I interpret you as making a moderate anti-institutionalist argument, not a radical one. But the problem with the reasoning is the same in either case — which is why I'm using the radical arguments for illustration. The guardrails in academic publishing sometimes fail, as in the case of research misconduct or in well-intentioned, earnestly conducted research that doesn't replicate as you mentioned. But is this an argument for kicking down all guardrails? Shouldn't it be the opposite? Doesn't this just show us that deeply flawed research can slip under the radar? Shouldn't this underscore the importance of savvy experts doing close, critical readings of research to find flaws? Shouldn't the replication crisis remind of us of the importance of replication (which has always been a cornerstone of institutional science)? Why should the replication crisis be taken as license to give up on institutions and processes that attempt to enforce academic rigour, including replication?
In the case of both AI 2027 and the METR graph, half of the problem is the underlying substance — the methodology, the modelling choices, the data. The other half of the problem is the presentation. Both have been used to make bold, sweeping, confident claims. Academic journals referee both the substance and the presentation of submitted research; they push back on authors trying to use their data or modelling to make conclusions that are insufficiently supported.
In this vein, one of the strongest critiques of AI 2027 is that it is an exercise in judgmental forecasting, in which the authors make intuitive, subjective guesses about the future trajectory of AI research and technology development. There's nothing inherently wrong with a judgmental forecasting exercise, but I don't think the presentation of AI 2027 is clear enough that AI 2027 is nothing more than that. (80,000 Hours' video on AI 2027, which is 34 minutes long and was carefully written and produced at a cost of $160,000, doesn't even mention this.)
If AI 2027 had been submitted to a reputable peer-reviewed journal, besides hopefully catching the modelling errors, the reviewers probably would have insisted the authors make it clear from the outset what data the conclusions are based on (i.e. the authors' judgmental forecasts) and where that data came from. They would probably also have insisted the conclusions are appropriately moderated and caveated in light of that. But, overall, I think AI 2027 would probably just be unpublishable.
I don't think my argument is even that anti-institutionalist. I have issues with how academic publishing works but I still think peer reviewed research is an extremely important and valuable source of information. I just think it has flaws and is much messier than discussions around the topic sometimes make it seem.
My point isn't to say that we should throw out traditional academic insitutions, it is to say that I feel like the claim that the arguments for short timelines are "non-evidence-based" are critiquing the same messiness that also is present in peer reviewed research. If I read a study whose conclusions I disagree with, I think it would be wrong to say "field X has a replication crisis, therefore we can't really consider this study to be evidence". I feel like a similar thing is going on when people say the arguments for short timelines are "non-evidence-based". To me things like METR's work definitely are evidence, even if they aren't necessarily strong or definitive evidence or if that evidence is open to contested interpretations. I don't think something needs to be peer reviewed to count as "evidence", is essentially the point I was trying to make.
Generally, the scientific community is not going around arguing that drastic measures should be taken based on singular novel studies. Mainly, what a single novel study will produce is a wave of new studies on the same subject, to ensure that the results are valid and that the assumptions used hold up to scrutiny. Hence why that low-temperature superconductor was so quickly debunked.
I do not see similar efforts in the AI safety community. The studies by METR are great first forays into difficult subjects, but then I see barely any scrutinity or follow-up by other researchers. And people accept much worse scholarship like AI2027 at face-value for seemingly no reason.
I have experience in both academia and EA now, and I believe that the scholarship and skeptical standards in EA are substantially worse.
I agree that on average the scientific community does a great job of this, but I think the process is much much messier in practice than a general description of the process makes it seem. For example, you have the alzheimers research that got huge pick-up and massive funding by major scientific institutions where the original research included doctored images. You have power-posing getting viral attention in science-ajacent media. You have priming where Kahneman wrote in his book that even if it seems wild you have to believe in it largely for similar reasons to what is being suggested here I think, that multiple rigorous scientific studies demonstrate the phenomenon, and yet when the replication crisis came around priming looks a lot more shaky than it seemed when Kahneman wrote that.
None of this means that we should throw out the existing scientific community or declare that most published research is false (although ironically there is a peer reviewed publication with this title!). Instead, my argument is that we should understand that this process is often messy and complicated. Imperfect research still has value and in my view is still "evidence" even if it is imperfect.
The research and arguments around AI risk are not anywhere near as rigorous as a lot of scientific research (and I linked a comment above where I myself criticize AI risk advocates for overestimating the rigor of their arguments). At the same time, this doesn't mean that these arguments do not contain any evidence or value. There is a huge amount of uncetainty about what will happen with AI. People worried about the risks from AI are trying to muddle through these issues, just like the scientific community has to muddle through figuring things out as well. I think it its completely valid to point of flaws in arguments, lack of rigor, or over confidence (as I have also done). But evidence or argument doesn't have to appear in a journal or conference to count as "evidence".
My view is that we have to live with the uncertainty and make decisions based on the information we have, while also trying to get better information. Doing nothing and going with the status quo is itself a decision that can have important consequences. We should use the best evidence we have to make the best decision given uncertainty, not just default to the status quo when we lack ideal, rigorous evidence.
I agree. EA has a cost-effectiveness problem that conflicts with its truth-seeking attempts. EA's main driving force is cost-effectiveness, above all else - even above truth itself.
I really don't know how you'd fix this. I don't think research into catastrophic risks should be conducted on a shoestring budget and by a pseudoreligion/citizen science community. I think it should be government funded and probably sit within the wider defense and security portfolio.
However I'll give EA some grace for essentially being a citizen science community, for the same reason I don't waste effort grumping about the statistical errors made by participants in the Big Garden Birdwatch.