Inspired by this post (and the one-year retrospective Rasool linked), I built a continuously-updated version: a live scorecard grading each prediction as evidence comes in — https://agiscorecard.com
Current tally: 3 on track (capability trajectory, scaling pace, capex), 1 graded wrong (open-source fading — DeepSeek V4 and Qwen are now ~3-6 months behind the frontier), 2 still open (AGI-by-2027, The Project). It also puts his 2027 side-by-side with Metaculus, Samotsvety, Hassabis, and the academic survey median.
Reading the thread here, the open-source verdict seems to be the most contested one — huw and JoshYou make points cutting both ways. I'd genuinely welcome pushback on any grading; the whole thing only works if the verdicts survive scrutiny.
This thread is exactly the kind of scrutiny I was hoping for — I graded "open-source fades" as his clearest miss on a live scorecard I built (https://agiscorecard.com), but your point about distillation muddying the picture makes me wonder if "wrong" is too strong vs "right mechanism, wrong conclusion." Curious where you'd land.