The Validation Crisis

I build algorithmic trading systems. By profession I am an engineer in a European regulated industrial sector; by inclination I work at the intersection of AI engineering and quantitative trading, where the questions that interest me are not about what a model predicts but about whether its prediction can be trusted. This book is the product of that second occupation, and of a mistake I spent six months learning to stop making.

For a long time I could not defeat overfitting. I would build a strategy, backtest it, and watch it return numbers that should not exist — Sharpe ratios that promised effortless wealth, equity curves that climbed without hesitation. Then I would validate the strategy properly and the numbers would collapse. I did this six or seven times before I understood that I was not finding alpha. I was finding the same pattern: spectacular in-sample returns, and a probability of backtest overfitting near ninety percent. The model was not discovering structure in the market. It was discovering structure in my own search — fitting itself to the noise I had given it the freedom to fit.

What broke the cycle was not a better strategy. It was a different discipline. I worked through the literature on validation — walk-forward analysis, the deflated Sharpe ratio, the methodology Marcos López de Prado lays out in Advances in Financial Machine Learning — and I applied it to my own work without mercy. The deflation was brutal. Strategies I had been proud of did not survive it. But the ones that did were real, and for the first time I could tell the difference between a result and an artifact.

I write here in a personal capacity. Nothing in this essay reflects the position, knowledge, or proprietary work of any employer, past or present.

I came to AI capability forecasting from the outside, as a practitioner who had already paid the tuition for one version of this error. When I read the major timeline forecasts — the extrapolations that project artificial general intelligence from trends in compute, benchmark scores, and the rate of recent progress — I recognized the structure immediately. It was the structure of my own failed backtests. A quantity is measured over a window. A curve is fit to that window. The curve is extended forward, and the confidence interval around the extension is read off as though the future were simply more of the past. This is in-sample extrapolation. In quantitative finance it is the first thing you are taught to distrust, because it is the most reliable way to produce a number that is both precise and wrong.

The forecasting community has not, for the most part, been taught to distrust it. The forecasts I examine in this book are serious, careful, and made by people who have thought hard about the problem. That is precisely why their methodological vulnerability matters. They report confidence intervals without deflating them for the number of model variations implicitly searched. They extrapolate from windows without asking what the equivalent of a walk-forward test would reveal. They are, in the language of my own field, overfit — not through carelessness, but through the absence of a discipline that finance learned the hard way and at considerable expense.

This book imports that discipline. Its central contribution is a method I call the Deflated Capability Forecast: a way of taking a capability projection and widening its interval by the amount the underlying methodology actually warrants, producing an honest distribution over outcomes — with explicit treatment of the tails — in place of a point estimate dressed in false precision. The full derivation is in Chapter 14. The chapters before it establish why the deflation is necessary; the chapters after it apply the method to specific forecasts and, finally, to itself.

I want to be precise about what this book does and does not offer. It does not tell you when artificial general intelligence will arrive. It does not contain a better timeline. What it offers is a tool for evaluating the confidence of any timeline — your own included — and a sustained argument that the confidence currently on display is not supported by the methods used to generate it. If you are looking for a forecast, this is the wrong book. If you are looking for a way to tell whether a forecast can be trusted, this is the discipline I wish I had been handed before my seventh failed backtest.

One commitment shapes everything that follows, and I want to state it at the outset rather than let the reader discover it. A book that audits other people’s forecasts for methodological over-confidence has no standing unless it is willing to be audited by the same standard. So before I had computed anything, I preregistered a prediction about what my own framework would produce: that applying the deflated Sharpe ratio to one of the landmark forecasts would widen its confidence interval by at least a factor of 2.3.

I report that failure in full, in Chapter 16, without softening it. I want to be clear about where the failure lies: not in the framework, which computed correctly, but in my own expectation, which was over-confident. I had set the threshold by intuition, without first computing the deflation — which is, with some irony, exactly the error this book exists to identify. The framework worked. My prior about the framework did not. That is the kind of result a discipline of honest validation is supposed to surface, and surfacing it on myself, on the first page of the constructive part of the argument, is the most direct evidence I can offer that the discipline is not for decoration.

This is, in the end, a manifesto for a single conviction, earned over six months of being wrong: that validation is never optional, and that a worse result confirmed across many independent layers of testing is always preferable to a spectacular result that fails validation at even one. An overfit of twenty percent is still an overfit. The number that survives scrutiny is the only number worth reporting, even when — especially when — it is the number you did not want.

The book that follows applies that conviction to a field that has not yet adopted it. I make no claim to have the last word. I claim only to have brought a tool that finance paid dearly to develop, and to have used it on others and on myself with the same hand.

2 Introduction

The forecasts that shape how billions of dollars and a generation of careers are allocated toward artificial general intelligence share a methodological foundation that would not survive scrutiny in any mature quantitative discipline. This is not a claim about whether those forecasts are right. It is a claim about how they are made. The major timeline projections extrapolate from a measured window, fit a curve to it, and read the confidence around the extension as though the future were a continuation of the sample. Quantitative finance has a name for this, and a set of tools developed specifically to defend against it. The forecasting community has, for the most part, neither the name nor the tools.

The central argument of this book is narrow and, I think, hard to dismiss: the current debate over AGI timelines suffers from the same methodological errors that quantitative finance diagnosed and partially solved between 2014 and 2018 — in-sample extrapolation, multiple testing without correction, the absence of walk-forward validation, and selection bias toward success. These are not exotic failures. They are the failures that a hedge fund learns to detect before it is allowed to manage external capital, because each of them reliably produces a track record that looks excellent and means nothing. The tests that exposed overfitting in financial backtests can be applied, with care, to capability projections.

I want to be exact about what this book claims and what it refuses to claim, because the distinction is the difference between a methodological critique and a competing prophecy.

It claims four things. First, that the methodology of most published AGI forecasts — Aschenbrenner’s 2024 projection, Cotra’s 2020 and 2022 biological-anchors work, Davidson’s 2023 takeoff model, the METR benchmark trajectories of 2023 through 2025 — is insufficient relative to the standards demanded in other forecasting disciplines. Second, that the same statistical tests which revealed overfitting in hedge fund backtests can be sensibly applied to these projections. Third, that production deployment of advanced AI in mission-critical systems — aerospace, finance, defense, healthcare — requires a validation infrastructure that current timelines underweight or omit entirely. Fourth, that the European regulatory stack introduces structural friction that does not appear in US-centric forecasts and that materially affects any deployment timeline.

It refuses to claim four things, and the refusals matter as much as the claims. This book does not argue that AGI will not arrive. It does not argue that Aschenbrenner, Cotra, or Davidson are dishonest or incompetent — it critiques methodology, not people, and it cites their work extensively and without a single pejorative adjective, because the work is serious and the seriousness is exactly why its methodological exposure is worth examining. It does not offer its own, better date for AGI; to do so would repeat precisely the error it identifies. And it does not claim that quantitative finance has solved all of its own validation problems — only that finance has paid, in real capital and over real years, for a discipline that capability forecasting has not yet adopted.

2.1 The error, and where it comes from

In quantitative finance, the original sin is in-sample extrapolation: measuring a quantity over a historical window, fitting a model to that window, and projecting the model forward as though the conditions that held inside the sample will hold outside it. The danger is not that extrapolation is always wrong. It is that the confidence interval around an in-sample fit is almost always too narrow, because it does not account for the searching that produced the fit. If you test two hundred strategies and report the best one, its historical Sharpe ratio tells you very little about its future and a great deal about your willingness to keep testing. The interval you should report is far wider than the interval the backtest hands you, and the amount by which it is too narrow scales with the number of configurations you searched before reporting the best.

The capability forecasts examined in this book are built on the same move. It is worth seeing it concretely across the forecasts that anchor the literature.

Aschenbrenner’s projection decomposes recent progress into orders of magnitude of effective compute and extends the rate forward. He takes physical-compute scaling at roughly half an order of magnitude per year, credits algorithmic efficiency with a comparable rate, adds a one-to-two order-of-magnitude bonus for what he calls unhobbling, and sums the drivers to a central estimate of about five orders of magnitude over the four years from 2023 to 2027 — which he then maps, by analogy with the GPT-2-to-GPT-4 jump, onto a leap to AI-researcher-level capability. The reconstruction in Chapter 1 separates what in this procedure is interpolation from what is extrapolation, and the result is less favorable than the presentation suggests: the 2019-2023 window is decomposed retrospectively, the same per-year rates are applied forward, and no held-out period is reserved against which the framework’s retrodiction could be checked. Six load-bearing assumptions carry the forecast, among them that scaling laws hold across another five orders of magnitude, that the data wall is solved, and that the drivers compose additively and independently — none of them validated out of sample, because the procedure has no out-of-sample protocol by construction.

Cotra’s biological-anchors framework approaches the question differently and arrives at the same structural vulnerability. It estimates the training compute that transformative AI would require by reference to six biological anchors — a human lifetime, three neural-network horizons, the genome, and evolution — spanning seventeen orders of magnitude, assigns each a subjective mixture weight, and projects when compute of the implied magnitude becomes affordable. The 2020 version reports a median arrival around 2052, with ten percent probability by 2031 and eighty percent by 2100. The 2022 update raises the hardware baseline roughly tenfold, shifts the weights toward shorter horizons, redefines the target, and compresses the median to about 2040 — fifteen percent by 2030, sixty percent by 2050. That twelve-year revision in two years is the most informative event in the framework’s history, and Chapter 1 reads it for what it is: a discretionary update with no preregistered rule governing it, which satisfies one criterion of an out-of-sample test in spirit while failing another, because the weight changes were chosen after seeing the interval, not committed to before it.

Davidson’s compute-centric model forecasts not an arrival date but a takeoff duration — the time from twenty-percent to one-hundred-percent automation of cognitive tasks — through a coupled simulation exposing roughly seventy named parameters, with a headline distribution of about twenty-five percent probability of takeoff in under a year, fifty percent under three years, and eighty percent under ten. The transparency is real and, among the surveyed forecasts, unusual; it is also, under multiple-testing logic, an exposure rather than a defense. Seventy parameters admit on the order of thousands of two-way and tens of thousands of three-way joint perturbations, of which the published sensitivity analysis explores a one-at-a-time subset. The factors the model treats as independent — compute, algorithms, recursive R&D feedback — are not independent in practice, and under positive correlation the joint distribution has heavier tails than the independent treatment implies. Here too the parameters are fit to history and integrated forward, with no frozen-vintage reconstruction scoring the dynamics against an elapsed period.

The remaining two forecasts I examine are instructive precisely because they are the strongest of the set on the dimension where the others are weakest — and they fail the same test anyway. METR’s benchmark program is not an estimate derived from theoretical commitments; it is a measured quantity on a defined task population. It computes, for each model, the human-equivalent task length at which the model succeeds half the time, and tracks how that time horizon grows across models, reporting a doubling rate of roughly seven months over 2019-2025. This is real data, reproducible from published task definitions, and METR is the most explicit forecast in the set about where its in-sample measurement ends and its forward extrapolation begins. It even supplies the closest thing the literature offers to a genuine out-of-sample check: when the benchmark was expanded by a third and the same models re-evaluated, the headline doubling rate held and most per-model intervals tightened. But that check validates the measurement layer, not the forecast layer. The projection to week-equivalent and month-equivalent task automation, two to five years beyond the data, remains an extrapolation of the chosen window’s slope — and the slope itself is window-dependent, shortening the arrival estimate by years when fit to only the most recent data rather than the full period.

Grace’s expert surveys sit one level up: rather than decompose capability progress directly, they aggregate the beliefs of thousands of AI researchers into a pooled arrival-year distribution, reporting a median for high-level machine intelligence around 2047 in the 2024 wave — drawn from 2,778 respondents, the largest such survey to date. The measurement record is genuinely rich, and the longitudinal series of how aggregate belief has moved — the same median sat near 2061 in 2018 and 2059 in 2022 — is the densest record of expert opinion in the field. Yet richness on the input layer does not transfer to the forecast layer. The aggregate arrival year is itself out-of-sample by construction, its horizon decades longer than the spacing between survey waves, so no prior wave’s forecast has been tested against a realized arrival. And the program’s own most robust finding is a warning about its method: asking the same experts about high-level machine intelligence versus full automation of labor — arguably the same underlying event in different words — produces median timelines differing by sixty to a hundred and five years. A forecast whose answer swings by a century on the framing of the question is measuring something other than the world.

In each case the structure is the same: a window, a fit, an extension, and a confidence interval that has not been deflated for the number of model variations implicitly searched in producing it. The forecasts differ in sophistication, in transparency, and in how candidly they mark the in-sample boundary — METR and Davidson are notably more explicit than the others — but the vulnerability is common to all of them, because explicitness about a boundary is not the same as a protocol for crossing it.

This is the structure I spent six months learning to distrust in my own trading systems, and recognizing it in the forecasting literature is what set this book in motion. The tools finance developed to handle it — walk-forward validation, the deflated Sharpe ratio, the probability of backtest overfitting — are standard, derived, and transferable to capability forecasting with adaptations that Part II works out in full. The first of them is the one that does the most work, and it is worth stating plainly here. Walk-forward validation manufactures the held-out sample a forecast never reserved: it reconstructs the forecast as of its publication date, freezing the information set to what was then available, and scores its projection against the period that has since elapsed — simulating a real-time forecast evaluated only on data the forecaster could not have seen. Several of the surveyed forecasts have now been public long enough that a portion of their projection window has elapsed, which means the held-out sample exists; it simply has not been used. Part II uses it.

I want to be careful about the strength of the claim, because the adaptations are not free. The financial deflation assumes a stationary, independent return series, and capability progress is neither; the number of configurations searched, which drives the correction, is rarely disclosed and often not even enumerable from a published forecast; and the performance statistic the correction operates on must be constructed for the capability domain rather than borrowed. Part II treats each as a derivation problem rather than asserting a deflated number the method has not yet earned, and where a required assumption cannot yet be validated, the corresponding result is marked provisional rather than asserted. The conclusion that survives this caution is structural and, I think, secure: to the extent that a forecaster searched over decomposition choices without committing to one in advance, the in-sample fit overstates the out-of-sample support, and the honest interval is wider — in several cases materially wider — than the one reported.

2.2 What the book offers in their place

The constructive contribution of this book is a method I call the Deflated Capability Forecast. The intuition is simple, and I will state it here without the machinery, which belongs in Chapter 14. A capability forecast, as usually presented, is a point estimate wrapped in a confidence interval that is too narrow because it ignores how much searching produced it. The Deflated Capability Forecast takes such a forecast and widens its interval by the amount the underlying methodology actually warrants, producing an honest distribution over outcomes — with explicit treatment of the tails, where the consequential surprises live — rather than a single number presented with unearned precision. It does not tell you the forecast is wrong. It tells you how much less certain the forecast is than it appears.

This is a tool for evaluating confidence, not for generating predictions. Applied to a timeline, it does not produce a better timeline; it produces an honest accounting of how much the timeline’s stated confidence exceeds what its method can support. That is the whole of what I am offering, and I believe it is enough, because the gap between stated and supportable confidence is where the most expensive decisions are currently being made.

2.3 How the argument is built

Part I treats the major forecasts as backtests — reconstructing them precisely enough that the statistical tools of the following parts can be applied to them. Each forecast is mapped along the single axis it organizes progress around, and the reconstruction locates, for each, the exact boundary where the in-sample window ends and the projection begins. If a forecast cannot be reconstructed, it cannot be evaluated; the reconstruction is the precondition for everything else.

Part II is the statistical crisis. It defines in-sample extrapolation rigorously, develops the multiple-testing problem along the Harvey-Liu-Zhu line, builds the walk-forward retrodiction protocol that manufactures the held-out sample the forecasts never reserved, and adapts the deflated Sharpe ratio and the probability of backtest overfitting from their financial origins to the setting of capability forecasting. This is the technical core on which the rest of the argument rests, and it is where the adaptations’ assumptions — non-stationarity, undisclosed search counts, the missing performance statistic — are confronted rather than waved past.

Part III is the production reality gap. It examines what mission-critical deployment actually demands — the validation infrastructure, the certification regimes, the distance between a capability that exists and a capability demonstrated to the standard a regulated industry requires — and shows how thoroughly current timelines underweight it.

Part IV is the European stack. US-centric forecasts treat regulation as friction to be routed around; this part takes the EU AI Act, the dual-use and sovereign-AI regime, and the broader regulatory architecture seriously as structural constraints on deployment timelines, and argues that ignoring them produces forecasts that are incomplete by construction.

Part V is the better framework. It constructs the Deflated Capability Forecast in full, applies it to the forecasts reconstructed in Part I, and then — because a book that audits others by a standard it will not meet has no standing — applies it to itself. The self-application is not a flourish. It is the test of whether the discipline is real.

A note on epistemic discipline runs through every chapter. Claims are labeled by their standing — established, evidenced, or speculative — and the labels are honest, including where the label is less flattering than I would prefer. The formal derivations I am confident in; several of the parameter choices required to apply them to specific forecasts I am less confident in, and the manuscript marks those as provisional, in some cases pending the outcome of preregistered validation studies whose criteria were fixed before the studies were run. The point of the book is not to be certain. It is to be exactly as certain as the methods allow, and no more — which is precisely the standard I argue the existing forecasts have failed to meet.

That standard cuts both ways, and Part V turns it on this book. The reader is entitled to know, before investing the chapters in between, that when I applied my own framework to my own preregistered prediction about what it would produce, the prediction failed. What that failure means, and why I report it rather than bury it, is the subject of the final part. For now it is enough to say that the failure is the strongest evidence I can offer that the discipline this book argues for is one I am willing to be held to.

Part I — The Forecast Landscape

3 Chapter 1.1 — Mapping the Major Forecasts: Aschenbrenner’s Counting OOMs Method

This chapter reconstructs the methodology of Aschenbrenner (2024), Situational Awareness: The Decade Ahead (non-peer-reviewed). The reconstruction is a precondition for the methodology audit developed in Chapters 4 through 7. The essay’s central thesis — that the drivers that produced the GPT-2-to-GPT-4 capability jump between 2019 and 2023, projected forward four years, deliver an AGI-equivalent system by 2027 — rests on a specific decomposition of progress into orders of magnitude (OOMs) of effective compute. That decomposition is the object here.

The aim is mapping, not adjudication. The methodology under examination is a forecast; the methodology developed in Chapter 6 is an evaluation tool for forecasts. Treating them as competitors would misstate the relationship between Aschenbrenner’s narrative-essay forecast and the López de Prado / Bailey statistical canon from which Chapter 6 derives. Aschenbrenner produces a forecast; this manuscript proposes statistical tools for assessing forecasts of that type. Chapter 1.1 belongs to the first half of that relationship.

3.1 §1 — What Aschenbrenner Counts

Aschenbrenner organizes capability progress along a single axis, which the preregistered taxonomy of this manuscript names the OOM-axis (effective-compute decomposition) (preregistration v1, §7-axis decomposition taxonomy, 2026-05-18). The axis sums contributions in log-space across four drivers. The first three are quantified in the essay; the fourth is described qualitatively and converted to OOM-equivalents by analogy.

Driver 1: physical-compute scaling. Aschenbrenner takes the Epoch AI training-compute database as the empirical anchor, with GPT-2 at approximately 4e21 FLOP, GPT-3 at approximately 3e23 FLOP, and GPT-4 at the 8e24-to-4e25 FLOP range (Aschenbrenner, 2024, Part I; Sevilla et al., 2022). The historical rate cited is approximately 0.5 OOMs of training compute per year over the post-2010 deep-learning era (evidenced — Epoch AI database is publicly auditable; the rate has held across the relevant window). Aschenbrenner extrapolates this forward and projects +2 to +3 OOMs over 2023-2027 (speculative — extrapolation conditional on capex maintenance).

Driver 2: algorithmic efficiency. The essay credits algorithmic-efficiency gains of approximately 0.5 OOMs per year, sourced principally to Erdil and Besiroglu (2022) on ImageNet over 2012-2021, with Epoch AI’s LLM replication treated as a parallel anchor (evidenced — Erdil and Besiroglu is peer-reviewed; the cross-domain transfer from ImageNet to language modelling is empirically less mature than the within-ImageNet trend). Aschenbrenner extrapolates +1 to +3 OOMs over 2023-2027 (speculative).

Driver 3: unhobbling — post-training and inference-time techniques. Aschenbrenner credits step-changes from RLHF (Ouyang et al., 2022; small-model human-preference rating equivalent to a non-RLHF model more than 100x larger), from chain-of-thought prompting (Wei et al., 2022; roughly an order-of-magnitude effective-compute equivalent on math and reasoning), and from scaffolding (single-OOM-and-larger multipliers on HumanEval and SWE-Bench). The aggregated unhobbling credit is presented as 1-to-2 OOM-equivalents over 2023-2027 (evidenced for individual interventions; speculative for the aggregated credit and for the compute-scaling equivalence).

Driver 4: hobbling removal. The fourth driver, less formally separated from Driver 3 in the essay’s own prose, covers context-window extension, tool-use, agent scaffolds, and the conversion of a base model into a deployed system. Progressive removal of these constraints is credited with additional OOM-equivalents over the forecast window.

Summed, the four drivers give a central estimate of approximately +5 OOMs over 2023-2027. This is mapped onto a capability scale by analogy: the GPT-2-to-GPT-4 transition is treated as approximately 4.5-6 OOMs and identified with a school-grade jump from preschool-level to high-school-student-level competence, so another +5 OOMs is identified by analogy with a further jump to AI-researcher-equivalent capability (speculative — analogy-based mapping). The functional target is operationalized as a system that automates AI-researcher work (Aschenbrenner, 2024, Part I, functional AGI definition).

The first driver’s empirical anchor is not produced by Aschenbrenner directly: it is taken from Epoch AI’s compute-estimation methodology, which decomposes training FLOP from architecture papers, GPU specifications, training-time disclosures, and infrastructure announcements (Sevilla et al., 2022). Where Epoch’s per-model estimates revise, Aschenbrenner’s first OOM driver shifts mechanically. This upstream dependency is preregistered as Pattern G (Epoch-specific decomposition-axis distinctness; preregistration v1, §Pattern framework) and is a structural feature of the surveyed forecast set, not a defect of Aschenbrenner’s essay: any compute-decomposition forecast inherits Epoch’s measurement choices.

3.2 §2 — The Methodology Decomposition

The OOM framework can be stated as a function. Let E(t) denote effective compute at time t in log-units relative to a chosen baseline. Aschenbrenner’s procedure constructs E(t) as

where rᵢ are the per-year contributions of the three (or four) drivers, treated as approximately constant over the extrapolation window, and the sum is interpreted as additive in log-space. The function is then evaluated at t = 2027 with t₀ = 2023, yielding the +5-OOM central estimate. The capability mapping is a separate, post-hoc analogy that converts E(2027) − E(2023) into a verbal capability label.

Inputs. The empirical base is narrow and explicit. Compute estimates come from the Epoch AI training-compute database (Sevilla et al., 2022). Algorithmic-efficiency trends come from Erdil and Besiroglu (2022) and Epoch’s LLM replication of the same logic. Benchmark progress is drawn from MMLU (Hendrycks et al., 2021a), MATH (Hendrycks et al., 2021b), GPQA (Rein et al., 2023), and METR autonomy evaluations (METR, 2024) (non-peer-reviewed). Unhobbling step-changes are sourced to InstructGPT (Ouyang et al., 2022), chain-of-thought (Wei et al., 2022), and Chinchilla compute-optimal scaling (Hoffmann et al., 2022); the earlier scaling-law foundation is Kaplan et al. (2020). API pricing comparisons (Gemini 1.5 Flash priced approximately 85x cheaper input / 57x cheaper output than original GPT-4) are taken from public documentation.

Extrapolation and assumptions. Three modelling commitments carry the forecast: 2019-2023 per-year rates continue without regime change to 2027; the quantified drivers compose additively in log-space and are treated as approximately independent; the unhobbling credit is aggregated to a 1-to-2-OOM bonus over the four-year window. Six load-bearing assumptions are stated or near-stated in Part I: (i) scaling-law continuity in training loss across an additional 5 OOMs; (ii) compute capital sufficient for $100B-$1T-scale clusters through end-of-decade; (iii) algorithmic-efficiency progress not exhausting low-hanging fruit; (iv) the data wall solvable via synthetic data, self-play, or reinforcement learning; (v) approximate additivity and independence of the drivers; (vi) qualitative-capability jumps repeating — another GPT-2-to-GPT-4-sized increase in E yields an AGI-level capability increase.

In-sample / out-of-sample boundary. This is the load-bearing observation of the chapter. The procedure is, in Chapter 4’s terminology, an in-sample reconstruction followed by trend extrapolation: the 2019-2023 window is decomposed retrospectively into the three quantified contributions, and the same per-year rates are then applied forward to 2023-2027. No out-of-sample protocol is used. No walk-forward analysis is presented. No held-out validation period reserves a portion of the historical record against which the framework’s retrodiction performance could be measured. No pre-specified falsification criterion is offered; the essay’s most explicit uncertainty concession is Part V’s qualitative acknowledgement that error bars are very large, without specification of width (established as a fact about the document).

The closest gestures toward validation are anecdotal: Aschenbrenner cites falsified pessimistic predictions (LeCun on physical reasoning; Marcus on deep-learning walls; Caplan’s economics-exam prediction) as evidence that capability progress has been underestimated. These are counterexamples to pessimism, not statistical validations of the OOM-counting method. Counterexamples to one direction of forecast error do not validate a forecasting procedure; they constrain the inverse error of the opposing camp. Chapter 4 returns to the asymmetry.

The absent boundary is the bridge to Chapter 6, which extends the Bailey and López de Prado (2014) line of multiple-testing-aware performance evaluation into the capability-forecasting domain. That extension assumes the performance statistic under correction is computed on an operational sample. OOM-counting capability forecasts lack this sample by construction at publication. Chapter 6 develops the adaptation; Chapter 4 develops the walk-forward retrodiction protocol that would generate the sample retrospectively.

3.3 §3 — What the Method Achieves

A reconstruction that fails to enumerate what the source method does well is critique masquerading as description. Four features of the OOM decomposition deserve explicit accounting, independent of any subsequent audit.

Decomposition into named drivers. Aschenbrenner’s principal methodological move is to separate capability progress into named, individually-quantifiable contributions: physical compute, algorithmic efficiency, and unhobbling-plus-hobbling-removal. A reader can disagree with the rate assigned to any one driver without disagreeing with the framework’s structure. Several other surveyed forecasts (Cotra’s anchor mixture; Karnofsky’s qualitative-probability framing) bundle multiple contributions into single weighted aggregates that are harder to interrogate at the level of individual modelling choices (established — verifiable in research_notes/forecasts_database.json). Openness to component-level audit is the reason this chapter can be written at all.

Empirical grounding of the compute driver. The first OOM driver is anchored on the Epoch AI training-compute database, the most rigorous public methodology in the surveyed-set domain (Sevilla et al., 2022; preregistration v1 §Distribution analysis). The 0.5-OOMs-per-year rate is computed from a per-model dataset with explicit version history and bootstrapped confidence intervals on log-linear fits. At the level of the input series, this driver carries the strongest data anchor of any quantitative claim in Situational Awareness. The forward extrapolation is (speculative); the input series is (evidenced).

Reach to a policy-relevant audience. The essay communicates a quantitative argument to readers who do not read arXiv preprints. The four-OOM frame is compact enough to be remembered, structured enough to be challenged, and quantitative enough to discipline its own projection. The choice to publish as a self-published essay rather than a working paper trades peer review for reach. Within the surveyed set, no other framework has comparable policy-audience reach. The empirical fact of reach is (established); whether reach without peer review is a desirable property of forecasting infrastructure is a normative question this manuscript does not adjudicate.

Honesty about confidence at the synthesis level. Part V contains a roughly twenty-word acknowledgement that important parts of the story are likely wrong and that error bars are very large (Aschenbrenner, 2024, Part V; paraphrased to remain under 15 words). The concession is unilateral — it acknowledges uncertainty without specifying direction or which components are most exposed — but it is on the record. The phrase strikingly plausible rather than will happen is used at the Part I summary. Both choices place the essay closer to the academic-extrapolation register than to the marketing-confident register adjacent technology-forecasting genres often inhabit.

These four features are the affirmative observations against which the audit in Chapters 4-7 must establish materiality. Without them, subsequent critique reads as ad hominem; with them, the critique reads as methodology — which is the project’s standing commitment under Rule 5.

3.4 §4 — What the Method Cannot Tell Us

The decomposition’s openness to component-level audit (§3) makes it possible to enumerate questions the methodology, on its own terms, does not answer. Four such questions surface from the decomposition itself; under Rule 3 of the master plan (self-application), each maps both to an audit-chapter section and to an analogous question facing this manuscript’s own Chapter 14 constructive proposal.

Question 1: hyperparameter-search count. The decomposition selects four drivers and assigns specific per-year rates and OOM windows. How many alternative decompositions were considered before this one was selected? At minimum, the implicit selection space spans (i) compute-estimation methodology (Epoch AI, OpenAI disclosures, SemiAnalysis architectural inference); (ii) algorithmic-efficiency series (Erdil-Besiroglu ImageNet, Epoch’s LLM replication, inferred-from-API-pricing); (iii) benchmark suite weighting (MMLU, MATH, GPQA, METR, HumanEval, SWE-Bench); (iv) capability-level analogy across candidates from school-grade jumps through PhD-level expert to AI-researcher-equivalent. The product is a model-search space of nontrivial size, and the essay does not enumerate it. Chapters 5 and 7 return to the consequences of unenumerated search counts for in-sample fit. (speculative as to magnitude of inflation; established as a methodological observation about the document.)

Self-application observation. The Chapter 14 construct (preregistration v2 §Adaptation 4 — Effective-N estimation; locked 2026-05-19) faces the strict analog. Its formula requires an effective-N for the capability hyperparameter-search space, and the manuscript pre-commits to a composite range estimator (Algorithms A through D with geometric-mean headline and bracket-ratio thresholds 1.5x and 100x) precisely because the search count is not directly observable. The dimension on which Chapter 14 faces an analogous question is identical: how many forecast-construction choices populate the implicit menu, and how is the implied multiple-testing penalty calibrated? The manuscript’s response is to publish the bracket and the geometric-mean headline; the analogous critique applies to the manuscript itself.

Question 2: additivity-vs-substitution between drivers. The OOM decomposition treats the three quantified contributions as approximately additive in log-space and approximately independent. Two sub-questions follow. First, under what conditions might unhobbling substitute for raw compute scaling rather than compound with it? If RLHF, chain-of-thought, and scaffolding extract performance that scaling alone would not — but the same gains are partially priced into the measured algorithmic-efficiency series — additivity double-counts. Second, the drivers’ financial sources are concentrated at the same frontier labs; capital is budget-substitutable between compute purchases and algorithmic-research headcount. Under positive correlation, the joint distribution has heavier tails than additive treatment implies. Neither case is examined in the essay. (evidenced as a gap; additivity is the framework’s load-bearing methodological choice.)

Self-application observation. The Chapter 14 construct addresses this on the dimension of the non-IID variance estimator (preregistration v2 §Adaptation 1; LOCKED at TIER_2_ADAPTED_DERIVATION; default (speculative) until MC Study 1 elevation per n_critical(ρ) > 60 criterion). The correction computed there depends on the correlation structure of the underlying capability series: if constituent series are positively correlated (same labs producing similar gains across benchmarks), the effective sample size is smaller than the nominal count and the correction is more severe. The manuscript pre-commits to block-bootstrap on detrended residuals. The analogous question — what is the dependence structure, and does the procedure handle it correctly? — applies to the manuscript itself, with elevation deferred to MC Study 1 outcomes.

Question 3: trend-continuation horizon. The empirical window is 2019-2023; the forecast window is 2023-2027. The in-sample-to-extrapolation ratio is approximately 1.25 to 1 (the 2019-2023 fitting span covers five calendar years; the 2023-2027 horizon, four years ahead). In time-series forecasting practice, in-sample windows several times longer than the horizon are the conservative baseline. A 1.25-to-1 ratio is methodologically aggressive for a framework that assumes constant per-year contributions and no regime change. The essay does not address why the chosen ratio is defensible relative to alternatives — a 2010-2023 window (fourteen calendar years) would extend the ratio to roughly 3.5 to 1 and would lower the central rate by including the slower pre-Transformer era. (evidenced as a gap.)

Self-application observation. The Chapter 14 construct faces the strict analog on the dimension of the capability-forecasting performance statistic and the window over which it is computed (preregistration v2 §Adaptation 3; LOCKED, TIER_2 for Brier / log-loss / calibration / interval-coverage; TIER_3 for the capability-Sharpe candidate). For parametric extrapolation forecasts the manuscript pre-commits to a capability-Sharpe primary statistic with a multi-statistic robustness panel. The analogous critique — over what window is the statistic computed, and how is the in-sample / out-of-sample partition chosen — applies in the same form. The manuscript’s response is the robustness panel; the OOM framework’s response is silence.

Question 4: regime-change blindness. The extrapolation assumes no break in the scaling-law-to-capability mapping, no saturation of the algorithmic-efficiency rate, no disruption to compute supply, and no exhaustion of training-data resources. Each is a candidate regime change with its own empirical literature (Caballero et al., 2022, on broken scaling laws; Bahri et al., 2024, on theoretical foundations for when scaling holds; Sorscher et al., 2022, on algorithmic interventions changing scaling-law exponents). The essay does not engage this literature; the cited sources are the foundational and trend-affirming ones, not the regime-detection and break-detection ones (evidenced as a structural gap).

Self-application observation. The Chapter 14 construct faces this on the dimension of the information-set partition (preregistration v2 §Adaptation 2; TIER_3_NEW_DERIVATION; default (speculative) until MC Study 2 elevation per Type-I_C > 2·PBO OR Power_C < 0.5·PBO criterion). The partition assumes a structure over historical forecast performance that respects sequential-causal information ordering. If the underlying capability series exhibits regime changes — the same concern raised here about Aschenbrenner — the partition’s properties depend on whether the regime change is detected and accommodated. The MC Study 2 criterion was pre-committed precisely to expose this dependency to empirical falsification. The manuscript’s response is the MC Study 2 commitment; the OOM framework’s response is the absence of any regime-detection apparatus.

3.5 §5 — Forward Reference

This chapter has reconstructed Aschenbrenner’s OOM-axis and surfaced four questions the decomposition does not answer on its own terms.

Chapter 2 identifies the structural pattern unifying the surveyed forecast set. Aschenbrenner’s OOM-axis is one of seven preregistered decomposition axes (preregistration v1 §7-axis decomposition taxonomy): Cotra’s anchor-axis, Davidson’s takeoff-duration axis, METR’s time-cost-of-task axis, Karnofsky’s probability-mass / civilizational-significance axis, Grace’s expert-survey-aggregation axis, and Epoch AI’s compute-scaling-empirical axis. The seven share one characteristic: none deploys an out-of-sample validation protocol at publication.

Chapter 4 develops a walk-forward retrodiction protocol. Chapter 6 extends the Bailey and López de Prado (2014) DSR derivation into the capability-forecasting domain; Chapter 14’s DCF construct applies that machinery to forecasts of the type Aschenbrenner produces. Chapter 1.2 turns next to Cotra’s biological-anchors framework.

3.6 References cited in this chapter

Aschenbrenner, L. (2024, June). Situational awareness: The decade ahead. Self-published essay. https://situational-awareness.ai/ (non-peer-reviewed)

Bahri, Y., Dyer, E., Kaplan, J., Lee, J., & Sharma, U. (2024). Explaining neural scaling laws. Proceedings of the National Academy of Sciences.

Bailey, D. H., & López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), 94-107.

Caballero, E., Gupta, K., Rish, I., & Krueger, D. (2022). Broken neural scaling laws. arXiv preprint arXiv:2210.14891.

Erdil, E., & Besiroglu, T. (2022). Algorithmic progress in image classification. arXiv preprint arXiv:2212.05153.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021a). Measuring massive multitask language understanding [MMLU]. International Conference on Learning Representations.

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021b). Measuring mathematical problem solving with the MATH dataset. NeurIPS Datasets and Benchmarks.

Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training compute-optimal large language models [Chinchilla]. NeurIPS.

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback [InstructGPT]. NeurIPS.

Rein, D., Hou, B. L., Stickland, A. C., et al. (2023). GPQA: A graduate-level Google-proof Q&A benchmark. arXiv preprint arXiv:2311.12022.

Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M., & Villalobos, P. (2022). Compute trends across three eras of machine learning. International Joint Conference on Neural Networks (IJCNN).

Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., & Morcos, A. S. (2022). Beyond neural scaling laws: Beating power law scaling via data pruning. NeurIPS.

Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.

4 Chapter 1.2 — Mapping the Major Forecasts: Cotra’s Biological Anchors Method (with 2022 Self-Revision)

This chapter reconstructs the methodology of Cotra (2020), Forecasting TAI with Biological Anchors (non-peer-reviewed), and Cotra (2022), Two-year update on my personal AI timelines (non-peer-reviewed). The 2020 report frames the transformative-AI-timeline question as a compute-affordability question: estimate the training-compute magnitude TAI requires by reference to biological benchmarks, then project when training-compute of that magnitude becomes affordable to a leading AI lab. The 2022 post revises the framework’s parameter values and partially redefines TAI without replacing the underlying anchor structure. Both publications are the object of this chapter.

The reconstruction is mapping, not adjudication. Cotra produces a forecast; this manuscript proposes statistical tools for assessing forecasts of that type.

4.1 §1 — What Cotra Anchors

Cotra organizes the TAI-arrival question along a single axis, which the preregistered taxonomy of this manuscript names the anchor-axis (biological-benchmark mixture across short-horizon, medium-horizon, long-horizon, genome, lifetime, and evolution anchors with associated weights) (preregistration v1, §7-axis decomposition taxonomy, 2026-05-18). The axis decomposes the TAI compute question into six parallel theoretical answers, each grounded in a biological reference point and each carrying a per-anchor compute estimate plus a subjective mixture weight (Cotra, 2020, Sections 1-4).

Anchor 1: lifetime (10^24 FLOPs). Brain inference compute at Cotra’s central estimate of 10^16 FLOP/s, multiplied by seconds in a human lifetime (~2.5×10^9 s) and adjusted for sample efficiency, yields approximately 10^24 FLOP (evidenced — derivation reconstructed via Karnofsky 2021 and Alexander 2022; the 10^16 FLOP/s figure rests on synaptic-count and firing-rate estimates carrying roughly three orders of magnitude of published neuroscience uncertainty).

Anchors 2-4: short / medium / long-horizon neural networks (10^30, 10^33, 10^36 FLOPs). Three estimates anchored on scaling-law extrapolations from 2020-vintage language and vision models, calibrated to task horizons of increasing length. The 6-OOM internal span reflects Cotra’s commitment that effective task horizon multiplies compute requirement (evidenced).

Anchor 5: genome (10^33 FLOPs). Anchored on the information-theoretic compute implied by the genome’s coding content. The figure coincides numerically with the medium-horizon NN anchor; Alexander (2022) flags cross-species genome-size variation — 50x larger genomes in certain canopy plants without corresponding intelligence — as methodological pressure on this anchor (evidenced).

Anchor 6: evolution (10^41 FLOPs). Anchored on cumulative compute of animal evolution leading to human-level intelligence; Karnofsky (2021) describes this anchor as very conservative (evidenced). Under Cotra’s central hardware-and-algorithm rates, the evolution anchor is reached only beyond 2100 in many parameter combinations.

Summed across the six anchors, the framework spans 17 orders of magnitude of candidate TAI compute requirement. Cotra mixes the per-anchor distributions under subjective weights and reports a final distribution: median ~2052, 10% by 2031, ~80% by 2100 (evidenced via Alexander 2022 and Epoch AI literature review, mutually consistent at headline). The 2022 update inherits the same six anchors, revises the mixture weights, raises the hardware FLOP/$ baseline by approximately 10x, and partially redefines TAI from automating all scientific tasks to automating AI R&D. The headline median compresses to approximately 2040; the updated distribution reports 15% by 2030, 60% by 2050, 97% by 2100 (Cotra, 2022; Epoch AI literature review).

Cotra’s per-anchor compute estimates inherit from upstream sources: the lifetime and evolution anchors depend on Sandberg and Bostrom (2008) and the Moravec (1988) brain-compute estimate range; the neural-network anchors depend on Kaplan et al. (2020) scaling-law exponents and Hernandez and Brown (2020) algorithmic-efficiency measurements. Revisions in those upstream estimates propagate directly to anchor-compute requirements. This upstream dependency is the Cotra-domain instance of Pattern G (preregistration v1, §Pattern framework) and is a structural feature of the surveyed forecast set, not a defect: any biological-anchors forecast inherits the brain-compute literature’s residual uncertainty.

4.2 §2 — The Methodology Decomposition

The biological-anchors framework can be stated as a function. Let T_a denote the compute requirement of anchor a ∈ {lifetime, short-NN, medium-NN, long-NN, genome, evolution}, let C(t) denote affordable training compute at year t, and let w_a denote the mixture weight on anchor a. Cotra’s procedure constructs the probability that TAI is feasible by year t as

where ε_a denotes the per-anchor distributional dispersion (approximately log-normal with 1-2 OOM width per anchor, per secondary coverage) and C(t) is built multiplicatively from hardware FLOP/$, algorithmic efficiency, and willingness-to-spend trends. The function is then evaluated across years 2020 to 2100, yielding the percentile distribution that is the framework’s headline output.

Inputs. The empirical base is layered across four families. Biological: brain inference compute at 10^16 FLOP/s (Cotra’s central estimate; Moravec’s 10^13 FLOP/s treated as sensitivity dimension), genome size at approximately 10^9 base pairs, evolutionary compute estimates aggregated across animal lineages. Scaling-law: Kaplan et al. (2020) exponents for the neural-net anchors, with Hoffmann et al. (2022) Chinchilla acknowledged in the 2022 update but not formally re-integrated (Cotra, 2022). Cost-trend: Hernandez and Brown (2020) hardware FLOP/$ trend at 0.2-0.3 OOMs/year; algorithmic-efficiency trend at 0.4-0.5 OOMs/year. Willingness-to-spend: a logistic-growth model saturating at approximately 1% of GDP per training run, with 0.1%-10% as the documented sensitivity range.

Extrapolation and assumptions. Five modelling commitments carry the forecast: (i) per-anchor compute estimates are stable across the projection horizon, with sensitivity analysis on brain inference compute as the principal robustness dimension; (ii) the multiplicative composition of C(t) — hardware × algorithmic efficiency × spending capacity — is approximately independent across factors; (iii) mixture weights are stable in time, with no formalized update rule; (iv) willingness-to-spend saturates within the projection horizon; (v) no paradigm shift between vintage and TAI year invalidates the scaling-law-based extrapolation (Alexander 2022 flags this as the Victorian-ship-size-to-spaceships assumption). The 2022 update revises (i) (hardware baseline up 10x; rates unchanged), (iii) (weights shift directionally toward shorter-horizon NN anchors, without published per-anchor magnitudes), and the TAI definition; (ii), (iv), (v) are preserved.

In-sample / out-of-sample boundary. This is the load-bearing observation of the chapter. The 2020 procedure is, in Chapter 4’s terminology, an in-sample reconstruction followed by trend extrapolation: hardware-cost and algorithmic-efficiency trends are fit to 2010-2020 data and extrapolated to 2030-2100; biological-anchor compute requirements are derived from current neuroscience and assumed stable over the projection horizon. No formal out-of-sample protocol is used. No walk-forward analysis is presented. The closest gestures toward validation are the Section 5 sensitivity analyses, which probe robustness within the model space — not predictive accuracy on held-out data (established as a fact about the document).

The 2022 update partially functions as an out-of-sample event for the 2020 framework: Cotra revises the median by approximately 12 years (Cotra, 2022; the 2020 → 2022 shift from 2052 to 2040 established via Epoch AI and EA Forum coverage). The update meets one out-of-sample criterion in spirit (validation values not used to specify the model) but fails another: the framework was not pre-registered with a specific update rule, and the weight revisions are discretionary. Chapters 4 and 6 return to this asymmetry.

4.3 §3 — What the Method Achieves

Decomposition into named anchors. Cotra’s principal methodological move is to separate the TAI compute question into six explicit anchors, each with a stated theoretical basis and a numerical compute estimate. A reader can disagree with the weight assigned to any one anchor without disagreeing with the framework’s structure. The decomposition is denser than Aschenbrenner’s three-driver OOM partition, and the per-anchor compute estimates are reported as point estimates plus distributional widths rather than point estimates alone. Openness to component-level audit is more developed here than in any other forecast in the surveyed set (established — verifiable in research_notes/forecasts_database.json).

Explicit weight disclosure. Cotra publishes mixture weights at the headline level: NN anchors collectively receive approximately 60-70%, genome anchor approximately 15-20%, lifetime and evolution anchors approximately 5-10% each (Cotra, 2020, via Alexander 2022). The headline allocation is on the record. This is more disclosure than Aschenbrenner’s narrative-essay treatment of OOM-component weights and more than Karnofsky’s qualitative probability-mass framing. Within the surveyed set, Cotra’s weight disclosure is the most explicit (evidenced via Alexander 2022 ACX review).

Public methodology revision. In August 2022, Cotra published a personal-capacity post compressing her TAI median from approximately 2052 to approximately 2040 over a two-year window, citing (a) a TAI redefinition from automating all scientific tasks to automating AI R&D; (b) acknowledgment of Hoffmann et al. (2022) Chinchilla scaling, directionally lengthening timelines; (c) a one-time hardware-baseline upward adjustment of approximately 10x; (d) directional reweighting toward shorter-horizon NN anchors, without published per-anchor magnitudes (Cotra, 2022). The update is the single most data-rich event in the surveyed forecaster set: a published forecaster revising a previously-defended timeline in public, with named drivers. No other framework in this manuscript’s database carries a comparable author-led methodology revision on the record. This is a substantive methodological feature, not a chronological footnote: it places one entry in the surveyed set under conditions approaching a self-administered audit (established — verifiable across research_notes/forecasts_database.json entries for Aschenbrenner, Davidson, Karnofsky, Grace, Epoch AI).

Decomposed uncertainty across anchors. The six-anchor structure is itself an explicit acknowledgement of epistemic uncertainty about the underlying TAI compute question. Where Aschenbrenner’s OOM decomposition uses point ranges per driver, Cotra’s framework reports a full probability distribution over candidate compute requirements via the mixture. The Section 5 sensitivity analysis additionally varies brain inference compute, anchor weights, hardware FLOP/$ rate, and algorithmic-efficiency rate, with reported median shifts of 5-15 years on brain-compute swap and 20+ years on weight perturbation. The framework surfaces the size of its own residual uncertainty (evidenced).

4.4 §4 — What the Method Cannot Tell Us

Four questions surface from the decomposition itself; under Rule 3 (self-application), each maps to an audit-chapter section and to an analogous question facing this manuscript’s own Chapter 14 construct.

Question 1: anchor-weight search space. The 2020 framework publishes one weight allocation; the 2022 update revises directionally without publishing per-anchor magnitudes (Cotra, 2022). Even taking the 2020 disclosure as a methodological strength (§3), the disclosed allocation reflects one trajectory through an implicit weight-search space. Across the six-dimensional simplex, plausible alternative allocations span outside-view uniform weighting, inside-view scaling-DL concentration (the 2022 direction), pessimistic evolution-concentration, and Moravec-aligned lifetime-concentration. Section 5 sensitivity analysis explores part of this space but does not enumerate the implicit search count (speculative as to magnitude of search-induced inflation; established as a methodological observation about the document).

Self-application observation. Chapter 14’s effective-N estimator (preregistration v2 §Adaptation 4; LOCKED, TIER_3_NEW_DERIVATION for Algorithms A/B/C and composite, default (speculative)) faces the strict analog: the implicit hyperparameter-search count is not directly observable. The manuscript pre-commits to a composite range with geometric-mean headline and bracket-ratio thresholds 1.5x and 100x; the analogous critique applies to the manuscript on the identical dimension.

Question 2: biological-analogue validity. Each anchor presupposes that a specific biological reference point — a human lifetime, the neural-net analog of biological learning, the genome’s coding content, cumulative evolution — bounds the TAI compute requirement. Under what conditions do these analogues transfer, and what evidence would falsify the transfer? The 2020 report grounds each anchor’s biological-to-artificial mapping in theoretical motivation but does not specify a capability observation that would refute the mapping. Alexander’s (2022) canopy-plant critique of the genome anchor does not appear in the framework’s sensitivity analysis (evidenced as a gap; the biological-to-artificial mapping is the framework’s load-bearing methodological choice).

Self-application observation. Chapter 14’s information-set partition (preregistration v2 §Adaptation 2; TIER_3_NEW_DERIVATION; default (speculative) until MC Study 2 elevation per Type-I_C > 2·PBO OR Power_C < 0.5·PBO) presupposes a sequential-causal ordering over historical forecast performance — itself an analogue-validity commitment. The MC Study 2 criterion was pre-committed to expose this dependency to falsification; the analogous critique applies to the manuscript itself.

Question 3: 2022 self-revision sufficiency. The 2022 update is the framework’s most informative event (§3) and a discretionary intervention. The TAI redefinition mechanically compresses the implied compute target; the 10x hardware-baseline adjustment is a one-time correction without rate revision; the directional weight reallocation toward shorter-horizon NN is not quantified per-anchor (Cotra, 2022). Does the revision genuinely address the 2020 framework’s calibration error, or does it shift the questions without resolving them? The framework provides no internal mechanism for distinguishing between these readings (evidenced as a gap).

Self-application observation. Chapter 14’s non-IID variance estimator (preregistration v2 §Adaptation 1; TIER_2_ADAPTED_DERIVATION; default (speculative) until MC Study 1 elevation per n_critical(ρ) > 60) faces the analog: a forecast revision must be distinguishable from a structural shift under the framework’s own dependence structure. Block-bootstrap on detrended residuals is the manuscript’s pre-committed procedure; the analogous critique applies to the manuscript itself.

Question 4: anchor-mixture aggregation. The mixture is a weighted sum of single-factor distributions, not a joint estimate. An alternative aggregation — treating TAI compute as a function of multiple factors simultaneously — would yield a different final distribution. The mixture’s headline statistics are sensitive to both anchor weights and per-anchor widths; the framework does not formalize the joint distribution. Whether the final percentiles reflect a real underlying epistemic distribution or impose a convenient meta-uncertainty is left open (evidenced as a gap).

Self-application observation. Chapter 14’s capability-forecasting performance statistic (preregistration v2 §Adaptation 3; LOCKED, TIER_2 for Brier/log-loss/calibration/interval-coverage, TIER_3 for capability-Sharpe) faces the analog: no single aggregation rule is privileged. The manuscript pre-commits to a capability-Sharpe primary plus multi-statistic robustness panel; the analogous critique — under what aggregation rule is the headline computed — applies to the manuscript itself.

4.5 §5 — Forward Reference

Chapter 2 identifies the structural pattern unifying the surveyed set: the anchor-axis is one of seven preregistered decomposition axes, none deploying an out-of-sample validation protocol at publication. Chapter 4 develops a walk-forward retrodiction protocol. Chapter 6 extends the Bailey and López de Prado (2014) DSR derivation into capability forecasting (addressing Questions 1 and 3 via Adaptations 4 and 1). Chapter 14’s constructive DCF apparatus addresses Questions 2 and 4 via Adaptations 2 and 3.

Chapter 1.3 turns next to Davidson’s compute-centric takeoff-speeds framework, which decomposes the same forecasting question along a different axis.

4.6 References cited in this chapter

Alexander, S. (2022, March). Biological anchors: a trick that might or might not work. Astral Codex Ten (non-peer-reviewed).

Bailey, D. H., & López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), 94-107.

Cotra, A. (2020). Forecasting TAI with biological anchors. Open Philanthropy (non-peer-reviewed).

Cotra, A. (2022, August). Two-year update on my personal AI timelines. AI Alignment Forum / LessWrong (non-peer-reviewed).

Hernandez, D., & Brown, T. B. (2020). Measuring the algorithmic efficiency of neural networks. arXiv preprint arXiv:2005.04305.

Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training compute-optimal large language models [Chinchilla]. NeurIPS.

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

Karnofsky, H. (2021, August). Forecasting transformative AI: the biological anchors method in a nutshell. Cold Takes blog (non-peer-reviewed).

Moravec, H. (1988). Mind children: The future of robot and human intelligence. Harvard University Press.

Sandberg, A., & Bostrom, N. (2008). Whole brain emulation: A roadmap (Technical Report 2008-3). Future of Humanity Institute, Oxford University.

5 Chapter 1.3 — Mapping the Major Forecasts: Davidson’s Compute-Centric Takeoff Model

This chapter reconstructs the methodology of Davidson (2023), What a compute-centric framework says about AI takeoff speeds (non-peer-reviewed). The report frames the AI-takeoff-speeds question as a coupled-trajectory simulation problem: combine compute-affordability extrapolations, algorithmic-progress rates, and recursive AI-R&D-feedback dynamics into a single time-varying model; report the implied probability distribution over the duration from 20%-automation to 100%-automation of cognitive tasks. The companion artifact takeoffspeeds.com, co-built with Epoch AI, exposes approximately seventy named parameters.

5.1 §1 — What Davidson Models

Davidson organizes the takeoff-speeds question along a single axis, which the preregistered taxonomy of this manuscript names the takeoff-duration axis (probability mass distributed across the duration from 20%-automation to 100%-automation of cognitive tasks, with multi-scenario decomposition over slow, fast, and discontinuous regimes) (preregistration v1, §7-axis decomposition taxonomy, 2026-05-18). The axis is distinct from Aschenbrenner’s OOM-axis and from Cotra’s anchor-axis: Davidson does not forecast a specific arrival year and does not partition the compute requirement across biological anchors. He forecasts a duration distribution, conditional on AI capability reaching the 20%-automation threshold, over how long the transition to 100%-automation takes.

The framework is structurally three-factor. Capability progress is decomposed multiplicatively into (1) the training-compute requirement for AGI under current algorithms, treated as a probability distribution rather than a point (~1e36 FLOP central under 2020-vintage scaling); (2) the algorithmic-progress rate, drawn from Kaplan et al. (2020), Hoffmann et al. (2022), Hernandez and Brown (2020), and Erdil and Besiroglu (2022); (3) the hardware-scaling trajectory, drawn from the Epoch AI compute database (Sevilla et al., 2022). The product gives effective FLOP available for training as a function of time; takeoff begins when effective FLOP crosses the lower threshold of the requirement distribution and concludes when the trajectory crosses the upper threshold scaled by the effective-FLOP-gap parameter (Davidson, 2023, §1).

Headline outputs. The framework reports a three-point probability distribution over takeoff duration: approximately 25% probability under one year, 50% probability under three years, 80% probability under ten years (Davidson, 2023, §2) (evidenced — reproducible via takeoffspeeds.com under central parameters). A separate claim is that the AGI-to-superintelligence interval is under one year absent deliberate slowdown coordination (speculative — counterfactual condition; model-internal output without out-of-sample anchor). The wake-up threshold at approximately 6% economy-wide automation and the central 10,000x effective-FLOP gap (1-to-8-OOM range) are the framework’s two most quantitatively-specific endogenous-dynamics parameters (evidenced as model parameters; specific values stipulated).

Davidson’s per-component inputs inherit from upstream methodologies: the AGI training-requirement distribution from Cotra’s neural-network anchor cluster (10^30-1036 FLOPs, with the 1e36 central estimate at the upper end); the algorithmic-progress and hardware-scaling rates from the same scaling-laws and Epoch AI literature Aschenbrenner’s OOM-axis draws on; the semi-endogenous-growth machinery from Romer (1990), Aghion, Jones, and Jones (2017), and Trammell and Korinek (2023). This upstream dependency is the Davidson-domain instance of Pattern G (preregistration v1, §Pattern framework) and is a structural feature of the surveyed forecast set, not a defect: any compute-centric takeoff-speeds model inherits the scaling-laws literature’s residual uncertainty, the biological-anchor methodology’s residual uncertainty, and the growth-theory literature’s domain-transfer uncertainty.

5.2 §2 — The Methodology Decomposition

The framework can be stated as a coupled differential-equations system. Let E(t) denote effective compute at time t, A(t) the automation level (fraction of cognitive tasks at which leading systems are competitive), R(t) the AI-R&D contribution multiplier, and S(t) the investor wake-up multiplier. Davidson’s procedure constructs the system as

where r_hw, r_alg, r_spend are the per-factor base growth rates, R(A) increases with automation (AI-R&D acceleration), S(A) steps upward at the 6% wake-up threshold, G is the effective-FLOP gap, and ρ is the CES labor-substitution parameter. The system is integrated forward from a starting condition matched to 2023 actuals. The Monte Carlo over parameter distributions produces the takeoff-duration distribution that is the framework’s headline output (Davidson, 2023, §§1-4).

Inputs. The empirical base divides into three families. Compute and requirements: AGI training-requirement distribution centered near 1e36 FLOP under 2020 algorithms (inherited from Cotra, 2020); runtime requirement ~1.6667e16 FLOP per inference for a 1e36-class model. Progress rates: hardware FLOP-per-dollar and training-compute scaling from the Epoch AI compute database (Sevilla et al., 2022); algorithmic-efficiency rates from Hernandez and Brown (2020), Erdil and Besiroglu (2022), and the Chinchilla compute-optimal correction (Hoffmann et al., 2022). Feedback and economic structure: AI-R&D acceleration (20%-economy-wide-to-40-50%-R&D-specific); the 6% wake-up threshold; CES parameter ρ in approximately 0.5-0.7; Cotra-style willingness-to-spend saturation.

Extrapolation and assumptions. Six modelling commitments carry the forecast: (i) current algorithms suffice for AGI given sufficient compute — no paradigm shift between 2020 and AGI year; (ii) the three-factor decomposition (compute × algorithms × hardware) composes multiplicatively with approximately independent contributions in log-space; (iii) the Jones-Romer semi-endogenous-growth elasticity applies to AI R&D, with elasticity exposed as a free simulator parameter; (iv) the 6% wake-up threshold is structurally correct, with the post-wake-up multiplier exposed as a tunable; (v) compute-to-capability scaling is monotonic across the 20%-to-100%-automation gap — neither Caballero et al. (2022) broken-scaling effects nor Sorscher et al. (2022) exponent-shifting interventions are integrated; (vi) the simulator’s approximately seventy named parameters are exhaustive of the modeling choices Davidson considers material.

In-sample / out-of-sample boundary. This is the load-bearing observation of the chapter. The framework is, in Chapter 4’s terminology, a coupled in-sample simulation: parameter values are fit to 2010-2023 history and the dynamics are integrated forward. No formal out-of-sample protocol is used. No walk-forward analysis — freezing the framework at an earlier vintage and asking whether it retrodicts later-observed dynamics — is presented. No pre-specified falsification criterion is offered for the takeoff-duration distribution; the 6%-wake-up threshold is the framework’s most directly-falsifiable single claim and has no historical analog with which to validate it (established as a fact about the document).

The closest gestures toward validation are two. The interactive simulator at takeoffspeeds.com exposes parameter uncertainty rather than hiding it — robustness exposure, not predictive validation. The Davidson 80,000 Hours interview reports a discretionary probability update: AI-takeover-by-2070 probability rose from approximately 10% to approximately 20% in roughly one year (stated by Davidson). The update parallels Cotra’s 2020-to-2022 framework revision in structure but is reported informally on a podcast. Chapter 6 returns to forecaster-self-update events as a small validation series across the surveyed set.

5.3 §3 — What the Method Achieves

Formal coupled-system structure with multi-scenario probabilistic output. Davidson’s principal methodological move is to express takeoff dynamics as a coupled-trajectory model whose output is a probability distribution over duration, not a point estimate. The three-point distribution (25% / 50% / 80% under 1 / 3 / 10 years) lets a reader read conditional probabilities directly without inferring tails from a median. Among the surveyed forecast set this is the only framework producing an explicit probability distribution over takeoff duration; Aschenbrenner’s OOM-counting produces a point AGI-arrival year (~2027), Cotra’s biological-anchors a distribution over arrival year but not over post-arrival dynamics, Karnofsky’s probability-mass framing remains qualitative on takeoff (established — verifiable in research_notes/forecasts_database.json entries for the six other surveyed forecasts).

Explicit parameter-sensitivity disclosure. Davidson exposes approximately seventy named parameters in the takeoffspeeds.com simulator across two tiers, and the report identifies the four most-sensitive parameters as the AGI training-requirement distribution, the effective-FLOP gap, the wake-up multiplier, and the CES labor-substitution parameter ρ (Davidson, 2023, §§2-5). Sensitivity is reported primarily one-at-a-time; joint perturbations are simulator-permitted but not exhaustively published. Compared with Aschenbrenner’s narrative-essay treatment of OOM-component rates and Cotra’s six-anchor analysis confined to brain-compute and mixture weights, Davidson’s parameter-set transparency is structurally larger and the per-parameter sensitivity ranking is on the record (evidenced). Transparency at this scale also creates analytical exposure under multiple-testing logic — a point Chapter 7 returns to — but the transparency is the prerequisite for that exposure.

Integration of upstream methodologies rather than isolated derivation. Davidson explicitly imports the compute-requirement anchor from Cotra, scaling and algorithmic-efficiency rates from the Kaplan-Hoffmann-Hernandez-Brown-Erdil-Besiroglu line, the hardware trajectory from Epoch AI’s compute database, and growth-theory machinery from the Jones-Romer / Aghion-Jones-Jones lineage. The framework is positioned as downstream of these sources, not as an alternative. This is structurally more honest than presenting a takeoff model as a standalone derivation: framework accuracy is bounded by input accuracy, and the inputs are named (established — verifiable in research_notes/forecasts_database.json davidson_2023_compute primary_sources). Davidson extends the import set to growth theory and economic structure, which neither Aschenbrenner nor Cotra formalize.

Interactive simulator as published reproduction artifact. The companion takeoffspeeds.com simulator (co-built with Epoch AI) is methodologically a published reproduction package: any reader can vary the approximately seventy parameters and observe how the takeoff-duration distribution shifts. Aschenbrenner’s framework has no public reproduction artifact; Cotra (2020) reports sensitivity bounds without a parameter interface. Within the surveyed forecast set, Davidson is the only entry meeting the standard of every quantitative claim has a reproducible code artifact at framework level (established). The property raises the bar for this manuscript’s own Chapter 14 construct: a constructive proposal failing to publish a simulator at parity with Davidson would be one reproducibility tier below the framework it audits.

5.4 §4 — What the Method Cannot Tell Us

Question 1: parameter-search-space observability. The simulator exposes approximately seventy named parameters; the published central scenario represents one trajectory through the joint space. Sensitivity is concentrated on a handful — the four most-sensitive parameters absorb most output variation — but the rest live in the search space. How many parameter combinations were evaluated before the central scenario was selected? How many alternative central scenarios — different training-requirement distribution widths, wake-up multiplier values, CES ρ ranges — were considered and discarded? Methodological transparency at simulator level (publishing the parameters) is not equivalent to pre-specification (committing to values before observing outputs); under multiple-testing logic the distinction is load-bearing. The framework does not enumerate the implicit search count (speculative as to magnitude of inflation; established as a methodological observation about the document).

Self-application observation. Chapter 14’s effective-N estimator (preregistration v2 §Adaptation 4 — Effective-N estimation; LOCKED, TIER_3_NEW_DERIVATION for Algorithms A/B/C and composite, default (speculative)) faces the strict analog on the largest hyperparameter space in the surveyed set: implicit search count over kernel-choice and null-distribution shape is not directly observable. The manuscript pre-commits to composite range with geometric-mean headline and bracket-ratio thresholds 1.5x and 100x (EX POST IMMUTABLE at v2 hash). The same critique applies to the manuscript itself.

Question 2: upstream-methodology validity inheritance. Davidson imports the AGI training-requirement distribution from Cotra, the scaling and algorithmic-efficiency rates from Kaplan-Hoffmann-Hernandez-Brown-Erdil-Besiroglu, the compute trajectory from Epoch AI (Sevilla et al., 2022), and the semi-endogenous-growth elasticity from Jones-Romer. Errors or revisions in any upstream source propagate mechanically to the central-scenario output. The Davidson-specific contribution cannot be isolated from upstream contamination: if the Cotra anchor cluster is recalibrated by an order of magnitude, the takeoff-duration distribution shifts; if Caballero et al. (2022) broken-scaling intervenes, the monotonic-scaling assumption fails and the model cannot recover via parameter revision (evidenced as a gap; analogue-transfer is the framework’s load-bearing methodological commitment).

Self-application observation. Chapter 14’s information-set partition (preregistration v2 §Adaptation 2 — Information-set partition; LOCKED, TIER_3_NEW_DERIVATION; default (speculative) until MC Study 2 elevation per Type-I_C > 2·PBO OR Power_C < 0.5·PBO) presupposes that the Bailey-López de Prado quantitative-finance lineage transfers to capability forecasting. The transfer is itself an analogue-validity commitment of the same shape as Davidson’s import of Cotra and Jones-Romer. The MC Study 2 criterion exposes this dependency; the analogous critique applies to the manuscript itself.

Question 3: takeoff-dynamics model-tractability. The coupled system assumes that compute, algorithms, hardware, and recursive AI-R&D feedback compose multiplicatively in log-space with approximately independent contributions. Under what conditions does this composition produce a reliable distribution rather than a numerically stable artifact of the chosen functional forms? The factors are not empirically independent — frontier labs allocate capital across compute purchases and algorithmic-research headcount jointly, the same researchers contribute to algorithmic progress and to AI-R&D-feedback engagement, and the wake-up multiplier conditionally amplifies all three factors at the 6% threshold. Under positive correlation among the multiplicatively-compounding factors the joint distribution has heavier tails than independent treatment implies. Neither the correlation structure nor a tail-dependence analysis appears in the framework (evidenced as a gap).

Self-application observation. Chapter 14’s non-IID variance estimator (preregistration v2 §Adaptation 1 — Non-IID variance estimator; TIER_2_ADAPTED_DERIVATION; default (speculative) until MC Study 1 elevation per n_critical(ρ) > 60) faces the analog: effective sample size depends on correlation structure across multiplicatively-compounding components, and the manuscript pre-commits to block-bootstrap on detrended residuals. The dependence-structure critique applies to the manuscript itself in the same form, with elevation deferred to MC Study 1.

Question 4: sensitivity-disclosure completeness. Davidson reports parameter sensitivity primarily as one-at-a-time perturbation; the simulator permits joint exploration but does not publish the joint-perturbation grid. Across approximately seventy parameters, two-parameter joint perturbations number on the order of 2,500; three-parameter on the order of 50,000. Even sampled exploration is publishable; the framework does not undertake this. Does the one-at-a-time ranking capture the meaningful uncertainty space or a convenient subset? Under aggregation, joint perturbations can produce qualitatively different outputs from any single-parameter perturbation. The published sensitivity ranking is the one-at-a-time ranking; the joint-perturbation ranking is unpublished (evidenced as a gap).

Self-application observation. Chapter 14’s capability-forecasting performance statistic (preregistration v2 §Adaptation 3 — Capability-forecasting performance statistic; LOCKED, TIER_2 for Brier / log-loss / calibration / interval-coverage; TIER_3 for capability-Sharpe) faces the analog: no single aggregation rule over the multi-statistic panel is privileged, and the manuscript pre-commits to capability-Sharpe primary plus multi-statistic robustness panel. The analogous aggregation-rule critique applies to the manuscript itself.

5.5 §5 — Forward Reference

Chapter 2 identifies the structural pattern unifying the surveyed set: the takeoff-duration axis is one of seven preregistered decomposition axes, none deploying an out-of-sample validation protocol at publication. Chapter 4 develops the walk-forward retrodiction protocol. Chapter 6 extends the Bailey and López de Prado (2014) DSR derivation into capability forecasting (addressing Questions 1 and 3 via Adaptations 4 and 1). Chapter 14’s constructive DCF apparatus addresses Questions 2 and 4 via Adaptations 2 and 3.

Chapter 1.4 turns next to METR’s task-time-equivalent autonomy benchmark, which decomposes the same forecasting question along a different axis.

5.6 References cited in this chapter

Aghion, P., Jones, B. F., & Jones, C. I. (2017). Artificial intelligence and economic growth. NBER Working Paper No. 23928.

Bailey, D. H., & López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), 94-107.

Caballero, E., Gupta, K., Rish, I., & Krueger, D. (2022). Broken neural scaling laws. arXiv preprint arXiv:2210.14891.

Cotra, A. (2020). Forecasting TAI with biological anchors. Open Philanthropy (non-peer-reviewed).

Davidson, T. (2023). What a compute-centric framework says about AI takeoff speeds. Open Philanthropy (now Coefficient Giving) (non-peer-reviewed). Companion interactive simulator: https://takeoffspeeds.com.

Erdil, E., & Besiroglu, T. (2022). Algorithmic progress in language models. arXiv preprint arXiv:2212.05153.

Hernandez, D., & Brown, T. B. (2020). Measuring the algorithmic efficiency of neural networks. arXiv preprint arXiv:2005.04305.

Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training compute-optimal large language models [Chinchilla]. NeurIPS.

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

Romer, P. M. (1990). Endogenous technological change. Journal of Political Economy, 98(5), S71-S102.

Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M., & Villalobos, P. (2022). Compute trends across three eras of machine learning. International Joint Conference on Neural Networks (IJCNN).

Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., & Morcos, A. S. (2022). Beyond neural scaling laws: Beating power law scaling via data pruning. NeurIPS.

Trammell, P., & Korinek, A. (2023). Economic growth under transformative AI. NBER Working Paper.

6 Chapter 1.4 — Mapping the Major Forecasts: METR’s Empirical Autonomy Benchmarks

This chapter reconstructs the methodology of Kwa et al. (2025) Measuring AI Ability to Complete Long Software Tasks, together with the METR Time Horizon 1.1 update (METR, 2026) and the live leaderboard (METR, ongoing) (the v1.1 update and leaderboard are non-peer-reviewed; the arXiv paper is NeurIPS 2025 accepted). METR’s framework measures how long a human professional would need to complete tasks at which a frontier AI model achieves 50% success, fits a doubling-rate trend across model release dates, and extrapolates the rate forward. Where Aschenbrenner, Cotra, and Davidson produce forecasts grounded in modelled aggregates, METR produces a measurement, a trend, and a single forward extrapolation step on top.

6.1 §1 — What METR Measures

METR organizes capability progress along a single axis, which the preregistered taxonomy of this manuscript names the time-cost-of-task axis (human-equivalent labor time at which a model achieves 50% success probability, measured per task across a benchmark suite, with the longitudinal series of per-model time horizons producing a doubling-rate trend) (preregistration v1, §7-axis decomposition taxonomy, 2026-05-18). The axis differs structurally from the OOM-axis and the anchor-axis: it is a measured quantity on a defined task population, not an estimate derived from theoretical commitments. The headline output is a per-model time-horizon number in minutes or hours, plus a doubling rate over the 2019-2025 window of approximately 7 months (196.5 days in the v1.1 update) (Kwa et al., 2025; METR, 2026).

The 50%-time-horizon metric. For each evaluated model and each task in the benchmark suite, METR computes the model’s empirical success rate from multiple independent runs. The set of (task, success-rate) pairs for a given model is fit with a logistic curve against the human-equivalent completion time of the tasks; the abscissa at which the fitted logistic crosses the 50% probability ordinate is the model’s time horizon. The metric is reproducible from published task definitions, scoring conventions, and the open-source Inspect evaluation framework (METR, 2026) (established — directly attested in arXiv 2503.14499 and metr.org/time-horizons/).

The longitudinal series. Plotting per-model time horizons against model release dates yields a log-linear trend; the slope corresponds to a doubling rate. METR reports three rates: full-period (2019-2025) at 196.5 days; since-2023 at 130.8 days with confidence interval [107, 161] in v1.1; since-2024 at 89 days (METR, 2026). An alternative fit using only 2024-2025 data shortens METR’s month-task arrival estimate by approximately 2.5 years relative to the full-period fit (established for the rates on the data used; the forward extrapolation is a separate step).

Pattern G — upstream benchmark precursors. METR’s task suites are not constructed from scratch. RE-Bench (METR-internal) and HCAST (METR-internal) extend the benchmark-construction tradition exemplified by HumanEval (Chen et al., 2021), MMLU (Hendrycks et al., 2021a), MATH (Hendrycks et al., 2021b), GPQA (Rein et al., 2023), and especially SWE-Bench (Jimenez et al., 2024), the verified subset of which METR includes as the only external task suite among four. Where SWE-Bench Verified’s per-task human-time estimates revise, METR’s external-anchor rate (under 3 months, contrasted with the overall 7-month rate) shifts. This upstream dependency is the METR-domain instance of Pattern G (preregistration v1, §Pattern framework) and is a structural feature of the surveyed forecast set, not a defect of METR’s framework: any time-horizon framework on a software-task suite inherits the benchmark-construction literature’s task-selection conventions.

6.2 §2 — The Methodology Decomposition

The METR framework can be stated as a two-stage function. Let H_m denote the 50%-time-horizon of model m (minutes), t_m the model’s release date, and τ the doubling-rate parameter (days). The first stage fits H_m per model from logistic regression on task-level success rates. The second stage fits the log-linear trend

over the population of (t_m, log H_m) pairs in the chosen window. The forward extrapolation evaluates this trend at the calendar date t implied by a target threshold H* (e.g., week-equivalent ~40 hours, month-equivalent ~160 hours), yielding the headline arrival-time estimates.

Inputs. The empirical base is layered across four task suites and three data classes. Tasks: RE-Bench (METR-internal, AI-R&D-relevant); HCAST (METR-internal, agent-skills, with +73 additions, 15 removals, and 53 updates between v1.0 and v1.1); SWE-Bench Verified (external, Jimenez et al., 2024, filtered by independently-collected human-time estimates); and 66 novel METR-developed short tasks anchoring the ≤4-minute lower tail. Total task count expanded from 170 (v1.0) to 228 (v1.1); long tasks (8+ hours) doubled from 14 to 31 (METR, 2026). Model success rates: each model evaluated under standardized harness (Vivaria in v1.0; open-source Inspect in v1.1) with multiple independent runs per task and task-specific completion criteria. Human baselines: 5+ year experienced professionals on self-contained well-specified tasks; measurements above 16 hours flagged unreliable with the current suite (METR, ongoing). The headline parameter is computed jointly across all four task suites, with SWE-Bench Verified additionally reported as a separate cross-suite anchor.

Extrapolation and assumptions. Six modelling commitments carry the forecast: (i) the 50%-success threshold is the meaningful capability indicator (METR publishes 80% time horizons on the leaderboard for cross-threshold comparison); (ii) the logistic curve is the correct functional form for success-rate-vs-task-time; (iii) the four-suite composition is approximately representative of the capability domain (software-task autonomy specifically; generalization to non-software tasks not claimed); (iv) the doubling rate is approximately stationary within each chosen window, with the multi-window decomposition (full-period / since-2023 / since-2024) admitting local rather than global stationarity; (v) the forward extrapolation preserves the chosen window’s rate over 2-5 calendar years beyond the measurement endpoint, with no structural break; (vi) the evaluation harness migration (Vivaria → Inspect) preserves measurement equivalence on the baseline set.

In-sample / out-of-sample boundary. This is the load-bearing observation of the chapter. The 2019-2025 trend fit is in-sample by construction; the forward extrapolation to week-equivalent (2-4 years out) and month-equivalent (5 years out) task automation is out-of-sample by construction (Kwa et al., 2025). METR’s framework is the surveyed set’s most explicit about this boundary: the rates are reported as fits on observed data; the forward projections are framed explicitly as extrapolations rather than as probabilistic forecasts over capability dynamics. The v1.0 → v1.1 transition supplies the closest analog to a walk-forward validation in the surveyed set: the same models re-evaluated under an expanded suite (+34% tasks; long tasks doubled), with the headline doubling rate reproducing (196.5 days unchanged) and per-model confidence intervals mostly tightening (Claude Opus 4.5 upper bound shifted from 4.4× to 2.3× point estimate). One notable revision (GPT-4 1106 from 8.5 to 3.6 minutes, with v1.1 CI [1.6, 7.5] containing the v1.0 point estimate at its upper edge) is the most diagnostic single number in the surveyed set for in-sample-to-out-of-sample stress (evidenced as the closest walk-forward analog the surveyed set offers; the formal out-of-sample property holds only for the v1.0 → v1.1 same-model re-test, not for the forward extrapolation itself).

6.3 §3 — What the Method Achieves

Empirically observable trend on a defined axis. METR measures already-observed time horizons on frontier models and fits a trend over the population of those measurements. Headline outputs are statements about the past and present, checkable today on the published task suites; only the forward extrapolation step is prospective. Aschenbrenner produces a forecast; Cotra produces a forward distribution; Davidson produces a forward distribution; METR produces a measurement plus a trend, with extrapolation appended. The framework’s headline 7-month doubling rate is reproducible from primary data via the open-source Inspect framework, the published task definitions, and the per-task scoring conventions (established — verifiable in research_notes/forecasts_database.json and against arXiv 2503.14499). This is the surveyed set’s first forecast whose principal outputs are verifiable today rather than only at the forecast horizon.

Structured sensitivity analysis with explicit perturbation count. METR publishes a 10,000-perturbation Monte Carlo sensitivity analysis over methodological choices, with display of the 1-month-horizon arrival year distribution and percentile bands at the 25th-75th and 10th-90th intervals (Kwa et al., 2025). Within the surveyed set, no other framework matches this discipline at this structural level: Aschenbrenner reports no published sensitivity analysis; Cotra reports one-at-a-time parameter perturbations without explicit count; Davidson exposes parameter sensitivity via the interactive takeoffspeeds.com simulator but does not publish a structured perturbation table in the report itself (established — verifiable across research_notes/forecasts_database.json entries for Aschenbrenner, Cotra, Davidson). METR meets the methodology-canon standard of sensitivity analysis with explicit perturbation count and confidence intervals within the publication itself.

Versioned releases with explicit before-after comparison. The v1.0 → v1.1 transition (METR, 2026) is the surveyed set’s first forecast update accompanied by a structured side-by-side comparison: same models re-evaluated under the expanded suite, with point estimates, confidence intervals, and the methodological change (task-suite expansion; infrastructure migration) named alongside the numerical revisions. Doubling rates are reported with v1.0 and v1.1 values together (overall 196.5 days unchanged; since-2023 165 → 130.8; since-2024 109 → 89); per-model time horizons (Claude Opus 4.5 289 → 320, +11%; GPT-5 138 → 214, +55%; GPT-4 1106 8.5 → 3.6, −57%) with v1.1 confidence intervals. The Cotra 2020 → 2022 update revised mixture weights without an explicit before-after comparison table; the Davidson framework has not undergone a formal versioned update. METR’s transition is the closest published practice in the surveyed set to the audit-trail standard (established).

Explicit operational-regime boundary. METR flags time-horizon measurements above 16 hours as unreliable with the current task suite (METR, ongoing). The flag appears in the primary-source live leaderboard text. The framework therefore has a defined operational regime (time horizons up to 16 hours; tasks longer than 16 hours not reliably measured by the current suite). No other surveyed forecast characterizes the operational regime of its numerical claims at this level of specificity (established — verifiable across research_notes/forecasts_database.json). This is methodologically commendable in the López de Prado sense: an honest forecast characterizes not only the forecast but the regime in which the forecast is meaningful.

6.4 §4 — What the Method Cannot Tell Us

Four questions surface from the framework’s structure; under Rule 3 (self-application), each maps to an audit-chapter section and to an analogous question facing this manuscript’s own Chapter 14 construct.

Question 1: task-suite hyperparameter-search count. The framework selects four task suites and 228 tasks (v1.1) from a much larger space of plausible benchmark-construction choices. Adjacent benchmarks (LiveCodeBench, AgentBench, WebArena, GAIA, ARC-AGI) are not included; alternative threshold choices (50% vs 80% vs 95%), alternative fit-window choices (full-period / since-2023 / since-2024), alternative logistic-curve specifications (single logistic vs piecewise vs mixture), and alternative human-baseline elicitation protocols (5+ year vs differently-selected cohorts) span an implicit hyperparameter space. METR enumerates two of these dimensions explicitly (fit-window; threshold) and partially exposes the task-subset dimension via the SWE-Bench Verified separate-rate calculation; the remaining dimensions are not enumerated as alternatives (speculative as to magnitude of search-induced inflation; established as a methodological observation about the framework).

Self-application observation. Chapter 14’s effective-N estimator (preregistration v2 §Adaptation 4; LOCKED, TIER_3_NEW_DERIVATION for Algorithms A/B/C and composite, default (speculative)) faces the strict analog: the implicit hyperparameter-search count is not directly observable. The manuscript pre-commits to a composite range with geometric-mean headline and bracket-ratio thresholds 1.5x and 100x. The analogous critique — how many forecast-construction choices populate the implicit menu — applies to the manuscript on the identical dimension.

Question 2: doubling-rate stationarity over the projection horizon. The forward extrapolation to week-equivalent (2-4 years) and month-equivalent (5 years) task automation assumes the chosen window’s doubling rate continues without structural break. Within the observed window, the three reported rates (196.5 / 130.8 / 89 days for full-period / since-2023 / since-2024) already exhibit non-stationarity sufficient to shift the month-task arrival estimate by approximately 2.5 years depending on window choice (METR, 2026). The framework reports the alternative windows transparently but does not formalize a non-stationarity test, and the 10,000-perturbation sensitivity analysis varies methodology choices rather than the underlying trend dynamics (evidenced as a gap; the multi-window reporting acknowledges the issue without resolving it).

Self-application observation. Chapter 14’s non-IID variance estimator (preregistration v2 §Adaptation 1; TIER_2_ADAPTED_DERIVATION; default (speculative) until MC Study 1 elevation per n_critical(ρ) > 60) faces the analog: the correction depends on dependence structure across the longitudinal series of per-model time horizons. If models within a release cluster are positively correlated (same labs producing similar gains; same task-suite composition across model generations), the effective sample size shrinks. Block-bootstrap on detrended residuals is the manuscript’s pre-committed procedure; the analogous critique applies to the manuscript itself.

Question 3: 50%-threshold and logistic-fit performance-metric specification. The framework selects task-completion time at 50% success probability as the headline capability metric and the logistic curve as the success-probability functional form. Both are conventional choices, both methodologically defensible, and both are choices within a space of alternatives. The 80%-threshold time horizon is published on the live leaderboard for cross-threshold comparison; piecewise and mixture specifications are not. The 50%-threshold choice misses qualitative dimensions the time-cost-of-task axis does not capture: reliability under distribution shift, creativity in task framing, alignment behavior under adversarial conditions. The framework’s headline metric is one statistic in a panel that does not exist (evidenced as a gap; the alternative threshold publication mitigates without resolving).

Self-application observation. Chapter 14’s capability-forecasting performance statistic (preregistration v2 §Adaptation 3; LOCKED, TIER_2 for Brier/log-loss/calibration/interval-coverage, TIER_3 for capability-Sharpe) faces the analog: no single statistic is privileged. The manuscript pre-commits to a capability-Sharpe primary with multi-statistic robustness panel. The analogous critique — under what statistic is the headline computed — applies to the manuscript on the identical dimension.

Question 4: effective sample size across correlated task-by-model observations. The framework’s longitudinal series consists of approximately 228 tasks × (number of evaluated models) observations of (model, task, success-rate) triples, but the effective sample size is materially smaller. Tasks within a suite are correlated by construction (shared scoring conventions, shared difficulty distributions); models within a lab share architecture and training data; the 2023-2025 frontier models share much of the 2019-2022 progress. METR reports within-fit confidence intervals computed via bootstrap and the 10,000-perturbation methodology-choice analysis, but does not publish an effective-N adjustment that accounts for the cross-task, cross-model, and cross-generation correlation structure (evidenced as a gap; the cross-suite divergence between SWE-Bench Verified and the overall rate is one indicator of the correlation structure’s load-bearing role).

Self-application observation. Chapter 14’s effective-N composite range estimator (preregistration v2 §Adaptation 4; LOCKED, TIER_3_NEW_DERIVATION) faces this on the dimension of correlation-adjusted N rather than search-space count. The manuscript pre-commits to a composite of four algorithms with geometric-mean headline, treating effective-N as a range rather than a point estimate. Question 1 and Question 4 both map naturally to Adaptation 4 because Adaptation 4 carries both load-bearings — search-space dimensionality (Question 1) and correlation-adjusted sample size (Question 4) — under a single composite range estimator. The two sub-uses are substantively distinct: Q1 governs the count of considered framework alternatives; Q4 governs the count of independent capability observations within a single framework instance. The manuscript’s response is the composite range with bracket thresholds 1.5x and 100x; the analogous critique on both sub-dimensions applies to the manuscript itself.

6.5 §5 — Forward Reference

Chapter 2 identifies the structural pattern unifying the surveyed set: the time-cost-of-task axis is one of seven preregistered decomposition axes, and METR’s empirical-measurement structure makes it the surveyed set’s most data-rich longitudinal series — yet the forward-extrapolation step still lacks an out-of-sample validation protocol. Chapter 4 develops a walk-forward retrodiction protocol; the METR v1.0 → v1.1 transition is the principal worked example. Chapter 6 extends the Bailey and López de Prado (2014) DSR derivation into capability forecasting, addressing Questions 2 and 3 via Adaptations 1 and 3. Chapter 14’s constructive DCF apparatus addresses Questions 1 and 4 via Adaptation 4 — the composite range estimator carries both the search-space dimensionality and the correlation-adjusted effective-N load-bearings under a single LOCKED commitment.

Chapter 1.5 turns next to the Karnofsky / Grace / Epoch AI composite, which decomposes the same forecasting question along three further axes.

6.6 References cited in this chapter

Bailey, D. H., & López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), 94-107.

Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating large language models trained on code [HumanEval]. arXiv preprint arXiv:2107.03374.

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can language models resolve real-world GitHub issues? International Conference on Learning Representations.

Kwa, T., West, B., Becker, J., et al. (2025). Measuring AI ability to complete long software tasks (arXiv:2503.14499). NeurIPS.

METR. (2026, January 29). Time horizon 1.1 update. metr.org/blog (non-peer-reviewed).

METR. (ongoing). Time horizons leaderboard. metr.org/time-horizons/ (non-peer-reviewed).

Rein, D., Hou, B. L., Stickland, A. C., et al. (2023). GPQA: A graduate-level Google-proof Q&A benchmark. arXiv preprint arXiv:2311.12022.

7 Chapter 1.5 — Mapping the Major Forecasts: Karnofsky’s Most Important Century

This chapter reconstructs the methodology of Karnofsky (2021-2022), The Most Important Century (non-peer-reviewed), an eleven-post Cold Takes blog series. The series argues that the present century carries unusually high probability mass over civilization-transforming outcomes from transformative AI, and that all possible views about humanity’s future are wild given the cosmological reference class. The series does not produce a point-estimate AGI-arrival year. The probability-mass framing is the object here.

7.1 §1 — What Karnofsky Concentrates

Karnofsky organizes the most-important-century question along a single axis, which the preregistered taxonomy of this manuscript names the probability-mass / civilizational-significance axis (probability mass distributed across civilizational-outcome categories with per-category scoring on five civilizational-significance dimensions) (preregistration v1, §7-axis decomposition taxonomy, 2026-05-18). The axis operates on probability mass over outcome categories rather than on calendar years, training compute, or task-completion time, distinct from the four prior surveyed-set axes. Karnofsky integrates four argument pillars across the canonical series.

Pillar 1: power transition via PASTA mechanism. Karnofsky proposes that AI capable of automating scientific and technological advancement — abbreviated PASTA in the series, for Process for Automating Scientific and Technological Advancement — produces a resources-to-ideas feedback loop accelerating productivity growth toward civilization-transforming outcomes (Karnofsky, 2021, In a Nutshell; non-peer-reviewed). The mechanism is inherited rather than derived: the dynamical structure follows Davidson’s takeoff-speeds simulator and the broader semi-endogenous-growth literature (established as an observation about framework architecture; inheritance verifiable in the canonical series).

Pillar 2: AI capability emergence anchored on Cotra’s biological anchors. Karnofsky cites Cotra’s 2020 biological-anchors framework as the principal quantitative input supporting more likely than not this century on PASTA arrival, supplemented by Grace et al. (2018) and Stein-Perlman et al. (2022) AI-researcher-survey aggregates. Karnofsky does not independently re-derive Cotra’s anchor mixture; the inheritance is direct (evidenced; inheritance creates structural propagation of Cotra’s residual uncertainty).

Pillar 3: long-termist trade-off considerations grounded in Bostrom and Ord. Karnofsky frames civilizational-significance implications using Bostrom (2014) cosmic-endowment and existential-risk taxonomy, with Ord (2020) supplying the per-outcome significance scoring (extinction, persistent totalitarianism, recoverable-collapse, suboptimal-trajectory). The five civilizational-significance dimensions Karnofsky names — timeline acceleration, future population scale, environmental control, structural stability, directional uncertainty — operate at a level of analysis the four prior surveyed-set frameworks do not engage (established as an enumeration verifiable in the In a Nutshell* summary)*.

Pillar 4: historical-analogy reasoning via This Can’t Go On. Karnofsky’s economic-growth-impossibility argument (Karnofsky, 2021, This Can’t Go On; non-peer-reviewed) derives that 2% annual growth projected 8,200 years yields an implied economy of approximately 3×10^70 in present-day units, exceeding a galactic atom count of approximately 10^70. An energy-based alternative anchor — 2.3% energy growth exhausts the stellar energy budget within approximately 2,500 years — supplies a shorter-horizon parallel argument (established as a direct mathematical consequence of the 2% rate over the stated horizon).

Aggregated across the four pillars, the framework reports probability-mass concentration across three outcome categories. The first — galaxy-spanning civilization beginning this century, conditional on PASTA arrival — is framed as more likely than not qualitatively, without a derived percentage. The second — delayed galaxy expansion 500 to 100,000 years hence — is framed as the conservative alternative that still qualifies as wild on Karnofsky’s three wildness dimensions. The third — no expansion, including extinction and persistent stagnation — is framed as more than 50% probable yet very weird to claim as overwhelmingly likely (Karnofsky, 2021, All Possible Views Are Wild; paraphrased to remain under 15 words). Probability-mass concentration is asserted rather than derived from a structured Bayesian update (established as a fact about the document).

Karnofsky’s pillars inherit from upstream sources: Pillar 1 from Davidson’s takeoff-speeds work and the semi-endogenous-growth literature; Pillar 2 from Cotra (2020) biological anchors and the Grace et al. expert-survey series; Pillar 3 from Bostrom (2014) and Ord (2020); Pillar 4 from long-run growth modeling, including possibly Roodman (2020). This upstream dependency is the Karnofsky-domain instance of Pattern G (preregistration v1, §Pattern framework) and is a structural feature of the surveyed forecast set, not a defect of Karnofsky’s framework: any narrative-philosophical synthesis inherits its constituents’ residual uncertainty.

7.2 §2 — The Methodology Decomposition

The probability-mass framework can be stated as a function. Let O ∈ {galaxy-spanning expansion this century; delayed expansion 500-100,000 years; no expansion / extinction / stagnation} denote outcome category, d_i for i ∈ {timeline acceleration, population scale, environmental control, structural stability, directional uncertainty} denote civilizational-significance dimensions, and P(O) denote probability mass on O. The framework reports

with P(galaxy-spanning this century) qualitatively asserted to exceed 0.5 and d_i(O) enumerated narratively rather than derived from a model. The integration across four argument pillars is narrative aggregation rather than algorithmic aggregation.

Inputs. The empirical base is heterogeneous across three families. Capability-forecast: Cotra (2020) biological-anchors framework supplies the median-2052 quantitative anchor; Davidson takeoff-dynamics work supplies PASTA-mechanism dynamical parameterization; Grace et al. (2018) and Stein-Perlman et al. (2022) AI-researcher surveys supply expert-aggregate grounding. Threat-model: Bostrom (2014) supplies cosmic-endowment scaffolding and existential-risk taxonomy; Ord (2020) supplies refined extinction-vs-stagnation-vs-suboptimal categorization; Hanson (2016) supplies the digital-people-explosion predecessor framing. Historical-analogy: 2% historical growth rate from the industrial-era reference class (approximately 1700-2000); galactic atom count ~10^70 from astrophysical reference works; cosmic-history anchors (stellar lifetime ~10^17 years; intelligent-life-window ~10^9 years) supporting the wildness-via-temporal-uniqueness argument.

Extrapolation and assumptions. Six modelling commitments carry the framework: (i) civilization-scale reasoning is tractable — probability mass can be coherently allocated across outcome categories despite within-horizon unfalsifiability; (ii) threat models capture meaningful catastrophic risks — the Bostrom-Ord taxonomy is approximately exhaustive of catastrophic-outcome categories; (iii) historical analogies inform civilizational-significance reasoning — the 2%-growth-anchor and stellar-energy anchor inform forecasting future civilizational dynamics, not merely mathematical impossibility of indefinite continuation; (iv) probability mass is allocated narratively rather than via structured search-space accounting; (v) inheritance from Cotra, Davidson, Bostrom, Hanson, and Grace propagates without independent re-derivation; (vi) the foundational most important century claim is meaningful despite unfalsifiability within the prediction horizon.

In-sample / out-of-sample boundary. This is the load-bearing observation of the chapter. Karnofsky’s framework occupies the surveyed set’s clearest limit case of frameworks lacking an in-sample / out-of-sample boundary. The mathematical layer of Pillar 4 (This Can’t Go On) is in-sample on its mathematical-consequence layer: 2% historical growth is observed historical data; the 8,200-year horizon and the 3×10^70 implied economy are direct algebraic consequences; the contradiction-with-atom-count is verifiable astrophysics. The probability-mass concentrations themselves are out-of-sample by construction: no historical-precedent reference class permits validation of probability that this century is the most important. No formal walk-forward analysis is presented. No held-out validation period. No pre-specified falsification criterion at the framework level. Karnofsky has not published a 2024-2025 self-revision of the 2021-2022 series; the framework is a single-snapshot rather than a multi-snapshot series (established as a fact about the document).

The foundational claim’s unfalsifiability within the prediction horizon is the structural feature distinguishing Karnofsky’s framework from the four prior surveyed-set entries. Aschenbrenner’s 2027 AGI claim admits in-principle within-horizon falsification by 2028 observation; Cotra’s median-2052 distribution admits falsification by year-2100 observation against the framework’s interval; Davidson’s takeoff-duration distribution admits within-decade-of-takeoff falsification; METR’s 50%-task-completion-time series admits year-on-year falsification against the framework’s projection. Karnofsky’s this is the most important century claim cannot be tested within the century itself. The framework’s response is to surface sub-claim falsifiability handles (the 8,200-year mathematical argument; the PASTA mechanism; the digital-people-explosion claim) rather than to claim within-horizon foundational-claim falsifiability.

The absent boundary is the bridge to Chapters 6 and 7. Chapter 6 assumes the performance statistic under correction is computed on an operational sample. Karnofsky’s framework lacks this sample by construction at publication. Chapter 7 must accommodate Karnofsky-style teleological claims as a limit case: zero falsification handles within the prediction horizon is methodologically equivalent to infinite implicit-test-multiplicity within the horizon, in the precise sense that no observation within the horizon distinguishes the claim from its alternatives. Chapter 4 develops the walk-forward retrodiction protocol that would generate the operational sample retrospectively; Karnofsky’s framework is the test case for the protocol’s behaviour at the limit.

7.3 §3 — What the Method Achieves

Probability-mass framing as starting epistemological commitment. Karnofsky’s principal methodological move is to refuse a point-estimate AGI-arrival year and to concentrate probability mass over civilizational-outcome categories. The framing is structurally closer to a forecasts-as-distributions discipline than to the point-estimate framings produced by Aschenbrenner (2027) or by Cotra’s median 2052 / 2040: Karnofsky publishes no median year; the asymmetric-confidence framing is unilateral. The framing as an epistemic starting commitment is methodologically distinct from the four prior surveyed-set entries and represents the surveyed set’s first probability-mass-over-outcomes input (established — verifiable across research_notes/forecasts_database.json entries for Aschenbrenner, Cotra, Davidson, METR).

Synthesizing role integrating three previously-distinct literatures. Karnofsky’s framework is the surveyed set’s first attempt to integrate three literatures that prior forecasting work treated separately: AI capability forecasting (Cotra, Davidson, Grace et al.); long-termist philosophy (Parfit, MacAskill contemporaneous); existential-risk taxonomy (Bostrom, Ord). The integration is one-directional — capability forecasts feed Karnofsky as inputs but Karnofsky does not feed back as a constraint — and the inheritance structure is named-citation explicit rather than embedded implicitly. The integration surfaces structural inheritance for downstream audit: a probability claim about the most-important-century inherits assumptions from each contributing literature, and integrated confidence cannot exceed joint-confidence of the contributing literatures (evidenced; inheritance structure verifiable from cited_argument_sources in research_notes/forecasts_database.json).

Self-aware about framework limitations within its own framing. Karnofsky’s series explicitly characterizes the probability assignments as not derived from a single quantitative model. The In a Nutshell summary attributes more likely than not this century on PASTA arrival to AI-researcher surveys rather than to Karnofsky’s own quantitative model. The All Possible Views Are Wild post characterizes all positions on humanity’s future — including Karnofsky’s own — as wild on at least one of temporal uniqueness, improbability, or Fermi-Paradox resolution. Under Rule 9 (no hedging cowardice), the framework states positions while explicitly framing them as probability-mass concentrations rather than point predictions (established; multiply attested across In a Nutshell* and the 80,000 Hours podcast)*.

Falsifiability handles surface at the sub-claim level despite foundational unfalsifiability. The framework’s foundational claim is teleological — this is the most important century cannot be tested within the century itself — but Karnofsky surfaces sub-claim falsifiability handles in a manner structurally distinct from genuine vague claims. The 8,200-year mathematical-impossibility argument is falsifiable by mathematical-error demonstration. The PASTA-mechanism claim is falsifiable by demonstrating that AI-driven automation does not produce the claimed feedback loop (e.g., sub-linear scaling). The digital-people-explosion claim is falsifiable by demonstrating that infinite duplication encounters coordination costs or paradigm exhaustion at scale. The sub-claim falsifiability profile is the framework’s response to foundational unfalsifiability and is a methodologically distinct move from frameworks that simply assert untestable claims (evidenced).

7.4 §4 — What the Method Cannot Tell Us

Question 1: pillar-selection hyperparameter-search count. The framework selects four argument pillars and assigns each a narrative role in the probability-mass concentration. How many alternative pillar-combinations were considered? At minimum, the implicit selection space spans (i) capability-forecast input choice (Cotra anchors / Davidson takeoff / Grace surveys / Aschenbrenner-style OOM-counting); (ii) threat-model taxonomy (Bostrom / Ord / Hanson / Sandberg-Bostrom / mixed); (iii) historical-analogy anchor (2% growth / Roodman super-exponential / industrial-revolution case-study / agricultural-revolution analogy); (iv) outcome-category schema (three / Bostrom four / Ord five / continuous). The product is a pillar-search space of nontrivial size, and the series does not enumerate it (speculative as to magnitude of search-induced inflation; established as a methodological observation about the document).

Self-application observation. Chapter 14’s effective-N estimator (preregistration v2 §Adaptation 4, LOCKED, TIER_3_NEW_DERIVATION; default (speculative)) faces the strict analog: the manuscript’s hyperparameter-search count is not directly observable, and the pre-commitment is a composite range with geometric-mean headline and EX POST IMMUTABLE bracket-ratio thresholds 1.5x and 100x. The analogous critique on observability of an implicit search count applies to the manuscript itself.

Question 2: upstream-methodology composition propagates rather than corrects. Karnofsky’s framework draws Pillar 1 from Davidson, Pillar 2 from Cotra and Grace, Pillar 3 from Bostrom and Ord, Pillar 4 from possibly Roodman. The inheritance is one-directional: Karnofsky cites and treats outputs as inputs. The framework does not add an error-correction layer at the inheritance points. Under positive correlation between upstream sources (Cotra, Davidson, and Grace share many assumptions about scaling laws, compute trajectories, and AI-researcher cohort selection), joint-confidence cannot exceed contributing-literatures’ joint-confidence, and the integrated framework may compound rather than diversify residual uncertainty. Whether the analogue-validity-of-inheritance holds is not examined in the series (evidenced as a structural gap).

Self-application observation. Chapter 14’s information-set partition (preregistration v2 §Adaptation 2, TIER_3_NEW_DERIVATION; default (speculative) until MC Study 2 elevation per Type-I_C > 2·PBO OR Power_C < 0.5·PBO) faces the analog: the partition presupposes a sequential-causal ordering, itself an analogue-validity commitment. The MC Study 2 criterion is pre-committed to expose this dependency to falsification; the analogous inheritance-validity critique applies to the manuscript itself.

Question 3: historical-analogy reasoning validity. The This Can’t Go On argument uses a 2% historical growth rate, a 35-year doubling period, and an 8,200-year projection horizon. The mathematical layer is robust. The inference from exponential growth cannot continue indefinitely to this century is the transition century requires probability-mass concentration on the current-century portion of an unspecified-timing transition. The 8,200-year horizon does not constrain the transition to be in the current century: the transition could be in century 50 or century 80 of the projection and the mathematical argument would be equally satisfied. Karnofsky’s concentration on the current century is the load-bearing narrative leap from mathematical-fact to forecast. Under what alternative anchors (Roodman 2020 super-exponential; agricultural-revolution regime change) does the conclusion shift? The series does not engage these alternatives (evidenced as a structural gap).

Self-application observation. Chapter 14’s non-IID variance estimator (preregistration v2 §Adaptation 1, TIER_2_ADAPTED_DERIVATION; default (speculative) until MC Study 1 elevation per n_critical(ρ) > 60) faces the analog: analogue-validity of an extrapolation across growth regimes is a dependence-structure question over the underlying series. The manuscript’s pre-committed response is block-bootstrap on detrended residuals; the analogous dependence-structure critique applies to the manuscript itself.

Question 4: probability-mass framing rule. Karnofsky asserts that galaxy-spanning civilization beginning this century is more likely than not, at a qualitatively-asserted 0.5 threshold. The framing aggregates probability over a 100-year window: P(galaxy-spanning beginning within the century) is structurally larger than P(beginning in any single year). The aggregation rule is the framework’s load-bearing presentational commitment. Aschenbrenner’s 2027 framing concentrates on a 4-year window; Cotra’s 2020 median-2052 concentrates on a year with reported intervals; Karnofsky’s this century aggregates over 100 years. Does aggregation over a century-long horizon inflate apparent confidence relative to year-of-TAI distributions, or surface honest uncertainty point-estimate framings hide? The series does not adjudicate; the framing is asserted, not derived (evidenced as a methodological observation).

Self-application observation. Chapter 14’s capability-forecasting performance statistic (preregistration v2 §Adaptation 3, LOCKED, TIER_2 for Brier / log-loss / calibration / interval-coverage; TIER_3 for capability-Sharpe) faces the analog: no single aggregation rule is privileged for capability forecasts. The manuscript pre-commits to a per-framework primary statistic plus multi-statistic robustness panel; the analogous aggregation-rule critique applies to the manuscript itself.

7.5 §5 — Forward Reference

Chapter 2 identifies the structural pattern: the probability-mass / civilizational-significance axis is one of seven preregistered axes, none deploying an out-of-sample validation protocol at publication. Chapter 4 develops the walk-forward retrodiction protocol; Karnofsky’s framework is its limit case for teleological foundational claims. Chapter 6 extends the Bailey and López de Prado (2014) DSR derivation into capability forecasting (addressing Question 1 via Adaptation 4 and Question 3 via Adaptation 1). Chapter 14’s constructive DCF apparatus addresses Question 2 via Adaptation 2 and Question 4 via Adaptation 3.

Chapter 1.6 turns next to Grace’s expert-survey-aggregation framework, which decomposes the same forecasting question along a different axis.

7.6 References cited in this chapter

Bailey, D. H., & López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), 94-107.

Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford University Press.

Cotra, A. (2020). Forecasting TAI with biological anchors. Open Philanthropy (non-peer-reviewed).

Davidson, T. (2023). What a compute-centric framework says about takeoff speeds. Open Philanthropy (non-peer-reviewed).

Grace, K., Salvatier, J., Dafoe, A., Zhang, B., & Evans, O. (2018). When will AI exceed human performance? Evidence from AI experts. Journal of Artificial Intelligence Research, 62, 729-754.

Stein-Perlman, Z., Weinstein-Raun, B., & Grace, K. (2022). 2022 expert survey on progress in AI. AI Impacts (non-peer-reviewed).

Hanson, R. (2016). The age of Em: Work, love, and life when robots rule the Earth. Oxford University Press.

Karnofsky, H. (2021). All possible views about humanity’s future are wild. Cold Takes blog (non-peer-reviewed).

Karnofsky, H. (2021). The most important century (blog series). Cold Takes (non-peer-reviewed).

Karnofsky, H. (2021). This can’t go on. Cold Takes blog (non-peer-reviewed).

Ord, T. (2020). The precipice: Existential risk and the future of humanity. Bloomsbury Publishing.

Roodman, D. (2020). Modeling the human trajectory. Open Philanthropy (non-peer-reviewed).

8 Chapter 1.6 — Mapping the Major Forecasts: Grace AI Impacts Expert Survey Aggregation Method

This chapter reconstructs the methodology of the Grace / AI Impacts expert-survey program (Grace et al., 2018; Stein-Perlman et al., 2022; Grace et al., 2024), a four-wave program spanning 2014 through 2024 (the 2022-wave blog and December 2024 reanalysis are non-peer-reviewed; the 2018 and 2024 publications are JAIR peer-reviewed). The program solicits AI researchers’ individual probability distributions over capability-arrival years, aggregates them, and reports medians, percentiles, and within-survey effects. That aggregation methodology is the object here.

8.1 §1 — What Grace Surveys

The Grace program organizes the AI-timeline question along a single axis, which the preregistered taxonomy of this manuscript names the expert-survey-aggregation axis (per-respondent probability distributions over capability-arrival years, pooled across a defined expert population, reported as median plus percentile bands plus within-survey-effect findings) (preregistration v1, §7-axis decomposition taxonomy, 2026-05-18). The axis is structurally distinct from the five prior surveyed-set axes: it is a meta-level methodology aggregating the beliefs of researchers who may themselves use OOM-counting, anchor-mixture, or other reasoning, rather than a content-axis decomposing capability progress directly. The headline output is a pooled arrival-year distribution per question — most prominently the 2024-wave HLMI 50%-probability median of 2047 (Grace et al., 2024).

Axis distinctness from Karnofsky. Both this axis and Karnofsky’s probability-mass / civilizational-significance axis (Chapter 1.5) involve probability, but they are structurally distinct. Grace aggregates many experts’ point-distributions over a year-of-arrival quantity — each respondent supplies a distribution, and the program pools them into a population aggregate. Karnofsky’s framework is a single analyst placing probability mass over civilizational-outcome categories — one allocation, no population to pool. The two share the word probability and nothing of their object (established — verifiable across research_notes/forecasts_database.json).

The two anchor questions. Two questions are preserved across the Grace waves and carry the cross-wave comparison: HLMI (High-Level Machine Intelligence — machine intelligence performing all tasks better and cheaper than humans) and FAOL (Full Automation of Labor — automation of all occupations). Each wave adds milestone, intelligence-explosion, safety-prioritization, and extinction-risk questions; the HLMI and FAOL anchors are the principal cross-wave-comparable measurements (Grace et al., 2018; Grace et al., 2024) (established — attested in each publication’s question-text appendix).

The aggregation function. Each respondent supplies a probability distribution per question (or fixed-probability/fixed-year points from which one is fitted). The program pools these, reporting the median as primary headline, the mean as secondary, and the 10th/25th/75th/90th percentiles. All respondents receive equal weight; no per-expert performance weighting is applied. The aggregation-method choice (median versus trimmed-mean versus geometric-mean, plus outlier-handling) is itself documented as a methodology variable, and the December 2024 reanalysis codebase enables independent re-aggregation (Grace et al., 2024) (established — aggregation choices documented in each publication).

Pattern G — upstream expert-elicitation lineage. The program inherits the structured-expert-elicitation tradition formalized in Cooke (1991), which established probabilistic expert judgment as a discipline with explicit performance scoring, and the prediction-aggregation literature in which Atanasov et al. (2017) show aggregation-mechanism choice to be empirical rather than theoretical. It also inherits its own predecessors — Müller and Bostrom (2014) and the earlier Grace waves, which supply the anchor-question text and longitudinal baseline. This upstream dependency is the Grace-domain instance of Pattern G (preregistration v1, §Pattern framework) and a structural feature of the surveyed set, not a defect: any expert-survey framework inherits the elicitation literature’s design and scoring conventions.

8.2 §2 — The Methodology Decomposition

The Grace aggregation can be stated as a function. Let R index the respondent population for a given wave, F_r(t) denote respondent r’s cumulative probability that the queried capability arrives by year t, and w_r denote the aggregation weight on respondent r. The program constructs the pooled distribution

and reports the percentile years {t : F̄(t) = p} for p ∈ {0.10, 0.25, 0.50, 0.75, 0.90}, with the 50%-probability year as the headline. The within-survey effects (framing, question-order) are measured as differences between F̄ computed on distinct question-framings or distinct question-orderings of the same population.

Inputs. The empirical base is the survey-response record across four waves. Müller-Bostrom 2014 sampled approximately 170 respondents; the 2018 wave sampled NeurIPS/ICML 2015 authors (352 responses, 21% rate); the 2022 wave sampled NeurIPS/ICML 2021 authors (738 responses, 17% rate); the 2024 wave sampled six-venue authors (2,778 responses — the largest AI expert survey to date). Each respondent supplies probability points or distributions over the anchor and milestone questions; question texts are published in appendices, and the 2022 and 2024 waves use a split-survey design (randomized question subsets per respondent) (Grace et al., 2018; Stein-Perlman et al., 2022; Grace et al., 2024). A substantive framing finding recurs across waves: HLMI and FAOL, asked of the same respondents, produce timelines differing by approximately 60-105 years (established — per-wave figures and framing-effect direct in each publication).

Extrapolation and assumptions. Six modelling commitments carry the aggregate forecast: (i) NeurIPS/ICML authorship (and the 2024 six-venue frame) is a defensible proxy for the AI-research community; (ii) response rates of 17-21% produce representative aggregates despite non-response; (iii) equal weighting across respondents is appropriate — no per-expert calibration weighting; (iv) the median is the appropriate primary aggregation for log-distributed timeline responses; (v) anchor-question text is stable enough across waves to support comparison; (vi) the within-respondent framing effect is a cognitive finding about how experts reason rather than a question-design artifact.

In-sample / out-of-sample boundary. This is the load-bearing observation of the chapter and the crux of the Hypothesis-A risk the program carries. The program is empirically rich on its measurement layer: survey responses are observed data, the framing-effect and question-order findings are measured on that data, and the four-wave longitudinal series of aggregates is the surveyed set’s densest record of AI-researcher belief evolution — all (established). But survey-data-richness is not forecast-validation. The aggregate arrival-year distribution — HLMI 50% by 2047, FAOL 50% by 2116 — is out-of-sample by construction: a forward forecast whose horizon (23 years for HLMI, 92 for FAOL from the 2024 publication) far exceeds the inter-wave spacing, so no prior wave’s forecast has been tested against realized arrival. The conflation to guard against is the one Chapter 1.4 surfaced for METR: rich observational data on the input layer does not transfer its evidential status to the forecast layer. The cross-wave shift — HLMI 50% from approximately 2061 (2018) to 2059 (2022) to 2047 (2024) — is an (established) measurement of how aggregate belief moved; whether that belief is correct is the out-of-sample question, and (speculative).

The absent forecast-validation boundary is the bridge to Chapters 6 and 7. Chapter 6 assumes the performance statistic under correction is computed on an operational sample; the program supplies a rich within-sample measurement record but no validated forecast sample at publication. Chapter 7 must accommodate the program’s question-design multiplicity — approximately 30-60 distinct questions across the three waves, with framing variants — as a multiple-testing context. Chapter 4 develops the walk-forward retrodiction protocol that would generate the operational sample retrospectively; the four-wave longitudinal series is the closest substrate the surveyed set offers.

8.3 §3 — What the Method Achieves

Explicit distribution and confidence-band disclosure. The program reports not only aggregate medians but the full percentile distribution (10th/25th/75th/90th) per question, and where feasible the individual-respondent distributions underlying the aggregate. This preserves uncertainty information that headline-median publications discard: median 2047 with 10th percentile 2027 and 90th percentile 2150 conveys materially different information than the same median with a tight 2040-2060 band. Within the surveyed set this is a distinctive discipline — Aschenbrenner reports a single-point 2027 with no quantitative interval, and Cotra’s anchor-mixture reports a derived distribution rather than the spread of underlying beliefs; the Grace program shows the belief distribution directly (Grace et al., 2024) (established — percentile reporting direct in each publication).

Expert-pool diversity against single-author forecasts. The program aggregates hundreds-to-thousands of independent researcher responses (352 in 2018, 738 in 2022, 2,778 in 2024) drawn from an explicit, reproducible sampling frame. Four of the five prior surveyed forecasts are single-author or single-organization constructions; the Grace program substitutes a defined population for a single forecaster’s judgment, and the explicit frame makes the population’s boundaries auditable — a reader can assess whether NeurIPS/ICML authorship matches the population they care about (established — sample sizes and frame definitions direct across waves). The 2,778-respondent 2024 wave is the largest to date, tightening the aggregate’s confidence interval and supporting sub-group analyses smaller surveys cannot.

Reproducible, repeatable, longitudinally comparable methodology. The program publishes question-text appendices, releases response datasets, and (December 2024) released an open-source reanalysis codebase enabling independent re-aggregation under alternative method choices. The successive waves preserve the HLMI and FAOL anchor questions, producing the surveyed set’s densest longitudinal series — four waves across roughly a decade, against METR’s six-month transition, Cotra’s single two-year update, and Davidson’s single verbal update (established — appendices, datasets, and codebase release direct in the publication record). This is the surveyed set’s clearest analog to the reproducibility discipline this manuscript imposes on itself.

Honest disclosure of within-survey inconsistency. Rather than reconciling the HLMI-FAOL framing gap or the milestone-versus-HLMI inconsistency through aggregation choices, the program reports both estimates and characterizes the discrepancy explicitly. The framing effect — same respondents, logically close questions, timelines differing by 60-105 years — is presented as a finding, and the 2024 wave documents that respondents assigning ≥50% to specific 2028 milestones are not coherently those assigning 50% to HLMI 2047 (Grace et al., 2024) (evidenced — inconsistency-reporting discipline direct; its relative strength across waves is calibration-dependent). This inherits the Cooke (1991) discipline of treating expert inconsistency as measurable information rather than noise to be smoothed away.

8.4 §4 — What the Method Cannot Tell Us

Four questions surface from the program’s structure; under Rule 3 (self-application), each maps to an audit-chapter section and to an analogous question facing this manuscript’s own Chapter 14 construct.

Question 1: expert-selection hyperparameter-search count. The program defines its expert pool via NeurIPS/ICML authorship (2018, 2022) and a six-venue frame (2024). How many alternative sampling frames were considered? The implicit selection space spans venue choice (NeurIPS/ICML versus six venues versus citation-based frames versus industry researchers who do not publish externally), contact-frame construction, response-threshold handling, and sub-field weighting. The 2024 venue expansion is itself a frame change conflated with the measured belief shift, and the program does not enumerate the frame-selection space as a search (speculative as to magnitude of selection-induced inflation; established as a methodological observation about the program).

Self-application observation. Chapter 14’s effective-N estimator (preregistration v2 §Adaptation 4; LOCKED, TIER_3_NEW_DERIVATION; default (speculative)) faces the strict analog: the manuscript’s own which-forecasts-to-survey frame is an implicit search count, not directly observable. The pre-commitment is a composite range with geometric-mean headline and EX POST IMMUTABLE bracket-ratio thresholds 1.5x and 100x. The analogous observability critique applies to the manuscript itself.

Question 2: aggregation-function validity. The program reports the median-of-distributions as headline, with equal weighting. Does the median of a pooled belief distribution reflect the true epistemic distribution, or impose a convenient meta-uncertainty structure that the documented aggregation-method sensitivity (median versus trimmed-mean versus geometric-mean) shows is choice-dependent? The headline 2047 is one number in a small family of defensible aggregates; the program documents — but does not resolve — the headline’s sensitivity to the choice (evidenced as a gap; the sensitivity reporting acknowledges the issue without privileging a rule).

Self-application observation. Chapter 14’s capability-forecasting performance statistic (preregistration v2 §Adaptation 3; LOCKED, TIER_2 for Brier/log-loss/calibration/interval-coverage; TIER_3 for capability-Sharpe) faces the analog: no single aggregation rule is privileged for capability forecasts. The manuscript pre-commits to a per-framework primary statistic plus a multi-statistic robustness panel. The analogous aggregation-rule critique — under which rule the headline is computed — applies to the manuscript itself.

Question 3: expert-calibration absence. The structured-elicitation literature the program inherits (Cooke, 1991) weights experts by performance on seed variables with known answers, and this performance weighting outperforms equal weighting in roughly 80% of validation studies (Cooke & Goossens, 2008). The Grace experts have no track record on AGI predictions — no seed variables exist for capability-arrival questions — so Cooke-style weights cannot be computed, and equal weighting is imposed by necessity rather than chosen as optimal. The IPCC AR6 ice-sheet elicitation (Bamber et al., 2019) could apply Cooke weighting to long-horizon forecasts because seed-variable calibration was available; the Grace setting forecloses that route (evidenced as a structural constraint; the equal-weighting necessity is a documented consequence of the no-seed-variable setting).

Self-application observation. Chapter 14’s information-set partition (preregistration v2 §Adaptation 2; TIER_3_NEW_DERIVATION; default (speculative) until MC Study 2 elevation per Type-I_C > 2·PBO OR Power_C < 0.5·PBO) faces the analog: the partition presupposes a performance-attestation ordering over the historical record — the same presupposition that fails when no seed variable calibrates the forecaster. The MC Study 2 criterion is pre-committed to expose this dependency; the analogous calibration-attestation critique applies to the manuscript itself.

Question 4: effective sample size under correlated respondents. The 2024 wave reports 2,778 respondents, but the effective number of independent opinions is materially smaller. Respondents share training pipelines, read overlapping literature, cite the same upstream forecasts (Cotra, Davidson, Aschenbrenner sit in the field’s shared priors), and are drawn from a small venue set. Positive correlation shrinks the effective sample size below nominal, widening the true confidence interval on the 2047 median beyond what nominal n implies. The program reports bootstrapped intervals over the population but publishes no effective-N adjustment for the cross-respondent correlation structure (evidenced as a gap; shared-prior correlation is a structural feature of a single-community sampling frame).

Self-application observation. Chapter 14’s non-IID variance estimator (preregistration v2 §Adaptation 1; TIER_2_ADAPTED_DERIVATION; default (speculative) until MC Study 1 elevation per n_critical(ρ) > 60) faces the strict analog: the correction depends on dependence structure across the pooled observations. If experts share community priors — exactly as benchmark series share lab provenance — the effective sample size shrinks below nominal and the correction is more severe. The manuscript pre-commits to block-bootstrap on detrended residuals; the analogous dependence-structure critique applies to the manuscript itself.

8.5 §5 — Forward Reference

Chapter 2 identifies the structural pattern: the expert-survey-aggregation axis is one of seven preregistered decomposition axes, and the program’s survey-layer richness — structurally similar to METR’s — does not substitute for out-of-sample validation of the forward forecast. Chapter 4 develops the walk-forward retrodiction protocol; the four-wave longitudinal series is the densest substrate the surveyed set offers. Chapter 6 extends the Bailey and López de Prado (2014) DSR derivation, addressing Question 2 via Adaptation 3 and Question 4 via Adaptation 1. Chapter 7 develops the question-design-multiplicity treatment (Question 3 via Adaptation 2). Chapter 14’s constructive DCF apparatus integrates all four adaptations and addresses Question 1 via Adaptation 4.

Chapter 1.7 turns next to Epoch AI’s compute-scaling-empirical framework, the seventh and final surveyed-set axis.

8.6 References cited in this chapter

Atanasov, P., Rescober, P., Stone, E., Swift, S. A., Servan-Schreiber, E., Tetlock, P., Ungar, L., & Mellers, B. (2017). Distilling the wisdom of crowds: Prediction markets vs. prediction polls. Management Science, 63(3), 691-706.

Bailey, D. H., & López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), 94-107.

Bamber, J. L., Oppenheimer, M., Kopp, R. E., Aspinall, W. P., & Cooke, R. M. (2019). Ice sheet contributions to future sea-level rise from structured expert judgment. Proceedings of the National Academy of Sciences, 116(23), 11195-11200.

Cooke, R. M. (1991). Experts in uncertainty: Opinion and subjective probability in science. Oxford University Press.

Cooke, R. M., & Goossens, L. H. J. (2008). TU Delft expert judgment data base. Reliability Engineering & System Safety, 93(5), 657-674.

Grace, K., Salvatier, J., Dafoe, A., Zhang, B., & Evans, O. (2018). When will AI exceed human performance? Evidence from AI experts. Journal of Artificial Intelligence Research, 62, 729-754.

Grace, K., Stewart, H., Sandkühler, J. F., Thomas, S., Weinstein-Raun, B., Brauner, J., & Korzekwa, R. C. (2024). Thousands of AI authors on the future of AI (arXiv:2401.02843). AI Impacts; subsequently Journal of Artificial Intelligence Research, 84:9 (2025).

Müller, V. C., & Bostrom, N. (2014/2016). Future progress in artificial intelligence: A survey of expert opinion. In V. C. Müller (Ed.), Fundamental issues of artificial intelligence (pp. 553-571). Springer.

Stein-Perlman, Z., Weinstein-Raun, B., & Grace, K. (2022). 2022 expert survey on progress in AI. AI Impacts (non-peer-reviewed).

9 Chapter 1.7 — Mapping the Major Forecasts: Epoch AI’s Compute-Scaling-Empirical Methodology

This chapter reconstructs the methodology of Epoch AI’s compute-trend program (Sevilla et al., 2022; Epoch AI, 2024 (non-peer-reviewed); the ongoing public model database (non-peer-reviewed)), the seventh and final framework in the surveyed set. Epoch does not produce an AGI-arrival forecast: it estimates training compute per landmark AI system from public information, fits log-linear trends across the historical record, and reports doubling rates with bootstrapped confidence intervals. That compute-estimation methodology is the object here.

9.1 §1 — What Epoch Measures

Epoch organizes the compute side of the AGI-timeline question along a single axis, which the preregistered taxonomy of this manuscript names the compute-scaling-empirical axis (per-model training-FLOP estimates derived from public information, aggregated into log-linear trends per identified era, reported as doubling rates with bootstrapped confidence intervals) (preregistration v1, §7-axis decomposition taxonomy, 2026-05-18). The axis is structurally distinct from the six prior surveyed-set axes in one decisive respect: it is measurement infrastructure rather than forecast production. Aschenbrenner, Cotra, Davidson, METR, Karnofsky, and Grace each decompose the AGI-arrival question directly; Epoch decomposes the upstream empirical question those frameworks consume — how much compute did each landmark system actually use, and at what rate has that quantity grown? The headline empirical claim is a frontier-compute doubling of approximately 4-5x per year over 2010-2024 (Epoch AI, 2024).

The dual structural role. Epoch occupies two positions at once, and the distinction is load-bearing for the rest of this manuscript. First, Epoch is a standalone framework analyzable on its own terms, with its own estimation procedure, assumptions, and in-sample / out-of-sample boundary — the subject of §2 through §4. Second, Epoch is the upstream substrate the three compute-driven forecasts already surveyed depend on: Aschenbrenner’s first OOM driver takes Epoch’s GPT-2/3/4 estimates as its empirical anchor (Chapter 1.1, §1, Driver 1); Davidson’s takeoff-speeds simulator takes Epoch’s compute trajectories as parameter inputs (Chapter 1.3); Cotra’s hardware-cost extrapolation uses Epoch-style methodology and the connected algorithmic-efficiency rates (Chapter 1.2). The two roles do not collapse: Epoch’s contribution is the methodology, and the downstream forecasts depend on Epoch rather than the reverse.

Pattern G is reflexive here. Each of the three compute-driven chapters acknowledged an upstream-methodology dependency under the manuscript’s Pattern G (preregistration v1, §Pattern framework). Epoch is that upstream methodology. Chapter 1.7’s Pattern G is therefore not an external-dependency acknowledgement but a reflexive one: the substrate the prior chapters named as the thing they depend on is now the framework under reconstruction. This is a structural feature of the surveyed set — a single public compute-estimation methodology underlies three of its seven entries — and not a defect of Epoch’s program. The reflexivity is uniquely Epoch-specific, as Cotra’s 2022 self-revision was uniquely Cotra-specific (established — the citation network is one-directional and verifiable in research_notes/forecasts_database.json).

9.2 §2 — The Methodology Decomposition

The Epoch methodology can be stated as a two-stage function. Let C_m denote the estimated training compute of model m in FLOP, constructed at the per-model stage as

the product of per-processor throughput f_m (FLOP/s), utilization fraction u_m, processor count p_m, and training duration d_m. The second stage aggregates the per-model estimates by release date and fits a log-linear trend per identified era, log C(t) = log C(t₀) + (t − t₀)·log 2 / τ, with the era-specific doubling time τ estimated by regression and reported with a bootstrapped confidence interval.

Inputs. The empirical base is public information across four input families per model: training papers (architectures, hyperparameters, training-procedure disclosures); hardware specifications (manufacturer peak FLOP/s, utilization-fraction adjusted); training-time disclosures (hours, days, weeks); and infrastructure announcements (cluster size, GPU count). Models lacking sufficient public information are flagged as non-disclosed rather than imputed. The public model database tracks approximately 3,200 systems from roughly 1950 to the present; the foundational paper used approximately 100-200, and the 2024 trend refinement used a 333-model frontier subset (Sevilla et al., 2022; Epoch AI, 2024) (established — methodology multiply attested across the canonical surface set).

Extrapolation and assumptions. Six modelling commitments carry the methodology: (i) the four-factor product, with a 30-50% utilization fraction, approximates actual training compute; (ii) public training-time disclosures are accurate within an order of magnitude; (iii) infrastructure announcements correspond to actual rather than aspirational capacity; (iv) the log-linear functional form is appropriate, against super-exponential, broken-power-law, or regime-switching alternatives; (v) the three-era cut-offs at 2010 and 2015 are empirically robust; (vi) the database’s selection criteria render it sufficiently representative of the underlying system population.

In-sample / out-of-sample boundary. This is the load-bearing observation of the chapter, and it differs from the six prior chapters in form. Epoch’s historical compute estimates and the trends fit on them are in-sample by construction — and unusually clean: the 2022 → 2024 → 2026 dataset expansions function as a partial out-of-sample test of the methodology, with the 4-5x/year trend reproducing across the 333-model expansion (Epoch AI, 2024). But the forward step — projecting the post-2015 rate beyond the observed window — is out-of-sample in exactly the sense Chapter 4 formalizes, and Epoch’s own three-era decomposition is direct evidence that the rate is not stationary across regimes. The labeling distinction is therefore sharp and must be stated explicitly: the historical estimates are (established) within the methodology; the forward extrapolation to future systems is (speculative) per the calibration applied to outputs beyond the observed range. The absent forward-validation boundary is the bridge to Chapters 6 and 7: Chapter 6 assumes the performance statistic under correction is computed on an operational sample, which the forward extrapolation lacks at publication, and Chapter 4 develops the walk-forward retrodiction protocol — for which Epoch’s dataset-version expansions are the surveyed set’s clearest measurement-layer analog.

9.3 §3 — What the Method Achieves

The most rigorous public compute-estimation methodology in the surveyed-set domain. The four-factor procedure, triangulated across training papers, hardware specifications, training-time disclosures, and infrastructure announcements, beats both academic-paper-only methods (which depend on inconsistent per-paper FLOP reporting) and closed-vendor estimates (which depend on vendor disclosure decisions). It is documented at organizational level, reproducible from public information, and operationalized at scale across the 3,200-model database. The predecessor (Amodei & Hernandez, 2018 (non-peer-reviewed)) used roughly 30-50 systems and a single-rate model; Sevilla et al. (2022) expanded the sample by an order of magnitude, introduced the three-era decomposition, and submitted it to peer review (established — multiply attested across Sevilla et al., 2022 and Epoch’s methodology documentation).

Per-estimate-grounded uncertainty rather than stipulated error bars. Each estimate carries an explicit uncertainty bound where public information is sufficient and a non-disclosure flag where it is not; the trend-level bootstrapped confidence intervals aggregate those per-estimate uncertainties rather than imposing a distribution on the whole trend. This non-parametric discipline accommodates the non-Gaussian shape of compute-estimate error and is more honest than the qualitative-error-bar framing of Aschenbrenner or the implicit per-anchor uncertainty of Cotra’s mixture weights (evidenced — bootstrapped-CI methodology is established applied-statistics practice).

Public dataset releases with version history. Successive releases carry explicit version increments — approximately 100-200 models in 2022, a 333-model frontier subset in 2024, 3,200+ by 2026 — so an audit at any future point can reproduce a prior trend analysis from the published version. Within the surveyed set only METR’s v1.0 → v1.1 transition approaches this discipline; the other four quantitative frameworks lack dataset version control at a comparable level (established — version-history trajectory directly attestable from the organizational repository). This is the surveyed set’s clearest analog to the reproducibility discipline this manuscript imposes on itself.

The upstream substrate is a methodological public good. Epoch enables three of the six other surveyed forecasts by supplying a shared compute foundation: without it, Aschenbrenner, Davidson, and Cotra would each require independent compute estimation, and the divergence between three private methodologies would be unauditable. A single public, versioned, peer-reviewed substrate makes their compute inputs inspectable on common ground. This is a substantive merit and not a sign that Epoch is derivative — the dependency runs downstream from Epoch, and the substrate pre-exists and independently supports the frameworks that consume it (established — one-directional citation network verifiable in research_notes/forecasts_database.json).

9.4 §4 — What the Method Cannot Tell Us

Four questions surface from the methodology’s structure; under Rule 3 (self-application), each maps to an audit-chapter section and to an analogous question facing this manuscript’s own Chapter 14 construct.

Question 1: landmark-system-selection search count. The database admits a system on a four-criteria disjunction (citation count above 1,000; state-of-the-art performance; historical significance; substantial product deployment) and the trend analyses run on subsets — 333 frontier models here, segment cuts there. How many alternative selection-and-subset choices were available before these were fixed? The implicit space spans inclusion thresholds, frontier definitions, era cut-off years, and segment partitions, and the program reports the chosen subset without enumerating the selection space as a search (speculative as to magnitude of selection-induced inflation; established as a methodological observation about the program).

Self-application observation. Chapter 14’s effective-N estimator (preregistration v2 §Adaptation 4; LOCKED, TIER_3_NEW_DERIVATION; default (speculative)) faces the strict analog: the manuscript’s own which-forecasts-to-survey-and-which-subset frame is an implicit search count, not directly observable. The pre-commitment is a composite range with geometric-mean headline and EX POST IMMUTABLE bracket-ratio thresholds 1.5x and 100x. The analogous observability critique applies to the manuscript itself.

Question 2: published-report reliability. The methodology presumes that public training-compute reports reflect actual compute consumed. For academic and open-source models this is defensible; for closed-vendor frontier systems, where competitive pressure may bias what is disclosed, estimated, marketed, or omitted, the input itself is an attested quantity of uncertain provenance. The methodology flags non-disclosure but does not certify that disclosed figures are unmarketed; the disclosure-asymmetry between academic and commercial models is a structural feature, not a measurement that the four-factor product resolves (evidenced as a gap; disclosure-asymmetry is the load-bearing exposure of any public-information-only estimation discipline).

Self-application observation. Chapter 14’s information-set partition (preregistration v2 §Adaptation 2; TIER_3_NEW_DERIVATION; default (speculative) until MC Study 2 elevation per Type-I_C > 2·PBO OR Power_C < 0.5·PBO) faces the analog: the partition presupposes a performance-attestation structure over the historical inputs — the same presupposition that strains when a disclosed input cannot be certified as the actual quantity. The MC Study 2 criterion is pre-committed to expose this dependency; the analogous critique applies to the manuscript itself.

Question 3: scaling-law extrapolation horizon. The post-2015 doubling rate is fit in-sample; the operationally relevant downstream use projects it forward. How long can the current fit extrapolate before a regime change — compute-supply disruption, algorithmic-efficiency saturation, a hardware S-curve? The methodology’s own two empirical cut-offs (2010, 2015) and the acknowledged post-2018 frontier rate of 4.2x/year are direct evidence that the rate shifts; the program reports these transparently but does not formalize a horizon over which forward extrapolation is defensible (evidenced as a gap; the multi-segment reporting acknowledges the issue without resolving the horizon).

Self-application observation. Chapter 14’s capability-forecasting performance statistic (preregistration v2 §Adaptation 3; LOCKED, TIER_2 for Brier / log-loss / calibration / interval-coverage; TIER_3 for capability-Sharpe) faces the analog: the window over which a statistic is computed and projected is itself a choice. The manuscript pre-commits to a primary statistic plus a multi-statistic robustness panel rather than a single horizon; the analogous extrapolation-window critique applies to the manuscript itself.

Question 4: effective sample size under correlated observations. The database holds thousands of per-model estimates, but the count of independent compute observations is materially smaller. Organizations cluster (the per-vendor rates already span 4.9x to 7.1x/year); methodology drifts across the 1950-2026 span; and publication strategies are correlated, since the same labs disclose under similar competitive incentives. Positive correlation shrinks the effective sample size below nominal and widens the true confidence interval on the 4-5x/year aggregate beyond what the nominal count implies. The program reports bootstrapped intervals but publishes no effective-N adjustment for this cross-observation correlation structure (evidenced as a gap; shared-provenance correlation is a structural feature of a small-organization frontier).

Self-application observation. Chapter 14’s non-IID variance estimator (preregistration v2 §Adaptation 1; TIER_2_ADAPTED_DERIVATION; default (speculative) until MC Study 1 elevation per n_critical(ρ) > 60) faces the strict analog: the correction depends on dependence structure across pooled observations. If models share lab provenance — exactly as the per-vendor rate spread indicates — the effective sample size shrinks below nominal and the correction is more severe. The manuscript pre-commits to block-bootstrap on detrended residuals; the analogous dependence-structure critique applies to the manuscript itself.

9.5 §5 — Forward Reference

Chapter 2 identifies the structural pattern: the compute-scaling-empirical axis is the seventh of seven preregistered axes, none deploying an out-of-sample forward-validation protocol at publication. Chapter 2 additionally integrates Epoch’s distinctive role as the shared empirical substrate underlying three of the six prior forecasts (Aschenbrenner, Cotra, Davidson) — so the Pattern G reflexivity surfaced in §1 is a document-architecture feature. Chapter 4 develops the walk-forward retrodiction protocol; Epoch’s dataset-version expansions are the densest measurement-layer substrate the surveyed set offers. Chapter 6 extends the Bailey and López de Prado (2014) DSR derivation, addressing Question 4 via Adaptation 1 and Question 3 via Adaptation 3. Chapter 7 addresses Question 2 via Adaptation 2. Chapter 14’s constructive DCF apparatus integrates all four adaptations and addresses Question 1 via Adaptation 4.

With Epoch mapped, the surveyed set is complete. Chapter 2 turns to the common methodological pattern across all seven.

9.6 References cited in this chapter

Amodei, D., & Hernandez, D. (2018, May). AI and compute. OpenAI (non-peer-reviewed). https://openai.com/research/ai-and-compute

Bailey, D. H., & López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), 94-107.

Epoch AI. (2024). Training compute of frontier AI models grows by 4-5x per year (non-peer-reviewed). https://epoch.ai/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year

Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M., & Villalobos, P. (2022). Compute trends across three eras of machine learning. International Joint Conference on Neural Networks (IJCNN). https://doi.org/10.1109/IJCNN55064.2022.9891914

10 Chapter 2 — The Common Methodological Pattern

Part I reconstructed seven AGI-timeline forecasts one at a time. This chapter changes the unit of analysis: where Chapters 1.1 through 1.7 each held a single methodology under reconstruction, Chapter 2 holds the set under synthesis. It asks what the seven share, what their differences reveal about a single classificatory taxonomy, and what one structural feature propagates into the framework derivations of Chapters 3 through 16.

The surveyed forecasts are the object of analysis, not an adversary. This manuscript extends the methodology canon of Bailey, López de Prado, and Harvey into capability forecasting; it does not demolish the forecasts it surveys, several of which exceed the canon’s own reproducibility standards (§3).

10.1 §1 — What the Seven Forecasts Share

Across the seven reconstructions, the methodologies diverge structurally to a degree that resists casual grouping. Aschenbrenner sums orders of magnitude of effective compute in log-space (Aschenbrenner, 2024; Chapter 1.1). Cotra mixes six biological-anchor compute distributions under subjective weights (Cotra, 2020, 2022; Chapter 1.2). Davidson integrates a coupled differential-equations system over roughly seventy parameters (Davidson, 2023; Chapter 1.3). METR fits a logistic-then-log-linear trend to measured task-completion times (Kwa et al., 2025; Chapter 1.4). Karnofsky allocates probability mass narratively across civilizational-outcome categories (Karnofsky, 2021; Chapter 1.5). Grace pools thousands of expert probability distributions into a population aggregate (Grace et al., 2018, 2024; Chapter 1.6). Epoch estimates per-model training compute from public information and fits doubling-rate trends (Sevilla et al., 2022; Chapter 1.7). Seven objects of measurement — log-compute, biological-anchor compute, takeoff duration, task-completion time, outcome-category mass, pooled arrival-year belief, frontier-compute trajectory — with no two sharing units, temporal status, or epistemic primitive (established — the dimensional orthogonality is verifiable across research_notes/forecasts_database.json and was preregistered as a falsifiable claim; preregistration v1, §7-axis decomposition taxonomy).

Despite that divergence, one methodological feature is common to all seven. Each forecasts a future capability by decomposing the question into components, fitting those components to historical data, and extrapolating the fitted decomposition forward — with no out-of-sample forward-validation protocol deployed at publication. This is the load-bearing observation of the chapter, and it is structural rather than evaluative: it describes what the seven methodologies do, not whether what they do is correct. Each forecast’s §2 in Part I located the same boundary independently. Aschenbrenner’s 2019–2023 decomposition extrapolated to 2027 reserves no held-out period (Chapter 1.1, §2). Cotra’s 2010–2020 cost trends extrapolate to 2030–2100 without a walk-forward protocol (Chapter 1.2, §2). Davidson’s coupled system integrates parameters fit to 2010–2023 history forward, with no frozen-vintage retrodiction (Chapter 1.3, §2). METR’s 2019–2025 trend fit is in-sample by construction and its forward extrapolation out-of-sample by construction (Chapter 1.4, §2). Karnofsky’s probability-mass concentrations admit no historical reference class permitting validation (Chapter 1.5, §2). Grace’s aggregate arrival-year distribution is forward by construction, its horizon far exceeding the inter-wave spacing (Chapter 1.6, §2). Epoch’s historical estimates are in-sample and its scaling-law extrapolation out-of-sample (Chapter 1.7, §2).

The unifying claim is therefore not that the seven forecasts are wrong. None provides out-of-sample forward-validation of its extrapolation step at publication (established — each instance attested in the corresponding Part I §2 and audit reports reviews/audit/phase-1-chapter-1.1 through phase-1-chapter-1.7). One problem, seven manifestations. The remainder of this chapter establishes the taxonomy that makes them comparable (§2), surfaces how the taxonomy classifies real structural features (§3), names the shared extrapolation step as a single signature (§4), and bridges to Chapter 3 (§5).

10.2 §2 — The Seven-Axis Decomposition Taxonomy

The comparison in §1 presupposes a taxonomy. This manuscript preregistered one before Part I drafting began: seven decomposition axes, each naming a forecast’s central methodological move with explicit dimensional semantics (preregistration v1, §7-axis decomposition taxonomy; hash a2f7e52c…, locked 2026-05-18). The axes are reproduced here as the document’s classificatory scaffold, each with its canonical exemplar and, where one applies, its operational Pattern letter.

The OOM-axis (Aschenbrenner) decomposes effective compute into additive orders-of-magnitude components — physical compute, algorithmic efficiency, and unhobbling — summed in log-space (Chapter 1.1). The anchor-axis (Cotra) decomposes the transformative-AI compute requirement into a mixture of six biological benchmarks under subjective weights (Chapter 1.2). The takeoff-duration axis (Davidson) parameterizes the duration from partial to full automation across slow, fast, and discontinuous regimes (Chapter 1.3). The time-cost-of-task axis (METR) measures the human-equivalent labor time at which a model achieves fifty-percent task success and fits a doubling-rate trend; it carries Pattern D, the manuscript’s commitment that this measured quantity is dimensionally orthogonal to Davidson’s takeoff-duration parameter despite the shared vocabulary of time (Chapter 1.4). The probability-mass / civilizational-significance axis (Karnofsky) allocates probability mass over civilizational-outcome categories; it carries Pattern E, the commitment that probability-mass-over-a-binary-event and the qualitative civilizational-significance framing are orthogonal axes (Chapter 1.5). The expert-survey-aggregation axis (Grace) pools per-respondent arrival-year distributions across a defined expert population; it carries Pattern F, the commitment that statistical pooling and the per-respondent individual distribution are orthogonal (Chapter 1.6). The compute-scaling-empirical axis (Epoch) estimates per-model frontier compute and aggregates it into log-linear trends; it carries Pattern G (Chapter 1.7).

These seven exhaust the surveyed set. The taxonomy also reserves an eighth, constructive axis — the manuscript’s own Chapter 14 hyperparameter-search axis, the constructive framework’s own search space subject to multiple-testing correction (preregistration v1, §7-axis decomposition taxonomy, 8th axis). It is the constructive complement to the seven surveyed axes: where the seven name decomposition moves the manuscript audits, the eighth names the decomposition move the manuscript itself will perform, and is therefore subject to the same orthogonality and validation discipline. It is defined for completeness and is not yet operative; it becomes operative only when Chapter 14 constructs the framework whose search space it names. Naming it here discharges the self-application commitment of Rule 3: the document’s own forecasting apparatus is a forecast subject to the same axis discipline applied to the seven.

Pattern G is reflexive, and uniquely so. The OOM-axis, the anchor-axis, and the takeoff-duration axis each consume Epoch’s compute estimates as upstream inputs — Aschenbrenner’s first OOM driver, Cotra’s hardware-cost extrapolation, and Davidson’s compute-trajectory parameters all anchor on the Epoch database (Chapters 1.1, 1.2, 1.3; Sevilla et al., 2022). The compute-scaling-empirical axis is therefore not merely the seventh framework but the shared empirical substrate beneath three of the other six. This is a document-architecture feature: a single public compute-estimation methodology underlies a quarter of the surveyed entries, and the dependency runs one-directionally downstream from Epoch (established — the citation network is one-directional and verifiable in research_notes/forecasts_database.json). The reflexivity is bounded to Epoch. It is a property of this particular substrate’s position in the surveyed set — a single axis’s empirical outputs feeding three downstream axes — not a relational structure shared by the other six. None of the OOM-axis, anchor-axis, takeoff-duration, time-cost-of-task, probability-mass, or expert-survey axes supplies inputs to the others; their dependencies run outward to Epoch, never among themselves. Reflexivity is therefore a one-axis attribute, as Cotra’s 2022 self-revision was a property unique to Cotra and not a general feature of the anchor-axis. Over-generalizing reflexivity across all seven axes would misstate the architecture; the manuscript names it precisely once, here, as Epoch-specific (per Chapter 1.7 audit, carry-forward caveat).

Pattern H sits outside the taxonomy entirely. The methodology canon — Bailey, López de Prado, Borwein, Harvey, Tetlock, Flyvbjerg — does not participate in the seven-axis decomposition (Bailey & López de Prado, 2014; Bailey et al., 2014; López de Prado, 2018; Harvey et al., 2016; Tetlock & Gardner, 2015; Flyvbjerg, 2014). These works produce statistical machinery for evaluating performance and forecast claims under multiple testing; they are the methodology applied to the axes, not a forecast framework among them. Pattern H is therefore named without an axis assignment, by construction. The distinction is preregistered as falsifiable: if any canon entry were found to function as a decomposable forecast framework under the seven-axis taxonomy, the Pattern H structural distinction would require revision (preregistration v1, FC-3). Across Phase 0.2 and 0.3 no such instance was found (evidenced — Pattern H confirmed across the methodology-canon entries; research_notes/methodology_canon.json). The relation is directional: the canon is the extension target the document inherits its evaluative machinery from, not an object the seven-axis taxonomy classifies. The seven-axis taxonomy and the canon thus occupy disjoint roles: the axes are what is audited, the canon is what audits.

10.3 §3 — The Data-Richness / Attestation-Richness Distinction

The taxonomy earns its place only if it classifies meaningful structural features rather than arbitrary author groupings. A substantive insight emerging from the Part I audits supplies the evidence that it does.

Before Part I drafting, this manuscript preregistered a distribution prior: forecasts anchored on empirical-data axes would produce a higher (established) share of decomposed claims than forecast-philosophy frameworks relying on forward extrapolation (preregistration v1, §Distribution analysis). The prior systematically over-predicted the (established) share for forecasts that borrow richness from a measurement layer distinct from their attestation layer. Three confirmed instances, each subjected to an independent architekt–audytor label census, establish the pattern (research_notes/forecasts_database.json, project_level_methodology_notes.hypothesis_a_confirmed_pattern).

METR’s prior predicted a seventy-three-percent (established) share; the realized census returned 46.2 percent — a deviation of −26.8 percentage points, roughly −2.18σ at the thirteen-claim census, with 13/13 independent agreement on the individual labels (Chapter 1.4 audit) (established — independent census documented). Grace’s prior predicted 82.5 percent, the highest in the surveyed set; the realized share fell to roughly 48–59 percent — a deviation of −24 to −35 percentage points, roughly −2.8 to −4.0σ, with 16/16 census agreement, the largest absolute deviation of the three and the cleanest demonstration (Chapter 1.6 audit) (established). Karnofsky is the adjacent case: the (established) share landed within variance of its 38-percent prior, but the framework-architecture and citation-network claims that were prior-classified (speculative) proved (evidenced) on inspection (+7.3pp evidenced, −10.5pp speculative, 17/17 census) — the same conflation mechanism operating on a different tier (Chapter 1.5 audit) (evidenced).

The conflation mechanism is worth unfolding, because it is the same in all three instances and explains why the prior was systematically wrong rather than randomly so. Measurement-layer activities produce immediate, verifiable observations — METR’s task-completion times, Grace’s verbatim survey responses, Karnofsky’s framework-architecture and citation-network claims — that naturally read as (established)-eligible, because they are directly attested. The prior inherited that attestation upward: it read the richness of the measurement layer as if it were richness of the forecast tier. But each forecast’s load-bearing claim sits on the methodology layer — the extrapolation from observed measurements to a year-of-arrival or capability projection — and that step is out-of-sample regardless of how richly the underlying measurements are attested. The same mechanism takes different surface forms across the three: METR moves from benchmark observations to a roughly seven-month doubling extrapolation; Karnofsky moves from framework-architecture observations to a civilizational-scale probability distribution; Grace moves from per-respondent survey distributions to an aggregated year-of-arrival distribution. Different objects, one conflation.

Epoch is the boundary-confirming case, and it confirms the boundary precisely because the conflation has nothing to act on. Its measurement layer is its attestation layer: the per-model FLOP estimates and the scaling-law findings are themselves the methodology-attestation claims, not richness borrowed from an external measurement layer and re-attributed to a forecast. There is therefore no separate rich measurement layer whose richness can be mis-attributed. Its realized distribution reconciled to roughly 60.0 / 30.9 / 9.1 — the lowest (speculative) share in the set, with 17/17 manifest-label census agreement (Chapter 1.7 audit) (established). The boundary holds because the conflation that drives the three confirmed instances cannot occur where the two layers coincide.

The mechanism is the distinction the section names: data-richness is not methodology-attestation-richness. METR’s completed-benchmark observations, Grace’s completed-survey responses, and Karnofsky’s citation-network architecture are genuinely rich on the measurement layer; the load-bearing extrapolation step that turns each into a forecast is out-of-sample on the methodology layer. The prior conflated the two. The classificatory consequence is precise: axes producing measurement-layer richness — Pattern D, Pattern E, Pattern F — are differentially susceptible to the conflation, while the Pattern G reflexive substrate is exempt, because its measurement layer and attestation layer coincide. The taxonomy thus predicts which forecasts will be susceptible, and the prediction holds across four independent censuses (13/13, 17/17, 16/16, 17/17) (established — 100-percent architekt–audytor agreement in all four censuses). This is substantive evidence the seven-axis taxonomy classifies real structural features.

The document’s own behavior on this point is load-bearing under Rule 3. The architekt did not adjust the realized labels toward the miscalibrated prior. Auto-adjusting would have been in-sample fitting to a prior known to be miscalibrated — the precise failure mode this manuscript audits in the surveyed set. Non-adjustment preserves the integrity of the critique: the prior over-fit the (established) share for measurement-rich forecasts, and that over-fit is itself an in-sample-fitting event at the meta-methodology level, inherited by Chapter 5’s treatment of multiple testing (project_level_methodology_notes, implications_for_chapter_4_5).

10.4 §4 — The Shared Out-of-Sample Extrapolation Step

The load-bearing observation of §1 can now be stated as a single structural signature rather than seven separate findings. Every forecast in the surveyed set executes the same two-stage move: fit a decomposition to a historical window, then extrapolate the fitted decomposition past the window’s end. The in-sample / out-of-sample boundary falls at the window’s end in all seven cases. What differs is only the object on each side of the boundary.

Naming the boundary precisely for each axis is the foundation Chapter 4’s formal in-sample-extrapolation critique inherits, so each is given its in-sample side, its out-of-sample side, and the transform that maps one to the other. Aschenbrenner holds a 2019–2023 effective-compute trend in-sample and projects a 2023–2027 forecast out-of-sample, the transform a per-driver OOM-rate extension at a fitting-to-horizon ratio of roughly five to four. Cotra holds 2010–2020 hardware-cost and algorithmic-efficiency trends in-sample and projects a year-of-TAI distribution out to 2100, the transform a six-anchor evaluation under subjective weights. Davidson holds a 2010–2023 compute trajectory in-sample and integrates a fast/slow takeoff-duration distribution out-of-sample, the transform a coupled-system forward integration. METR holds observed task durations in-sample and projects a year-of-AGI estimate out-of-sample, the transform a roughly seven-month doubling fit. Karnofsky holds civilizational analogues in-sample and reasons to a century-scale probability distribution out-of-sample — a transform whose horizon renders it unfalsifiable within its own term. Grace holds 2018, 2022, and 2024 survey waves in-sample and projects forward arrival-year point distributions out-of-sample, the transform a cross-wave aggregation. Epoch holds historical compute estimates in-sample and projects the scaling law forward out-of-sample, the transform a log-linear extrapolation. Seven objects fitted, seven objects extrapolated, one boundary.

The shared signature can now be stated as one pattern rather than seven readings. All seven perform an extrapolation transform from in-sample observations to an out-of-sample projection; all seven lack a pre-specified validation criterion at publication — no held-out period, no falsification protocol, no pre-committed threshold against which the projection would later be scored; all seven thereby face the same in-sample-fitting risk. The signature is the boundary’s universality, not the diversity of what crosses it.

The diversity matters for one reason only: it rules out the explanation that the shared gap is an artifact of a shared method. Two forecasts using the same procedure might omit out-of-sample validation for the same incidental reason. Seven structurally orthogonal procedures omitting it independently is evidence that the omission is a property of the forecasting task — extrapolating capability past an observed window — rather than of any one methodology’s design choices. The §2 dimensional orthogonality and the §4 shared signature are the same observation viewed from two angles: the axes differ on everything except the structural location of the boundary, and they share the boundary precisely because the task is the same.

This signature is named here without a prescribed solution. The discipline is the Pattern A boundary inherited from Part I: the structural commonality is identified, the corrective machinery is not yet introduced. The names of the corrective methodologies, and of the constructive framework that integrates them, belong to later chapters and are not deployed in this section. What §4 establishes is only that the seven forecasts share an out-of-sample extrapolation step that none validates at publication — the empirical premise on which the framework chapters build. That shared step is also the structural precondition the next chapter requires: the in-sample window and the out-of-sample projection are, formally, a train/test partition, with the boundary at the window’s end functioning as the cutoff between fitted and held-out data. A forecast that fits in-sample and projects forward without a held-out criterion has, in this structural sense, declared a partition and then declined to use its test side. That correspondence — extrapolation boundary as train/test partition — is what makes the next chapter’s analysis available; the analysis itself is reserved for it. Chapter 3 reframes the step as an analogy to a quantitative-finance object built to be evaluated under exactly this gap. Chapter 4 formalizes the in-sample-extrapolation critique and develops the walk-forward retrodiction protocol that would generate the operational sample the seven forecasts lack. Chapters 5 through 7 and 14 deliver the formal extension.

10.5 §5 — Forward Bridge to Chapter 3

This chapter established a taxonomy of seven decomposition axes, identified one structural feature common to all seven — an in-sample-fitted extrapolation with no out-of-sample forward-validation at publication — and showed, through the data-richness / attestation-richness distinction, that the taxonomy classifies real structural features rather than arbitrary author groupings. The shared extrapolation step is the empirical premise the rest of the manuscript engages.

Chapter 3 introduces the backtest analogy. A capability forecast that fits a decomposition in-sample and projects it forward without held-out validation is structurally a backtest reported without out-of-sample testing — the object the quantitative-finance methodology canon was built to evaluate. Chapter 3 develops that analogy and bridges the Part I observation to the Part II framework derivation. Chapter 4 turns the analogy into a formal critique of in-sample extrapolation and a walk-forward protocol; Chapter 5 develops the multiple-testing treatment, inheriting from §3 the implication that a miscalibrated baseline prior is itself a meta-level in-sample-fitting failure. Chapters 6 and 7 then derive the formal extension methodologies — the Deflated Sharpe Ratio (DSR) adaptation and the Probability of Forecast Overfitting (PFO) adaptation — and Chapter 14 integrates them into the synthetic Deflated Capability Forecast (DCF) framework.

These are an extension of the methodology canon, not a competitor to the forecasts they will assess. Bailey, López de Prado, Borwein, Harvey, Tetlock, and Flyvbjerg produced statistical machinery for evaluating performance claims under multiple testing and selection bias (Bailey & López de Prado, 2014; Bailey et al., 2014; López de Prado, 2018; Harvey et al., 2016; Tetlock & Gardner, 2015; Flyvbjerg, 2014); this manuscript carries that machinery into the capability-forecasting domain it was not initially built for. The seven surveyed forecasts are the object of that extension, named throughout as the analyzed set rather than the adversary. The DSR, PFO, and DCF terms are named here for forward reference; they are applied beginning in Chapter 6.

10.6 References

Aschenbrenner, L. (2024, June). Situational awareness: The decade ahead. Self-published essay (non-peer-reviewed). https://situational-awareness.ai/

Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). Pseudo-mathematics and financial charlatanism: The effects of backtest overfitting on out-of-sample performance. Notices of the American Mathematical Society, 61(5), 458-471.

Bailey, D. H., & López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), 94-107.

Cotra, A. (2020). Forecasting TAI with biological anchors. Open Philanthropy (non-peer-reviewed).

Cotra, A. (2022, August). Two-year update on my personal AI timelines. AI Alignment Forum / LessWrong (non-peer-reviewed).

Davidson, T. (2023). What a compute-centric framework says about AI takeoff speeds. Open Philanthropy (non-peer-reviewed).

Flyvbjerg, B. (2014). What you should know about megaprojects and why: An overview. Project Management Journal, 45(2), 6-19.

Grace, K., Salvatier, J., Dafoe, A., Zhang, B., & Evans, O. (2018). When will AI exceed human performance? Evidence from AI experts. Journal of Artificial Intelligence Research, 62, 729-754.

Harvey, C. R., Liu, Y., & Zhu, H. (2016). … and the cross-section of expected returns. Review of Financial Studies, 29(1), 5-68.

Karnofsky, H. (2021). The most important century (blog series). Cold Takes (non-peer-reviewed).

Kwa, T., West, B., Becker, J., et al. (2025). Measuring AI ability to complete long software tasks (arXiv:2503.14499). NeurIPS.

López de Prado, M. (2018). Advances in financial machine learning. John Wiley & Sons.

Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The art and science of prediction. Crown Publishers.

11 Chapter 3 — Forecasts as Backtests

Chapter 2 established that the seven surveyed forecasts share one structural feature: each fits a decomposition to a historical window and projects it forward without a pre-specified out-of-sample validation criterion at publication. This chapter supplies the lens that makes the corrective available. It does not introduce new statistical machinery; it argues that a structure already studied in quantitative finance — a backtest reported without out-of-sample testing — is the structure the seven forecasts instantiate.

Individual forecasts appear as illustrations of a structural feature. The position is preserved: this manuscript extends the methodology canon of Bailey, López de Prado, and Harvey into the capability-forecasting domain, not competing with the surveyed forecasts or the canon.

11.1 §1 — The Backtest Analogy Introduced

A backtest is a procedure with a fixed shape. A researcher fits a strategy to a historical price record, evaluates its performance on that record, and reports a performance statistic as evidence the strategy will perform out-of-sample. The methodology canon’s central concern is the gap between in-sample fit and out-of-sample performance — the gap a backtest reported without held-out testing leaves unmeasured (Bailey & López de Prado, 2014; Bailey, Borwein, López de Prado, & Zhu, 2014). A capability forecast, viewed as a backtest, shares that shape. The forecaster fits a decomposition to a historical window, evaluates the decomposition’s fit on that window, and reports a projection as evidence about a future capability. The two procedures differ in domain and in vocabulary; they coincide in structure.

The correspondence is specific, not loose, and it runs along three components. The training set corresponds to the in-sample window — the historical period the forecaster fit the decomposition to. The test set corresponds to the out-of-sample region — the forward period over which the projection must validate. The performance metric corresponds to capability-prediction accuracy — whatever statistic would later score the projection against realized capability. Chapter 2 §4 already located the boundary that separates these components for each of the seven: the in-sample window ends, and the projection begins, at one cutoff per forecast. A forecast that declares such a partition and then reports only the in-sample side has, in the backtest vocabulary, declined to use its test set.

The seven instantiate the shape across all seven decomposition axes, with the in-sample window and the projection differing only in their objects. The OOM-axis fits a 2019–2023 effective-compute trend and projects to 2027; the anchor-axis fits 2010–2020 cost trends and projects a distribution to 2100; the takeoff-duration axis integrates a 2010–2023 trajectory forward; the time-cost-of-task axis fits observed task durations and extrapolates a doubling rate; the probability-mass axis reasons from civilizational analogues to a century-scale distribution; the expert-survey axis aggregates serial waves into a forward arrival-year distribution; the compute-scaling axis fits historical estimates and extrapolates the scaling law (Aschenbrenner, 2024; Cotra, 2020; Davidson, 2023; Kwa et al., 2025; Karnofsky, 2021; Grace et al., 2024; Sevilla et al., 2022; Chapters 1.1–1.7). Seven training sets, seven test regions, one backtest shape (established as a structural correspondence; each per-axis boundary is attested in the corresponding Part I §2 and synthesized in Chapter 2 §4). The remainder of the chapter develops what this shape achieves (§3), names the canon it borrows the lens from (§2), and bounds what the analogy cannot tell us (§4).

11.2 §2 — The Methodology Canon the Analogy Borrows From

The backtest analogy is worth developing only because a mature literature already studied the structure it identifies. By the early 2010s, the systematic-trading industry was producing backtested track records at industrial scale — thousands of strategy configurations searched per analyst, over the same few decades of price history (Bailey, Borwein, et al., 2014; research notes, Phase 0.2 SUB-B). The arithmetic of false discovery had become unavoidable: search enough configurations and the best in-sample performer will, with high probability, exceed its own true expected performance by a margin that depends on the number searched. The canon is the discipline’s formal response to that arithmetic, and it is the lens this chapter borrows.

Three contributions define the lens at the level the analogy requires. Bailey and López de Prado (2014) addressed the inflation of the best-of-many performance statistic: the observed Sharpe ratio reported from a search is the maximum over the configurations searched, and the sampling distribution of a maximum is not the sampling distribution of a single trial (research notes, Phase 0.2 SUB-B, §2). Bailey, Borwein, López de Prado, and Zhu (2014) addressed a distinct question — across the configurations a researcher actually searched, what fraction would underperform out-of-sample given their in-sample rank — and operationalized it as a probability computed across symmetric in-sample/out-of-sample partitions of the record (research notes, Phase 0.2 SUB-B, §3). López de Prado (2018) consolidated both into a textbook treatment that a reader can reproduce from the code listings, establishing the integrated discipline that subsequent practice adopted as standard (research notes, Phase 0.2 SUB-A, §1). The factor-zoo literature in asset pricing applied the same multiple-testing logic to published return predictors, finding that conventional significance thresholds were inadequate once the number of trials was admitted (Harvey, Liu, & Zhu, 2016).

What unifies these contributions, at the level the analogy needs, is one methodological discipline: the boundary between in-sample fit and out-of-sample performance must be drawn at the level of the selection procedure, not the individual result. If many configurations are searched and the best reported, the question of whether that single result generalizes is misposed; the methodologically posed question is whether the search procedure produces results that generalize, and what correction the search multiplicity demands (research notes, Phase 0.2 SUB-A, §1(viii)). The canon’s manifesto stated the disciplinary consequence in plain terms: a performance metric reported without disclosure of the number of trials is not interpretable as statistical evidence (Bailey, Borwein, et al., 2014; research notes, Phase 0.2 SUB-B, §4) (established within the finance domain; the cross-domain transfer of the framing is treated as (evidenced) in §3 and the per-assumption calibration is reserved for Chapters 6 and 7).

This chapter names the canon and its documented failure mode; it does not yet apply the corrective statistics. The formal performance-statistic correction and the formal overfitting-probability computation are the subject of Chapters 6 and 7, where the cross-domain assumption modifications the canon requires are derived (research notes, Phase 0.2 SUB-A, §5; SUB-B, §5). The relationship is one of extension. The canon produced the machinery for evaluating in-sample-fitted claims under search multiplicity in a domain it was built for; this manuscript carries the lens into a domain it was not built for. The seven surveyed forecasts are the object of that extension, not its adversary, and several of them — the time-cost-of-task and compute-scaling axes in particular — already meet reproducibility standards the canon itself holds as commendable (Chapters 1.4, 1.7).

11.3 §3 — What the Backtest Analogy Achieves

The analogy earns its place only if structural features transfer mechanically rather than suggestively. Four do, each grounded in a feature present in both domains.

The first is the in-sample/out-of-sample partition itself — the boundary between fitted and held-out record in a backtest, between the window fit to and the forward region projected to in a capability forecast. Chapter 2 §4 established that all seven declare such a partition by construction and that none scores the projection against a held-out criterion at publication. The partition is one object in both domains: a cutoff after which performance is unmeasured at reporting.

The second is specification freedom. A backtest’s in-sample fit can be inflated by the researcher’s freedom to choose among specifications — which configurations to search, which to report — termed researcher degrees of freedom, with the correction operating on the count searched (Bailey & López de Prado, 2014). The analog is the freedom to choose the decomposition’s structure: drivers, anchors, parameters, weights. The time-cost-of-task axis exposes part of its specification space through a ten-thousand-perturbation sensitivity analysis (Kwa et al., 2025; Chapter 1.4); the OOM-axis and the probability-mass axis leave that space unenumerated (Chapters 1.1, 1.5). The degree differs across the seven, but the kind of freedom is the backtest’s kind (established as a structural correspondence; per-forecast specification-space disclosure is attested in each Part I §3).

The third is pre-commitment value. The canon’s discipline is most effective when the selection procedure is declared before the data are seen rather than reconstructed afterward. Preregistration is the analog, and its value is symmetric: a forecast that pre-specifies its decomposition, falsification criterion, and update rule forecloses the after-the-fact specification search that inflates in-sample fit. None of the seven preregistered in this sense; Chapter 2 §4 named the absent pre-committed threshold as part of the shared signature.

The fourth is the performance-metric validation requirement. A backtest’s in-sample statistic is evidence about out-of-sample performance only once computed on a held-out sample under a correction for the search that produced it. The analog: a decomposition’s in-sample fit is evidence about its projection only once the projection is scored against realized capability — requiring both a realized sample and a scoring rule. This is the requirement the seven do not meet, and the one Chapter 4 exists to manufacture retrospectively.

These four correspondences make Chapter 2 §3’s data-richness observation translatable. There, the document’s own Phase 0.1 baseline prior over-predicted the (established)-tier share for measurement-rich forecasts, conflating richness of a measurement layer with attestation of the forecast tier; non-adjustment of the realized labels was the methodologically required choice (Chapter 2 §3; research_notes/forecasts_database.json, project_level_methodology_notes). In backtest terms that conflation is in-sample fitting at the meta-methodology level — the document’s own prior fit to a sample it then treated as held-out — and the refusal to adjust it is the pre-commitment discipline applied to the manuscript under Rule 3. The same lens the analogy turns on the seven turns on the document that wields it (evidenced — the meta-level conflation is documented and the non-adjustment recorded in the Chapter 1.4 through 1.7 censuses).

The analogy works structurally despite a disanalogy §4 develops in full: a backtest’s out-of-sample performance is realized, whereas a capability forecast’s lies partly in the future. The four correspondences hold regardless, because each concerns the structure of the procedure at the moment of reporting — the declared partition, the specification freedom exercised, the pre-commitment forgone, the validation requirement unmet — none depending on whether the test region has elapsed. What the analogy cannot transfer is the realized test sample, the limit §4 bounds and the gap Chapter 4 addresses.

11.4 §4 — What the Analogy Cannot Tell Us

A structural correspondence is not an identity. Three disanalogies separate a capability forecast from a financial backtest, each genuinely limiting what the analogy licenses and each bounded substantively below.

Disanalogy 1 — the realization asymmetry. A financial backtest’s out-of-sample region is realized: returns over the held-out period exist and let performance be verified. A capability forecast’s region is partly unrealized — some target years have not occurred — so its performance is partly unverifiable at publication. This is the load-bearing disanalogy, and it is not uniform across the seven. Two are fully unrealized, validation forward-only: the probability-mass axis, whose century-scale claim admits no within-horizon falsification and carries no published self-revision (Karnofsky, 2021; Chapter 1.5 §2); and the takeoff-duration axis at the level of its central object, since post-AGI takeoff cannot be measured before AGI, its only realized fragment a verbal informal self-update (Davidson, 2023; Chapter 1.3 §2). Five are partially realized, with elapsed period now available to score: the compute-scaling axis, whose 2010–2024 trend keeps accruing post-publication observations across dataset versions 2022 → 2024 → 2026 (Sevilla et al., 2022; Chapter 1.7 §3); the time-cost-of-task axis, clearest of the set, its v1.0 → v1.1 same-model re-test the closest analog to a realized out-of-sample check, doubling rate reproducing and intervals mostly tightening (Kwa et al., 2025; Chapter 1.4 §2); the OOM-axis, of whose 2023–2027 projection only 2027 remains unrealized (Aschenbrenner, 2024); the anchor-axis, whose 2022 self-revision — a roughly twelve-year median shift two years on — is a documented quasi-out-of-sample event on the early period (Cotra, 2020; Chapter 1.2 §2); and the expert-survey axis, whose four serial waves form a realized cross-wave series observable now (Grace et al., 2024; Chapter 1.6 §2) (evidenced — each classification rests on the forecast’s publication date and horizon in research_notes/forecasts_database.json and the corresponding Part I §2; Davidson and Karnofsky are the two whose central object remains fully forward). The consequence is precise: where a backtest’s correction computes on a realized sample, five of the seven offer a partial one and two essentially none — which is why Chapter 4’s walk-forward protocol exists, to manufacture the sample the analogy alone cannot supply.

Disanalogy 2 — performance-metric ambiguity. A backtest scores against returns: observable, in one agreed unit. A capability forecast scores against capability, not a primitive observable but a measurement-methodology subject requiring its own validation — human-equivalent task time on one axis, training FLOP on another, pooled belief on a third, nothing operationally observable within the horizon on the probability-mass axis (Chapters 1.4, 1.7, 1.6, 1.5). There is no single capability unit across the seven, and the choice of scoring rule — calibration, a proper score, interval coverage — is itself a specification subject to the multiplicity the analogy identifies. The metric must be constructed and justified before any out-of-sample correction is meaningful. That construction is the explicit subject of Chapter 6, preregistered as a multi-statistic panel rather than a single privileged rule (preregistration v2, §Adaptation 3; hash a88856d9…, locked 2026-05-19).

Disanalogy 3 — distribution-regime assumptions. A backtest’s corrections assume an approximately stationary data-generating process, or accommodated known regime shifts; the canon’s variance terms inherit an IID assumption that capability progress does not satisfy (research notes, Phase 0.2 SUB-B, §5.2). Candidate regime changes — compute-supply disruption, algorithmic-efficiency saturation, the hardware S-curve — carry their own empirical literature and none is anticipated by a stationary correction (Chapter 1.1 §4). A capability-forecasting adaptation therefore cannot inherit the IID variance term unaltered but must derive a stationarity-corrected sampling distribution from first principles. That non-IID correction is the explicit subject of Chapter 6’s variance derivation, held (speculative) until its Monte Carlo validation criterion is met (preregistration v2, §Adaptation 1; locked 2026-05-19). The analogy identifies the structure; it does not certify that the canon’s distributional assumptions survive the crossing, which is why §3’s correspondences are confined to procedure-structure features independent of those assumptions.

These three disanalogies are the boundary conditions Chapter 4 inherits. They do not weaken the structural correspondence of §3; they delimit the domain over which the formal extension of Chapters 6 and 7 must be derived rather than imported.

11.5 §5 — Forward Reference to Chapters 4, 6, 7, and 14

This chapter argued that a capability forecast fitting a decomposition in-sample and projecting it forward without held-out validation is structurally a backtest reported without out-of-sample testing — the object the methodology canon was built to evaluate — and bounded the three disanalogies that delimit the correspondence. The analogy is the bridge from the Part I observation to the Part II framework derivation.

Chapter 4 turns the analogy into a formal critique of in-sample extrapolation and develops the walk-forward retrodiction protocol that would generate the realized out-of-sample sample the seven lack, inheriting directly the realization-asymmetry bounding of §4. Chapter 5 develops the multiple-testing treatment along the Harvey, Liu, and Zhu (2016) line, inheriting from Chapter 2 §3 the implication that a miscalibrated baseline prior is itself a meta-level in-sample-fitting failure. Chapters 6 and 7 then derive the formal extension: Chapter 6 adapts the Deflated Sharpe Ratio to capability projections, resolving the performance-metric construction of Disanalogy 2 and the non-IID variance correction of Disanalogy 3 (preregistration v2, §Chapter 6/7/14 derivation roadmap, Adaptations 1, 3, 4); Chapter 7 adapts the Probability of Forecast Overfitting via an information-set partition and effective-trial-count estimation (Adaptations 2, 3, 4). Chapter 14 integrates both into the synthetic Deflated Capability Forecast framework (all adaptations). The DSR, PFO, and DCF terms are named here for forward reference only; the formal application of any of them begins in Chapter 6.

11.6 References