21 Sub-Module 2.3-A
Decision-Ranking Preservation, Validation Protocol
21.1 SM-2.3-A: Decision-Ranking Preservation, Validation Protocol
Formal definition. Let \(\hat{f}(\cdot)\) denote a trained surrogate of a full model \(f(\cdot)\), and let \(\Omega_{\text{test}}\) denote a held-out test set of futures. The surrogate satisfies decision-ranking preservation if and only if for every future \(\omega \in \Omega_{\text{test}}\) and every pair of alternatives \((a_1, a_2) \in \mathcal{A} \times \mathcal{A}\):
\[ \text{sign}\left(Z(a_1, \omega) - Z(a_2, \omega)\right) = \text{sign}\left(\hat{Z}(a_1, \omega) - \hat{Z}(a_2, \omega)\right) \]
where \(Z\) and \(\hat{Z}\) denote the consequences generated by the full model and the surrogate respectively. In plain terms: the surrogate must produce the same ranking of alternatives as the full model for every test future. It must also satisfy two additional conditions: the maximum regret of each alternative must agree with the full model’s assessment within a declared tolerance, and threshold-violation patterns (the set of futures in which each alternative fails a declared performance threshold) must agree within a declared discordance rate.
Validation workflow. The validation process has five stages.
The first stage is training set design. The surrogate is trained on a sample drawn from the full input feature space using Latin hypercube sampling across continuous uncertain drivers and stratified sampling across discrete ones. The training sample should oversample the regions of the input space where the full model output changes most rapidly, identified by a preliminary screening run. For the Southland regional electricity module, the training sample should oversample low-headroom, high-demand conditions, because this is the region where the PyPSA output changes qualitatively from feasible-without-upgrade to exceedance.
The second stage is held-out test set selection. The test set must include the decision-critical futures identified by the scenario discovery analysis of §2.4 and SM-2.4-A. A random held-out set is insufficient because the futures that matter most for decision-ranking preservation are those where alternatives are most evenly matched, not those that are merely representative of the full input distribution. The test set should include at least the 21 anchor futures that span the most analytically important conditions.
The third stage is performance evaluation against the three acceptance criteria: decision-ranking concordance across all test futures, maximum regret agreement within tolerance, and threshold-violation discordance below the declared rate. A surrogate that achieves high average prediction accuracy while failing any of these three criteria does not satisfy decision-ranking preservation and must be retrained, refined, or supplemented with truth model evaluations in the failing region.
The fourth stage is confidence scoring. The validated surrogate is equipped with a confidence scoring mechanism, typically based on distance from the training distribution in feature space, that assigns each prediction a confidence score. Predictions with confidence scores below a declared threshold trigger a truth model verification: the full model is run for that future, and the result replaces the surrogate’s prediction. This fallback logic is the operational expression of the progressive-refinement philosophy applied to surrogate deployment.
The fifth stage is active learning. As the surrogate is used in ensemble evaluations, the futures that trigger low confidence scores accumulate as candidates for additional training data. After a declared number of low-confidence evaluations, a new training batch is commissioned from the full model targeting those regions, the surrogate is retrained, and its decision-ranking preservation is re-evaluated. This active learning loop progressively improves the surrogate’s accuracy where it matters most.
Table SM-2.3-A: Validation metrics for decision-ranking preservation
| Metric | Definition | Acceptance threshold | What failure indicates |
|---|---|---|---|
| Ranking concordance | Fraction of test futures where surrogate and full model agree on the preferred alternative | 1.00 (all test futures) | Surrogate would lead to different pathway choice in some futures |
| Max regret agreement | Absolute difference in maximum regret between surrogate and full model, normalised by full model value | Less than 5 percent | Surrogate materially understates or overstates tail exposure |
| Threshold discordance | Fraction of test futures where surrogate and full model disagree on whether a threshold is violated | Less than 2 percent | Surrogate misrepresents feasibility in some futures |
| Confidence coverage | Fraction of ensemble futures where confidence score exceeds declared threshold | Target 90 percent or above | Surrogate requires many truth model verifications; consider expanding training set |