21  Sub-Module 2.3-A

Decision-Ranking Preservation, Validation Protocol

NoteNode Declaration — SM-2.3-A: Decision-Ranking Preservation, Validation Protocol
Field Content
Tier Sub-Module
Status ✓ Complete
Assumes §2.3
Contributes Formal definition of the decision-ranking preservation criterion, the validation workflow, confidence scoring logic, and active learning protocol for surrogate development
Skip condition Skip if surrogate validation approach is already known; process in full when implementing a surrogate module
Passes to §2.4, SM-6.6-E
Sub-Modules here None

21.1 SM-2.3-A: Decision-Ranking Preservation, Validation Protocol

Formal definition. Let \(\hat{f}(\cdot)\) denote a trained surrogate of a full model \(f(\cdot)\), and let \(\Omega_{\text{test}}\) denote a held-out test set of futures. The surrogate satisfies decision-ranking preservation if and only if for every future \(\omega \in \Omega_{\text{test}}\) and every pair of alternatives \((a_1, a_2) \in \mathcal{A} \times \mathcal{A}\):

\[ \text{sign}\left(Z(a_1, \omega) - Z(a_2, \omega)\right) = \text{sign}\left(\hat{Z}(a_1, \omega) - \hat{Z}(a_2, \omega)\right) \]

where \(Z\) and \(\hat{Z}\) denote the consequences generated by the full model and the surrogate respectively. In plain terms: the surrogate must produce the same ranking of alternatives as the full model for every test future. It must also satisfy two additional conditions: the maximum regret of each alternative must agree with the full model’s assessment within a declared tolerance, and threshold-violation patterns (the set of futures in which each alternative fails a declared performance threshold) must agree within a declared discordance rate.

Validation workflow. The validation process has five stages.

The first stage is training set design. The surrogate is trained on a sample drawn from the full input feature space using Latin hypercube sampling across continuous uncertain drivers and stratified sampling across discrete ones. The training sample should oversample the regions of the input space where the full model output changes most rapidly, identified by a preliminary screening run. For the Southland regional electricity module, the training sample should oversample low-headroom, high-demand conditions, because this is the region where the PyPSA output changes qualitatively from feasible-without-upgrade to exceedance.

The second stage is held-out test set selection. The test set must include the decision-critical futures identified by the scenario discovery analysis of §2.4 and SM-2.4-A. A random held-out set is insufficient because the futures that matter most for decision-ranking preservation are those where alternatives are most evenly matched, not those that are merely representative of the full input distribution. The test set should include at least the 21 anchor futures that span the most analytically important conditions.

The third stage is performance evaluation against the three acceptance criteria: decision-ranking concordance across all test futures, maximum regret agreement within tolerance, and threshold-violation discordance below the declared rate. A surrogate that achieves high average prediction accuracy while failing any of these three criteria does not satisfy decision-ranking preservation and must be retrained, refined, or supplemented with truth model evaluations in the failing region.

The fourth stage is confidence scoring. The validated surrogate is equipped with a confidence scoring mechanism, typically based on distance from the training distribution in feature space, that assigns each prediction a confidence score. Predictions with confidence scores below a declared threshold trigger a truth model verification: the full model is run for that future, and the result replaces the surrogate’s prediction. This fallback logic is the operational expression of the progressive-refinement philosophy applied to surrogate deployment.

The fifth stage is active learning. As the surrogate is used in ensemble evaluations, the futures that trigger low confidence scores accumulate as candidates for additional training data. After a declared number of low-confidence evaluations, a new training batch is commissioned from the full model targeting those regions, the surrogate is retrained, and its decision-ranking preservation is re-evaluated. This active learning loop progressively improves the surrogate’s accuracy where it matters most.

Table SM-2.3-A: Validation metrics for decision-ranking preservation

Metric Definition Acceptance threshold What failure indicates
Ranking concordance Fraction of test futures where surrogate and full model agree on the preferred alternative 1.00 (all test futures) Surrogate would lead to different pathway choice in some futures
Max regret agreement Absolute difference in maximum regret between surrogate and full model, normalised by full model value Less than 5 percent Surrogate materially understates or overstates tail exposure
Threshold discordance Fraction of test futures where surrogate and full model disagree on whether a threshold is violated Less than 2 percent Surrogate misrepresents feasibility in some futures
Confidence coverage Fraction of ensemble futures where confidence score exceeds declared threshold Target 90 percent or above Surrogate requires many truth model verifications; consider expanding training set