Test a new model version against your recorded run history. Each run's genesis context and retrieval set is re-submitted to the new model, the output is verified, and the verdict is compared against the original. Provider non-determinism means results are approximate — labeled accordingly.