BLACKBOX — AI Flight Recorder

Upgrade Test

APPROXIMATE RE-EXECUTION

Test a new model version against your recorded run history. Each run's genesis context and retrieval set is re-submitted to the new model, the output is verified, and the verdict is compared against the original. Provider non-determinism means results are approximate — labeled accordingly.

QUICK START

TARGET MODEL

RUN IDs (comma-separated · leave blank for all runs)