From R²=0.8725 to R²=0.07: What We Learned About FNO Overfitting

The headline number we cited

The Fourier Neural Operator (FNO) yield surrogate in PROV 9 was trained on 10,000 hybrid bonding configurations. We held out 20% of the dataset for validation and the trained model scored R² = 0.8725 on the holdout set. We cited that number in over // [DRIFT C5: FNO v2 R²=0.87 RETRACTED; honest 100K image-max R²=0.642] 170 places in the codebase and in customer-facing materials. It was the headline ML result.

The number was not fabricated. The model genuinely did achieve R² = 0.8725 on the holdout set. The problem was not in the // [DRIFT C5: FNO v2 R²=0.87 RETRACTED; honest 100K image-max R²=0.642] measurement. It was in the dataset.

The data leakage we did not see

The 10,000-sample training dataset was generated by sweeping process parameters across what we thought was a wide distribution of operating points. In reality, the parameter ranges were centered tightly around a single nominal recipe. The 80/20 train/holdout split kept all samples inside the same narrow distribution. The model only had to interpolate between similar operating points, never extrapolate beyond them.

In ML terms, we had built a strong cross-validation pipeline on top of a weak data distribution. The CV protocol was correct. The dataset it was applied to was not representative of the production process window. The model learned to memorize the narrow operating regime and could not generalize.

We did not catch this because the validation R² looked great. When the validation set has the same distributional limits as the training set, an overfit model still scores well. This is the textbook failure mode for ML in physics — you get a number that is mathematically correct but operationally meaningless.

The 100K realistic test set

During the 10x audit we generated a new test dataset: 100,000 hybrid bonding configurations sampled from the full operational process window, including the corners that the original training set had ignored. We re-ran the v2 FNO model on this wider dataset without any retraining, just to see what would happen.

The results were brutal:

Metric	v2 narrow CV	100K realistic	Ridge baseline
Pixel R²	0.8725	0.073	0.500
Image-max R²	0.8104	-1.56	0.625

On the wider distribution the FNO did worse than a Ridge regression baseline. Pixel R² fell to 0.07. Image-max R² went negative, which means the model was actively making predictions worse than just predicting the mean. The full numbers are in benchmarks/fno_v2_crossval_100k.json.

What we do now

We retracted the 0.8725 number publicly. We updated all 170+ references in the codebase and the docs to the honest pixel R² of 0.50 against the wider Ridge baseline, and image-max R² of 0.625. The model is still useful — at R² = 0.50 it is a physics-constrained screener, not a replacement for full simulation. But the number is honest and the boundaries of applicability are documented.

The bigger fix was structural. We rewrote the data generation pipeline to sample from the realistic operational distribution by default. Cross-validation now happens against the wide distribution, not the narrow training distribution. New surrogate models have to clear the Ridge baseline before they can ship under their own name.

The lesson

Distribution width is everything. An ML surrogate is only as honest as the data distribution it cross-validates against. If you train on a narrow regime and validate on the same narrow regime, you will get a great R² that lies. If your customer ever pushes the model into the corners of the operating window, the lie surfaces immediately.

The fix is to define the operational range first, sample wide, and validate against the wide distribution. The number you ship should be the number on the wider distribution, not the easier one. Less impressive — but actually true.

Honest R² = 0.50 is what we ship. The retraction registry has the full root cause document at RETRACTION_REGISTRY.md, and the surviving model is documented on the benchmarks page.