Cross-validation

PGS

I wonder how TuringBot applies cross-validation:
My understanding about cross-vaidation is that it involves using the "training data" to optimize models, after which the remaining "test data" is used to evaluate, rank, and select models. This splitting reduces the risk of overfitting. In TuringBot, there is a tick box in the top right hand corner where we can toggle between showing the error metrics for the training and test data. All good so far...

Question: How is cross-validation actually implemented in TuringBot? Does TuringBot select which equations to keep in the next iteration/generation based on the Pareto-front (size vs error metric) from the "training" data only, or all the data combined ("training" + "testing")? I ask this because I see that TuringBot clearly doesn't use the "testing" Pareto-front to select equations for the next iteration/generation. See the screen dump below, which shows peaks for candidate equations of size 11 and 24 have a worse error metric under "testing" than equations with smaller size. Note that in the case shown, I use a 80/20% split, and the same RMSE error metric for both Regression and Analysis.

Further thoughts: I understand that k-fold is the industry-standard for cross-validation, but is computationally demanding to implement in symbolic regression. A compromise can be to split the data three-ways ("training" + "validation" + "testing"), and to use the "training" data only to optimize the scalar equation parameters, then automatically select equations for next iteration/generation from the Pareto front of the error metric from the combined "training"+"validation" data, or the "validation" data alone. In this approach, the "testing" data can be holdout data, not submitted to the symbolic regression algorithm, but is intended for final evaluation by the user after terminating the regression (e.g. separately in a spreadsheet), to subjectively select the best parsimonious equation. How does TuringBot's method of cross-validation fit in this approach; is the "testing" data actually used for "validation"?
Ref. Wikipedia on Training+Validation+Testing

admin

Hi @PGS,

The testing data is truly only used for "testing" in TuringBot, never validation. It is not considered in the optimization procedure in any way; it is only used to calculate those alternative "cross-validation" errors that you see in the UI when "Show cross-validation error" is ticked.

Sometimes the functions can even have nan or infinite errors in the testing domain.

The intended use cases are:

Spotting solutions that are more complex than necessary and unlikely to generalize well.
Using a manageable number of rows for training when using a large dataset to speed up the search, and using the rest of the dataset for testing. That's why there are the "100 rows", "1000 rows", etc, options under "Train/test split".

PGS

Thanks for the clarification.