Official Forum for TuringBot Software

General Category => Symbolic Regression General => Topic started by: xntang88 on November 24, 2021, 11:42:27 am

Title: General guide on the cross-validation settings (train/test split)
Post by: xntang88 on November 24, 2021, 11:42:27 am
I wonder whether there is any guide for choosing an appropriate train/test split.
I did a test on a set of data (about 100 numbers) using a defined function under the cross-validation settings;
a different R-squared value will be obtained each time you stop and run again.

I would like to see the following additional information/function included in the next version:
1) R-squared values are included in the export solution as text.
2) In the outcome plot, it would very helpful to judge the fitting function via a plot of observed data against the prediction.
Title: Re: General guide on the cross-validation settings (train/test split)
Post by: admin on November 26, 2021, 12:49:43 pm
About different errors appearing each time: if you leave the Test sample setting as the default value, Chosen randomly, the program will generate a new random split each time you start a new optimization, resulting in different errors each time.

To get consistent errors, you can switch this option to The last points, so that a sequential split is used instead of a random one.

Your suggestions of exporting R-squared values and adding an observed/predicted plot are both great, I'll try to add them in the next release.

As a rule of thumb, the 100/1000/10000 points settings are useful to speed up the optimization if you are using a very large dataset, for instance, one with millions of rows. Otherwise, the other options are more appropriate.
Title: Re: General guide on the cross-validation settings (train/test split)
Post by: xntang88 on November 27, 2021, 03:35:24 am
Many thanks. looking forward to the next release having a plot of observed vs predicted data.

By the way, when a train/test split is chosen, is the R-squared value for the whole set of data or only training data?

In the train/test split, I wonder whether it can include a user-defined number of rows or a customized percentage split as an option.

Meanwhile, I wonder whether it is possible to carry out a search for a defined function with a combined variable. For example,  y = f(x1+f(), x2), where x1 and x2 are independent variables.
Title: Re: General guide on the cross-validation settings (train/test split)
Post by: xntang88 on December 01, 2021, 03:38:05 am
One more thing is for clarification:
If a fixed number of rows (e.g. 100 rows) with the last points chosen, does this mean that the data of the 1st 100 rows from the beginning are chosen for training?