Official Forum for TuringBot Software
General Category => Symbolic Regression General => Topic started by: xntang88 on November 24, 2021, 11:42:27 am

I wonder whether there is any guide for choosing an appropriate train/test split.
I did a test on a set of data (about 100 numbers) using a defined function under the crossvalidation settings;
a different Rsquared value will be obtained each time you stop and run again.
I would like to see the following additional information/function included in the next version:
1) Rsquared values are included in the export solution as text.
2) In the outcome plot, it would very helpful to judge the fitting function via a plot of observed data against the prediction.

About different errors appearing each time: if you leave the Test sample setting as the default value, Chosen randomly, the program will generate a new random split each time you start a new optimization, resulting in different errors each time.
To get consistent errors, you can switch this option to The last points, so that a sequential split is used instead of a random one.
Your suggestions of exporting Rsquared values and adding an observed/predicted plot are both great, I'll try to add them in the next release.
As a rule of thumb, the 100/1000/10000 points settings are useful to speed up the optimization if you are using a very large dataset, for instance, one with millions of rows. Otherwise, the other options are more appropriate.

Many thanks. looking forward to the next release having a plot of observed vs predicted data.
By the way, when a train/test split is chosen, is the Rsquared value for the whole set of data or only training data?
In the train/test split, I wonder whether it can include a userdefined number of rows or a customized percentage split as an option.
Meanwhile, I wonder whether it is possible to carry out a search for a defined function with a combined variable. For example, y = f(x1+f(), x2), where x1 and x2 are independent variables.

One more thing is for clarification:
If a fixed number of rows (e.g. 100 rows) with the last points chosen, does this mean that the data of the 1st 100 rows from the beginning are chosen for training?