General guide on the cross-validation settings (train/test split)

  • 3 Replies
  • 179 Views
*

xntang88

  • Newbie
  • *
  • 6
    • View Profile
I wonder whether there is any guide for choosing an appropriate train/test split.
I did a test on a set of data (about 100 numbers) using a defined function under the cross-validation settings;
a different R-squared value will be obtained each time you stop and run again.

I would like to see the following additional information/function included in the next version:
1) R-squared values are included in the export solution as text.
2) In the outcome plot, it would very helpful to judge the fitting function via a plot of observed data against the prediction.
« Last Edit: November 25, 2021, 12:00:25 pm by xntang88 »

*

admin

  • Administrator
  • Newbie
  • *****
  • 17
    • View Profile
Re: General guide on the cross-validation settings (train/test split)
« Reply #1 on: November 26, 2021, 12:49:43 pm »
About different errors appearing each time: if you leave the Test sample setting as the default value, Chosen randomly, the program will generate a new random split each time you start a new optimization, resulting in different errors each time.

To get consistent errors, you can switch this option to The last points, so that a sequential split is used instead of a random one.

Your suggestions of exporting R-squared values and adding an observed/predicted plot are both great, I'll try to add them in the next release.

As a rule of thumb, the 100/1000/10000 points settings are useful to speed up the optimization if you are using a very large dataset, for instance, one with millions of rows. Otherwise, the other options are more appropriate.
« Last Edit: November 26, 2021, 06:57:10 pm by admin »

*

xntang88

  • Newbie
  • *
  • 6
    • View Profile
Re: General guide on the cross-validation settings (train/test split)
« Reply #2 on: November 27, 2021, 03:35:24 am »
Many thanks. looking forward to the next release having a plot of observed vs predicted data.

By the way, when a train/test split is chosen, is the R-squared value for the whole set of data or only training data?

In the train/test split, I wonder whether it can include a user-defined number of rows or a customized percentage split as an option.

Meanwhile, I wonder whether it is possible to carry out a search for a defined function with a combined variable. For example,  y = f(x1+f(), x2), where x1 and x2 are independent variables.
« Last Edit: November 27, 2021, 06:54:04 am by xntang88 »

*

xntang88

  • Newbie
  • *
  • 6
    • View Profile
Re: General guide on the cross-validation settings (train/test split)
« Reply #3 on: December 01, 2021, 03:38:05 am »
One more thing is for clarification:
If a fixed number of rows (e.g. 100 rows) with the last points chosen, does this mean that the data of the 1st 100 rows from the beginning are chosen for training?