Data input

Input files are selected in the interface by clicking on the "Input file" button. After loading the file, you can define the target variable and which other variables should be used as input, as shown below. You also have the option of using the row number (1, 2, 3...) as an input variable, which is useful for time series data.

The input file name must end in .txt or .csv,and the file must contain columns representing different variables separated by either spaces, commas or semicolons. Those values can be integers, floats or floats in exponential notation (%d, %f or %e), with decimal parts separated by a dot (1.61803 and not 1,61803).

Optionally, a header containing the variable names may be present in the first line of the file —in this case, those names will be used in the formulas instead of the default names, which are col1, col2, col3, etc.

For example, the following is a valid input file:

x y z
0.01231 0.99992 0.99985
0.23180 0.97325 0.94723
0.45128 0.89989 0.80980
0.67077 0.78334 0.61363
0.89026 0.62921 0.39591
1.00000 0.54030 0.29193

Search options

Before running the regression, you can select which error metric should be used in the optimization and also which base functions should be used. The available error metrics are:

RMS error Root mean square error.
Mean relative errorAverage of (absolute value of the error) / (value of the target variable). Useful for forcing the convergence to be in terms of relative error instead of absolute error.
Classification accuracy (correct predictions) / (number of data points). Only useful for integer target values and classification problems.
Mean error Average of (absolute value of the error). Similar to RMS error, but puts less emphasis on outliers.
F1 score 2*(precision*recall)/(precision+recall). This metric is useful for classification problems on highly imbalanced datasets, where the target variable is 0 for the majority of inputs and a positive integer for a few cases that need to be identified. If the classification is binary, the categories can be 1 (relevant cases) and 0 (all other cases).
Correlation coefficient Corresponds to the Pearson correlation coefficient. Useful for quickly getting the overal shape of the output right without much attention to scales.
Hybrid (CC+RMS) The geometric mean between the correlation coefficient and the RMS error. This is a compromise between the attention to scale of the RMS metric and the speed of the CC metric.
Maximum error The maximum absolute difference between the predictions of the model and the target variable.

The function names follow the convention of the C math library, except for the logical functions (logical_and(x, y), greater(x, y), etc). You can consult this page for more information.

Cross validation

In the same box where the search options are selected, the cross validation settings can also be defined. It is recommended to use cross validation, since that allows overfit models which are more complex than necessary to be discarded in a straightforward way.

The first available option is the train/test split, where common choices for the split can be selected. The default value, "No cross validation", disables the cross validation altogether. The second option is the choice for the test sample, which can be defined as a random subsample of the input dataset or as its last points (sequential train/test split). The latter may be useful for time series datasets in some cases.

During the optimization, you can alternate between showing the errors for the train sample and the test sample by clicking on the "Show cross validation error" box on the upper right of the interface. With this, overfit solutions can be spotted in real time.

The solutions box

The regression is started by clicking on the play button at the top of the interface. After that, the best solutions found so far will be shown in the "Solutions" box, as shown below:

Each row corresponds to the best solution of a given size encountered so far. By clicking on a solution, it will be shown in the plot. Larger solutions are only shown if they provide a better fit than all other smaller solutions.

We define the size of a solution as the sum of the sizes of the base functions that constitute it. The sizes of the base functions are:

  • Size 1: an input variable, sum, subtraction and multiplication.
  • Size 2: division.
  • Size 3: ceil(x), floor(x) and round(x).
  • Size 4: all other functions.

In the solutions box, you have the option of sorting the solutions by a balance between size and accuracy by clicking on the "Function" header, which will sort the solutions by (error)^2 * (size). By default, the solutions are sorted by size.

Plot settings

In the "Other options" part of the interface, you can at any time change the value shown in the x axis of the plot (which is the row number by default), alternate the x or y scales between regular and log, and change the variable shown in the y axis.

You have the option of plotting the residual error instead of the solution itself in the y axis of the plot, which is useful for diagnosing the solutions: the errors of a good solution will be randomly distributed around zero, while those of a worse solution will feature some kind of regularity.

Log

In the "Log" tab of the interface, information is shown about how many formulas were generated in the current optimization, how many cycles of Simulated Annealing were run so far, and for how long the optimization has been running. A log message is also shown every time a new solution is encountered, so that you can keep track of the progress.