Overview

TuringBot, named after great mathematician and computation pioneer Alan Turing, is a Symbolic Regression software. It uses a novel algorithm based on simulated annealing to efficiently find mathematical formulas from values.

Data is read from TXT or CSV files, and the target and input columns can be selected through the interface. Different search metrics are available, allowing the program to find formulas that solve both regression and classification problems.

Its main features are the following:

  • Pareto optimization: the software simultaneously tries to find the best formulas of all possible sizes. It will give you not only a single formula as output, but a set of formulas of increasing complexity to choose from.
  • Built-in cross-validation: allows you to easily rule out overfit solutions.
  • Export solutions as Python, C/C++, or plain text.
  • Multiprocessing.
  • Written in a low-level programming language (no Python). Extremely fast.

Data input

Input files are selected through the interface by clicking on the "Input file" button, as shown below:

The input file name must end in .txt or .csv, and the file must contain columns representing different variables separated by either spaces, commas, or semicolons. Those values can be integers, floats, or floats in exponential notation (%d, %f, or %e), with decimal parts separated by a dot (1.61803 and not 1,61803).

Optionally, a header containing the variable names may be present in the first line of the file — in this case, those names will be used in the formulas instead of the default names, which are col1, col2, col3, etc. You also have the option of using the row number (1, 2, 3...) as an input variable, which is useful for time series data.

For example, the following is a valid input file:

x y z
0.01231 0.99992 0.99985
0.23180 0.97325 0.94723
0.45128 0.89989 0.80980
0.67077 0.78334 0.61363
0.89026 0.62921 0.39591
1.00000 0.54030 0.29193

Search options

Before starting the search, you can select which error metric should be used in the optimization and also which base functions should be used. The available error metrics are:

RMS error Root mean square error.
Mean relative errorAverage of (absolute value of the error) / (absolute value of the target variable). Makes the convergence be in terms of relative error instead of absolute error.
Classification accuracy (correct predictions) / (number of data points). Only useful for integer target values and classification problems.
Mean error Average of (absolute value of the error). Similar to RMS error, but puts less emphasis on outliers.
F1 score 2*(precision*recall)/(precision+recall). See here an image that explains what those two quantities are.

This metric is useful for classification problems on highly imbalanced datasets, where the target variable is 0 for the majority of inputs and a positive integer on a few cases that need to be identified. If the classification is binary, then the categories can be 1 (relevant cases) and 0 (all other cases).
Correlation coefficient Corresponds to the Pearson correlation coefficient. Useful for quickly getting the overall shape of the output right without much attention to scales.
Hybrid (CC+RMS) The geometric mean between the correlation coefficient and the RMS error. This is a compromise between the attention to scale of the RMS metric and the speed of the CC metric.
Maximum error The maximum absolute difference between the predictions of the model and the target variable.

The function names follow the conventions of the C math library, except for the logical functions (logical_and(x, y), greater(x, y), etc). You can consult this page for more information.

Cross-validation

In the same box where the search options are selected, the cross-validation settings can also be selected. It is recommended to use cross-validation, since that allows overfit models which are more complex than necessary to be discarded in a straightforward way.

The size of the train/test split can be selected from the menu. The default value, "No cross-validation", disables the cross-validation altogether. It is also possible to select how the train sample should be generated: if it should be a selection of random rows or if it should be the first rows in sequential order.

During the optimization, you can alternate between showing the errors for the train sample and the test sample by clicking on the "Show cross-validation error" box on the upper right of the interface. With this, overfit solutions can be spotted in real-time.

The solutions box

The regression is started by clicking on the play button at the top of the interface. After that, the best solutions found so far will be shown in the "Solutions" box, as shown below:

Each row corresponds to the best solution of a given size encountered so far. By clicking on a solution, it will be shown in the plot and its stats will be shown in the "Solution info" box. Larger solutions are only shown if they provide a better fit than all other smaller solutions.

We define the size of a solution as the sum of the sizes of the base functions that constitute it. The sizes of the base functions are:

  • Size 1: an input variable, sum, subtraction, and multiplication.
  • Size 2: division.
  • Size 3: ceil(x), floor(x) and round(x).
  • Size 4: all other functions.

In the Solutions box, you have the option of sorting the solutions by a balance between size and accuracy by clicking on the "Function" header, which will sort the solutions by (error)^2 * (size). By default, the solutions are sorted by size.

Plot settings

In the "Other options" part of the interface, you can at any time change the value shown in the x-axis of the plot (which is the row number by default), alternate the x or y scales between regular and log, and change the variable shown in the y-axis.

You have the option of plotting the residual error instead of the solution itself in the y-axis of the plot, which is useful for diagnosing the solutions: the errors of a good solution should ideally be distributed around zero in a random way.

Log

In the "Log" tab of the interface, information is shown about how many formulas were generated in the current optimization, how many cycles of Simulated Annealing were run so far, and for how long the optimization has been running. A log message is also shown every time a new solution is encountered so that you can keep track of the progress.