Symbolic Regression

The core algorithm of TuringBot is called Symbolic Regression. It is a technique in which one tries to find mathematical formulas that best approximate some desired variable as a function of a set of input variables, while trying to keep those mathematical formulas as simple as possible.

The power of this technique is that, contrary to what happens in mainstream machine learning methods like neural networks and random forests, the generated models do not require complicated data structures to be represented, thus being much easier to interpret and also much more portable.

The set of all possible mathematical formulas is very large, which makes it challenging to efficiently find the ones that are relevant. TuringBot is able to accomplish that by applying an algorithm called Simulated Annealing to the search, coupled with several heuristic optimizations that have been statistically verified to lead to a faster convergence.

Input data format

The software uses as input text or CSV files in which each column represents a different variable. After loading an input file, you can select through the interface which column should be the target and which other ones should be used as input variables, and then simply start the search.

For more information about the input file formats and search settings, please visit our Documentation page.

Features

  • Export solutions as Python or C/C++
  • Built-in cross validation
  • Multiprocessing
  • Extremely fast, written in a low-level programming language

Applications

TuringBot can be used to solve both regression and classification problems in general. The latter is done by by representing categorical variables as integer numbers, and optimizing using one of the classification metrics available.

The applications are endless. To name a few, one can:

  • Discover hidden relationships between variables in multidimensional datasets.
  • Find an optimal formula that describes a time series, and use it to forecast future values.
  • Create simple and fast interpolations to tabulated data.

All of these with the advantage that the generated models are not black boxes, but explicit formulas that allow insight to be gained about the data that is being modeled.