How to create an equation for data points

In order to find an equation from a list of values, a special technique called symbolic regression must be used. The idea is to search over the space of all possible mathematical formulas for the ones with the greatest accuracy, while trying to keep those formulas as simple as possible.

In this tutorial, we are going to show how to find formulas using the desktop symbolic regression software TuringBot, which is very easy to use.

How symbolic regression works

Symbolic regression starts from a set of base functions to be used in the search, such as addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in all possible ways with the goal of finding a model that will be as accurate as possible in predicting a target variable. Some examples of base functions used by TuringBot are the following:

As important as the accuracy of a formula is its simplicity. A huge formula can predict with perfect accuracy the data points, but if the number of free parameters in the model is the same as the number of points then this model is not really informative. For this reason, a symbolic regression optimization will discard a larger formula if it finds a smaller one that performs just as well.

Finding a formula with TuringBot

Finding equations from data points with TuringBot is a simple process. The first step is selecting the input file with the data through the interface. This input file should be in TXT or CSV format. After it has been loaded, the target variable can be selected (by default it will be the last column in the file), and the search can be started. This is what the interface looks like:

Several options are available on the menus on the left, such as setting a test/train split to be able to detect overfit solutions, selecting which base functions should be used, and selecting the search metric, which by default is root-mean-square error, but that can also be set to classification accuracy, mean relative error and others. For this example, we are going to keep it simple and just use the defaults.

The optimization is started by clicking on the play button at the top of the interface. The best formulas found so far will be shown in the solutions box, ordered by complexity:

The software allows the solutions to be exported to common programming languages from the menu, and also to simply be exported as text. Here are the formulas in the example above exported in text format:

```Complexity   Error      Function
1            1.91399    -0.0967549
3            1.46283    0.384409*x
4            1.362      atan(x)
5            1.18186    0.546317*x-1.00748
6            1.11019    asinh(x)-0.881587
9            1.0365     ceil(asinh(x))-1.4131
13           0.985787   round(tan(floor(0.277692*x)))
15           0.319857   cos(x)*(1.96036-x)*tan(x)
19           0.311375   cos(x)*(1.98862-1.02261*x)*tan(1.00118*x)```

Conclusion

In this tutorial, we have seen how symbolic regression can be used to find formulas from values. Symbolic regression is very different from regular curve-fitting methods, since no assumption is made about what the shape of the formulas should be. This allows patterns to be found in datasets with an arbitrary number of dimensions, making symbolic regression a general purpose machine learning technique.

An alternative to the Eureqa software

Eureqa is a symbolic regression software based on genetic programming. Here we will talk about an alternative to that software called TuringBot.

Eureqa used to be developed by a company called Nutonian. A few years ago this company was acquired by a consulting company called Data Robot, and Eureqa has been removed from the market after that.

The program gained popularity due to its ease of use. Finding mathematical formulas from data using its graphical interface was very convenient and required no coding.

The alternative: TuringBot

An alternative to Eureqa exists and is called TuringBot. It uses a completely different approach to solve symbolic regression problems, based on a simulated annealing algorithm. It can be downloaded for free from the official website.

Here is what its interface looks like:

It features a variety of search metrics, allowing many different kinds of machine learning models to be solved. Those include the basic RMS and mean error regression metrics, but also classification accuracy, F1 score (for rare event classification) and correlation coefficient.

The code allows overfit solutions to be easily ruled out with its convenient cross validation feature. A test/train split can be enabled through the interface, and the out-of-sample error shown in the solutions box can be used to select the formula with the best trade-off between size and accuracy.

Compared to Eureqa, the symbolic regression implementation of TuringBot seems to yield better results in many cases. Eureqa overly restricts itself to simpler and less recursive formulas, and often results in polynomial fits to the data that diverge and lose usefulness outside the training domain. We also find that TuringBot is noticeably faster than Eureqa.