## How to create an equation for data points

In order to find an equation from a list of values, a special technique called symbolic regression must be used. The idea is to search over the space of all possible mathematical formulas for the ones with the greatest accuracy, while trying to keep those formulas as simple as possible.

In this tutorial, we are going to show how to find formulas using the desktop symbolic regression software TuringBot, which is very easy to use.

### How symbolic regression works

Symbolic regression starts from a set of base functions to be used in the search, such as addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in all possible ways with the goal of finding a model that will be as accurate as possible in predicting a target variable. Some examples of base functions used by TuringBot are the following:

As important as the accuracy of a formula is its simplicity. A huge formula can predict with perfect accuracy the data points, but if the number of free parameters in the model is the same as the number of points then this model is not really informative. For this reason, a symbolic regression optimization will discard a larger formula if it finds a smaller one that performs just as well.

### Finding a formula with TuringBot

Finding equations from data points with TuringBot is a simple process. The first step is selecting the input file with the data through the interface. This input file should be in TXT or CSV format. After it has been loaded, the target variable can be selected (by default it will be the last column in the file), and the search can be started. This is what the interface looks like:

Several options are available on the menus on the left, such as setting a test/train split to be able to detect overfit solutions, selecting which base functions should be used, and selecting the search metric, which by default is root-mean-square error, but that can also be set to classification accuracy, mean relative error and others. For this example, we are going to keep it simple and just use the defaults.

The optimization is started by clicking on the play button at the top of the interface. The best formulas found so far will be shown in the solutions box, ordered by complexity:

The software allows the solutions to be exported to common programming languages from the menu, and also to simply be exported as text. Here are the formulas in the example above exported in text format:

```Complexity   Error      Function
1            1.91399    -0.0967549
3            1.46283    0.384409*x
4            1.362      atan(x)
5            1.18186    0.546317*x-1.00748
6            1.11019    asinh(x)-0.881587
9            1.0365     ceil(asinh(x))-1.4131
13           0.985787   round(tan(floor(0.277692*x)))
15           0.319857   cos(x)*(1.96036-x)*tan(x)
19           0.311375   cos(x)*(1.98862-1.02261*x)*tan(1.00118*x)```

### Conclusion

In this tutorial, we have seen how symbolic regression can be used to find formulas from values. Symbolic regression is very different from regular curve-fitting methods, since no assumption is made about what the shape of the formulas should be. This allows patterns to be found in datasets with an arbitrary number of dimensions, making symbolic regression a general purpose machine learning technique.

## Symbolic regression tutorial with TuringBot

In this tutorial, we are going to show how you can find a formula from your data using the symbolic regression software TuringBot. It is a desktop software that runs on both Windows and Linux, and as you will see the usage is very simple.

### Preparing the data

TuringBot takes as input files in .txt or CSV format containing one variable per column. The first row may contain the names of the variables, otherwise they will be labelled col1, col2, col3, etc.

For instance, the following is a valid input file:

```x y z w classification
5.20 2.70 3.90 1.40 1
6.50 2.80 4.60 1.50 1
7.70 2.80 6.70 2.00 2
5.90 3.20 4.80 1.80 1
5.00 3.50 1.60 0.60 0
5.10 3.50 1.40 0.20 0
4.60 3.10 1.50 0.20 0
6.90 3.20 5.70 2.30 2```

This is what the program looks like when you open it:

By clicking on the “Input file” button on the upper left, you can select your input file and load it. Different search metrics are available, including for instance classification accuracy, and a handy cross validation feature can also be enabled in the “Search options” box — if enabled, it will automatically create a test/train split and allow you to see the out-of-sample error as the optimization goes on. But in this example we are going to keep things simple and just use the defaults.

### Finding the formulas

After loading the data, you can click on the play button at the top of the interface to start the optimization. The best formulas found so far will be shown in the “Solutions” box, in ascending order of complexity. A formula is only shown if its accuracy is greater than that of all simpler alternatives — in symbolic regression, the goal is not simply to find a formula, but to find the simplest ones possible.

Here are the formulas it found for an example dataset:

The formulas are all written in a format that is compatible out of the box with Python and C. Indeed, the menu on the upper right allows you to export the solutions to these languages:

In this example, the true formula turned out to be sqrt(x), which was recovered in a few seconds. The methodology would be the same for a real-world dataset with many input variables and an unknown dependency between them.

### How to get TuringBot

If you have liked this tutorial, we encourage you to download TuringBot for free from the official website. As we have shown, it is very simple to use, and its powerful mathematical modelling capabilities allow you to find very subtle numerical patterns in your data. Much like a scientist would do from empirical observations, but in an automatic way and millions of times faster.

## An alternative to the Eureqa software

Eureqa is a symbolic regression software based on genetic programming. Here we will talk about an alternative to that software called TuringBot.

Eureqa used to be developed by a company called Nutonian. A few years ago this company was acquired by a consulting company called Data Robot, and Eureqa has been removed from the market after that.

The program gained popularity due to its ease of use. Finding mathematical formulas from data using its graphical interface was very convenient and required no coding.

### The alternative: TuringBot

An alternative to Eureqa exists and is called TuringBot. It uses a completely different approach to solve symbolic regression problems, based on a simulated annealing algorithm. It can be downloaded for free from the official website.

Here is what its interface looks like:

It features a variety of search metrics, allowing many different kinds of machine learning models to be solved. Those include the basic RMS and mean error regression metrics, but also classification accuracy, F1 score (for rare event classification) and correlation coefficient.

The code allows overfit solutions to be easily ruled out with its convenient cross validation feature. A test/train split can be enabled through the interface, and the out-of-sample error shown in the solutions box can be used to select the formula with the best trade-off between size and accuracy.

Compared to Eureqa, the symbolic regression implementation of TuringBot seems to yield better results in many cases. Eureqa overly restricts itself to simpler and less recursive formulas, and often results in polynomial fits to the data that diverge and lose usefulness outside the training domain. We also find that TuringBot is noticeably faster than Eureqa.