Finding mathematical formulas from data is an extremely useful machine learning task. A formula is the most compressed representation of a table, allowing large amounts of data to be compressed into something simple, while also making explicit the relationship that exists between the different variables.

In this tutorial, we are going to generate a dataset and try to recover the original formula using the symbolic regression software TuringBot, without any previous knowledge of what that formula was.

### What symbolic regression is

Symbolic regression is a machine learning technique which tries to find explicit mathematical formulas that connect variables. The technique starts from a set of base functions to be used in the search, for instance addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in such way that the target variable is accurately predicted.

Simplicity is as important as accuracy in a symbolic regression model. Every dataset can be represented with perfect accuracy by a polynomial, but that is uninformative since the number of free parameters in the model the same as the number of training data points. For this reason, a symbolic regression optimization penalizes large formulas, favoring simpler ones that perform just as well.

### Generating an example dataset

Let’s give an explicit example of how symbolic regression can be used to find a formula from data. We will generate a dataset which consists of the formula x*cos(10*x) + 2, add noise to this data, and then see if we can recover this formula using symbolic regression.

The following Python script generates the input data:

import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 1, 100) y = np.cos(10*x)*x + 2 + np.random.random(len(x))*0.1

And this is what the result looks like:

Now we are going to try to find a formula for this data and see what happens.

### Finding a formula using TuringBot

The usage of TuringBot is very simple. All we have to do is load the input data using its interface and start the search. First we save the data to an input file:

arr = np.column_stack((x, y)) np.savetxt('input.txt', arr, fmt='%f')

After loading input.txt into TuringBot, starting the search and letting it work for a minute, these were the formulas that it found, ordered by complexity:

It can be seen that it has successfully found our original formula!

### Conclusion

Here we have seen how symbolic regression can be used to automatically find mathematical formulas from data values. The example that we have given was a simple one, but the procedure that we have used would also work for a real-world dataset in which the dependencies between the variables was not known beforehand, and in which more than one input variable was present.

If you are interested in trying to find formulas from your own dataset, you can download TuringBot for free from the official website.

Hello,

This is a very fascinating and unique approach to machine learning!

Just a few questions:

How do you decide which functions/mathematical operations to include in the search?

Are there particular functions that are critical for good performance across most problems?

Are the boolean/logical functions necessary for good performance?

Also, I notice your implementation uses simulated annealing rather than genetic/evolutionary methods for search. Do you find for SA outperforms GA for this task?

Thanks.

Hi Entil,

I am glad that you think so! I am also very much fascinated by symbolic regression, and use it for everything. I barely touched scikit-learn after developing TuringBot.

1- A good rule of thumb is to just use all the functions that are available, and only disable the ones that you have a reason to disable. For instance, if you expect some time series data to not have any periodicity and all solutions end up having cosines in them, you can try disabling the trigonometric functions to find a better set of solutions.

2- I do not think that there is some function of particular importance, the solutions are very diverse across different datasets. Except when it comes to classification problems, when the rounding functions (round, floor, ceil, etc) are essential and always appear.

3- No, actually the first version of TuringBot didn’t even had logical functions. What I see is that there is some redundancy, the program ends up finding creative ways of representing logical operations if they are necessary but not included in the search. Sometimes it can be nice to disable those functions to find prettier formulas, since they are not of the same type as say a sine or an exponential.

4- Yes, my finding is that simulated annealing gives a much better performance. An old symbolic regression tool called Eureqa used a genetic algorithm for its search, and most symbolic regression implementations found in free packages also do, but in our experiments the convergence was terribly slower using this algorithm. It seems to me that this is not restricted to symbolic regression, simulated annealing seems to be a superior algorithm overall, but unfortunately it is not as well known.

Thank you for your explanations.

I’m still learning about this area which is completely new to me. I had thought neural networks were the only way forward in ML/DS, but now I see there’s another, perhaps even better way!

Many thanks,

Entil