Finding mathematical formulas from data is an extremely useful machine-learning task. A formula is the most compressed representation of a table, allowing large amounts of data to be compressed into something simple, while also making explicit the relationship that exists between the different variables.
In this tutorial, we are going to generate a dataset and try to recover the original formula using the symbolic regression software TuringBot, without any previous knowledge of what that formula was.
What symbolic regression is
Symbolic regression is a machine learning technique that tries to find explicit mathematical formulas that connect variables. The technique starts from a set of base functions to be used in the search, for instance, addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in such a way that the target variable is accurately predicted.
Simplicity is as important as accuracy in a symbolic regression model. Every dataset can be represented with perfect accuracy by a polynomial, but that is uninformative since the number of free parameters in the model is the same as the number of training data points. For this reason, a symbolic regression optimization penalizes large formulas, favoring simpler ones that perform just as well.
Generating an example dataset
Let’s give an explicit example of how symbolic regression can be used to find a formula from data. We will generate a dataset that consists of the formula x*cos(10*x) + 2, add noise to this data, and then see if we can recover this formula using symbolic regression.
The following Python script generates the input data:
import numpy as np
import matplotlib.pyplot as plt
# Generate an array of 100 linearly spaced values between 0 and 1
x = np.linspace(0, 1, 100)
# Calculate y values using a cosine function, adding noise
y = np.cos(10 * x) * x + 2 + np.random.random(len(x)) * 0.1
And this is what the result looks like:
Now we are going to try to find a formula for this data and see what happens.
Finding a formula using TuringBot
The usage of TuringBot is very simple. All we have to do is load the input data using its interface and start the search. First, we save the data to an input file:
# Stack the x and y arrays column-wise to create a 2D array
arr = np.column_stack((x, y))
# Save the array to a text file in floating-point format
np.savetxt('input.txt', arr, fmt='%f')
After loading input.txt into TuringBot, starting the search, and letting it work for a minute, these were the formulas that it found, ordered by complexity:
It can be seen that it has successfully found our original formula!
Conclusion
Here we have seen how symbolic regression can be used to automatically find mathematical formulas from data values. The example that we have given was a simple one, but the procedure that we have used would also work for a real-world dataset in which the dependencies between the variables were not known beforehand, and in which more than one input variable was present.
If you are interested in trying to find formulas from your dataset, you can download TuringBot for free from the official website.