How to find formulas from values?

Finding mathematical formulas from data is an extremely useful machine-learning task. A formula is the most compressed representation of a table, allowing large amounts of data to be compressed into something simple, while also making explicit the relationship that exists between the different variables.

In this tutorial, we are going to generate a dataset and try to recover the original formula using the symbolic regression software TuringBot, without any previous knowledge of what that formula was.

What symbolic regression is

Symbolic regression is a machine learning technique that tries to find explicit mathematical formulas that connect variables. The technique starts from a set of base functions to be used in the search, for instance, addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in such a way that the target variable is accurately predicted.

Simplicity is as important as accuracy in a symbolic regression model. Every dataset can be represented with perfect accuracy by a polynomial, but that is uninformative since the number of free parameters in the model is the same as the number of training data points. For this reason, a symbolic regression optimization penalizes large formulas, favoring simpler ones that perform just as well.

Generating an example dataset

Let’s give an explicit example of how symbolic regression can be used to find a formula from data. We will generate a dataset that consists of the formula x*cos(10*x) + 2, add noise to this data, and then see if we can recover this formula using symbolic regression.

The following Python script generates the input data:

import numpy as np
import matplotlib.pyplot as plt

# Generate an array of 100 linearly spaced values between 0 and 1
x = np.linspace(0, 1, 100)

# Calculate y values using a cosine function, adding noise
y = np.cos(10 * x) * x + 2 + np.random.random(len(x)) * 0.1

And this is what the result looks like:

The input data that we have generated.

Now we are going to try to find a formula for this data and see what happens.

Finding a formula using TuringBot

The usage of TuringBot is very simple. All we have to do is load the input data using its interface and start the search. First, we save the data to an input file:

# Stack the x and y arrays column-wise to create a 2D array
arr = np.column_stack((x, y))

# Save the array to a text file in floating-point format
np.savetxt('input.txt', arr, fmt='%f')

After loading input.txt into TuringBot, starting the search, and letting it work for a minute, these were the formulas that it found, ordered by complexity:

The formulas found by TuringBot for our input dataset.

It can be seen that it has successfully found our original formula!

Conclusion

Here we have seen how symbolic regression can be used to automatically find mathematical formulas from data values. The example that we have given was a simple one, but the procedure that we have used would also work for a real-world dataset in which the dependencies between the variables were not known beforehand, and in which more than one input variable was present.

If you are interested in trying to find formulas from your dataset, you can download TuringBot for free from the official website.

About TuringBot

TuringBot is a powerful desktop tool for Symbolic Regression. Simply upload your data in .TXT or .CSV format, and instantly discover mathematical formulas that link your variables. Ready to see what TuringBot can do? Visit our homepage to download it for free and start exploring today. Available for Windows, macOS, and Linux.