Symbolic Regression in Python with TuringBot

In this tutorial, we are going to show a very easy way to do symbolic regression in Python.

For that, we are going to use the symbolic regression software TuringBot. This program runs on both Windows and Linux, and it comes with a handy Python library. You can download it for free from the official website.

Importing TuringBot

The first step in running our symbolic regression optimization in Python is importing TuringBot. For that, all you have to do is add its installation directory to your Python PATH and import it, as so:

Windows

import sys 
sys.path.insert(1, r'C:\Users\user\AppData\Local\Programs\TuringBot') 

import turingbot as tb 
Linux

import sys 
sys.path.insert(1, '/usr/share/turingbot') 

import turingbot as tb 

Running the optimization

The turingbot library implements a simulation object that can be used to start, stop and get the current status of a symbolic regression optimization.

This is how it works:

Windows

path = r'C:\Users\user\AppData\Local\Programs\TuringBot\TuringBot.exe' 
input_file = r'C:\Users\user\Desktop\input.txt' 
config_file = r'C:\Users\user\Desktop\settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 
Linux

path = r'/usr/bin/turingbot' 
input_file = r'/home/user/input.txt' 
config_file = r'/home/user/settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 

The start_process method starts the optimization in the background. It takes as input the paths to the TuringBot executable and to your input file. Optionally, you can also set the number of threads that the program should use and the path to the configuration file (more on that below).

After running the commands above, nothing will happen because the optimization will start in the background. To retrieve and print the current best formulas, you should use:

sim.refresh_functions() 
print(*sim.functions, sep='\n') 
print(sim.info) 

To stop the optimization and kill the TuringBot process, you should use the terminate_process method:

sim.terminate_process()

Using a configuration file

We have seen above that the start_process method may take the path to a configuration file as an optional input parameter. This is what the file should look like:

4 # Search metric. 1: Mean relative error, 2: Classification accuracy, 3: Mean error, 4: RMS error, 5:, F1 score, 6: Correlation coefficient, 7: Hybrid (CC+RMS), 8: Maximum error, 9: Maximum relative error
-1 # Train/test split. -1: No cross validation. Valid options are: 50, 60, 70, 75, 80
1 # Test sample. 1: Chosen randomly, 2: The last points
0 # Integer constants only. 0: Disabled, 1: Enabled
0 # Bound search mode. 0: Deactivated, 1: Lower bound search, 2: Upper bound search
60 # Maximum formula complexity.
+ * / pow fmod sin cos tan asin acos atan exp log log2 sqrt sinh cosh tanh asinh acosh atanh abs floor ceil round tgamma lgamma erf # Allowed functions.

The comments after the # characters are for your convenience and are ignored. To change the search settings, all you have to do is change the numbers in each line. To change the base functions for the search, just add or delete their names from the last line.

Save the contents of the file above to a settings.cfg file and add the path of this file to the start_process method before calling it if you want to customize your search.

Full example

Here are the full source codes of the examples that we have provided above. Note that you have to replace user in the paths to your local username and that you have to create an input file (txt or csv format, one number per column) to use with the program.

Windows

import sys 
sys.path.insert(1, r'C:\Users\user\AppData\Local\Programs\TuringBot') 

import turingbot as tb 
import time

path = r'C:\Users\user\AppData\Local\Programs\TuringBot\TuringBot.exe' 
input_file = r'C:\Users\user\Desktop\input.txt' 
config_file = r'C:\Users\user\Desktop\settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 

time.sleep(10)

sim.refresh_functions()
print(*sim.functions, sep='\n')
print(sim.info)

sim.terminate_process()

Linux

import sys 
sys.path.insert(1, '/usr/share/turingbot') 

import turingbot as tb 
import time 

path = r'/usr/bin/turingbot' 
input_file = r'/home/user/input.txt' 
config_file = r'/home/user/settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 

time.sleep(10) 

sim.refresh_functions() 
print(*sim.functions, sep='\n') 
print(sim.info) 

sim.terminate_process()

How to find formulas from values?

Finding mathematical formulas from data is an extremely useful machine learning task. A formula is the most compressed representation of a table, allowing large amounts of data to be compressed into something simple, while also making explicit the relationship that exists between the different variables.

In this tutorial, we are going to generate a dataset and try to recover the original formula using the symbolic regression software TuringBot, without any previous knowledge of what that formula was.

What symbolic regression is

Symbolic regression is a machine learning technique that tries to find explicit mathematical formulas that connect variables. The technique starts from a set of base functions to be used in the search, for instance, addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in such a way that the target variable is accurately predicted.

Simplicity is as important as accuracy in a symbolic regression model. Every dataset can be represented with perfect accuracy by a polynomial, but that is uninformative since the number of free parameters in the model the same as the number of training data points. For this reason, a symbolic regression optimization penalizes large formulas, favoring simpler ones that perform just as well.

Generating an example dataset

Let’s give an explicit example of how symbolic regression can be used to find a formula from data. We will generate a dataset that consists of the formula x*cos(10*x) + 2, add noise to this data, and then see if we can recover this formula using symbolic regression.

The following Python script generates the input data:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 1, 100)
y = np.cos(10*x)*x + 2 + np.random.random(len(x))*0.1

And this is what the result looks like:

The input data that we have generated.

Now we are going to try to find a formula for this data and see what happens.

Finding a formula using TuringBot

The usage of TuringBot is very simple. All we have to do is load the input data using its interface and start the search. First, we save the data to an input file:

arr = np.column_stack((x, y))
np.savetxt('input.txt', arr, fmt='%f')

After loading input.txt into TuringBot, starting the search, and letting it work for a minute, these were the formulas that it found, ordered by complexity:

The formulas found by TuringBot for our input dataset.

It can be seen that it has successfully found our original formula!

Conclusion

Here we have seen how symbolic regression can be used to automatically find mathematical formulas from data values. The example that we have given was a simple one, but the procedure that we have used would also work for a real-world dataset in which the dependencies between the variables was not known beforehand, and in which more than one input variable was present.

If you are interested in trying to find formulas from your own dataset, you can download TuringBot for free from the official website.

Symbolic regression example with Python visualization

Symbolic regression is a machine learning technique capable of generating models that are explicit and easy to understand.

In this tutorial, we are going to generate our first symbolic regression model. For that we are going to use the TuringBot software. After generating the model, we are going to visualize the results using a Python library (Matplotlib).

In order to make things more interesting, we are going to try to find a mathematical formula for the N-th prime number (A000040 in the OEIS).

Symbolic regression setup

The symbolic regression software that we are going to use is called TuringBot. It is a desktop application that runs on both Windows and Linux. The usage is straightforward: you load your input file in .txt or .csv file, select which column should be predicted and which columns should be used as input, and then start the search.

Several search metrics are available, including RMS error, mean error, correlation coefficient and others. Since we are interested in predicting the exact values of the prime numbers, we are going to use the “classification accuracy” metric.

This is what the interface looks like after loading the input file containing prime numbers as a function of N, which we have truncated to the first 20 rows:

The TuringBot interface.

With the input file loaded and the search metric selected, the search can be started by clicking on the play button at the top of the interface.

The formulas that were found

After letting TuringBot work for a few minutes, these were the formulas that it ended up finding:

The results of our symbolic regression optimization.

The best one has an error of 0.20, that is, a classification accuracy of 80%. Which is quite impressive considering how short the formula is. Of course we could have obtained a 100% accuracy with a huge polynomial, but that would not really compress the data in any way, since the number of free parameters in the resulting model would be the same as the number of data points.

Visualizing with Python

Now we can finally visualize the symbolic model using Python. Luckily the formula works out of the box as long as we import the math library (TuringBot follows the same naming convention). This is what the script looks like:

from math import *
import numpy as np
import matplotlib.pyplot as plt

def prime(x):
    return floor(1.98623*ceil(0.0987775+cosh(log2(x)-0.049869))-(1/x))

data = np.loadtxt('primes.txt')
plt.scatter(data[:,0], data[:,1], label='Data')
plt.plot(data[:,0], [prime(x) for x in data[:,0]], label='Model')
plt.xlabel('N')
plt.title('Prime numbers')
plt.legend()
plt.show()

And this is the resulting plot:

Symbolic regression Python model.
Plot of our model vs the original data.

Conclusion

In this tutorial we have seen how to generate a symbolic regression model. The example given was a very simple one, with only one input variable and a small number of data points, but the methodology would work just as fine with a real-world large dataset with multiple dimensions, allowing a variety of machine learning problems of practical interest to be solved.