Symbolic regression example with Python visualization

Symbolic regression is a machine learning technique capable of generating models that are explicit and easy to understand.

In this tutorial, we are going to generate our first symbolic regression model. For that, we are going to use the TuringBot software. After generating the model, we are going to visualize the results using a Python library (Matplotlib).

To make things more interesting, we are going to try to find a mathematical formula for the N-th prime number (A000040 in the OEIS).

Symbolic regression setup

The symbolic regression software that we are going to use is called TuringBot. It is a desktop application that runs on both Windows and Linux. The usage is straightforward: you load your input file in .txt or .csv format, select which column should be predicted and which columns should be used as input, and then start the search.

Several search metrics are available, including RMS error, mean error, correlation coefficient, and others. Since we are interested in predicting the exact values of the prime numbers, we are going to use the “classification accuracy” metric.

This is what the interface looks like after loading the input file containing prime numbers as a function of N, which we have truncated to the first 20 rows:

The TuringBot interface.

With the input file loaded and the search metric selected, the search can be started by clicking on the play button at the top of the interface.

The formulas that were found

After letting TuringBot work for a few minutes, these were the formulas that it ended up finding:

The results of our symbolic regression optimization.

The best one has an error of 0.20, that is, a classification accuracy of 80%. Which is quite impressive considering how short the formula is. Of course, we could have obtained a 100% accuracy with a huge polynomial, but that would not compress the data in any way, since the number of free parameters in the resulting model would be the same as the number of data points.

Visualizing with Python

Now we can finally visualize the symbolic model using Python. Luckily the formula works out of the box as long as we import the math library (TuringBot follows the same naming convention). This is what the script looks like:

from math import *

import matplotlib.pyplot as plt
import numpy as np


def prime(x):
    return floor(1.98623*ceil(0.0987775+cosh(log2(x)-0.049869))-(1/x))


data = np.loadtxt('primes.txt')

plt.scatter(data[:,0], data[:,1], label='Data')
plt.plot(data[:,0], [prime(x) for x in data[:,0]], label='Model')
plt.xlabel('N')
plt.title('Prime numbers')
plt.legend()
plt.show()

And this is the resulting plot:

Plot of our model vs the original data.

Conclusion

In this tutorial, we have seen how to generate a symbolic regression model. The example given was a very simple one, with only one input variable and a small number of data points, but the methodology would work just as fine with a real-world large dataset with multiple dimensions, allowing a variety of machine learning problems of practical interest to be solved.

About TuringBot

TuringBot is a desktop software for Symbolic Regression. By feeding your data in .TXT or .CSV format into the program, you can immediately start searching for mathematical formulas that connect the variables. If you want to learn more about what TuringBot can offer you, please visit our homepage.