Symbolic regression is a deep learning technique capable of generating models that are explicit and easy to understand.
In this tutorial, we are going to generate our first symbolic regression model. For that we are going to use the TuringBot software. After generating the model, we are going to visualize the results using a Python library (Matplotlib).
In order to make things more interesting, we are going to try to find a mathematical formula for the N-th prime number.
Symbolic regression setup
The symbolic regression software that we are going to use is called TuringBot. It is a desktop application that runs on both Windows and Linux. The usage is straightforward: you load your input file in .txt or .csv file, select which column should be predicted and which columns should be used as input, and then start the search.
Several search metrics are available, including RMS error, mean error, correlation coefficient and others. Since we are interested in predicting the exact values of the prime numbers, we are going to use the “classification accuracy” metric.
This is what the interface looks like after loading the input file containing prime numbers as a function of N, which we have truncated to the first 20 rows:
With the input file loaded and the search metric selected, the search can be started by clicking on the play button at the top of the interface.
The formulas that were found
After letting TuringBot work for a few minutes, these were the formulas that it ended up finding:
The best one has an error of 0.20, that is, a classification accuracy of 80%. Which is quite impressive considering how short the formula is. Of course we could have obtained a 100% accuracy with a huge polynomial, but that would not really compress the data in any way, since the number of free parameters in the resulting model would be the same as the number of data points.
Visualizing with Python
Now we can finally visualize the symbolic model using Python. Luckily the formula works out of the box as long as we import the math library (TuringBot follows the same naming convention). This is what the script looks like:
from math import * import numpy as np import matplotlib.pyplot as plt def prime(x): return floor(1.98623*ceil(0.0987775+cosh(log2(x)-0.049869))-(1/x)) data = np.loadtxt('primes.txt') plt.scatter(data[:,0], data[:,1], label='Data') plt.plot(data[:,0], [prime(x) for x in data[:,0]], label='Model') plt.xlabel('N') plt.title('Prime numbers') plt.legend() plt.show()
And this is the resulting plot:
In this tutorial we have seen how to generate a symbolic regression model. The example given was a very simple one, with only one input variable and a small number of data points, but the methodology would work just as fine with a real-world large dataset with multiple dimensions, allowing a variety of machine learning problems of practical interest to be solved.