Machine learning with symbolic regression

Many machine learning methods are presently available, including for instance neural networks, random forests and support vector machines. In this article, we will talk about a very unexplored algorithm called symbolic regression, and will show how it can be used to solve machine learning problems in a very transparent and explicit way.

What is machine learning

Machine learning concerns algorithms capable of predicting numerical values (regression) and creating classifications, among other tasks. The real world is messy and randomness appears everywhere, so a major challenge that these algorithms face is being able to discern meaningful signals from the underlying noise contained in the training datasets.

What most machine learning methods have in common is that they are very implicit and resemble black boxes: numbers are fed into the model, and it spits out a result after performing a series of complex computations under the hood. This kind of processing of information is strongly connected to the notion of “artificial intelligence”, since the inner workings of the human brain are also very hard to describe, while it is capable of learning and recognizing patterns across a very wide range of domains.

Symbolic regression

Symbolic regression is a technique that looks for mathematical formulas that predict some target variable taking as input one or more input variables. Thus, a symbolic model is nothing more than an algebraic formula that can be written on a piece of paper.

A simple case of symbolic model is a polynomial. Any dataset can be represented with perfect accuracy by a polynomial, but that is not very interesting because polynomials quickly diverge outside the train domain, and because they contain as many free parameters as the training dataset itself. So they do not really compress information in any way.

More interesting models are found by combining a set of base functions and trying to find the simplest combinations that predict some target variable. Examples of base functions are trigonometric functions, exponentials, sum, multiplication, division, etc.

For instance, these are some of the base functions used by the symbolic regression software TuringBot:

Base functions used by TuringBot.

After the base functions are defined, the task is then to combine them in such way that a target variable is successfully predicted from the input variables. There is more than one way to carry out the optimization — one might be interested in maximizing the classification accuracy, or in recovering the overall shape of a curve without much regard for outliers, etc. For this reason, TuringBot allows many different search metrics to be used:

The search metrics available in TuringBot.

Some examples of problems that can be solved with symbolic regression include:

Clearly the method is very general, and can be creatively used to solve a variety of problems.

Conclusion

In this article, we have seen how symbolic regression is an alternative machine learning method capable of generating explicit models and solving various classes of problems in an elegant way. If you are interested in generating symbolic models from your own data and seeing what patterns it can find, you can download the symbolic regression software TuringBot, which works on both Windows and Linux, for free.

Share this with your network:

How to find formulas from values

Finding mathematical formulas from data is an extremely useful machine learning task. A formula is the most compressed representation of a table, allowing large amounts of data to be compressed into something simple, while also making explicit the relationship that exists between the different variables.

In this tutorial, we are going to generate a dataset and try to recover the original formula using the symbolic regression software TuringBot, without any previous knowledge of what that formula was.

What symbolic regression is

Symbolic regression is a machine learning technique which tries to find explicit mathematical formulas that connect variables. The technique starts from a set of base functions to be used in the search, for instance addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in such way that the target variable is accurately predicted.

Simplicity is as important as accuracy in a symbolic regression model. Every dataset can be represented with perfect accuracy by a polynomial, but that is uninformative since the number of free parameters in the model the same as the number of training data points. For this reason, a symbolic regression optimization penalizes large formulas, favoring simpler ones that perform just as well.

Generating an example dataset

Let’s give an explicit example of how symbolic regression can be used to find a formula from data. We will generate a dataset which consists of the formula x*cos(10*x) + 2, add noise to this data, and then see if we can recover this formula using symbolic regression.

The following Python script generates the input data:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 1, 100)
y = np.cos(10*x)*x + 2 + np.random.random(len(x))*0.1

And this is what the result looks like:

The input data that we have generated.

Now we are going to try to find a formula for this data and see what happens.

Finding a formula using TuringBot

The usage of TuringBot is very simple. All we have to do is load the input data using its interface and start the search. First we save the data to an input file:

arr = np.column_stack((x, y))
np.savetxt('input.txt', arr, fmt='%f')

After loading input.txt into TuringBot, starting the search and letting it work for a minute, these were the formulas that it found, ordered by complexity:

The formulas found by TuringBot for our input dataset.

It can be seen that it has successfully found our original formula!

Conclusion

Here we have seen how symbolic regression can be used to automatically find mathematical formulas from data values. The example that we have given was a simple one, but the procedure that we have used would also work for a real-world dataset in which the dependencies between the variables was not known beforehand, and in which more than one input variable was present.

If you are interested in trying to find formulas from your own dataset, you can download TuringBot for free from the official website.

Share this with your network:

Deep learning with symbolic regression

Symbolic regression is an innovative machine learning technique that is capable of generating results similar to those of neural networks, but with a completely different approach. Here we will talk about its basic characteristics, and show how it can be used to solve deep learning problems.

What is deep learning?

The concept of deep learning has emerged in the context of artificial neural networks. A neural network which contains hidden layers is capable of pre-processing the input information and extracting non-trivial features prior to combining that input into an output value. The term “deep learning” comes from the presence of those multiple layers.

More recently, it has become common to call deep learning any machine learning technique that is capable of extracting non-trivial information from an input and using that to predict target variables in a way that is not possible for classical statistical methods.

How symbolic regression works

Despite being so common, neural networks are not the only way to extract non-trivial patterns from input data. An alternative technique, which is capable of solving the same tasks as neural networks, is called symbolic regression.

The idea of symbolic regression is to find explicit mathematical formulas that predict a target variable taking as input a set of input variables. Sophisticated algorithms have to be employed to efficiently search over the space of all mathematical formulas, which is very large. The most common approach is to use genetic algorithms for this search, but TuringBot shows that a simulated annealing optimization also gives excellent results.

The biggest difference between symbolic regression and neural networks is that the models that result from the former are explicit. Neural networks often require hundreds of weights to be represented, whereas a symbolic model might be a mathematical formula that fits on a single line. This way, symbolic regression can be said to be an alternative to neural networks that does not involve black boxes.

Deep learning with symbolic regression

So how does it work to solve a traditional deep learning task with symbolic regression? To give an example, let’s try to use it to classify the famous Iris dataset, in which four features of flowers are given and the goal is to classify the species of those flowers using this data. You can find the raw dataset here: iris.txt.

After loading this dataset in the symbolic regression software TuringBot, selecting “classification accuracy” as the search metric and setting a 50/50 test/train split for the training, these were the formulas that it ended up finding, ordered by complexity in ascending order:

The results of a symbolic regression procedure applied to the Iris dataset.

The error shown is the out-of-sample error. It can be seen that the best formula turned out to be one of intermediate size, not so small that it cannot find any pattern, but also not so large that it overfit the data. Its classification accuracy in the test domain was 98%.

If you found this example interesting, you might want to download TuringBot for free and give it a try with your own data. It can be used to solve regression and classification problems in general.

Conclusion

In this article, we have seen how symbolic regression can be used to solve problems where a non-linear relationship between the input variables exist. Despite neural networks being so common, this alternative approach is capable of finding models that perform similarly, but with the advantage of being simple and explainable.

Share this with your network:

Symbolic regression example with Python visualization

Symbolic regression is a machine learning technique capable of generating models that are explicit and easy to understand.

In this tutorial, we are going to generate our first symbolic regression model. For that we are going to use the TuringBot software. After generating the model, we are going to visualize the results using a Python library (Matplotlib).

In order to make things more interesting, we are going to try to find a mathematical formula for the N-th prime number (A000040 in the OEIS).

Symbolic regression setup

The symbolic regression software that we are going to use is called TuringBot. It is a desktop application that runs on both Windows and Linux. The usage is straightforward: you load your input file in .txt or .csv file, select which column should be predicted and which columns should be used as input, and then start the search.

Several search metrics are available, including RMS error, mean error, correlation coefficient and others. Since we are interested in predicting the exact values of the prime numbers, we are going to use the “classification accuracy” metric.

This is what the interface looks like after loading the input file containing prime numbers as a function of N, which we have truncated to the first 20 rows:

The TuringBot interface.

With the input file loaded and the search metric selected, the search can be started by clicking on the play button at the top of the interface.

The formulas that were found

After letting TuringBot work for a few minutes, these were the formulas that it ended up finding:

The results of our symbolic regression optimization.

The best one has an error of 0.20, that is, a classification accuracy of 80%. Which is quite impressive considering how short the formula is. Of course we could have obtained a 100% accuracy with a huge polynomial, but that would not really compress the data in any way, since the number of free parameters in the resulting model would be the same as the number of data points.

Visualizing with Python

Now we can finally visualize the symbolic model using Python. Luckily the formula works out of the box as long as we import the math library (TuringBot follows the same naming convention). This is what the script looks like:

from math import *
import numpy as np
import matplotlib.pyplot as plt

def prime(x):
    return floor(1.98623*ceil(0.0987775+cosh(log2(x)-0.049869))-(1/x))

data = np.loadtxt('primes.txt')
plt.scatter(data[:,0], data[:,1], label='Data')
plt.plot(data[:,0], [prime(x) for x in data[:,0]], label='Model')
plt.xlabel('N')
plt.title('Prime numbers')
plt.legend()
plt.show()

And this is the resulting plot:

Symbolic regression Python model.
Plot of our model vs the original data.

Conclusion

In this tutorial we have seen how to generate a symbolic regression model. The example given was a very simple one, with only one input variable and a small number of data points, but the methodology would work just as fine with a real-world large dataset with multiple dimensions, allowing a variety of machine learning problems of practical interest to be solved.

Share this with your network:

Using R to visualize a Symbolic Regression model

In this article, we are going to show how a symbolic regression model can be visualized using the R programming language. The model will be generated using the TuringBot symbolic regression software, and we are going to use the ggplot2 library [1] for the visualization.

The dataset that we are going to use consists of the closing prices for the S&P 500 index in the last year, downloaded from Yahoo Finance [2]. The CSV file, which also contains additional columns like open, high, low and volume, can be found here: spx.csv

Symbolic regression modelling

After opening TuringBot and selecting this file from the menu on the upper left of the interface, we select “Row number” as the input variable and “Close” as the target variable. This way, our model will find the close price as a function of the index of the trading day (1, 2, 3, etc). We will also use a randomly selected 50:50 train/test split to make our model more robust, and “mean relative error” as the optimization metric because we are more interested in the shape of the model than in specific values.

This is what the interface will look like:

Clicking on the play button at the top, the optimization is started, using all the CPU cores in the computer for greater performance. The models encountered so far are seen in the “Solutions” box.

Selecting the best formula

After letting the optimization run for a few minutes, we can click on the “Show cross validation error” box on the upper right of the interface to see the out-of-sample performance of each model, and use this information to select the best one, which in this case turned out to be a combination of cosines and multiplications:

Visualizing with R and ggplot2

Now that we have the model, we are going to visualize it using ggplot2. The following script loads the input CSV file and plots it along with the model that we just selected:

library(ggplot2)

data <- read.csv("spx.csv")
data$idx <- as.numeric(row.names(data))
print(data)

eq = function(row){2966.96+(2.98602*(-55.4604+row)*cos(0.0397268*(row+8.34129*cos(-0.0819996*row))-1.16301*cos(-0.0358919*row)))}

p <- ggplot(data, aes(x=idx, y=Close)) + geom_point()
#p <- ggplot() + geom_line(aes(x=idx, y=Close), data=data)

p + stat_function(fun=eq, color='blue')

#png("test.png")
#print(p)

And this is the final result:

Symbolic regression R model.

This demonstrates the power and simplicity of symbolic regression models: we have managed to readily implement and visualize a deep learning model generated using TuringBot into R, something that would be much harder if the model was a black-box like a neural network or a random forest.

References

[1] ggplot2: https://ggplot2.tidyverse.org/

[2] Yahoo Finance quotes for the S&P 500: https://finance.yahoo.com/quote/%5EGSPC?p=^GSPC&.tsrc=fin-srch

Share this with your network:

Using Symbolic Regression to predict rare events

The formula above predicts credit card frauds in a real world dataset with 87% precision.

Rare events classification

Predicting rare events is a machine learning problem of great practical importance, and also a very difficult one. Models of this kind need to be trained on highly imbalanced datasets, and are used, among other things, for spotting fraudulent online transactions and detecting anomalies in medical images.

In this article, we show how such problems can be modeled using Symbolic Regression, a technique which attempts to find mathematical formulas that predict a desired variable from a set of input variables. Symbolic models, contrary to more mainstream ones like neural networks and random forests, are not black boxes, since they clearly show which variables are being used and how. They are also very fast and easy to implement, since no complex data structures are involved in the calculations.

In order to provide a real world example, we will try to model the credit card fraud dataset available on Kaggle using our Symbolic Regression software TuringBot. The dataset consists of a CSV file containing 284,807 transactions, one per row, out of which 492 are frauds. The first 28 columns represent anonymized features, and the last one contains “0” for legitimate transactions and “1” for fraudulent ones.

Prior to the regression, we remove all quotation mark characters from the file, so that those two categories are recognized as numbers by the software.

Symbolic regression

Generating symbolic models using TuringBot is a straightforward process, which requires no data science skills. The first step is to open the program and load the input file by clicking on the “Input” button, shown below. After loading, the code will automatically define the column “Class” as the target variable and all other ones as input variables, which is what we want.

Then, we select the error metric for the search as “F1 score”, which is the appropriate one for binary classification problems on highly imbalanced datasets like this one. This metric corresponds to a geometric mean of precision and the recall of the model. A very illustrative image that explains what precision and recall are can be found on the Wikipedia page for F1 score.

That’s it! After those two steps, the search is ready to start. Just click on the “play” button at the top of the interface. The best solutions that the program has encountered so far will be shown in the “Solutions” box in real time.

Bear in mind that this is a relatively large dataset, and that it may seem like not much is going on in the first minutes of the optimization. Ideally, you should leave the program running until at least a few million formulas have been tested (you can see the number so far in the Log tab). In a modest i7-3770 CPU with 8 threads, this took us about 6 hours. A more powerful CPU would take less time.

The resulting formula

The models that were encountered by the program after this time were the following:

The error for the best one is 0.17, meaning its F1 score is 1 – 0.17 = 0.83. This implies that both the recall and the precision of the model are close to 83%. In a verification using Python, we have found that they are 80% and 87% respectively.

So what does this mean? That the following mathematical formula found by our program is capable of detecting 80% of all frauds in the dataset, and that it is right 87% of the time when it claims that a fraud is taking place! This is a result consistent with the best machine learning methods available.

Conclusion

In this article, we have demonstrated that our Symbolic Regression software TuringBot is able to generate models that classify credit card frauds in a real world dataset with high precision and high recall. We believe that this kind of modeling capability, combined with the transparency and efficiency of the generated models, is very useful for those interested in developing machine learning models for the classification and prediction of rare events.

Share this with your network: