A regression model example and how to generate it

Regression models are perhaps the most important class of machine learning models. In this tutorial, we will show how to easily generate a regression model from data values.

What is regression

The goal of a regression model is to be able to predict a target variable taking as input one or more input variables. The simplest case is that of a linear relationship between the variables, in which case basic methods such as least squares regression can be used.

In real-world datasets, the relationship between the variables is often highly non-linear. This motivates the use of more sophisticated machine learning techniques to solve the regression problems, including for instance neural networks and random forests.

A regression problem example is to predict the value of a house from its characteristics (location, number of bedrooms, total area, etc), using for that information from other houses which are not identical to it but for which the prices are known.

Regression model example

To give a concrete example, let’s consider the following dataset of house prices: house_prices.txt. It contains the following columns:

Index;
Local selling prices, in hundreds of dollars;
Number of bathrooms;
Area of the site in thousands of square feet;
Size of the living space in thousands of square feet;
Number of garages;
Number of rooms;
Number of bedrooms;
Age in years;
Construction type (1=brick, 2=brick/wood, 3=aluminum/wood, 4=wood);
Number of fire places;
Selling price.

The goal is to predict the last column, the selling price, as a function of all the other variables. In order to do that, we are going to use a technique called symbolic regression, which attempts to find explicit mathematical formulas that connect the input variables to the target variable.

We will use the desktop software TuringBot, which can be downloaded for free, to find that regression model. The usage is quite straightforward: you load the input file through the interface, select which variable is the target and which variables should be used as input, and then start the search. This is what its interface looks like with the data loaded in:

The TuringBot interface.

We have also enabled the cross validation feature with a 50/50 test/train split (see the “Search options” menu in the image above). This will allow us to easily discard overfit formulas.

After running the optimization for a few minutes, the formulas found by the program and their corresponding out-of-sample errors were the following:

The regression models found for the house prices.

The highlighted one turned out to be the best — more complex solutions did not offer increased out-of-sample accuracy. Its mean relative error in the test dataset was of roughly 8%. Here is that formula:

price = fire_place+15.5668+(1.66153+bathrooms)*local_pric

The variables that are present in it are only three: the number of bathrooms, the number of fire places and the local price. It is a completely non-trivial fact that the house price should only depend on these three parameters, but the symbolic regression optimization made this fact evident.

Conclusion

In this tutorial, we have seen an example of generating a regression model. The technique that we used was symbolic regression, implemented in the desktop software TuringBot. The model that was found had a good out-of-sample accuracy in predicting the prices of houses based on their characteristics, and it allowed us to clearly see the most relevant variables in estimating that price.

Share this with your network:

How to find a formula for the nth term of a sequence

Given a sequence of numbers, finding an explicit mathematical formula that computes the nth term of the sequence can be challenging, except in very special cases like arithmetic and geometric sequences.

In the general case, this task involves searching over the space of all mathematical formulas for the most appropriate one. A special technique exists that does just that: symbolic regression. Here we will introduce how it works, and use it to find a formula for the nth term in the Fibonacci sequence (A000045 in the OEIS) as an example.

What symbolic regression is

Regression is the task of establishing a relationship between an output variable and one or more input variables. Symbolic regression solves this task by searching over the space of all possible mathematical formulas for the ones with the greatest accuracy, while trying to keep those formulas as simple as possible.

The technique starts from a set of base functions — for instance, sin(x), exp(x), addition, multiplication, etc. Then it tries to combine those base functions in various ways using an optimization algorithm, keeping track of the most accurate ones found so far.

An important point in symbolic regression is simplicity. It is easy to find a polynomial that will fit any sequence of numbers with perfect accuracy, but that does not really tell you anything since the number of free parameters in the model is the same as the number of input variables. For this reason, a symbolic regression procedure will discard a larger formula if it finds a smaller one that performs just as well.

Finding the nth Fibonacci term

Now let’s show how symbolic regression can be used in practice by trying to find a formula for the Fibonacci sequence using the desktop symbolic regression software TuringBot. The first two terms of the sequence are 1 and 1, and every next term is defined as the sum of the previous two terms. Its first terms are the following, where the first column is the index:

1 1
2 1
3 2
4 3
5 5
6 8
7 13
8 21
9 34
10 55

A list of the first 30 terms can be found on this file: fibonacci.txt.

TuringBot takes as input TXT or CSV files with one variable per column and efficiently finds formulas that connect those variables. This is how it looks like after we load fibonacci.txt and run the optimization:

Finding a formula for the nth Fibonacci term with TuringBot.

The software finds not only a single formula, but the best formulas of all possible complexities. A larger formula is only shown if it performs better than all simpler alternatives. In this case, the last formula turned out to predict with perfect accuracy every single one of the first 30 Fibonacci terms. The formula is the following:

f(x) = floor(cosh(-0.111572+0.481212*x))

Clearly a very elegant solution. The same procedure can be used to find a formula for the nth term of any other sequence (if it exists).

Conclusion

In this tutorial, we have seen how the symbolic regression software TuringBot can be used to find a closed-form expression for the nth term in a sequence of numbers. We found a very short formula for the Fibonacci sequence by simply writing it into a text file with one number per row and loading this file into the software.

If you are interested in trying TuringBot your own data, you can download it from the official website. It is available for both Windows and Linux.

Share this with your network:

A machine learning software for data science

Data science is becoming more and more widespread, pushed by companies that are finding that very valuable and actionable information can be extracted from their databases.

It can be challenging to develop useful models from raw data. Here we will introduce a tool that makes it very easy to develop state of the art models from any dataset.

What is TuringBot

TuringBot is a desktop machine learning software. It runs on both Windows and Linux, and what it does is generate models that predict some target variable taking as input one or more input variables. It does that through a technique called symbolic regression. This is what its interface looks like:

TuringBot’s interface.

The idea of symbolic regression is to search over the space of all possible mathematical formulas for the ones that best connect the input variables to the target variable, while trying to keep those formulas as simple as possible. The target variable can be anything: for instance, it can represent different categorical variables as different integer numbers, allowing the program to solve classification problems, or it can be a regular continuous variable.

Machine learning with TuringBot

The usage of TuringBot is very straightforward. All you have to do is save your data in CSV or TXT format, with one variable per column, and load this input file through the program’s interface.

Once the data is loaded, you can select the target variable and which variables should be used as input, as well as the search metric, and then start the search. Several search metrics are available, including RMS error, mean error and classification accuracy. A list of formulas encountered so far will be shown in real time, ordered by complexity. Those formulas can be easily exported as Python, C or text from the interface:

Some solutions found by TuringBot. They can readily be exported to common programming languages.

Most machine learning methods are black boxes, which carry out complex computations under the hood before giving a result. This is how neural networks and random forests work, for instance. A great advantage of TuringBot over these methods is that the models that it generates are very explicit, allowing some understanding to be gained into the data. This turns data science into something much more similar to natural science and its search for mathematical laws that explain the world.

How to get the software

If you are interested in trying TuringBot on your own data, you can download it for free from the official website. There you can also find the official documentation, with detailed information about all the features and parameters of the software. Many engineers and data scientists are already making use of the software to find hidden patterns in their data.

Share this with your network:

Symbolic regression tutorial with TuringBot

In this tutorial, we are going to show how you can find a formula from your data using the symbolic regression software TuringBot. It is a desktop software that runs on both Windows and Linux, and as you will see the usage is very simple.

Preparing the data

TuringBot takes as input files in .txt or CSV format containing one variable per column. The first row may contain the names of the variables, otherwise they will be labelled col1, col2, col3, etc.

For instance, the following is a valid input file:

x y z w classification
5.20 2.70 3.90 1.40 1
6.50 2.80 4.60 1.50 1
7.70 2.80 6.70 2.00 2
5.90 3.20 4.80 1.80 1
5.00 3.50 1.60 0.60 0
5.10 3.50 1.40 0.20 0
4.60 3.10 1.50 0.20 0
6.90 3.20 5.70 2.30 2

Loading the data into TuringBot

This is what the program looks like when you open it:

The TuringBot interface.

By clicking on the “Input file” button on the upper left, you can select your input file and load it. Different search metrics are available, including for instance classification accuracy, and a handy cross validation feature can also be enabled in the “Search options” box — if enabled, it will automatically create a test/train split and allow you to see the out-of-sample error as the optimization goes on. But in this example we are going to keep things simple and just use the defaults.

Finding the formulas

After loading the data, you can click on the play button at the top of the interface to start the optimization. The best formulas found so far will be shown in the “Solutions” box, in ascending order of complexity. A formula is only shown if its accuracy is greater than that of all simpler alternatives — in symbolic regression, the goal is not simply to find a formula, but to find the simplest ones possible.

Here are the formulas it found for an example dataset:

Finding formulas with TuringBot.

The formulas are all written in a format that is compatible out of the box with Python and C. Indeed, the menu on the upper right allows you to export the solutions to these languages:

Exporting solutions to different languages.

In this example, the true formula turned out to be sqrt(x), which was recovered in a few seconds. The methodology would be the same for a real-world dataset with many input variables and an unknown dependency between them.

How to get TuringBot

If you have liked this tutorial, we encourage you to download TuringBot for free from the official website. As we have shown, it is very simple to use, and its powerful mathematical modelling capabilities allow you to find very subtle numerical patterns in your data. Much like a scientist would do from empirical observations, but in an automatic way and millions of times faster.

Share this with your network:

How to find formulas from values

Finding mathematical formulas from data is an extremely useful machine learning task. A formula is the most compressed representation of a table, allowing large amounts of data to be compressed into something simple, while also making explicit the relationship that exists between the different variables.

In this tutorial, we are going to generate a dataset and try to recover the original formula using the symbolic regression software TuringBot, without any previous knowledge of what that formula was.

What symbolic regression is

Symbolic regression is a machine learning technique that tries to find explicit mathematical formulas that connect variables. The technique starts from a set of base functions to be used in the search, for instance, addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in such a way that the target variable is accurately predicted.

Simplicity is as important as accuracy in a symbolic regression model. Every dataset can be represented with perfect accuracy by a polynomial, but that is uninformative since the number of free parameters in the model the same as the number of training data points. For this reason, a symbolic regression optimization penalizes large formulas, favoring simpler ones that perform just as well.

Generating an example dataset

Let’s give an explicit example of how symbolic regression can be used to find a formula from data. We will generate a dataset that consists of the formula x*cos(10*x) + 2, add noise to this data, and then see if we can recover this formula using symbolic regression.

The following Python script generates the input data:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 1, 100)
y = np.cos(10*x)*x + 2 + np.random.random(len(x))*0.1

And this is what the result looks like:

The input data that we have generated.

Now we are going to try to find a formula for this data and see what happens.

Finding a formula using TuringBot

The usage of TuringBot is very simple. All we have to do is load the input data using its interface and start the search. First, we save the data to an input file:

arr = np.column_stack((x, y))
np.savetxt('input.txt', arr, fmt='%f')

After loading input.txt into TuringBot, starting the search, and letting it work for a minute, these were the formulas that it found, ordered by complexity:

The formulas found by TuringBot for our input dataset.

It can be seen that it has successfully found our original formula!

Conclusion

Here we have seen how symbolic regression can be used to automatically find mathematical formulas from data values. The example that we have given was a simple one, but the procedure that we have used would also work for a real-world dataset in which the dependencies between the variables was not known beforehand, and in which more than one input variable was present.

If you are interested in trying to find formulas from your own dataset, you can download TuringBot for free from the official website.

Share this with your network:

Deep learning with symbolic regression

Symbolic regression is an innovative machine learning technique that is capable of generating results similar to those of neural networks, but with a completely different approach. Here we will talk about its basic characteristics, and show how it can be used to solve deep learning problems.

What is deep learning?

The concept of deep learning has emerged in the context of artificial neural networks. A neural network which contains hidden layers is capable of pre-processing the input information and extracting non-trivial features prior to combining that input into an output value. The term “deep learning” comes from the presence of those multiple layers.

More recently, it has become common to call deep learning any machine learning technique that is capable of extracting non-trivial information from an input and using that to predict target variables in a way that is not possible for classical statistical methods.

How symbolic regression works

Despite being so common, neural networks are not the only way to extract non-trivial patterns from input data. An alternative technique, which is capable of solving the same tasks as neural networks, is called symbolic regression.

The idea of symbolic regression is to find explicit mathematical formulas that predict a target variable taking as input a set of input variables. Sophisticated algorithms have to be employed to efficiently search over the space of all mathematical formulas, which is very large. The most common approach is to use genetic algorithms for this search, but TuringBot shows that a simulated annealing optimization also gives excellent results.

The biggest difference between symbolic regression and neural networks is that the models that result from the former are explicit. Neural networks often require hundreds of weights to be represented, whereas a symbolic model might be a mathematical formula that fits on a single line. This way, symbolic regression can be said to be an alternative to neural networks that does not involve black boxes.

Deep learning with symbolic regression

So how does it work to solve a traditional deep learning task with symbolic regression? To give an example, let’s try to use it to classify the famous Iris dataset, in which four features of flowers are given and the goal is to classify the species of those flowers using this data. You can find the raw dataset here: iris.txt.

After loading this dataset in the symbolic regression software TuringBot, selecting “classification accuracy” as the search metric and setting a 50/50 test/train split for the training, these were the formulas that it ended up finding, ordered by complexity in ascending order:

The results of a symbolic regression procedure applied to the Iris dataset.

The error shown is the out-of-sample error. It can be seen that the best formula turned out to be one of intermediate size, not so small that it cannot find any pattern, but also not so large that it overfit the data. Its classification accuracy in the test domain was 98%.

If you found this example interesting, you might want to download TuringBot for free and give it a try with your own data. It can be used to solve regression and classification problems in general.

Conclusion

In this article, we have seen how symbolic regression can be used to solve problems where a non-linear relationship between the input variables exist. Despite neural networks being so common, this alternative approach is capable of finding models that perform similarly, but with the advantage of being simple and explainable.

Share this with your network:

Symbolic regression example with Python visualization

Symbolic regression is a machine learning technique capable of generating models that are explicit and easy to understand.

In this tutorial, we are going to generate our first symbolic regression model. For that we are going to use the TuringBot software. After generating the model, we are going to visualize the results using a Python library (Matplotlib).

In order to make things more interesting, we are going to try to find a mathematical formula for the N-th prime number (A000040 in the OEIS).

Symbolic regression setup

The symbolic regression software that we are going to use is called TuringBot. It is a desktop application that runs on both Windows and Linux. The usage is straightforward: you load your input file in .txt or .csv file, select which column should be predicted and which columns should be used as input, and then start the search.

Several search metrics are available, including RMS error, mean error, correlation coefficient and others. Since we are interested in predicting the exact values of the prime numbers, we are going to use the “classification accuracy” metric.

This is what the interface looks like after loading the input file containing prime numbers as a function of N, which we have truncated to the first 20 rows:

The TuringBot interface.

With the input file loaded and the search metric selected, the search can be started by clicking on the play button at the top of the interface.

The formulas that were found

After letting TuringBot work for a few minutes, these were the formulas that it ended up finding:

The results of our symbolic regression optimization.

The best one has an error of 0.20, that is, a classification accuracy of 80%. Which is quite impressive considering how short the formula is. Of course we could have obtained a 100% accuracy with a huge polynomial, but that would not really compress the data in any way, since the number of free parameters in the resulting model would be the same as the number of data points.

Visualizing with Python

Now we can finally visualize the symbolic model using Python. Luckily the formula works out of the box as long as we import the math library (TuringBot follows the same naming convention). This is what the script looks like:

from math import *
import numpy as np
import matplotlib.pyplot as plt

def prime(x):
    return floor(1.98623*ceil(0.0987775+cosh(log2(x)-0.049869))-(1/x))

data = np.loadtxt('primes.txt')
plt.scatter(data[:,0], data[:,1], label='Data')
plt.plot(data[:,0], [prime(x) for x in data[:,0]], label='Model')
plt.xlabel('N')
plt.title('Prime numbers')
plt.legend()
plt.show()

And this is the resulting plot:

Symbolic regression Python model.
Plot of our model vs the original data.

Conclusion

In this tutorial we have seen how to generate a symbolic regression model. The example given was a very simple one, with only one input variable and a small number of data points, but the methodology would work just as fine with a real-world large dataset with multiple dimensions, allowing a variety of machine learning problems of practical interest to be solved.

Share this with your network:

An alternative to the Eureqa software

Eureqa is a symbolic regression software based on genetic programming. Here we will talk about an alternative to that software called TuringBot.

About Eureqa

Eureqa used to be developed by a company called Nutonian. A few years ago this company was acquired by a consulting company called Data Robot, and Eureqa has been removed from the market after that.

The program gained popularity due to its ease of use. Finding mathematical formulas from data using its graphical interface was very convenient and required no coding.

The alternative: TuringBot

An alternative to Eureqa exists and is called TuringBot. It uses a completely different approach to solve symbolic regression problems, based on a simulated annealing algorithm. It can be downloaded for free from the official website.

Here is what its interface looks like:

The interface of the TuringBot symbolic regression software.

It features a variety of search metrics, allowing many different kinds of machine learning models to be solved. Those include the basic RMS and mean error regression metrics, but also classification accuracy, F1 score (for rare event classification) and correlation coefficient.

The code allows overfit solutions to be easily ruled out with its convenient cross validation feature. A test/train split can be enabled through the interface, and the out-of-sample error shown in the solutions box can be used to select the formula with the best trade-off between size and accuracy.

Compared to Eureqa, the symbolic regression implementation of TuringBot seems to yield better results in many cases. Eureqa overly restricts itself to simpler and less recursive formulas, and often results in polynomial fits to the data that diverge and lose usefulness outside the training domain. We also find that TuringBot is noticeably faster than Eureqa.

More information

If the concept of TuringBot sounds interesting to you, you can learn more about it from the official website, and also from the posts on this blog. We suggest the following to get started:

For an introduction to Symbolic Regression, you can also check out the Wikipedia article on the topic.

Share this with your network:

Using R to visualize a Symbolic Regression model

In this article, we are going to show how a symbolic regression model can be visualized using the R programming language. The model will be generated using the TuringBot symbolic regression software, and we are going to use the ggplot2 library [1] for the visualization.

The dataset that we are going to use consists of the closing prices for the S&P 500 index in the last year, downloaded from Yahoo Finance [2]. The CSV file, which also contains additional columns like open, high, low and volume, can be found here: spx.csv

Symbolic regression modelling

After opening TuringBot and selecting this file from the menu on the upper left of the interface, we select “Row number” as the input variable and “Close” as the target variable. This way, our model will find the close price as a function of the index of the trading day (1, 2, 3, etc). We will also use a randomly selected 50:50 train/test split to make our model more robust, and “mean relative error” as the optimization metric because we are more interested in the shape of the model than in specific values.

This is what the interface will look like:

Clicking on the play button at the top, the optimization is started, using all the CPU cores in the computer for greater performance. The models encountered so far are seen in the “Solutions” box.

Selecting the best formula

After letting the optimization run for a few minutes, we can click on the “Show cross validation error” box on the upper right of the interface to see the out-of-sample performance of each model, and use this information to select the best one, which in this case turned out to be a combination of cosines and multiplications:

Visualizing with R and ggplot2

Now that we have the model, we are going to visualize it using ggplot2. The following script loads the input CSV file and plots it along with the model that we just selected:

library(ggplot2)

data <- read.csv("spx.csv")
data$idx <- as.numeric(row.names(data))
print(data)

eq = function(row){2966.96+(2.98602*(-55.4604+row)*cos(0.0397268*(row+8.34129*cos(-0.0819996*row))-1.16301*cos(-0.0358919*row)))}

p <- ggplot(data, aes(x=idx, y=Close)) + geom_point()
#p <- ggplot() + geom_line(aes(x=idx, y=Close), data=data)

p + stat_function(fun=eq, color='blue')

#png("test.png")
#print(p)

And this is the final result:

Symbolic regression R model.

This demonstrates the power and simplicity of symbolic regression models: we have managed to readily implement and visualize a deep learning model generated using TuringBot into R, something that would be much harder if the model was a black-box like a neural network or a random forest.

References

[1] ggplot2: https://ggplot2.tidyverse.org/

[2] Yahoo Finance quotes for the S&P 500: https://finance.yahoo.com/quote/%5EGSPC?p=^GSPC&.tsrc=fin-srch

Share this with your network:

Using Symbolic Regression to predict rare events

The formula above predicts credit card frauds in a real world dataset with 87% precision.

Rare events classification

Predicting rare events is a machine learning problem of great practical importance, and also a very difficult one. Models of this kind need to be trained on highly imbalanced datasets, and are used, among other things, for spotting fraudulent online transactions and detecting anomalies in medical images.

In this article, we show how such problems can be modeled using Symbolic Regression, a technique which attempts to find mathematical formulas that predict a desired variable from a set of input variables. Symbolic models, contrary to more mainstream ones like neural networks and random forests, are not black boxes, since they clearly show which variables are being used and how. They are also very fast and easy to implement, since no complex data structures are involved in the calculations.

In order to provide a real world example, we will try to model the credit card fraud dataset available on Kaggle using our Symbolic Regression software TuringBot. The dataset consists of a CSV file containing 284,807 transactions, one per row, out of which 492 are frauds. The first 28 columns represent anonymized features, and the last one contains “0” for legitimate transactions and “1” for fraudulent ones.

Prior to the regression, we remove all quotation mark characters from the file, so that those two categories are recognized as numbers by the software.

Symbolic regression

Generating symbolic models using TuringBot is a straightforward process, which requires no data science skills. The first step is to open the program and load the input file by clicking on the “Input” button, shown below. After loading, the code will automatically define the column “Class” as the target variable and all other ones as input variables, which is what we want.

Then, we select the error metric for the search as “F1 score”, which is the appropriate one for binary classification problems on highly imbalanced datasets like this one. This metric corresponds to a geometric mean of precision and the recall of the model. A very illustrative image that explains what precision and recall are can be found on the Wikipedia page for F1 score.

That’s it! After those two steps, the search is ready to start. Just click on the “play” button at the top of the interface. The best solutions that the program has encountered so far will be shown in the “Solutions” box in real time.

Bear in mind that this is a relatively large dataset, and that it may seem like not much is going on in the first minutes of the optimization. Ideally, you should leave the program running until at least a few million formulas have been tested (you can see the number so far in the Log tab). In a modest i7-3770 CPU with 8 threads, this took us about 6 hours. A more powerful CPU would take less time.

The resulting formula

The models that were encountered by the program after this time were the following:

The error for the best one is 0.17, meaning its F1 score is 1 – 0.17 = 0.83. This implies that both the recall and the precision of the model are close to 83%. In a verification using Python, we have found that they are 80% and 87% respectively.

So what does this mean? That the following mathematical formula found by our program is capable of detecting 80% of all frauds in the dataset, and that it is right 87% of the time when it claims that a fraud is taking place! This is a result consistent with the best machine learning methods available.

Conclusion

In this article, we have demonstrated that our Symbolic Regression software TuringBot is able to generate models that classify credit card frauds in a real world dataset with high precision and high recall. We believe that this kind of modeling capability, combined with the transparency and efficiency of the generated models, is very useful for those interested in developing machine learning models for the classification and prediction of rare events.

Share this with your network: