Symbolic Regression in Python with TuringBot

In this tutorial, we are going to show a very easy way to do symbolic regression in Python.

For that, we are going to use the symbolic regression software TuringBot. This program runs on both Windows and Linux, and it comes with a handy Python library. You can download it for free from the official website.

Importing TuringBot

The first step in running our symbolic regression optimization in Python is importing TuringBot. For that, all you have to do is add its installation directory to your Python PATH and import it, as so:

Windows

import sys 
sys.path.insert(1, r'C:\Users\user\AppData\Local\Programs\TuringBot') 

import turingbot as tb 
Linux

import sys 
sys.path.insert(1, '/usr/share/turingbot') 

import turingbot as tb 

Running the optimization

The turingbot library implements a simulation object that can be used to start, stop and get the current status of a symbolic regression optimization.

This is how it works:

Windows

path = r'C:\Users\user\AppData\Local\Programs\TuringBot\TuringBot.exe' 
input_file = r'C:\Users\user\Desktop\input.txt' 
config_file = r'C:\Users\user\Desktop\settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 
Linux

path = r'/usr/bin/turingbot' 
input_file = r'/home/user/input.txt' 
config_file = r'/home/user/settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 

The start_process method starts the optimization in the background. It takes as input the paths to the TuringBot executable and to your input file. Optionally, you can also set the number of threads that the program should use and the path to the configuration file (more on that below).

After running the commands above, nothing will happen because the optimization will start in the background. To retrieve and print the current best formulas, you should use:

sim.refresh_functions() 
print(*sim.functions, sep='\n') 
print(sim.info) 

To stop the optimization and kill the TuringBot process, you should use the terminate_process method:

sim.terminate_process()

Using a configuration file

We have seen above that the start_process method may take the path to a configuration file as an optional input parameter. This is what the file should look like:

4 # Search metric. 1: Mean relative error, 2: Classification accuracy, 3: Mean error, 4: RMS error, 5:, F1 score, 6: Correlation coefficient, 7: Hybrid (CC+RMS), 8: Maximum error, 9: Maximum relative error
-1 # Train/test split. -1: No cross validation. Valid options are: 50, 60, 70, 75, 80
1 # Test sample. 1: Chosen randomly, 2: The last points
0 # Integer constants only. 0: Disabled, 1: Enabled
0 # Bound search mode. 0: Deactivated, 1: Lower bound search, 2: Upper bound search
60 # Maximum formula complexity.
+ * / pow fmod sin cos tan asin acos atan exp log log2 sqrt sinh cosh tanh asinh acosh atanh abs floor ceil round tgamma lgamma erf # Allowed functions.

The comments after the # characters are for your convenience and are ignored. To change the search settings, all you have to do is change the numbers in each line. To change the base functions for the search, just add or delete their names from the last line.

Save the contents of the file above to a settings.cfg file and add the path of this file to the start_process method before calling it if you want to customize your search.

Full example

Here are the full source codes of the examples that we have provided above. Note that you have to replace user in the paths to your local username and that you have to create an input file (txt or csv format, one number per column) to use with the program.

Windows

import sys 
sys.path.insert(1, r'C:\Users\user\AppData\Local\Programs\TuringBot') 

import turingbot as tb 
import time

path = r'C:\Users\user\AppData\Local\Programs\TuringBot\TuringBot.exe' 
input_file = r'C:\Users\user\Desktop\input.txt' 
config_file = r'C:\Users\user\Desktop\settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 

time.sleep(10)

sim.refresh_functions()
print(*sim.functions, sep='\n')
print(sim.info)

sim.terminate_process()

Linux

import sys 
sys.path.insert(1, '/usr/share/turingbot') 

import turingbot as tb 
import time 

path = r'/usr/bin/turingbot' 
input_file = r'/home/user/input.txt' 
config_file = r'/home/user/settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 

time.sleep(10) 

sim.refresh_functions() 
print(*sim.functions, sep='\n') 
print(sim.info) 

sim.terminate_process()
Share this with your network:

10 creative applications of symbolic regression

Symbolic regression is a method that discovers mathematical formulas from data without assumptions on what those formulas should look like. Given a set of input variables x1, x2, x3, etc, and a target variable y, it will use trial and error find f such that y = f(x1, x2, x3, …).

The method is very general, given that the target variable y can be anything, and given that a variety of error metrics can be chosen for the search. Here we want to enumerate a few creative applications to give the reader some ideas.

All of these problems can be modeled out of the box with the TuringBot symbolic regression software.

1. Forecast the next values of a time series

Say you have a sequence of numbers and you want to predict the next one. This could be the monthly revenue of a company or the daily prices of a stock, for instance.

In special cases, this kind of problem can be solved by simply fitting a line to the data and extrapolating to the next point, a task that can be easily accomplished with numpy.polyfit. While this will work just fine in many cases, it will not be useful if the time series evolves in a nonlinear way.

Symbolic regression offers a more general alternative. One can look for formulas for y = f(index), where y are the values of the series and index = 1, 2, 3, etc. A prediction can then be made by evaluating the resulting formulas at a future index.

This is not a mainstream way to go about this kind of problem, but the simplicity of the resulting models can make them much more informative than mainstream forecasting methods like Monte Carlo simulations, used for instance by Facebook’s Prophet library.

2. Predict binary outcomes

A machine learning problem of great practical importance is to predict whether something will happen or not. This is a central problem in options trading, gambling, and finance (“will a recession happen?”).

Numerically, this problem translates to predicting 0 or 1 based on a set of input features.

Symbolic regression allows binary problems to be solved by using classification accuracy as the error metric for the search. In order to minimize the error, the optimization will converge without supervision towards formulas that only output 0 or 1, usually involving floor/ceil/round of some bounded function like tanh(x) or cos(x).

3. Predict continuous outcomes

A generalization of the problem of making a binary prediction is the problem of predicting a continuous quantity in the future.

For instance, in agriculture one could be interested in predicting the time for a crop to mature given parameters known at the time of sowing, such as soil composition, the month of the year, temperature, etc.

Usually, few data points will be available to train the model in this kind of scenario, but since symbolic models are simple, they are the least likely to overfit the data. The problem can be modeled by running the optimization with a standard error metric like root-mean-square error or mean error.

4. Solve classification problems

Classification problems, in general, can be solved by symbolic regression with a simple trick: representing different categorical variables as different integer numbers.

If your data points have 10 possible labels that should be predicted based on a set of input features, you can use symbolic regression to find formulas that output integers from 1 to 10 based on these features.

This may sound like asking too much — a formula capable of that is highly specific. But a good symbolic regression engine will be thorough in its search over the space of all mathematical formulas and will eventually find appropriate solutions.

5. Classify rare events

Another interesting case of classification problem is that of a highly imbalanced dataset, in which only a handful of rows contain the relevant label and the rest are negatives. This could be medical diagnostic images or fraudulent credit card transactions.

For this kind of problem, the usual classification accuracy search metric is not appropriate, since f(x1, x2, x3, …) = 0 will have a very high accuracy while being a useless function.

Special search metrics exist for this kind of problem, the most popular of which being the F1 score, which consists of the geometric mean between precision and recall. This search metric is available in TuringBot, allowing this kind of problem to be easily modeled.

6. Compress data

A mathematical formula is perhaps the shortest possible representation of a dataset. If the target variable features some kind of regularity, symbolic regression can turn gigabytes of data into something that can be equivalently expressed in one line.

Examples of target variables could be rgb colors of an image as a function of (x, y) pixels. We have tried finding a formula for the Mona Lisa, but unfortunately, nothing simple could be found in this case.

7. Interpolate data

Say you have a table of numbers and you want to compute the target variable for intermediate values not present in the table itself.

One way to go about this is to generate a spline interpolation from the table, which is a somewhat cumbersome and non-portable solution.

With symbolic regression, one can turn the entire table into a mathematical expression, and then proceed to do the interpolation without the need for specialized libraries or data structures, and also without the need to store the table itself anywhere.

8. Discover upper or lower bounds for a function

In problems of engineering and applied mathematics, one is often interested not in the particular value of a variable but in how fast this variable grows or how large it can be given an input. In this case, it is more informative to obtain an upper bound for the function than an approximation for the function itself.

With symbolic regression, this can be accomplished by discarding formulas that are not always larger or always smaller than the target variable. This kind of search is available out of the box in TuringBot with its “Bound search mode” option.

9. Discover the most predictive variables

When creating a machine learning model, it is extremely useful to know which input variables are the most relevant in predicting the target variable.

With black-box methods like neural networks, answering this kind of question is nontrivial because all variables are used at once indiscriminately.

But with symbolic regression the situation is different: since the formulas are kept as short as possible, variables that are not predictive end up not appearing, making it trivial to spot which variables are actually predictive and relevant.

10. Explore the limits of computability

Few people are aware of this, but the notion of computability has been first introduced by Alan Turing himself in his famous paper “On Computable Numbers, with an Application to the Entscheidungsproblem“.

Some things are easy to compute, for instance the function f(x) = x or common functions like sin(x) and exp(x) that can be converted into simple series expansions. But other things are much harder to compute, for instance, the N-th prime number.

With symbolic regression, one can try to derandomize tables of numbers and discover highly nonlinear patterns connecting variables. Since this is done in a very free way, even absurd solutions like tan(tan(tan(tan(x)))) end up being a possibility. This makes the method operate on the edge of computability.

Interested in symbolic regression? Download TuringBot and get started today.

Share this with your network:

Eureqa vs TuringBot for symbolic regression

Introduced in 2009, the Eureqa software gained great popularity with the promise that it could potentially be used to derive new physical laws from empirical data in an automatic way. Details of this reasoning can be found in the original paper, called Distilling Free-Form Natural Laws from Experimental Data.

In 2017 this software was acquired by a global consulting company called DataRobot and left the market. The promise of revolutionizing physics was never quite fulfilled, but the project had a major impact in raising awareness about symbolic regression.

Here we want to compare Eureqa to a more recent symbolic regression software called TuringBot.

About TuringBot

Similarly to Eureqa, TuringBot is a symbolic regression software. It has a simple graphical interface that allows the user to load a dataset and then try to find formulas that predict a target column taking as input the remaining columns:

The TuringBot interface.

This software was introduced in 2020, and contrary to Eureqa it does not use a genetic algorithm to search for formulas, but instead a novel algorithm based on simulated annealing. While most references to symbolic regression in the literature involve genetic algorithms, our finding was that simulated annealing yields results much faster if implemented the right way.

Simulated annealing is inspired by a metallurgic process in which a metal is heated to a high temperature and then slowly cooled to attain better physical properties. The algorithm starts at first very “hot”, with worse solutions being accepted very often, and over time it cools down and becomes more strict about the solutions that it passes by. This allows the algorithm to overcome local maxima and discover the global maximum in a stochastic way.

Pareto optimization

Both TuringBot and Eureqa implement the idea searching for the best formulas of each possible size, and not just a single optimal formula. This is the essence of a Pareto optimization, and it results on a list of formulas of increasing complexity and accuracy to choose from.

A list of formulas of increasing complexity discovered by TuringBot.

A handy feature offered by TuringBot is to create a train/test split for the optimization and see in real-time the test error for the solutions discovered so far. This allows overfit solutions to be spotted very easily.

Availability

TuringBot is available for both Windows and Linux. It can be downloaded for free, but it also has a paid plan with more functionalities.

The software is already being used by many researchers and engineers around the world to study topics including turbine design, materials science and zoology, and also by business owners to come up with pricing models and other applications.

You might also like our article on Symbolic Regression featured on Towards Data Science: Symbolic Regression: The Forgotten Machine Learning Method.

Share this with your network:

Decision boundary discovery with symbolic regression

An interesting classification problem is trying to find a decision boundary that separates two categories of points. For instance, consider the following cloud of points:

Clearly, we could hand draw a line that separates the two colors. But can this problem be solved in an automatic way?

Several machine learning methods could be used for this, including for instance a Support Vector Machine or AdaBoost. What all of these methods have in common is that they perform complex calculations under the hood and spill out some number, that is, they are black boxes. An interesting comparison of several of these methods can be found here.

A simpler and more elegant alternative is to try to find an explicit mathematical formula that separates the two categories. Not only would this be easier to compute, but it would also offer some insight into the data. This is where symbolic regression comes in.

Symbolic regression

The way to solve this problem with symbolic regression is to look for a formula that returns 0 for points of one category and 1 for points of another. That is, a formula for classification = f(x, y).

We can look for that formula by generating a CSV file with our points and loading it into TuringBot. Then we can run the optimization with classification accuracy as the search metric.

If we do that, the program ends up finding a simple formula with an accuracy of 100%:

classification = ceil(-1*tanh(round(x*y-cos((-2)*(y-x)))))

To visualize the decision boundary associated with this formula, we can generate some random points and keep track of the ones classified as orange. Then we can find the alpha shape that encompasses those points, which will be the decision boundary:

import alphashape
from descartes import PolygonPatch
import numpy as np
from math import *

def f(x, y):
    return ceil(-1*tanh(round(x*y-cos((-2)*(y-x)))))

pts = []
for i in range(10000):
    x = np.random.random()*2-1
    y = np.random.random()*2-1
    if f(x, y) == 1:
        pts.append([x, y])
pts = np.array(pts)

alpha_shape = alphashape.alphashape(pred, 2.)

fig, ax = plt.subplots()
ax.add_patch(PolygonPatch(alpha_shape, alpha=0.2, fc='#ddd', zorder=100))

And this is the result:

It is worth noting that even though this was a 2D problem, the same procedure could have been carried out for a classification problem in any number of dimensions.

Share this with your network:

A regression model example and how to generate it

Regression models are perhaps the most important class of machine learning models. In this tutorial, we will show how to easily generate a regression model from data values.

What is regression

The goal of a regression model is to be able to predict a target variable taking as input one or more input variables. The simplest case is that of a linear relationship between the variables, in which case basic methods such as least squares regression can be used.

In real-world datasets, the relationship between the variables is often highly non-linear. This motivates the use of more sophisticated machine learning techniques to solve the regression problems, including for instance neural networks and random forests.

A regression problem example is to predict the value of a house from its characteristics (location, number of bedrooms, total area, etc), using for that information from other houses which are not identical to it but for which the prices are known.

Regression model example

To give a concrete example, let’s consider the following public dataset of house prices: x26.txt. This file contains a long and uncommented header; a stripped-down version that is compatible with TuringBot can be found here: house_prices.txt. The columns that are present are the following

Index;
Local selling prices, in hundreds of dollars;
Number of bathrooms;
Area of the site in thousands of square feet;
Size of the living space in thousands of square feet;
Number of garages;
Number of rooms;
Number of bedrooms;
Age in years;
Construction type (1=brick, 2=brick/wood, 3=aluminum/wood, 4=wood);
Number of fire places;
Selling price.

The goal is to predict the last column, the selling price, as a function of all the other variables. In order to do that, we are going to use a technique called symbolic regression, which attempts to find explicit mathematical formulas that connect the input variables to the target variable.

We will use the desktop software TuringBot, which can be downloaded for free, to find that regression model. The usage is quite straightforward: you load the input file through the interface, select which variable is the target and which variables should be used as input, and then start the search. This is what its interface looks like with the data loaded in:

The TuringBot interface.

We have also enabled the cross validation feature with a 50/50 test/train split (see the “Search options” menu in the image above). This will allow us to easily discard overfit formulas.

After running the optimization for a few minutes, the formulas found by the program and their corresponding out-of-sample errors were the following:

The regression models found for the house prices.

The highlighted one turned out to be the best — more complex solutions did not offer increased out-of-sample accuracy. Its mean relative error in the test dataset was of roughly 8%. Here is that formula:

price = fire_place+15.5668+(1.66153+bathrooms)*local_pric

The variables that are present in it are only three: the number of bathrooms, the number of fire places and the local price. It is a completely non-trivial fact that the house price should only depend on these three parameters, but the symbolic regression optimization made this fact evident.

Conclusion

In this tutorial, we have seen an example of generating a regression model. The technique that we used was symbolic regression, implemented in the desktop software TuringBot. The model that was found had a good out-of-sample accuracy in predicting the prices of houses based on their characteristics, and it allowed us to clearly see the most relevant variables in estimating that price.

Share this with your network:

How to find a formula for the nth term of a sequence

Given a sequence of numbers, finding an explicit mathematical formula that computes the nth term of the sequence can be challenging, except in very special cases like arithmetic and geometric sequences.

In the general case, this task involves searching over the space of all mathematical formulas for the most appropriate one. A special technique exists that does just that: symbolic regression. Here we will introduce how it works, and use it to find a formula for the nth term in the Fibonacci sequence (A000045 in the OEIS) as an example.

What symbolic regression is

Regression is the task of establishing a relationship between an output variable and one or more input variables. Symbolic regression solves this task by searching over the space of all possible mathematical formulas for the ones with the greatest accuracy, while trying to keep those formulas as simple as possible.

The technique starts from a set of base functions — for instance, sin(x), exp(x), addition, multiplication, etc. Then it tries to combine those base functions in various ways using an optimization algorithm, keeping track of the most accurate ones found so far.

An important point in symbolic regression is simplicity. It is easy to find a polynomial that will fit any sequence of numbers with perfect accuracy, but that does not really tell you anything since the number of free parameters in the model is the same as the number of input variables. For this reason, a symbolic regression procedure will discard a larger formula if it finds a smaller one that performs just as well.

Finding the nth Fibonacci term

Now let’s show how symbolic regression can be used in practice by trying to find a formula for the Fibonacci sequence using the desktop symbolic regression software TuringBot. The first two terms of the sequence are 1 and 1, and every next term is defined as the sum of the previous two terms. Its first terms are the following, where the first column is the index:

1 1
2 1
3 2
4 3
5 5
6 8
7 13
8 21
9 34
10 55

A list of the first 30 terms can be found on this file: fibonacci.txt.

TuringBot takes as input TXT or CSV files with one variable per column and efficiently finds formulas that connect those variables. This is how it looks like after we load fibonacci.txt and run the optimization:

Finding a formula for the nth Fibonacci term with TuringBot.

The software finds not only a single formula, but the best formulas of all possible complexities. A larger formula is only shown if it performs better than all simpler alternatives. In this case, the last formula turned out to predict with perfect accuracy every single one of the first 30 Fibonacci terms. The formula is the following:

f(x) = floor(cosh(-0.111572+0.481212*x))

Clearly a very elegant solution. The same procedure can be used to find a formula for the nth term of any other sequence (if it exists).

Conclusion

In this tutorial, we have seen how the symbolic regression software TuringBot can be used to find a closed-form expression for the nth term in a sequence of numbers. We found a very short formula for the Fibonacci sequence by simply writing it into a text file with one number per row and loading this file into the software.

If you are interested in trying TuringBot your own data, you can download it from the official website. It is available for both Windows and Linux.

Share this with your network:

A machine learning software for data science

Data science is becoming more and more widespread, pushed by companies that are finding that very valuable and actionable information can be extracted from their databases.

It can be challenging to develop useful models from raw data. Here we will introduce a tool that makes it very easy to develop state of the art models from any dataset.

What is TuringBot

TuringBot is a desktop machine learning software. It runs on both Windows and Linux, and what it does is generate models that predict some target variable taking as input one or more input variables. It does that through a technique called symbolic regression. This is what its interface looks like:

TuringBot’s interface.

The idea of symbolic regression is to search over the space of all possible mathematical formulas for the ones that best connect the input variables to the target variable, while trying to keep those formulas as simple as possible. The target variable can be anything: for instance, it can represent different categorical variables as different integer numbers, allowing the program to solve classification problems, or it can be a regular continuous variable.

Machine learning with TuringBot

The usage of TuringBot is very straightforward. All you have to do is save your data in CSV or TXT format, with one variable per column, and load this input file through the program’s interface.

Once the data is loaded, you can select the target variable and which variables should be used as input, as well as the search metric, and then start the search. Several search metrics are available, including RMS error, mean error and classification accuracy. A list of formulas encountered so far will be shown in real time, ordered by complexity. Those formulas can be easily exported as Python, C or text from the interface:

Some solutions found by TuringBot. They can readily be exported to common programming languages.

Most machine learning methods are black boxes, which carry out complex computations under the hood before giving a result. This is how neural networks and random forests work, for instance. A great advantage of TuringBot over these methods is that the models that it generates are very explicit, allowing some understanding to be gained into the data. This turns data science into something much more similar to natural science and its search for mathematical laws that explain the world.

How to get the software

If you are interested in trying TuringBot on your own data, you can download it for free from the official website. There you can also find the official documentation, with detailed information about all the features and parameters of the software. Many engineers and data scientists are already making use of the software to find hidden patterns in their data.

Share this with your network:

Symbolic regression tutorial with TuringBot

In this tutorial, we are going to show how you can find a formula from your data using the symbolic regression software TuringBot. It is a desktop software that runs on both Windows and Linux, and as you will see the usage is very simple.

Preparing the data

TuringBot takes as input files in .txt or CSV format containing one variable per column. The first row may contain the names of the variables, otherwise they will be labelled col1, col2, col3, etc.

For instance, the following is a valid input file:

x y z w classification
5.20 2.70 3.90 1.40 1
6.50 2.80 4.60 1.50 1
7.70 2.80 6.70 2.00 2
5.90 3.20 4.80 1.80 1
5.00 3.50 1.60 0.60 0
5.10 3.50 1.40 0.20 0
4.60 3.10 1.50 0.20 0
6.90 3.20 5.70 2.30 2

Loading the data into TuringBot

This is what the program looks like when you open it:

The TuringBot interface.

By clicking on the “Input file” button on the upper left, you can select your input file and load it. Different search metrics are available, including for instance classification accuracy, and a handy cross validation feature can also be enabled in the “Search options” box — if enabled, it will automatically create a test/train split and allow you to see the out-of-sample error as the optimization goes on. But in this example we are going to keep things simple and just use the defaults.

Finding the formulas

After loading the data, you can click on the play button at the top of the interface to start the optimization. The best formulas found so far will be shown in the “Solutions” box, in ascending order of complexity. A formula is only shown if its accuracy is greater than that of all simpler alternatives — in symbolic regression, the goal is not simply to find a formula, but to find the simplest ones possible.

Here are the formulas it found for an example dataset:

Finding formulas with TuringBot.

The formulas are all written in a format that is compatible out of the box with Python and C. Indeed, the menu on the upper right allows you to export the solutions to these languages:

Exporting solutions to different languages.

In this example, the true formula turned out to be sqrt(x), which was recovered in a few seconds. The methodology would be the same for a real-world dataset with many input variables and an unknown dependency between them.

How to get TuringBot

If you have liked this tutorial, we encourage you to download TuringBot for free from the official website. As we have shown, it is very simple to use, and its powerful mathematical modelling capabilities allow you to find very subtle numerical patterns in your data. Much like a scientist would do from empirical observations, but in an automatic way and millions of times faster.

Share this with your network:

How to find formulas from values

Finding mathematical formulas from data is an extremely useful machine learning task. A formula is the most compressed representation of a table, allowing large amounts of data to be compressed into something simple, while also making explicit the relationship that exists between the different variables.

In this tutorial, we are going to generate a dataset and try to recover the original formula using the symbolic regression software TuringBot, without any previous knowledge of what that formula was.

What symbolic regression is

Symbolic regression is a machine learning technique that tries to find explicit mathematical formulas that connect variables. The technique starts from a set of base functions to be used in the search, for instance, addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in such a way that the target variable is accurately predicted.

Simplicity is as important as accuracy in a symbolic regression model. Every dataset can be represented with perfect accuracy by a polynomial, but that is uninformative since the number of free parameters in the model the same as the number of training data points. For this reason, a symbolic regression optimization penalizes large formulas, favoring simpler ones that perform just as well.

Generating an example dataset

Let’s give an explicit example of how symbolic regression can be used to find a formula from data. We will generate a dataset that consists of the formula x*cos(10*x) + 2, add noise to this data, and then see if we can recover this formula using symbolic regression.

The following Python script generates the input data:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 1, 100)
y = np.cos(10*x)*x + 2 + np.random.random(len(x))*0.1

And this is what the result looks like:

The input data that we have generated.

Now we are going to try to find a formula for this data and see what happens.

Finding a formula using TuringBot

The usage of TuringBot is very simple. All we have to do is load the input data using its interface and start the search. First, we save the data to an input file:

arr = np.column_stack((x, y))
np.savetxt('input.txt', arr, fmt='%f')

After loading input.txt into TuringBot, starting the search, and letting it work for a minute, these were the formulas that it found, ordered by complexity:

The formulas found by TuringBot for our input dataset.

It can be seen that it has successfully found our original formula!

Conclusion

Here we have seen how symbolic regression can be used to automatically find mathematical formulas from data values. The example that we have given was a simple one, but the procedure that we have used would also work for a real-world dataset in which the dependencies between the variables was not known beforehand, and in which more than one input variable was present.

If you are interested in trying to find formulas from your own dataset, you can download TuringBot for free from the official website.

Share this with your network:

Deep learning with symbolic regression

Symbolic regression is an innovative machine learning technique that is capable of generating results similar to those of neural networks, but with a completely different approach. Here we will talk about its basic characteristics, and show how it can be used to solve deep learning problems.

What is deep learning?

The concept of deep learning has emerged in the context of artificial neural networks. A neural network which contains hidden layers is capable of pre-processing the input information and extracting non-trivial features prior to combining that input into an output value. The term “deep learning” comes from the presence of those multiple layers.

More recently, it has become common to call deep learning any machine learning technique that is capable of extracting non-trivial information from an input and using that to predict target variables in a way that is not possible for classical statistical methods.

How symbolic regression works

Despite being so common, neural networks are not the only way to extract non-trivial patterns from input data. An alternative technique, which is capable of solving the same tasks as neural networks, is called symbolic regression.

The idea of symbolic regression is to find explicit mathematical formulas that predict a target variable taking as input a set of input variables. Sophisticated algorithms have to be employed to efficiently search over the space of all mathematical formulas, which is very large. The most common approach is to use genetic algorithms for this search, but TuringBot shows that a simulated annealing optimization also gives excellent results.

The biggest difference between symbolic regression and neural networks is that the models that result from the former are explicit. Neural networks often require hundreds of weights to be represented, whereas a symbolic model might be a mathematical formula that fits on a single line. This way, symbolic regression can be said to be an alternative to neural networks that does not involve black boxes.

Deep learning with symbolic regression

So how does it work to solve a traditional deep learning task with symbolic regression? To give an example, let’s try to use it to classify the famous Iris dataset, in which four features of flowers are given and the goal is to classify the species of those flowers using this data. You can find the raw dataset here: iris.txt.

After loading this dataset in the symbolic regression software TuringBot, selecting “classification accuracy” as the search metric and setting a 50/50 test/train split for the training, these were the formulas that it ended up finding, ordered by complexity in ascending order:

The results of a symbolic regression procedure applied to the Iris dataset.

The error shown is the out-of-sample error. It can be seen that the best formula turned out to be one of intermediate size, not so small that it cannot find any pattern, but also not so large that it overfit the data. Its classification accuracy in the test domain was 98%.

If you found this example interesting, you might want to download TuringBot for free and give it a try with your own data. It can be used to solve regression and classification problems in general.

Conclusion

In this article, we have seen how symbolic regression can be used to solve problems where a non-linear relationship between the input variables exist. Despite neural networks being so common, this alternative approach is capable of finding models that perform similarly, but with the advantage of being simple and explainable.

Share this with your network: