## 10 creative applications of symbolic regression

Symbolic regression is a method that discovers mathematical formulas from data without assumptions on what those formulas should look like. Given a set of input variables x1, x2, x3, etc, and a target variable y, it will use trial and error find f such that y = f(x1, x2, x3, …).

The method is very general, given that the target variable y can be anything, and given that a variety of error metrics can be chosen for the search. Here we want to enumerate a few creative applications to give the reader some ideas.

All of these problems can be modeled out of the box with the TuringBot symbolic regression software.

### 1. Forecast the next values of a time series

Say you have a sequence of numbers and you want to predict the next one. This could be the monthly revenue of a company or the daily prices of a stock, for instance.

In special cases, this kind of problem can be solved by simply fitting a line to the data and extrapolating to the next point, a task that can be easily accomplished with numpy.polyfit. While this will work just fine in many cases, it will not be useful if the time series evolves in a nonlinear way.

Symbolic regression offers a more general alternative. One can look for formulas for y = f(index), where y are the values of the series and index = 1, 2, 3, etc. A prediction can then be made by evaluating the resulting formulas at a future index.

This is not a mainstream way to go about this kind of problem, but the simplicity of the resulting models can make them much more informative than mainstream forecasting methods like Monte Carlo simulations, used for instance by Facebook’s Prophet library.

### 2. Predict binary outcomes

A machine learning problem of great practical importance is to predict whether something will happen or not. This is a central problem in options trading, gambling, and finance (“will a recession happen?”).

Numerically, this problem translates to predicting 0 or 1 based on a set of input features.

Symbolic regression allows binary problems to be solved by using classification accuracy as the error metric for the search. In order to minimize the error, the optimization will converge without supervision towards formulas that only output 0 or 1, usually involving floor/ceil/round of some bounded function like tanh(x) or cos(x).

### 3. Predict continuous outcomes

A generalization of the problem of making a binary prediction is the problem of predicting a continuous quantity in the future.

For instance, in agriculture one could be interested in predicting the time for a crop to mature given parameters known at the time of sowing, such as soil composition, the month of the year, temperature, etc.

Usually, few data points will be available to train the model in this kind of scenario, but since symbolic models are simple, they are the least likely to overfit the data. The problem can be modeled by running the optimization with a standard error metric like root-mean-square error or mean error.

### 4. Solve classification problems

Classification problems, in general, can be solved by symbolic regression with a simple trick: representing different categorical variables as different integer numbers.

If your data points have 10 possible labels that should be predicted based on a set of input features, you can use symbolic regression to find formulas that output integers from 1 to 10 based on these features.

This may sound like asking too much — a formula capable of that is highly specific. But a good symbolic regression engine will be thorough in its search over the space of all mathematical formulas and will eventually find appropriate solutions.

### 5. Classify rare events

Another interesting case of classification problem is that of a highly imbalanced dataset, in which only a handful of rows contain the relevant label and the rest are negatives. This could be medical diagnostic images or fraudulent credit card transactions.

For this kind of problem, the usual classification accuracy search metric is not appropriate, since f(x1, x2, x3, …) = 0 will have a very high accuracy while being a useless function.

Special search metrics exist for this kind of problem, the most popular of which being the F1 score, which consists of the geometric mean between precision and recall. This search metric is available in TuringBot, allowing this kind of problem to be easily modeled.

### 6. Compress data

A mathematical formula is perhaps the shortest possible representation of a dataset. If the target variable features some kind of regularity, symbolic regression can turn gigabytes of data into something that can be equivalently expressed in one line.

Examples of target variables could be rgb colors of an image as a function of (x, y) pixels. We have tried finding a formula for the Mona Lisa, but unfortunately, nothing simple could be found in this case.

### 7. Interpolate data

Say you have a table of numbers and you want to compute the target variable for intermediate values not present in the table itself.

One way to go about this is to generate a spline interpolation from the table, which is a somewhat cumbersome and non-portable solution.

With symbolic regression, one can turn the entire table into a mathematical expression, and then proceed to do the interpolation without the need for specialized libraries or data structures, and also without the need to store the table itself anywhere.

### 8. Discover upper or lower bounds for a function

In problems of engineering and applied mathematics, one is often interested not in the particular value of a variable but in how fast this variable grows or how large it can be given an input. In this case, it is more informative to obtain an upper bound for the function than an approximation for the function itself.

With symbolic regression, this can be accomplished by discarding formulas that are not always larger or always smaller than the target variable. This kind of search is available out of the box in TuringBot with its “Bound search mode” option.

### 9. Discover the most predictive variables

When creating a machine learning model, it is extremely useful to know which input variables are the most relevant in predicting the target variable.

With black-box methods like neural networks, answering this kind of question is nontrivial because all variables are used at once indiscriminately.

But with symbolic regression the situation is different: since the formulas are kept as short as possible, variables that are not predictive end up not appearing, making it trivial to spot which variables are actually predictive and relevant.

### 10. Explore the limits of computability

Few people are aware of this, but the notion of computability has been first introduced by Alan Turing himself in his famous paper “On Computable Numbers, with an Application to the Entscheidungsproblem“.

Some things are easy to compute, for instance the function f(x) = x or common functions like sin(x) and exp(x) that can be converted into simple series expansions. But other things are much harder to compute, for instance, the N-th prime number.

With symbolic regression, one can try to derandomize tables of numbers and discover highly nonlinear patterns connecting variables. Since this is done in a very free way, even absurd solutions like tan(tan(tan(tan(x)))) end up being a possibility. This makes the method operate on the edge of computability.

## Neural networks are overrated

When it comes to AI, neural networks are the first method that comes to mind. Despite their impressive performance on a number of applications, we want to argue that they are not necessarily a good general-purpose machine learning method.

### Neural network basics

Neural networks are powerful computation devices. Their basic design is the following:

Each circle is a perceptron, and the perceptrons are organized in layers. What a perceptron does is:

1. Take a number as input.
2. Add a bias to it (a fixed number).
3. Apply an activation function to the result (for instance tanh or sigmoid).
4. Send the result either to the perceptrons of the next layer, or to the output if that was the last layer.

When a perceptron takes as input several numbers from a past layer, each of those numbers is multiplied by a weight (usually between -1 and 1) which characterizes the strength of the connection between those two perceptrons. The numbers are then added together and go through the steps 1-4 outlined above.

To sum up in a visual way, this is what a perceptron does:

### How many hidden layers?

A natural question when it comes to creating a neural network model is: how many hidden layers should be used, and with how many perceptrons each?

A common rule of thumb is that the number of perceptrons in a layer should be between the number in the previous layer and the number in the next one.

Regarding the number of hidden layers, author Jeff Heaton writes in his book Introduction to Neural Networks for Java:

It can be seen that, with just a single hidden layer, any continuous function on a bounded interval can be represented, and that with two hidden layers any map can be represented. So it is never really necessary to have 3 or more hidden layers.

### An embarrassing example

With everything that we have seen so far, neural networks seem like a very elegant method with promising computation capabilities. But how well do they perform?

To give a concrete example, consider the following function on the interval [0, 1]:

This function was hand drawn and converted to numbers using WebPlotDigitizer, so it is a simple but nontrivial example.

What happens if we try to fit this function with a neural network regressor?

The following script trains a neural network with one hidden layer containing 100 perceptrons using the scikit-learn Python library:

```import numpy as np
import pandas as pd
from sklearn.neural_network import MLPRegressor

X = np.array(df['x']).reshape(-1, 1)
y = df['y']

nn = MLPRegressor(random_state=1, max_iter=500, hidden_layer_sizes=(100, )).fit(X, y)

prediction = nn.predict(X)```

And this is what the resulting model looks like:

Clearly, this is not a good fit.

What if we add one more hidden layer? For instance, (100, 50) instead of just one (100,) hidden layer like we did before:

`nn = MLPRegressor(random_state=1, max_iter=500, hidden_layer_sizes=(100, 50)).fit(X, y)`

This is the result:

No much improvement. Bear in mind that the model visualized above has tens of thousands of free parameters (weights and biases), but it still performed poorly.

### Alternatives to neural networks

Now you might think that we have just picked a particularly hard example that will not be properly represented by any typical machine learning method. To show that this is not the case, consider the following alternative model, obtained through symbolic regression using the desktop software TuringBot:

This model consists of the following simple formula:

```def f(x):
return log(0.0192917+x)*2.88451*x*x+0.797118```

Despite the model being simple and not containing ten thousand parameters, it managed to represent our nontrivial function with great accuracy.

### Conclusion

Our goal in this article was to question the notion that neural networks are “the” machine learning method, and that they possess some kind of magical machine learning capability that allows them to find hidden patterns everywhere.

It might be the case that for most of the typical applications of machine learning, neural networks might actually underperform simpler alternative methods.

## How to create an equation for data points?

In order to find an equation from a list of values, a special technique called symbolic regression must be used. The idea is to search over the space of all possible mathematical formulas for the ones with the greatest accuracy, while trying to keep those formulas as simple as possible.

In this tutorial, we are going to show how to find formulas using the desktop symbolic regression software TuringBot, which is very easy to use.

### How symbolic regression works

Symbolic regression starts from a set of base functions to be used in the search, such as addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in all possible ways with the goal of finding a model that will be as accurate as possible in predicting a target variable. Some examples of base functions used by TuringBot are the following:

As important as the accuracy of a formula is its simplicity. A huge formula can predict with perfect accuracy the data points, but if the number of free parameters in the model is the same as the number of points then this model is not really informative. For this reason, a symbolic regression optimization will discard a larger formula if it finds a smaller one that performs just as well.

### Finding a formula with TuringBot

Finding equations from data points with TuringBot is a simple process. The first step is selecting the input file with the data through the interface. This input file should be in TXT or CSV format. After it has been loaded, the target variable can be selected (by default it will be the last column in the file), and the search can be started. This is what the interface looks like:

Several options are available on the menus on the left, such as setting a test/train split to be able to detect overfit solutions, selecting which base functions should be used, and selecting the search metric, which by default is root-mean-square error, but that can also be set to classification accuracy, mean relative error and others. For this example, we are going to keep it simple and just use the defaults.

The optimization is started by clicking on the play button at the top of the interface. The best formulas found so far will be shown in the solutions box, ordered by complexity:

The software allows the solutions to be exported to common programming languages from the menu, and also to simply be exported as text. Here are the formulas in the example above exported in text format:

```Complexity   Error      Function
1            1.91399    -0.0967549
3            1.46283    0.384409*x
4            1.362      atan(x)
5            1.18186    0.546317*x-1.00748
6            1.11019    asinh(x)-0.881587
9            1.0365     ceil(asinh(x))-1.4131
13           0.985787   round(tan(floor(0.277692*x)))
15           0.319857   cos(x)*(1.96036-x)*tan(x)
19           0.311375   cos(x)*(1.98862-1.02261*x)*tan(1.00118*x)```

### Conclusion

In this tutorial, we have seen how symbolic regression can be used to find formulas from values. Symbolic regression is very different from regular curve-fitting methods, since no assumption is made about what the shape of the formulas should be. This allows patterns to be found in datasets with an arbitrary number of dimensions, making symbolic regression a general purpose machine learning technique.

## A regression model example and how to generate it

Regression models are perhaps the most important class of machine learning models. In this tutorial, we will show how to easily generate a regression model from data values.

### What regression is

The goal of a regression model is to be able to predict a target variable taking as input one or more input variables. The simplest case is that of a linear relationship between the variables, in which case basic methods such as least squares regression can be used.

In real-world datasets, the relationship between the variables is often highly non-linear. This motivates the use of more sophisticated machine learning techniques to solve the regression problems, including for instance neural networks and random forests.

A regression problem example is to predict the value of a house from its characteristics (location, number of bedrooms, total area, etc), using for that information from other houses which are not identical to it but for which the prices are known.

### Regression model example

To give a concrete example, let’s consider the following public dataset of house prices: x26.txt. This file contains a long and uncommented header; a stripped-down version that is compatible with TuringBot can be found here: house_prices.txt. The columns that are present are the following

```Index;
Local selling prices, in hundreds of dollars;
Number of bathrooms;
Area of the site in thousands of square feet;
Size of the living space in thousands of square feet;
Number of garages;
Number of rooms;
Number of bedrooms;
Age in years;
Construction type (1=brick, 2=brick/wood, 3=aluminum/wood, 4=wood);
Number of fire places;
Selling price.```

The goal is to predict the last column, the selling price, as a function of all the other variables. In order to do that, we are going to use a technique called symbolic regression, which attempts to find explicit mathematical formulas that connect the input variables to the target variable.

We will use the desktop software TuringBot, which can be downloaded for free, to find that regression model. The usage is quite straightforward: you load the input file through the interface, select which variable is the target and which variables should be used as input, and then start the search. This is what its interface looks like with the data loaded in:

We have also enabled the cross validation feature with a 50/50 test/train split (see the “Search options” menu in the image above). This will allow us to easily discard overfit formulas.

After running the optimization for a few minutes, the formulas found by the program and their corresponding out-of-sample errors were the following:

The highlighted one turned out to be the best — more complex solutions did not offer increased out-of-sample accuracy. Its mean relative error in the test dataset was of roughly 8%. Here is that formula:

`price = fire_place+15.5668+(1.66153+bathrooms)*local_pric`

The variables that are present in it are only three: the number of bathrooms, the number of fire places and the local price. It is a completely non-trivial fact that the house price should only depend on these three parameters, but the symbolic regression optimization made this fact evident.

### Conclusion

In this tutorial, we have seen an example of generating a regression model. The technique that we used was symbolic regression, implemented in the desktop software TuringBot. The model that was found had a good out-of-sample accuracy in predicting the prices of houses based on their characteristics, and it allowed us to clearly see the most relevant variables in estimating that price.