10 creative applications of symbolic regression

Symbolic regression is a method that discovers mathematical formulas from data without assumptions on what those formulas should look like. Given a set of input variables x1, x2, x3, etc, and a target variable y, it will use trial and error find f such that y = f(x1, x2, x3, …).

The method is very general, given that the target variable y can be anything, and given that a variety of error metrics can be chosen for the search. Here we want to enumerate a few creative applications to give the reader some ideas.

All of these problems can be modeled out of the box with the TuringBot symbolic regression software.

1. Forecast the next values of a time series

Say you have a sequence of numbers and you want to predict the next one. This could be the monthly revenue of a company or the daily prices of a stock, for instance.

In special cases, this kind of problem can be solved by simply fitting a line to the data and extrapolating to the next point, a task that can be easily accomplished with numpy.polyfit. While this will work just fine in many cases, it will not be useful if the time series evolves in a nonlinear way.

Symbolic regression offers a more general alternative. One can look for formulas for y = f(index), where y are the values of the series and index = 1, 2, 3, etc. A prediction can then be made by evaluating the resulting formulas at a future index.

This is not a mainstream way to go about this kind of problem, but the simplicity of the resulting models can make them much more informative than mainstream forecasting methods like Monte Carlo simulations, used for instance by Facebook’s Prophet library.

2. Predict binary outcomes

A machine learning problem of great practical importance is to predict whether something will happen or not. This is a central problem in options trading, gambling, and finance (“will a recession happen?”).

Numerically, this problem translates to predicting 0 or 1 based on a set of input features.

Symbolic regression allows binary problems to be solved by using classification accuracy as the error metric for the search. In order to minimize the error, the optimization will converge without supervision towards formulas that only output 0 or 1, usually involving floor/ceil/round of some bounded function like tanh(x) or cos(x).

3. Predict continuous outcomes

A generalization of the problem of making a binary prediction is the problem of predicting a continuous quantity in the future.

For instance, in agriculture one could be interested in predicting the time for a crop to mature given parameters known at the time of sowing, such as soil composition, the month of the year, temperature, etc.

Usually, few data points will be available to train the model in this kind of scenario, but since symbolic models are simple, they are the least likely to overfit the data. The problem can be modeled by running the optimization with a standard error metric like root-mean-square error or mean error.

4. Solve classification problems

Classification problems, in general, can be solved by symbolic regression with a simple trick: representing different categorical variables as different integer numbers.

If your data points have 10 possible labels that should be predicted based on a set of input features, you can use symbolic regression to find formulas that output integers from 1 to 10 based on these features.

This may sound like asking too much — a formula capable of that is highly specific. But a good symbolic regression engine will be thorough in its search over the space of all mathematical formulas and will eventually find appropriate solutions.

5. Classify rare events

Another interesting case of classification problem is that of a highly imbalanced dataset, in which only a handful of rows contain the relevant label and the rest are negatives. This could be medical diagnostic images or fraudulent credit card transactions.

For this kind of problem, the usual classification accuracy search metric is not appropriate, since f(x1, x2, x3, …) = 0 will have a very high accuracy while being a useless function.

Special search metrics exist for this kind of problem, the most popular of which being the F1 score, which consists of the geometric mean between precision and recall. This search metric is available in TuringBot, allowing this kind of problem to be easily modeled.

6. Compress data

A mathematical formula is perhaps the shortest possible representation of a dataset. If the target variable features some kind of regularity, symbolic regression can turn gigabytes of data into something that can be equivalently expressed in one line.

Examples of target variables could be rgb colors of an image as a function of (x, y) pixels. We have tried finding a formula for the Mona Lisa, but unfortunately, nothing simple could be found in this case.

7. Interpolate data

Say you have a table of numbers and you want to compute the target variable for intermediate values not present in the table itself.

One way to go about this is to generate a spline interpolation from the table, which is a somewhat cumbersome and non-portable solution.

With symbolic regression, one can turn the entire table into a mathematical expression, and then proceed to do the interpolation without the need for specialized libraries or data structures, and also without the need to store the table itself anywhere.

8. Discover upper or lower bounds for a function

In problems of engineering and applied mathematics, one is often interested not in the particular value of a variable but in how fast this variable grows or how large it can be given an input. In this case, it is more informative to obtain an upper bound for the function than an approximation for the function itself.

With symbolic regression, this can be accomplished by discarding formulas that are not always larger or always smaller than the target variable. This kind of search is available out of the box in TuringBot with its “Bound search mode” option.

9. Discover the most predictive variables

When creating a machine learning model, it is extremely useful to know which input variables are the most relevant in predicting the target variable.

With black-box methods like neural networks, answering this kind of question is nontrivial because all variables are used at once indiscriminately.

But with symbolic regression the situation is different: since the formulas are kept as short as possible, variables that are not predictive end up not appearing, making it trivial to spot which variables are actually predictive and relevant.

10. Explore the limits of computability

Few people are aware of this, but the notion of computability has been first introduced by Alan Turing himself in his famous paper “On Computable Numbers, with an Application to the Entscheidungsproblem“.

Some things are easy to compute, for instance the function f(x) = x or common functions like sin(x) and exp(x) that can be converted into simple series expansions. But other things are much harder to compute, for instance, the N-th prime number.

With symbolic regression, one can try to derandomize tables of numbers and discover highly nonlinear patterns connecting variables. Since this is done in a very free way, even absurd solutions like tan(tan(tan(tan(x)))) end up being a possibility. This makes the method operate on the edge of computability.

Interested in symbolic regression? Download TuringBot and get started today.

Neural networks: what are the alternatives?

TuringBot is a great alternative to neural networks. Check it out!

In this article, we will see some alternatives to neural networks that can be used to solve the same types of machine learning tasks that they do.

What neural networks are

Neural networks are by far the most popular machine learning method. They are capable of automatically learning hidden features from input data prior to computing an output value, and established algorithms exist for finding the optimal internal parameters (weights and biases) based on a training dataset.

The basic architecture is the following. The building blocks are perceptrons, which take values as input, calculate a weighted sum of those values, and apply a non-linear activation function to the result. The output is then either fed into perceptrons of the next layer, or it is sent to the output if that was the last layer.

The basic architecture of a neural network. Blue circles are perceptrons.

This architecture is directly inspired by the workings of the human brain. Combined with a neural network’s ability to learn from data, a strong association between this machine learning method and the notion of artificial intelligence can be drawn.

Alternatives to neural networks

Despite being so popular, neural networks are not the only machine learning method available. Several alternatives exist, and in many contexts, these alternatives may provide better performances.

Some noteworthy alternatives are the following:

  • Random forests, which consist of an ensemble of decision trees, each trained with a random subset of the training dataset. This method corrects a decision tree’s tendency to overfit the input data.
  • Support vector machines, which attempt to map the input data into a space where it is linearly separable into different categories.
  • k-nearest neighbors algorithm (KNN), which looks for the values in the training dataset that are closest to a new input, and combines the target variables associated with those nearest neighbors into a new prediction.
  • Symbolic regression, a technique that tries to find explicit mathematical formulas that connect the input variables to the target variable.

A noteworthy alternative

Among the alternatives above, all but symbolic regression involve implicit computations under the hood that cannot be easily interpreted. With symbolic regression, the model is an explicit mathematical formula that can be written on a sheet of paper, making this technique an alternative to neural networks of particular interest.

Here is how it works: given a set of base functions, for instance, sin(x), exp(x), addition, multiplication, etc, a training algorithm tries to find the combinations of those functions that best predict the output variable taking as input the input variables. It is important that the formulas encountered are the simplest ones possible, so the algorithm will automatically discard a formula if it finds a simpler one that performs just as well.

Here is an example of output for a symbolic regression optimization, in which a set of formulas of increasing complexity were found that describe the input dataset. The symbolic regression package used is called TuringBot, a desktop application that can be downloaded for free.

Formulas found with a symbolic regression optimization.

This method very much resembles a scientist looking for mathematical laws that explain data, as Kepler did with data on the positions of planets in the sky to find his laws of planetary motion.

Conclusion

In this article, we have seen some alternatives to neural networks based on completely different ideas, including for instance symbolic regression which generates models that are explicit and more explainable than a neural network. Exploring different models is very valuable, because they may perform differently in different particular contexts.

TuringBot is a great alternative to neural networks. Check it out!

A free AI software for PC

If you are interested in solving AI problems and would like an easy to use desktop software that yields state of the art results, you might like TuringBot. In this article, we will show you how it can be used to easily solve classification and regression problems, and explain the methodology that it uses, which is called symbolic regression.

The software

TuringBot is a desktop application that runs on both Windows and Linux, and that can be downloaded for free from the official website. This is what its interface looks like:

The interface of TuringBot.

The usage is simple: you load your data in CSV or TXT format through the interface, select which column should be predicted and which columns should be used as input, and start the search. The program will look for explicit mathematical formulas that predict this target variable, and show the results in the Solutions box.

Symbolic regression

The name of this technique, which looks for explicit formulas that solve AI problems, is symbolic regression. It is capable of solving the same problems as neural networks, but in an explicit way that does not involve black box computations.

Think of what Kepler did when he extracted his laws of planetary motion from observations. He looked for algebraic equations that could explain this data, and found timeless patterns that are taught to this day in schools. What TuringBot does is something similar to that, but millions of times faster than a human could ever do.

An important point in symbolic regression is that it is not sufficient for a model to be accurate — it also has to be simple. This is why TuringBot’s algorithm tries to find the best formulas of all possible sizes simultaneously, discarding larger formulas that do not perform better than simpler alternatives.

The problems that it can solve

Some examples of problems that can be solved by the program are the following:

  • Regression problems, in which a continuous target variable should be predicted. See here a tutorial in which we use the program to recover a mathematical formula without previous knowledge of what that formula was.
  • Classification problems, in which the goal is to classify inputs into two or more different categories. The rationale of solving this kind of problem using symbolic regression is to represent different categorical variables as different integer numbers, and run the optimization with “classification accuracy” as the search metric (this can easily be selected through the interface). In this article, we teach how to use the program to classify the Iris dataset.
  • Classification of rare events, in which a classification task must be solved on highly imbalanced datasets. The logic is similar to that of a regular classification problem, but in this case a special metric called F1 score should be used (also available in TuringBot). In this article, we found a formula that successfully classified credit card frauds on a real-world dataset that is highly imbalanced.

Getting TuringBot

If you liked the concept of TuringBot, you can download it for free from the official website. There you can also find the official documentation, with more information about the search metrics that are available, the input file formats and the various features that the program offers.

A machine learning software for data science

Data science is becoming more and more widespread, pushed by companies that are finding that very valuable and actionable information can be extracted from their databases.

It can be challenging to develop useful models from raw data. Here we will introduce a tool that makes it very easy to develop state of the art models from any dataset.

What is TuringBot

TuringBot is a desktop machine learning software. It runs on both Windows and Linux, and what it does is generate models that predict some target variable taking as input one or more input variables. It does that through a technique called symbolic regression. This is what its interface looks like:

TuringBot’s interface.

The idea of symbolic regression is to search over the space of all possible mathematical formulas for the ones that best connect the input variables to the target variable, while trying to keep those formulas as simple as possible. The target variable can be anything: for instance, it can represent different categorical variables as different integer numbers, allowing the program to solve classification problems, or it can be a regular continuous variable.

Machine learning with TuringBot

The usage of TuringBot is very straightforward. All you have to do is save your data in CSV or TXT format, with one variable per column, and load this input file through the program’s interface.

Once the data is loaded, you can select the target variable and which variables should be used as input, as well as the search metric, and then start the search. Several search metrics are available, including RMS error, mean error and classification accuracy. A list of formulas encountered so far will be shown in real time, ordered by complexity. Those formulas can be easily exported as Python, C or text from the interface:

Some solutions found by TuringBot. They can readily be exported to common programming languages.

Most machine learning methods are black boxes, which carry out complex computations under the hood before giving a result. This is how neural networks and random forests work, for instance. A great advantage of TuringBot over these methods is that the models that it generates are very explicit, allowing some understanding to be gained into the data. This turns data science into something much more similar to natural science and its search for mathematical laws that explain the world.

How to get the software

If you are interested in trying TuringBot on your own data, you can download it for free from the official website. There you can also find the official documentation, with detailed information about all the features and parameters of the software. Many engineers and data scientists are already making use of the software to find hidden patterns in their data.