Machine learning with symbolic regression

Many machine learning methods are presently available, including for instance neural networks, random forests, and support vector machines. In this article, we will talk about a very unexplored algorithm called symbolic regression and will show how it can be used to solve machine learning problems in a very transparent and explicit way.

What is machine learning

Machine learning concerns algorithms capable of predicting numerical values (regression) and creating classifications, among other tasks. The real world is messy and randomness appears everywhere, so a major challenge that these algorithms face is being able to discern meaningful signals from the underlying noise contained in the training datasets.

What most machine learning methods have in common is that they are very implicit and resemble black boxes: numbers are fed into the model, and it spits out a result after performing a series of complex computations under the hood. This kind of processing of information is strongly connected to the notion of “artificial intelligence”, since the inner workings of the human brain are also very hard to describe, while it is capable of learning and recognizing patterns across a very wide range of domains.

Symbolic regression

Symbolic regression is a technique that looks for mathematical formulas that predict some target variable taking as input one or more input variables. Thus, a symbolic model is nothing more than an algebraic formula that can be written on a piece of paper.

A simple case of a symbolic model is a polynomial. Any dataset can be represented with perfect accuracy by a polynomial, but that is not very interesting because polynomials quickly diverge outside the training domain, and because they contain as many free parameters as the training dataset itself. So they do not compress information in any way.

More interesting models are found by combining a set of base functions and trying to find the simplest combinations that predict some target variable. Examples of base functions are trigonometric functions, exponentials, sum, multiplication, division, etc.

For instance, these are some of the base functions used by the symbolic regression software TuringBot:

Base functions used by TuringBot.

After the base functions are defined, the task is then to combine them in such a way that a target variable is successfully predicted from the input variables. There is more than one way to carry out the optimization — one might be interested in maximizing the classification accuracy, or in recovering the overall shape of a curve without much regard for outliers, etc. For this reason, TuringBot allows many different search metrics to be used:

The search metrics available in TuringBot.

Some examples of problems that can be solved with symbolic regression include:

Regression problems, which consist of the most basic kind of usage of the technique. See here an example of recovering a mathematical formula using TuringBot without previous knowledge of what the formula was.
Classification problems
Rare event classification on highly imbalanced datasets, by using F1 score as the search metric.

The method is very general and can be creatively used to solve a variety of problems.

Conclusion

In this article, we have seen how symbolic regression is an alternative machine-learning method capable of generating explicit models and solving various classes of problems in an elegant way. If you are interested in generating symbolic models from your data and seeing what patterns it can find, you can download the symbolic regression software TuringBot, which works on both Windows and Linux, for free.