Many machine learning methods are presently available, including for instance neural networks, random forests and support vector machines. In this article, we will talk about a very unexplored algorithm called symbolic regression, and will show how it can be used to solve machine learning problems in a very transparent and explicit way.
What is machine learning
Machine learning concerns algorithms capable of predicting numerical values (regression) and creating classifications, among other tasks. The real world is messy and randomness appears everywhere, so a major challenge that these algorithms face is being able to discern meaningful signals from the underlying noise contained in the training datasets.
What most machine learning methods have in common is that they are very implicit and resemble black boxes: numbers are fed into the model, and it spits out a result after performing a series of complex computations under the hood. This kind of processing of information is strongly connected to the notion of “artificial intelligence”, since the inner workings of the human brain are also very hard to describe, while it is capable of learning and recognizing patterns across a very wide range of domains.
Symbolic regression is a technique that looks for mathematical formulas that predict some target variable taking as input one or more input variables. Thus, a symbolic model is nothing more than an algebraic formula that can be written on a piece of paper.
A simple case of symbolic model is a polynomial. Any dataset can be represented with perfect accuracy by a polynomial, but that is not very interesting because polynomials quickly diverge outside the train domain, and because they contain as many free parameters as the training dataset itself. So they do not really compress information in any way.
More interesting models are found by combining a set of base functions and trying to find the simplest combinations that predict some target variable. Examples of base functions are trigonometric functions, exponentials, sum, multiplication, division, etc.
For instance, these are some of the base functions used by the symbolic regression software TuringBot:
After the base functions are defined, the task is then to combine them in such way that a target variable is successfully predicted from the input variables. There is more than one way to carry out the optimization — one might be interested in maximizing the classification accuracy, or in recovering the overall shape of a curve without much regard for outliers, etc. For this reason, TuringBot allows many different search metrics to be used:
Some examples of problems that can be solved with symbolic regression include:
- Regression problems, which consist of the most basic kind of usage of the technique. See classifying the flower species in the Iris dataset.
- symbolic regression software TuringBot, which works on both Windows and Linux, for free.