How to create an equation for data points

In order to find an equation from a list of values, a special technique called symbolic regression must be used. The idea is to search over the space of all possible mathematical formulas for the ones with the greatest accuracy, while trying to keep those formulas as simple as possible.

In this tutorial, we are going to show how to find formulas using the desktop symbolic regression software TuringBot, which is very easy to use.

How symbolic regression works

Symbolic regression starts from a set of base functions to be used in the search, such as addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in all possible ways with the goal of finding a model that will be as accurate as possible in predicting a target variable. Some examples of base functions used by TuringBot are the following:

Some base functions that TuringBot uses for symbolic regression.

As important as the accuracy of a formula is its simplicity. A huge formula can predict with perfect accuracy the data points, but if the number of free parameters in the model is the same as the number of points then this model is not really informative. For this reason, a symbolic regression optimization will discard a larger formula if it finds a smaller one that performs just as well.

Finding a formula with TuringBot

Finding equations from data points with TuringBot is a simple process. The first step is selecting the input file with the data through the interface. This input file should be in TXT or CSV format. After it has been loaded, the target variable can be selected (by default it will be the last column in the file), and the search can be started. This is what the interface looks like:

The interface of the TuringBot symbolic regression software.

Several options are available on the menus on the left, such as setting a test/train split to be able to detect overfit solutions, selecting which base functions should be used, and selecting the search metric, which by default is root-mean-square error, but that can also be set to classification accuracy, mean relative error and others. For this example, we are going to keep it simple and just use the defaults.

The optimization is started by clicking on the play button at the top of the interface. The best formulas found so far will be shown in the solutions box, ordered by complexity:

The formulas found by TuringBot for an example dataset.

The software allows the solutions to be exported to common programming languages from the menu, and also to simply be exported as text. Here are the formulas in the example above exported in text format:

Complexity   Error      Function
1            1.91399    -0.0967549
3            1.46283    0.384409*x
4            1.362      atan(x)
5            1.18186    0.546317*x-1.00748
6            1.11019    asinh(x)-0.881587
9            1.0365     ceil(asinh(x))-1.4131
13           0.985787   round(tan(floor(0.277692*x)))
15           0.319857   cos(x)*(1.96036-x)*tan(x)
19           0.311375   cos(x)*(1.98862-1.02261*x)*tan(1.00118*x)

Conclusion

In this tutorial, we have seen how symbolic regression can be used to find formulas from values. Symbolic regression is very different from regular curve-fitting methods, since no assumption is made about what the shape of the formulas should be. This allows patterns to be found in datasets with an arbitrary number of dimensions, making symbolic regression a general purpose machine learning technique.

A regression model example and how to generate it

Regression models are perhaps the most important class of machine learning models. In this tutorial, we will show how to easily generate a regression model from data values.

What is regression

The goal of a regression model is to be able to predict a target variable taking as input one or more input variables. The simplest case is that of a linear relationship between the variables, in which case basic methods such as least squares regression can be used.

In real-world datasets, the relationship between the variables is often highly non-linear. This motivates the use of more sophisticated machine learning techniques to solve the regression problems, including for instance neural networks and random forests.

A regression problem example is to predict the value of a house from its characteristics (location, number of bedrooms, total area, etc), using for that information from other houses which are not identical to it but for which the prices are known.

Regression model example

To give a concrete example, let’s consider the following dataset of house prices: house_prices.txt. It contains the following columns:

Index;
Local selling prices, in hundreds of dollars;
Number of bathrooms;
Area of the site in thousands of square feet;
Size of the living space in thousands of square feet;
Number of garages;
Number of rooms;
Number of bedrooms;
Age in years;
Construction type (1=brick, 2=brick/wood, 3=aluminum/wood, 4=wood);
Number of fire places;
Selling price.

The goal is to predict the last column, the selling price, as a function of all the other variables. In order to do that, we are going to use a technique called symbolic regression, which attempts to find explicit mathematical formulas that connect the input variables to the target variable.

We will use the desktop software TuringBot, which can be downloaded for free, to find that regression model. The usage is quite straightforward: you load the input file through the interface, select which variable is the target and which variables should be used as input, and then start the search. This is what its interface looks like with the data loaded in:

The TuringBot interface.

We have also enabled the cross validation feature with a 50/50 test/train split (see the “Search options” menu in the image above). This will allow us to easily discard overfit formulas.

After running the optimization for a few minutes, the formulas found by the program and their corresponding out-of-sample errors were the following:

The regression models found for the house prices.

The highlighted one turned out to be the best — more complex solutions did not offer increased out-of-sample accuracy. Its mean relative error in the test dataset was of roughly 8%. Here is that formula:

price = fire_place+15.5668+(1.66153+bathrooms)*local_pric

The variables that are present in it are only three: the number of bathrooms, the number of fire places and the local price. It is a completely non-trivial fact that the house price should only depend on these three parameters, but the symbolic regression optimization made this fact evident.

Conclusion

In this tutorial, we have seen an example of generating a regression model. The technique that we used was symbolic regression, implemented in the desktop software TuringBot. The model that was found had a good out-of-sample accuracy in predicting the prices of houses based on their characteristics, and it allowed us to clearly see the most relevant variables in estimating that price.