10 creative applications of symbolic regression

Symbolic regression is a method that discovers mathematical formulas from data without assumptions on what those formulas should look like. Given a set of input variables x1, x2, x3, etc, and a target variable y, it will use trial and error find f such that y = f(x1, x2, x3, …).

The method is very general, given that the target variable y can be anything, and given that a variety of error metrics can be chosen for the search. Here we want to enumerate a few creative applications to give the reader some ideas.

All of these problems can be modeled out of the box with the TuringBot symbolic regression software.

1. Forecast the next values of a time series

Say you have a sequence of numbers and you want to predict the next one. This could be the monthly revenue of a company or the daily prices of a stock, for instance.

In special cases, this kind of problem can be solved by simply fitting a line to the data and extrapolating to the next point, a task that can be easily accomplished with numpy.polyfit. While this will work just fine in many cases, it will not be useful if the time series evolves in a nonlinear way.

Symbolic regression offers a more general alternative. One can look for formulas for y = f(index), where y are the values of the series and index = 1, 2, 3, etc. A prediction can then be made by evaluating the resulting formulas at a future index.

This is not a mainstream way to go about this kind of problem, but the simplicity of the resulting models can make them much more informative than mainstream forecasting methods like Monte Carlo simulations, used for instance by Facebook’s Prophet library.

2. Predict binary outcomes

A machine learning problem of great practical importance is to predict whether something will happen or not. This is a central problem in options trading, gambling, and finance (“will a recession happen?”).

Numerically, this problem translates to predicting 0 or 1 based on a set of input features.

Symbolic regression allows binary problems to be solved by using classification accuracy as the error metric for the search. In order to minimize the error, the optimization will converge without supervision towards formulas that only output 0 or 1, usually involving floor/ceil/round of some bounded function like tanh(x) or cos(x).

3. Predict continuous outcomes

A generalization of the problem of making a binary prediction is the problem of predicting a continuous quantity in the future.

For instance, in agriculture one could be interested in predicting the time for a crop to mature given parameters known at the time of sowing, such as soil composition, the month of the year, temperature, etc.

Usually, few data points will be available to train the model in this kind of scenario, but since symbolic models are simple, they are the least likely to overfit the data. The problem can be modeled by running the optimization with a standard error metric like root-mean-square error or mean error.

4. Solve classification problems

Classification problems, in general, can be solved by symbolic regression with a simple trick: representing different categorical variables as different integer numbers.

If your data points have 10 possible labels that should be predicted based on a set of input features, you can use symbolic regression to find formulas that output integers from 1 to 10 based on these features.

This may sound like asking too much — a formula capable of that is highly specific. But a good symbolic regression engine will be thorough in its search over the space of all mathematical formulas and will eventually find appropriate solutions.

5. Classify rare events

Another interesting case of classification problem is that of a highly imbalanced dataset, in which only a handful of rows contain the relevant label and the rest are negatives. This could be medical diagnostic images or fraudulent credit card transactions.

For this kind of problem, the usual classification accuracy search metric is not appropriate, since f(x1, x2, x3, …) = 0 will have a very high accuracy while being a useless function.

Special search metrics exist for this kind of problem, the most popular of which being the F1 score, which consists of the geometric mean between precision and recall. This search metric is available in TuringBot, allowing this kind of problem to be easily modeled.

6. Compress data

A mathematical formula is perhaps the shortest possible representation of a dataset. If the target variable features some kind of regularity, symbolic regression can turn gigabytes of data into something that can be equivalently expressed in one line.

Examples of target variables could be rgb colors of an image as a function of (x, y) pixels. We have tried finding a formula for the Mona Lisa, but unfortunately, nothing simple could be found in this case.

7. Interpolate data

Say you have a table of numbers and you want to compute the target variable for intermediate values not present in the table itself.

One way to go about this is to generate a spline interpolation from the table, which is a somewhat cumbersome and non-portable solution.

With symbolic regression, one can turn the entire table into a mathematical expression, and then proceed to do the interpolation without the need for specialized libraries or data structures, and also without the need to store the table itself anywhere.

8. Discover upper or lower bounds for a function

In problems of engineering and applied mathematics, one is often interested not in the particular value of a variable but in how fast this variable grows or how large it can be given an input. In this case, it is more informative to obtain an upper bound for the function than an approximation for the function itself.

With symbolic regression, this can be accomplished by discarding formulas that are not always larger or always smaller than the target variable. This kind of search is available out of the box in TuringBot with its “Bound search mode” option.

9. Discover the most predictive variables

When creating a machine learning model, it is extremely useful to know which input variables are the most relevant in predicting the target variable.

With black-box methods like neural networks, answering this kind of question is nontrivial because all variables are used at once indiscriminately.

But with symbolic regression the situation is different: since the formulas are kept as short as possible, variables that are not predictive end up not appearing, making it trivial to spot which variables are actually predictive and relevant.

10. Explore the limits of computability

Few people are aware of this, but the notion of computability has been first introduced by Alan Turing himself in his famous paper “On Computable Numbers, with an Application to the Entscheidungsproblem“.

Some things are easy to compute, for instance the function f(x) = x or common functions like sin(x) and exp(x) that can be converted into simple series expansions. But other things are much harder to compute, for instance, the N-th prime number.

With symbolic regression, one can try to derandomize tables of numbers and discover highly nonlinear patterns connecting variables. Since this is done in a very free way, even absurd solutions like tan(tan(tan(tan(x)))) end up being a possibility. This makes the method operate on the edge of computability.

Interested in symbolic regression? Download TuringBot and get started today.

TuringBot and 2D-3D Animation: Jumping Jack

By Giovanni Di Maria, creator of EDM Electronics Design Master

A person’s movements can be represented using mathematical formulas. With the help of TuringBot it is possible to translate into numbers all the movements that this person makes. If the movements are simple, the formulas will also be simple. Conversely, complex movements can be described by longer and more complex formulas and equations. Thanks to the ability to find mathematical formulas for any type of action, robotic systems, simulations, 3D prototyping, video games and much more can be implemented.

Turning real life into numbers and formulas

To describe the movement of a person (or an animal, plant or object) on the computer, it is necessary to follow different phases and mathematically represent the set of actions that he performs. Some steps are quite complicated. This article is dedicated to the numerical representation of the movements of a man performing the “Jumping Jack” physical exercise. The steps to follow are as follows:

  • video capture;
  • cutting the video and keeping only the useful frames;
  • identification of the parts of the body in motion;
  • determination of the coordinates of the various segments;
  • symbolic regression and curve fitting of data;
  • application of the formulas and creation of a 3D model;
  • creation of final animation and comparison with the original video.

Video capture and frame selection

For this step, you need to create a video of the person in motion. It need not be very long. 3 or 4 seconds is enough, even at 15 FPS. The important thing is to acquire the sequence of useful frames, in which the athlete moves for a complete period, ie from when the human body assumes a certain position to when it returns to the same position.

You probably need to use a variety of software. When creating the video, the important thing is that the athlete is well lit and visible and, above all, the camera angle must not change. You need to trim the video to take only the useful sequence. For this example, the video consists of 29 frames. The various elements, therefore, are represented by as many positions in the two-dimensional spatial domain.

Identification of the moving parts of the body

The aim of our work is to simulate on the computer the movement of a graphic object composed simply of some segments. In fact, we are interested in the mathematical aspect of the problem and not the graphic one. The digital athlete must be created using some segments that represent the arms, legs, etc. Below we can see the athlete together with his virtual “avatar”, made up of many segments which, subsequently, will be mathematically processed. For convenience, the forearm and hand have been treated as a single segment, since the video shows them both with a single line.

The virtual athlete is composed of the following elements:

  • a circle “H” for the head, that is a circle with position X, Y and radius R;
  • by two segments “Arm1” and “Arm2” for the arms;
  • by two segments “Forearm1” and “Forearm2” for the forearms;
  • from the “Body” segment for the body;
  • by two segments “Thigh1” and “Thigh2” for the thighs;
  • by two segments “Leg1” and “Leg2” for the legs;
  • by two segments “Foot1” and “Foot2” for the feet.

Determination of the coordinates of the various segments

The determination of all the coordinates of the points of the subject is the most important and tiring phase of the whole operation. A good result depends on the precision with which the point values were taken. For each frame you need:

  • determine the X and Y coordinate of the start of a segment;
  • determine the end X and Y coordinate of the same segment.

The head is drawn with a circle, so its center (and its position) is represented by a single point, with X and Y coordinates. To evaluate the position of each point we can use any graphics program, for example ImageJ that , in real time, it shows the position of the mouse (in pixels) on the status bar and has many excellent automated selection functions that simplify a lot of work. We use this information to obtain the coordinates of the various segments.

It is a bit of a long job, where maximum precision and a lot of patience are required. At the end of the calculations it is possible to insert the results in a spreadsheet. Remember that these coordinates are relative. To transform them into absolute coordinates it is sufficient to subtract a fixed value.

With all this data available it is possible, for example, to draw the displacement graphs of some elements of the body, to discover the presence of any anomalies. For example, the following graph shows the trend of the position in the X axis and in the Y axis, during the entire movement of all the frames.

And now, TuringBot in action

All these coordinates must be transformed into mathematical formulas, even if the simplest and safest way would be to insert them into an array and process the movements within a loop. But the adoption of equations and formulas makes the operation much more elegant and faster, as well as allowing the writing of a shorter and more compact source code, with less memory consumption. It is understood that if the athlete’s movements change, the formulas must also follow the same changes. It is therefore the right time to try to represent the trends of human movements with mathematical formulas. The various nodes of the human body are represented by many points, of which it is necessary to memorize the two coordinates X and Y. It is therefore necessary to generate a considerable amount of mathematical formulas that allow to obtain a small arithmetic miracle. Some points are in common, so they use the same formulas.

For each coordinate of each point it is necessary to create a text file for TuringBot, containing two columns. The data is, of course, taken from the giant spreadsheet. We show below the example for the X coordinate of the head, but the other files follow the same rule.

x y
1 190
2 190
3 190
4 189
5 189
6 188
7 188
8 187
9 186
10 186
11 185
12 184
13 184
14 184
15 184
16 185
17 185
18 186
19 186
20 187
21 188
22 188
23 188
24 189
25 190
26 190
27 191
28 192
29 192

The TuringBot setting is as follows:

  • Search metric: RMS error;
  • Train / test split: No cross validation;
  • Test sample: Chosen randomly;
  • Integer constants only: Disabled;
  • Bound search mode: Deactivated;
  • Maximum formula complexity: 60;
  • Allowed functions: + * / sin cos exp log sqrt abs floor ceil round

For some systems, care must be taken with trigonometric functions. They work in radians in TuringBot, while for other graphics software they may work in degrees, complicating the conversion process. In our specific case, the following conversion functions were created in OpenSCAD:

function cos_deg (x) = cos (x * 180 / 3.1415);
function sin_deg (x) = sin (x * 180 / 3.1415);

A relatively low search time is sufficient for each formula, let’s say 5-10 minutes or less is more than enough. It is very interesting to observe TuringBot during the research. Its formulas are getting closer and closer to the final goal, as can be seen in the screen below.

Below there are all the formulas that describe the positions of all the coordinates of the segments that make up the virtual athlete.

Head
y01=round((2.6318+(0.0554792*((0.624893-cos(5.98948*x))*x)))*cos(0.22113*x)+187.659);
y02=131.133+round((22.2134-cos(-0.236286+1.0202*x))*cos((-0.0920179)*(((-6.27235+cos(1.1767+x))/x)+200.387*x))-cos(0.665542+0.98184*x)+(-0.312138*x)+4.11456);
Arm 1
y03=145.872-((-0.162146)*3.12725*round(cos(0.869371-round(5.96336*x))*(0.362323*x+1.77951)-(cos(-2.63665+round(-0.806851*x))/0.61255)+x));
y04=191.306+((round(cos(0.100135*(-8.17967-1.70143*round(cos(x))+x))*(1.27238*x-cos(-1.86063+x))-(cos(389.994*(x+1.88321)+0.0989649)/0.0417673))+x)/(-1.2269));
y05=round((-22.0364+ceil(-0.00500284*ceil(x)*x))*(-1.65113-cos(19.3301*(x-0.0605927))))+((30.7797/(0.161422+x))+(68.1958-cos(51.0009*x))-cos(x));
y06=exp(3.04946+cos(2.45808*(-0.0870771*x)+0.138404)+1.14937)+(98.0591+round(sin(round(0.980629*x)))+((6.65119-cos(exp(0.0876617*x)))*cos(0.695886*(2.86001+x))));
Forearm 1
y07=y05;
y08=y06;
y09=(82.9424-5.26613*cos(0.586505*x))*(cos((-0.373595)*(x+exp(round(0.702127*x)-(0.498511+round(0.81141*x)))+1.08672))+1.09296)-cos(x+1.33121);
y10=24.2181-floor((6.34302+(-0.112064*x))*round(cos(0.553976+x)))+abs(12.7211+x-((0.650495*x-11.0444)*round(3.27817*((cos((-0.730758)*(0.273942+x))/0.755489)+(x-12.8353)))));
Body 1
y11=2.89195*cos(1.94668*0.118136*floor(0.631336-(0.929462*x+cos(exp(0.91244*x))))+(0.00305925/cos(x)))+186.109;
y12=round(168.264-(ceil(23.9806*cos((-2.32409+(-0.00596887+x))/0.00939977))+1.57128*cos(-1.18468*x))-(cos(x)+(1.91115*(cos(7.08496*(-0.288919+x))+0.109288*x))));
y13=round(185.474-0.0754086*x)+round(cos((-0.991109+0.0450943*x)*(0.928695*x+log(x))));
y14=302.57-round(0.656922*x)-(3.72319*cos(0.92024*(0.811155+x))+round((0.357329-x*cos(0.448989*(5.09698+x)))*(-0.0148642*x+((-0.214389*cos(61*x)*(4.88954-x)-25.1065)/x))));
Thigh 1
y15=(3.73138+((-2.81755)/x)-1.5051*sin(round(0.797607*x+0.247816)+3.30139))*(-0.514093+cos(0.205523*(cos(round((-4.32537)*(x+((-1.0676)/x))))+x)))+164.175;
y16=round(291.396-((0.865647*x+13.9971)*cos(x+568-0.562408*x)))-3.78295*cos(0.889973*(1.67606+x))-((-68.1261)/(1.44242*cos(x-0.362494*x)+x));
y17=18.9883*cos(-12.8305*floor(-2.33066+x))+(0.0747105*(7.00189-(0.0741629/cos(x))+x)*cos(509.575*x))+(148.027+round(cos(x+0.658366))+(-0.209201*round(x)));
y18=390.962+(round((-0.602733)*(cos(0.747127*x-0.621592)+0.177481)*(-13.466-cos(round((-0.957067)/cos(x)))+cos(x*x)+x))+27.0249*cos(0.457091*(-0.671057+(2.28055-x))));
Leg 1
y19=y17;
y20=y18;
y21=((4.54964+(-0.104918*x)+0.86746)*cos(0.908546*x+2.08515)+22.7637)*cos(-0.111712*cos(0.935385*x)+6.70845*0.0314056*1*x)+(155.963-(x/3.08672));
y22=(13.6737-round((x-1.93818)*(0.289077+cos(-0.619872+x))))*(-0.0369057+round(3.95993/(x+1.89954)))+478.649+round(cos(0.116577*(-3.69209*x))*round(0.549631*(2.80209+x)));
Foot 1
y23=y21;
y24=y22;
y25=139.272+round(41.4532+(5.65241*cos(1.93845+0.927451*x)+(-0.313033*x)))*cos(0.233247*(cos(0.93121*(0.497331+x))+(1.53881-x)));
y26=round(0.608219/x)+(round(531.435-exp(sin((-0.579902)*(-1.04537+x))-round(cos((-0.437731)*(-1.0422+x)))))-1.74977*round(-0.0192443*x));
Arm 2
y27=215.61+(round(-6.43715+((0.0859084*x-cos(0.942479-x))*cos(x-1.00562)))*cos((0.294155+(-0.00453909*cos(x)))*(4.12657+x))+round(16.3707/(1.00701+0.237835*x)));
y28=11.6855*cos(0.807271*x-x)+round((cos((-0.423063)*(x-0.93908))/(0.0525671+(-0.000435858*x)))-log(x)*cos(0.888721*(2.08408+x)))+178.346;
y29=268.152-cos(x)-round(cos(13.3713*x))-round((2.85905+(cos(x/0.00529412)+(30.6583/x)))*(cos((-0.503562)*(3.04348+x))/(-0.441047+0.0116418*x)));
y30=round(190.29-((0.274471*(x-1.03236)-cos(x)+4.1627)*cos(0.498517*(x+3.46653))))+((77.1759+8.48446*cos(-0.610676*x))*cos(-0.201617*x));
Forearm 2
y31=y29;
y32=y30;
y33=round(-0.249366*x)+291.228+(-12.6244*cos((-0.820511)*(-0.186609+x)))-(round(78.0757*cos((-0.404411)*((2.17112/(x+0.071869))+x-0.610615)))-ceil(4.08193*cos((-1.17325)*(x-0.515673))));
y34=abs(164.762+(224*cos(0.198515*x)-15.9158*cos((-0.738632)*(1.55249+x)))-8.63263*cos(-1.01382*x*cos(0.211738*x)+0.224974))-3.18017+6.85659*cos(x);
Thigh 2
y35=0.172415+round(210.278+1.65102*round(cos(5.80056*(0.885518+x)))+(-0.0412545*x+cos(6.98431*x)))+floor(((-0.0356155)/sin(x-18.9299))+0.158463);
y36=round(round(3.61529*cos(round(0.194003*x))+(307.838-3.51777*cos((-0.897347)*(1.61609+x)))-0.729957*x)-cos(0.471091*(5.10132+0.942463*x))*round(21.5251+0.421583*round(x)));
y37=221.338-19.8608*cos(0.208955*x)+3.4534*cos(0.449626*(cos(2.00243*x*cos(x))+(-1.68947+x)));
y38=(0.247935/(-0.632871+x-3.60529))+(395.163+((28.4837+cos(exp(x+0.642375)))*cos(0.430087*(x-0.908733)))-ceil(-0.0921263*x))-round(2.63053*cos(0.886319*(2.07703+x)));
Leg 2
y39=y37;
y40=y38;
y41=221.068-round(cos((-0.229646)*(1.00585-x))*(0.512421*x+(17.4901+((3.61754-3.39247*cos(round(0.966905*x)))*cos(-0.599782*x-(-970.82-x))))));
y42=round(((7.16092*cos(x)+2.62982)/x)+ceil(11.2742*cos(0.46229*round(2.61198-1.01162*x))))+round(1.6828*cos(23.8215*(0.0290402+x)))+481.601;
Foot 2
y43=y41;
y44=y42;
y45=233.767+((-30.4935+cos(1.96625*x))*cos((-0.253376)*(2.19983-(0.927354*x-2.38703*cos(0.421426*x-1.58061)))));
y46=530.782+((cos(0.663909*x)+cos(2.00314*x)+2.30646)*cos(0.509108*(3-x))+round(cos(0.471645-x)));

Implementation of the model

After the creation of the formulas by TuringBot, it is finally possible to implement the athlete’s mathematical model. The figure below shows the use of OpenSCAD with the created equations.

The following short list shows all software used in the project. Obviously, the user can use other software, according to their needs.

  • video recorder of a smartphone or camera;
  • Avidemux;
  • ImageMagick (convert.exe);
  • ImageJ;
  • Calc (LibreOffice);
  • and of course… TuringBot.

It is very interesting to observe the animation of the real and virtual subject at the same time.

Conclusions

By symbolic regression it is possible to convert any fact and action into a mathematical formula. These techniques are used in movies, video games, and 2D and 3D simulations. Behind a very short film, even lasting just one second, there is a lot of effort and dozens of hours of work. The adoption of mathematical formulas for the description of the points, in 2D and 3D space, gives a great touch of elegance to the project. Depending on the software used, it may be necessary to “overturn” the coordinates of the subject. The whole procedure described was very exciting and fun, even if very tiring. And the satisfaction was immense when, after all the calculations made and the implementation of the code, the digital figure acquired a life of its own and started jumping. Let us remember once again, the athlete’s movement is not produced by coordinates stored in an array, but is created using mathematical formulas. All this massive processing would not have been possible without the use of TuringBot. This is math!

The representation of the board in a Chess Engine with TuringBot

By Giovanni Di Maria, creator of EDM Electronics Design Master

A Chess Engine is a computer program that receives a move as input and returns a counter move as an answer. Here we will not delve into the theory that underlies the functioning of Chess Engines, but we will focus our attention on the representation of the chessboard relative to the initial position of a game, deepening a mathematical method to get there, using the TuringBot program.

The starting position

For those who are familiar with the game of chess, they know that the pieces are placed on the board in a very particular way, as evidenced by the figure below.

From left to right there are:

  • rook;
  • the knight;
  • the bishop;
  • Queen;
  • the king;
  • the bishop;
  • the knight;
  • rook.

The black pieces are arranged in a similar way: the Queens are on the squares of their color: the white Queen on the white square and the black Queen on the black square.

How to represent the chessboard in memory

There are many methods for representing a chessboard in memory. We focus our attention on reproducing the starting position using an 8×8 matrix. The basic idea is this, then the programmer can vary the method according to his needs. This in-memory representation is necessary because the program must know the game situation, especially at the start of the game. A possible solution, extremely clear and simple, perhaps banal, consists in memorizing the initials of the chess pieces in the matrix, distinguishing them between white and black with upper and lower case. This solution immediately provides a clear reading of the source code.

void reset_chessboard()
{
    int x, y;
    // -------Empty chessboard---------
    for (x = 1; x <= 8; x++)
        for (y = 1; y <= 8; y++)
            chessboard[x][y] = '-';
    // ----Pawns-------
    for (x = 1; x <= 8; x++)
    {
        chessboard[x][2] = 'P';
        chessboard[x][7] = 'p';
    }
    // ------knight------
    chessboard[2][1] = 'N';
    chessboard[7][1] = 'N';
    chessboard[2][8] = 'n';
    chessboard[7][8] = 'n';
    // ------Bishop--------
    chessboard[3][1] = 'B';
    chessboard[6][1] = 'B';
    chessboard[3][8] = 'b';
    chessboard[6][8] = 'b';
    // ----Rook---------
    chessboard[1][1] = 'R';
    chessboard[8][1] = 'R';
    chessboard[1][8] = 'r';
    chessboard[8][8] = 'r';
    // -----Queen-----
    chessboard[4][1] = 'Q';
    chessboard[4][8] = 'q';
    // -----King-------
    chessboard[5][1] = 'K';
    chessboard[5][8] = 'k';
    return;
}

The mathematical approach with TuringBot

A very elegant solution is to perform the processing and storage of the initial characters of the pieces through two simple nested cycles, within which a formula, found by TuringBot, calculates the ASCII code to be inserted in each cell of the matrix. The basic idea follows the diagrams illustrated below.

As can be seen in the matrix on the right (completely equivalent to the one on the left), in each element of the square matrix there is an ASCII numeric code that corresponds to the character that distinguishes each piece. The empty box is represented by a space (ASCII code 32). Recall that each box of a matrix is identified by two values, one of row and one of column. These values can be saved in a text file, consisting of 64 lines and 3 columns. It will be the TuringBot Input file.

1 1 82
1 2 78
1 3 66
1 4 81
1 5 75
1 6 66
1 7 78
1 8 82
2 1 80
2 2 80
2 3 80
2 4 80
2 5 80
2 6 80
2 7 80
2 8 80
3 1 32
3 2 32
3 3 32
3 4 32
3 5 32
3 6 32
3 7 32
3 8 32
4 1 32
4 2 32
4 3 32
4 4 32
4 5 32
4 6 32
4 7 32
4 8 32
5 1 32
5 2 32
5 3 32
5 4 32
5 5 32
5 6 32
5 7 32
5 8 32
6 1 32
6 2 32
6 3 32
6 4 32
6 5 32
6 6 32
6 7 32
6 8 32
7 1 112
7 2 112
7 3 112
7 4 112
7 5 112
7 6 112
7 7 112
7 8 112
8 1 114
8 2 110
8 3 98
8 4 113
8 5 107
8 6 98
8 7 110
8 8 114

Processing and research with TuringBot

After opening the input file in TuringBot, you can configure some search parameters to better optimize this operation. Some functions, in fact, are not necessary. In our case, the following search parameters have been entered:

  • Basic functions
    • Addition
    • Multiplication
    • Division
  • Trigonometric functions
    • cos (x)
  • Other functions
    • floor (x)

Of course you can try other functions. The search procedure is quite long as the final formula is very complex. The screenshot below shows some initial setup steps. After setting the values, you can start the search by pressing the appropriate button.

The research begins and the program begins to generate many formulas, with a solution ever closer to the final goal. A continuously updated window shows the formulas found, in order of length and error. Normally the longer formulas provide the best results.

Each formula is also characterized, of course, by a graph that should follow the trend of the original data. A perfect formula always lies above all input points.

Final results

The search field is extremely vast and a final result is not always perfect, with a difference of zero. In this case, luck wanted the system to find an excellent formula that perfectly described the initial arrangement of the pieces on the board. The search took about 4 hours, using 4 Threads. This timing obviously depends on the computer used and the number of processes used for the operation. Below are the different formulas found by TuringBot:

ComplexityErrorFunction
133.174763
529.543437.5+col1*col1
727.7461col1*(-4.42857+col1)+57.4285
818.437837.8919*sin(col1)+55.6912
917.1364128.854-(-5.76178*col1*(col1-8.20656))
1015.35447.64726*col1*sin(col1)+57.17
1213.783731.3877+(5.355*(7.2754*sin(col1)+col1))
1311.813732+(62*greater(sin(col1),0.555342))
1410.60117.09756*abs(col1+10.0746*sin(col1))
157.121(13.9153/cos(floor(0.732407*col1)))+58.555
173.8758232+(greater(sin(col1),0.415717)*(5.18918*col1+38.6487))
183.5793abs(30.9689-(-6.10071*sin(1.05488*col1)*(7.47296+col1)))
203.1130131.4802+pow(11.7402*(col1+5.71344),cos(1.04974*col1-1.56687))
273.0281231.4787+pow((11.6058+(0.68396/(col2*col2)))*(col1+5.71681),cos(1.04968*col1-1.56654))
282.4801731.4839+pow((11.6357-cos(4.18524*col2))*(col1+5.69307),cos(1.04935*col1-1.56473))
302.4222531.4835+pow((11.6154-cos(4.26395*(-0.110504+col2)))*(col1+5.69428),cos(1.04932*col1-1.56459))
332.3570632-((38.2449+((cos(col1)+4.90525)*(col1-cos((-2.01944)*(0.23205+col2)))))*floor(cos(-58.3057-col1)))
352.299232.0001-((38.2972+((1.28923*cos(col1)+4.83299)*(col1-cos(2.02027*(0.230258+col2)))))*floor(cos(-58.3057-col1)))
382.0259332.0005-(floor(cos(col1+1.83754))*(5.12705*(8.04121-((0.818858-cos(-34.9995*col1))*(cos(2.09691*col2)+0.6415))+col1)))
401.7558932.0028-(floor(cos(col1+1.78588))*(5.18838*(8.15112-((0.78022-cos(34.8496*col1+0.883278))*(cos(2.09681*col2)+0.945564))+col1)))
411.5233232-(floor(cos(col1+1.78119))*(4.96845*(8.04125-floor((0.939273-cos(34.9916*col1))*(cos(-2.0859*col2)+0.624216))+col1)))
431.1131232-(floor(cos(col1+1.8017))*(5.12004*(cos(col1)+7.83221-((0.78464-cos(-34.9183*col1))*(cos(2.09684*col2)+0.591668))+col1)))
450.76065232.0002-(floor(cos(col1+1.73726))*(5.12047*(cos(col1)+7.831-((0.783564-cos(34.9177*col1))*(cos((-2.00884)*(col2+0.259462))+0.612137))+col1)))
480.728869floor(32-(floor(cos(col1+1.73299))*(5.12841*(col1+7.95556-((0.779654-cos(34.9151*col1))*(cos(1.97953*(col2+0.346471))+0.636367))+cos(col1)))))
530.41344832.0002-(floor(cos(col1+1.81624))*(5.11899*(col1+7.82369-((0.754511-cos(-34.913*col1))*(cos((-2.01269)*(col2+(0.0352432/cos(col2))+0.268925))+0.648806))+cos(col1))))
550.1266632.0005-(floor(cos(col1+1.78631))*(5.12029*(cos(col1)+7.8269-((0.771208-cos(34.9154*col1))*(cos((-1.98154)*(col2+0.178865*cos(-5.37724+col2)+0.351792))+0.63671))+col1)))
570.058082832-(floor(cos(col1+1.81682))*(5.11347*(col1+7.83564-((0.769821-cos(-34.9072*col1))*(cos((-1.98027)*(col2+0.180734*cos(-5.38204+col2)+0.35485))+0.644024))+1.07536*cos(col1))))
580floor(32-(floor(cos(col1+1.73299))*(5.12816*(col1+7.89449-((0.770326-cos(34.9151*col1))*(cos(1.97953*(col2+0.181536*cos(-5.37362+col2)+0.35012))+0.636367))+cos(col1)))))

The final formula, therefore, which perfectly calculates the ASCII code of the piece on a box, given an “x” and “y” position, is the following:

floor(32-(floor(cos(x+1.73299))*(5.12816*(x+7.89449-((0.770326-cos(34.9151*x))*(cos(1.97953*(y+0.181536*cos(-5.37362+y)+0.35012))+0.636367))+cos(x)))))

The formula can be implemented in all programming languages. The following example was created for the PARI / GP program:

{
   for(x=1,8,
      for(y=1,8,
         z=floor(32-(floor(cos(x+1.73299))*(5.12816*(x+7.89449-((0.770326-cos(34.9151*x))*(cos(1.97953*(y+0.181536*cos(-5.37362+y)+0.35012))+0.636367))+cos(x)))))            ;
         print(x,"    ",y,"      ",z);
      );
   );
}

The execution of the simple listing, which contemplates the long but powerful mathematical formula, generates the following list of values, perfectly corresponding with the previous one.

1 1 82
1 2 78
1 3 66
1 4 81
1 5 75
1 6 66
1 7 78
1 8 82
2 1 80
2 2 80
2 3 80
2 4 80
2 5 80
2 6 80
2 7 80
2 8 80
3 1 32
3 2 32
3 3 32
3 4 32
3 5 32
3 6 32
3 7 32
3 8 32
4 1 32
4 2 32
4 3 32
4 4 32
4 5 32
4 6 32
4 7 32
4 8 32
5 1 32
5 2 32
5 3 32
5 4 32
5 5 32
5 6 32
5 7 32
5 8 32
6 1 32
6 2 32
6 3 32
6 4 32
6 5 32
6 6 32
6 7 32
6 8 32
7 1 112
7 2 112
7 3 112
7 4 112
7 5 112
7 6 112
7 7 112
7 8 112
8 1 114
8 2 110
8 3 98
8 4 113
8 5 107
8 6 98
8 7 110
8 8 114

When the curve fits perfectly to the points of the input data, with a zero error, it means that the final formula has been found!

Symbolic Regression in Python with TuringBot

In this tutorial, we are going to show a very easy way to do symbolic regression in Python.

For that, we are going to use the symbolic regression software TuringBot. This program runs on both Windows and Linux, and it comes with a handy Python library. You can download it for free from the official website.

Importing TuringBot

The first step in running our symbolic regression optimization in Python is importing TuringBot. For that, all you have to do is add its installation directory to your Python PATH and import it, as so:

Windows

import sys 
sys.path.insert(1, r'C:\Users\user\AppData\Local\Programs\TuringBot') 

import turingbot as tb 
Linux

import sys 
sys.path.insert(1, '/usr/share/turingbot') 

import turingbot as tb 

Running the optimization

The turingbot library implements a simulation object that can be used to start, stop and get the current status of a symbolic regression optimization.

This is how it works:

Windows

path = r'C:\Users\user\AppData\Local\Programs\TuringBot\TuringBot.exe' 
input_file = r'C:\Users\user\Desktop\input.txt' 
config_file = r'C:\Users\user\Desktop\settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 
Linux

path = r'/usr/bin/turingbot' 
input_file = r'/home/user/input.txt' 
config_file = r'/home/user/settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 

The start_process method starts the optimization in the background. It takes as input the paths to the TuringBot executable and to your input file. Optionally, you can also set the number of threads that the program should use and the path to the configuration file (more on that below).

After running the commands above, nothing will happen because the optimization will start in the background. To retrieve and print the current best formulas, you should use:

sim.refresh_functions() 
print(*sim.functions, sep='\n') 
print(sim.info) 

To stop the optimization and kill the TuringBot process, you should use the terminate_process method:

sim.terminate_process()

Using a configuration file

We have seen above that the start_process method may take the path to a configuration file as an optional input parameter. This is what the file should look like:

4 # Search metric. 1: Mean relative error, 2: Classification accuracy, 3: Mean error, 4: RMS error, 5:, F1 score, 6: Correlation coefficient, 7: Hybrid (CC+RMS), 8: Maximum error, 9: Maximum relative error
-1 # Train/test split. -1: No cross validation. Valid options are: 50, 60, 70, 75, 80
1 # Test sample. 1: Chosen randomly, 2: The last points
0 # Integer constants only. 0: Disabled, 1: Enabled
0 # Bound search mode. 0: Deactivated, 1: Lower bound search, 2: Upper bound search
60 # Maximum formula complexity.
+ * / pow fmod sin cos tan asin acos atan exp log log2 sqrt sinh cosh tanh asinh acosh atanh abs floor ceil round tgamma lgamma erf # Allowed functions.

The comments after the # characters are for your convenience and are ignored. To change the search settings, all you have to do is change the numbers in each line. To change the base functions for the search, just add or delete their names from the last line.

Save the contents of the file above to a settings.cfg file and add the path of this file to the start_process method before calling it if you want to customize your search.

Full example

Here are the full source codes of the examples that we have provided above. Note that you have to replace user in the paths to your local username and that you have to create an input file (txt or csv format, one number per column) to use with the program.

Windows

import sys 
sys.path.insert(1, r'C:\Users\user\AppData\Local\Programs\TuringBot') 

import turingbot as tb 
import time

path = r'C:\Users\user\AppData\Local\Programs\TuringBot\TuringBot.exe' 
input_file = r'C:\Users\user\Desktop\input.txt' 
config_file = r'C:\Users\user\Desktop\settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 

time.sleep(10)

sim.refresh_functions()
print(*sim.functions, sep='\n')
print(sim.info)

sim.terminate_process()

Linux

import sys 
sys.path.insert(1, '/usr/share/turingbot') 

import turingbot as tb 
import time 

path = r'/usr/bin/turingbot' 
input_file = r'/home/user/input.txt' 
config_file = r'/home/user/settings.cfg' 

sim = tb.simulation() 
sim.start_process(path, input_file, threads=4, config=config_file) 

time.sleep(10) 

sim.refresh_functions() 
print(*sim.functions, sep='\n') 
print(sim.info) 

sim.terminate_process()

Eureqa vs TuringBot for symbolic regression

Introduced in 2009, the Eureqa software gained great popularity with the promise that it could potentially be used to derive new physical laws from empirical data in an automatic way. Details of this reasoning can be found in the original paper, called Distilling Free-Form Natural Laws from Experimental Data.

In 2017 this software was acquired by a global consulting company called DataRobot and left the market. The promise of revolutionizing physics was never quite fulfilled, but the project had a major impact in raising awareness about symbolic regression.

Here we want to compare Eureqa to a more recent symbolic regression software called TuringBot.

About TuringBot

Similarly to Eureqa, TuringBot is a symbolic regression software. It has a simple graphical interface that allows the user to load a dataset and then try to find formulas that predict a target column taking as input the remaining columns:

The TuringBot interface.

This software was introduced in 2020, and contrary to Eureqa it does not use a genetic algorithm to search for formulas, but instead a novel algorithm based on simulated annealing. While most references to symbolic regression in the literature involve genetic algorithms, our finding was that simulated annealing yields results much faster if implemented the right way.

Simulated annealing is inspired by a metallurgic process in which a metal is heated to a high temperature and then slowly cooled to attain better physical properties. The algorithm starts at first very “hot”, with worse solutions being accepted very often, and over time it cools down and becomes more strict about the solutions that it passes by. This allows the algorithm to overcome local maxima and discover the global maximum in a stochastic way.

Pareto optimization

Both TuringBot and Eureqa implement the idea searching for the best formulas of each possible size, and not just a single optimal formula. This is the essence of a Pareto optimization, and it results on a list of formulas of increasing complexity and accuracy to choose from.

A list of formulas of increasing complexity discovered by TuringBot.

A handy feature offered by TuringBot is to create a train/test split for the optimization and see in real-time the test error for the solutions discovered so far. This allows overfit solutions to be spotted very easily.

Availability

TuringBot is available for both Windows and Linux. It can be downloaded for free, but it also has a paid plan with more functionalities.

The software is already being used by many researchers and engineers around the world to study topics including turbine design, materials science and zoology, and also by business owners to come up with pricing models and other applications.

You might also like our article on Symbolic Regression featured on Towards Data Science: Symbolic Regression: The Forgotten Machine Learning Method.

Decision boundary discovery with symbolic regression

An interesting classification problem is trying to find a decision boundary that separates two categories of points. For instance, consider the following cloud of points:

Clearly, we could hand draw a line that separates the two colors. But can this problem be solved in an automatic way?

Several machine learning methods could be used for this, including for instance a Support Vector Machine or AdaBoost. What all of these methods have in common is that they perform complex calculations under the hood and spill out some number, that is, they are black boxes. An interesting comparison of several of these methods can be found here.

A simpler and more elegant alternative is to try to find an explicit mathematical formula that separates the two categories. Not only would this be easier to compute, but it would also offer some insight into the data. This is where symbolic regression comes in.

Symbolic regression

The way to solve this problem with symbolic regression is to look for a formula that returns 0 for points of one category and 1 for points of another. That is, a formula for classification = f(x, y).

We can look for that formula by generating a CSV file with our points and loading it into TuringBot. Then we can run the optimization with classification accuracy as the search metric.

If we do that, the program ends up finding a simple formula with an accuracy of 100%:

classification = ceil(-1*tanh(round(x*y-cos((-2)*(y-x)))))

To visualize the decision boundary associated with this formula, we can generate some random points and keep track of the ones classified as orange. Then we can find the alpha shape that encompasses those points, which will be the decision boundary:

import alphashape
from descartes import PolygonPatch
import numpy as np
from math import *

def f(x, y):
    return ceil(-1*tanh(round(x*y-cos((-2)*(y-x)))))

pts = []
for i in range(10000):
    x = np.random.random()*2-1
    y = np.random.random()*2-1
    if f(x, y) == 1:
        pts.append([x, y])
pts = np.array(pts)

alpha_shape = alphashape.alphashape(pred, 2.)

fig, ax = plt.subplots()
ax.add_patch(PolygonPatch(alpha_shape, alpha=0.2, fc='#ddd', zorder=100))

And this is the result:

It is worth noting that even though this was a 2D problem, the same procedure could have been carried out for a classification problem in any number of dimensions.

How to create an AI trading system

Predicting whether the price of a stock will rise or fall is perhaps one of the most difficult machine learning tasks. Signals must be found on datasets which are dominated by noise, and in a robust way that will not overfit the training data.

In this tutorial, we are going to show how an AI trading system can be created using a technique called symbolic regression. The idea will be to try to find a formula that classifies whether the price of a stock will rise or fall in the following day based on its price candles (open, high, low, close) in the last 14 days.

AI trading system concept

Our AI trading system will be a classification algorithm: it will take past data as input, and output 0 if the stock is likely to fall in the following day and 1 if it is likely to rise. The first step in generating this model is to prepare a training dataset in which each row contains all the relevant past data and also a 0 or 1 label based on what happened in the following day.

We can be very creative about what past data to use as input while generating the model. For instance, we could include technical indicators such as RSI and MACD, sentiment data, etc. But for the sake of this example, all we are going to use are the OHLC prices of the last 14 candles.

Our training dataset should then contain the following columns:

 open_1,high_1,low_1,close_1,...,open_14,high_14,low_14,close_14,label

Here the index 1 denotes the last trading day, the index 2 the trading day prior to that, etc.

Generating the training dataset

To make things interesting, we are going to train our model on data for the S&P 500 index over the last year, as retrieved from Yahoo Finance. The raw dataset can be found here: S&P 500.csv.

To process this CSV file into the format that we need for the training, we have created the following Python script which uses the Pandas library:

import pandas as pd

df = pd.read_csv('S&P 500.csv')

training_data = []

for i,row in df.iterrows():
    if i < 13 or i+1 >= len(df):
        continue

    features = []
    for j in range(i, i-14, -1):
        features.append(df.iloc[j]['Open'])
        features.append(df.iloc[j]['High'])
        features.append(df.iloc[j]['Low'])
        features.append(df.iloc[j]['Close'])
    if df.iloc[i+1]['Close'] > row['Close']:
        features.append(1)
    else:
        features.append(0)
    
    training_data.append(features)
    
columns = []
for i in range(1, 15):
    columns.append('open_%d' % i)
    columns.append('high_%d' % i)
    columns.append('low_%d' % i)
    columns.append('close_%d' % i)
columns.append('label')

training_data = pd.DataFrame(training_data, columns=columns)

training_data.to_csv('training.csv', index=False)

All this script does is iterate through the rows in the Yahoo Finance data and generate rows with the OHLC prices of the last 14 candles, and an additional ‘label’ column based on what happened in the following day. The result can be found here: training.csv.

Creating a model with symbolic regression

Now that we have the training dataset, we are going to try to find formulas that predict what will happen to the S&P 500 in the following day. For that, we are going to use the desktop symbolic regression software TuringBot. This is what the interface of the program looks like:

The interface of the TuringBot symbolic regression software.

The input file is selected from the menu on the upper left. We also select the following settings:

  • Search metric: classification accuracy.
  • Test/train split: 50/50. This will allow us to easily discard overfit models.
  • Test sample: the last points. The other option is “chosen randomly”, which would make it easier to overfit the data due to autocorrelation.

With these settings in place, we can start the search by clicking on the play button at the top of the interface. The best solutions found so far will be shown in real time, ordered by complexity, and their out-of-sample errors can be seen by toggling the “show cross validation” button on the upper right.

After letting the optimization run for a few minutes, these were the models that were encountered:

Symbolic models found for predicting S&P 500 returns.

The one with the best ouf-of-sample accuracy turned out to be the one with size 23. Its win rate in the test domain was 60.5%. This is the model:

label = 1-floor((open_5-high_4+open_12+tan(-0.541879*low_1-high_1))/high_13)

It can be seen that it depends on the low and high of the current day, and also on a few key parameters of previous days.

Conclusion

In this tutorial, we have generated an AI trading signal using symbolic regression. This model had good out-of-sample accuracy in predicting what the S&P 500 would do the next day, using for that nothing but the OHLC prices of the last 14 trading days. Even better models could probably be obtained if more interesting past data was used for the training, such as technical indicators (RSI, MACD, etc).

You can generate your own models by downloading TuringBot for free from the official website. We encourage you to experiment with different stocks and timeframes to see what you can find.

How to create an equation for data points?

In order to find an equation from a list of values, a special technique called symbolic regression must be used. The idea is to search over the space of all possible mathematical formulas for the ones with the greatest accuracy, while trying to keep those formulas as simple as possible.

In this tutorial, we are going to show how to find formulas using the desktop symbolic regression software TuringBot, which is very easy to use.

How symbolic regression works

Symbolic regression starts from a set of base functions to be used in the search, such as addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in all possible ways with the goal of finding a model that will be as accurate as possible in predicting a target variable. Some examples of base functions used by TuringBot are the following:

Some base functions that TuringBot uses for symbolic regression.

As important as the accuracy of a formula is its simplicity. A huge formula can predict with perfect accuracy the data points, but if the number of free parameters in the model is the same as the number of points then this model is not really informative. For this reason, a symbolic regression optimization will discard a larger formula if it finds a smaller one that performs just as well.

Finding a formula with TuringBot

Finding equations from data points with TuringBot is a simple process. The first step is selecting the input file with the data through the interface. This input file should be in TXT or CSV format. After it has been loaded, the target variable can be selected (by default it will be the last column in the file), and the search can be started. This is what the interface looks like:

The interface of the TuringBot symbolic regression software.

Several options are available on the menus on the left, such as setting a test/train split to be able to detect overfit solutions, selecting which base functions should be used, and selecting the search metric, which by default is root-mean-square error, but that can also be set to classification accuracy, mean relative error and others. For this example, we are going to keep it simple and just use the defaults.

The optimization is started by clicking on the play button at the top of the interface. The best formulas found so far will be shown in the solutions box, ordered by complexity:

The formulas found by TuringBot for an example dataset.

The software allows the solutions to be exported to common programming languages from the menu, and also to simply be exported as text. Here are the formulas in the example above exported in text format:

Complexity   Error      Function
1            1.91399    -0.0967549
3            1.46283    0.384409*x
4            1.362      atan(x)
5            1.18186    0.546317*x-1.00748
6            1.11019    asinh(x)-0.881587
9            1.0365     ceil(asinh(x))-1.4131
13           0.985787   round(tan(floor(0.277692*x)))
15           0.319857   cos(x)*(1.96036-x)*tan(x)
19           0.311375   cos(x)*(1.98862-1.02261*x)*tan(1.00118*x)

Conclusion

In this tutorial, we have seen how symbolic regression can be used to find formulas from values. Symbolic regression is very different from regular curve-fitting methods, since no assumption is made about what the shape of the formulas should be. This allows patterns to be found in datasets with an arbitrary number of dimensions, making symbolic regression a general purpose machine learning technique.

Machine learning black box models: some alternatives

In this article, we will discuss a very basic question regarding machine learning: is every model a black box? Certainly most methods seem to be, but as we will see, there are very interesting exceptions to this.

What is a black box method?

A method is said to be a black box when it performs complicated computations under the hood that cannot be clearly explained and understood. Data is fed into the model, internal transformations are performed on this data and an output is given, but these transformations are such that basic questions cannot be answered in a straightforward way:

  • Which of the input variables contributed the most to generating the output?
  • Exactly what features did the model derive from the input data?
  • How does the output change as a function of one of the variables?

Not only are black box models hard to understand, they are also hard to move around: since complicated data structures are necessary for the relevant computations, they cannot be readily translated to different programming languages.

Can there be machine learning without black boxes?

The answer to that question is yes. In the simplest case, a machine learning model can be a linear regression and consist of a line defined by an explicit algebraic equation. This is not a black box method, since it is clear how the variables are being used to compute an output.

But linear models are quite limited and cannot perform the same kinds of tasks that neural networks do, for example. So a more interesting question is: is there a machine learning method capable of finding nonlinear patterns in an explicit and understandable way?

It turns out that such method exists, and is called symbolic regression.

Symbolic regression as an alternative

The idea of symbolic regression is to find explicit mathematical formulas that connect input variables to an output, while trying to keep those formulas as simple as possible. The resulting models end up being explicit equations that can be written on a sheet of paper, making it apparent how the input variables are being used despite the presence of nonlinear computations.

To give a clearer picture, consider some models found by TuringBot, a symbolic regression software for PC:

Symbolic models found by the TuringBot symbolic regression software.

In the “Solutions” box above, a typical result of a symbolic regression optimization can be seen. A set of formulas of increasing complexity was found, with more complex formulas only being shown if they perform better than all simpler alternatives. A nonlinearity in the input dataset was successfully recovered through the use of nonlinear base functions like cos(x), atan(x) and multiplication.

Symbolic regression is a very general technique: although the most obvious use case is to solve regression problems, it can also be used to solve classification problems by representing categorical variables as different integer numbers, and running the optimization with classification accuracy as the search metric instead of RMS error. Both of these options are available in TuringBot.

Conclusion

In this article, we have seen that despite most machine learning methods indeed being black boxes, not all of them are. A simple counterexample are linear models, which are explicit and hence not black boxes. More interestingly, we have seen how symbolic regression is capable of solving machine learning tasks where nonlinear patterns are present, generating models that are mathematical equations that can be analyzed and interpreted.

A regression model example and how to generate it

Regression models are perhaps the most important class of machine learning models. In this tutorial, we will show how to easily generate a regression model from data values.

What regression is

The goal of a regression model is to be able to predict a target variable taking as input one or more input variables. The simplest case is that of a linear relationship between the variables, in which case basic methods such as least squares regression can be used.

In real-world datasets, the relationship between the variables is often highly non-linear. This motivates the use of more sophisticated machine learning techniques to solve the regression problems, including for instance neural networks and random forests.

A regression problem example is to predict the value of a house from its characteristics (location, number of bedrooms, total area, etc), using for that information from other houses which are not identical to it but for which the prices are known.

Regression model example

To give a concrete example, let’s consider the following public dataset of house prices: x26.txt. This file contains a long and uncommented header; a stripped-down version that is compatible with TuringBot can be found here: house_prices.txt. The columns that are present are the following

Index;
Local selling prices, in hundreds of dollars;
Number of bathrooms;
Area of the site in thousands of square feet;
Size of the living space in thousands of square feet;
Number of garages;
Number of rooms;
Number of bedrooms;
Age in years;
Construction type (1=brick, 2=brick/wood, 3=aluminum/wood, 4=wood);
Number of fire places;
Selling price.

The goal is to predict the last column, the selling price, as a function of all the other variables. In order to do that, we are going to use a technique called symbolic regression, which attempts to find explicit mathematical formulas that connect the input variables to the target variable.

We will use the desktop software TuringBot, which can be downloaded for free, to find that regression model. The usage is quite straightforward: you load the input file through the interface, select which variable is the target and which variables should be used as input, and then start the search. This is what its interface looks like with the data loaded in:

The TuringBot interface.

We have also enabled the cross validation feature with a 50/50 test/train split (see the “Search options” menu in the image above). This will allow us to easily discard overfit formulas.

After running the optimization for a few minutes, the formulas found by the program and their corresponding out-of-sample errors were the following:

The regression models found for the house prices.

The highlighted one turned out to be the best — more complex solutions did not offer increased out-of-sample accuracy. Its mean relative error in the test dataset was of roughly 8%. Here is that formula:

price = fire_place+15.5668+(1.66153+bathrooms)*local_pric

The variables that are present in it are only three: the number of bathrooms, the number of fire places and the local price. It is a completely non-trivial fact that the house price should only depend on these three parameters, but the symbolic regression optimization made this fact evident.

Conclusion

In this tutorial, we have seen an example of generating a regression model. The technique that we used was symbolic regression, implemented in the desktop software TuringBot. The model that was found had a good out-of-sample accuracy in predicting the prices of houses based on their characteristics, and it allowed us to clearly see the most relevant variables in estimating that price.