In this tutorial, we are going to show how you can find a formula from your data using the symbolic regression software TuringBot. It is a desktop software that runs on both Windows and Linux, and as you will see the usage is very simple.
Preparing the data
TuringBot takes as input files in .txt or CSV format containing one variable per column. The first row may contain the names of the variables, otherwise, they will be labeled col1, col2, col3, etc.
For instance, the following is a valid input file:
x y z w classification
5.20 2.70 3.90 1.40 1
6.50 2.80 4.60 1.50 1
7.70 2.80 6.70 2.00 2
5.90 3.20 4.80 1.80 1
5.00 3.50 1.60 0.60 0
5.10 3.50 1.40 0.20 0
4.60 3.10 1.50 0.20 0
6.90 3.20 5.70 2.30 2
Loading the data into TuringBot
This is what the program looks like when you open it:
By clicking on the “Input file” button on the upper left, you can select your input file and load it. Different search metrics are available, including for instance classification accuracy and a handy cross-validation feature can also be enabled in the “Search options” box — if enabled, it will automatically create a test/train split and allow you to see the out-of-sample error as the optimization goes on. But in this example, we are going to keep things simple and just use the defaults.
Finding the formulas
After loading the data, you can click on the play button at the top of the interface to start the optimization. The best formulas found so far will be shown in the “Solutions” box, in ascending order of complexity. A formula is only shown if its accuracy is greater than that of all simpler alternatives — in symbolic regression, the goal is not simply to find a formula, but to find the simplest ones possible.
Here are the formulas it found for an example dataset:
The formulas are all written in a format that is compatible out of the box with Python and C. Indeed, the menu on the upper right allows you to export the solutions to these languages:
In this example, the true formula turned out to be sqrt(x), which was recovered in a few seconds. The methodology would be the same for a real-world dataset with many input variables and an unknown dependency between them.
How to get TuringBot
If you have liked this tutorial, we encourage you to download TuringBot for free from the official website. As we have shown, it is very simple to use, and its powerful mathematical modeling capabilities allow you to find very subtle numerical patterns in your data. Much like a scientist would do from empirical observations, but in an automatic way and millions of times faster.