To find an equation from a list of values, a special technique called symbolic regression must be used. The idea is to search over the space of all possible mathematical formulas for the ones with the greatest accuracy while trying to keep those formulas as simple as possible.
In this tutorial, we are going to show how to find formulas using the desktop symbolic regression software TuringBot, which is very easy to use.
How symbolic regression works
Symbolic regression starts from a set of base functions to be used in the search, such as addition, multiplication, sin(x), exp(x), etc and then tries to combine those functions in all possible ways to find a model that will be as accurate as possible in predicting a target variable. Some examples of base functions used by TuringBot are the following:
As important as the accuracy of a formula is its simplicity. A huge formula can predict with perfect accuracy the data points, but if the number of free parameters in the model is the same as the number of points then this model is not informative. For this reason, a symbolic regression optimization will discard a larger formula if it finds a smaller one that performs just as well.
Finding a formula with TuringBot
Finding equations from data points with TuringBot is a simple process. The first step is selecting the input file with the data through the interface. This input file should be in TXT or CSV format. After it has been loaded, the target variable can be selected (by default it will be the last column in the file), and the search can be started. This is what the interface looks like:
Several options are available on the menus on the left, such as setting a test/train split to be able to detect overfit solutions, selecting which base functions should be used, and selecting the search metric, which by default is the root-mean-square error, but that can also be set to classification accuracy, mean relative error and others. For this example, we are going to keep it simple and use the defaults.
The optimization is started by clicking on the play button at the top of the interface. The best formulas found so far will be shown in the solutions box, ordered by complexity:
The software allows the solutions to be exported to common programming languages from the menu, and also to be exported as text. Here are the formulas in the example above exported in text format:
Complexity Error Function
1 1.91399 -0.0967549
3 1.46283 0.384409*x
4 1.362 atan(x)
5 1.18186 0.546317*x-1.00748
6 1.11019 asinh(x)-0.881587
9 1.0365 ceil(asinh(x))-1.4131
13 0.985787 round(tan(floor(0.277692*x)))
15 0.319857 cos(x)*(1.96036-x)*tan(x)
19 0.311375 cos(x)*(1.98862-1.02261*x)*tan(1.00118*x)
Conclusion
In this tutorial, we have seen how symbolic regression can be used to find formulas from values. Symbolic regression is very different from regular curve-fitting methods since no assumption is made about what the shape of the formulas should be. This allows patterns to be found in datasets with an arbitrary number of dimensions, making symbolic regression a general-purpose machine learning technique.