Many people on the internet are interested in an alternative to the Eureqa symbolic regression software, which used to be developed by a company called Nutonian and is no longer available. Such alternative exists, and it is called TuringBot!
You can download the program on this website and try it for free for 14 days, no credit card required. It uses a simulated annealing algorithm to find formulas that connect variables, allowing regression and classification problems to be solved in a very powerful way. Many engineers, researchers and data scientists from around the world are already making use of it with great results.
We believe that our symbolic regression implementation yields better results than the genetic algorithm present in Eureqa, which seems to overly restrict itself to simpler and less recursive formulas, and which often results in polynomial fits to the data that diverge and lose usefulness outside the domain of the training dataset. We also find that TuringBot is noticeably faster than Eureqa.
If you need some help to get started with TuringBot, we have some tutorials which you might find useful:
In this article, we are going to show how a symbolic regression model can be visualized using the R programming language. The model will be generated using the TuringBot symbolic regression software, and we are going to use the ggplot2 library for the visualization.
The dataset that we are going to use consists of the closing prices for the S&P 500 index in the last year, downloaded from Yahoo Finance. The CSV file, which also contains additional columns like open, high, low and volume, can be found here: spx.csv
After opening TuringBot and selecting this file from the menu on the upper left of the interface, we select “Row number” as the input variable and “Close” as the target variable. This way, our model will find the close price as a function of the index of the trading day (1, 2, 3, etc). We will also use a randomly selected 50:50 train/test split to make our model more robust, and “mean relative error” as the optimization metric because we are more interested in the shape of the model than in specific values.
This is what the interface will look like:
Clicking on the play button at the top, the optimization is started, using all the CPU cores in the computer for greater performance. The models encountered so far are seen in the “Solutions” box.
After letting the optimization run for a few minutes, we can click on the “Show cross validation error” box on the upper right of the interface to see the out-of-sample performance of each model, and use this information to select the best one, which in this case turned out to be a combination of cosines and multiplications:
Now that we have the model, we are going to visualize it using ggplot2. The following script loads the input CSV file and plots it along with the model that we just selected:
This demonstrates the power and simplicity of symbolic regression models: we have managed to readily implement and visualize a deep learning model generated using TuringBot into R, something that would be much harder if the model was a black-box like a neural network or a random forest.
Predicting rare events is a machine learning problem of great practical importance, and also a very difficult one. Models of this kind need to be trained on highly imbalanced datasets, and are used, among other things, for spotting fraudulent online transactions and detecting anomalies in medical images.
In this article, we show how such problems can be modeled using Symbolic Regression, a technique which attempts to find mathematical formulas that predict a desired variable from a set of input variables. Symbolic models, contrary to more mainstream ones like neural networks and random forests, are not black boxes, since they clearly show which variables are being used and how. They are also very fast and easy to implement, since no complex data structures are involved in the calculations.
In order to provide a real world example, we will try to model the credit card fraud dataset available on Kaggle using our Symbolic Regression software TuringBot. The dataset consists of a CSV file containing 284,807 transactions, one per row, out of which 492 are frauds. The first 28 columns represent anonymized features, and the last one contains “0” for legitimate transactions and “1” for fraudulent ones.
Prior to the regression, we remove all quotation mark characters from the file, so that those two categories are recognized as numbers by the software.
Creating the model
Generating symbolic models using TuringBot is a straightforward process, which requires no data science skills. The first step is to open the program and load the input file by clicking on the “Input” button, shown below. After loading, the code will automatically define the column “Class” as the target variable and all other ones as input variables, which is what we want.
Then, we select the error metric for the search as “F1 score”, which is the appropriate one for binary classification problems on highly imbalanced datasets like this one. This metric corresponds to a geometric mean of precision and the recall of the model. A very illustrative image that explains what precision and recall are can be found on the Wikipedia page for F1 score.
That’s it! After those two steps, the search is ready to start. Just click on the “play” button at the top of the interface. The best solutions that the program has encountered so far will be shown in the “Solutions” box in real time.
Bear in mind that this is a relatively large dataset, and that it may seem like not much is going on in the first minutes of the optimization. Ideally, you should leave the program running until at least a few million formulas have been tested (you can see the number so far in the Log tab). In a modest i7-3770 CPU with 8 threads, this took us about 6 hours. A more powerful CPU would take less time.
The models that were encountered by the program after this time were the following:
The error for the best one is 0.17, meaning its F1 score is 1 – 0.17 = 0.83. This implies that both the recall and the precision of the model are close to 83%. In a verification using Python, we have found that they are 80% and 87% respectively.
So what does this mean? That the following mathematical formula found by our program is capable of detecting 80% of all frauds in the dataset, and that it is right 87% of the time when it claims that a fraud is taking place! This is a result consistent with the best machine learning methods available.
In this article, we have demonstrated that our Symbolic Regression software TuringBot is able to generate models that classify credit card frauds in a real world dataset with high precision and high recall. We believe that this kind of modeling capability, combined with the transparency and efficiency of the generated models, is very useful for those interested in developing machine learning models for the classification and prediction of rare events.