An interesting classification problem is trying to find a decision boundary that separates two categories of points. For instance, consider the following cloud of points:

Clearly, we could hand draw a line that separates the two colors. But can this problem be solved in an automatic way?

Several machine learning methods could be used for this, including for instance a Support Vector Machine or AdaBoost. What all of these methods have in common is that they perform complex calculations under the hood and spill out some number, that is, they are black boxes. An interesting comparison of several of these methods can be found here.

A simpler and more elegant alternative is to try to find an explicit mathematical formula that separates the two categories. Not only would this be easier to compute, but it would also offer some insight into the data. This is where symbolic regression comes in.

Symbolic regression

The way to solve this problem with symbolic regression is to look for a formula that returns 0 for points of one category and 1 for points of another. That is, a formula for classification = f(x, y).

We can look for that formula by generating a CSV file with our points and loading it into TuringBot. Then we can run the optimization with classification accuracy as the search metric.

If we do that, the program ends up finding a simple formula with an accuracy of 100%:

To visualize the decision boundary associated with this formula, we can generate some random points and keep track of the ones classified as orange. Then we can find the alpha shape that encompasses those points, which will be the decision boundary:

import alphashape
from descartes import PolygonPatch
import numpy as np
from math import *
def f(x, y):
return ceil(-1*tanh(round(x*y-cos((-2)*(y-x)))))
pts = []
for i in range(10000):
x = np.random.random()*2-1
y = np.random.random()*2-1
if f(x, y) == 1:
pts.append([x, y])
pts = np.array(pts)
alpha_shape = alphashape.alphashape(pred, 2.)
fig, ax = plt.subplots()
ax.add_patch(PolygonPatch(alpha_shape, alpha=0.2, fc='#ddd', zorder=100))

And this is the result:

It is worth noting that even though this was a 2D problem, the same procedure could have been carried out for a classification problem in any number of dimensions.

Predicting whether the price of a stock will rise or fall is perhaps one of the most difficult machine learning tasks. Signals must be found on datasets which are dominated by noise, and in a robust way that will not overfit the training data.

In this tutorial, we are going to show how an AI trading system can be created using a technique called symbolic regression. The idea will be to try to find a formula that classifies whether the price of a stock will rise or fall in the following day based on its price candles (open, high, low, close) in the last 14 days.

AI trading system concept

Our AI trading system will be a classification algorithm: it will take past data as input, and output 0 if the stock is likely to fall in the following day and 1 if it is likely to rise. The first step in generating this model is to prepare a training dataset in which each row contains all the relevant past data and also a 0 or 1 label based on what happened in the following day.

We can be very creative about what past data to use as input while generating the model. For instance, we could include technical indicators such as RSI and MACD, sentiment data, etc. But for the sake of this example, all we are going to use are the OHLC prices of the last 14 candles.

Our training dataset should then contain the following columns:

Here the index 1 denotes the last trading day, the index 2 the trading day prior to that, etc.

Generating the training dataset

To make things interesting, we are going to train our model on data for the S&P 500 index over the last year, as retrieved from Yahoo Finance. The raw dataset can be found here: S&P 500.csv.

To process this CSV file into the format that we need for the training, we have created the following Python script which uses the Pandas library:

import pandas as pd
df = pd.read_csv('S&P 500.csv')
training_data = []
for i,row in df.iterrows():
if i < 13 or i+1 >= len(df):
continue
features = []
for j in range(i, i-14, -1):
features.append(df.iloc[j]['Open'])
features.append(df.iloc[j]['High'])
features.append(df.iloc[j]['Low'])
features.append(df.iloc[j]['Close'])
if df.iloc[i+1]['Close'] > row['Close']:
features.append(1)
else:
features.append(0)
training_data.append(features)
columns = []
for i in range(1, 15):
columns.append('open_%d' % i)
columns.append('high_%d' % i)
columns.append('low_%d' % i)
columns.append('close_%d' % i)
columns.append('label')
training_data = pd.DataFrame(training_data, columns=columns)
training_data.to_csv('training.csv', index=False)

All this script does is iterate through the rows in the Yahoo Finance data and generate rows with the OHLC prices of the last 14 candles, and an additional ‘label’ column based on what happened in the following day. The result can be found here: training.csv.

Creating a model with symbolic regression

Now that we have the training dataset, we are going to try to find formulas that predict what will happen to the S&P 500 in the following day. For that, we are going to use the desktop symbolic regression software TuringBot. This is what the interface of the program looks like:

The input file is selected from the menu on the upper left. We also select the following settings:

Search metric: classification accuracy.

Test/train split: 50/50. This will allow us to easily discard overfit models.

Test sample: the last points. The other option is “chosen randomly”, which would make it easier to overfit the data due to autocorrelation.

With these settings in place, we can start the search by clicking on the play button at the top of the interface. The best solutions found so far will be shown in real time, ordered by complexity, and their out-of-sample errors can be seen by toggling the “show cross validation” button on the upper right.

After letting the optimization run for a few minutes, these were the models that were encountered:

The one with the best ouf-of-sample accuracy turned out to be the one with size 23. Its win rate in the test domain was 60.5%. This is the model:

It can be seen that it depends on the low and high of the current day, and also on a few key parameters of previous days.

Conclusion

In this tutorial, we have generated an AI trading signal using symbolic regression. This model had good out-of-sample accuracy in predicting what the S&P 500 would do the next day, using for that nothing but the OHLC prices of the last 14 trading days. Even better models could probably be obtained if more interesting past data was used for the training, such as technical indicators (RSI, MACD, etc).

You can generate your own models by downloading TuringBot for free from the official website. We encourage you to experiment with different stocks and timeframes to see what you can find.

If you are interested in solving AI problems and would like an easy to use desktop software that yields state of the art results, you might like TuringBot. In this article, we will show you how it can be used to easily solve classification and regression problems, and explain the methodology that it uses, which is called symbolic regression.

The software

TuringBot is a desktop application that runs on both Windows and Linux, and that can be downloaded for free from the official website. This is what its interface looks like:

The usage is simple: you load your data in CSV or TXT format through the interface, select which column should be predicted and which columns should be used as input, and start the search. The program will look for explicit mathematical formulas that predict this target variable, and show the results in the Solutions box.

Symbolic regression

The name of this technique, which looks for explicit formulas that solve AI problems, is symbolic regression. It is capable of solving the same problems as neural networks, but in an explicit way that does not involve black box computations.

Think of what Kepler did when he extracted his laws of planetary motion from observations. He looked for algebraic equations that could explain this data, and found timeless patterns that are taught to this day in schools. What TuringBot does is something similar to that, but millions of times faster than a human could ever do.

An important point in symbolic regression is that it is not sufficient for a model to be accurate — it also has to be simple. This is why TuringBot’s algorithm tries to find the best formulas of all possible sizes simultaneously, discarding larger formulas that do not perform better than simpler alternatives.

The problems that it can solve

Some examples of problems that can be solved by the program are the following:

Regression problems, in which a continuous target variable should be predicted. See here a tutorial in which we use the program to recover a mathematical formula without previous knowledge of what that formula was.

Classification problems, in which the goal is to classify inputs into two or more different categories. The rationale of solving this kind of problem using symbolic regression is to represent different categorical variables as different integer numbers, and run the optimization with “classification accuracy” as the search metric (this can easily be selected through the interface). In this article, we teach how to use the program to classify the Iris dataset.

Classification of rare events, in which a classification task must be solved on highly imbalanced datasets. The logic is similar to that of a regular classification problem, but in this case a special metric called F1 score should be used (also available in TuringBot). In this article, we found a formula that successfully classified credit card frauds on a real-world dataset that is highly imbalanced.

Getting TuringBot

If you liked the concept of TuringBot, you can download it for free from the official website. There you can also find the official documentation, with more information about the search metrics that are available, the input file formats and the various features that the program offers.

Predicting rare events is a machine learning problem of great practical importance, and also a very difficult one. Models of this kind need to be trained on highly imbalanced datasets and are used, among other things, for spotting fraudulent online transactions and detecting anomalies in medical images.

In this article, we show how such problems can be modeled using Symbolic Regression, a technique that discovers mathematical formulas that predict the desired variable from a set of input variables. Symbolic models, contrary to more mainstream ones like neural networks and random forests, are not black boxes, since they clearly show which variables are being used and how. They are also very fast and easy to implement, since no complex data structures are involved in the calculations.

In order to provide a real-world example, we will try to model the credit card fraud dataset available on Kaggle using our Symbolic Regression software TuringBot. The dataset consists of a CSV file containing 284,807 transactions, one per row, out of which 492 are frauds. The first 28 columns represent anonymized features, and the last one contains “0” for legitimate transactions and “1” for fraudulent ones.

Prior to the regression, we remove all quotation mark characters from the file, so that those two categories are recognized as numbers by the software.

Symbolic regression

Generating symbolic models using TuringBot is a straightforward process, which requires no data science skills. The first step is to open the program and load the input file by clicking on the “Input” button, shown below. After loading, the code will automatically define the column “Class” as the target variable and all other ones as input variables, which is what we want.

Then, we select the error metric for the search as “F1 score”, which is the appropriate one for binary classification problems on highly imbalanced datasets like this one. This metric corresponds to a geometric mean of precision and the recall of the model. A very illustrative image that explains what precision and recall are can be found on the Wikipedia page for the F1 score.

That’s it! After those two steps, the search is ready to start. Just click on the “play” button at the top of the interface. The best solutions that the program has encountered so far will be shown in the “Solutions” box in real-time.

Bear in mind that this is a relatively large dataset and that it may seem like not much is going on in the first minutes of the optimization. Ideally, you should leave the program running until at least a few million formulas have been tested (you can see the number so far in the Log tab). In a modest i7-3770 CPU with 8 threads, this took us about 6 hours. A more powerful CPU would take less time.

The resulting formula

The models that were encountered by the program after this time were the following:

The error for the best one is 0.17, meaning its F1 score is 1 – 0.17 = 0.83. This implies that both the recall and the precision of the model are close to 83%. In a verification using Python, we have found that they are 80% and 87% respectively.

So what does this mean? That the following mathematical formula found by our program is capable of detecting 80% of all frauds in the dataset, and that it is right 87% of the time when it claims that fraud is taking place! This is a result consistent with the best machine learning methods available.

Conclusion

In this article, we have demonstrated that our Symbolic Regression software TuringBot is able to generate models that classify credit card frauds in a real-world dataset with high precision and high recall. We believe that this kind of modeling capability, combined with the transparency and efficiency of the generated models, is very useful for those interested in developing machine learning models for the classification and prediction of rare events.