Predicting whether the price of a stock will rise or fall is perhaps one of the most difficult machine learning tasks. Signals must be found on datasets which are dominated by noise, and in a robust way that will not overfit the training data.
In this tutorial, we are going to show how an AI trading system can be created using a technique called symbolic regression. The idea will be to try to find a formula that classifies whether the price of a stock will rise or fall in the following day based on its price candles (open, high, low, close) in the last 14 days.
AI trading system concept
Our AI trading system will be a classification algorithm: it will take past data as input, and output 0 if the stock is likely to fall in the following day and 1 if it is likely to rise. The first step in generating this model is to prepare a training dataset in which each row contains all the relevant past data and also a 0 or 1 label based on what happened in the following day.
We can be very creative about what past data to use as input while generating the model. For instance, we could include technical indicators such as RSI and MACD, sentiment data, etc. But for the sake of this example, all we are going to use are the OHLC prices of the last 14 candles.
Our training dataset should then contain the following columns:
Here the index 1 denotes the last trading day, the index 2 the trading day prior to that, etc.
Generating the training dataset
To make things interesting, we are going to train our model on data for the S&P 500 index over the last year, as retrieved from Yahoo Finance. The raw dataset can be found here: S&P 500.csv.
To process this CSV file into the format that we need for the training, we have created the following Python script which uses the Pandas library:
import pandas as pd df = pd.read_csv('S&P 500.csv') training_data =  for i,row in df.iterrows(): if i < 13 or i+1 >= len(df): continue features =  for j in range(i, i-14, -1): features.append(df.iloc[j]['Open']) features.append(df.iloc[j]['High']) features.append(df.iloc[j]['Low']) features.append(df.iloc[j]['Close']) if df.iloc[i+1]['Close'] > row['Close']: features.append(1) else: features.append(0) training_data.append(features) columns =  for i in range(1, 15): columns.append('open_%d' % i) columns.append('high_%d' % i) columns.append('low_%d' % i) columns.append('close_%d' % i) columns.append('label') training_data = pd.DataFrame(training_data, columns=columns) training_data.to_csv('training.csv', index=False)
All this script does is iterate through the rows in the Yahoo Finance data and generate rows with the OHLC prices of the last 14 candles, and an additional ‘label’ column based on what happened in the following day. The result can be found here: training.csv.
Creating a model with symbolic regression
Now that we have the training dataset, we are going to try to find formulas that predict what will happen to the S&P 500 in the following day. For that, we are going to use the desktop symbolic regression software TuringBot. This is what the interface of the program looks like:
The input file is selected from the menu on the upper left. We also select the following settings:
- Search metric: classification accuracy.
- Test/train split: 50/50. This will allow us to easily discard overfit models.
- Test sample: the last points. The other option is “chosen randomly”, which would make it easier to overfit the data due to autocorrelation.
With these settings in place, we can start the search by clicking on the play button at the top of the interface. The best solutions found so far will be shown in real time, ordered by complexity, and their out-of-sample errors can be seen by toggling the “show cross validation” button on the upper right.
After letting the optimization run for a few minutes, these were the models that were encountered:
The one with the best ouf-of-sample accuracy turned out to be the one with size 23. Its win rate in the test domain was of 60.5%. This is the model:
label = 1-floor((open_5-high_4+open_12+tan(-0.541879*low_1-high_1))/high_13)
It can be seen that it depends on the low and high of the current day, and also on a few key parameters of previous days.
In this tutorial, we have generated an AI trading signal using symbolic regression. This model had a good out-of-sample accuracy in predicting what the S&P 500 would do in the next day, using for that nothing but the OHLC prices of the last 14 trading days. Even better models could probably be obtained if more interesting past data was used for the training, such as technical indicators (RSI, MACD, etc).
You can generate your own models by downloading TuringBot for free from the official website. We encourage you to experiment with different stocks and timeframes to see what you can find.