How to find a formula for the nth term of a sequence

Given a sequence of numbers, finding an explicit mathematical formula that computes the nth term of the sequence can be challenging, except in very special cases like arithmetic and geometric sequences.

In the general case, this task involves searching over the space of all mathematical formulas for the most appropriate one. A special technique exists that does just that: symbolic regression. Here we will introduce how it works, and use it to find a formula for the nth term in the Fibonacci sequence (A000045 in the OEIS) as an example.

What symbolic regression is

Regression is the task of establishing a relationship between an output variable and one or more input variables. Symbolic regression solves this task by searching over the space of all possible mathematical formulas for the ones with the greatest accuracy, while trying to keep those formulas as simple as possible.

The technique starts from a set of base functions — for instance, sin(x), exp(x), addition, multiplication, etc. Then it tries to combine those base functions in various ways using an optimization algorithm, keeping track of the most accurate ones found so far.

An important point in symbolic regression is simplicity. It is easy to find a polynomial that will fit any sequence of numbers with perfect accuracy, but that does not really tell you anything since the number of free parameters in the model is the same as the number of input variables. For this reason, a symbolic regression procedure will discard a larger formula if it finds a smaller one that performs just as well.

Finding the nth Fibonacci term

Now let’s show how symbolic regression can be used in practice by trying to find a formula for the Fibonacci sequence using the desktop symbolic regression software TuringBot. The first two terms of the sequence are 1 and 1, and every next term is defined as the sum of the previous two terms. Its first terms are the following, where the first column is the index:

1 1
2 1
3 2
4 3
5 5
6 8
7 13
8 21
9 34
10 55

A list of the first 30 terms can be found on this file: fibonacci.txt.

TuringBot takes as input TXT or CSV files with one variable per column and efficiently finds formulas that connect those variables. This is how it looks like after we load fibonacci.txt and run the optimization:

Finding a formula for the nth Fibonacci term with TuringBot.

The software finds not only a single formula, but the best formulas of all possible complexities. A larger formula is only shown if it performs better than all simpler alternatives. In this case, the last formula turned out to predict with perfect accuracy every single one of the first 30 Fibonacci terms. The formula is the following:

f(x) = floor(cosh(-0.111572+0.481212*x))

Clearly a very elegant solution. The same procedure can be used to find a formula for the nth term of any other sequence (if it exists).


In this tutorial, we have seen how the symbolic regression software TuringBot can be used to find a closed-form expression for the nth term in a sequence of numbers. We found a very short formula for the Fibonacci sequence by simply writing it into a text file with one number per row and loading this file into the software.

If you are interested in trying TuringBot your own data, you can download it from the official website. It is available for both Windows and Linux.

A machine learning software for data science

Data science is becoming more and more widespread, pushed by companies that are finding that very valuable and actionable information can be extracted from their databases.

It can be challenging to develop useful models from raw data. Here we will introduce a tool that makes it very easy to develop state of the art models from any dataset.

What is TuringBot

TuringBot is a desktop machine learning software. It runs on both Windows and Linux, and what it does is generate models that predict some target variable taking as input one or more input variables. It does that through a technique called symbolic regression. This is what its interface looks like:

TuringBot’s interface.

The idea of symbolic regression is to search over the space of all possible mathematical formulas for the ones that best connect the input variables to the target variable, while trying to keep those formulas as simple as possible. The target variable can be anything: for instance, it can represent different categorical variables as different integer numbers, allowing the program to solve classification problems, or it can be a regular continuous variable.

Machine learning with TuringBot

The usage of TuringBot is very straightforward. All you have to do is save your data in CSV or TXT format, with one variable per column, and load this input file through the program’s interface.

Once the data is loaded, you can select the target variable and which variables should be used as input, as well as the search metric, and then start the search. Several search metrics are available, including RMS error, mean error and classification accuracy. A list of formulas encountered so far will be shown in real time, ordered by complexity. Those formulas can be easily exported as Python, C or text from the interface:

Some solutions found by TuringBot. They can readily be exported to common programming languages.

Most machine learning methods are black boxes, which carry out complex computations under the hood before giving a result. This is how neural networks and random forests work, for instance. A great advantage of TuringBot over these methods is that the models that it generates are very explicit, allowing some understanding to be gained into the data. This turns data science into something much more similar to natural science and its search for mathematical laws that explain the world.

How to get the software

If you are interested in trying TuringBot on your own data, you can download it for free from the official website. There you can also find the official documentation, with detailed information about all the features and parameters of the software. Many engineers and data scientists are already making use of the software to find hidden patterns in their data.

Symbolic regression tutorial with TuringBot

In this tutorial, we are going to show how you can find a formula from your data using the symbolic regression software TuringBot. It is a desktop software that runs on both Windows and Linux, and as you will see the usage is very simple.

Preparing the data

TuringBot takes as input files in .txt or CSV format containing one variable per column. The first row may contain the names of the variables, otherwise they will be labelled col1, col2, col3, etc.

For instance, the following is a valid input file:

x y z w classification
5.20 2.70 3.90 1.40 1
6.50 2.80 4.60 1.50 1
7.70 2.80 6.70 2.00 2
5.90 3.20 4.80 1.80 1
5.00 3.50 1.60 0.60 0
5.10 3.50 1.40 0.20 0
4.60 3.10 1.50 0.20 0
6.90 3.20 5.70 2.30 2

Loading the data into TuringBot

This is what the program looks like when you open it:

The TuringBot interface.

By clicking on the “Input file” button on the upper left, you can select your input file and load it. Different search metrics are available, including for instance classification accuracy, and a handy cross validation feature can also be enabled in the “Search options” box — if enabled, it will automatically create a test/train split and allow you to see the out-of-sample error as the optimization goes on. But in this example we are going to keep things simple and just use the defaults.

Finding the formulas

After loading the data, you can click on the play button at the top of the interface to start the optimization. The best formulas found so far will be shown in the “Solutions” box, in ascending order of complexity. A formula is only shown if its accuracy is greater than that of all simpler alternatives — in symbolic regression, the goal is not simply to find a formula, but to find the simplest ones possible.

Here are the formulas it found for an example dataset:

Finding formulas with TuringBot.

The formulas are all written in a format that is compatible out of the box with Python and C. Indeed, the menu on the upper right allows you to export the solutions to these languages:

Exporting solutions to different languages.

In this example, the true formula turned out to be sqrt(x), which was recovered in a few seconds. The methodology would be the same for a real-world dataset with many input variables and an unknown dependency between them.

How to get TuringBot

If you have liked this tutorial, we encourage you to download TuringBot for free from the official website. As we have shown, it is very simple to use, and its powerful mathematical modelling capabilities allow you to find very subtle numerical patterns in your data. Much like a scientist would do from empirical observations, but in an automatic way and millions of times faster.

Machine learning with symbolic regression

Many machine learning methods are presently available, including for instance neural networks, random forests and support vector machines. In this article, we will talk about a very unexplored algorithm called symbolic regression, and will show how it can be used to solve machine learning problems in a very transparent and explicit way.

What is machine learning

Machine learning concerns algorithms capable of predicting numerical values (regression) and creating classifications, among other tasks. The real world is messy and randomness appears everywhere, so a major challenge that these algorithms face is being able to discern meaningful signals from the underlying noise contained in the training datasets.

What most machine learning methods have in common is that they are very implicit and resemble black boxes: numbers are fed into the model, and it spits out a result after performing a series of complex computations under the hood. This kind of processing of information is strongly connected to the notion of “artificial intelligence”, since the inner workings of the human brain are also very hard to describe, while it is capable of learning and recognizing patterns across a very wide range of domains.

Symbolic regression

Symbolic regression is a technique that looks for mathematical formulas that predict some target variable taking as input one or more input variables. Thus, a symbolic model is nothing more than an algebraic formula that can be written on a piece of paper.

A simple case of symbolic model is a polynomial. Any dataset can be represented with perfect accuracy by a polynomial, but that is not very interesting because polynomials quickly diverge outside the train domain, and because they contain as many free parameters as the training dataset itself. So they do not really compress information in any way.

More interesting models are found by combining a set of base functions and trying to find the simplest combinations that predict some target variable. Examples of base functions are trigonometric functions, exponentials, sum, multiplication, division, etc.

For instance, these are some of the base functions used by the symbolic regression software TuringBot:

Base functions used by TuringBot.

After the base functions are defined, the task is then to combine them in such way that a target variable is successfully predicted from the input variables. There is more than one way to carry out the optimization — one might be interested in maximizing the classification accuracy, or in recovering the overall shape of a curve without much regard for outliers, etc. For this reason, TuringBot allows many different search metrics to be used:

The search metrics available in TuringBot.

Some examples of problems that can be solved with symbolic regression include:

Clearly the method is very general, and can be creatively used to solve a variety of problems.


In this article, we have seen how symbolic regression is an alternative machine learning method capable of generating explicit models and solving various classes of problems in an elegant way. If you are interested in generating symbolic models from your own data and seeing what patterns it can find, you can download the symbolic regression software TuringBot, which works on both Windows and Linux, for free.

How to find formulas from values

Finding mathematical formulas from data is an extremely useful machine learning task. A formula is the most compressed representation of a table, allowing large amounts of data to be compressed into something simple, while also making explicit the relationship that exists between the different variables.

In this tutorial, we are going to generate a dataset and try to recover the original formula using the symbolic regression software TuringBot, without any previous knowledge of what that formula was.

What symbolic regression is

Symbolic regression is a machine learning technique that tries to find explicit mathematical formulas that connect variables. The technique starts from a set of base functions to be used in the search, for instance, addition, multiplication, sin(x), exp(x), etc, and then tries to combine those functions in such a way that the target variable is accurately predicted.

Simplicity is as important as accuracy in a symbolic regression model. Every dataset can be represented with perfect accuracy by a polynomial, but that is uninformative since the number of free parameters in the model the same as the number of training data points. For this reason, a symbolic regression optimization penalizes large formulas, favoring simpler ones that perform just as well.

Generating an example dataset

Let’s give an explicit example of how symbolic regression can be used to find a formula from data. We will generate a dataset that consists of the formula x*cos(10*x) + 2, add noise to this data, and then see if we can recover this formula using symbolic regression.

The following Python script generates the input data:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 1, 100)
y = np.cos(10*x)*x + 2 + np.random.random(len(x))*0.1

And this is what the result looks like:

The input data that we have generated.

Now we are going to try to find a formula for this data and see what happens.

Finding a formula using TuringBot

The usage of TuringBot is very simple. All we have to do is load the input data using its interface and start the search. First, we save the data to an input file:

arr = np.column_stack((x, y))
np.savetxt('input.txt', arr, fmt='%f')

After loading input.txt into TuringBot, starting the search, and letting it work for a minute, these were the formulas that it found, ordered by complexity:

The formulas found by TuringBot for our input dataset.

It can be seen that it has successfully found our original formula!


Here we have seen how symbolic regression can be used to automatically find mathematical formulas from data values. The example that we have given was a simple one, but the procedure that we have used would also work for a real-world dataset in which the dependencies between the variables was not known beforehand, and in which more than one input variable was present.

If you are interested in trying to find formulas from your own dataset, you can download TuringBot for free from the official website.

Symbolic regression example with Python visualization

Symbolic regression is a machine learning technique capable of generating models that are explicit and easy to understand.

In this tutorial, we are going to generate our first symbolic regression model. For that we are going to use the TuringBot software. After generating the model, we are going to visualize the results using a Python library (Matplotlib).

In order to make things more interesting, we are going to try to find a mathematical formula for the N-th prime number (A000040 in the OEIS).

Symbolic regression setup

The symbolic regression software that we are going to use is called TuringBot. It is a desktop application that runs on both Windows and Linux. The usage is straightforward: you load your input file in .txt or .csv file, select which column should be predicted and which columns should be used as input, and then start the search.

Several search metrics are available, including RMS error, mean error, correlation coefficient and others. Since we are interested in predicting the exact values of the prime numbers, we are going to use the “classification accuracy” metric.

This is what the interface looks like after loading the input file containing prime numbers as a function of N, which we have truncated to the first 20 rows:

The TuringBot interface.

With the input file loaded and the search metric selected, the search can be started by clicking on the play button at the top of the interface.

The formulas that were found

After letting TuringBot work for a few minutes, these were the formulas that it ended up finding:

The results of our symbolic regression optimization.

The best one has an error of 0.20, that is, a classification accuracy of 80%. Which is quite impressive considering how short the formula is. Of course we could have obtained a 100% accuracy with a huge polynomial, but that would not really compress the data in any way, since the number of free parameters in the resulting model would be the same as the number of data points.

Visualizing with Python

Now we can finally visualize the symbolic model using Python. Luckily the formula works out of the box as long as we import the math library (TuringBot follows the same naming convention). This is what the script looks like:

from math import *
import numpy as np
import matplotlib.pyplot as plt

def prime(x):
    return floor(1.98623*ceil(0.0987775+cosh(log2(x)-0.049869))-(1/x))

data = np.loadtxt('primes.txt')
plt.scatter(data[:,0], data[:,1], label='Data')
plt.plot(data[:,0], [prime(x) for x in data[:,0]], label='Model')
plt.title('Prime numbers')

And this is the resulting plot:

Symbolic regression Python model.
Plot of our model vs the original data.


In this tutorial we have seen how to generate a symbolic regression model. The example given was a very simple one, with only one input variable and a small number of data points, but the methodology would work just as fine with a real-world large dataset with multiple dimensions, allowing a variety of machine learning problems of practical interest to be solved.

Using R to visualize a Symbolic Regression model

In this article, we are going to show how a symbolic regression model can be visualized using the R programming language. The model will be generated using the TuringBot symbolic regression software, and we are going to use the ggplot2 library [1] for the visualization.

The dataset that we are going to use consists of the closing prices for the S&P 500 index in the last year, downloaded from Yahoo Finance [2]. The CSV file, which also contains additional columns like open, high, low, and volume, can be found here: spx.csv

Symbolic regression modelling

After opening TuringBot and selecting this file from the menu on the upper left of the interface, we select “Row number” as the input variable and “Close” as the target variable. This way, our model will find the close price as a function of the index of the trading day (1, 2, 3, etc). We will also use a randomly selected 50:50 train/test split to make our model more robust, and “mean relative error” as the optimization metric because we are more interested in the shape of the model than in specific values.

This is what the interface will look like:

Clicking on the play button at the top, the optimization is started, using all the CPU cores in the computer for greater performance. The models encountered so far are seen in the “Solutions” box.

Selecting the best formula

After letting the optimization run for a few minutes, we can click on the “Show cross validation error” box on the upper right of the interface to see the out-of-sample performance of each model, and use this information to select the best one, which in this case turned out to be a combination of cosines and multiplications:

Visualizing with R and ggplot2

Now that we have the model, we are going to visualize it using ggplot2. The following script loads the input CSV file and plots it along with the model that we just selected:


data <- read.csv("spx.csv")
data$idx <- as.numeric(row.names(data))

eq = function(row){2966.96+(2.98602*(-55.4604+row)*cos(0.0397268*(row+8.34129*cos(-0.0819996*row))-1.16301*cos(-0.0358919*row)))}

p <- ggplot(data, aes(x=idx, y=Close)) + geom_point()
#p <- ggplot() + geom_line(aes(x=idx, y=Close), data=data)

p + stat_function(fun=eq, color='blue')


And this is the final result:

Symbolic regression R model.

This demonstrates the power and simplicity of symbolic regression models: we have managed to readily implement and visualize a deep learning model generated using TuringBot into R, something that would be much harder if the model was a black-box like a neural network or a random forest.


[1] ggplot2:

[2] Yahoo Finance quotes for the S&P 500:^GSPC&.tsrc=fin-srch

Using Symbolic Regression to predict rare events

The formula above predicts credit card frauds in a real-world dataset with 87% precision.

Rare events classification

Predicting rare events is a machine learning problem of great practical importance, and also a very difficult one. Models of this kind need to be trained on highly imbalanced datasets and are used, among other things, for spotting fraudulent online transactions and detecting anomalies in medical images.

In this article, we show how such problems can be modeled using Symbolic Regression, a technique that discovers mathematical formulas that predict the desired variable from a set of input variables. Symbolic models, contrary to more mainstream ones like neural networks and random forests, are not black boxes, since they clearly show which variables are being used and how. They are also very fast and easy to implement, since no complex data structures are involved in the calculations.

In order to provide a real-world example, we will try to model the credit card fraud dataset available on Kaggle using our Symbolic Regression software TuringBot. The dataset consists of a CSV file containing 284,807 transactions, one per row, out of which 492 are frauds. The first 28 columns represent anonymized features, and the last one contains “0” for legitimate transactions and “1” for fraudulent ones.

Prior to the regression, we remove all quotation mark characters from the file, so that those two categories are recognized as numbers by the software.

Symbolic regression

Generating symbolic models using TuringBot is a straightforward process, which requires no data science skills. The first step is to open the program and load the input file by clicking on the “Input” button, shown below. After loading, the code will automatically define the column “Class” as the target variable and all other ones as input variables, which is what we want.

Then, we select the error metric for the search as “F1 score”, which is the appropriate one for binary classification problems on highly imbalanced datasets like this one. This metric corresponds to a geometric mean of precision and the recall of the model. A very illustrative image that explains what precision and recall are can be found on the Wikipedia page for the F1 score.

That’s it! After those two steps, the search is ready to start. Just click on the “play” button at the top of the interface. The best solutions that the program has encountered so far will be shown in the “Solutions” box in real-time.

Bear in mind that this is a relatively large dataset and that it may seem like not much is going on in the first minutes of the optimization. Ideally, you should leave the program running until at least a few million formulas have been tested (you can see the number so far in the Log tab). In a modest i7-3770 CPU with 8 threads, this took us about 6 hours. A more powerful CPU would take less time.

The resulting formula

The models that were encountered by the program after this time were the following:

The error for the best one is 0.17, meaning its F1 score is 1 – 0.17 = 0.83. This implies that both the recall and the precision of the model are close to 83%. In a verification using Python, we have found that they are 80% and 87% respectively.

So what does this mean? That the following mathematical formula found by our program is capable of detecting 80% of all frauds in the dataset, and that it is right 87% of the time when it claims that fraud is taking place! This is a result consistent with the best machine learning methods available.


In this article, we have demonstrated that our Symbolic Regression software TuringBot is able to generate models that classify credit card frauds in a real-world dataset with high precision and high recall. We believe that this kind of modeling capability, combined with the transparency and efficiency of the generated models, is very useful for those interested in developing machine learning models for the classification and prediction of rare events.