## Introduction

TuringBot, named after great mathematician and computation pioneer Alan Turing, is a desktop software for Symbolic Regression. It uses a novel algorithm based on simulated annealing to discover mathematical formulas from values with unprecedented efficiency.

Data is read from TXT or CSV files, and the target and input columns can be selected from the interface. A variety of error metrics are available for the search, allowing the program to find formulas that solve both regression and classification problems.

The main features of TuringBot are the following:

- Pareto optimization: the software simultaneously tries to find the best formulas of all possible sizes. It gives you not only a single formula as output, but a set of formulas of increasing complexity to choose from.
- Built-in cross-validation: allows you to easily rule out overfit solutions.
- Export solutions as Python, C/C++, LaTeX, or plain text.
- Multiprocessing.
- Written in a low-level programming language (no Python). Extremely fast.
- Has a command-line mode that allows you to automate the program.

## Input Data Format

Input files are selected in the interface by clicking on the "Input file" button, as shown below:

The input file must have a .txt or .csv extension, and the file must contain columns representing different variables separated by spaces, commas, semicolons, or tabs. The values can be integers (%d), floating-point numbers (%f), or floating-point numbers in exponential notation (%e). Decimal points must be indicated with a dot (e.g., 1.61803), not a comma (e.g., 1,61803).

Optionally, a header containing the variable names may be present in the first line. If a header is included, these names will be used in formulas instead of default names such as col1, col2, col3, and so on. Variable names in the header should be formatted without spaces; instead, use underscores (_).

You can also use row numbers (1, 2, 3...) as an input variable, which is useful for time series data.

Below is an example of a valid input file:

x y z 0.01231 0.99992 0.99985 0.23180 0.97325 0.94723 0.45128 0.89989 0.80980 0.67077 0.78334 0.61363 0.89026 0.62921 0.39591 1.00000 0.54030 0.29193

### Important Notes

The following elements must not be present in the input file:

- Text or string data.
- Date values (e.g., 2023-09-15 or 09/15/2023).
- Currency symbols (e.g., $).
- Percentage symbols (e.g., %).
- Commas in numeric values (e.g., 6,000.00 should be 6000.00).
- Special characters such as slashes (/).
- Spaces in header names (use underscores instead: variable_name).

If any of the above elements are present, the program will fail to parse the file correctly.

## Search options

Before starting the search, you can select a variety of search settings, including the error metric and the base functions that should be used in the optimization.

#### Error metrics

The available error metrics are:

RMS error | Root mean square error. |

Mean relative error | Average of (absolute value of the error) / (absolute value of the target variable). With that, the convergence is in terms of relative error instead of absolute error. |

Classification accuracy | (correct predictions) / (number of data points). Only useful for integer target values and classification problems. |

Mean error | Average of (absolute value of the error). Similar to RMS error, but it puts less emphasis on outliers. |

F-score | 2*(precision*recall)/(precision+recall) when "F-score beta parameter" is set to 1. See here an image that explains what those two quantities are. This metric is useful for classification problems on highly imbalanced datasets, where the target variable is 0 for the majority of inputs and a positive integer for a few cases that need to be identified. If the classification is binary, then the categories can be 1 (relevant cases) and 0 (all other cases). |

Correlation coefficient | Corresponds to the Pearson correlation coefficient. Useful for quickly getting the overall shape of the output right without attention to scales. |

Hybrid (CC+RMS) | The geometric mean between the correlation coefficient and the RMS error. This is a compromise between the attention to scale of the RMS metric and the speed of the CC metric. |

Maximum error | The maximum absolute difference between the predictions of the model and the target variable. |

Maximum relative error | The maximum of (absolute value of the error) / (absolute value of the target variable). |

Nash-Sutcliffe efficiency | A metric that resembles the correlation coefficient and is commonly used in hydrological applications. See the definition in this paper. |

Binary cross-entropy | Used for solving binary classification problems in terms of probabilities. To use this metric, your target variable must contain two (and only two) classes represented by the numbers 0 and 1. |

Matthews correlation coefficient | A classification metric that takes into account true positives, true negatives, false positives, and false negatives and can be used even if the categories have imbalanced numbers of elements. To use this metric, your "negative" target variable must be represented by the number 0, and your "positive" variables must be represented by one or more positive integer numbers. See this paper for details. |

Residual sum of squares (RSS) | The sum of (predicted - observed)^2. Very similar to the RMS metric but without taking the average and the square root. |

Root mean squared log error (RMSLE) | The square root of the average of (log(1 + predicted) - log(1 + observed))^2. Like mean relative error, this metric can be applied to target variables that span multiple orders of magnitude, but it penalizes large errors less aggressively, and is thus less sensitive to outliers. It requires the target variable to be strictly positive. |

#### Base functions

The function names follow the conventions of the C math library. You can find their definitions on this page.

The exceptions are the logical functions (logical_and(x, y), greater(x, y), etc), the history functions (delay and moving average), and sign (the sign function), which are defined internally in TuringBot.

The **moving average** of a variable is defined as its average value in the N rows before the present one. For instance, if the successive values for variable x are (1, 2, 3, 4, 5), then at x = 5, the value for moving_average(x,3) will be (2+3+4)/3. In some systems, this may be considered a moving average with a lag of 1.

The sizes of the base functions are defined as:

- Size 1: an input variable, sum, subtraction, and multiplication.
- Size 2: division.
- Size 3: abs(x), ceil(x), floor(x), and round(x).
- Size 4: all other functions.

#### Cross-validation

In the same box where the search options are selected, the cross-validation settings can also be selected. It is recommended to use cross-validation since that allows overfit models that are more complex than necessary to be discarded in a straightforward way.

The size of the train/test split can be selected from the menu. The default value, "No cross-validation", disables the cross-validation altogether. Three kinds of options are available:

- Percentages like 50/50 and 80/20.
- Fixed training dataset sizes like "100 rows" and "1000 rows".
- A "Custom rows" option where you can specify the exact number of rows for the training set.

It is also possible to select how the training sample should be generated: if it should be a selection of random rows or the first rows in your dataset in sequential order.

During the optimization, you can alternate between showing the errors for the training sample and the testing sample by clicking on the "Show cross-validation error" box on the upper right of the interface. With this, overfit solutions can be spotted in real time.

#### Custom search

By default, the program is configured to find formulas such that y = f(x1,x2,x3,...). But, in some cases, you may be interested in specific functional forms. Some examples could be:

- y = f()*x+f() (a line).
- y = f()*x*x + f()*x + f() (a parabola).
- y = f(x1,x2)*x1 + exp(f(x1)/4) + 2 (a formula with terms that should depend on specific variables).

This kind of search is possible with TuringBot's Advanced mode. To enable it, click the "Advanced" button and type your desired equation in the input box that will appear:

The left side of the equation should be the desired variable, and the right side should be the formula that you are trying to find, with unknown terms denoted by f([variables]).

In this custom equation, you can use any base function offered by the program, as well as numerical constants, which must be in integer, floating-point, or exponential notation (like 2, 3.14, or 2.35e-3).

The following conventions must be followed for the unknown terms:

- Uknown terms must be denoted f([variables]). For a constant, use f(). For a function of x, use f(x). For a function of x and z, use f(x,z). Etc.
- For a function of all variables except one, you can use the "~" operator, which excludes variables. y = f(~y) will use all variables as input except for y. y = f(~y,~row) will use all variables except y and row.

During a custom search, the plot will show the left side of the equation as points and the right side of the equation as a line.

## The solutions box

The regression is started by clicking on the play button at the top of the interface. After that, the best solutions found so far will be shown in the "Solutions" box, as shown below:

Each row corresponds to the best solution of a given size encountered so far. By clicking on a solution, it will be shown in the plot and its stats will be shown in the "Solution info" box. Larger solutions are only shown if they provide a better fit than all smaller solutions.

The size of a solution is defined as the sum of the sizes of the base functions that constitute it (see above).

In the Solutions box, you have the option of sorting the solutions by a balance between size and accuracy by clicking on the "Function" header, which will sort the solutions by (error)^2 * (size). By default, the solutions are sorted by size.

## Exporting/loading formulas

In the menu, you can find an option called "Set periodic output". There you can choose to enable two options for periodic output:

- Solutions: export formulas in the same format generated by "Export solutions as text".
- Predictions: export your original dataset along with the model predictions. The prediction columns will be named solution_N, where N is the complexity of that solution. If using cross-validation, the training dataset rows will be written to the output file first, followed by the testing dataset rows.

The periodic output files are only saved if new solutions have been found, to avoid saving the same files over and over again in long runs.

#### Loading formulas

The solutions file generated by the periodic output option above can be loaded back into the program, allowing you to restart an optimization from a checkpoint. For that, start a new optimization and then choose the option "Load formulas from file" in the menu.

It's also possible to input your custom formulas into the program using this option. For that, generate a text file with one formula per row, making sure that the formulas do not contain any space characters, and load this file into the program.

## Plot settings

In the "Other options" part of the main tab, you can at any time change your plot settings:

For the y-axis, you can choose to plot your target variable, the residual error (difference between a solution and the target data), and the residual error as a percentage of the target data.

For the x-axis, you can choose the row number (1, 2, 3, ...) corresponding to that point, or any of your input variables. Additionally, you can also select the **Observed** option to see an **Observed vs predicted** plot. In this case, the plot also shows a gray line representing a perfect fit for visual reference.

The plot scales can be adjusted in the "Plot scale" menu, where you can choose from regular scale, symlog x, symlog y, symlog x and y, log x, log y, or log x and y. The "symlog" scale allows negative numbers to be visualized on a logarithmic scale, while "log" uses the regular base 10 logarithm.

## Advanced

In the "Advanced" tab of the interface, you can find information about how many formulas have been generated in the current optimization, how long the optimization has been running, and how many formulas are being tried per second. A log message is also generated every time a new solution is encountered so that you can keep track of progress.

A few specialized search settings are also available on this tab:

- Maximum formula size: by default, the complexity of formulas is prevented from becoming larger than 60. With this option, you can allow the program to generate larger formulas, which makes the optimization slower but makes longer formulas possible.
- Maximum history size: only used if one of the history functions is enabled. Sets the maximum length of those functions. Note that if this parameter is set to 20, then the first 20 rows of your dataset are not used for the search, and are only used to calculate the history functions starting from row 21.
- F-score beta parameter: when left at the default value of 1, the F-score metric corresponds to the F1-metric. Values of beta lower than 1 favor precision over recall.
- Random seed for train/test split generation: if a value >= 0 is set, this value will be used as a seed, resulting in the same split every time a new search is started with the same dataset. When the parameter is set to -1, a different random split will be generated each time.
- Normalize the dataset: for each variable, subtract the average and divide by the standard deviation before starting the search. This can speed up the search a lot if your input variables are large. Note that the "sample standard deviation" is used, where the denominator is N-1 instead of N for smaller bias: link.
- Target variable in history functions: allows you to choose whether your target variable can be used in the history functions.
- Force solutions to include all variables: allows you to discard solutions that do not feature all input variables.
- Bound search mode: this advanced search mode allows you to discover formulas that are upper or lower bounds for the target variable.

## Command-line usage

TuringBot is also a console application that can be executed in a fully automated and customizable way. The general usage is the following:

TuringBot - Symbolic Regression Software usage: turingbot [--help] INPUT_FILE [SETTINGS_FILE] [--outfile FILENAME] [--predictions-file FILENAME] [--formulas-file FILENAME] [--threads N] required arguments: INPUT_FILE the full path to your input file: use /foo/bar/file.txt not ./file.txt or file.txt optional arguments: SETTINGS_FILE the full path to the settings file to use for this optimization --help show this help message --outfile FILENAME write the best formulas found so far to this file --predictions-file FILENAME write the predictions obtained from the best formulas found so far to this file --formulas-file FILENAME load seed formulas from this file. the file generated by --outfile can be later used as input here --threads N use this number of threads; the default is the total number available in your system

If no configuration file is provided, the program will use the last column in the input file as the target variable and all other columns as input variables.

The best formulas found so far will be written to the terminal every 1 second. If you set an output file with the --outfile option, those formulas will also be regularly saved to the output file.

Note that to run the command above on Windows you have to first cd to the installation directory and then run with .\TuringBot.exe:

```
cd C:\Program Files (x86)\TuringBot
.\TuringBot.exe INPUT_FILE
```

#### Settings file

The search can be fully customized by providing the program with a settings file. Here is an example:

```
search_metric = 4 # Search metric. 1: Mean relative error, 2: Classification accuracy, 3: Mean error, 4: RMS error, 5: F-score, 6: Correlation coefficient, 7: Hybrid (CC+RMS), 8: Maximum error, 9: Maximum relative error, 10: Nash-Sutcliffe efficiency, 11: Binary cross-entropy, 12: Matthews correlation coefficient (MCC), 13: Residual sum of squares (RSS), 14: Root mean squared log error (RMSLE)
train_test_split = -1 # Train/test split. Options are as follows: -1 for no cross-validation, 50, 60, 70, 75, or 80 for percentages, and 100, 1000, 10000, or 100000 for predefined row counts. Use negative numbers for a custom number of rows (e.g., set -200 to use 200 rows for training).
test_sample = 1 # Test sample. 1: Chosen randomly, 2: The last points
train_test_seed = -1 # Random seed for train/test split generation when the test sample is chosen randomly.
integer_constants = 0 # Integer constants only. 0: Disabled, 1: Enabled
bound_search_mode = 0 # Bound search mode. 0: Deactivated, 1: Lower bound search, 2: Upper bound search
maximum_formula_complexity = 60 # Maximum formula complexity.
history_size = 20 # History size.
fscore_beta = 1 # F-score beta parameter.
normalize_dataset = 0 # Normalize the dataset before starting the optimization? 0: No, 1: Yes
allow_target_delay = 0 # Allow the target variable in the lag functions? 0: No, 1: Yes
force_all_variables = 0 # Force solutions to include all input variables? 0: No, 1: Yes
custom_formula = # Custom formula for the search. If empty, the program will try to find the last column as a function of the remaining ones.
allowed_functions = + * / pow fmod sin cos tan asin acos atan exp log log2 log10 sqrt sinh cosh tanh asinh acosh atanh abs floor ceil round sign tgamma lgamma erf # Allowed functions.
```

Settings are changed by modifying the values after the = characters. The comments after # characters are ignored. The allowed functions are set by directly providing their names to the allowed_functions variable, separated by spaces.

The order of the parameters inside the settings file does not matter.

A convenient way of generating a settings file is to set things as you like them in the graphical interface, and then simply export the settings from the menu using the "Save settings" option:

#### Output

A typical terminal output of TuringBot is the following:

Formulas generated: 1135108 Size Error Function 1 177813 186275.6979035278 3 7890.39 11.75045370574789*x 5 6895.25 11.93943363494786*(-472.8408126495318+x) 7 980.126 lgamma(1.22706776648686*x) 8 674.279 x*(0.6868942922296394+asinh(x)) 9 240.168 1.062116924609507*acosh(x)*x 11 147.484 1.063188751909768*acosh(x)*(-34.51291396937853+x)

The first line reports how many formulas have been attempted so far.

The next lines contain the formulas as well as their corresponding sizes and errors.

## Running TuringBot from Python

When you install TuringBot, you also receive a small Python library designed to make it very easy to call the software from within Python.

Below we provide examples of usage for each OS, but the basic idea is that this library provides a simulation class:

sim = tb.simulation()

This class has a start_process method that starts TuringBot in the background:

sim.start_process(path, input_file, threads=4, config=config_file)

The parameters that you see are:

- path (obligatory): the path to the TuringBot executable.
- input_file (obligatory): the path to the input file. By default, the last column will be set as the target variable, but you can fully customize the search by using a configuration file.
- threads=4 (optional): the number of threads that you want the program to use.
- config=config_file (optional): the path to the configuration file.

Once a simulation is started, you can refresh the current formulas with:

sim.refresh_functions()

and then access the formulas in the form of a list with:

sim.functions

You can also find general information about the number of formulas tried so far as well as error messages with:

sim.info

To finish a simulation and kill the TuringBot process, you should call

sim.terminate_process()

#### Windows

An example of the usage of TuringBot's Python library on Windows is the following.

```
import sys
sys.path.insert(1, r'C:\Program Files (x86)\TuringBot\resources')
import time
import turingbot as tb
path = r'C:\Program Files (x86)\TuringBot\TuringBot.exe'
input_file = r'C:\Users\YourUsername\Desktop\input.txt'
config_file = r'C:\Users\YourUsername\Desktop\settings.cfg'
sim = tb.simulation()
sim.start_process(path, input_file, threads=4, config=config_file)
time.sleep(10)
sim.refresh_functions()
print(*sim.functions, sep='\n')
print(sim.info)
sim.terminate_process()
```

#### Linux

An example of usage of TuringBot's Python library on Linux is the following:

```
import sys
sys.path.insert(1, r'/usr/lib/turingbot/resources')
import time
import turingbot as tb
path = r'/usr/lib/turingbot/TuringBot'
input_file = r'/home/user/input.txt'
config_file = r'/home/user/settings.cfg'
sim = tb.simulation()
sim.start_process(path, input_file, threads=4, config=config_file)
time.sleep(10)
sim.refresh_functions()
print(*sim.functions, sep='\n')
print(sim.info)
sim.terminate_process()
```

#### macOS

An example of usage of TuringBot's Python library on macOS is the following:

```
import sys
sys.path.insert(1, r'/Applications/TuringBot.app/Contents/Resources')
import time
import turingbot as tb
path = r'/Applications/TuringBot.app/Contents/MacOS/TuringBot'
input_file = r'/Users/YourUsername/Desktop/input.txt'
config_file = r'/Users/YourUsername/Desktop/settings.cfg'
sim = tb.simulation()
sim.start_process(path, input_file, threads=4, config=config_file)
time.sleep(10)
sim.refresh_functions()
print(*sim.functions, sep='\n')
print(sim.info)
sim.terminate_process()
```