Introduction

TuringBot, named after great mathematician and computation pioneer Alan Turing, is a desktop software for Symbolic Regression. It uses a novel algorithm based on simulated annealing to discover mathematical formulas from values with unprecedented efficiency.

Data is read from TXT or CSV files, and the target and input columns can be selected from the interface. A variety of error metrics are available for the search, allowing the program to find formulas that solve both regression and classification problems.

The main features of TuringBot are the following:

  • Pareto optimization: the software simultaneously tries to find the best formulas of all possible sizes. It gives you not only a single formula as output, but a set of formulas of increasing complexity to choose from.
  • Built-in cross-validation: allows you to easily rule out overfit solutions.
  • Export solutions as Python, C/C++, LaTeX, or plain text.
  • Multiprocessing.
  • Written in a low-level programming language (no Python). Extremely fast.
  • Has a command-line mode that allows you to automate the program.

Input Data Format

Input files are selected in the interface by clicking on the "Input file" button, as shown below:

The input file name must end in .txt or .csv, and the file must contain columns representing different variables separated by spaces, commas, semicolons, or tabs. Those values can be integers, floats, or floats in exponential notation (%d, %f, or %e), with decimal parts separated by a dot (1.61803 and not 1,61803).

Optionally, a header containing the variable names may be present in the first line of the file — in this case, those names will be used in the formulas instead of the default names, which are col1, col2, col3, etc. You also have the option of using the row number (1, 2, 3...) as an input variable, which may be useful for time series data.

For example, the following is a valid input file:

x y z
0.01231 0.99992 0.99985
0.23180 0.97325 0.94723
0.45128 0.89989 0.80980
0.67077 0.78334 0.61363
0.89026 0.62921 0.39591
1.00000 0.54030 0.29193

Important Notes

The following elements must not be present in the input file:

  • Text or string data (except in the header).
  • Date values (e.g., 2024-09-15 or 09/15/2024).
  • Currency symbols (e.g., $).
  • Percentage symbols (e.g., %).
  • Commas in numeric values (e.g., 6,000.00 should be 6000.00).
  • Parentheses around numbers (e.g., (100)).
  • Spaces in header names (use underscores instead: variable_name).

If any of the above elements are present, the program will fail to parse the file correctly.

Search options

Before starting the search, you can select a variety of search settings, including the error metric and the base functions that should be used in the optimization.

Error metrics

The available error metrics are:

MetricDescription
RMS errorRoot mean square error.
Mean relative errorAverage of (absolute value of the error) / (absolute value of the target variable). With that, the convergence is in terms of relative error instead of absolute error.
Classification accuracy(correct predictions) / (number of data points). Only useful for integer target values and classification problems.
Mean errorAverage of (absolute value of the error). Similar to RMS error, but it puts less emphasis on outliers.
F-score2*(precision*recall)/(precision+recall) when "F-score beta parameter" is set to 1. See here an image that explains what those two quantities are.
This metric is useful for classification problems on highly imbalanced datasets, where the target variable is 0 for the majority of inputs and a positive integer for a few cases that need to be identified. If the classification is binary, then the categories can be 1 (relevant cases) and 0 (all other cases).
Correlation coefficientCorresponds to the Pearson correlation coefficient. Useful for quickly getting the overall shape of the output right without attention to scales.
Hybrid (CC+RMS)The geometric mean between the correlation coefficient and the RMS error. This is a compromise between the attention to scale of the RMS metric and the speed of the CC metric.
Maximum errorThe maximum absolute difference between the predictions of the model and the target variable.
Maximum relative errorThe maximum of (absolute value of the error) / (absolute value of the target variable).
Nash-Sutcliffe efficiencyA metric that resembles the correlation coefficient and is commonly used in hydrological applications. See the definition in this paper.
Binary cross-entropyUsed for solving binary classification problems in terms of probabilities. To use this metric, your target variable must contain two (and only two) classes represented by the numbers 0 and 1.
Matthews correlation coefficientA classification metric that takes into account true positives, true negatives, false positives, and false negatives and can be used even if the categories have imbalanced numbers of elements. To use this metric, your "negative" target variable must be represented by the number 0, and your "positive" variables must be represented by one or more positive integer numbers. See this paper for details.
Residual sum of squares (RSS)The sum of (predicted - observed)^2. Very similar to the RMS metric but without taking the average and the square root.
Root mean squared log error (RMSLE)The square root of the average of (log(1 + predicted) - log(1 + observed))^2. Like mean relative error, this metric can be applied to target variables that span multiple orders of magnitude, but it penalizes large errors less aggressively, and is thus less sensitive to outliers. It requires the target variable to be strictly positive.

Base functions

The function names follow the conventions of the C math library. You can find their definitions on this page.

The exceptions are the logical functions (logical_and(x, y), greater(x, y), etc), the history functions (delay and moving average), and sign (the sign function), which are defined internally in TuringBot.

The moving average of a variable is defined as its average value in the N rows before the present one. For instance, if the successive values for variable x are (1, 2, 3, 4, 5), then at x = 5, the value for moving_average(x,3) will be (2+3+4)/3. In some systems, this may be considered a moving average with a lag of 1.

The sizes of the base functions are defined as:

  • Size 1: an input variable, sum, subtraction, and multiplication.
  • Size 2: division.
  • Size 3: abs(x), ceil(x), floor(x), and round(x).
  • Size 4: all other functions.

Cross-validation

In the same box where the search options are selected, the cross-validation settings can also be selected. It is recommended to use cross-validation since that allows overfit models that are more complex than necessary to be discarded in a straightforward way.

The size of the train/test split can be selected from the menu. The default value, "No cross-validation", disables the cross-validation altogether. Three kinds of options are available:

  • Percentages like 50/50 and 80/20.
  • Fixed training dataset sizes like "100 rows" and "1000 rows".
  • A "Custom rows" option where you can specify the exact number of rows for the training set.

It is also possible to select how the training sample should be generated: if it should be a selection of random rows or the first rows in your dataset in sequential order.

During the optimization, you can alternate between showing the errors for the training sample and the testing sample by clicking on the "Show cross-validation error" box on the upper right of the interface. With this, overfit solutions can be spotted in real time.

By default, the program is configured to find formulas such that y = f(x1,x2,x3,...). But, in some cases, you may be interested in specific functional forms. Some examples could be:

  • y = f()*x+f() (a line).
  • y = f()*x*x + f()*x + f() (a parabola).
  • y = f(x1,x2)*x1 + exp(f(x1)/4) + 2 (a formula with terms that should depend on specific variables).

This kind of search is possible with TuringBot's Advanced mode. To enable it, click the "Advanced" button and type your desired equation in the input box that will appear:

The left side of the equation should be the desired variable, and the right side should be the formula that you are trying to find, with unknown terms denoted by f([variables]).

In this custom equation, you can use any base function offered by the program, as well as numerical constants, which must be in integer, floating-point, or exponential notation (like 2, 3.14, or 2.35e-3).

The following conventions must be followed for the unknown terms:

  • Uknown terms must be denoted f([variables]). For a constant, use f(). For a function of x, use f(x). For a function of x and z, use f(x,z). Etc.
  • For a function of all variables except one, you can use the "~" operator, which excludes variables. y = f(~y) will use all variables as input except for y. y = f(~y,~row) will use all variables except y and row.

During a custom search, the plot will show the left side of the equation as points and the right side of the equation as a line.

The solutions box

The regression is started by clicking on the play button at the top of the interface. After that, the best solutions found so far will be shown in the "Solutions" box, as shown below:

Each row corresponds to the best solution of a given size encountered so far. By clicking on a solution, it will be shown in the plot and its stats will be shown in the "Solution info" box. Larger solutions are only shown if they provide a better fit than all smaller solutions.

The size of a solution is defined as the sum of the sizes of the base functions that constitute it (see above).

In the Solutions box, you have the option of sorting the solutions by a balance between size and accuracy by clicking on the "Function" header, which will sort the solutions by (error)^2 * (size). By default, the solutions are sorted by size.

Exporting/loading formulas

In the menu, you can find an option called "Set periodic output". There you can choose to enable two options for periodic output:

  • Solutions: export formulas in the same format generated by "Export solutions as text".
  • Predictions: export your original dataset along with the model predictions. The prediction columns will be named solution_N, where N is the complexity of that solution. If using cross-validation, the training dataset rows will be written to the output file first, followed by the testing dataset rows.

The periodic output files are only saved if new solutions have been found, to avoid saving the same files over and over again in long runs.

Loading formulas

The solutions file generated by the periodic output option above can be loaded back into the program, allowing you to restart an optimization from a checkpoint. For that, start a new optimization and then choose the option "Load formulas from file" in the menu.

It's also possible to input your custom formulas into the program using this option. For that, generate a text file with one formula per row, making sure that the formulas do not contain any space characters, and load this file into the program.

Plot settings

In the "Other options" part of the main tab, you can at any time change your plot settings:

For the y-axis, you can choose to plot your target variable, the residual error (difference between a solution and the target data), and the residual error as a percentage of the target data.

For the x-axis, you can choose the row number (1, 2, 3, ...) corresponding to that point, or any of your input variables. Additionally, you can also select the Observed option to see an Observed vs predicted plot. In this case, the plot also shows a gray line representing a perfect fit for visual reference.

The plot scales can be adjusted in the "Plot scale" menu, where you can choose from regular scale, symlog x, symlog y, symlog x and y, log x, log y, or log x and y. The "symlog" scale allows negative numbers to be visualized on a logarithmic scale, while "log" uses the regular base 10 logarithm.

Advanced

In the "Advanced" tab of the interface, you can find information about how many formulas have been generated in the current optimization, how long the optimization has been running, and how many formulas are being tried per second. A log message is also generated every time a new solution is encountered so that you can keep track of progress.

A few specialized search settings are also available on this tab:

  • Maximum formula size: by default, the complexity of formulas is prevented from becoming larger than 60. With this option, you can allow the program to generate larger formulas, which makes the optimization slower but makes longer formulas possible.
  • Maximum history size: only used if one of the history functions is enabled. Sets the maximum length of those functions. Note that if this parameter is set to 20, then the first 20 rows of your dataset are not used for the search, and are only used to calculate the history functions starting from row 21.
  • F-score beta parameter: when left at the default value of 1, the F-score metric corresponds to the F1-metric. Values of beta lower than 1 favor precision over recall.
  • Random seed for train/test split generation: if a value >= 0 is set, this value will be used as a seed, resulting in the same split every time a new search is started with the same dataset. When the parameter is set to -1, a different random split will be generated each time.
  • Normalize the dataset: for each variable, subtract the average and divide by the standard deviation before starting the search. This can speed up the search a lot if your input variables are large. Note that the "sample standard deviation" is used, where the denominator is N-1 instead of N for smaller bias: link.
  • Target variable in history functions: allows you to choose whether your target variable can be used in the history functions.
  • Force solutions to include all variables: allows you to discard solutions that do not feature all input variables.
  • Bound search mode: this advanced search mode allows you to discover formulas that are upper or lower bounds for the target variable.

Command-line usage

TuringBot is also a console application that can be executed in a fully automated and customizable way. The general usage is the following:

TuringBot - Symbolic Regression Software

Usage: turingbot [--help] INPUT_FILE [SETTINGS_FILE]
                 [--outfile FILENAME]
                 [--predictions-file FILENAME]
                 [--formulas-file FILENAME]
                 [--threads N]
                 [--search-metric VALUE]
                 [--train-test-split VALUE]
                 [--test-sample VALUE]
                 [--train-test-seed VALUE]
                 [--bound-search-mode VALUE]
                 [--maximum-formula-complexity VALUE]
                 [--history-size VALUE]
                 [--fscore-beta VALUE]
                 [--integer-constants]
                 [--normalize-dataset]
                 [--allow-target-delay]
                 [--force-all-variables]
                 [--custom-formula STRING]
                 [--allowed-functions STRING]

Required arguments:
  INPUT_FILE                    The full path to your input file.
                                Use /foo/bar/file.txt,
                                not ./file.txt or file.txt.

Optional arguments:
  SETTINGS_FILE                 The full path to the settings file to use
                                for this optimization.
  --help                        Show this help message.
  --outfile FILENAME            Write the best formulas found so far
                                to this file.
  --predictions-file FILENAME   Write the predictions obtained from the best
                                formulas found so far to this file.
  --formulas-file FILENAME      Load seed formulas from this file.
                                The file generated by --outfile can be
                                later used as input here.
  --threads N                   Use this number of threads. The default is
                                the total number available in your system.
  --search-metric VALUE         Specify the search metric to use:
                                1: Mean relative error
                                2: Classification accuracy
                                3: Mean error
                                4: RMS error (default)
                                5: F-score
                                6: Correlation coefficient
                                7: Hybrid (CC + RMS)
                                8: Maximum error
                                9: Maximum relative error
                                10: Nash-Sutcliffe efficiency
                                11: Binary cross-entropy
                                12: Matthews correlation coefficient (MCC)
                                13: Residual sum of squares (RSS)
                                14: Root mean squared log error (RMSLE)
  --train-test-split VALUE      Set the train-test split ratio:
                                -1: No cross-validation (default)
                                50, 60, 70, 75, 80: Percentages
                                100, 1000, 10000, 100000: Predefined row counts
                                Negative numbers: Custom row count (e.g., -200 for 200 rows)
  --test-sample VALUE           Specify the test sample selection method:
                                1: Chosen randomly (default)
                                2: The last points
  --train-test-seed VALUE       Set the seed for train-test split (-1 for random).
  --bound-search-mode VALUE     Set the bound search mode:
                                0: Deactivated (default)
                                1: Lower bound search
                                2: Upper bound search
  --maximum-formula-complexity VALUE
                                Set the maximum formula complexity (default: 60).
  --history-size VALUE          Specify the history size (default: 20).
  --fscore-beta VALUE           Set the F-score beta value (default: 1).
  --integer-constants           Force all numerical constants to be integers.
  --normalize-dataset           Normalize the dataset before optimization.
  --allow-target-delay          Allow the target variable in lag functions.
  --force-all-variables         Force solutions to include all input variables.
  --custom-formula STRING       Provide a custom formula for the search.
                                If empty, the program will try to find the last
                                column as a function of the remaining ones.
  --allowed-functions STRING    Specify allowed functions (default: "+ * / pow
                                fmod sin cos tan asin acos atan exp log log2
                                log10 sqrt sinh cosh tanh asinh acosh atanh
                                abs floor ceil round sign tgamma lgamma erf").

If no configuration file is provided, the program will use the last column in the input file as the target variable and all other columns as input variables.

The best formulas found so far will be written to the terminal every 1 second. If you set an output file with the --outfile option, those formulas will also be regularly saved to the output file.

Note that to run the command above on Windows you have to first cd to the installation directory and then run with .\TuringBot.exe:

cd C:\Program Files (x86)\TuringBot
.\TuringBot.exe INPUT_FILE

Examples


Windows:

cd "C:\Program Files (x86)\TuringBot"
.\TuringBot.exe "C:\Users\YourName\Documents\climate_research\temperature_data.txt" \
   --threads 8 \
   --search-metric 7 \
   --train-test-split 80 \
   --custom-formula "avg_temp = f(solar_radiation, humidity, wind_speed)"
   --allowed-functions "+ * / pow sin cos exp log sqrt"

macOS:

/Applications/TuringBot.app/Contents/MacOS/TuringBot /Users/researcher/climate_data.txt \
   --threads 8 \
   --search-metric 7 \
   --train-test-split 80 \
   --custom-formula "avg_temp = f(solar_radiation, humidity, wind_speed)"
   --allowed-functions "+ * / pow sin cos exp log sqrt"

Linux:

turingbot /home/researcher/climate_data.txt \
   --threads 8 \
   --search-metric 7 \
   --train-test-split 80 \
   --custom-formula "avg_temp = f(solar_radiation, humidity, wind_speed)"
   --allowed-functions "+ * / pow sin cos exp log sqrt"

Output

A typical terminal output of TuringBot is the following:

Formulas generated: 1135108 
Size   Error       Function 
1      177813      186275.6979035278 
3      7890.39     11.75045370574789*x 
5      6895.25     11.93943363494786*(-472.8408126495318+x) 
7      980.126     lgamma(1.22706776648686*x) 
8      674.279     x*(0.6868942922296394+asinh(x)) 
9      240.168     1.062116924609507*acosh(x)*x 
11     147.484     1.063188751909768*acosh(x)*(-34.51291396937853+x) 

The first line reports how many formulas have been attempted so far.

The next lines contain the formulas as well as their corresponding sizes and errors.

Settings file

The search can also be customized by providing the program with a settings file. Here is an example:

search_metric = 4  # Search metric. 1: Mean relative error, 2: Classification accuracy, 3: Mean error, 4: RMS error, 5: F-score, 6: Correlation coefficient, 7: Hybrid (CC + RMS), 8: Maximum error, 9: Maximum relative error, 10: Nash-Sutcliffe efficiency, 11: Binary cross-entropy, 12: Matthews correlation coefficient (MCC), 13: Residual sum of squares (RSS), 14: Root mean squared log error (RMSLE)
train_test_split = -1  # Train/test split. Options are as follows: -1 for no cross-validation, 50, 60, 70, 75, or 80 for percentages, and 100, 1000, 10000, or 100000 for predefined row counts. Use negative numbers for a custom number of rows (e.g., set -200 to use 200 rows for training).
test_sample = 1  # Test sample. 1: Chosen randomly, 2: The last points
train_test_seed = -1  # Random seed for train/test split generation when the test sample is chosen randomly.
integer_constants = 0  # Integer constants only. 0: Disabled, 1: Enabled
bound_search_mode = 0  # Bound search mode. 0: Deactivated, 1: Lower bound search, 2: Upper bound search
maximum_formula_complexity = 60  # Maximum formula complexity.
history_size = 20  # History size.
fscore_beta = 1  # F-score beta parameter.
normalize_dataset = 0  # Normalize the dataset before starting the optimization? 0: No, 1: Yes
allow_target_delay = 0  # Allow the target variable in the lag functions? 0: No, 1: Yes
force_all_variables = 0  # Force solutions to include all input variables? 0: No, 1: Yes
custom_formula =   # Custom formula for the search. If empty, the program will try to find the last column as a function of the remaining ones.
allowed_functions = + * / pow fmod sin cos tan asin acos atan exp log log2 log10 sqrt sinh cosh tanh asinh acosh atanh abs floor ceil round sign tgamma lgamma erf # Allowed functions.

Settings are changed by modifying the values after the = characters. The comments after # characters are ignored. The allowed functions are set by directly providing their names to the allowed_functions variable, separated by spaces.

The order of the parameters inside the settings file does not matter.

A convenient way of generating a settings file is to set things as you like them in the graphical interface, and then simply export the settings from the menu using the "Save settings" option:

Running TuringBot from Python

When you install TuringBot, you also receive a small Python library designed to make it very easy to call the software from within Python.

Below we provide examples of usage for each OS, but the basic idea is that this library provides a simulation class:

sim = tb.simulation()

This class has a start_process method that starts TuringBot in the background:

start_process(
    path: str,
    input_file: str,
    config: str = None,
    threads: int = None,
    outfile: str = None,
    predictions_file: str = None,
    formulas_file: str = None,
    search_metric: int = None,
    train_test_split: int = None,
    test_sample: int = None,
    train_test_seed: int = None,
    bound_search_mode: int = None,
    maximum_formula_complexity: int = None,
    history_size: int = None,
    fscore_beta: float = None,
    integer_constants: bool = False,
    normalize_dataset: bool = False,
    allow_target_delay: bool = False,
    force_all_variables: bool = False,
    custom_formula: str = None,
    allowed_functions: str = None)

    method of turingbot.simulation instance
    Start the process with the specified configuration and dataset.
    
    Parameters:
    -----------
    path : str
        The path where the process is executed.
    input_file : str
        The input dataset file path.
    config : str, optional
        Path to the configuration file.
    threads : int, optional
        Number of threads to use.
    outfile : str, optional
        Output file path.
    predictions_file : str, optional
        File to store predictions.
    formulas_file : str, optional
        File to store generated formulas.
    search_metric : int, optional
        Search metric to use. Default is 4 (RMS error).
        Options:
        1: Mean relative error
        2: Classification accuracy
        3: Mean error
        4: RMS error (default)
        5: F-score
        6: Correlation coefficient
        7: Hybrid (CC + RMS)
        8: Maximum error
        9: Maximum relative error
        10: Nash-Sutcliffe efficiency
        11: Binary cross-entropy
        12: Matthews correlation coefficient (MCC)
        13: Residual sum of squares (RSS)
        14: Root mean squared log error (RMSLE)
    train_test_split : int, optional
        Train/test split. Default is -1 (no cross-validation).
        Options:
        -1: No cross-validation (default)
        50, 60, 70, 75, 80: Percentage split for training data
        100, 1000, 10000: Predefined row counts for training
        Negative values (e.g., -200): Use 200 rows for training
    test_sample : int, optional
        How to select test samples. Default is 1 (random).
        Options:
        1: Chosen randomly (default)
        2: The last points
    train_test_seed : int, optional
        Random seed for train/test split generation. Default is -1 (no specific seed).
    bound_search_mode : int, optional
        Whether to use bound search mode. Default is 0 (deactivated).
        Options:
        0: Deactivated (default)
        1: Lower bound search
        2: Upper bound search
    maximum_formula_complexity : int, optional
        Maximum formula complexity. Default is 60.
    history_size : int, optional
        History size for the optimization process. Default is 20.
    fscore_beta : float, optional
        Beta parameter for F-score. Default is 1.
    integer_constants : bool, optional
        Whether to use integer constants only. Default is False (disabled).
    normalize_dataset : bool, optional
        Whether to normalize the dataset before optimization. Default is False (no normalization).
    allow_target_delay : bool, optional
        Whether to allow the target variable in lag functions. Default is False (not allowed).
    force_all_variables : bool, optional
        Whether to force the solution to include all input variables. Default is False (not forced).
    custom_formula : str, optional
        Custom formula for the search. If not provided, the last column will be treated as the target variable.
    allowed_functions : str, optional
        Allowed functions for the formula search. Default: "+ * / pow fmod sin cos tan asin acos atan exp log log2 log10 sqrt sinh cosh tanh asinh acosh atanh abs floor ceil round sign tgamma lgamma erf"

Once a simulation is started, you can refresh the current formulas with:

sim.refresh_functions()

and then access the formulas in the form of a list with:

sim.functions

You can also find general information about the number of formulas tried so far as well as error messages with:

sim.info

To finish a simulation and kill the TuringBot process, you should call

sim.terminate_process()

Windows

An example of the usage of TuringBot's Python library on Windows is the following.

import sys
sys.path.insert(1, r'C:\Program Files (x86)\TuringBot\resources')  
path = r'C:\Program Files (x86)\TuringBot\TuringBot.exe'  # Executable path

import time
import turingbot as tb

# Initialize the simulation
sim = tb.simulation()

# Start the process with specified parameters
sim.start_process(
    path=path,
    input_file=r'C:\Users\YourUsername\Desktop\input.txt',
    search_metric=4  # RMS error
)

# Allow some time for the process to generate results
time.sleep(10)

# Fetch and display generated functions and simulation information
sim.refresh_functions()
print("\nGenerated Functions:")
print(*sim.functions, sep='\n')

print("\nSimulation Info:")
print(sim.info)

# Terminate the simulation process
sim.terminate_process()

Linux

An example of usage of TuringBot's Python library on Linux is the following:

import sys
sys.path.insert(1, r'/usr/lib/turingbot/resources')
path = r'/usr/lib/turingbot/TuringBot'  # Executable path

import time
import turingbot as tb

# Initialize the simulation
sim = tb.simulation()

# Start the process with specified parameters
sim.start_process(
    path=path,
    input_file=r'/home/user/input.txt',
    search_metric=4  # RMS error
)

# Allow some time for the process to generate results
time.sleep(10)

# Fetch and display generated functions and simulation information
sim.refresh_functions()
print("\nGenerated Functions:")
print(*sim.functions, sep='\n')

print("\nSimulation Info:")
print(sim.info)

# Terminate the simulation process
sim.terminate_process()

macOS

An example of usage of TuringBot's Python library on macOS is the following:

import sys
sys.path.insert(1, r'/Applications/TuringBot.app/Contents/Resources')
path = r'/Applications/TuringBot.app/Contents/MacOS/TuringBot'  # Executable path

import time
import turingbot as tb

# Initialize the simulation
sim = tb.simulation()

# Start the process with specified parameters
sim.start_process(
    path=path,
    input_file=r'/Users/YourUsername/Desktop/input.txt',
    search_metric=4  # RMS error
)

# Allow some time for the process to generate results
time.sleep(10)

# Fetch and display generated functions and simulation information
sim.refresh_functions()
print("\nGenerated Functions:")
print(*sim.functions, sep='\n')

print("\nSimulation Info:")
print(sim.info)

# Terminate the simulation process
sim.terminate_process()