Looking for a Symbolic Regression library for Python that will allow you to turn your data into nice mathematical formulas? TuringBot is by far the easiest to use. Here we will show how to use it.
Step #1: Download TuringBot
Contrary to most Python libraries, which are distributed through PyPI, TuringBot is distributed as a standalone application. Go ahead and download it from the website. It has versions for both Windows and Linux.
The program also has a nice user interface, but in this case, we are not going to use it, just the Python library that comes with the program.
Step #2: Import TuringBot
Once you have the program installed, import it in Python with the following syntax, making sure to replace “user” with your local username:
import sys
sys.path.insert(1, r'C:\Program Files (x86)\TuringBot\resources')
import turingbot as tb
If you are in Linux, you can equivalently use:
import sys
sys.path.insert(1, r'/usr/lib/turingbot/resources')
import turingbot as tb
After that, TuringBot will be imported and ready to go.
Step #3: Start the Symbolic Regression search
The optimization is started like this:
sim = tb.simulation()
sim.start_process(path, input_file, threads=4, config=config_file)
The 4 parameters that you see are:
- path: the path to the TuringBot executable.
- input_file: the path to your input file, which must contain one variable per column.
- threads (optional): the number of threads that the program should use.
- config (optional): the path to the configuration file.
For instance, if you are on Windows, the paths would look something like this:
path = r'C:\Program Files (x86)\TuringBot\TuringBot.exe'
input_file = r'C:\Users\user\Desktop\input.txt'
config_file = r'C:\Users\user\Desktop\settings.cfg'
And on Linux:
path = r'/usr/lib/turingbot/TuringBot'
input_file = r'/home/user/input.txt'
config_file = r'/home/user/settings.cfg'
Once you run the start_process() method, the optimization will start in the background. You can refresh the current functions in real-time with sim.refresh_functions():
sim.refresh_functions()
print(*sim.functions, sep='\n')
print(sim.info)
The output will look something like this, with the size of the solution in the first column, the error in the second one, and finally the solution itself:
[1, 177813.0, '186276']
[3, 7890.39, '11.7503*x']
[5, 6895.25, '11.9394*(-472.889+x)']
[7, 1769.0, '(10.4154+3.9908e-05*x)*x']
[11, 1666.42, '(9.10666+3.26179e-05*x)*(1.156*(-93.3986+x))']
[21, 1224.31, '-1624.3+((9.18774*sign(x-10.1264)+3.13847e-05*x)*(1.1586*(-158.606+x)))']
Tip: Customizing your search
By default, the last column of your input file will be the target variable, and all other columns will be used as input variables.
But you can change that as well as several other options by providing the program with a configuration file, that looks like this:
search_metric = 4 # Search metric. 1: Mean relative error, 2: Classification accuracy, 3: Mean error, 4: RMS error, 5:, F-score, 6: Correlation coefficient, 7: Hybrid (CC+RMS), 8: Maximum error, 9: Maximum relative error, 10: Nash-Sutcliffe efficiency, 11: Binary cross-entropy, 12: Matthews correlation coefficient (MCC)
train_test_split = -1 # Train/test split. -1: No cross-validation. Valid options are: 50, 60, 70, 75, 80, 100, 1000, 10000, 100000
test_sample = 1 # Test sample. 1: Chosen randomly, 2: The last points
train_test_seed = -1 # Random seed for train/test split generation when the test sample is chosen randomly.
integer_constants = 0 # Integer constants only. 0: Disabled, 1: Enabled
bound_search_mode = 0 # Bound search mode. 0: Deactivated, 1: Lower bound search, 2: Upper bound search
maximum_formula_complexity = 60 # Maximum formula complexity.
history_size = 20 # History size.
fscore_beta = 1 # F-score beta parameter.
normalize_dataset = 0 # Normalize the dataset before starting the optimization? 0: No, 1: Yes
allow_target_delay = 0 # Allow the target variable in the lag functions? 0: No, 1: Yes
force_all_variables = 0 # Force solutions to include all input variables? 0: No, 1: Yes
custom_formula = # Custom formula for the search. If empty, the program will try to find the last column as a function of the remaining ones.
allowed_functions = + * / pow fmod sin cos tan asin acos atan exp log log2 log10 sqrt sinh cosh tanh asinh acosh atanh abs floor ceil round sign tgamma lgamma erf # Allowed functions.
The definitions of those settings can be consulted on the Official Documentation.