Looking for a Symbolic Regression library for Python that will allow you to turn your data into nice mathematical formulas? TuringBot is by far the easiest to use. Here we will show how to use it.
Step #1: Download TuringBot
Contrary to most Python libraries, which are distributed through PyPI, TuringBot is distributed as a standalone application. Go ahead and download it from the website. It has versions for both Windows and Linux.
The program also has a nice user interface, but in this case, we are not going to use it, just the Python library that comes with the program.
Step #2: Import TuringBot
Once you have the program installed, import it in Python with the following syntax, making sure to replace “user” with your local username:
import sys
sys.path.insert(1, r'C:\Users\user\AppData\Local\Programs\TuringBot')
import turingbot as tb
If you are in Linux, you can equivalently use:
import sys
sys.path.insert(1, '/usr/share/turingbot')
import turingbot as tb
After that, TuringBot will be imported and ready to go.
Step #3: Start the Symbolic Regression search
The optimization is started like this:
sim = tb.simulation()
sim.start_process(path, input_file, threads=4, config=config_file)
The 4 parameters that you see are:
-
path: the path to the TuringBot executable.
-
input_file: the path to your input file, which must contain one variable per column.
-
threads (optional): the number of threads that the program should use.
-
config (optional): the path to the configuration file.
For instance, if you are on Windows, the paths would look something like this:
path = r'C:\Users\user\AppData\Local\Programs\TuringBot\TuringBot.exe'
input_file = r'C:\Users\user\Desktop\input.txt'
config_file = r'C:\Users\user\Desktop\settings.cfg'
And on Linux:
path = r'/usr/bin/turingbot'
input_file = r'/home/user/input.txt'
config_file = r'/home/user/settings.cfg'
Once you run the start_process() method, the optimization will start in the background. You can refresh the current functions in real-time with sim.refresh_functions():
sim.refresh_functions()
print(*sim.functions, sep='\n')
print(sim.info)
The output will look something like this, with the size of the solution in the first column, the error in the second one, and finally the solution itself:
[1, 177813.0, '186276']
[3, 7890.39, '11.7503*x']
[5, 6895.25, '11.9394*(-472.889+x)']
[7, 1769.0, '(10.4154+3.9908e-05*x)*x']
[11, 1666.42, '(9.10666+3.26179e-05*x)*(1.156*(-93.3986+x))']
[21, 1224.31, '-1624.3+((9.18774*sign(x-10.1264)+3.13847e-05*x)*(1.1586*(-158.606+x)))']
Tip: Customizing your search
By default, the last column of your input file will be the target variable, and all other columns will be used as input variables.
But you can change that as well as several other options by providing the program with a configuration file, that looks like this:
search_metric = 4 # Search metric. 1: Mean relative error, 2: Classification accuracy, 3: Mean error, 4: RMS error, 5:, F1 score, 6: Correlation coefficient, 7: Hybrid (CC+RMS), 8: Maximum error, 9: Maximum relative error, 10: Nash-Sutcliffe efficiency
train_test_split = -1 # Train/test split. -1: No cross validation. Valid options are: 50, 60, 70, 75, 80
test_sample = 1 # Test sample. 1: Chosen randomly, 2: The last points
integer_constants = 0 # Integer constants only. 0: Disabled, 1: Enabled
bound_search_mode = 0 # Bound search mode. 0: Deactivated, 1: Lower bound search, 2: Upper bound search
maximum_formula_complexity = 60 # Maximum formula complexity.
history_size = 20 # History size.
allow_target_delay = 1 # Allow the target variable in the history functions? 0: No, 1: Yes
custom_formula = # Custom formula for the search. If empty, the program will try to find the last column as a function of the remaining ones.
allowed_functions = + * / pow fmod sin cos tan asin acos atan exp log log2 sqrt sinh cosh tanh asinh acosh atanh abs floor ceil round tgamma lgamma erf # Allowed functions.
The definitions of those settings can be consulted on the Official Documentation.