What is Symbolic Regression and How Does it Work?

Symbolic Regression is a great method for discovering hidden relationships between variables. It accomplishes this task by turning data into explicit mathematical formulas.

What is Symbolic Regression?

The purpose of Symbolic Regression is to find intrinsic relationships between two or more variables. In general, the relationships are nonlinear.

The steps of a Symbolic Regression optimization are:

Take a dataset with two or more columns;
Propose formulas for one of the columns as a function of the other ones;
Evaluate the errors, and keep track of the record-holders.

Intrinsic stochasticity is involved: random formulas must be sequentially tried. Since too many possible formulas exist, a clever algorithm must be used.

It is a powerful tool to help understand the underlying dynamics of some observed phenomena.

How Symbolic Regression Works

Instead of fitting numbers to some presumed model, as your classic regression does, this method optimizes the functional form itself of the relationship between the variables.

This can often help you discover nonlinear correlations between your variables that regular regression models could just not predict.

However, care must be taken to avoid overfit models, where spurious correlations between the variables are given too much merit leading to good fits with little predictive value. To avoid this risk, the use of cross-validation is recommended.

The Benefits of Symbolic Regression

The power of symbolic regression comes from the fact that all you need is known: your data. The method takes care of turning its intrinsic distributions and correlations into meaningful models.

For instance, if you have measured the airflow around a shape as a function of the parameters that define this shape for several variations of these parameters, you can immediately feed the parameters into a Symbolic Regression.

By looking at the variables that do not appear in the resulting models, you can discover which parameters are not relevant to the observed phenomenon.