[1] arXiv:2006.06813 [pdf][2] arXiv:2006.05561 [pdf][3] arXiv:2005.11212 [pdf][4] arXiv:2005.07021 [pdf][5] arXiv:2005.04151 [pdf][6] arXiv:2004.12762 [pdf][7] arXiv:2004.12750 [pdf][8] arXiv:2004.11947 [pdf][9] arXiv:2004.11170 [pdf][10] arXiv:2003.01943 [pdf][11] arXiv:2002.06095 [pdf][12] arXiv:2001.10056 [pdf][13] arXiv:2001.00624 [pdf][14] arXiv:1912.04871 [pdf][15] arXiv:1912.04825 [pdf][16] arXiv:1911.05254 [pdf][17] arXiv:1911.00644 [pdf][18] arXiv:1910.10065 [pdf][19] arXiv:1910.08892 [pdf][20] arXiv:1909.05862 [pdf][21] arXiv:1908.10673 [pdf][22] arXiv:1908.06778 [pdf][23] arXiv:1908.06754 [pdf][24] arXiv:1906.12332 [pdf][25] arXiv:1906.07848 [pdf][26] arXiv:1906.03959 [pdf][27] arXiv:1905.13266 [pdf][28] arXiv:1905.11481 [pdf][29] arXiv:1905.07510 [pdf][30] arXiv:1904.04314 [pdf][31] arXiv:1904.03368 [pdf][32] arXiv:1904.02050 [pdf][33] arXiv:1904.01095 [pdf][34] arXiv:1903.11483 [pdf][35] arXiv:1903.09688 [pdf][36] arXiv:1903.08011 [pdf][37] arXiv:1902.03983 [pdf][38] arXiv:1902.00882 [pdf][39] arXiv:1901.07714 [pdf][40] arXiv:1901.04136 [pdf][41] arXiv:1810.00835 [pdf][42] arXiv:1808.10394 [pdf][43] arXiv:1807.01019 [pdf][44] arXiv:1806.02502 [pdf][45] arXiv:1805.10365 [pdf][46] arXiv:1804.09331 [pdf][47] arXiv:1804.06808 [pdf][48] arXiv:1803.07369 [pdf][49] arXiv:1803.06226 [pdf][50] arXiv:1803.06127 [pdf][51] arXiv:1802.02423 [pdf][52] arXiv:1801.01807 [pdf][53] arXiv:1712.04170 [pdf][54] arXiv:1712.01104 [pdf][55] arXiv:1710.10720 [pdf][56] arXiv:1709.05394 [pdf][57] arXiv:1707.03744 [pdf][58] arXiv:1705.08061 [pdf][59] arXiv:1705.07877 [pdf][60] arXiv:1704.07313 [pdf][61] arXiv:1704.05134 [pdf][62] arXiv:1704.04998 [pdf][63] arXiv:1704.00828 [pdf][64] arXiv:1703.01925 [pdf][65] arXiv:1701.03641 [pdf][66] arXiv:1612.05276 [pdf][67] arXiv:1611.04766 [pdf][68] arXiv:1602.04648 [pdf][69] arXiv:1507.01687 [pdf][70] arXiv:1412.4690 [pdf][71] arXiv:1410.0532 [pdf][72] arXiv:1409.2390 [pdf][73] arXiv:1403.0623 [pdf][74] arXiv:1312.6122 [pdf][75] arXiv:1309.5931 [pdf][76] arXiv:1308.4145 [pdf][77] arXiv:1304.3610 [pdf][78] arXiv:1212.2044 [pdf][79] arXiv:1203.1023 [pdf][80] arXiv:1203.0388 [pdf][81] arXiv:1202.5683 [pdf][82] arXiv:1109.1922 [pdf][83] arXiv:1106.3834 [pdf][84] arXiv:1006.4998 [pdf][85] arXiv:0405415 [pdf][86] arXiv:0102027 [pdf]

Symbolic Regression using Mixed-Integer Nonlinear Optimization

The Symbolic Regression (SR) problem, where the goal is to find a regression function that does not have a pre-specified form but is any function that can be composed of a list of operators, is a hard problem in machine learning, both theoretically and computationally. Genetic programming based methods, that heuristically search over a very large space of functions, are the most commonly used methods to tackle SR problems. An alternative mathematical programming approach, proposed in the last decade, is to express the optimal symbolic expression as the solution of a system of nonlinear equations over continuous and discrete variables that minimizes a certain objective, and to solve this system via a global solver for mixed-integer nonlinear programming problems. Algorithms based on the latter approach are often very slow. We propose a hybrid algorithm that combines mixed-integer nonlinear optimization with explicit enumeration and incorporates constraints from dimensional analysis. We show that our algorithm is competitive, for some synthetic data sets, with a state-of-the-art SR software and a recent physics-inspired method called AI Feynman.

Learning Functions to Study the Benefit of Multitask Learning

We study and quantify the generalization patterns of multitask learning (MTL) models for sequence labeling tasks. MTL models are trained to optimize a set of related tasks jointly. Although multitask learning has achieved improved performance in some problems, there are also tasks that lose performance when trained together. These mixed results motivate us to study the factors that impact the performance of MTL models. We note that theoretical bounds and convergence rates for MTL models exist, but they rely on strong assumptions such as task relatedness and the use of balanced datasets. To remedy these limitations, we propose the creation of a task simulator and the use of Symbolic Regression to learn expressions relating model performance to possible factors of influence. For MTL, we study the model performance against the number of tasks (T), the number of samples per task (n) and the task relatedness measured by the adjusted mutual information (AMI). In our experiments, we could empirically find formulas relating model performance with factors of sqrt(n), sqrt(T), which are equivalent to sound mathematical proofs in Maurer[2016], and we went beyond by discovering that performance relates to a factor of sqrt(AMI).

Symbolic Pregression Discovering Physical Laws from Raw Distorted Video

We present a method for unsupervised learning of equations of motion for objects in raw and optionally distorted unlabeled video. We first train an autoencoder that maps each video frame into a low-dimensional latent space where the laws of motion are as simple as possible, by minimizing a combination of non-linearity, acceleration and prediction error. Differential equations describing the motion are then discovered using Pareto-optimal symbolic regression. We find that our pre-regression ("pregression") step is able to rediscover Cartesian coordinates of unlabeled moving objects even when the video is distorted by a generalized lens. Using intuition from multidimensional knot-theory, we find that the pregression step is facilitated by first adding extra latent space dimensions to avoid topological problems during training and then removing these extra dimensions via principal component analysis.

Review of new flow friction equations Constructing Colebrook explicit correlations accurately

Using only a limited number of computationally expensive functions, we show a way how to construct accurate and computationally efficient approximations of the Colebrook equation for flow friction. The presented approximations are based on the asymptotic series expansion of the Wright Omega-function and symbolic regression. The results are verified with 8 million of Quasi-Monte Carlo points covering the domain of interest for engineers. In comparison with the built-in wrightOmega feature of Matlab R2016a, the herein introduced related approximations of the Wright Omega-function significantly accelerate the computation. With only two logarithms and several basic arithmetic operations used, the presented approximations are not only computationally efficient but also extremely accurate. The maximal relative error of the most promising approximation which is given in the form suitable for engineers use is limited to 0.0012%, while for a little bit more complex variant is limited to 0.000024%.

Swarm Programming Using Moth-Flame Optimization and Whale Optimization Algorithms

Automatic programming (AP) is an important area of Machine Learning (ML) where computer programs are generated automatically. Swarm Programming (SP), a newly emerging research area in AP, automatically generates the computer programs using Swarm Intelligence (SI) algorithms. This paper presents two grammar-based SP methods named as Grammatical Moth-Flame Optimizer (GMFO) and Grammatical Whale Optimizer (GWO). The Moth-Flame Optimizer and Whale Optimization algorithm are used as search engines or learning algorithms in GMFO and GWO respectively. The proposed methods are tested on Santa Fe Ant Trail, quartic symbolic regression, and 3-input multiplexer problems. The results are compared with Grammatical Bee Colony (GBC) and Grammatical Fireworks algorithm (GFWA). The experimental results demonstrate that the proposed SP methods can be used in automatic computer program generation.

Fitness Landscape Analysis of Dimensionally-Aware Genetic Programming Featuring Feynman Equations

Genetic programming is an often-used technique for symbolic regression finding symbolic expressions that match data from an unknown function. To make the symbolic regression more efficient, one can also use dimensionally-aware genetic programming that constrains the physical units of the equation. Nevertheless, there is no formal analysis of how much dimensionality awareness helps in the regression process. In this paper, we conduct a fitness landscape analysis of dimensionallyaware genetic programming search spaces on a subset of equations from Richard Feynmans well-known lectures. We define an initialisation procedure and an accompanying set of neighbourhood operators for conducting the local search within the physical unit constraints. Our experiments show that the added information about the variable dimensionality can efficiently guide the search algorithm. Still, further analysis of the differences between the dimensionally-aware and standard genetic programming landscapes is needed to help in the design of efficient evolutionary operators to be used in a dimensionally-aware regression.

MATE A Model-based Algorithm Tuning Engine

In this paper, we introduce a Model-based Algorithm Turning Engine, namely MATE, where the parameters of an algorithm are represented as expressions of the features of a target optimisation problem. In contrast to most static (feature-independent) algorithm tuning engines such as irace and SPOT, our approach aims to derive the best parameter configuration of a given algorithm for a specific problem, exploiting the relationships between the algorithm parameters and the features of the problem. We formulate the problem of finding the relationships between the parameters and the problem features as a symbolic regression problem and we use genetic programming to extract these expressions. For the evaluation, we apply our approach to configuration of the (1+1) EA and RLS algorithms for the OneMax, LeadingOnes, BinValue and Jump optimisation problems, where the theoretically optimal algorithm parameters to the problems are available as functions of the features of the problems. Our study shows that the found relationships typically comply with known theoretical results, thus demonstrating a new opportunity to consider model-based parameter tuning as an effective alternative to the static algorithm tuning engines.

Symbolic Regression Driven by Training Data and Prior Knowledge

In symbolic regression, the search for analytic models is typically driven purely by the prediction error observed on the training data samples. However, when the data samples do not sufficiently cover the input space, the prediction error does not provide sufficient guidance toward desired models. Standard symbolic regression techniques then yield models that are partially incorrect, for instance, in terms of their steady-state characteristics or local behavior. If these properties were considered already during the search process, more accurate and relevant models could be produced. We propose a multi-objective symbolic regression approach that is driven by both the training data and the prior knowledge of the properties the desired model should manifest. The properties given in the form of formal constraints are internally represented by a set of discrete data samples on which candidate models are exactly checked. The proposed approach was experimentally evaluated on three test problems with results clearly demonstrating its capability to evolve realistic models that fit the training data well while complying with the prior knowledge of the desired model characteristics at the same time. It outperforms standard symbolic regression by several orders of magnitude in terms of the mean squared deviation from a reference model.

Learning a Formula of Interpretability to Learn Interpretable Formulas

Many risk-sensitive applications require Machine Learning (ML) models to be interpretable. Attempts to obtain interpretable models typically rely on tuning, by trial-and-error, hyper-parameters of model complexity that are only loosely related to interpretability. We show that it is instead possible to take a meta-learning approach an ML model of non-trivial Proxies of Human Interpretability (PHIs) can be learned from human feedback, then this model can be incorporated within an ML training process to directly optimize for interpretability. We show this for evolutionary symbolic regression. We first design and distribute a survey finalized at finding a link between features of mathematical formulas and two established PHIs, simulatability and decomposability. Next, we use the resulting dataset to learn an ML model of interpretability. Lastly, we query this model to estimate the interpretability of evolving solutions within bi-objective genetic programming. We perform experiments on five synthetic and eight real-world symbolic regression problems, comparing to the traditional use of solution size minimization. The results show that the use of our model leads to formulas that are, for a same level of accuracy-interpretability trade-off, either significantly more or equally accurate. Moreover, the formulas are also arguably more interpretable. Given the very positive results, we believe that our approach represents an important stepping stone for the design of next-generation interpretable (evolutionary) ML algorithms.

Machine Learning of Mechanical Properties of Steels

The mechanical properties are essential for structural materials. The analyzed 360 data on four mechanical properties of steels, viz. fatigue strength, tensile strength, fracture strength, and hardness, are selected from the NIMS database, including carbon steels, and low-alloy steels. Five machine learning algorithms were applied on the 360 data to predict the mechanical properties and random forest regression illustrates the best performance. The feature selection was conducted by random forest and symbolic regressions, leading to the four most important features of tempering temperature, and alloying elements of carbon, chromium, and molybdenum to the mechanical properties of steels. Besides, mathematic expressions were generated via symbolic regression, and the expressions explicitly predict how each of the four mechanical properties varies quantitatively with the four most important features. The present work demonstrates the great potential of symbolic regression in the discovery of novel advanced materials.

Traffic Modelling and Prediction via Symbolic Regression on Road Sensor Data

The continuous expansion of the urban traffic sensing infrastructure has led to a surge in the volume of widely available road related data. Consequently, increasing effort is being dedicated to the creation of intelligent transportation systems, where decisions on issues ranging from city-wide road maintenance planning to improving the commuting experience are informed by computational models of urban traffic instead of being left entirely to humans. The automation of traffic management has received substantial attention from the research community, however, most approaches target highways, produce predictions valid for a limited time window or require expensive retraining of available models in order to accurately forecast traffic at a new location. In this article, we propose a novel and accurate traffic flow prediction method based on symbolic regression enhanced with a lag operator. Our approach produces robust models suitable for the intricacies of urban roads, much more difficult to predict than highways. Additionally, there is no need to retrain the model for a period of up to 9 weeks. Furthermore, the proposed method generates models that are transferable to other segments of the road network, similar to, yet geographically distinct from the ones they were initially trained on. We demonstrate the achievement of these claims by conducting extensive experiments on data collected from the Darmstadt urban infrastructure.

Explainable Machine Learning Control -- robust control and stability analysis

Recently, the term explainable AI became known as an approach to produce models from artificial intelligence which allow interpretation. Since a long time, there are models of symbolic regression in use that are perfectly explainable and mathematically tractable in this contribution we demonstrate how to use symbolic regression methods to infer the optimal control of a dynamical system given one or several optimization criteria, or cost functions. In previous publications, network control was achieved by automatized machine learning control using genetic programming. Here, we focus on the subsequent analysis of the analytical expressions which result from the machine learning. In particular, we use AUTO to analyze the stability properties of the controlled oscillator system which served as our model. As a result, we show that there is a considerable advantage of explainable models over less accessible neural networks.

Analytic Continued Fractions for Regression A Memetic Algorithm Approach

We present an approach for regression problems that employs analytic continued fractions as a novel representation. Comparative computational results using a memetic algorithm are reported in this work. Our experiments included fifteen other different machine learning approaches including five genetic programming methods for symbolic regression and ten machine learning methods. The comparison on training and test generalization was performed using 94 datasets of the Penn State Machine Learning Benchmark. The statistical tests showed that the generalization results using analytic continued fractions provides a powerful and interesting new alternative in the quest for compact and interpretable mathematical models for artificial intelligence.

Deep symbolic regression Recovering mathematical expressions from data via risk-seeking policy gradients

Discovering the underlying mathematical expressions describing a dataset is a core challenge for artificial intelligence. This is the problem of $\textit{symbolic}$ $\textit{regression.}$ Despite recent advances in training neural networks to solve complex tasks, deep learning approaches to symbolic regression are underexplored. We propose a framework that combines deep learning with symbolic regression via a simple idea use a large model to search the space of small models. More specifically, we use a recurrent neural network to emit a distribution over tractable mathematical expressions, and employ reinforcement learning to train the network to generate better-fitting expressions. Our algorithm significantly outperforms standard genetic programming-based symbolic regression in its ability to exactly recover symbolic expressions on a series of benchmark problems, both with and without added noise. More broadly, our contributions include a framework that can be applied to optimize hierarchical, variable-length objects under a black-box performance metric, with the ability to incorporate a priori constraints in situ, and a risk-seeking policy gradient formulation that optimizes for best-case performance instead of expected performance.

Integration of Neural Network-Based Symbolic Regression in Deep Learning for Scientific Discovery

Symbolic regression is a powerful technique that can discover analytical equations that describe data, which can lead to explainable models and generalizability outside of the training data set. In contrast, neural networks have achieved amazing levels of accuracy on image recognition and natural language processing tasks, but are often seen as black-box models that are difficult to interpret and typically extrapolate poorly. Here we use a neural network-based architecture for symbolic regression that we call the Sequential Equation Learner (SEQL) network and integrate it with other deep learning architectures such that the whole system can be trained end-to-end through backpropagation. To demonstrate the power of such systems, we study their performance on several substantially different tasks. First, we show that the neural network can perform symbolic regression and learn the form of several functions. Next, we present an MNIST arithmetic task where a separate part of the neural network extracts the digits. Finally, we demonstrate prediction of dynamical systems where an unknown parameter is extracted through an encoder. We find that the EQL-based architecture can extrapolate quite well outside of the training data set compared to a standard neural network-based architecture, paving the way for deep learning to be applied in scientific exploration and discovery.

Feature engineering and symbolic regression methods for detecting hidden physics from sparse sensors

In this study we put forth a modular approach for distilling hidden flow physics in discrete and sparse observations. To address functional expressiblity, a key limitation of the black-box machine learning methods, we have exploited the use of symbolic regression as a principle for identifying relations and operators that are related to the underlying processes. This approach combines evolutionary computation with feature engineering to provide a tool to discover hidden parameterizations embedded in the trajectory of fluid flows in the Eulerian frame of reference. Our approach in this study mainly involves gene expression programming (GEP) and sequential threshold ridge regression (STRidge) algorithms. We demonstrate our results in three different applications (i) equation discovery, (ii) truncation error analysis, and (iii) hidden physics discovery, for which we include both predicting unknown source terms from a set of sparse observations and discovering subgrid scale closure models. We illustrate that both GEP and STRidge algorithms are able to distill the Smagorinsky model from an array of tailored features in solving the Kraichnan turbulence problem. Our results demonstrate the huge potential of these techniques in complex physics problems, and reveal the importance of feature selection and feature engineering in model discovery approaches.

On ab initio closed-form expressions for gravitational waves

We introduce an approach for finding ab initio, high accuracy, closed-form expressions for the gravitational waves emitted by binary systems. Our expressions are built from numerical surrogate models based on numerical relativity simulations, which have been shown to be essentially indistinguishable from each other, with the advantage that our expressions can be written explicitly in a few lines. The key new ingredient in this approach is symbolic regression through genetic programming. The minimum overlap obtained in the proof of concept here presented, compared to ground truth solutions, is 99%.

Enhanced Evolutionary Symbolic Regression Via Genetic Programming for PV Power Forecasting

Solar power becomes one of the most promising renewable energy sources over the years leading up. Nevertheless, the weather is causing periodicity and volatility to photovoltaic (PV) energy production. Thus, Forecasting the PV power is crucial for maintaining sustainability and reliably to grid-connected systems. Anticipating the energy harnessed with prediction models is required to prevent the grid from any damage coming from every slight disturbance. In this direction, various architectures were suggested to predict the ambiguous behavior of meteorological data. Within this vein. Genetic algorithm (GA) presents a robust solution for nonlinear problems. The success of GA presents a source of motivation to scientists and engineers to develop a variety of sub-models that imitate the same Darwinian type-survival of the fittest strategy approach from GA propriety. However, during the training process, the later face an issue with missing the optimal solutions due to the existence of a local minimum. Following that regard, this paper provides an accurate PV power forecasting one month of PV power using a hybrid model combining symbolic regressor via Genetic programming and artificial neural network. The features inputs used in the process are only the solar irradiation and the historical solar power data. The application of the said model on an Australian PV plant of 200 kW offers a low mean absolute error equal to 3.30 and outperforms the state of art models.

Bayesian Symbolic Regression

Interpretability is crucial for machine learning in many scenarios such as quantitative finance, banking, healthcare, etc. Symbolic regression (SR) is a classic interpretable machine learning method by bridging X and Y using mathematical expressions composed of some basic functions. However, the search space of all possible expressions grows exponentially with the length of the expression, making it infeasible for enumeration. Genetic programming (GP) has been traditionally and commonly used in SR to search for the optimal solution, but it suffers from several limitations, e.g. the difficulty in incorporating prior knowledge; overly-complicated output expression and reduced interpretability etc. To address these issues, we propose a new method to fit SR under a Bayesian framework. Firstly, Bayesian model can naturally incorporate prior knowledge (e.g., preference of basis functions, operators and raw features) to improve the efficiency of fitting SR. Secondly, to improve interpretability of expressions in SR, we aim to capture concise but informative signals. To this end, we assume the expected signal has an additive structure, i.e., a linear combination of several concise expressions, whose complexity is controlled by a well-designed prior distribution. In our setup, each expression is characterized by a symbolic tree, and the proposed SR model could be solved by sampling symbolic trees from the posterior distribution using an efficient Markov chain Monte Carlo (MCMC) algorithm. Finally, compared with GP, the proposed BSR(Bayesian Symbolic Regression) method saves computer memory with no need to keep an updated 'genome pool'. Numerical experiments show that, compared with GP, the solutions of BSR are closer to the ground truth and the expressions are more concise. Meanwhile we find the solution of BSR is robust to hyper-parameter specifications such as the number of trees.

Learning Symbolic Physics with Graph Networks

We introduce an approach for imposing physically motivated inductive biases on graph networks to learn interpretable representations and improved zero-shot generalization. Our experiments show that our graph network models, which implement this inductive bias, can learn message representations equivalent to the true force vector when trained on n-body gravitational and spring-like simulations. We use symbolic regression to fit explicit algebraic equations to our trained model's message function and recover the symbolic form of Newton's law of gravitation without prior knowledge. We also show that our model generalizes better at inference time to systems with more bodies than had been experienced during training. Our approach is extensible, in principle, to any unknown interaction law learned by a graph network, and offers a valuable technique for interpreting and inferring explicit causal theories about the world from implicit knowledge captured by deep learning.

A Search for the Underlying Equation Governing Similar Systems

We show a data-driven approach to discover the underlying structural form of the mathematical equation governing the dynamics of multiple but similar systems induced by the same mechanisms. This approach hinges on theories that we lay out involving arguments based on the nature of physical systems. In the same vein, we also introduce a metric to search for the best candidate equation using the datasets generated from the systems. This approach involves symbolic regression by means of genetic programming and regressions to compute the strength of the interplay between the extrinsic parameters in a candidate equation. We relate these extrinsic parameters to the hidden properties of the data-generating systems. The behavior of a new similar system can be predicted easily by utilizing the discovered structural form of the general equation. As illustrations, we apply the approach to identify candidate structural forms of the underlying equation governing two cases the changes in a sensor measurement of degrading engines; and the search for the governing equation of systems with known variations of an intrinsic parameter.

Symbolic Regression Discovery of New Perovskite Catalysts with High Oxygen Evolution Reaction Activity

Symbolic regression (SR) is an emerging method for building analytical formulas to find models that best fit data sets. Here, SR was used to guide the design of new oxide perovskite catalysts with improved oxygen evolution reaction (OER) activities. An unprecedentedly simple descriptor, {\mu}/t, where {\mu} and t are the octahedral and tolerance factors, respectively, was identified, which accelerated the discovery of a series of new oxide perovskite catalysts with improved OER activity. We successfully synthesized five new oxide perovskites and characterized their OER activities. Remarkably, four of them, Cs0.4La0.6Mn0.25Co0.75O3, Cs0.3La0.7NiO3, SrNi0.75Co0.25O3, and Sr0.25Ba0.75NiO3, outperform the current state-of-the-art oxide perovskite catalyst, Ba0.5Sr0.5Co0.8Fe0.2O3 (BSCF). Our results demonstrate the potential of SR for accelerating data-driven design and discovery of new materials with improved properties.

A New Deterministic Technique for Symbolic Regression

This paper describes a new method for Symbolic Regression that allows to find mathematical expressions from a dataset. This method has a strong mathematical basis. As opposed to other methods such as Genetic Programming, this method is deterministic, and does not involve the creation of a population of initial solutions. Instead of it, a simple expression is being grown until it fits the data. The experiments performed show that the results are as good as other Machine Learning methods, in a very low computational time. Another advantage of this technique is that the complexity of the expressions can be limited, so the system can return mathematical expressions that can be easily analysed by the user, in opposition to other techniques like GSGP.

Automatic Discovery of Families of Network Generative Processes

Designing plausible network models typically requires scholars to form a priori intuitions on the key drivers of network formation. Oftentimes, these intuitions are supported by the statistical estimation of a selection of network evolution processes which will form the basis of the model to be developed. Machine learning techniques have lately been introduced to assist the automatic discovery of generative models. These approaches may more broadly be described as "symbolic regression", where fundamental network dynamic functions, rather than just parameters, are evolved through genetic programming. This chapter first aims at reviewing the principles, efforts and the emerging literature in this direction, which is very much aligned with the idea of creating artificial scientists. Our contribution then aims more specifically at building upon an approach recently developed by us [Menezes \& Roth, 2014] in order to demonstrate the existence of families of networks that may be described by similar generative processes. In other words, symbolic regression may be used to group networks according to their inferred genotype (in terms of generative processes) rather than their observed phenotype (in terms of statistical/topological features). Our empirical case is based on an original data set of 238 anonymized ego-centered networks of Facebook friends, further yielding insights on the formation of sociability networks.

Symbolic regression by uniform random global search

Symbolic regression (SR) is a data analysis problem where we search for the mathematical expression that best fits a numerical dataset. It is a global optimization problem. The most popular approach to SR is by genetic programming (SRGP). It is a common paradigm to compare an algorithm's performance to that of random search, but the data comparing SRGP to random search is lacking. We describe a novel algorithm for SR, namely SR by uniform random global search (SRURGS), also known as pure random search. We conduct experiments comparing SRURGS with SRGP using 100 randomly generated equations. Our results suggest that a SRGP is faster than SRURGS in producing equations with good R^2 for simple problems. However, our experiments suggest that SRURGS is more robust than SRGP, able to produce good output in more challenging problems. As SRURGS is arguably the simplest global search algorithm, we believe it should serve as a control algorithm against which other symbolic regression algorithms are compared. SRURGS has only one tuning parameter, and is conceptually very simple, making it a useful tool in solving SR problems. The method produces random equations, which is useful for the generation of symbolic regression benchmark problems. We have released well documented and open-source python code, currently under formal peer-review, so that interested researchers can deploy the tool in practice.

Exploration and Exploitation in Symbolic Regression using Quality-Diversity and Evolutionary Strategies Algorithms

By combining Genetic Programming, MAP-Elites and Covariance Matrix Adaptation Evolution Strategy, we demonstrate very high success rates in Symbolic Regression problems. MAP-Elites is used to improve exploration while preserving diversity and avoiding premature convergence and bloat. Then, a Covariance Matrix Adaptation-Evolution Strategy is used to evaluate free scalars through a non-gradient-based black-box optimizer. Although this evaluation approach is not computationally scalable to high dimensional problems, our algorithm is able to find exactly most of the $31$ targets extracted from the literature on which we evaluate it.

Epsilon-Lexicase Selection for Regression

Lexicase selection is a parent selection method that considers test cases separately, rather than in aggregate, when performing parent selection. It performs well in discrete error spaces but not on the continuous-valued problems that compose most system identification tasks. In this paper, we develop a new form of lexicase selection for symbolic regression, named epsilon-lexicase selection, that redefines the pass condition for individuals on each test case in a more effective way. We run a series of experiments on real-world and synthetic problems with several treatments of epsilon and quantify how epsilon affects parent selection and model performance. epsilon-lexicase selection is shown to be effective for regression, producing better fit models compared to other techniques such as tournament selection and age-fitness Pareto optimization. We demonstrate that epsilon can be adapted automatically for individual test cases based on the population performance distribution. Our experiments show that epsilon-lexicase selection with automatic epsilon produces the most accurate models across tested problems with negligible computational overhead. We show that behavioral diversity is exceptionally high in lexicase selection treatments, and that epsilon-lexicase selection makes use of more fitness cases when selecting parents than lexicase selection, which helps explain the performance improvement.

AI Feynman a Physics-Inspired Method for Symbolic Regression

A core challenge for both physics and artificial intellicence (AI) is symbolic regression finding a symbolic expression that matches data from an unknown function. Although this problem is likely to be NP-hard in principle, functions of practical interest often exhibit symmetries, separability, compositionality and other simplifying properties. In this spirit, we develop a recursive multidimensional symbolic regression algorithm that combines neural network fitting with a suite of physics-inspired techniques. We apply it to 100 equations from the Feynman Lectures on Physics, and it discovers all of them, while previous publicly available software cracks only 71; for a more difficult test set, we improve the state of the art success rate from 15% to 90%.

Discovery of Algebraic Reynolds-Stress Models Using Sparse Symbolic Regression

** This article is published (open-access). ** A novel deterministic symbolic regression method SpaRTA (Sparse Regression of Turbulent Stress Anisotropy) is introduced to infer algebraic stress models for the closure of RANS equations directly from high-fidelity LES or DNS data. The models are written as tensor polynomials and are built from a library of candidate functions. The machine-learning method is based on elastic net regularisation which promotes sparsity of the inferred models. By being data-driven the method relaxes assumptions commonly made in the process of model development. Model-discovery and cross-validation is performed for three cases of separating flows, i.e. periodic hills ($Re$=10595), converging-diverging channel ($Re$=12600) and curved backward-facing step ($Re$=13700). The predictions of the discovered models are significantly improved over the $k$-$\omega$ SST also for a true prediction of the flow over periodic hills at $Re$=37000. This study shows a systematic assessment of SpaRTA for rapid machine-learning of robust corrections for standard RANS turbulence models.

Data-driven discovery of partial differential equation models with latent variables

In spatially extended systems, it is common to find latent variables that are hard, or even impossible, to measure with acceptable precision, but are crucially important for the proper description of the dynamics. This substantially complicates construction of an accurate model for such systems using data-driven approaches. The present paper illustrates how physical constraints can be employed to overcome this limitation using the example of a weakly turbulent quasi-two-dimensional Kolmogorov flow driven by a steady Lorenz force with an unknown spatial profile. Specifically, the terms involving latent variables in the partial differential equations governing the dynamics can be eliminated at the expense of raising the order of that equation. We show that local polynomial interpolation combined with symbolic regression can handle sparse data on grids that are representative of typical experimental measurement techniques such as particle image velocimetry. However, we also find that the reconstructed model is sensitive to measurement noise and trace this sensitivity to the presence of high order spatial and/or temporal derivatives.

A Novel Continuous Representation of Genetic Programmings using Recurrent Neural Networks for Symbolic Regression

Neuro-encoded expression programming that aims to offer a novel continuous representation of combinatorial encoding for genetic programming methods is proposed in this paper. Genetic programming with linear representation uses nature-inspired operators to tune expressions and finally search out the best explicit function to simulate data. The encoding mechanism is essential for genetic programmings to find a desirable solution efficiently. However, the linear representation methods manipulate the expression tree in discrete solution space, where a small change of the input can cause a large change of the output. The unsmooth landscapes destroy the local information and make difficulty in searching. The neuro-encoded expression programming constructs the gene string with recurrent neural network (RNN) and the weights of the network are optimized by powerful continuous evolutionary algorithms. The neural network mappings smoothen the sharp fitness landscape and provide rich neighborhood information to find the best expression. The experiments indicate that the novel approach improves test accuracy and efficiency on several well-known symbolic regression problems.

Improving Model-based Genetic Programming for Symbolic Regression of Small Expressions

The Gene-pool Optimal Mixing Evolutionary Algorithm (GOMEA) is a model-based EA framework that has been shown to perform well in several domains, including Genetic Programming (GP). Differently from traditional EAs where variation acts blindly, GOMEA learns a model of interdependencies within the genotype, i.e., the linkage, to estimate what patterns to propagate. In this article, we study the role of Linkage Learning (LL) performed by GOMEA in Symbolic Regression (SR). We show that the non-uniformity in the distribution of the genotype in GP populations negatively biases LL, and propose a method to correct for this. We also propose approaches to improve LL when ephemeral random constants are used. Furthermore, we adapt a scheme of interleaving runs to alleviate the burden of tuning the population size, a crucial parameter for LL, to SR. We run experiments on 10 real-world datasets, enforcing a strict limitation on solution size, to enable interpretability. We find that the new LL method outperforms the standard one, and that GOMEA outperforms both traditional and semantic GP. We also find that the small solutions evolved by GOMEA are competitive with tuned decision trees, making GOMEA a promising new approach to SR.

Fast, accurate, and transferable many-body interatomic potentials by symbolic regression

The length and time scales of atomistic simulations are limited by the computational cost of the methods used to predict material properties. In recent years there has been great progress in the use of machine learning algorithms to develop fast and accurate interatomic potential models, but it remains a challenge to develop models that generalize well and are fast enough to be used at extreme time and length scales. To address this challenge, we have developed a machine learning algorithm based on symbolic regression in the form of genetic programming that is capable of discovering accurate, computationally efficient manybody potential models. The key to our approach is to explore a hypothesis space of models based on fundamental physical principles and select models within this hypothesis space based on their accuracy, speed, and simplicity. The focus on simplicity reduces the risk of overfitting the training data and increases the chances of discovering a model that generalizes well. Our algorithm was validated by rediscovering an exact Lennard-Jones potential and a Sutton Chen embedded atom method potential from training data generated using these models. By using training data generated from density functional theory calculations, we found potential models for elemental copper that are simple, as fast as embedded atom models, and capable of accurately predicting properties outside of their training set. Our approach requires relatively small sets of training data, making it possible to generate training data using highly accurate methods at a reasonable computational cost. We present our approach, the forms of the discovered models, and assessments of their transferability, accuracy and speed.

Symbolic Regression for Constructing Analytic Models in Reinforcement Learning

Reinforcement learning (RL) is a widely used approach for controlling systems with unknown or time-varying dynamics. Even though RL does not require a model of the system, it is known to be faster and safer when using models learned online. We propose to employ symbolic regression (SR) to construct parsimonious process models described by analytic equations for real-time RL control. We have tested our method with two different state-of-the-art SR algorithms which automatically search for equations that fit the measured data. In addition to the standard problem formulation in the state-space domain, we show how the method can also be applied to input-output models of the NARX (nonlinear autoregressive with exogenous input) type. We present the approach on three simulated examples with up to 14-dimensional state space an inverted pendulum, a mobile robot, and a biped walking robot. A comparison with deep neural networks and local linear regression shows that SR in most cases outperforms these commonly used alternative methods. We demonstrate on a real pendulum system that the analytic model found enables RL to successfully perform the swing-up task, based on a model constructed from only 100 data samples.

Symbolic Regression Methods for Reinforcement Learning

Reinforcement learning algorithms can be used to optimally solve dynamic decision-making and control problems. With continuous-valued state and input variables, reinforcement learning algorithms must rely on function approximators to represent the value function and policy mappings. Commonly used numerical approximators, such as neural networks or basis function expansions, have two main drawbacks they are black-box models offering no insight in the mappings learned, and they require significant trial and error tuning of their meta-parameters. In this paper, we propose a new approach to constructing smooth value functions by means of symbolic regression. We introduce three off-line methods for finding value functions based on a state transition model symbolic value iteration, symbolic policy iteration, and a direct solution of the Bellman equation. The methods are illustrated on four nonlinear control problems velocity control under friction, one-link and two-link pendulum swing-up, and magnetic manipulation. The results show that the value functions not only yield well-performing policies, but also are compact, human-readable and mathematically tractable. This makes them potentially suitable for further analysis of the closed-loop system. A comparison with alternative approaches using neural networks shows that our method constructs well-performing value functions with substantially fewer parameters.

Data-driven PDE discovery with evolutionary approach

The data-driven models allow one to define the model structure in cases when a priori information is not sufficient to build other types of models. The possible way to obtain physical interpretation is the data-driven differential equation discovery techniques. The existing methods of PDE (partial derivative equations) discovery are bound with the sparse regression. However, sparse regression is restricting the resulting model form, since the terms for PDE are defined before regression. The evolutionary approach described in the article has a symbolic regression as the background instead and thus has fewer restrictions on the PDE form. The evolutionary method of PDE discovery (EPDE) is described and tested on several canonical PDEs. The question of robustness is examined on a noised data example.

Interaction-Transformation Evolutionary Algorithm for Symbolic Regression

The Interaction-Transformation (IT) is a new representation for Symbolic Regression that restricts the search space into simpler, but expressive, function forms. This representation has the advantage of creating a smoother search space unlike the space generated by Expression Trees, the common representation used in Genetic Programming. This paper introduces an Evolutionary Algorithm capable of evolving a population of IT expressions supported only by the mutation operator. The results show that this representation is capable of finding better approximations to real-world data sets when compared to traditional approaches and a state-of-the-art Genetic Programming algorithm.

Online Diversity Control in Symbolic Regression via a Fast Hash-based Tree Similarity Measure

Diversity represents an important aspect of genetic programming, being directly correlated with search performance. When considered at the genotype level, diversity often requires expensive tree distance measures which have a negative impact on the algorithm's runtime performance. In this work we introduce a fast, hash-based tree distance measure to massively speed-up the calculation of population diversity during the algorithmic run. We combine this measure with the standard GA and the NSGA-II genetic algorithms to steer the search towards higher diversity. We validate the approach on a collection of benchmark problems for symbolic regression where our method consistently outperforms the standard GA as well as NSGA-II configurations with different secondary objectives.

Neural-Guided Symbolic Regression with Asymptotic Constraints

Symbolic regression is a type of discrete optimization problem that involves searching expressions that fit given data points. In many cases, other mathematical constraints about the unknown expression not only provide more information beyond just values at some inputs, but also effectively constrain the search space. We identify the asymptotic constraints of leading polynomial powers as the function approaches zero and infinity as useful constraints and create a system to use them for symbolic regression. The first part of the system is a conditional production rule generating neural network which preferentially generates production rules to construct expressions with the desired leading powers, producing novel expressions outside the training domain. The second part, which we call Neural-Guided Monte Carlo Tree Search, uses the network during a search to find an expression that conforms to a set of data points and desired leading powers. Lastly, we provide an extensive experimental validation on thousands of target expressions showing the efficacy of our system compared to exiting methods for finding unknown functions outside of the training set.

Symbolic Regression in Materials Science

We showcase the potential of symbolic regression as an analytic method for use in materials research. First, we briefly describe the current state-of-the-art method, genetic programming-based symbolic regression (GPSR), and recent advances in symbolic regression techniques. Next, we discuss industrial applications of symbolic regression and its potential applications in materials science. We then present two GPSR use-cases formulating a transformation kinetics law and showing the learning scheme discovers the well-known Johnson-Mehl-Avrami-Kolmogorov (JMAK) form, and learning the Landau free energy functional form for the displacive tilt transition in perovskite LaNiO$_3$. Finally, we propose that symbolic regression techniques should be considered by materials scientists as an alternative to other machine-learning-based regression models for learning from data.

Automated Discovery of Jet Substructure Analyses

The study of the substructure of collimated particles from quarks and gluons, or jets, has the promise to reveal the details how color charges interact with the QCD plasma medium created in colliders such as RHIC and the LHC. Traditional jet substructure observables have been constructed using expert knowledge, and are largely transplanted, unmodified, from the high-energy physics, where the goal is primarily the study of boosted hadronic decays. A novel neural network architecture is described that is capable of examining theoretical models, and constructs, on its own, an analysis procedure that is sensitive to the internal model features. This architecture, in combination with symbolic regression, further allows the extraction of closed-form algebraic expressions from the learned result -- enabling the automatically constructed jet substructure analysis to be subsequently understood and reproduced by humans. This system is then tasked to construct an analysis that infers the plasma temperature from observing jets, which is demonstrated using both JEWEL and the Linearized Boltzmann Transport model, and at the presence of a realistic remnant of the plasma, or underlying event, that the measurement has to overcome. In a demonstration how algorithms can produce original research in direct competition to human experts, the resulting jet substructure variables and analyses are capable of determining the initial temperature of the plasma medium from analyzing 1200--2500 jets, a performance not seen in existing, manually designed analyses. Comparison of an incidentally discovered observable with the existing literature further indicates that the system described is capable of examining the model phase spaces to a detail at least comparable to the current field of human experts.

Symbolic regression based genetic approximations of the Colebrook equation for flow friction

Widely used in hydraulics, the Colebrook equation for flow friction relates implicitly to the input parameters; the Reynolds number, and the relative roughness of inner pipe surface, with the output unknown parameter; the flow friction factor. In this paper, a few explicit approximations to the Colebrook equation are generated using the ability of artificial intelligence to make inner patterns to connect input and output parameters in explicit way not knowing their nature or the physical law that connects them, but only knowing raw numbers. The fact that the used genetic programming tool does not know the structure of the Colebrook equation which is based on computationally expensive logarithmic law, is used to obtain better structure of the approximations which is less demanding for calculation but also enough accurate. All generated approximations are with low computational cost because they contain a limited number of logarithmic forms used although for normalization of input parameters or for acceleration, but they are also sufficiently accurate. The relative error regarding the friction factor in best case is up to 0.13% with only two logarithmic forms used. As the second logarithm can be accurately approximated by the Pade approximation, practically the same error is obtained also using only one logarithm.

Linear Combination of Distance Measures for Surrogate Models in Genetic Programming

Surrogate models are a well established approach to reduce the number of expensive function evaluations in continuous optimization. In the context of genetic programming, surrogate modeling still poses a challenge, due to the complex genotype-phenotype relationships. We investigate how different genotypic and phenotypic distance measures can be used to learn Kriging models as surrogates. We compare the measures and suggest to use their linear combination in a kernel. We test the resulting model in an optimization framework, using symbolic regression problem instances as a benchmark. Our experiments show that the model provides valuable information. Firstly, the model enables an improved optimization performance compared to a model-free algorithm. Furthermore, the model provides information on the contribution of different distance measures. The data indicates that a phenotypic distance measure is important during the early stages of an optimization run when less data is available. In contrast, genotypic measures, such as the tree edit distance, contribute more during the later stages.

GP-RVM Genetic Programing-based Symbolic Regression Using Relevance Vector Machine

This paper proposes a hybrid basis function construction method (GP-RVM) for Symbolic Regression problem, which combines an extended version of Genetic Programming called Kaizen Programming and Relevance Vector Machine to evolve an optimal set of basis functions. Different from traditional evolutionary algorithms where a single individual is a complete solution, our method proposes a solution based on linear combination of basis functions built from individuals during the evolving process. RVM which is a sparse Bayesian kernel method selects suitable functions to constitute the basis. RVM determines the posterior weight of a function by evaluating its quality and sparsity. The solution produced by GP-RVM is a sparse Bayesian linear model of the coefficients of many non-linear functions. Our hybrid approach is focused on nonlinear white-box models selecting the right combination of functions to build robust predictions without prior knowledge about data. Experimental results show that GP-RVM outperforms conventional methods, which suggest that it is an efficient and accurate technique for solving SR. The computational complexity of GP-RVM scales in $O( M^{3})$, where $M$ is the number of functions in the basis set and is typically much smaller than the number $N$ of training patterns.

Analysing Symbolic Regression Benchmarks under a Meta-Learning Approach

The definition of a concise and effective testbed for Genetic Programming (GP) is a recurrent matter in the research community. This paper takes a new step in this direction, proposing a different approach to measure the quality of the symbolic regression benchmarks quantitatively. The proposed approach is based on meta-learning and uses a set of dataset meta-features---such as the number of examples or output skewness---to describe the datasets. Our idea is to correlate these meta-features with the errors obtained by a GP method. These meta-features define a space of benchmarks that should, ideally, have datasets (points) covering different regions of the space. An initial analysis of 63 datasets showed that current benchmarks are concentrated in a small region of this benchmark space. We also found out that number of instances and output skewness are the most relevant meta-features to GP output error. Both conclusions can help define which datasets should compose an effective testbed for symbolic regression methods.

Where are we now? A large benchmark study of recent symbolic regression methods

In this paper we provide a broad benchmarking of recent genetic programming approaches to symbolic regression in the context of state of the art machine learning approaches. We use a set of nearly 100 regression benchmark problems culled from open source repositories across the web. We conduct a rigorous benchmarking of four recent symbolic regression approaches as well as nine machine learning approaches from scikit-learn. The results suggest that symbolic regression performs strongly compared to state-of-the-art gradient boosting algorithms, although in terms of running times is among the slowest of the available methodologies. We discuss the results in detail and point to future research directions that may allow symbolic regression to gain wider adoption in the machine learning community.

Solving the Exponential Growth of Symbolic Regression Trees in Geometric Semantic Genetic Programming

Advances in Geometric Semantic Genetic Programming (GSGP) have shown that this variant of Genetic Programming (GP) reaches better results than its predecessor for supervised machine learning problems, particularly in the task of symbolic regression. However, by construction, the geometric semantic crossover operator generates individuals that grow exponentially with the number of generations, resulting in solutions with limited use. This paper presents a new method for individual simplification named GSGP with Reduced trees (GSGP-Red). GSGP-Red works by expanding the functions generated by the geometric semantic operators. The resulting expanded function is guaranteed to be a linear combination that, in a second step, has its repeated structures and respective coefficients aggregated. Experiments in 12 real-world datasets show that it is not only possible to create smaller and completely equivalent individuals in competitive computational time, but also to reduce the number of nodes composing them by 58 orders of magnitude, on average.

Optimal Symbolic Controllers Determinization for BDD storage

Controller synthesis techniques based on symbolic abstractions appeal by producing correct-by-design controllers, under intricate behavioural constraints. Yet, being relations between abstract states and inputs, such controllers are immense in size, which makes them futile for em- bedded platforms. Control-synthesis tools such as PESSOA, SCOTS, and CoSyMA tackle the problem by storing controllers as binary decision di- agrams (BDDs). However, due to redundantly keeping multiple inputs per-state, the resulting controllers are still too large. In this work, we first show that choosing an optimal controller determinization is an NP- complete problem. Further, we consider the previously known controller determinization technique and discuss its weaknesses. We suggest several new approaches to the problem, based on greedy algorithms, symbolic regression, and (muli-terminal) BDDs. Finally, we empirically compare the techniques and show that some of the new algorithms can produce up to 85% smaller controllers than those obtained with the previous technique.

Glyph Symbolic Regression Tools

We present Glyph - a Python package for genetic programming based symbolic regression. Glyph is designed for usage let by numerical simulations let by real world experiments. For experimentalists, glyph-remote provides a separation of tasks a ZeroMQ interface splits the genetic programming optimization task from the evaluation of an experimental (or numerical) run. Glyph can be accessed at this http URL . Domain experts are be able to employ symbolic regression in their experiments with ease, even if they are not expert programmers. The reuse potential is kept high by a generic interface design. Glyph is available on PyPI and Github.

Towards Advanced Phenotypic Mutations in Cartesian Genetic Programming

Cartesian Genetic Programming is often used with a point mutation as the sole genetic operator. In this paper, we propose two phenotypic mutation techniques and take a step towards advanced phenotypic mutations in Cartesian Genetic Programming. The functionality of the proposed mutations is inspired by biological evolution which mutates DNA sequences by inserting and deleting nucleotides. Experiments with symbolic regression and boolean functions problems show a better search performance when the proposed mutations are in use. The results of our experiments indicate that the use of phenotypic mutations could be beneficial for the use of Cartesian Genetic Programming.

On the Generalizability of Linear and Non-Linear Region of Interest-Based Multivariate Regression Models for fMRI Data

In contrast to conventional, univariate analysis, various types of multivariate analysis have been applied to functional magnetic resonance imaging (fMRI) data. In this paper, we compare two contemporary approaches for multivariate regression on task-based fMRI data linear regression with ridge regularization and non-linear symbolic regression using genetic programming. The data for this project is representative of a contemporary fMRI experimental design for visual stimuli. Linear and non-linear models were generated for 10 subjects, with another 4 withheld for validation. Model quality is evaluated by comparing $R$ scores (Pearson product-moment correlation) in various contexts, including single run self-fit, within-subject generalization, and between-subject generalization. Propensity for modelling strategies to overfit is estimated using a separate resting state scan. Results suggest that neither method is objectively or inherently better than the other.

A Greedy Search Tree Heuristic for Symbolic Regression

Symbolic Regression tries to find a mathematical expression that describes the relationship of a set of explanatory variables to a measured variable. The main objective is to find a model that minimizes the error and, optionally, that also minimizes the expression size. A smaller expression can be seen as an interpretable model considered a reliable decision model. This is often performed with Genetic Programming which represents their solution as expression trees. The shortcoming of this algorithm lies on this representation that defines a rugged search space and contains expressions of any size and difficulty. These pose as a challenge to find the optimal solution under computational constraints. This paper introduces a new data structure, called Interaction-Transformation (IT), that constrains the search space in order to exclude a region of larger and more complicated expressions. In order to test this data structure, it was also introduced an heuristic called SymTree. The obtained results show evidence that SymTree are capable of obtaining the optimal solution whenever the target function is within the search space of the IT data structure and competitive results when it is not. Overall, the algorithm found a good compromise between accuracy and simplicity for all the generated models.

Interpretable Policies for Reinforcement Learning by Genetic Programming

The search for interpretable reinforcement learning policies is of high academic and industrial interest. Especially for industrial systems, domain experts are more likely to deploy autonomously learned controllers if they are understandable and convenient to evaluate. Basic algebraic equations are supposed to meet these requirements, as long as they are restricted to an adequate complexity. Here we introduce the genetic programming for reinforcement learning (GPRL) approach based on model-based batch reinforcement learning and genetic programming, which autonomously learns policy equations from pre-existing default state-action trajectory samples. GPRL is compared to a straight-forward method which utilizes genetic programming for symbolic regression, yielding policies imitating an existing well-performing, but non-interpretable policy. Experiments on three reinforcement learning benchmarks, i.e., mountain car, cart-pole balancing, and industrial benchmark, demonstrate the superiority of our GPRL approach compared to the symbolic regression method. GPRL is capable of producing well-performing interpretable reinforcement learning policies from pre-existing default trajectory data.

Calculation of secondary electron emission yields from low-energy electron deposition in tungsten surfaces

We present calculations of secondary electron emission (SEE) yields in tungsten as a function of primary electron energies between 50 eV and 1 keV and incidence angles between 0 and 90°. We conduct a review of the established Monte Carlo methods to simulate multiple electron scattering in solids and select the best suited to study SEE in high-Z metals. We generate secondary electron yield and emission energy functions of the incident energy and angle and fit them to bivariate fitting functions using symbolic regression. We compare the numerical results with experimental data, with good agreement found. Our calculations are the first step towards studying SEE in nanoarchitected surfaces for electric propulsion chamber walls.

Globally Optimal Symbolic Regression

In this study we introduce a new technique for symbolic regression that guarantees global optimality. This is achieved by formulating a mixed integer non-linear program (MINLP) whose solution is a symbolic mathematical expression of minimum complexity that explains the observations. We demonstrate our approach by rediscovering Kepler's law on planetary motion using exoplanet data and Galileo's pendulum periodicity equation using experimental data.

A probabilistic and multi-objective analysis of lexicase selection and epsilon-lexicase selection

Lexicase selection is a parent selection method that considers training cases individually, rather than in aggregate, when performing parent selection. Whereas previous work has demonstrated the ability of lexicase selection to solve difficult problems in program synthesis and symbolic regression, the central goal of this paper is to develop the theoretical underpinnings that explain its performance. To this end, we derive an analytical formula that gives the expected probabilities of selection under lexicase selection, given a population and its behavior. In addition, we expand upon the relation of lexicase selection to many-objective optimization methods to describe the behavior of lexicase selection, which is to select individuals on the boundaries of Pareto fronts in high-dimensional space. We show analytically why lexicase selection performs more poorly for certain sizes of population and training cases, and show why it has been shown to perform more poorly in continuous error spaces. To address this last concern, we propose new variants of epsilon-lexicase selection, a method that modifies the pass condition in lexicase selection to allow near-elite individuals to pass cases, thereby improving selection performance with continuous errors. We show that epsilon-lexicase outperforms several diversity-maintenance strategies on a number of real-world and synthetic regression problems.

P-Tree Programming

We propose a novel method for automatic program synthesis. P-Tree Programming represents the program search space through a single probabilistic prototype tree. From this prototype tree we form program instances which we evaluate on a given problem. The error values from the evaluations are propagated through the prototype tree. We use them to update the probability distributions that determine the symbol choices of further instances. The iterative method is applied to several symbolic regression benchmarks from the literature. It outperforms standard Genetic Programming to a large extend. Furthermore, it relies on a concise set of parameters which are held constant for all problems. The algorithm can be employed for most of the typical computational intelligence tasks such as classification, automatic program induction, and symbolic regression.

A divide and conquer method for symbolic regression

Symbolic regression aims to find a function that best explains the relationship between independent variables and the objective value based on a given set of sample data. Genetic programming (GP) is usually considered as an appropriate method for the problem since it can optimize functional structure and coefficients simultaneously. However, the convergence speed of GP might be too slow for large scale problems that involve a large number of variables. Fortunately, in many applications, the target function is separable or partially separable. This feature motivated us to develop a new method, divide and conquer (D&C), for symbolic regression, in which the target function is divided into a number of sub-functions and the sub-functions are then determined by any of a GP algorithm. The separability is probed by a new proposed technique, Bi-Correlation test (BiCT). D&C powered GP has been tested on some real-world applications, and the study shows that D&C can help GP to get the target function much more rapidly.

Block building programming for symbolic regression

Symbolic regression that aims to detect underlying data-driven models has become increasingly important for industrial data analysis. For most existing algorithms such as genetic programming (GP), the convergence speed might be too slow for large-scale problems with a large number of variables. This situation may become even worse with increasing problem size. The aforementioned difficulty makes symbolic regression limited in practical applications. Fortunately, in many engineering problems, the independent variables in target models are separable or partially separable. This feature inspires us to develop a new approach, block building programming (BBP). BBP divides the original target function into several blocks, and further into factors. The factors are then modeled by an optimization engine (e.g. GP). Under such circumstances, BBP can make large reductions to the search space. The partition of separability is based on a special method, block and factor detection. Two different optimization engines are applied to test the performance of BBP on a set of symbolic regression problems. Numerical results show that BBP has a good capability of structure and coefficient optimization with high computational efficiency.

Elite Bases Regression A Real-time Algorithm for Symbolic Regression

Symbolic regression is an important but challenging research topic in data mining. It can detect the underlying mathematical models. Genetic programming (GP) is one of the most popular methods for symbolic regression. However, its convergence speed might be too slow for large scale problems with a large number of variables. This drawback has become a bottleneck in practical applications. In this paper, a new non-evolutionary real-time algorithm for symbolic regression, Elite Bases Regression (EBR), is proposed. EBR generates a set of candidate basis functions coded with parse-matrix in specific mapping rules. Meanwhile, a certain number of elite bases are preserved and updated iteratively according to the correlation coefficients with respect to the target model. The regression model is then spanned by the elite bases. A comparative study between EBR and a recent proposed machine learning method for symbolic regression, Fast Function eXtraction (FFX), are conducted. Numerical results indicate that EBR can solve symbolic regression problems more effectively.

Learning Linear Feature Space Transformations in Symbolic Regression

We propose a new type of leaf node for use in Symbolic Regression (SR) that performs linear combinations of feature variables (LCF). These nodes can be handled in three different modes -- an unsynchronized mode, where all LCFs are free to change on their own, a synchronized mode, where LCFs are sorted into groups in which they are forced to be identical throughout the whole individual, and a globally synchronized mode, which is similar to the previous mode but the grouping is done across the whole population. We also present two methods of evolving the weights of the LCFs -- a purely stochastic way via mutation and a gradient-based way based on the backpropagation algorithm known from neural networks -- and also a combination of both. We experimentally evaluate all configurations of LCFs in Multi-Gene Genetic Programming (MGGP), which was chosen as baseline, on a number of benchmarks. According to the results, we identified two configurations which increase the performance of the algorithm.

Interval Arithmetic and Interval-Aware Operators for Genetic Programming

Symbolic regression via genetic programming is a flexible approach to machine learning that does not require up-front specification of model structure. However, traditional approaches to symbolic regression require the use of protected operators, which can lead to perverse model characteristics and poor generalisation. In this paper, we revisit interval arithmetic as one possible solution to allow genetic programming to perform regression using unprotected operators. Using standard benchmarks, we show that using interval arithmetic within model evaluation does not prevent invalid solutions from entering the population, meaning that search performance remains compromised. We extend the basic interval arithmetic concept with `safe' search operators that integrate interval information into their process, thereby greatly reducing the number of invalid solutions produced during search. The resulting algorithms are able to more effectively identify good models that generalise well to unseen data. We conclude with an analysis of the sensitivity of interval arithmetic-based operators with respect to the accuracy of the supplied input feature intervals.

A Probabilistic Linear Genetic Programming with Stochastic Context-Free Grammar for solving Symbolic Regression problems

Traditional Linear Genetic Programming (LGP) algorithms are based only on the selection mechanism to guide the search. Genetic operators combine or mutate random portions of the individuals, without knowing if the result will lead to a fitter individual. Probabilistic Model Building Genetic Programming (PMB-GP) methods were proposed to overcome this issue through a probability model that captures the structure of the fit individuals and use it to sample new individuals. This work proposes the use of LGP with a Stochastic Context-Free Grammar (SCFG), that has a probability distribution that is updated according to selected individuals. We proposed a method for adapting the grammar into the linear representation of LGP. Tests performed with the proposed probabilistic method, and with two hybrid approaches, on several symbolic regression benchmark problems show that the results are statistically better than the obtained by the traditional LGP.

Grammar Variational Autoencoder

Deep generative models have been wildly successful at learning coherent latent representations for continuous data such as video and audio. However, generative modeling of discrete data such as arithmetic expressions and molecular structures still poses significant challenges. Crucially, state-of-the-art methods often produce outputs that are not valid. We make the key observation that frequently, discrete data can be represented as a parse tree from a context-free grammar. We propose a variational autoencoder which encodes and decodes directly to and from these parse trees, ensuring the generated outputs are always valid. Surprisingly, we show that not only does our model more often generate valid outputs, it also learns a more coherent latent space in which nearby points decode to similar discrete outputs. We demonstrate the effectiveness of our learned models by showing their improved performance in Bayesian optimization for symbolic regression and molecular synthesis.

Symbolic Regression Algorithms with Built-in Linear Regression

Recently, several algorithms for symbolic regression (SR) emerged which employ a form of multiple linear regression (LR) to produce generalized linear models. The use of LR allows the algorithms to create models with relatively small error right from the beginning of the search; such algorithms are thus claimed to be (sometimes by orders of magnitude) faster than SR algorithms based on vanilla genetic programming. However, a systematic comparison of these algorithms on a common set of problems is still missing. In this paper we conceptually and experimentally compare several representatives of such algorithms (GPTIPS, FFX, and EFS). They are applied as off-the-shelf, ready-to-use techniques, mostly using their default settings. The methods are compared on several synthetic and real-world SR benchmark problems. Their performance is also related to the performance of three conventional machine learning algorithms --- multiple regression, random forests and support vector regression.

Learning Optimal Control of Synchronization in Networks of Coupled Oscillators using Genetic Programming-based Symbolic Regression

Networks of coupled dynamical systems provide a powerful way to model systems with enormously complex dynamics, such as the human brain. Control of synchronization in such networked systems has far reaching applications in many domains, including engineering and medicine. In this paper, we formulate the synchronization control in dynamical systems as an optimization problem and present a multi-objective genetic programming-based approach to infer optimal control functions that drive the system from a synchronized to a non-synchronized state and vice-versa. The genetic programming-based controller allows learning optimal control functions in an interpretable symbolic form. The effectiveness of the proposed approach is demonstrated in controlling synchronization in coupled oscillator systems linked in networks of increasing order complexity, ranging from a simple coupled oscillator system to a hierarchical network of coupled oscillators. The results show that the proposed method can learn highly-effective and interpretable control functions for such systems.

Differentiable Genetic Programming

We introduce the use of high order automatic differentiation, implemented via the algebra of truncated Taylor polynomials, in genetic programming. Using the Cartesian Genetic Programming encoding we obtain a high-order Taylor representation of the program output that is then used to back-propagate errors during learning. The resulting machine learning framework is called differentiable Cartesian Genetic Programming (dCGP). In the context of symbolic regression, dCGP offers a new approach to the long unsolved problem of constant representation in GP expressions. On several problems of increasing complexity we find that dCGP is able to find the exact form of the symbolic expression as well as the constants values. We also demonstrate the use of dCGP to solve a large class of differential equations and to find prime integrals of dynamical systems, presenting, in both cases, results that confirm the efficacy of our approach.

Prediction of Dynamical Systems by Symbolic Regression

We study the modeling and prediction of dynamical systems based on conventional models derived from measurements. Such algorithms are highly desirable in situations where the underlying dynamics are hard to model from physical principles or simplified models need to be found. We focus on symbolic regression methods as a part of machine learning. These algorithms are capable of learning an analytically tractable model from data, a highly valuable property. Symbolic regression methods can be considered as generalized regression methods. We investigate two particular algorithms, the so-called fast function extraction which is a generalized linear regression algorithm, and genetic programming which is a very general method. Both are able to combine functions in a certain way such that a good model for the prediction of the temporal evolution of a dynamical system can be identified. We illustrate the algorithms by finding a prediction for the evolution of a harmonic oscillator based on measurements, by detecting an arriving front in an excitable system, and as a real-world application, the prediction of solar power production based on energy production observations at a given site together with the weather forecast.

Developing Postfix-GP Framework for Symbolic Regression Problems

This paper describes Postfix-GP system, postfix notation based Genetic Programming (GP), for solving symbolic regression problems. It presents an object-oriented architecture of Postfix-GP framework. It assists the user in understanding of the implementation details of various components of Postfix-GP. Postfix-GP provides graphical user interface which allows user to configure the experiment, to visualize evolved solutions, to analyze GP run, and to perform out-of-sample predictions. The use of Postfix-GP is demonstrated by solving the benchmark symbolic regression problem. Finally, features of Postfix-GP framework are compared with that of other GP systems.

GPTIPS 2 an open-source software platform for symbolic data mining

GPTIPS is a free, open source MATLAB based software platform for symbolic data mining (SDM). It uses a multigene variant of the biologically inspired machine learning method of genetic programming (MGGP) as the engine that drives the automatic model discovery process. Symbolic data mining is the process of extracting hidden, meaningful relationships from data in the form of symbolic equations. In contrast to other data-mining methods, the structural transparency of the generated predictive equations can give new insights into the physical systems or processes that generated the data. Furthermore, this transparency makes the models very easy to deploy outside of MATLAB. The rationale behind GPTIPS is to reduce the technical barriers to using, understanding, visualising and deploying GP based symbolic models of data, whilst at the same time remaining highly customisable and delivering robust numerical performance for power users. In this chapter, notable new features of the latest version of the software are discussed with these aims in mind. Additionally, a simplified variant of the MGGP high level gene crossover mechanism is proposed. It is demonstrated that the new functionality of GPTIPS 2 (a) facilitates the discovery of compact symbolic relationships from data using multiple approaches, e.g. using novel gene-centric visualisation analysis to mitigate horizontal bloat and reduce complexity in multigene symbolic regression models (b) provides numerous methods for visualising the properties of symbolic models (c) emphasises the generation of graphically navigable libraries of models that are optimal in terms of the Pareto trade off surface of model performance and complexity and (d) expedites real world applications by the simple, rapid and robust deployment of symbolic models outside the software environment they were developed in.

Automated conjecturing of Frobenius numbers via grammatical evolution

Conjecturing formulas and other symbolic relations occurs frequently in number theory and combinatorics. If we could automate conjecturing, we could benefit not only from speeding up, but also from finding conjectures previously out of our grasp. Grammatical evolution, a genetic programming technique, can be used for automated conjecturing in mathematics. Concretely, this work describes how one can interpret the Frobenius problem as a symbolic regression problem, and then apply grammatical evolution to it. In this manner, a few formulas for Frobenius numbers of specific quadruples were found automatically. The sketch of the proof for one conjectured formula, using lattice point enumeration method, is provided as well. Same method can easily be used on other problems to speed up and enhance the research process.

Symbolic regression of generative network models

Networks are a powerful abstraction with applicability to a variety of scientific fields. Models explaining their morphology and growth processes permit a wide range of phenomena to be more systematically analysed and understood. At the same time, creating such models is often challenging and requires insights that may be counter-intuitive. Yet there currently exists no general method to arrive at better models. We have developed an approach to automatically detect realistic decentralised network growth models from empirical data, employing a machine learning technique inspired by natural selection and defining a unified formalism to describe such models as computer programs. As the proposed method is completely general and does not assume any pre-existing models, it can be applied "out of the box" to any given network. To validate our approach empirically, we systematically rediscover pre-defined growth laws underlying several canonical network generation models and credible laws for diverse real-world networks. We were able to find programs that are simple enough to lead to an actual understanding of the mechanisms proposed, namely for a simple brain and a social network.

Global solar irradiation prediction using a multi-gene genetic programming approach

In this paper, a nonlinear symbolic regression technique using an evolutionary algorithm known as multi-gene genetic programming (MGGP) is applied for a data-driven modelling between the dependent and the independent variables. The technique is applied for modelling the measured global solar irradiation and validated through numerical simulations. The proposed modelling technique shows improved results over the fuzzy logic and artificial neural network (ANN) based approaches as attempted by contemporary researchers. The method proposed here results in nonlinear analytical expressions, unlike those with neural networks which is essentially a black box modelling approach. This additional flexibility is an advantage from the modelling perspective and helps to discern the important variables which affect the prediction. Due to the evolutionary nature of the algorithm, it is able to get out of local minima and converge to a global optimum unlike the back-propagation (BP) algorithm used for training neural networks. This results in a better percentage fit than the ones obtained using neural networks by contemporary researchers. Also a hold-out cross validation is done on the obtained genetic programming (GP) results which show that the results generalize well to new data and do not over-fit the training samples. The multi-gene GP results are compared with those, obtained using its single-gene version and also the same with four classical regression models in order to show the effectiveness of the adopted approach.

Shadow networks Discovering hidden nodes with models of information flow

Complex, dynamic networks underlie many systems, and understanding these networks is the concern of a great span of important scientific and engineering problems. Quantitative description is crucial for this understanding yet, due to a range of measurement problems, many real network datasets are incomplete. Here we explore how accidentally missing or deliberately hidden nodes may be detected in networks by the effect of their absence on predictions of the speed with which information flows through the network. We use Symbolic Regression (SR) to learn models relating information flow to network topology. These models show localized, systematic, and non-random discrepancies when applied to test networks with intentionally masked nodes, demonstrating the ability to detect the presence of missing nodes and where in the network those nodes are likely to reside.

Data Mining using Unguided Symbolic Regression on a Blast Furnace Dataset

In this paper a data mining approach for variable selection and knowledge extraction from datasets is presented. The approach is based on unguided symbolic regression (every variable present in the dataset is treated as the target variable in multiple regression runs) and a novel variable relevance metric for genetic programming. The relevance of each input variable is calculated and a model approximating the target variable is created. The genetic programming configurations with different target variables are executed multiple times to reduce stochastic effects and the aggregated results are displayed as a variable interaction network. This interaction network highlights important system components and implicit relations between the variables. The whole approach is tested on a blast furnace dataset, because of the complexity of the blast furnace and the many interrelations between the variables. Finally the achieved results are discussed with respect to existing knowledge about the blast furnace process.

The first analytical expression to estimate photometric redshifts suggested by a machine

We report the first analytical expression purely constructed by a machine to determine photometric redshifts ($z_{\rm phot}$) of galaxies. A simple and reliable functional form is derived using $41,214$ galaxies from the Sloan Digital Sky Survey Data Release 10 (SDSS-DR10) spectroscopic sample. The method automatically dropped the $u$ and $z$ bands, relying only on $g$, $r$ and $i$ for the final solution. Applying this expression to other $1,417,181$ SDSS-DR10 galaxies, with measured spectroscopic redshifts ($z_{\rm spec}$), we achieved a mean $\langle (z_{\rm phot} - z_{\rm spec})/(1+z_{\rm spec})\rangle\lesssim 0.0086$ and a scatter $\sigma_{(z_{\rm phot} - z_{\rm spec})/(1+z_{\rm spec})}\lesssim 0.045$ when averaged up to $z \lesssim 1.0$. The method was also applied to the PHAT0 dataset, confirming the competitiveness of our results when faced with other methods from the literature. This is the first use of symbolic regression in cosmology, representing a leap forward in astronomy-data-mining connection.

Modified Soft Brood Crossover in Genetic Programming

Premature convergence is one of the important issues while using Genetic Programming for data modeling. It can be avoided by improving population diversity. Intelligent genetic operators can help to improve the population diversity. Crossover is an important operator in Genetic Programming. So, we have analyzed number of intelligent crossover operators and proposed an algorithm with the modification of soft brood crossover operator. It will help to improve the population diversity and reduce the premature convergence. We have performed experiments on three different symbolic regression problems. Then we made the performance comparison of our proposed crossover (Modified Soft Brood Crossover) with the existing soft brood crossover and subtree crossover operators.

Macro-Economic Time Series Modeling and Interaction Networks

Macro-economic models describe the dynamics of economic quantities. The estimations and forecasts produced by such models play a substantial role for financial and political decisions. In this contribution we describe an approach based on genetic programming and symbolic regression to identify variable interactions in large datasets. In the proposed approach multiple symbolic regression runs are executed for each variable of the dataset to find potentially interesting models. The result is a variable interaction network that describes which variables are most relevant for the approximation of each variable of the dataset. This approach is applied to a macro-economic dataset with monthly observations of important economic indicators in order to identify potentially interesting dependencies of these indicators. The resulting interaction network of macro-economic indicators is briefly discussed and two of the identified models are presented in detail. The two models approximate the help wanted index and the CPI inflation in the US.

Can the Eureqa symbolic regression program, computer algebra and numerical analysis help each other?

The Eureqa symbolic regression program has recently received extensive press praise. A representative quote is "There are very clever 'thinking machines' in existence today, such as Watson, the IBM computer that conquered Jeopardy! last year. But next to Eureqa, Watson is merely a glorified search engine." The program was designed to work with noisy experimental data. However, if the data is generated from an expression for which there exists more concise equivalent expressions, sometimes some of the Eureqa results are one or more of those more concise equivalents. If not, perhaps one or more of the returned Eureqa results might be a sufficiently accurate approximation that is more concise than the given expression. Moreover, when there is no known closed form expression, the data points can be generated by numerical methods, enabling Eureqa to find expressions that concisely fit those data points with sufficient accuracy. In contrast to typical regression software, the user does not have to explicitly or implicitly provide a specific expression or class of expressions containiing unknown constants for the software to determine. Is Eureqa useful enough in these regards to provide an additional tool for experimental mathematics, computer algebra users and numerical analysis? Yes if used carefully. Can computer algebra and numerical methods help Eureqa? Definitely.

System adjustment for targeted performance combining symbolic regression and set inversion

One presents methodology and algorithms to prepare a causal system in order to achieve desired performances if only input-output data are known and when no other informations are available. This can be done with mean of evolutionnary programming and set inversion methods, such as PSI-algorithm or SIvIA.

Improved Model Reduction and Tuning of Fractional Order PIλDμ Controllers for Analytical Rule Extraction with Genetic Programming

Genetic Algorithm (GA) has been used in this paper for a new approach of sub-optimal model reduction in the Nyquist plane and optimal time domain tuning of PID and fractional order (FO) PI{\lambda}D{\mu} controllers. Simulation studies show that the Nyquist based new model reduction technique outperforms the conventional H2 norm based reduced parameter modeling technique. With the tuned controller parameters and reduced order model parameter data-set, optimum tuning rules have been developed with a test-bench of higher order processes via Genetic Programming (GP). The GP performs a symbolic regression on the reduced process parameters to evolve a tuning rule which provides the best analytical expression to map the data. The tuning rules are developed for a minimum time domain integral performance index described by weighted sum of error index and controller effort. From the reported Pareto optimal front of GP based optimal rule extraction technique a trade-off can be made between the complexity of the tuning formulae and the control performance. The efficacy of the single-gene and multi-gene GP based tuning rules has been compared with original GA based control performance for the PID and PI{\lambda}D{\mu} controllers, handling four different class of representative higher order processes. These rules are very useful for process control engineers as they inherit the power of the GA based tuning methodology, but can be easily calculated without the requirement for running the computationally intensive GA every time. Three dimensional plots of the required variation in PID/FOPID controller parameters with reduced process parameters have been shown as a guideline for the operator. Parametric robustness of the reported GP based tuning rules has also been shown with credible simulation examples.

Predicting the Energy Output of Wind Farms Based on Weather Data Important Variables and their Correlation

Wind energy plays an increasing role in the supply of energy world-wide. The energy output of a wind farm is highly dependent on the weather condition present at the wind farm. If the output can be predicted more accurately, energy suppliers can coordinate the collaborative production of different energy sources more efficiently to avoid costly overproductions. With this paper, we take a computer science perspective on energy prediction based on weather data and analyze the important parameters as well as their correlation on the energy output. To deal with the interaction of the different parameters we use symbolic regression based on the genetic programming tool DataModeler. Our studies are carried out on publicly available weather and energy data for a wind farm in Australia. We reveal the correlation of the different variables for the energy output. The model obtained for energy prediction gives a very reliable prediction of the energy output for newly given weather data.

Dimensionally Constrained Symbolic Regression

We describe dimensionally constrained symbolic regression which has been developed for mass measurement in certain classes of events in high-energy physics (HEP). With symbolic regression, we can derive equations that are well known in HEP. However, in problems with large number of variables, we find that by constraining the terms allowed in the symbolic regression, convergence behavior is improved. Dimensionally constrained symbolic regression (DCSR) finds solutions with much better fitness than is normally possible with symbolic regression. In some cases, novel solutions are found.

Construction of a Kinematic Variable Sensitive to the Mass of the Standard Model Higgs Boson in H->WW*->l l nu nu-bar using Symbolic Regression

We derive a kinematic variable that is sensitive to the mass of the Standard Model Higgs boson (M_H) in the H->WW*->l l nu nu-bar channel using symbolic regression method. Explicit mass reconstruction is not possible in this channel due to the presence of two neutrinos which escape detection. Mass determination problem is that of finding a mass-sensitive function that depends on the measured observables. We use symbolic regression, which is an analytical approach to the problem of non-linear regression, to derive an analytic formula sensitive to M_H from the two lepton momenta and the missing transverse momentum. Using the newly-derived mass-sensitive variable, we expect Higgs mass resolutions between 1 to 4 GeV for M_H between 130 and 190 GeV at the LHC with 10 fb^-1 of data. This is the first time symbolic regression method has been applied to a particle physics problem.

Genetic Programming for Multi-Timescale Modeling

A bottleneck for multi-timescale dynamics is the computation of the potential energy surface (PES). We explore the use of genetic programming (GP) to symbolically regress a mapping of the saddle-point barriers from only a few calculated points via molecular dynamics, thereby avoiding explicit calculation of all the barriers. The GP-regressed barrier function enables use of kinetic Monte Carlo (KMC) to simulate real-time kinetics (seconds to hours) using realistic interactions. To illustrate, we apply a GP regression to vacancy-assisted migration on a surface of a binary alloy and predict the diffusion barriers within 0.1--1% error using 3% (or less) of the barriers, and discuss the significant reduction in CPU time.

Gene Expression Programming a New Adaptive Algorithm for Solving Problems

Gene expression programming, a genotype/phenotype genetic algorithm (linear and ramified), is presented here for the first time as a new technique for the creation of computer programs. Gene expression programming uses character linear chromosomes composed of genes structurally organized in a head and a tail. The chromosomes function as a genome and are subjected to modification by means of mutation, transposition, root transposition, gene transposition, gene recombination, and one- and two-point recombination. The chromosomes encode expression trees which are the object of selection. The creation of these separate entities (genome and expression tree) with distinct functions allows the algorithm to perform with high efficiency that greatly surpasses existing adaptive techniques. The suite of problems chosen to illustrate the power and versatility of gene expression programming includes symbolic regression, sequence induction with and without constant creation, block stacking, cellular automata rules for the density-classification problem, and two problems of boolean concept learning the 11-multiplexer and the GP rule problem.

You might be interested to know that TuringBot is an alternative to the Eureqa symbolic regression software. You can read more about it here.

You can also browse papers in other categories.