Modeling, Control and Industry
One of the methods used in the wastewater treatment is the adsorption process, a type of physicochemical treatment that commonly uses activated carbon as the absorbent. In this particular application, the removal of Phenol and Nitrophenol is required, and thus a prediction of adsorption removal (W%) is modeled based on a contaminant, pH and an initial concentration (Ci) for a given contact time. Given experimental data, a symbolic regression model is built with the premise of predicting reasonable well unseen data based on a well known family of techniques from the Evolutionary Computation (EC) domain,a branch of a broader Artificial Intelligence (AI) research area. Particularly, a novel framework is proposed called Memetic Genetic Programming (MGP), which essentially uses the advantages of explorative search,present on the majority of population based algorithms, and the explotative procedure using a deterministic method to build altogether a model. This model exhibit a pair of benefits: first, the interpretability is high which could be explained by an expert on account of its symbolic nature, and second, the accuracy is high enough to compete with other state-of-art techniques used in model extraction.
The MGP is designed by incorporating numerical Local Search (LS) optimization into a Canonical or Standard Genetic Programming (SGP). LS is explained in the context of the search space where the solutions are being explored, which can be in different layers of the algorithm, like syntaxis, fitness, output,etc. In our case, the LS space matches the solution output space given by SGP.
In SGP, a common representation for the solution is given by syntactic trees, which can be also viewed as mathematical expressions built during the evolutionary process by elementary units called functions. Each of these expressions or potential solutions represents an individual model in our problem, and are ranked by a fitness function that measures the error with respect to a desired output (the adsorption removal rate in our case). The evolutionary process mimics an optimization process where this fitness function is minimized to reach an acceptable value. However, this process usually takes computational resources to produce accurate solutions. This is because SGP is more eager to improve solutions by extending syntactic trees. Nonetheless, this takes more resources and there is no guarantee that significant improved solutions are ever going to be found.
The LS process is designed by optimizing a specific solution toward the target. To be able to do this,firstly, syntaxis trees are transformed by incorporating a parameter for each node. Second, a linear tree of the expression 1+2K(x)is inserted at the root node of each tree, where is the parameter set and K(x)is the tree before transformation.
The optimization is performed by a well known technique, the Trust Region, a variant of Levenberg-Marquardt, where an approximate model is found in a neighborhood during a given iteration of the algorithm. The Trust Region method excels by simultaneously finding the step size and direction toward the optimum, which could be local or global.
Not all solutions are picked up in the population to be optimized. This reduces resources and at the same time produces sufficient improvement in the overall population fitness. An heuristic method was used to choose the solutions, which is based on a ratio of the average population size and individual size. The size is expressed as the number of nodes in a given tree, an it has a direct relationship with the length of the mathematical expression. Small expressions are preferred to be optimized by this heuristic. The fitness function integrates an error measure given by the Mean Squared Error (MSE).
Data was randomly partitioned in training and testing subsets during the experimental procedure, with a 75% of total data for the former and the rest for the latter. 30 independent runs where performed for the experiment. The MGP setup was the following: 250 generations, 200 individuals, 0.9 probability of crossover, 0.1 probability of mutation, approximated 50% of population is optimized, 500 iterations for the Trust Region method, 17 maximum depth level, ramped half-and-half initialization and keep best as elitism.
Experimental statistical results shows good performance for unseen data in terms of Pearson correlation and MSE. The required correlation beforehand was to overpass= 0:96, which was successfully accomplished.In the other hand, in terms of algorithm performance, there is no evident presence of over-fitting which gives a rough idea of the model generalization capabilities. Preliminary results are shown in figures 1(a)and 1(b) for training and testing data respectively.
 Emigdio Z-Flores, Leonardo Trujillo, Oliver Schütze, and Pierrick Legrand. Evaluating the effects of local search in genetic programming. InEVOLVE, volume 288 of Advances in Intelligent Systems and Computing, pages 213–228. Springer International Publishing, 2014.
 Emigdio Z.-Flores, Leonardo Trujillo, Oliver Schütze, and Pierrick Legrand. A local search approach to genetic programming for binary classification. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2015, Madrid, Spain, July 11-15, 2015, pages 1151–1158, 2015.
 M. Abatal and M.T. Olguin. Comparative adsorption behavior between phenol and p-nitrophenol byna- and hdtma-clinoptilolite-rich tuff.Environmental Earth Sciences, 69(8):2691–2698, 2013.
- Emigdio Z.Flores