Semantic Genetic Programming for Sentiment Analysis

10:20 am - 10:45 am 24 Thursday


Genetic Programming

In last years, the production of textual documents in social media has increased exponentially. This ever-growing amount of available information promotes the research and business activities around opinion mining and sentiment analysis areas. In social media, people share comments about many disparate topics. i.e., events, persons, organization, etc. The main result is that social media has become a source of human opinion. For this reason, the text mining has received a lot of attention from many companies and governments.

The automatic sentiment analysis of given text is one of the most important tasks in text mining. This is a classification task that consists in determining whether a particular document has a positive, negative or neutral opinion. There also exists variations considering intermediate levels of the sentiments. Determining whether a text document has a positive or negative opinion is turning on an essential tool for both public and private companies [7]. This tool is useful to know “What people think”; so, it could be a significant help to any decision-making process (for any level of government, marketing, etc.) [6].

On the other hand, Genetic Programming (GP) is an evolutionary algorithm that has received a lot of attention due to its success in solving hard real-world problems [8]. Surprisingly, for the best of our knowledge, GP has not been used to tackle the problem of sentiment analysis. In fact, the use of GP in the field of text analysis is scarce, being one of these exceptions our previous research work (see [3]). In that paper, GP was used to optimize the weights in a vector space model for text classification, or the works in Automatic text summarization presented in [9, 10].

Sentiment analysis poses a number of challenges where semantic GP might be a feasible option. Some of these problems come from its high-dimensional representation and the considerable training set size. Just to fix ideas, a typical real-world dataset for text-mining is represented using tens to hundred of thousands coordinates and tens of thousands examples. These characteristics make traditional GP to suffer from finding an optimal solution in reasonable time. The interested reader in how a document collection is processed to obtain a vector representation is referenced to the specialized literature [1].

The fast convergence rate of novel semantic GP indicates that they may provide a feasible solution for text mining problems. The semantic genetic operators that seem to have the highest convergence rate are the ones proposed by [5, 2]. These both techniques were inspired by the geometric crossover and both use a more aggressive approach than the original one. The key idea for these new approaches consists in creating the best offspring that can be produced by a linear combination of the parents. However, these method suffer from overfitting. On the other hand, in [4], we applied the idea of a linear combination at the level of the individual, i.e., an individual is composed of a set of expressions which are linearly combined to produce the final output. This latter work presents a better trade off between learning and generalization.

In this contribution, we propose to combine both strategies, i.e., orchestrate the crossover and the local optimization of individuals, to tackle the problem of semantic analysis.


[1] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto.Modern Information Retrieval. Addison-Wesley, 2ndedition, 2011.

[2] Mauro Castelli, Leonardo Trujillo, Leonardo Vanneschi, Sara Silva, Emigdio Z-Flores, and Pierrick Legrand.Geometric Semantic Genetic Programming with Local Search. In Proceedings of the 2015 on Genetic and Evolutionary Computation Conference, GECCO ’15, pages 999–1006, New York, NY, USA, 2015. ACM. 00000.

[3] Hugo Jair Escalante, Mauricio A. Garcia-Limon, Alicia Morales-Reyes, Mario Graff, Manuel Montes-y Gomez,Eduardo F. Morales, and Jose Martinez-Carranza. Term-weighting learning via genetic programming for text classification.Knowledge-Based Systems, 2015. 00000.

[4] Mario Graff, Eric S. Tellez Hugo Jair Escalante, and Jose Ortiz-Bejar. Memetic genetic programming based on orthogonal projections in the phenotype space. IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC2015) (under review).

[5] Mario Graff, Eric Sadit Tellez, Elio Villasenor, and Sabino Miranda-Jiménez. Semantic genetic programming operators based on projections in the phenotype space.Research in Computing Science, 94:73–85, 2015.

[6] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis.Foundations and Trends in Information Retrieval, 2(1-2):1–135, 2008.

[7] Tao Peng, Wanli Zuo, and Fengling He. Svm based adaptive learning method for text classification from positive and unlabeled documents.Knowledge and Information Systems, 16(3):281–301, 2008.

[8] Riccardo Poli, William B. Langdon, and Nicholas Freitag McPhee.A Field Guide to Genetic Programming.Lulu Enterprises, UK Ltd, March 2008. 00724.

[9] Carlos Nascimento Silla Jr., Gisele L. Pappa, Alex Alves Freitas, and Celso A. A. Kaestner. Automatic text summarization with genetic algorithm-based attribute selection. In Christian Lemaître, Carlos A. Reyes,and Jesús A. González, editors,Advances in Artificial Intelligence - IBERAMIA 2004, Proceedings 9th Ibero-American Conference on AI, volume 3315 of Lecture Notes in Computer Science, pages 305–314, Puebla,Mexico, November 22-26 2004. Springer.

[10] Nguyen Quang Uy, Pham Tuan Anh, Truong Cong Doan, and Nguyen Xuan Hoai. A study on the use of genetic programming for automatic text summarization. In Hung Dang-Van and Jeff Sanders, editors,The Fourth International Conference on Knowledge and Systems Engineering, KSE 2012, pages 93–98, Danang,Vietnam, 17-19 August 2012.

Mario Graff