TY - GEN
T1 - How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms
AU - Panichella, Annibale
AU - Dit, Bogdan
AU - Oliveto, Rocco
AU - Di Penta, Massimilano
AU - Poshynanyk, Denys
AU - De Lucia, Andrea
PY - 2013
Y1 - 2013
N2 - Information Retrieval (IR) methods, and in particular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar. However, applying topic models on software data using the same settings as for natural language text did not always produce the expected results. Recent research investigated this assumption and showed that source code is much more repetitive and predictable as compared to the natural language text. Our paper builds on this new fundamental finding and proposes a novel solution to adapt, configure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks. Our paper introduces a novel solution called LDA-GA, which uses Genetic Algorithms (GA) to determine a near-optimal configuration for LDA in the context of three different SE tasks: (1) traceability link recovery, (2) feature location, and (3) software artifact labeling. The results of our empirical studies demonstrate that LDA-GA is able to identify robust LDA configurations, which lead to a higher accuracy on all the datasets for these SE tasks as compared to previously published results, heuristics, and the results of a combinatorial search.
AB - Information Retrieval (IR) methods, and in particular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar. However, applying topic models on software data using the same settings as for natural language text did not always produce the expected results. Recent research investigated this assumption and showed that source code is much more repetitive and predictable as compared to the natural language text. Our paper builds on this new fundamental finding and proposes a novel solution to adapt, configure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks. Our paper introduces a novel solution called LDA-GA, which uses Genetic Algorithms (GA) to determine a near-optimal configuration for LDA in the context of three different SE tasks: (1) traceability link recovery, (2) feature location, and (3) software artifact labeling. The results of our empirical studies demonstrate that LDA-GA is able to identify robust LDA configurations, which lead to a higher accuracy on all the datasets for these SE tasks as compared to previously published results, heuristics, and the results of a combinatorial search.
KW - Genetic Algoritms
KW - Latent Dirichlet Allocation
KW - Textual Analysis in Software Engineering
UR - http://www.scopus.com/inward/record.url?scp=84883710034&partnerID=8YFLogxK
U2 - 10.1109/ICSE.2013.6606598
DO - 10.1109/ICSE.2013.6606598
M3 - Conference contribution
AN - SCOPUS:84883710034
SN - 9781467330763
T3 - Proceedings - International Conference on Software Engineering
SP - 522
EP - 531
BT - 2013 35th International Conference on Software Engineering, ICSE 2013 - Proceedings
T2 - 2013 35th International Conference on Software Engineering, ICSE 2013
Y2 - 18 May 2013 through 26 May 2013
ER -