Chapter 6 - Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data

https://doi.org/10.1016/B978-0-12-411519-4.00006-9Get rights and content

Abstract

Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories.

This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.

References (0)

Cited by (150)

View all citing articles on Scopus
View full text