Chapter 6 - Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data
Abstract
Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories.
This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.
References (0)
Cited by (150)
An exploratory study of software artifacts on GitHub from the lens of documentation
2024, Information and Software TechnologyThe abundance of software artifacts in open-source repositories has been analyzed by researchers from many perspectives, to address challenges in downstream tasks such as bug localization, code clone detection and so on. However, there is limited exploration of artifacts such as pull-requests and issues from a documentation perspective.
We aim to explore the presence of information useful for documentation in different sources within the software projects. We present an exploratory analysis of 1.38M artifacts extracted from 950 GitHub repositories that analyses the content present in multiple software artifacts from a documentation perspective.
We arrive at a list of documentation types and sources through card-sorting and a developer survey. We apply topic modeling on the data extracted from 1.38M software artifacts based on these lists and study the extent of documentation-related information present in the software artifacts. The exploratory analysis of the artifacts listed is consolidated into the ‘DocMine’ dataset that comprises 50.63M textual sentences spanning across repositories written in four different programming languages.
We observe that about 28.1% of content extracted from the artifacts contains information related to features and modifications of the project at a higher level, and that pull-requests and issues comprise 18.26% and 17.85% of the extracted information.
The presence of information about the projects in pull-requests and issues indicates immense scope in analyzing and processing multiple software artifacts for the purposes of generating software documentation and beyond. We envision that this study could open up a new line of research in software documentation.
Application of question answering systems for intelligent agriculture production and sustainable management: A review
2024, Resources, Conservation and RecyclingThe increasing application of artificial intelligence in agriculture production and management has generated a large amount of data, leading to a demand for processing this data. This review focuses on the knowledge storage approaches in agricultural question answering systems, namely corpora, knowledge graphs, and large language models. These systems are built on massive amounts of data and aim to process and retrieve information effectively in the context of sustainable agriculture. Corpora refer to large collections of diverse documents that serve as foundational resources for training and fine-tuning question answering systems. Knowledge graphs capture structured and interconnected knowledge by representing entities, relationships, and attributes, enabling efficient organization and querying of information. Large language models, such as GPT-4, enhance the capacity of question answering systems to provide accurate and relevant responses. By exploring these three prominent knowledge storage approaches, this review analyses the methodology and impact of agricultural question answering systems, highlighting their applications in the production process. The findings provide important implications for future research in agriculture, and potential directions for further exploration.
Research topic identification and trend prediction of China's energy policy: A combined LDA-ARIMA approach
2024, Renewable EnergyIdentifying the research topics of China's energy transition policy and predicting future research trends are crucial for policymakers to make informed decisions and advance the energy transition. This study applied the Latent Dirichlet Allocation (LDA) to extract 26 major research topics from the 13,976 abstracts of Chinese policy research papers in web of science (WOS), conceptualized the five most popular topics and used the ARIMA model to predict trends over the next 36 months. The results show that except for Economic and Environmental Impact of Renewable Energy Consumption (Topic13), which will first experience unanticipated fluctuations and then show strong growth capacity, all other topics will continue to remain positive growth trend. Research and Development (Topic3) is driven by the urgency of China's energy transformation. Economic and Environmental Impact of Renewable Energy Consumption (Topic13) is driven by the potential for the development of renewable energy. Economic Impact of Climate Change (Topic26) is driven by the climate change debate. Economic Growth, Urbanization, and Energy Consumption (Topic7) is driven by a discussion of their relationship and the mechanisms influencing them. Regional Efficiency and Productivity Analysis (Topic19) is driven by regional differences in China's energy policies.
Constructing tomorrow: A multifaceted exploration of Industry 4.0 scientific, patents, and market trend
2023, Automation in ConstructionThis article investigates the evolution of Industry 4.0 in construction related to a scientific, technological, and commercial perspective. Initially, it was determined which technologies could be considered 4.0 to analyze construction and infrastructure results from scientific literature, patent publications, and market offerings. This study shows a panorama of the development stages of 4.0 industry technologies in the construction and infrastructure sector based on 3155 articles from Web of Science, 14,360 patents from Derwent, and 2346 companies. According to the technologies analyzed in this research, it is shown that in the Construction Industry 4.0, there are different levels of development for all its technologies. The main conclusion of this study was that there is notable scientific and technological activity related to 4.0 technologies in construction. Moreover, the transition from technology to market has been particularly pronounced in Building Information Modeling and artificial intelligence.
Integrating Ergonomics and sustainability: A framework with LDA methodology and implementation roadmap
2023, Technology in SocietySustainability has received increased interest from academics in various domains. However, the role of ergonomics/human factors in sustainability related literature has not received the same level of interest, especially in the context of future research directions. For the advancement in the field of ergonomics/human factors, sustainability is critical to fulfilling the stakeholder's needs, putting eco-friendly and cost-effective arbitration into practice, and providing ergonomic solutions sensitive to human-wellbeing. This paper has identified the critical research areas in which ergonomics/human factors can amalgamate with sustainability using the unsupervised machine learning algorithm ‘Latent Dirichlet allocation’. The results show that the text mining technique effectively analyzes ergonomics/human factors discipline in sustainability. Furthermore, a conceptual framework is developed that identifies three potential research themes, which academicians and practitioners can apply as a good starting point for future research endeavors. In addition to that, this study also proposed a manufacturing system model for measuring performance based on Plan-Do-Check-Act (PDCA) cycle. Selections of lean, sustainability, and ergonomics/human factors tools are the key points for consideration when measuring performance on the proposed manufacturing system model stages.
AGRONER: An unsupervised agriculture named entity recognition using weighted distributional semantic model
2023, Expert Systems with ApplicationsIn this work, we propose a novel weighted distributional semantic model for unsupervised Named Entity Recognition (NER) in domain specific texts, specifically focusing on agricultural domain. Developing accurate agriculture NER models requires overcoming several challenges, including the lack of annotated data, domain-specific vocabulary, entity ambiguity, and contextual variation. The proposed approach is completely unsupervised and utilizes an extended BERT model with LDA topic modeling () for NER. The proposed Agricultural Named Entity Recognition (AGRONER) model, focuses on identifying six major entities, disease, soil, pathogen, pesticide, crops, and place. The existing four entities are recognized using the proposed algorithm while we utilize the AGROVOC dictionary for crops and Geocoding APIs for Place entities. Due to the absence of a benchmark dataset in the agriculture domain, we created a corpus of 30,000 sentences extracted from recognized agriculture sites. For the evaluation, we used a test corpus with 700 sentences that include 1690 entity names. The labeled entities were then manually checked to evaluate the prediction accuracy. The proposed approach presents a macro average F-measure of 80.43%, which is quite promising for an unsupervised domain specific entity labeling. We performed ablations studies, where the proposed model exhibited a relative percentage improvement of 31.56%, 26.11% F-measure when compared to BERT without LDA () and extended BERT without LDA ()models, respectively. Experimental results show the efficacy of the proposed approach in labeling the named entities in an unsupervised set-up for the agricultural domain. Further, the approach can be easily extended to recognize more domain-specific entities.1