Identifying algorithm in program code based on structural features using CNN classification model

Watanobe, Yutaka; Rahman, Md. Mostafizer; Amin, Md. Faizul Ibne; Kabir, Raihan

doi:10.1007/s10489-022-04078-y

Identifying algorithm in program code based on structural features using CNN classification model

Open access
Published: 23 September 2022

Volume 53, pages 12210–12236, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Identifying algorithm in program code based on structural features using CNN classification model

Download PDF

4451 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

In software, an algorithm is a well-organized sequence of actions that provides the optimal way to complete a task. Algorithmic thinking is also essential to break-down a problem and conceptualize solutions in some steps. The proper selection of an algorithm is pivotal to improve computational performance and software productivity as well as to programming learning. That is, determining a suitable algorithm from a given code is widely relevant in software engineering and programming education. However, both humans and machines find it difficult to identify algorithms from code without any meta-information. This study aims to propose a program code classification model that uses a convolutional neural network (CNN) to classify codes based on the algorithm. First, program codes are transformed into a sequence of structural features (SFs). Second, SFs are transformed into a one-hot binary matrix using several procedures. Third, different structures and hyperparameters of the CNN model are fine-tuned to identify the best model for the code classification task. To do so, 61,614 real-world program codes of different types of algorithms collected from an online judge system are used to train, validate, and evaluate the model. Finally, the experimental results show that the proposed model can identify algorithms and classify program codes with a high percentage of accuracy. The average precision, recall, and F-measure scores of the best CNN model are 95.65%, 95.85%, and 95.70%, respectively, indicating that it outperforms other baseline models.

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

A Hybrid Machine Learning Model for Code Optimization

Article 22 September 2023

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Information technology (IT) has become an indispensable part of global society. One of the essential requirements for developing IT tools is computer programming and the importance of programming education is attracting global attention [1]. Programming languages, curriculum, teaching, and learning methods as well as platforms have become the subject of representative basic research on programming education [2,3,4]. As a result, a considerable amount of code^{Footnote 1} is generated and accumulated daily by learners at different levels in platforms such as online judge (OJ) systems [5]. These large code archives can be used as a suitable reference for problem solving, searching for problems and answers as well as for educational research and analysis [6]. In the context of education, identifying the algorithm in the code can be useful for advanced code analysis, including code evaluation [7, 8], plagiarism checking, and problem evaluation (or difficulty estimation) [9, 10]. Furthermore, educational data mining (EDM) using large-scale programming data from repositories enable various empirical analyses. These analyses demonstrate the correlation between academic achievement and programming skills, user assessment, learning path recommendations to facilitate programming learning [1, 11, 12].

In software engineering (SE), algorithms are implemented at the functional level of the code. Solution codes can be reused for various purposes in SE in the form of libraries, open sources, components and APIs. One of the important aspects for faster coding is code reuse [13]. Code reuse is a practice of using existing code snippets to create a new function or code, as it requires understanding other codes and algorithms used. The identification of algorithms are also important for development environments (IDEs, editors, etc.) and related intelligent software tools, where feedback and support functions are involved. In a development environment, services for various types of searches against a set of program codes are indispensable. Identifying algorithms in code can be useful for advanced code analysis, including code cloning, refactoring, function prediction, debugging, code evaluation, and software metrics. On the other hand, as intelligent software tools, various ML models have been specifically designed for generating, evaluating, modifying, supplementing, and improving source code. The accuracy and efficiency of many specialized ML models for these operations as well as augmentation and retrieval tasks are highly dependent on identifying the program code [14, 15]. Therefore, the algorithm implemented in the code can be a useful feature for ML models.

Due to the vast amount of code accumulated, manually searching for codes using keywords, comments/documents, tags, names, and other metadata is a challenging task. The unavailability, non-uniformity, and inadequacy of metadata is also a major obstacle in code retrieval. This is because many keywords are freely defined by programmers, the main reason for non-uniformity, and these keywords may not be suitable for accurate code classification. However, to find similar codes for reference purposes, it is not enough to find identical codes of similar algorithms based only on metadata. Therefore, artificial intelligence (AI) can be useful as a core technology to solve this problem. In recent years, advanced deep neural network (DNN) models, such as recurrent neural networks (RNN), feed-forward neural networks (FNN), long short-term memory (LSTM) [16], bidirectional long short-term memory (BiLSTM) [17], and convolutional neural network (CNN) [18], are effectively used for such diverse tasks as computer vision [19,20,21,22], travel and Internet-of-Things time series data [23, 24], fault diagnosis of chemical data [25], and autonomous transportation systems [26]. Meanwhile, DNN models are considered an effective method in the context of programming activities.

In recent times, DNN models have achieved significant results for program code classification, recommendation, error detection, prediction, and code assessment [7, 12, 27,28,29,30]. Moreover, DNN models are used for various programming tasks (e.g., code completion, evaluation, repair, generation, and summarization) [31,32,33]. To make DNN models more effective in programming-related tasks, real-world programming data resources can be advantageous, and one of the sources can be OJ data. The OJ system is an effective platform for programming exercises and competitions, allowing programmers to practice throughout the year [34, 35]. OJ systems can effectively provide autonomous learning opportunities for code evaluation and detailed feedback on program errors [9, 10, 12]. Let P = {p₁,p₂,p₃,⋯ ,p_n} be the set of problems related to various algorithms, V = {v₁,v₂,v₃,⋯ ,v_m} be the set of verdicts. For each problem P, there are many solutions S = {s₁,s₂,s₃,⋯ ,s_w} and each solution receives a verdict in V with evaluation values such as CPU time. Typically, OJ systems provide decisions or verdicts depending on the errors and acceptance of the codes. Each error decision gives a specific reason for an error in the code. For example, error decisions such as memory limit exceeded (MLE), time limit exceeded (TLE), and runtime error (RE) are made when the performance of the algorithm is not enough for solving the corresponding problem. In contrast, the error decision, e.g., wrong answer (WA), is made when the code contains logical errors. Thus, large real-world OJ data (solution codes with verdict and performance logs for problem sets) can be a real treasure in the task of AI for coding [36,37,38].

Despite the remarkable results of DNN models in programming tasks, the structural (or algorithmic) features of the code have not been adequately discussed. However, knowing the algorithms used in the program code is important from an educational and software development perspective to better understand the code. Therefore, the classification of program code based on the structural features of the code remains an open problem. To address this research gap, we propose a CNN-based program code classification model that can be applied to both programming education and software development. The proposed model classifies program codes by identifying the algorithms contained in the codes. In addition, this study presents a new data preprocessing approach for program codes. The code preprocessing requires several steps, including (i) user-defined properties/tokens of program codes such as functions, classes, keywords, and variables are filtered; (ii) as the structural features (SFs) such as if, else, loops, mathematical operators, bitwise operators, and assignment operators of the codes are considered; (iii) the SFs of each program code are converted into a one-hot binary matrix (OBM). We have collected two different datasets of real-world program codes based on various algorithms for model training, validation, and evaluation. Three CNN models are developed, trained, and evaluated based on various structures and hyperparameters to select the best model for program code classification. The best CNN model is applied for the classification task considering the experimental results. The contribution of the research work is as follows:

The proposed CNN model can identify the algorithm used in the program code and classify the code based on the identified algorithm.
We present a novel strategy for program code processing. SFs are extracted from program codes and converted into OBM for model training. SFs facilitate the model to understand the algorithmic properties of codes better.
The average precision, recall, and F-measure values of the proposed model are 95.65%, 95.85% and 95.70%, respectively, which outperform the values obtained by other referenced models.
The proposed classification model and its novel data preprocessing approach can be useful for various educational and industrial applications.

The remainder of this paper is structured as follows. Section 2 presents the background and related works. Section 3 describes the proposed approach, and Section 4 presents the experimental results and evaluations. Section 5 discusses the results in detail, and finally, Section 6 concludes this study with suggestions for future work.

2 Background and related works

This section presents prior studies related to programming education and its challenges, ML in software development practices and its challenges, code evaluation and repair, and code classification.

2.1 Programming education and its challenges

Research in programming education has gained potential worldwide, and learning programming in higher education has been recognized as significantly important for the sustainable development of IT infrastructure [39]. A data-driven study [1] has shown that better programming skills have a positive impact on students’ academic performance. In [40], EDM has been performed to support programming learning based on programming data. Sun et al. [41] have proposed a model to evaluate students’ programming skills in terms of programming and test performance. Based on object-oriented programming tasks, the model observed the improvement of students’ programming skills. The experimental results showed that test performance was positively correlated with programming performance. Qian et al. [42] conducted a comprehensive study to identify students’ misconceptions and difficulties in introductory programming course. Students are most confronted with these misconceptions, such as conceptual, syntactic, and strategic knowledge. The challenges faced by students depend on many factors, including the unusualness of language syntax, programming environments, incorrect concepts and strategies, and instructor competence. Medeiros et al. [2] categorized the challenges in introductory programming and essential issues for learning programming and teaching in higher education. In addition, the study [43] identified significant challenges such as writing, debugging, conceptualizing, and tracing code. To overcome these challenges in learning programming, pedagogical teaching/learning techniques and valuable learning tools are also presented. Meanwhile, due to rapid social and technological changes, many interesting and convenient tools are available, which sometimes have a negative impact on programming learning and students’ motivation [39].

2.2 Machine learning in software development practices and its challenges

Recently, ML has been gaining attention as a method for developing various software systems, such as speech recognition, computer vision, natural language processing (NLP), robot control, and other application domains. ML capabilities can be integrated into a software system in many ways, including ML components, tools, libraries (cover ML functionalities), and frameworks [14]. In contrast, a widespread trend has emerged: the development and implementation of ML-enabled systems are fast and inexpensive. However, long-term maintenance is not cost-effective [44]. Wan and collaborators investigated the differences in software development practices between ML and non-ML [14]. Moreover, common practices and workflows for building large-scale ML applications, systems, and platforms at Microsoft, Amazon, and Google have been presented in [6, 45,46,47]. Additionally, various testing and debugging tools have been proposed to test and debug ML-based applications and systems [48,49,50,51]. Despite these efforts, standardization and operationalization of reliable ML systems are inevitable. Based on real-world ML-enabled software development practices [6], around 11 challenges have been identified, from data collection to model evolution, evaluation, and deployment. However, our proposed classification model can be a supporting component in building large-scale ML-based applications and systems dealing with SFs.

2.3 Program code evaluation and repairing

Recently, researchers have made continuous efforts to achieve significant results in this area. Programming languages are quite different from natural languages, as program codes contain a large amount of complex structural information. However, conventional NLP models are inadequate for program codes. Therefore, in [52], a tree-based CNN model for the programming code processing task has been proposed. Rahman et al. [8] presented a model for source code evaluation using LSTM neural networks. These networks have been combined with an attention mechanism to understand the complex context of the code. During code evaluation, the model identified errors, including logic and syntax errors in codes with a high accuracy percentage.

In [53], a multi-modal attention network (MMAN) has been proposed to properly represent the SFs of source codes and improve the reasoning for which features have the most impact on the final results. The MMAN can represent both structured and non-structured features of source codes, using a tree-LSTM for the abstract syntax tree (AST) and a gated graph neural network (GNN) for the control flow graph. In another study [7], an LSTM model has been developed to identify source code errors in C programming. In this model, characters, variables, keywords, tokens, numbers, functions, and classes have been encoded with the defined IDs. The model detected errors in faulty solution codes with high accuracy. Terada et al. [29] presented an interesting model for predicting the following unknown code sequence to complete the code. The model was built using an LSTM network. Their model can help novice programmers who have difficulty writing complete code from scratch. This model has effectively predicted the correct words to complete the code. In addition, code evaluation, completion, and repair tasks have been performed using an LSTM neural network at different levels of programming learning [31, 32].

2.4 Program code classification

The program code classification model is essential for a better understanding of the code. Researchers have proposed various approaches for program code classifications. In the early stages of code classification and prediction, NLP models have been applied to source code to perform various prediction tasks [54,55,56]. A GNN model [57] was proposed for students’ program code classification that integrates AST and data flow to improve the performance of the model. The GNN model classifies student program code with an accuracy of 97%. Fan et al. [28] proposed a method for classifying defective source codes using RNNs with attention mechanisms. Two evaluation indicators, such as area under the curve (AUC) and F1-measure, were used. AUC and F1-measure achieved about 7% and 14% additional accuracy compared to other benchmark models.

Furthermore, many models have been proposed for classifying program codes based on programming languages. Ugurel et al. [58] performed two types of classification using SVM: first, classification of programming languages and, second, classification of different categories of programs (e.g., databases, multimedia, and graphics). Tian et al. [59] used a Latent Dirichlet mapping method to classify the programming language associated with the source code based on the words. Alreshedy et al. [60] presented an ML language model to classify the source code snippets based on the programming language. A multinomial naive bayes (MNB) classifier was used to classify the source code snippets in their works. The contributions of the stack overflow were used as experimental data. This classification method used features such as comments, variables, and functions, instead of syntactic information. Reyes et al. [61] presented a model for classifying source code using LSTM. Archived source codes are classified based on written programming languages. Empirical results show that the LSTM model performed better than the Naive Bayes and linguistic classifier. Gilda has used a CNN model [62] to identify programming languages from source code snippets.

In [63], classification based on code tags has been performed using three classification methods SVM, random forest, and AdaBoost. In [64], the decision tree-based classification method has been used to classify source codes related to sorting algorithms. LeClair et al. [65] mentioned that the source code can be classified into six categories: games, admin, network, words, science, and usage. Xu et al. [66] used LSTM and CNN to identify vulnerabilities in source code. In addition, a CNN-based classification model was used to classify code based on the algorithms used.

In brief, numerous promising methods have been proposed and experimented with in various studies. The researchers have used traditional unsupervised and supervised classifiers. In addition, CNN and LSTM have been employed as language models for source code-related research and applications. However, the relative importance of the methods is challenging to identify. The proposed code classification model differs from other models due to its novel data preprocessing and selection approach for the CNN model. In this study, three CNN models based on different structures and hyperparameters are trained, validated, and evaluated. The best CNN model is selected for the classification task based on the results.

3 Proposed approach

Programmers prefer implementing algorithms for efficient code. However, implementing algorithms in code is not a trivial task. This research aims to identify the algorithm contained in the program code and classify the code based on the identified algorithm. We have used real-world solution codes of different algorithms from programming competitions and academic courses. A crucial step is a data preprocessing for model training and evaluation, where SFs are extracted from the codes, excluding all user-defined elements (e.g., variables, classes, and functions). These SFs of the codes help the DNN model better understand the algorithm’s flow. CNN-based classification models are developed for classifying codes with various structures and hyperparameters. Although the CNN models are widely used in computer vision research, they have recently achieved significant success in various programming-related tasks (classification, error detection, prediction, and language modeling) [67, 68]. The proposed classification model includes several phases, from data acquisition to model training and evaluation: (i) data acquisition and categorization, (ii) data preprocessing, (iii) CNN models training, and (iv) program code classification with the optimal CNN model. The basic framework of our proposed approach is shown in Fig. 1. The proposed approach is explained in detail in the following sections.

3.1 Data collection and categorization

Selecting relevant datasets from a real-world data repository is essential in research. In this study, real-world program codes are collected from the Aizu Online Judge (AOJ) system [69, 70]. All program codes are written in the C++ programming language. AOJ is a platform that hosts various academic programming activities and programming competitions. As of February 2022, AOJ has over 3,000 programming problems and 100,000 users. It presents programming problems very efficiently based on categories and algorithms. The AOJ system has archived more than 6 million solution codes and submission logs, creating research opportunities for SE and programming education. For example, IBM and MIT have used solution codes from AOJ for their CodeNet project [36, 71].

In this study, all program codes are divided into two separate datasets: A and B. In Dataset A, we considered the categories that cover a large number of algorithms in computer science and engineering, such as computational geometry problems (CGP), number theory problems (NTP), flow network problems (FNP), shortest path problems (SPP), query for data structures problems (QDSP), and combinatorial optimization problems (COP), as shown in Table 1. These categories include basic algorithms from graph theory, geometry, numerical analysis, puzzles, numbers, search, computational theory, networks, advanced mathematics, and advanced data structures and algorithms. All program codes of each category in Dataset A are collected from the problems of programming competitions in AOJ^{Footnote 2}.

Table 1 Algorithms and number of codes for dataset A

Identifying algorithm in program code based on structural features using CNN classification model

Abstract

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A Hybrid Machine Learning Model for Code Optimization

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

1 Introduction

2 Background and related works

2.1 Programming education and its challenges

2.2 Machine learning in software development practices and its challenges

2.3 Program code evaluation and repairing

2.4 Program code classification

3 Proposed approach

3.1 Data collection and categorization

3.2 Data preprocessing

3.2.1 Comments deletion

3.2.2 Extraction of feature tokens

3.2.3 Tokenization of the features

3.2.4 One-hot binary matrix conversion from token IDs

3.2.5 Padding

3.3 Architecture of the CNN model

3.4 Hyperparameters

4 Experimental results

4.1 Overview

4.1.1 Target models and experimental flow

4.1.2 Data preparation for experiments

4.1.3 Evaluation metrics

4.1.4 Implementation details

4.2 Performance of different CNN models

4.2.1 Training accuracy and time of the CNN models

4.2.2 Classification accuracy of the models

4.2.3 Comparison between training, validation, and evaluation of the top-3 models

4.2.4 Effects of tuning hyperparameters

4.3 Program code classification with the optimal CNN model

4.3.1 Model performance with Dataset A

4.3.2 Model performance with Dataset B

4.4 Program code classification with the LSTM and BiLSTM models

4.5 Comparison with baseline models

5 Discussion

5.1 Model performance analysis

5.2 Model scalability

5.3 Model usage in programming learning

5.4 Model usage in software engineering

5.5 Threats to validity

6 Conclusion and future work

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation