1 Introduction

Information technology (IT) has become an indispensable part of global society. One of the essential requirements for developing IT tools is computer programming and the importance of programming education is attracting global attention [1]. Programming languages, curriculum, teaching, and learning methods as well as platforms have become the subject of representative basic research on programming education [2,3,4]. As a result, a considerable amount of codeFootnote 1 is generated and accumulated daily by learners at different levels in platforms such as online judge (OJ) systems [5]. These large code archives can be used as a suitable reference for problem solving, searching for problems and answers as well as for educational research and analysis [6]. In the context of education, identifying the algorithm in the code can be useful for advanced code analysis, including code evaluation [7, 8], plagiarism checking, and problem evaluation (or difficulty estimation) [9, 10]. Furthermore, educational data mining (EDM) using large-scale programming data from repositories enable various empirical analyses. These analyses demonstrate the correlation between academic achievement and programming skills, user assessment, learning path recommendations to facilitate programming learning [1, 11, 12].

In software engineering (SE), algorithms are implemented at the functional level of the code. Solution codes can be reused for various purposes in SE in the form of libraries, open sources, components and APIs. One of the important aspects for faster coding is code reuse [13]. Code reuse is a practice of using existing code snippets to create a new function or code, as it requires understanding other codes and algorithms used. The identification of algorithms are also important for development environments (IDEs, editors, etc.) and related intelligent software tools, where feedback and support functions are involved. In a development environment, services for various types of searches against a set of program codes are indispensable. Identifying algorithms in code can be useful for advanced code analysis, including code cloning, refactoring, function prediction, debugging, code evaluation, and software metrics. On the other hand, as intelligent software tools, various ML models have been specifically designed for generating, evaluating, modifying, supplementing, and improving source code. The accuracy and efficiency of many specialized ML models for these operations as well as augmentation and retrieval tasks are highly dependent on identifying the program code [14, 15]. Therefore, the algorithm implemented in the code can be a useful feature for ML models.

Due to the vast amount of code accumulated, manually searching for codes using keywords, comments/documents, tags, names, and other metadata is a challenging task. The unavailability, non-uniformity, and inadequacy of metadata is also a major obstacle in code retrieval. This is because many keywords are freely defined by programmers, the main reason for non-uniformity, and these keywords may not be suitable for accurate code classification. However, to find similar codes for reference purposes, it is not enough to find identical codes of similar algorithms based only on metadata. Therefore, artificial intelligence (AI) can be useful as a core technology to solve this problem. In recent years, advanced deep neural network (DNN) models, such as recurrent neural networks (RNN), feed-forward neural networks (FNN), long short-term memory (LSTM) [16], bidirectional long short-term memory (BiLSTM) [17], and convolutional neural network (CNN) [18], are effectively used for such diverse tasks as computer vision [19,20,21,22], travel and Internet-of-Things time series data [23, 24], fault diagnosis of chemical data [25], and autonomous transportation systems [26]. Meanwhile, DNN models are considered an effective method in the context of programming activities.

In recent times, DNN models have achieved significant results for program code classification, recommendation, error detection, prediction, and code assessment [7, 12, 27,28,29,30]. Moreover, DNN models are used for various programming tasks (e.g., code completion, evaluation, repair, generation, and summarization) [31,32,33]. To make DNN models more effective in programming-related tasks, real-world programming data resources can be advantageous, and one of the sources can be OJ data. The OJ system is an effective platform for programming exercises and competitions, allowing programmers to practice throughout the year [34, 35]. OJ systems can effectively provide autonomous learning opportunities for code evaluation and detailed feedback on program errors [9, 10, 12]. Let P = {p1,p2,p3,⋯ ,pn} be the set of problems related to various algorithms, V = {v1,v2,v3,⋯ ,vm} be the set of verdicts. For each problem P, there are many solutions S = {s1,s2,s3,⋯ ,sw} and each solution receives a verdict in V with evaluation values such as CPU time. Typically, OJ systems provide decisions or verdicts depending on the errors and acceptance of the codes. Each error decision gives a specific reason for an error in the code. For example, error decisions such as memory limit exceeded (MLE), time limit exceeded (TLE), and runtime error (RE) are made when the performance of the algorithm is not enough for solving the corresponding problem. In contrast, the error decision, e.g., wrong answer (WA), is made when the code contains logical errors. Thus, large real-world OJ data (solution codes with verdict and performance logs for problem sets) can be a real treasure in the task of AI for coding [36,37,38].

Despite the remarkable results of DNN models in programming tasks, the structural (or algorithmic) features of the code have not been adequately discussed. However, knowing the algorithms used in the program code is important from an educational and software development perspective to better understand the code. Therefore, the classification of program code based on the structural features of the code remains an open problem. To address this research gap, we propose a CNN-based program code classification model that can be applied to both programming education and software development. The proposed model classifies program codes by identifying the algorithms contained in the codes. In addition, this study presents a new data preprocessing approach for program codes. The code preprocessing requires several steps, including (i) user-defined properties/tokens of program codes such as functions, classes, keywords, and variables are filtered; (ii) as the structural features (SFs) such as if, else, loops, mathematical operators, bitwise operators, and assignment operators of the codes are considered; (iii) the SFs of each program code are converted into a one-hot binary matrix (OBM). We have collected two different datasets of real-world program codes based on various algorithms for model training, validation, and evaluation. Three CNN models are developed, trained, and evaluated based on various structures and hyperparameters to select the best model for program code classification. The best CNN model is applied for the classification task considering the experimental results. The contribution of the research work is as follows:

  • The proposed CNN model can identify the algorithm used in the program code and classify the code based on the identified algorithm.

  • We present a novel strategy for program code processing. SFs are extracted from program codes and converted into OBM for model training. SFs facilitate the model to understand the algorithmic properties of codes better.

  • The average precision, recall, and F-measure values of the proposed model are 95.65%, 95.85% and 95.70%, respectively, which outperform the values obtained by other referenced models.

  • The proposed classification model and its novel data preprocessing approach can be useful for various educational and industrial applications.

The remainder of this paper is structured as follows. Section 2 presents the background and related works. Section 3 describes the proposed approach, and Section 4 presents the experimental results and evaluations. Section 5 discusses the results in detail, and finally, Section 6 concludes this study with suggestions for future work.

2 Background and related works

This section presents prior studies related to programming education and its challenges, ML in software development practices and its challenges, code evaluation and repair, and code classification.

2.1 Programming education and its challenges

Research in programming education has gained potential worldwide, and learning programming in higher education has been recognized as significantly important for the sustainable development of IT infrastructure [39]. A data-driven study [1] has shown that better programming skills have a positive impact on students’ academic performance. In [40], EDM has been performed to support programming learning based on programming data. Sun et al. [41] have proposed a model to evaluate students’ programming skills in terms of programming and test performance. Based on object-oriented programming tasks, the model observed the improvement of students’ programming skills. The experimental results showed that test performance was positively correlated with programming performance. Qian et al. [42] conducted a comprehensive study to identify students’ misconceptions and difficulties in introductory programming course. Students are most confronted with these misconceptions, such as conceptual, syntactic, and strategic knowledge. The challenges faced by students depend on many factors, including the unusualness of language syntax, programming environments, incorrect concepts and strategies, and instructor competence. Medeiros et al. [2] categorized the challenges in introductory programming and essential issues for learning programming and teaching in higher education. In addition, the study [43] identified significant challenges such as writing, debugging, conceptualizing, and tracing code. To overcome these challenges in learning programming, pedagogical teaching/learning techniques and valuable learning tools are also presented. Meanwhile, due to rapid social and technological changes, many interesting and convenient tools are available, which sometimes have a negative impact on programming learning and students’ motivation [39].

2.2 Machine learning in software development practices and its challenges

Recently, ML has been gaining attention as a method for developing various software systems, such as speech recognition, computer vision, natural language processing (NLP), robot control, and other application domains. ML capabilities can be integrated into a software system in many ways, including ML components, tools, libraries (cover ML functionalities), and frameworks [14]. In contrast, a widespread trend has emerged: the development and implementation of ML-enabled systems are fast and inexpensive. However, long-term maintenance is not cost-effective [44]. Wan and collaborators investigated the differences in software development practices between ML and non-ML [14]. Moreover, common practices and workflows for building large-scale ML applications, systems, and platforms at Microsoft, Amazon, and Google have been presented in [6, 45,46,47]. Additionally, various testing and debugging tools have been proposed to test and debug ML-based applications and systems [48,49,50,51]. Despite these efforts, standardization and operationalization of reliable ML systems are inevitable. Based on real-world ML-enabled software development practices [6], around 11 challenges have been identified, from data collection to model evolution, evaluation, and deployment. However, our proposed classification model can be a supporting component in building large-scale ML-based applications and systems dealing with SFs.

2.3 Program code evaluation and repairing

Recently, researchers have made continuous efforts to achieve significant results in this area. Programming languages are quite different from natural languages, as program codes contain a large amount of complex structural information. However, conventional NLP models are inadequate for program codes. Therefore, in [52], a tree-based CNN model for the programming code processing task has been proposed. Rahman et al. [8] presented a model for source code evaluation using LSTM neural networks. These networks have been combined with an attention mechanism to understand the complex context of the code. During code evaluation, the model identified errors, including logic and syntax errors in codes with a high accuracy percentage.

In [53], a multi-modal attention network (MMAN) has been proposed to properly represent the SFs of source codes and improve the reasoning for which features have the most impact on the final results. The MMAN can represent both structured and non-structured features of source codes, using a tree-LSTM for the abstract syntax tree (AST) and a gated graph neural network (GNN) for the control flow graph. In another study [7], an LSTM model has been developed to identify source code errors in C programming. In this model, characters, variables, keywords, tokens, numbers, functions, and classes have been encoded with the defined IDs. The model detected errors in faulty solution codes with high accuracy. Terada et al. [29] presented an interesting model for predicting the following unknown code sequence to complete the code. The model was built using an LSTM network. Their model can help novice programmers who have difficulty writing complete code from scratch. This model has effectively predicted the correct words to complete the code. In addition, code evaluation, completion, and repair tasks have been performed using an LSTM neural network at different levels of programming learning [31, 32].

2.4 Program code classification

The program code classification model is essential for a better understanding of the code. Researchers have proposed various approaches for program code classifications. In the early stages of code classification and prediction, NLP models have been applied to source code to perform various prediction tasks [54,55,56]. A GNN model [57] was proposed for students’ program code classification that integrates AST and data flow to improve the performance of the model. The GNN model classifies student program code with an accuracy of 97%. Fan et al. [28] proposed a method for classifying defective source codes using RNNs with attention mechanisms. Two evaluation indicators, such as area under the curve (AUC) and F1-measure, were used. AUC and F1-measure achieved about 7% and 14% additional accuracy compared to other benchmark models.

Furthermore, many models have been proposed for classifying program codes based on programming languages. Ugurel et al. [58] performed two types of classification using SVM: first, classification of programming languages and, second, classification of different categories of programs (e.g., databases, multimedia, and graphics). Tian et al. [59] used a Latent Dirichlet mapping method to classify the programming language associated with the source code based on the words. Alreshedy et al. [60] presented an ML language model to classify the source code snippets based on the programming language. A multinomial naive bayes (MNB) classifier was used to classify the source code snippets in their works. The contributions of the stack overflow were used as experimental data. This classification method used features such as comments, variables, and functions, instead of syntactic information. Reyes et al. [61] presented a model for classifying source code using LSTM. Archived source codes are classified based on written programming languages. Empirical results show that the LSTM model performed better than the Naive Bayes and linguistic classifier. Gilda has used a CNN model [62] to identify programming languages from source code snippets.

In [63], classification based on code tags has been performed using three classification methods SVM, random forest, and AdaBoost. In [64], the decision tree-based classification method has been used to classify source codes related to sorting algorithms. LeClair et al. [65] mentioned that the source code can be classified into six categories: games, admin, network, words, science, and usage. Xu et al. [66] used LSTM and CNN to identify vulnerabilities in source code. In addition, a CNN-based classification model was used to classify code based on the algorithms used.

In brief, numerous promising methods have been proposed and experimented with in various studies. The researchers have used traditional unsupervised and supervised classifiers. In addition, CNN and LSTM have been employed as language models for source code-related research and applications. However, the relative importance of the methods is challenging to identify. The proposed code classification model differs from other models due to its novel data preprocessing and selection approach for the CNN model. In this study, three CNN models based on different structures and hyperparameters are trained, validated, and evaluated. The best CNN model is selected for the classification task based on the results.

3 Proposed approach

Programmers prefer implementing algorithms for efficient code. However, implementing algorithms in code is not a trivial task. This research aims to identify the algorithm contained in the program code and classify the code based on the identified algorithm. We have used real-world solution codes of different algorithms from programming competitions and academic courses. A crucial step is a data preprocessing for model training and evaluation, where SFs are extracted from the codes, excluding all user-defined elements (e.g., variables, classes, and functions). These SFs of the codes help the DNN model better understand the algorithm’s flow. CNN-based classification models are developed for classifying codes with various structures and hyperparameters. Although the CNN models are widely used in computer vision research, they have recently achieved significant success in various programming-related tasks (classification, error detection, prediction, and language modeling) [67, 68]. The proposed classification model includes several phases, from data acquisition to model training and evaluation: (i) data acquisition and categorization, (ii) data preprocessing, (iii) CNN models training, and (iv) program code classification with the optimal CNN model. The basic framework of our proposed approach is shown in Fig. 1. The proposed approach is explained in detail in the following sections.

Fig. 1
figure 1

Framework of the proposed approach

3.1 Data collection and categorization

Selecting relevant datasets from a real-world data repository is essential in research. In this study, real-world program codes are collected from the Aizu Online Judge (AOJ) system [69, 70]. All program codes are written in the C++ programming language. AOJ is a platform that hosts various academic programming activities and programming competitions. As of February 2022, AOJ has over 3,000 programming problems and 100,000 users. It presents programming problems very efficiently based on categories and algorithms. The AOJ system has archived more than 6 million solution codes and submission logs, creating research opportunities for SE and programming education. For example, IBM and MIT have used solution codes from AOJ for their CodeNet project [36, 71].

In this study, all program codes are divided into two separate datasets: A and B. In Dataset A, we considered the categories that cover a large number of algorithms in computer science and engineering, such as computational geometry problems (CGP), number theory problems (NTP), flow network problems (FNP), shortest path problems (SPP), query for data structures problems (QDSP), and combinatorial optimization problems (COP), as shown in Table 1. These categories include basic algorithms from graph theory, geometry, numerical analysis, puzzles, numbers, search, computational theory, networks, advanced mathematics, and advanced data structures and algorithms. All program codes of each category in Dataset A are collected from the problems of programming competitions in AOJFootnote 2.

Table 1 Algorithms and number of codes for dataset A

As shown in Table 2, all program codes related to sorting, such as counting sort, bubble sort, insertion sort, merge sort, selection sort, shell sort, and quick sort, are contained in Dataset BFootnote 3. In addition, some essential key features such as complexity and method of sorting algorithms are presented.

Table 2 Feature attributes and number of program codes for dataset B

3.2 Data preprocessing

To achieve better results from DNN models, effective input shapes can play a vital role. It is essential to create a suitable input shape that represents the actual features of the original data. Programming is a highly complex representation than natural languages. Therefore, we extracted suitable features from the program codes for model training so that the model can be trained effectively. The workflow of preprocessing the program code is shown in Fig. 2.

Fig. 2
figure 2

Workflow of program code preprocessing

Only structural properties are extracted from the code for tokenization in program code transformation. Usually, program code consists of operators, operands, loops, branches, keywords, methods, and classes. Therefore, key attributes of the program code are extracted. In contrast, user-defined elements such as comments, variables, classes, and functions with little impact, are not considered. A list of featured tokens (T) and their corresponding IDs are shown in Table 3. Initially, SFs are extracted from the program codes according to Algorithm 1. The steps of program code preprocessing are described in the forthcoming subsections.

Algorithm 1
figure e

Extraction of structural features from code.

Table 3 List of feature tokens and their IDs

3.2.1 Comments deletion

All comments in the program code are identified and removed with the removeComments() function because the comments in the program code are not significant, as shown in Fig. 3.

Fig. 3
figure 3

Identification and removal of unnecessary comments from the code

3.2.2 Extraction of feature tokens

After removing comments from the code, feature tokens, such as if, else, loops, the math operator, bitwise operator, assignment operator, compound assignment operator, comparison operator, braces, parentheses, and square braces, are selected. Typically, in C++ programming, parentheses are used for function calls and declarations, conditional statements (if, while, do), loops, and operator precedence. In contrast, braces are used for processing functions, classes, structs, if and loops. Square brackets are also used to access arrays. With this definition, all the feature tokens in the program code are selected for extraction using the extractSelectedFeatures() function, as shown in Fig. 4.

Fig. 4
figure 4

Extraction of selected feature tokens

In addition, irrelevant tokens such as functions and variables are identified and removed from the code, e.g., all variables and functions arbitrarily defined by the programmers. The names of the variables and functions may vary depending on the programmer’s definition in the code. Thus, a single code can have many different variable and function names. Also, C++ is a statically typed programming language; the type of variables must be explicitly specified in the code, while Ruby and Python are dynamically typed languages. Therefore, all user-defined variable and function names are removed from the code so that the DNN model can better understand the context of the code.

3.2.3 Tokenization of the features

All the feature tokens are extracted from the code, as shown in Fig. 5(a). Next, the extracted feature tokens are converted into token IDs according to Table 3. This process is called tokenization or encoding. In this research, the tokenization/encoding process represents each SF of code as a token. All these tokens are mapped with the numeric numbers to feed DNN models. In learning DNN models, a sequence of tokens is converted into a sequence of numerical vectors, which are later processed by the neural network. Basically, DNN models neither know the SFs such as { + & = [ ] } of the code nor understand the semantic or algorithmic features of the code. Therefore, tokenization/encoding is an important process of DNN to learn the neural network model from scratch. For example, we defined token IDs from 0 − 16 for different features of the code (Table 3). When some features such as { + & = [ ] } are extracted from the code, these features are converted to numeric numbers such as 13 1 2 0 15 16 14 after tokenization/encoding. Once the tokenization process is completed, the code contains only a list of IDs, as shown in Fig. 5(b).

Fig. 5
figure 5

(a) A program code with featured tokens, and (b) Tokenization with IDs

3.2.4 One-hot binary matrix conversion from token IDs

After the tokenization process, the IDs are assigned to the corresponding feature tokens. A sequence of token IDs is converted to a \(\mathcal {P \times Q}\) matrix structure, where \(\mathcal {P}\) represents the number of token IDs and \(\mathcal {Q}\) is the highest value of token ID + 1. According to the definition of token ID, the maximum value of a token is 16, so the maximum length of the \(\mathcal {Q}\) (column of matrices) will be 17. Finally, the token IDs are converted to an OBM structure of \(\mathcal {P}\) elements according to Algorithm 2. In Algorithm 2, the concept of the (1) has been used for constructing the OBM. The conversion of token ID to OBM is shown in Fig. 6.

$$ \mathcal{M}_{\mathcal{P}, \mathcal{Q}} = \left\{\begin{array}{lc} 1, & if \mathcal{Q}=\mathcal{S}_{\mathcal{P}} +1\\ 0, & Otherwise \end{array}\right. $$
(1)

Where \(\mathcal {S}_{\mathcal {P}}\) represents the token ID of \(\mathcal {P}^{th}\) iteration and \(\mathcal {Q}\) is the sequence of column.

Fig. 6
figure 6

OBM conversion process

In Algorithm 2, first, line 5 takes the entire tokenized solution code (e.g., Fig. 5(b)), then line 6 processes the individual tokens for OBM conversion, and finally lines 7-14 are repeatedly used for OBM construction until the token of a code runs out.

Algorithm 2
figure f

OBM conversion from tokenized structural feature (TF).

3.2.5 Padding

The final step of preprocessing the program code is padding. This is an essential step for the DNN model with batches. To train a DNN model, all input sequences in each batch must have the same length. Therefore, random tokens are added to the input sequence’s end (post) and beginning (pre) to make the same length. One of the reasons for this is to avoid overfitting by adding random tokens to the input sequences.

3.3 Architecture of the CNN model

CNN has become an effective deep learning technique for solving complex tasks in various domains in recent years. Thus, the use of CNN has increased significantly in various fields of computer science and engineering [22, 68, 72]. The architecture of a CNN model is illustrated in Fig. 7. The CNN architecture includes different sized convolutional layers (CL), activation functions (AF), max-pooling, fully connected layers (FCL), dropout layers, and softmax function for the classification tasks. The OBM is used as input for different sized CL via a dropout layer. To determine the features of the code for the evaluation process, each CL learned the features of the code from the input sequences. The output of each CL is used as input to the AF (e.g., ReLU/LeakyReLU). The ReLU and LeakyReLU AFs are expressed by (2) and (3), respectively.

$$ ReLU ~f(z)=max(0,z) $$
(2)
$$ LeakyReLU ~f(z)=max(\mathcal{\alpha}z,z) $$
(3)

Where z is an input and \(\mathcal {\alpha }\) is a small magnitude.

Fig. 7
figure 7

A sample network architecture of the two-parallel convolutional layered CNN model

A max-pooling layer is added after each CL. The max-pooling layer extracts the maximum value from the output of each activation/feature map generated by the convolutional filter/kernel. In this manner, important information is preserved, and the size of the feature/activation map is reduced. Thereafter, the output of different max-pooling layers is concatenated. The pooled results are passed to an FCL via a dropout layer. The FCL is learned from combining filters that are highly correlated with each algorithm category. Finally, the output of the FCL is converted into probabilities via the softmax layer according to (4). In the following (4), the probability Yk is calculated from ak, where ak is the output of the FCL. The loss function \({\mathscr{L}}\) is calculated by (5) using the predicted value Yk and actual value tk.

$$ Y_{k} =\frac{exp(a_{k})}{{\sum}_{j=1}^{N} exp(a_{j})} $$
(4)
$$ \mathcal{L}=\sum t_{k} \log(Y_{k}) $$
(5)

The dropout layer was placed in front of CL and FCL to avoid overfitting. The initial dropout layer randomly generates the whole column zero, and the other dropout layers randomly generate some inputs zero. Lastly, the softmax layer is used to classify program codes based on probability. The output probabilities are calculated in the softmax layer for each category based on the given codes. The sum of the probabilities of all the categories is 1 (one). The category with the highest value is declared the winner.

3.4 Hyperparameters

Different architectures and hyperparameters of the CNN model are fine-tuned to select the best/optimal model for program code classification. We used filters/kernels of different sizes such as 16 × 17, 32 × 17, and 64 × 17 in the CL. The lengths/batch sizes (\({\mathscr{B}}\mathcal {S}\)) for the input sequences are 16, 32, and 64. However, the horizontal length of the convolutional filter and the OBM is always equal, i.e., 17. The output length of all convolution layers is 64, and thus the length of the training sequence is also 64. The large convolutional filter length is that the CL can learn the characteristics of the entire code block of the program code. Some hyperparameters are given in Table 4.

Table 4 List of hyperparameters

4 Experimental results

4.1 Overview

In this section, we present the target models and experimental steps, dataset preparation, evaluation metrics, and experimental environment. In this paper, we conducted the experiments in two phases. In the first phase, experiments are conducted using different architectures of CNN models. Based on the performance of the CNN models, the best model is selected for further experiments. In the second phase, experiments are conducted with the best CNN model and two other baseline models (i.e., LSTM and BiLSTM). An overview of the experimental phases is shown in Fig. 8.

Fig. 8
figure 8

Overview of the experimental phases

4.1.1 Target models and experimental flow

In this paper, CNN, LSTM and BiLSTM models are used for the training, validation and evaluation purposes. To do so, comprehensive experiments are performed and the main steps are as follows: (i) experiments are performed with different CNN architectures, (ii) training accuracy and time, evaluation accuracy, and comparison are demonstrated, (iii) the best CNN model is selected, (iv) code classification with the best CNN models is performed, (v) code classification with baseline models such as LSTM and BiLSTM is performed, and (vi) comparisons of overall classification performance are made with baseline models and other related studies.

4.1.2 Data preparation for experiments

The details of our datasets and their preprocessing procedures are presented in Sections 3.1 and 3.2. We have two datasets A and B. Dataset A covers a wide variety of algorithms, including combinatorial, geometric, graph, and numerical algorithms. Dataset B, on the other hand, consists of codes related to sorting algorithms. In the experiments, the total number of program codes for datasets A and B of 45,398 and 16,216, respectively, are used, and about 10% of the total number of program codes for each dataset are randomly selected for evaluation. In addition, all the program codes are written in C++ programming language and have been accepted by the AOJ, which means that all codes are “correct” and efficient enough. Since dataset A has more program codes and more diversity than dataset B, dataset A is used for training and evaluation in the first phase of the experiment. Next, the best CNN model is selected based on the performance and both datasets A and B are used for evaluation. In the second phase, experiments are performed on dataset A and comparisons are made between the best CNN, LSTM and BiLSTM models.

4.1.3 Evaluation metrics

To evaluate the model performance, precision (\(\mathcal {P}_{o}\)), recall (\(\mathcal {R}_{o}\)), F-measure (\(\mathcal {F}_{o}\)), and accuracy (\(\mathcal {A}_{o}\)) are calculated and the following corresponding (6), (7), (8), and (9) are used for these evaluation metrics. A larger \(\mathcal {P}_{o}\) value indicates the higher credibility of the classification results of a particular category. In other words, \(\mathcal {P}_{o}\) indicates the accuracy of classification predictions. The \(\mathcal {R}_{o}\) is an index that measures program codes classified into a certain category; the \(\mathcal {F}_{o}\) is a harmonic mean between \(\mathcal {R}_{o}\) and \(\mathcal {P}_{o}\).

$$ \mathcal{P}_{o}=\frac{TP}{TP+FP} $$
(6)
$$ \mathcal{R}_{o}=\frac{TP}{TP+FN} $$
(7)
$$ \mathcal{F}_{o}=2 \times \frac{\mathcal{P}_{o} \times \mathcal{R}_{o}}{\mathcal{P}_{o} + \mathcal{R}_{o}} $$
(8)
$$ \mathcal{A}_{o}=\frac{G}{N} $$
(9)

Where TP is the true positive, FP is the false positive, FN is the false negative, G is the correct number of classification and N is the total program codes.

4.1.4 Implementation details

All the experiments are executed on the PyTorch framework with two NVIDIA GeForce GTX 1080 GPU of 32 GB of memory. Details of the experimental hyperparameters of the model are presented in Section 3.4.

4.2 Performance of different CNN models

Instead of using a basic CNN model, we conducted experiments by varying the network architectures and hyperparameters of the CNN to investigate the performance. The optimal CNN model is selected based on the learning accuracy and classification accuracy. For this purpose, three different architectures are developed and used in the experiments: (i) single convolutional layer CNN (CNN-Arch-I) model, (ii) two parallel convolutional layer CNN (CNN-Arch-II) model, and (iii) three parallel convolutional layer CNN (CNN-Arch-III) model. In each CL, three different sizes of filters (e.g., 16, 32, and 64) and two AFs (e.g., ReLU and LeakyReLU) are used separately. In addition, three different BSs (e.g., 16, 32, and 64) and LRs (e.g., 0.01, 0.001, and 0.0001) are used. The network architectures of all three models are shown below.

  1. (i)

    CNN-Arch-I Model: Input (OBM) → dropout (10%) → Single Conv. Layer[Filter/Kernel Size (64) + ReLU/LeakyReLU + MaxPool] → dropout (50%) → FCL → dropout (50%) → FCL → Softmax Layer

  2. (ii)

    CNN-Arch-II Model: Input (OBM) → dropout (10%) → two-parallel Conv. Layer[Filter/Kernel Sizes (32, 64) + ReLU/LeakyReLU + MaxPool] → dropout (50%) → FCL → dropout (50%) → FCL → Softmax Layer

  3. (iii)

    CNN-Arch-III Model: Input (OBM) → dropout (10%) → three-parallel Conv. Layer [Filter/Kernel Sizes (16, 32, 64) + ReLU/LeakyReLU + MaxPool] → dropout (50%) → FCL → dropout (50%) → FCL → Softmax Layer

Experimental results of training accuracy and time, classification accuracy, comparisons between training, validation, and evaluation scores, and accuracy with 10-fold cross-validation of the models are presented below.

4.2.1 Training accuracy and time of the CNN models

To investigate the training accuracy, Dataset A is first used for model training because it contains a variety of program codes from different algorithms compared to dataset B. Training accuracy is calculated at the 100th epoch of the 1st round of 10-fold cross-validation. Figures 910, and 11 show the training accuracy of the CNN-Arch-I, CNN-Arch-II, and CNN-Arch-III models, respectively. The results reveal that (i) the CNN-Arch-II and CNN-Arch-III models achieved training accuracy of more than 96%, when AF of ReLU/LeakyReLU, \({\mathscr{L}}\mathcal {R}\) of 0.001, and \({\mathscr{B}}\mathcal {S}\) of 16, 32, and 64 are used, as shown in Figs. 10 and 11. (ii) In contrast, the CNN-Arch-I model achieved a training accuracy of approximately 92%, as shown in Fig. 9, which is comparatively lower than the other two models.

Fig. 9
figure 9

Training accuracy of CNN-Arch-I model

Fig. 10
figure 10

Training accuracy of CNN-Arch-II model

Fig. 11
figure 11

Training accuracy of CNN-Arch-III model

The training time for each model with different hyperparameter combinations, i.e., {\(\mathcal {A}\mathcal {F}\), \({\mathscr{B}}\mathcal {S}\), \({\mathscr{L}}\mathcal {R}\)}, is also compared. Figures 1213, and 14 compare the training times for all models, where the X-axis represents \({\mathscr{B}}\mathcal {S}\) and the Y -axis represents the time (in seconds) required to train the model. Several observations can be derived from Figs. 1213, and 14: (i) the CNN-Arch-I model required an average of 26172.77 s (7.27 h) to train for each hyperparameter combination, which is smaller than that required by the other two models. In contrast, the CNN-Arch-II and CNN-Arch-III models required an average of 34772.77 s (9.66 h) and 37595.05 s (10.44 h) to train each hyperparameter combination, respectively; (ii) the training time increased linearly with the increasing number of CLs, regardless of \({\mathscr{B}}\mathcal {S}\) and \({\mathscr{L}}\mathcal {R}\); (iii) all models required relatively more time to train when AF of LeakyReLU was used instead of ReLU. Because ReLU ignores all negative values of neurons, which results in many neurons becoming inactive and generating only 0. This is also known as the dying ReLU/dead neuron problem [73]. LeakyReLU is used to solve the dying ReLU problem by using a small slop value (e.g., α = 0.01).

Fig. 12
figure 12

Training time of CNN-Arch-I model

Fig. 13
figure 13

Training time of CNN-Arch-II model

Fig. 14
figure 14

Training time of CNN-Arch-III model

4.2.2 Classification accuracy of the models

In this part of the experiment, we present the initial classification accuracies of all three models with different hyperparameters. All the results are calculated with dataset A on the 100th epoch and the 1st round of 10-fold cross-validation. The top-3 models and their corresponding hyperparameters are selected for further experiments based on the classification results. Tables 56, and 7 show the classification results of the CNN-Arch-I, CNN-Arch-II, and CNN-Arch-III models, respectively. For better understanding, a set of hyperparameters is the combination of different values of \(\mathcal {A}\mathcal {F}\), \({\mathscr{B}}\mathcal {S}\), and \({\mathscr{L}}\mathcal {R}\), i.e., {ReLU, 16, 0.001}. The CNN-Arch-I model achieved the highest \(\mathcal {F}_{o}\) score of 93.30% and \(\mathcal {A}_{o}\) of 93.10% with this set of hyperparameters {LeakyReLU, 32, 0.0001}. Similarly, the CNN-Arch-II model obtained the highest \(\mathcal {F}_{o}\) score of 94.20% and \(\mathcal {A}_{o}\) of 93.90% with this set of hyperparameters {LeakyReLU, 16, 0.0001}. The CNN-Arch-III model achieved highest \(\mathcal {F}_{o}\) score of 94.1% and \(\mathcal {A}_{o}\) of 93.9% with {LeakyReLU, 64, 0.0001}.

Table 5 Classification results of CNN-Arch-I model
Table 6 Classification results of CNN-Arch-II model
Table 7 Classification results of CNN-Arch-III model

In addition, we performed the experiments with three more layers (e.g., 4, 5, and 6) to demonstrate the performance of the models with the set of hyperparameters {LeakyReLU, 16, 0.0001}. The CNN model with four (04) convolutional layers (8, 16, 32, 64) achieved \(\mathcal {P}_{o}\), \(\mathcal {R}_{o}\), and \(\mathcal {F}_{o}\) scores of 93.21%, 92.80%, and 92.65%, respectively, and \(\mathcal {A}_{o}\) score of 92.42%. The model required approximately 11 hours and 18 minutes to complete the training process. Similarly, the CNN model with five (05) convolutional layers (8, 16, 32, 64, 128) achieved \(\mathcal {P}_{o}\), \(\mathcal {R}_{o}\), and \(\mathcal {F}_{o}\) scores of 94.03%, 92.10%, and 92.60%, respectively, and \(\mathcal {A}_{o}\) score of 92.17%; the model required approximately 21 hours and 14 minutes to complete the training process. Furthermore, the CNN model with six (06) convolutional layers (8, 16, 32, 64, 128, 256) achieved \(\mathcal {P}_{o}\), \(\mathcal {R}_{o}\), and \(\mathcal {F}_{o}\) scores of 93.31%, 90.93%, and 91.61%, respectively, and \(\mathcal {A}_{o}\) score of 90.90%. The experimental results of these three CNN models (with 4, 5, and 6 convolutional layers) failed to achieve better classification scores and required much more time for training than the CNN-Arch-III model. Depending on the combination of hyperparameters, the models yielded different accuracies. These models achieved a significant classification accuracy up to about 94%. Therefore, the classification scores of the models help to identify and select the optimal hyperparameters for each model to achieve the best results. Based on the classification \(\mathcal {A}_{o}\) and \(\mathcal {F}_{o}\) scores, the top-3 results are presented in Table 8.

Table 8 List of the top-3 results achieved with all three models and hyperparameters

4.2.3 Comparison between training, validation, and evaluation of the top-3 models

To evaluate the performance of the top-3 models, the training, validation, and evaluation curves were compared, which were generated based on 100,000 iterations for the top-3 models, as shown in Fig. 15(a), (b), and (c), respectively. From these figures, the following observations can be made: (i) all models achieved a training accuracy of approximately 96% and a validation and evaluation accuracy of approximately 94%; (ii) during the first 55,000 iterations, all three models experienced more overfitting; (iii) all accuracies increase linearly up to 80,000 iterations and then become more stable.

Fig. 15
figure 15

Comparison between training, validation and evaluation accuracy of the top-3 models

Although the top-3 models achieved almost similar \(\mathcal {F}_{o}\) score and \(\mathcal {A}_{o}\), as shown in Table 8. Herein, 10-fold cross-validation is performed with the top-3 models and their corresponding hyperparameters to select the best/optimal model. In each cross-validation, different sets of training, validation, and test data are randomly selected to verify the effectiveness of the models. The accuracy comparison between top-3 models for each validation step is shown in Fig. 16. In addition, the average cross-validation accuracy (ACV) is calculated for each model using (10).

$$ ACV=\frac{{\sum}_{i=1}^{H} A_{o_{i}}}{H} $$
(10)

where H is the number of cross-validation, \(\mathcal {A}_{o}\) is the accuracy.

Fig. 16
figure 16

10-fold cross-validation accuracy of top-3 models

Figure 16 demonstrates that (i) the CNN-Arch-II and CNN-Arch-III models achieved higher accuracy than the CNN-Arch-I model in 10-fold cross-validations, except for the 10th round of 10-fold cross-validation; (ii) the CNN-Arch-III model achieved an ACV of 92.76%, which is comparatively higher than that of the other two models; (iii) the ACV values of the CNN-Arch-I and CNN-Arch-II models are 91.56% and 92.69%, respectively. Considering the results in training, validation, evaluation, and classification of all the models, the CNN-Arch-III model achieved better results. To validate the superiority of the CNN-Arch-III model, we also performed additional experiments with three (03) more convolution layers (e.g., 4, 5, and 6) (see in Section 4.2.2). The obtained experimental results could not exceed the performance of the CNN-Arch-III model. Henceforth, all experiments are performed with the CNN-Arch-III model.

4.2.4 Effects of tuning hyperparameters

Instead of using a simple CNN architecture, we investigated the performance of CNN models with different architectures and hyperparameters on our dataset. Different sets of hyperparameters, such as \(\mathcal {C}{\mathscr{L}}=\{single, ~two-parallel, ~three-parallel\}\), filter/kernel size (\(\mathcal {K}\mathcal {S}\))= {16, 32, 64}, \(\mathcal {A}\mathcal {F}=\{ReLU, ~LeakyReLU\}\), \({\mathscr{B}}\mathcal {S}=\{16, ~32, ~64\}\) and \({\mathscr{L}}\mathcal {R}=\{0.01, ~0.001, ~0.0001\}\), were used in three CNN models in different combinations. The performance of each model strongly depends on the hyperparameter settings. For example, CNN-Arch-I has very low accuracy regardless of \(\mathcal {A}\mathcal {F}\) or \({\mathscr{B}}\mathcal {S}\) when \({\mathscr{L}}\mathcal {R}\) is high (i.e., 0.01). Similarly, the performance of CNN-Arch-II and CNN-Arch-III varies based on changes in hyperparameters. As shown in Figs. 910, and 11, the CNN-Arch-III model achieved the highest training accuracy of approximately 96.7% at the 100th epoch when \(\mathcal {A}\mathcal {F}\), \({\mathscr{B}}\mathcal {S}\), and \({\mathscr{L}}\mathcal {R}\) were LeakyReLU, 64, and 0.001, respectively. As shown in Figs. 1213, and 14, the training time increased for all models, regardless of \({\mathscr{L}}\mathcal {R}\) or \(\mathcal {A}\mathcal {F}\), when \({\mathscr{B}}\mathcal {S}\) was set to 16. In addition, all models consumed approximately 0.60% additional time for training when LeakyReLU is used. Furthermore, in most cases, the models achieved better classification results with \(\mathcal {F}_{o}\) scores of 94.20% and \(\mathcal {A}_{o}\) of 93.90% when \({\mathscr{L}}\mathcal {R}\) was slowed to 0.0001 and \(\mathcal {A}\mathcal {F}\) was LeakyReLU, as shown in Tables 56, and 7. Thus, it can be seen that the optimal hyperparameter settings have a significant impact on model performance.

4.3 Program code classification with the optimal CNN model

In this part of the experiment, the results of program code classification using the best model are presented. The best CNN model (CNN-Arch-III) is used for further experiments. Here, program code classification tasks are performed with datasets (A and B), and the corresponding results are presented.

4.3.1 Model performance with Dataset A

Dataset A contains a large number of program codes on various algorithms such as a tree, graph, geometry, computational theory, discrete mathematics, and data structure (see in Table 1). Therefore, dataset A is more diverse than dataset B. During model training, \(\mathcal {P}_{o}\), \(\mathcal {R}_{o}\), and \(\mathcal {F}_{o}\) scores are calculated using the validation data for each category, as shown in Fig. 17. All learning curves are generated with 100,000 iterations. From the Fig. 17, it is evident that (i) the model obtained relatively low \(\mathcal {P}_{o}\) values (approximately 90%) for the QDSP and COP categories, implying that a large number of false positive (FP) occurred for both categories. On the other hand, about 99% and 96% of \(\mathcal {P}_{o}\) values are obtained for the SPP and CGP categories; (ii) the model achieved about 100% of \(\mathcal {R}_{o}\) values for the FNP category; in contrast, the model scored the lowest \(\mathcal {R}_{o}\) scores for both NTP and SPP categories, indicating that more false negative (FN) occurred; (iii) the value of the \(\mathcal {F}_{o}\) in each category is gradually increased with successive iterations; (iv) the model achieved the highest \(\mathcal {F}_{o}\) values of approximately 98% for both CGP and FNP categories and the \(\mathcal {F}_{o}\) values of these two categories are more stable compared to those in the other categories.

Fig. 17
figure 17

Precision, Recall, and F-measure curves for each category of algorithm

The confusion matrix of \(\mathcal {P}_{o}\) and \(\mathcal {R}_{o}\) is calculated using test data to find out how the model learned the features of program codes of different algorithms. The test program codes for each category COP, CGP, FNP, NTP, QDSP, and SSP are 61, 30, 22, 22, 31, and 35. The confusion matrix of \(\mathcal {P}_{o}\) and \(\mathcal {R}_{o}\) for program code classification is shown in Fig. 18(a) and (b). The confusion matrices indicate that the model achieves approximately 100% of \(\mathcal {P}_{o}\) and \(\mathcal {R}_{o}\) values for the FNP and CGP categories, respectively.

Fig. 18
figure 18

Confusion matrix of precision (\(\mathcal {P}_{o}\)) and recall (\(\mathcal {R}_{o}\))

Table 9 shows the validation performance for each category of algorithms. The model (CNN-Arch-III) achieved an average \(\mathcal {P}_{o}\), \(\mathcal {R}_{o}\), and \(\mathcal {F}_{o}\) score of 95.50%, 94.80%, and 95.00%, respectively, for the validation. In contrast, the model achieved an average \(\mathcal {P}_{o}\) score of 94.30%, \(\mathcal {R}_{o}\) score of 94.80%, and \(\mathcal {F}_{o}\) score of 94.50% at the time of evaluation, as shown in Table 10. In this part of the experiment, precision and recall scores are calculated for each category, and validation and evaluation are also performed for dataset A using the CNN-Arch-III model. Given the diversity of dataset A, the overall classification results achieved with the best model are significant.

Table 9 Validation scores for each category of algorithm
Table 10 Evaluation scores for each category of algorithm

4.3.2 Model performance with Dataset B

Dataset B is also used for training, validation, and evaluation of the model (CNN-Arch-III) similar to Dataset A. The program codes of Dataset B refer to sorting algorithms. The purpose of all sorting algorithms is the same, but the way they are applied in the codes is different. The SFs of the program codes of all the sorting algorithms are used for model training, allowing the models to learn the actual features of the sorting algorithms, instead of the codes. For the evaluation, the average \(\mathcal {P}_{o}\), \(\mathcal {R}_{o}\), and \(\mathcal {F}_{o}\) are calculated for each category of sorting algorithm, as shown in Table 11. The model obtained an average \(\mathcal {P}_{o}\), \(\mathcal {R}_{o}\), and \(\mathcal {F}_{o}\) scores of 97.00%, 96.90%, and 96.90%, respectively.

Table 11 Evaluation scores for each category of the sorting algorithm (Dataset B)

Comparing the performance of the model in datasets A and B, the model achieved a higher \(\mathcal {F}_{o}\) score of 96.90% for Dataset B than for Dataset A (\(\mathcal {F}_{o}\) score of 94.50%). This is due to the more diversity of program codes and algorithms in Dataset A so that the model could better process and learn the features of the sorting algorithms.

4.4 Program code classification with the LSTM and BiLSTM models

To compare the classification performance of the proposed model, experiments with baseline models, such as LSTM and BiLSTM are performed under considering the same dataset. All results are computed at the 100th epoch, and for each combination of hyperparameters, an average of 3480.16 s is required for the LSTM model training. The LSTM model achieved the highest \(\mathcal {F}_{o}\) score of 82.02% and \(\mathcal {A}_{o}\) score of 83.10% with a set of hyperparameters {LeakyReLU, 16, 0.001} as shown in Table 12. Furthermore, the experimental results show that the LSTM model fails to yield significant \(\mathcal {A}_{o}\) and \(\mathcal {F}_{o}\) scores when using the \(\mathcal {A}\mathcal {F}\) of ReLU and any combination of \({\mathscr{B}}\mathcal {S}\) and \({\mathscr{L}}\mathcal {R}\).

Table 12 Classification results of the LSTM model

On the other hand, the BiLSTM model achieved the highest \(\mathcal {F}_{o}\) score of 84.14% and \(\mathcal {A}_{o}\) score of 84.64% with the set of hyperparameters {LeakyReLU, 32, 0.001} as shown in Table 13. For each combination of hyperparameters, this model took an average of 6084.83 s to train. In addition, the BiLSTM model achieved relatively good \(\mathcal {A}_{o}\) and \(\mathcal {F}_{o}\) scores for each combination of hyperparameters. Overall, the BiLSTM model achieved comparatively better results than the LSTM model because the BiLSTM model can consider code sequences from both directions (forward and backward), leading to better performance of the model.

Table 13 Classification results of the BiLSTM model

4.5 Comparison with baseline models

To validate the effectiveness of our CNN-based program code classification model, different state-of-the-art models that use real-world program codes are compared. The overall approaches, datasets, data preprocessing, model training, validation, and evaluation vary per model. Therefore, two comparisons are made: first, a comparison of the results with the most similar tasks in different studies, as shown in Table 14, and second, a comparison of the results with state-of-the-art models, as shown in Table 15.

Table 14 Experimental results comparison with baseline models of different studies
Table 15 Experimental result comparison with baseline models

The experimental results, datasets, number of program codes, languages, and models are considered when making comparisons with other studies, as shown in Table 14. Models such as DP-ARNN [28], RF [28], LSTM [8], and LSTM-AttM [8] are used to classify the defective source codes as either defective or non-defective (i.e., binary classification). In the binary classification, the LSTM-AttM model achieved a comparatively higher \(\mathcal {F}_{o}\) score of 94.00% than the other referenced models. The Stacked Bi-LSTM model achieved an \(\mathcal {F}_{o}\) score of about 89.24% for the multiclass classification task, which is higher than that for other models. In contrast, the proposed CNN-Arch-III model achieved a higher \(\mathcal {F}_{o}\) score of 95.70% than the other comparative multiclass classification models. In addition, the CNN-Arch-III model achieved a higher \(\mathcal {F}_{o}\) score among all classification models (binary and multiclass). Moreover, the experimental data size of our study is 61,614, which is also larger and more diverse than that of the other compared baseline classification models from different studies.

In addition, experiments are performed on the same dataset for LSTM and BiLSTM models, as shown in Tables 12 and 13, respectively. The results are compared with the proposed CNN models, as shown in Table 15. The LSTM model achieved the \(\mathcal {F}_{o}\) score of 82.02% and \(\mathcal {A}_{o}\) score of 83.10%, which are the lowest among all the models, and BiLSTM model obtained \(\mathcal {F}_{o}\) and \(\mathcal {A}_{o}\) scores of 84.14% and 84.64%, respectively, which is better than the LSTM model. The CNN-Arch-III model achieved \(\mathcal {F}_{o}\) and \(\mathcal {A}_{o}\) scores of 94.10% and 93.90%, respectively, which is better than both LSTM and BiLSTM models. In particular, all CNN models achieved relatively better results than LSTM and BiLSTM models when using the same dataset and hyperparameters. This comparison demonstrates the superiority of the proposed CNN model in understanding the algorithmic features (or SFs) of the code. The overall classification results of the proposed CNN models showed the potential for detecting algorithms in program codes.

5 Discussion

In this section, we discussed the approach, including scalability of the model compared to other state-of-the-art models, and usefulness of the model in learning programming and software engineering. In addition, we discuss the threats to the validity of the proposed model.

5.1 Model performance analysis

In this paper, we focus on training the DNN models using the algorithmic features of the code rather than the meta-information. We considered SFs as key components of the algorithm in each solution code. A large number of practice-oriented solution codes are collected and processed for training and evaluation of the model. We conducted extensive experiments with different CNN architectures and hyperparameters. The CNN-Arch-III model achieved better training, validation, and evaluation accuracy than other CNN models. Comparisons are also made between CNN, LSTM, and BiLSTM models to demonstrate the classification performance of these models. Experimental results show that DNN models recognize the algorithm in solution codes with an acceptable degree of accuracy. The CNN-Arch-III model achieved an average \(\mathcal {F}_{o}\) score of 94.5% and 96.9% for datasets A and B, respectively for code classification. This result shows that the model achieved high accuracy in classifying “program codes” without meta-information.

In addition, we reviewed a large body of literature on program code classification. We found that studies classify codes based on various types of meta-information of the code, including programming language [58,59,60,61], code tags [63], errors [8, 28], and category [64, 65]. To the best of our knowledge, no study has considered the algorithmic (structural) features of codes in the classification task. Consequently, a comparison of the proposed CNN-Arch-III model with other relevant classification methods is presented in Table 14. However, in this paper, we recognize the importance of the algorithmic (structural) features of the codes for the classification task. The experimental results (Tables 56712, and 13) show that DNN models have achieved significant results using the SFs of the program codes for the classification task.

5.2 Model scalability

In this study, SFs are extracted from the codes and then the CNN model is trained to classify the program codes. The model classifies the program codes based on the category of algorithms with a high percentage of \(\mathcal {F}_{o}\) score of about 95.7%. The higher accuracy demonstrates that the proposed approach including SFs extraction, OBM conversion, and training and evaluation of the best CNN model with real-world program codes, are effective. Moreover, the experiments are conducted with program codes of C++ programming, which is considered a procedural programming language. Thus, the proposed model can also be utilized for classifying program codes of other procedural languages, such as Python, Java, and C. Based on the comparison studies with the baseline classification models, the proposed model (CNN-Arch-III) achieved relatively better classification results than others, as shown in Tables 14 and 15. Also, the proposed model has the scalability to classify large industrial program codes. Typically, industrial program codes are quite long and contain many functions and classes. As these functions may contain different algorithms, the proposed model can be useful for classifying codes at the function-level. It can be seen that the proposed code classification model can be useful and scalable for various programming-related tasks.

5.3 Model usage in programming learning

One of our research objectives is how the model can help programmers learn to program in real-world environments. From this viewpoint, the proposed model has been developed. The experimental results indicate that the present study can be useful for programming learning. A considerable amount of programming code is regularly generated from various sources such as academia, industry, programming platforms, and the OJ. However, programmers often find it challenging to identify the algorithms in the reference program codes while learning and searching from a large number of codes. Therefore, knowing the program code algorithm can help programmers better understand the code and accelerate their learning progress. Here, the proposed code classification model can effectively assist programmers in identifying algorithms contained in program codes. Moreover, the proposed model can be integrated with various real-world programming learning platforms, including OJ systems.

5.4 Model usage in software engineering

Repositories of real-world program codes play a key role in building effective ML models in SE. ML models are suitable in various fields of SE, such as strategic decision making, rapid prototyping, design and analysis, bug detection, code review, bug fixing, code reuse, and intelligent programming assistants (IPA). In addition, ML-enabled IPA systems can provide the best relevant code examples, best practices, and related texts as just-in-time support. As a result, the importance of ML models in software development and their application in SE is increasing significantly [14, 75]. The proposed CNN model classifies program codes by identifying the algorithms contained in the codes. Therefore, this model can also be used directly/indirectly for various SE tasks such as code review, bug detection, code examples, and code refactoring. In particular, the proposed model can be used as a supporting component of other ML models in SE that deal with SFs of program codes.

5.5 Threats to validity

This study applied several novel ideas from data preprocessing to model development. The model achieved significant results in classifying program codes during the experiment. However, the proposed model may suffer due to the following reasons/threats: (i) variation in the list of feature tokens for other programming languages; (ii) different strategies for data preprocessing; (iii) different sets of programming problems; (iv) problem sets with other programming languages such as C, Python, Java, and C#; and (v) different values of hyperparameters and architectures of the CNN model.

In the follow-up work, we plan to validate the model’s performance by addressing the threats above mentioned.

6 Conclusion and future work

We developed CNN models to classify the program codes based on the identified algorithms. Real-world program codes were collected from the AOJ system and utilized in all experimental tasks. The SFs of the program codes were extracted to learn the CNN models. They were converted to OBM, followed by several processing steps. Different sets of hyperparameters such as \(\mathcal {C}{\mathscr{L}}\), \({\mathscr{L}}\mathcal {R}\), \(\mathcal {A}\mathcal {F}\), \({\mathscr{B}}\mathcal {S}\) were used in CNN models in different combinations. The top-3 CNN models and their hyperparameters were selected based on the superior experimental results. In addition, a 10-fold cross-validation was performed to select the most suitable (topmost) CNN model and hyperparameters for further experiments. Subsequently, all the experiments with the best CNN model were performed on both datasets (A and B). The model achieved significant classification results for both datasets, an average \(\mathcal {P}_{o}\), \(\mathcal {R}_{o}\), and \(\mathcal {F}_{o}\) score of 94.30%, 94.80%, and 94.50%, respectively for Dataset A, and an average \(\mathcal {P}_{o}\), \(\mathcal {R}_{o}\), and \(\mathcal {F}_{o}\) score of 97.00%, 96.90%, and 96.90%, respectively, for Dataset B. Furthermore, the performance of the proposed CNN model was compared with those of other baseline models. Results indicate that the proposed model outperforms the referenced models. The results show that the proposed model is more scalable in classifying program codes of diverse algorithms. In addition, the model can be useful in classifying program codes of other procedural programming languages, such as C, Java, Python, and C#.

In the future, the code block sequence of program codes can be considered, instead of SFs, to investigate the model performance. Moreover, a multi-label classification model can be considered to classify program codes with multiple labels. In addition, the model can be used to evaluate large-scale industrial program codes.