1 Introduction

With the geometric growth of multimedia information, driven by the growth of Internet technology, has surpassed humans’ capacity to effectively process and utilize information within a brief period. To address this challenge, Recommender systems (RS) have emerged as a valuable tool to assist users in obtaining the relevant information they need. RS have been applied in various fields, such as news [1], movies [2], and products [3].

The core of a recommender system is to estimate the probability that a user is interested in an item based on his or her historical interaction behavior with the item. Most of the existing recommendation techniques use embedding techniques to simulate the interaction between user and item, such as using inner product [4] and neural networks [5]. Collaborative filtering (CF) uses past user-item interactions to make recommendations and has received significant attention in recent years [4, 6,7,8,9,10,11]. However, collaborative filtering based recommendation methods suffer from data sparsity and cold-start.

Recently, knowledge graph (KG) has been successfully integrated into recommender systems to optimize the vector representation of entities through relational connections in the knowledge graph to improve recommendation performance [12, 13]. Knowledge graph is a heterogeneous graph, which represents the graphical structure of knowledge and information in the real world, composed of nodes and their relationships. Introducing KG into recommender systems provides several benefits [14]: (1) Leveraging the rich feature attributes of items in KG can improve recommendation performance while uncovering potential connections between items; (2) Using the different relationships in KG, the potential interests of users can be explored and the diversity of recommended items is improved.

Graph convolutional network (GCN) have garnered considerable attention in recent years, with applications in several domains [15,16,17]. Therefore, GCN are widely used in recommender systems [18,19,20]. GCN extracts synergistic information of items and users through aggregated propagation to get more suitable user-item embeddings for recommendations. Moreover, higher-order connectivity signals can be captured by overlaying multiple embedded propagation layers. Knowledge graph attention network (KGAT) [21] improves recommendation performance by training a global knowledge graph and iteratively propagating vector representations of adjacent nodes while updating the node representations through GCN. LightGCN [22] simplifies the application of GCN in recommender systems, while improving the performance of the recommender systems and reducing model complexity. The recommender methods based on GCN employ aggregation and propagation techniques to extract collaborative information between users and items. Leveraging the nodes and edges within the graph, these methods infer user interests and item attributes, thereby generating user-item embeddings that are better suited for recommendation purposes. This enhances the accuracy and efficacy of personalized recommendations. Although GCN can enhance recommendation performance, further research is needed to investigate its component methods and structures.

Despite the noteworthy successes of KG-based and GCN-based recommendation approaches, three major problems persist: (1) Solely considering the connections between users or items and entities in the knowledge graph overlooks the high-order semantic information related to the relationships between users or items and entities. (2) Focusing solely on extracting item feature information from the knowledge graph with GCN may neglect the collaborative signals between users and items. (3) The direct application of GCN’s complex design elements (such as feature transformations and nonlinear activations) in recommendation leads to marginal improvements in recommendation performance and increases computational overhead [23].

Given the limitations of the aforementioned methods, it is imperative to create a method that can effectively unearth informative data from interaction graphs and item knowledge graph in an intuitive and efficient manner. This research aims to tackle this challenge by introducing a knowledge-aware recommendations under bi-layer graph convolutional networks (BIKAGCN). The model in this paper consists of two main components:

\(\bullet \)Light graph convolution (LGC). In LGC, self-connection, feature transformations, and non-linear activation functions are removed, which simplifies GCN to a large extent and enhances its ability to extract collaborative signals from the user-item matrix.

\(\bullet \)KAGCN. This research improves the aggregation layer of KGCN [24] into a knowledge-aware layer of attention and graph convolutional networks (KAGCN). It uses the attention mechanism designed in this paper to mine the importance between specific users and node relationships in the information aggregation process, by which the personalized interests of users can be captured. In addition, we incorporate a user self-attention mechanism that captures the user’s differences and enhances their semantic information.The incorporation of a knowledge graph into recommendation systems is particularly crucial for addressing the needs of cold-start users. These users lack sufficient or any historical behavioral data in the system, making it challenging for traditional personalized recommendation methods to provide accurate suggestions. In recent years, a substantial body of research and methods has been dedicated to mitigating the cold-start problem. This effort ensures that even new users or new items can receive valuable recommendations, thereby enhancing the comprehensiveness and utility of recommendation systems [25,26,27].

In summary, our research’s contributions can be outlined as follows:

  1. 1.

    In this paper, we propose the BIKAGCN model, which uses the first layer of GCN to learn the synergistic information of user and item, and update the user-item embeddings, and then uses the user and item embeddings as the input of the second layer of GCN.

  2. 2.

    By integrating bi-layer GCN to obtain the final representation of user-item embeddings, our approach effectively alleviates the cold-start problem caused by relying solely on user-item interactions and the lack of collaborative information in the item knowledge graph.

  3. 3.

    Experimental results on three publicly available datasets (MovieLens-20M, Last-FM, and Book-Crossing) demonstrate that BIKAGCN improved the recall recall@20 by 14.41%, 8.86%, and 20.90%, respectively, while the normalized discounted cumulative gain (NDCG@20) improved by 15.07%, 18.82%, and 22.79%, respectively.

2 State of the art

In this section, we review and summarize related work, including collaborative filtering-based, knowledge graph-based, and GCN-based recommendations.

Collaborative Filtering is among the earliest and widely adopted recommendation methods in predicting and suggesting new items based on user preferences [28]. MF [6] achieves good performance in recommendation by mapping user and item related features to embeddings. Moreover, FM [29] and Field-aware factorization machine (FFM) [30] utilize multiple features and blend linear combinations to enhance the performance, but they solely consider low-order feature combinations. Advanced methods must therefore integrate deep neural networks to model feature combinations for improved performance. Some distinguished models [31, 32] utilize deep neural networks and exhibit excellent recommendation performance. However, these methods only use the user-item matrix to define the loss function, failing to explicitly encode the user and item, which leads to insufficient utilization of collaborative signals.

Collaborative filtering-based recommendation methods exhibit issues involving data sparsity, cold-start, and interpretability. To address these problems, knowledge graph-aware recommendation methods have emerged, which are mainly divided into two categories: recommendation based on knowledge graph embedding (KGE) and recommendation based on knowledge graph path. The recommendation algorithm based on knowledge graph embedding utilizes the knowledge graph embedding [33] to directly link the obtained item embedding to the recommender systems, thereby improving the recommendation performance and alleviating cold-start problems. However, these KGE may prove insufficient for the purposes of recommendation, as they lack intuitiveness and effectiveness in representing item relationships. Conversely, the knowledge graph path-based method [34, 35] endeavors to explore higher-order latent entity relationships within a knowledge graph by designing specialized meta-paths or graphs that can be applied to recommendation. Although the path-based approach enhances the interpretability of the recommender systems and provides corresponding methods to explore the feature information present in the knowledge graph, the meta-paths/graphs it designs may only be applicable to specific domains and have limited generality.

LightGCN [22] presents a simplified design of a graph convolutional neural network that enhances its applicability for recommendation purposes. The recommendation model of GCN considers user and item as well as their corresponding comments [36] and interaction social networks [20] to enhance the accuracy and interpretability of recommendations. In addition, KGCN [24] uses GCN to extract the attributes between items in the knowledge graph with the aim of enhancing recommendation performance. Conversely, KGNN-LS [37] applies label smoothing to solve the overfitting problem associated with KGCN. KGAT [21] embeds a high-level knowledge graph for modeling, while the recent KLGCN [38] explores the interactions between users and relations in the knowledge graph, as well as interactions between items and relations. Although GCN-based approaches have improved recommendation performance, they still inadequately exploit the knowledge graph for personalized item-relationship mining and neglect user semantics. In this research, the user-item interaction matrix and item knowledge graph are used as inputs and relevant features are extracted by a bi-layer GCN, while accelerating model convergence and improving recommendation performance. For further comparison, we summarize our proposed model in Table 1 alongside some of the relevant methods.

Table 1 Comparison of BIKAGCN with related methods

The full article is organised as follows: Sect. 1 presents the current research dilemma and outlines the research methodology employed in this paper. Section 2 introduces related technologies and backgrounds. Section 3 describes the recommendation problem and presents model details and the computation process. Section 4 analyzes the experimental results, and through ablation experiments, validates the reasonableness of the model parts and the outstanding performance of the model in alleviating cold-start problems. Finally, Sect. 5 summarizes the paper and analyzes future possible improvements.

3 Methodology

3.1 Description of the Problem

In the context of recommendations, a set of k users, denoted as \(m=\left\{ u_1,u_2,\ldots u_k\right\} \), and a set of n items, denoted as \(p=\left\{ v_1,v_2,\ldots v_n\right\} \), are considered. A set \(R=\left\{ \textrm{r}_{u,i}\right\} \) represents the interactions between user and item, where \(\textrm{r}_{u,i}=1\) indicates the existence of interactions between users u and items i. A user-item bipartite graph G = {E, V} is then constructed based on the historical data of user-item interactions, as depicted in Fig. 1. Each user and item is viewed as a node of the bipartite graph, while edges with interaction records are established between user nodes and item nodes. Here, \(M_{m\times n}\) refers to the adjacency matrix of the bipartite graph. The recommendation task is to predict the probability of user interaction with all items, and generally select the Top-N items with the highest predicted probability as the recommendation result.

Fig. 1
figure 1

A bipartite graph of a user-item constructed from the U-I matrix

3.2 Overall Model Description

The overall framework of the BIKAGCN model is shown in Fig. 2, which consists of three parts: an initialization layer, a bilayer graph convolution information propagation layer, and a prediction layer.

Fig. 2
figure 2

The architecture of BIKAGCN comprises three primary layers: an embedding layer, a bi-layer graph convolutional information propagation layer, and a prediction layer. Figure displays a recommendation scenario between \(\mathrm {i_2}\) and \(\mathrm {u_2}\)

The recommender model in this research comprises several operational steps. Firstly, user nodes, item nodes, entity nodes, and relationship nodes are initialized with low-dimensional dense embedding representations. Next, through the LGC layer, utilizing information propagation rules, the initial embedding representations of users and items are updated to extract collaborative information between them. Subsequently, the KAGCN layer aggregates the initialized entity and relationship embedding representations with the updated user-item embedding using aggregation functions to obtain entirely new user-item embedding. Unlike many GCN-based recommendation methods, the KAGCN layer employs different attention mechanisms to separately update the embeddings for users and items, thus acquiring biased user-item embedding representations. Thirdly, by stacking the aforementioned operations multiple times, multiple high-order embeddings for users and items are obtained. The purpose of this multiple stacking is to enhance the model’s representation capacity and abstraction levels, enabling it to better capture complex relationships and features in recommendations. Then, the final user-item embedding are generated by weighting these high-order embedding representations. Finally, the model employs a simple dot product between vectors to generate the ultimate recommendation prediction probabilities. The bi-layer graph convolution structure of the BIKAGCN is shown in Fig. 3.

Fig. 3
figure 3

An illustration of the bi-layer GCNs architecture, in the information aggregation process, first uses LGC to mine synergistic information between user-item interaction matrices and update user-item embedding representations, and then uses KAGCN to mine item knowledge graph information and update user-item embedding representations again

3.3 Initial Embedding Layer

To attain the suitable, final representation, an initial embedding vector is assigned to each user denoted by \(e_u \in \mathbb {R}^d\). Additionally, initial embedding vectors \(e_e\) and \(e_r\) are designated for the entities and relations, where \(e_e, e_r \in \mathbb {R}^d\):

$$\begin{aligned} E=\{\underbrace{e_{u_{1}}, \ldots , e_{u_{N}}}_{\text{ users } \text{ embeddings } }, \underbrace{e_{i_{1}}, \ldots , e_{i_{M}}}_{\text{ items } \text{ embeddings } }, \underbrace{e_{e_{M+1}}, \ldots , e_{e_{K}}}_{\text{ non-items/entities }}, \underbrace{\left. e_{r_{1}}, \ldots , e_{r_{Q}}\right\} }_{\text{ relations } \text{ embeddings } } \end{aligned}$$
(1)

The method proposed in this paper achieves end-to-end optimization of the initial embedding. Compared with traditional methods, the method in this paper directly passes the embedding to the information aggregation part and mines the synergistic signals in the user-item graph by LGC, thus improving the performance of the recommender systems. In addition, the problem of cold-starts was mitigated by aggregating information from the item’s knowledge graph through KAGCN.

3.4 Information Aggregation Layer

We introduce the LGC layer, followed by the KAGCN layer proposed in this paper, and we will focus on the KAGCN part.

3.4.1 The First Layer of GCN–LGC

LightGCN streamlines the architecture of the GCN model by removing the feature transformation and nonlinear activation function propagation processes. Instead, a summation aggregator generates target node embeddings at each layer utilizing the Formula 2:

$$\begin{aligned} {e}_{u}^{(l+1)}=\sum _{i \in \mathcal {N}_{u}} \frac{1}{\sqrt{|\mathcal {N}_{u}||\mathcal {N}_{i} \mid }} {e}_{i}^{(l)} \nonumber \\ {e}_{i}^{(l+1)}=\sum _{u \in \mathcal {N}_{i}} \frac{1}{\sqrt{|\mathcal {N}_{u}||\mathcal {N}_{i} \mid }} e_{u}^{(l)} \end{aligned}$$
(2)

In this case, \({e}_u^{(l)}\) and \({e}_i^{\left( l\right) }\) depict the \(l_{th}\) layer embedding. \(\vert \mathcal {N}_u\vert \) indicates the number of items interacted by user u. Similarly, \(\left| \mathcal {N}_i\right| \) indicates the number of users interacting with item i. The goal of the symmetric normalization term \(\frac{1}{\sqrt{\left| \mathcal {N}_u\right| \left| \mathcal {N}_i\right| }}\) is to prevent vector scale explosion.

3.4.2 The Second Layer of GCN–KAGCN

Generally, users possess distinct and individualized interests when it comes to relationships. The attention mechanism, through dynamic adjustment of recommendation results and highlighting the most relevant content to a user’s interests, can better cater to a user’s unique needs [43,44,45,46]. This, in turn, enhances the precision and personalization of recommendations.For instance, one user may be more inclined towards a movie’s genre, while another user may prioritize its cast. Consequently, our model incorporates a specifically tailored attention mechanism that takes into account each user’s unique interests during the aggregation process.

To account for the varying importance of different relations to different users, we introduce an attention function, denoted as \(g:\mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}\), as shown in Eq. (3). The attention function evaluates the relative importance of each relation that is associated with the central node, \({e}_{u}^{(l)}\) and \({e}_{i}^{(l)}\) are both outputs from LGC.

$$\begin{aligned} \omega _{r_{ei}}^u&=g\left( {e}_{u}^{(l)},r_{ei}\right) \nonumber \\ \omega _{r_{ei}}^i&=g\left( {e}_{i}^{(l)},r_{ei}\right) \nonumber \\ \omega _{r_{ei}}^{ui}&=\alpha \omega _{r_{ei}}^u+\beta \omega _{r_{ei}}^i \end{aligned}$$
(3)

where \(r_{ei}\) represents the relationship between entity e (\(e\in \mathcal {N}_{i}\)) and item i, and \(\mathcal {N}_{i}\) represents the neighboring entities of item i in the knowledge graph. We use a strategy similar to GraphSage [47] to maintain computational efficiency and rigidity due to the large number of entities and connections in real-world knowledge graph. Specifically, we use a technique where uniform and randomly sampled fixed-size neighbor nodes K are employed as local neighbors in place of full neighbors. This results in a new set of neighbors, \(S_i\), whose size is K, with the potential for duplicate entries when \(|\mathcal {N}{i}|<\)K. The final attention score is calculated by taking the sum of the attention weights \(\omega _{r_{ei}}^u\) and \(\omega _{r_{ei}}^i\), and setting \(\alpha =\beta =0.5\) in the experiment.

We generate representations of the target entity \(e_{S_i}\) by aggregating neighbor embeddings, as shown in Formula 4.

$$\begin{aligned} {e}_{S_{i}}^{(l+1)}=\sum _{e \in S_{i}} \tilde{\omega }_{r_{e i}}^{u i} {e}_{e}^{(l)} \end{aligned}$$
(4)

Where \(\tilde{\omega }_{r_{e}}^{ui}\) represents the normalized use-item relational attention score, which indicates that the target node will value neighbouring nodes with higher relational attention scores, where \(\tilde{\omega }_{r_{e}}^{u i}\) is defined by the following Formula (5):

$$\begin{aligned} \tilde{\omega }_{r_{e i}}^{u i}=\frac{\exp {\left( \omega _{r_{ei}}^{ui}\right) }}{\sum _{e^\prime \in S_i}{\exp {\left( \omega _{r_{e^\prime i}}^{ui}\right) }}} \end{aligned}$$
(5)

The self-attention mechanism stands as a widely applied technique in deep learning an natural language processing [48]. Its fundamental concept revolves around the allocation of weights to each position within an input sequence based on their respective importance and subsequently amalgamating the information from all positions. Self-attention mechanism has the capacity to capture interdependencies among data, facilitating enhanced user perception and information fusion. This paper adds a self-attention mechanism on the user side. Firstly, according to the user embedding calculation:

$$\begin{aligned} Q_{u}^{(l)}=W^{q}\cdot e_{u}^{(l)} \nonumber \\ K_{u}^{(l)}=W^k\cdot e_{u}^{(l)} \nonumber \\ V_{u}^{(l)}=W^v\cdot e_{u}^{(l)} \end{aligned}$$
(6)

In this step, the attention score for the \(Q_{u}^{(l)}\) query and \(K_{u}^{(l)}\) is calculated.

$$\begin{aligned} {\omega }_{u^{\prime }u}^{u}=F\left( Q_{u}^{(l)},K_{u}^{(l)}\right) \end{aligned}$$
(7)

The next step involves calculating the similarity of \(Q_{u}^{(l)}\) with \(K_{u}^{(l)}\), which is usually achieved by utilizing the softmax function in the attention mechanism.

$$\begin{aligned} \tilde{\omega }_{u^{\prime }u}^{u}=\frac{\exp {\left( {\omega }_{u^{\prime }u}^{u}\right) }}{\sum _{u^{\prime } \in S_{u}} {exp({\omega }_{u^{\prime }u}^{u})}} \end{aligned}$$
(8)

The final computed attention weights are weighted and summed over \(V_{u}^{(l)}\).

$$\begin{aligned} {e}_{S_{u}}^{(l+1)}=\sum _{u^{\prime } \in S_{u}} \tilde{\omega }_{u^{\prime } u}^{u} V_{u}^{(l)} \end{aligned}$$
(9)

3.4.3 Aggregation Functions

The phase of information propagation comprises the process of aggregating entity representations \(e_{i}^{(l)}\) and their corresponding domain representations \(e_{S_{i}}^{(l)}\) to obtain a novel entity representation l. In order to integrate additional domain semantic information, we explore two approaches to aggregate \(e_{i}^{(l)}\) and \(e_{S_{i}}^{(l)}\) to obtain the ultimate entity representation:

(1) GCN-s aggregator. We remove the feature transformation and non-linear activation function of the original GCN aggregator [49] to obtain the final entity representation, as shown in Formula (10):

$$\begin{aligned} f_{\textrm{GCN}-\textrm{s}}=e_{i}^{(l)}+e_{S_{i}}^{(l)} \end{aligned}$$
(10)

(2) Neighbor-s aggregator. We remove the feature transformation and non-linear activation function of the original Neighbor aggregator [50] to get the final entity representation. As shown in Formula (11):

$$\begin{aligned} f_{\text{ Neighbor-s } }=e_{S_{i}}^{(l)} \end{aligned}$$
(11)

In Algorithm 1, we summarize the entire flow of KAGCN, including the inputs and outputs, as well as the intermediate steps.

Algorithm 1
figure a

Embedding generation process of KAGCN

3.5 Prediction Layer

After stacking the bi-layer GCN L times, vector representations of L+1 item (\(\left\{ e_i^{\left( 1\right) },\cdots ,e_i^{\left( L\right) }\right\} \)) are obtained, as well as those of L+1 users (\(\left\{ e_u^{\left( 1\right) },\cdots ,e_u^{\left( L\right) }\right\} \)). The embeddings of the L layers are merged in a weighted sum to obtain the final vector representations of user and item, as shown in the following Formula (12):

$$\begin{aligned} {e}_{u}^{*}=\sum _{l=0}^{L} \alpha _{l} {e}_{u}^{BIKAGCN(l)} \nonumber \\ {e}_{i}^{*}=\sum _{l=0}^{L} \alpha _{l} {e}_{i}^{BIKAGCN(l)} \end{aligned}$$
(12)

Where, \(\alpha _l\) represents the weight of the \(l_{th}\) layer vector in generating the final vector, and in the experiment it is set as a constant \(1/\left( l+1\right) \).

Finally, we perform inner product operations on the representations of the user and the item to predict the scores of their interactions:

$$\begin{aligned} \hat{y}\left( u,i\right) ={e_u^{*}}^T{e_i}^{*} \end{aligned}$$
(13)

The expression \(\hat{y}\left( u,i\right) \) represents the probability that the user u will interact with the item i. When the probability is greater than or equal to 0.5, we consider that the user will interact with the item.

3.6 Loss Function

The BIKAGCN model includes a large number of training parameters, leading to a potential overfitting problem. To alleviate this problem, \(L_2\) regularization techniques can be used to some extent.

The network parameters of the whole model are learned with the help of the BPR loss function [51]. The loss function is defined as follows:

$$\begin{aligned} \mathcal {L}=-\sum _{(u, i, j) \in \mathcal {O}} \ln \sigma \left( \hat{y}_{u i}-\hat{y}_{u j}\right) +\lambda \Vert \Theta \Vert _{2}^{2} \end{aligned}$$
(14)

where \(\left( u,i\right) \in R^+\) represents positive samples and \(\left( u,j\right) \in R^-\) represents negative samples, where \(\Theta \) is the set of all model parameters and \(\lambda \Vert \Theta \Vert _{2}^{2}\) represents the normalized term of \(L_2\).

To describe the computational process of BIKAGCN visually, it is summarized in Algorithm 2.

Algorithm 2
figure b

The overall recommendation process of BIKAGCN

4 Experimentation and analysis

This section presents the recommended performance of BIKAGCN in three real recommendation scenarios (MovieLens-20M, Last-FM, and Book-Crossing), including comparative experimental performance, performance to mitigate the cold-start problem, and ablation experiments.

4.1 Experimental Data and Metrics

The performance of all models was evaluated on three real recommendation datasets (MovieLens-20M, Book-Crossing and Last-FM). These datasets have widespread availability and are extensively employed by researchers. With their variability in both size and sparsity, these datasets are deemed appropriate for conducting comprehensive and robust evaluations of the performance of BIKAGCN across varying dataset sizes.

MovieLens-20MFootnote 1 is a widely used benchmark dataset. It includes nearly 20 million explicit historical rating records on the MovieLens website.

Last-FMFootnote 2 consists of nearly 93,000 musician listens from 2,000 users of the Last.fm online music system.

Book-crossingFootnote 3 is a dataset containing book ratings and reviews. This dataset contains 1 million book item rating records.

In view of the datasets’ explicit feedback nature, we converted them into implicit feedback by categorizing positive ratings as 1 and negative ratings as 0. For the MovieLens-20M dataset, we considered ratings of 4 or higher as positive, while for the other datasets, we did not set any specific threshold. Our selection criterion for each dataset required the inclusion of users with a minimum of ten interactions to ensure data quality.

In addition to the records of user-item interactions, we also require a corresponding item knowledge graph. We utilize project knowledge graphs provided by KGCN [24], which have been constructed using Microsoft Satori.Footnote 4 The summary statistics of the three datasets and the knowledge graphs associated with them are shown in Table 2.

Table 2 Statistics of datasets

Recall: the proportion of correctly recommended Top-N items that are of real interest to the user out of the total number of items that the user interacted with, with the formula shown in Formula (15):

$$\begin{aligned} recall@k =\frac{\sum _{u \in U}|R(u) \cap T(u)|}{\sum _{u \in U}|T(u)|} \end{aligned}$$
(15)

where R(u) represents the element recommendation list generated by the model for user u, while T(u) denotes the set of elements with which user u has genuinely interacted in the test dataset.

NDGC: It not only considers the percentage of correctly recommended items but also takes into account the position of recommended items that are of interest to the user within the recommendation list. A higher numerical value indicates better performance, as shown in Formula (16).

$$\begin{aligned} ndcg@k =\frac{1}{n} \sum _{i=1}^{N} \frac{2^{\textrm{rel}_{i}}-1}{\log _{2}(i+1)} \end{aligned}$$
(16)

\(rel_i\) represents the user’s rating for the i-th item, and \(\log _2(i+1)\) is a position-dependent decreasing weight.

In this paper, our dataset will be partitioned into a train set and a test set in an 8:2 ratio. To generate negative samples, items without any positive ratings will be chosen at random and paired with each positive interaction. Our metrics will be recall@N and ndcg@N to assess the performance of the item recommendation and user preference ranking models. To determine the hyperparameters, we will optimize recall@20 on the test set, with a default value of N=20.

4.2 Baselines

We compared BIKAGCN with the following baselines on the same dataset and metrics to demonstrate its superiority. The details are as follows:

\(\bullet \) MF [6]: The model is a common recommender systems algorithm that performs recommendations by decomposing the user-item interaction matrix into two low-rank matrices.

\(\bullet \) CKE [41]: The model a typical regularisation-based approach that uses semantic embeddings obtained from TransR to enhance the matrix decomposition.

\(\bullet \) NGCF [40]: The model uses standard GCN to capture synergistic signals from user-item interaction graphs, effectively placing synergistic signals in the representation of user and item.

\(\bullet \) KGCN [24]: The model is a recommender systems algorithm based on knowledge graph, which embeds the entities and relationships in the knowledge graph into the recommendation model to improve the accuracy and effectiveness of recommendations.

\(\bullet \) KGAT [21]: The model alternates between training recommendations and knowledge graph embedding. All structural and semantic information is obtained from a combination of the user-item graph and the item knowledge graph.

\(\bullet \) LightGCN [22]: The model is an efficient GCN-based model that abandons the feature transformation and non-linear activation operations in the GCN that are not useful for recommendations.

4.3 Hyperparameters Settings

Hyperparameters play a crucial role in influencing the results of the experiment. For the embedding size, we analyzed all models within a given dataset using a fixed size 64. We initialized trainable parameters using the Xavier initialization method. To avoid local optimization issues during training, we employed the mini-batch Adam optimizer. The batch size for MovieLens-20M was set to 2048, while for other datasets, it was set to 1024. We fine-tuned the following parameters with grid search: the learning rate was tested within a range of {0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001}; the \(L_2\) regularization factor, \(\lambda \), was varied between \({10}^{-1} \) and \({10}^{-5}\); the number of layers in GCN recommendation models (KGCN, KGAT, NGCF, LightGCN, and BIKAGCN) was adjusted between {1, 2, 3, 4} for layer size, and between {4, 8, 16, 32} for neighbor sampling size in KGCN and BIKAGCN. For KGCN, the hidden dimension is set to the same size as the initial embedding dimension, and for KGAT and NGCF, the size of the first hidden dimension is the same as the initial embedding dimension, but the subsequent hidden dimension is half of the previous one. KGAT and NGCF use a 0.1 dropout ratio. Furthermore, owing to the vast scale of the MovieLens-20M dataset, we deployed an early stopping, whereby if the recall@20 metric on the test set did not exhibit any progress within ten epochs, we resorted to early stopping.

4.4 Experimental Results

4.4.1 Comparison of Recommended Performance with Baseline

From Table 3, we can conclude the following:

Table 3 Comparison of the recommended performance with the baseline method

Among all the recommendation models based on graph neural networks, KGCN exhibits the weakest performance across all three datasets. This could be attributed to KGCN producing more links between items while overlooking the links between user and item. Additionally, KGCN’s large number of trainable parameters may cause overfitting, diminishing its generalization ability and restricting its performance.

In comparison with NGCF, the KGCN model’s performance was relatively poorer across the three datasets, indicating the limited strength of semantic information obtained exclusively from the item knowledge graph and emphasizing the significance of synergistic information between user and item. When comparing NGCF and LightGCN, the latter exhibiting enhanced recommendation performance across all three datasets, emphasizing the advantages of eliminating feature transformation and nonlinear activation in recommendations. Furthermore, BIKAGCN performs better than KGAT across all three datasets, implying that it is not essential to train the entire knowledge graph simultaneously for recommendation purposes.

A comparative analysis of all baselines showed that BIKAGCN has a clear advantage on all datasets. This suggests that BIKAGCN’s bilayer graph convolutional neural network layer possesses the ability of LGC to extract synergistic information and KAGCN to selectively and preferentially aggregate item neighbourhood information.BIKAGCN is adept at merging complementary information from item knowledge graph into item vector representations and gaining insight into the unique requirements and potential interests of different users; as a result, it produces better recommendations, thereby improving the performance of the recommender systems.

Fig. 4
figure 4

Performance comparison of training sets with different scales on the three datasets. The training set ratio indicates the ratio of the current training set to the original training set

4.4.2 Comparison of Model Cold-Start Scenarios

By conducting a joint analysis of the experimental results from Last-FM and Book-Crossing, we can derive some meaningful conclusions, which are presented in Fig. 4.

In scenarios where data is sparse, adding the item KG as additional information into the recommender system can mitigate the cold-start problem, especially in cases of extreme data scarcity. When the training set ratio drops to 0.2, three recommendation models based on GCN and KG (BIKAGCN, KGAT, and KGCN) perform the best, indicating the effectiveness of incorporating knowledge graphs in alleviating the cold-start problem. Possible reasons are that, in cold-start scenarios, the KG provides additional high-order item correlations and preference relationship information between users and items. Introducing this information can mitigate data sparsity, compensate for the lack of item information, and maintain the performance of the recommender systems.

BIKAGCN excels in cold-start scenarios. The recommendation results across three datasets demonstrate that it outperforms other models at all levels of sparsity. This suggests that BIKAGCN can effectively address the cold-start problem, leading to the conclusion that the model is a reliable solution for the cold-start problem.

4.5 Ablation Study of BIKAGCN

The graphical convolution layer is a key component of the GCN-based model. The effects of different aggregation functions, different aggregation depths, different sampling neighborhoods and different attention mechanisms on the model performance are explored.

4.5.1 Impact of BIKAGCN Aggregators

Table 4 summarizes the results of our experimental evaluation of four aggregation functions, from these findings, several noteworthy conclusions can be drawn:

Table 4 Performance of BIKAGCN with different aggregation functions

The results of our experiments comparing GCN-s with GCN and Neighbor-s with Neighbor have demonstrated that the elimination of feature transformation and non-linear activation functions contributes positively to enhancing the performance of recommender systems. Representing item features solely via aggregated vectors of neighboring nodes is insufficient for fully exploiting user and item information. Notably, we have found that incorporating self-connection to item features can further boost the performance of the recommender systems.

4.5.2 Impact of BIKAGCN Layer Number

To explore the impact of the number of layers on performance, analyzing the Table 5 we have the following observations:

Table 5 Performance of BIKAGCN with different layer number

In all three datasets, the best performance was achieved at \(layers_L\) of 1 or 2. The reason is that increasing the number of layers enhances the amount of surrounding neighbor information that can be acquired, which leads to a richer information context and improves the performance of the recommender systems. It is evident that increasing the number of propagation layers does not necessarily enhance the performance of the recommender systems model. This is because deeper layers may lead to the problem of overly smooth neighboring nodes, i.e., the vector representations of nodes become increasingly similar, resulting in the inability of the network to identify items of interest to users. Moreover, deeper layers also mean that more negative information is aggregated. Therefore, it is necessary to select an appropriate number of layers based on practical requirements.

4.5.3 Impact of BIKAGCN Sampled Neighbor Number

Table 6 shows the results of changing the size of the neighbors, and we can find that:

Table 6 Performance of BIKAGCN with sampled neighbor number

When K equals 8 or 16, BIKAGCN achieves the best performance. In the MovieLens-20M and Last-FM datasets, the best number of neighbors is 8, whereas in the Book-Crossing dataset, the optimal number of neighbors is 16. This is because the MovieLens-20M dataset is large enough to obtain sufficient semantic information with 8 neighbors. In contrast, increasing the number of neighbors can introduce more noise. In the Last-FM dataset, the model can be easily fitted due to its small size, and a large number of neighbors can introduce excessive noise. For the Book-Crossing dataset, which is both sparse and small, the optimal number of neighbors is 16. Beyond the neighbor threshold, the performance of BIKAGCN rapidly decreases as the number of neighbors increases, indicating that more item noise is aggregated, and the negative impact gradually exceeds the positive impact.

4.5.4 Impact of BIKAGCN Attention Mechanism

The attention mechanism is a critical factor that significantly influences model performance. Table 7 presents the experimental results, from which several noteworthy observations can be made.

Table 7 Performance of BIKAGCN with attention mechanism

In most cases, the performance of the recommender systems improves with an increase in the number of attention mechanisms, except for ndcg@20 on the MovieLens-20M dataset. Additionally, "ur+ir+self" consistently yields the best model performance. Notably, using "ur" mechanisms outperforms "ir" on in all cases, which underscores the indispensable role of personalized interest in recommendation and its importance exceeding that of neighbors.

Comparison between "ur+ir+self" and "ur+ir" reveals that the former outperforms the latter. One possible reason for this superior performance is that considering user attention enables the model to leverage user semantic information fully and enhance the performance of the recommender systems. On the other hand, the reason "ur+ir" outperforms "ur" may be that the former thoroughly explores the connections and relationships between user and item, thus promoting the aggregation of item neighboring nodes and improving the recommendation algorithm’s performance. However, when the result without an attention mechanism is used (i.e., using "no"), it is shown that treating all neighbors equally introduces noise and misleads the embedding propagation process, indicating the important role of graph attention mechanism.

5 Conclusions

In this paper, we propose a new model—BIKAGCN, which combines the respective characteristic features of LGC and KAGCN. Specifically, LightGCN is used as the first layer of the model, followed by KAGCN as the second layer, resulting in the integration of two into a novel bi-layer graph convolutional network. This innovative combination enhances the feature extraction capability of GCN and accelerates model convergence. Our experimental results conducted on three real-world datasets demonstrate that incorporating a GCN layer after LightGCN to learn the embedding representations of users and entities in the knowledge graph is both effective and feasible, thereby providing insights for the development of graph neural-based recommender systems in the future.

We identify two key areas for future research. Firstly, regarding the negative sampling strategy in the model, there are several studies exploring ways to improve upon the current uniform sampling technique [20, 52,53,54,55]. One potential approach involves sampling a different number of neighboring entities for each item, which may result in enhanced recommender system performance. Secondly, our experimental results suggest that incorporating item KG as auxiliary information can improve performance compared to using only the interaction graph. Thus, research could examine the effectiveness of incorporating additional types of auxiliary data, such as contextual data [56] or social networks [57]. In social network recommendation, identifying reliable friends is also a popular research direction [58, 59].