1 Introduction

E-Commerce platforms are popular over the last two decades as the people recognised the ease of buying products online without having to step out of their homes. In the recent past, there has been a boom in online shopping, with many sellers moving their businesses to these e-commerce platforms, Enberg [7]. Product pricing is quite difficult on e-commerce platforms, given the sheer number of products sold online. Several factors influence the pricing of products, and thus, even pricing tends to be dynamic at times.

Yang et al. proposed an advanced business-to-business (B2B) and customer-to-customer( C2C ) methods Yang and Xia [24] for complete supply chain, resulting in future prediction for market analysis, experimental aspects and other unknown facts of current facts. Predicting a strategy for prices in share market is highly demanded topic in today’s world. Aljazzazen demonstrated the gap between the two methods skimming and penetration to show the optimal strategy for price prediction AlJazzazen [2]. Many investigations are performed to measure the effectiveness of various marketing plans to improve the industrial loyalty and individuality. Davcik described the facts of organising the campaigns Davcik and Sharma [6] at places with different products with respect to the prices that will help increase the marketing. By conducting the feedback and review session with the customers, behaviour and the characteristics of the end user purchase details were analysed by Li et al. [12]. With the analysis conducted, different strategies are framed and formulated by Xin Li (2021). Priester (2020) provided two ways by personalised dynamic pricing based on fairness and privacy policies Priester et al. [16]. This includes location- and history-based marketing. Minglun ren proposed three pricing models by investigating backward and game theory Ren et al. [18]. This is framed based on the service cost assigned to the certain weights.

Many real-time examples are enclosed for different strategies of pricing and profit situations, which act as the attributes in the channel service. Survey on various consumers and end users helps to find the behaviour of each consumer, and this leads to analyse the impact of pricing decisions of the entire supply chain Wei and Li [22]. Theoretical analysis gives partial result in decision-making, and mathematical analysis will narrow down the decision but it is quite challenging, wei (2015). The author could find the loyalty to the channel that increases the purchase and favour manufacturer. Aliomar lino mattos found that cost-based approach is said to dominating when compared to the other value and competition-based approaches that leads to the ground work for future analysis, Mattos et al. [13]. This is done by the investigation made on 286 papers on total and from 31 journals that enhances the future opportunities. In addition to the business market strategies, psychological behaviours from individual throw a vital information in determining the cost of the products in markets, Hinterhuber et al. [9]. Partitioned pricing is estimated by the experimental studies to increase the price in the share market for the welfare and sustainability of the products, Bürgin and Wilken [5] proposed by Burgin et al. (2021). Gupta (2014) developed a general architecture using machine learning models, Gupta and Pathak [8], that will help to predict the purchase price preferred by the customer. This can be included in the e-commerce platform and share market prediction. The research extended its functionalities to the next level by adapting personalised adaptive pricing. Since prediction of price depends on time-series data, LSTM model is used for predicting the product prices on a daily basis, Yu (2021). This model produced a notable result with high accuracy, Yu [25]. Spuritha (2021) suggested ensemble model that uses XGBoost algorithm for sales prediction, and the results are validated with low error rate using RMSE and MAPE values, Spuritha et al. [20]. Wiyati’s (2019) finding revealed that the combining supply chain management with cloud computing has high performance and give reliable output, Wiyati et al. [23]. Winky compared the result with three performance measures for support vector machine model, Ho et al. [10].

Mohamed Ali Mohamed framed online trailer data set for the task prediction for the seasonal products, Mohamed et al. [14], with autoregressive-integrated-moving-average method with support vector machine to know the statistical-based analysis. Jose-Luis et al. have explained the use of ensemble model in real estate property pricing and showed improvement in estimating the dwelling price Alfaro-Navarro et al. [1] using random forest and bagging techniques. Jorge Ivan discussed method incremental sample with resampling ( MINREM ) Pérez Rave et al. [15] method for identifying variables for inferential and predictive analysis by integrating machine learning and regression analysis. Grybauskas et al. have analysed about the influence of attributes on price revision of property, Andrius et al. [3]; Shahrel et al. [19], during pandemic and tested with different machine learning models.

To achieve high speed and accuracy, recently interesting ensemble algorithms based on gradient boost are added in the literature, namely Extreme gradient Boost (XGBoost) regressor, Light Gradient Boosting Machine (LightGBM) regressors, Category Boosting(CatBoost) Regressor and Natural gradient boosting (NGBoost). The XGBoost is a scalable algorithm and proved to be an challenger in solving machine learning problems. LightGBM considers the highest gradient instances using selective sampling and attains fast training performance. CatBoost achieves high accuracy with avoidance of shifting of prediction order by modifying the gradients. NGBoost provides probabilistic prediction using gradient boosting with conditioning on the covariates. These ensemble learning technologies provide a systematic solution for combining the predictive capabilities of multiple learners, because relying on the results of a single machine learning model may not be enough.

Table 1 Attributes of the data set

For pricing solutions, only the basic machine learning models are used, and the boosting algorithms are the way forward to achieve speed and efficiency. In the proposed work, XGBoost, LightGBM, CatBoost and X-NGBoost are used in order to train the huge online data set with speed and accuracy, the XGBoost is considered as the base learning model for the NGBoost, at every iteration the best gradient is estimated based on the score indicating the presumed outcome based on the observed features. XGBoost builds trees in order, so that each subsequent tree will minimise the errors generated in the previous tree but it fails to estimate the features. Hence, the NGBoost considers the outcomes of the XGBoost and provides the best estimate based on features further assisting in faster training and accuracy.

Table 2 Training data preview
Table 3 Testing dataset preview

Thus, this paper is an attempt to offer business owners competitive pricing solutions based on similar products being sold on e-commerce platforms by applying ML algorithms with the main target being small-scale retailers and small business owners. Machine learning does not only just give pricing suggestions but accurately predict how customers will react to pricing and even forecast demand for a given product. ML considers all of this information and comes up with the right price suggestions for pricing thousands of products to make the pricing suggestions more profitable. The objectives of the work are listed below.

Objectives:

  • Offer pricing solutions using ML trained on historical and competitive data while taking into consideration the date time features, group-by of categorical variables statistical features.

  • To help retailers make pricing decisions quicker and change prices more accordingly.

  • To create predictable, scalable, repeatable and successful pricing decisions, business owners will always know the reason.

  • The outcome of applying the pricing solutions can be used to get a better understanding of your customer engagement and customer behaviour.

2 Data preprocessing and feature engineering

We used the knowledge in the data domain to create features to improve the performance of machine learning models. Given that various factors are already available that influence pricing, the addition of date time features, statistical features based on categorical variables would help in predicting better results.

2.1 Exploratory data analysis

  1. 1.

    Began with Exploratory Data Analysis.

  2. 2.

    On plotting features, the target variable ‘Selling Price’ was highly left-skewed, due to which the output also was getting left-skewed.

  3. 3.

    Thus, a logarithmic transformation to normalise the target variable was applied.

  4. 4.

    Next, creation of new features like date time features, group-by of categorical variables for statistical features; categorical variables were handled through label encoder.

  5. 5.

    After preprocessing, the models were built: XGBoost, LGBM, CatBoost and X-NGBoost.

  6. 6.

    Manipulation of parameters helped give rise to a good cross-validation score through early stopping.

  7. 7.

    Finally, the results of all the models were presented.

2.2 Data set

The data set used carries various features of the multiple products sold on an e-commerce platform. The attributes used for working the model are given in Table 2. The data set has two parts: the training data set sample shown in Table 2 and the testing data set shown in Table 3. The data set is collected from kaggle, named as E-Commerce_Participants_Data. It contains the 2453 records of training data and 1052 testing data.

2.3 Preprocessing

On plotting a distribution plot of the data set in Fig. 1, it was observed that the target variable, that is the Selling Price attribute, was highly skewed.

Since the data do not follow a normal bell curve, a logarithmic transformation is applied to make them as “normal” as possible, making the statistical analysis results of these data more effective.

In other words, applying the logarithmic transformation reduces or eliminates the skewness of the original data. Here, the data are normalised by applying a logarithmic transformation. The result of the normalised variable is shown in Fig. 2.

2.4 Attributes of the data set

2.5 Statistical features added

  • Date Features Seasonality heavily influences the price of a product. Adding this feature would help get better insight on how prices change based on not just the date but financial quarter, part of the week, the year and more. The data set sample after adding the date features is shown in Table 4.

  • Group-by category unique items feature Products that belong to the same category generally follow similar pricing trends. Grouping like products helps with better pricing accuracy. The data set sample after adding the group-by category unique items feature is shown in Table 5.

2.6 Categorical features added

Most algorithms cannot process categorical features and work better with statistical features. Thus, one-hot coding is done using label encoders to add the below features as shown in Tables 67 and 8.

  • Item_Category

  • Subcategory_1

  • Subcategory_2

With one-hot coding the categorical values are converted into a new categorical column with assigned binary value of 1 or 0 to those columns. Each one-hot coding represents an integer value representing a binary vector.

Table 4 Dataset after the addition of statistical features
Fig. 1
figure 1

Target variable before preprocessing

Fig. 2
figure 2

Target variable after preprocessing

Table 5 Dataset after the addition of statistical features
Table 6 Dataset after the addition of categorical features
Table 7 Dataset after the addition of categorical features
Table 8 Dataset after the addition of categorical features

2.7 Graphical overview of processed data set

Heat map

A graphical representation of the correlation between all the attributes is shown in Fig. 3. Here, the correlation values are denoted by the colours where red bears the most correlation and blue the least.

Fig. 3
figure 3

Heat map of correlation between the attributes

3 Proposed methodology

In this section, the models XGBoost, LightGBM, CatBoost and X-NGBoost ensemble learning techniques are proposed for pricing solution. These ensemble learnings provide a systematic solution for combining the predictive capabilities of multiple learners, because relying solely on the results of a machine learning model may not be enough. The flow chart of the proposed models is shown in Fig. 4.

Fig. 4
figure 4

The flow chart of the proposed model

The data set is preprocessed to add the statistical and categorical features and performing the normalisation. The proposed models XGBoost, LightGBM, CatBoost and the X-NGBoost are executed on the data set. The models are explained in the following subsection.

3.1 Gradient boost models

  • XGBoost, LightGBM, CatBoost and NGBoost are ensemble learning techniques.

  • Ensemble learning provides a systematic solution for combining the predictive capabilities of multiple learners, because relying solely on the results of a machine learning model may not be enough.

  • The aggregate output of multiple models results in a single model.

  • In the boost, the trees are built in order, so that each subsequent tree aims to reduce the errors of the previous tree.

  • The residuals are updated, and the next tree growing in the sequence will learn from them.

3.2 Comparison of existing algorithms

The comparison of the Boosting Algorithms Bentéjac et al. [4] is presented in Table 9 with respect to splits of trees, missing values, leaf growth, training speed, categorical feature handling and parameters to control over fitting.

3.3 XGBoost

XGBoost, Jabeur et al. [11], is a boosting integration algorithm obtained by combining the decision trees and gradient lift algorithm. The XGBoost makes use of the first and second derivatives of the loss function instead of search method. It uses pre-ordering and node number of bits technique to improve the performance of the algorithm. With the regularisation term, in every iteration the weak learners (decision trees) are suppressed and are not part of the final model. The objective function o of the XGBoost is given as:

$$\begin{aligned}&{O}^{(t)} = \sum _{i=1}^n [l(y_i, {\hat{y}}_i^{(t-1)}) + p_i f_t(x_i) \nonumber \\&\quad + \frac{1}{2} q_i f_t^2(x_i)] + \Omega (f_t) \end{aligned}$$
(1)

where i is the \(i^{th}\) sample and t represents the iteration, \(y_i\) represents the \(i^{th}\) sample’s real value and \({y}_i^{(t-1)}\) is the predicted outcome of \((t-1)^{th}\) iteration, \(p_i\) and \(q_i\) are the first and second derivatives and \(\Omega (f_t)\) represents the term of regularisation.

XGBoost algorithm does not support categorical data; therefore, one-hot encoding is required to be done manually—no inbuilt feature present. The algorithm can also be trained on very large data sets because the algorithm can be parallelised and can take advantage of the capabilities of multi-core computers. It internally has parameters for regularisation, missing values and tree parameters. Few important concerns of XGBoost are listed below. Tree ensembles: Trees are generated one after the other, and with each iteration attempts are made to reduce the mistakes of classification.

At any time n, the results of the model are weighted based on the results at the previous time n-1. The results of correct predictions are given lower weights, and the results of incorrect predictions are given higher weights. Tree pruning is a technique in which an overfitted tree is constructed and then leaves are removed according to selected criteria. Cross-validation is a technique used to compare the partitioning and non-partitioning of an over-fitted tree. If there is no better result for a particular node, then it will be excluded.

Table 9 Comparison table of boosting algorithms

3.4 LightBoost

LightBoost, Sun et al. [21], is also based on decision tree and boosting learning gradient framework. The major difference between the LightBoost and XGBoost is the former uses a histogram-based algorithm or stores continuous feature values in discrete intervals to speed up the training process. In this approach, the floating values are digitised into k bins of discrete values to construct the histogram. LightGBM offers good accuracy with categorical features being integer-encoded. Therefore, label encoding is preferred as it often performs better than one-hot encoding for LightGBM models. The framework uses a leaf-wise tree growth algorithm. Leaf tree growth algorithms tend to converge faster than depth algorithms and are more prone to overfitting.

XGBoost uses GOSS (Gradient-Based One-Sided Sampling) with an concept that not all data points contribute in the same way to training. Data points with small gradients tend to be better trained; therefore, it is more effective to focus on data points with larger gradients. By calculating its contribution to the loss changes, LightGBM will increase the weight of the samples with small gradients.

3.5 Category boosting (CatBoost)

Category Boosting approach, Punmiya and Choe [17], mainly focuses on the permutation and target-oriented statistics. It works well with multiple categories of data and is widely applied to many kinds of business challenges. It does not require explicit data preprocessing to convert categories to numbers. We use various statistical data combined with internal classification features to convert classification values into numbers. It can deal with large amounts of data while using less memory to run. It lowers the chance of overfitting leading to more generalised models. For categorical encoding, a sorting principle called Target Based with Previous (TBS) is used. Inspired by online learning, we get training examples sequentially over time. The target values for each example are based solely on observed historical records. We use a combination of classification features as additional classification features to capture high-order dependencies. CatBoost works on the following steps:

  • Forming the subsets from the records

  • Converting classification labels to numbers

  • Converting the categorical features to numbers

CatBoost uses random decision trees, where splitting criteria are uniformly adopted throughout the tree level. Such a tree is balanced, less prone to over fitting, and can significantly speed up prediction during testing.

4 X-NGBoost

The primary aim of the work is to design a novel technique based on AI for predicting the product price based on XGBoost model and NGBoost algorithm, called X-NGBoost technique. The flow chart of the model is shown in Fig. 5

Fig. 5
figure 5

X-NGBoost algorithm

In the proposed model, the preprocessed data set is processed with XGBoost adapted with natural probability prediction algorithm ( as the base learning model). The initial training data are fitted to the XGBoost to start the model. Further, the hyperparameters of XGBoost are chosen with trial and error and are optimised by Bayesian optimisation. Based on the score estimated, the optimised parameters are obtained from the Bayesian optimisation. The proposed optimised predictive models effectively enhance the accuracy.

To compare the model, the XGBoost, LightBoost and CatBoost algorithms are trained and implemented on the same data set. The comparison of algorithms is made based on root-mean-square error (RMSE) quantified with meters. RMSE is the prominent measurement in case of calculating sensitive error.

5 Results and discussion

The proposed model abides by the objectives discussed and provides us with the appropriate pricing solution. The proposed ensemble models of decision tree methods like LightGBM, XGBoost, and CatBoost are implemented. The advantage of these models is that the features of importance are estimated with prediction. Feature’s based score is used to boost the decision trees with in the model improving the prediction result. In the proposed model along with the XGBoost, the natural gradient predictive model is adapted as to further enhance the score.

Table 10 Root-mean-square error of the algorithms

Therefore, from the results in Table 10 it can be concluded that, for the data set, X-NGBoost algorithm gives the appropriate pricing solution. Thus, it would be the appropriate model among the boosting ensemble model for providing pricing solutions for the considered dataset. The boosting-based ensemble techniques further can be extended to predict pricing solutions for products of multiple e-commerce sites and suggesting the appropriate option for the costumer. Large products variety considering the multitude of products sold on e-commerce platforms and the window variety within each category.

6 Conclusion and future work

Since the e-commerce platform has been growing rapidly over the last decade, it has become the norm for many small businesses to sell their products on such a platform to begin with as opposed to setting up shop.

Considering the limited budget of such small business owners, having a tool that is easily accessible and gives them a basic idea of how to price their products effectively would prove to be very useful. Enhancing such businesses in the long run would help improve the economy while promoting local industries to establish and develop themselves.

The results obtained indicate that the X-NGBoost algorithm is suitable for providing appropriate pricing solutions with the lowest error rates. Furthermore, the implementation of ensemble techniques resulted in an output that was reliable and efficient—possibly even having a lower error rate and therefore providing users with a fruitful result and a greater degree of satisfaction.

This can further be extended by integrating it with an e-commerce platform to enable the provision of dynamic results. Additionally, the sales data can be analysed for understanding the other contributing factors towards effective pricing.