Predicting construction project compliance with machine learning model: case study using Portuguese procurement data

Luís Jacques de Sousa (Department of Civil Engineering, Faculdade de Engenharia, Universidade do Porto (FEUP), Porto, Portugal) (CONSTRUCT, FEUP, Porto, Portugal) (BUILT CoLAB – Collaborative Laboratory for the Future Built Environment Association, Porto, Portugal)

João Poças Martins (Department of Civil Engineering, Faculdade de Engenharia, Universidade do Porto (FEUP), Porto, Portugal) (CONSTRUCT, FEUP, Porto, Portugal) (BUILT CoLAB – Collaborative Laboratory for the Future Built Environment Association, Porto, Portugal)

Luís Sanhudo (BUILT CoLAB – Collaborative Laboratory for the Future Built Environment Association, Porto, Portugal)

Engineering, Construction and Architectural Management

ISSN: 0969-9988

Article publication date: 26 April 2024

Downloads

202

pdf (36.1 MB)

Abstract

Purpose

Factors like bid price, submission time, and number of bidders influence the procurement process in public projects. These factors and the award criteria may impact the project’s financial compliance. Predicting budget compliance in construction projects has been traditionally challenging, but Machine Learning (ML) techniques have revolutionised estimations.

Design/methodology/approach

In this study, Portuguese Public Procurement Data (PPPData) was utilised as the model’s input. Notably, this dataset exhibited a substantial imbalance in the target feature. To address this issue, the study evaluated three distinct data balancing techniques: oversampling, undersampling, and the SMOTE method. Next, a comprehensive feature selection process was conducted, leading to the testing of five different algorithms for forecasting budget compliance. Finally, a secondary test was conducted, refining the features to include only those elements that procurement technicians can modify while also considering the two most accurate predictors identified in the previous test.

Findings

The findings indicate that employing the SMOTE method on the scraped data can achieve a balanced dataset. Furthermore, the results demonstrate that the Adam ANN algorithm outperformed others, boasting a precision rate of 68.1%.

Practical implications

The model can aid procurement technicians during the tendering phase by using historical data and analogous projects to predict performance.

Social implications

Although the study reveals that ML algorithms cannot accurately predict budget compliance using procurement data, they can still provide project owners with insights into the most suitable criteria, aiding decision-making. Further research should assess the model’s impact and capacity within the procurement workflow.

Originality/value

Previous research predominantly focused on forecasting budgets by leveraging data from the private construction execution phase. While some investigations incorporated procurement data, this study distinguishes itself by using an imbalanced dataset and anticipating compliance rather than predicting budgetary figures. The model predicts budget compliance by analysing qualitative and quantitative characteristics of public project contracts. The research paper explores various model architectures and data treatment techniques to develop a model to assist the Client in tender definition.

Keywords

Citation

Jacques de Sousa, L., Poças Martins, J. and Sanhudo, L. (2024), "Predicting construction project compliance with machine learning model: case study using Portuguese procurement data", Engineering, Construction and Architectural Management, Vol. 31 No. 13, pp. 285-302. https://doi.org/10.1108/ECAM-09-2023-0973

Publisher

:

Emerald Publishing Limited

License

Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

1. Introduction

During the procurement process, essential decisions regarding the project’s economic and temporal dimensions, as well as the selection of the general contractor, must be taken. However, despite these decisions potentially significantly impacting the project’s performance and being contractually binding, technicians often make them under very austere time constraints (Aman and Azeanita, 2021; Moon et al., 2022). In the Construction industry, especially in the public sector, selecting the right construction contractor is crucial for the project’s success (Cheaitou et al., 2019; Almohassen et al., 2023). However, there is an overwhelming reliance on the lowest bid method for deciding tenders, as governmental bodies try to safeguard public interest and funds (Almohassen et al., 2023; Cheaitou et al., 2019). This approach frequently increases project risks, including cost overruns, delays, and substandard work (Naji et al., 2022). While there has been an increasing emphasis on multicriteria evaluation and life cycle costing in recent years, tenders are still predominantly awarded based on the lowest bid price. In this sense, it is crucial to grasp the distinctions between cost, price, and value (Hanák et al., 2021). Decision support tools may improve the quality of these submissions and help decide to whom and how to award a construction project.

Construction applications are detailed and complex documents compiled according to different assumptions given a specific project (Poças Martins and Abrantes Almeida, 2009), traditionally, it has been challenging to accurately predict budget compliance (Jafari et al., 2021; Elhegazy et al., 2021). The emergence of Artificial Intelligence may change this paradigm; however, due to technical and cultural reasons, these technologies, which have already been implemented in similar industries, have only started to be adopted by the construction sector (Chen et al., 2019). In recent years, different Machine Learning (ML) algorithm architectures have been adopted in various stages of the construction process (Jacques de Sousa et al., 2023b). The cost prediction aspect of the procurement process has been an area of substantial interest as it would make construction more productive and efficient (aspects where the sector trails behind similar industries) by making more realistic cost assessments and preventing underbudgeting (Kerridge and Halaris, 2001).

The vast diversity of projects adds to the complexity of the ML tools in construction, as different projects must consider distinct variables for budgeting (Elhegazy et al., 2021). Previous studies used these variables as features; for example, Suneja used an artificial neural network (ANN) to predict the cost of road construction projects considering factors such as the type of the road, the number of lanes, and the length of the road, among others, as features and using 124 road projects (Suneja et al., 2021). Juszczyk developed a cost estimation model using ensembles and the Support Vector Machine (SVM) method, which used the building height, gross floor area, and concrete class, among other factors and achieved MAPE error between 5 and 6% (Juszczyk, 2020). In another study, V. Chandanshive developed a model based on a feed-forward ANN that considered the ground floor area, the number of floors, the structural parking area and the foundation types, using a dataset of 78 building projects (Chandanshive and Kambekar, 2019). Jiang developed a back-propagation ANN and applied it to a database of 10 house construction projects using features such as the type of concrete, steel bars, masonry and other electrical and water performance components such as the water supply pipe and indoor drainage pipe (Jiang, 2020).

In similar work, there was a significant divergence in the quantity of data used in each model. Models used data from as little as ten projects to as many as some thousands. For example, while Yaqubi used only ten educational projects developed in India, Juszczyk used 129 sports complex projects, and, in another study, Baker gathered close to 2,345 safety reports from a construction company (Yaqubi and Salhotra, 2019; Juszczyk et al., 2019; Baker et al., 2020). Such variations in data quantity raise concerns regarding the quality of data and the types of features that can be extracted from said data. However, the literature shows no correlation between the number of projects and the performance of the models; despite this, holistic models will require extensive datasets due to the nature of the ML algorithms (Jacques de Sousa et al., 2023b). Researchers must process the data to create a meaningful and compatible sample with their models (Jacques de Sousa et al., 2023b). Usually, this data originates from construction budget applications' digital text files, such as bills of quantities, planning schemes and project specifications.

Concerning research that used procurement data for prediction, in a comparable study, Cheng et al. constructed an SVM model to forecast project performance using 11 features. However, the dataset was limited to 46 projects (Cheng et al., 2010). Leu et al. used a Bayesian-based approach to predict the contract preferences of opponent companies in bilateral Construction negotiations (Leu et al., 2015).

Moreover, improving the optimisation of construction bid selection has been a significant study point in recent years: E. Mohamed et al. developed a mixed approach, combining rule-based and fuzzy-based expert systems. In their work, the rule-based system assessed qualitative aspects of the bid, while the fuzzy system evaluates quantitative factors, providing decision-makers with a comprehensive model that reduces administrative burdens (Mohamed et al., 2022). Naji et al. developed an analytical network Process model for ranking the critical criteria for a project’s success. The study revealed that financial stability, offered price and delivery date are among the essential criteria impacting contract performance (Naji et al., 2022). Almohassen’s et al. study on pre-tender stage evaluation created a model that allows clients to set a minimum threshold for contractor qualification and ensure alignment between the contractor’s capabilities and the project requirements (Almohassen et al., 2023).

Concerning the performance of the algorithms, the literature shows that the trend for the various cost prediction models regarding MAPE error is around 20% and that the industry standard concerning the level of accuracy must be upwards of 90% to be satisfactory (Juszczyk, 2020; Jacques de Sousa et al., 2023b). Previous ML applications for construction cost prediction have mainly used private construction execution phase data (Jacques de Sousa et al., 2023b). Nevertheless, recent research has utilised publicly accessible procurement information for predicting base tender price (García Rodríguez et al., 2020) and contract collusion (García Rodríguez et al., 2022).

Analogously, this study applies a budget compliance model to the construction procurement stage using open-access construction procurement data. However, in contrast with related research, the final goal of this study is to test how compliance prediction algorithms can aid procurement technicians' decisions by predicting the chance of the project complying with the budget using fuzzy logic. Moreover, this study sets itself apart from previous ones by employing validated methods sourced from contexts distinct from the AEC sector. Specifically, it applies these methods to compliance prediction in Construction contracts, utilising public Portuguese procurement data.

The following sections compose the rest of this article: Section 2 states the methodology for developing the model and its insertion in the construction’s procurement workflow. Next, Section 2.1 exposes data sourcing issues, Section 2.2 addresses data processing, and Section 2.3 presents data treatment strategies. Section 2.4 presents an exploratory data analysis crucial for the feature selection of Section 2.5. Section 2.6 demonstrates the algorithm training and testing results for the different tested architectures and selects the best performers. Finally, Section 3 discusses the results, and Section 4 presents the trends and implications of this study for construction procurement.

2. Methodology

The traditional procurement process starts with the project’s approval; a public project’s owner must initiate a public contracting process to hire a general contractor company to start building. To this end, the owner outlines the characteristics of the project procurement, which can be, for example, the base bid price and the award criteria, among other factors. These details and criteria comprise the tender announcement, and the different bidders submit their proposals according to these details. After the contract is awarded to the tender winner, the project is carried out, yielding financial results. The procurement process results, details, and criteria are obligatorily submitted to public platforms and stored in an open format.

The model developed in this study uses data mined from these open-access repositories previously treated as training data, as described in the Data Sourcing section. With the model trained, a new workflow for procurement (in red) is created, as shown in Figure 1. The proposed workflow relies on the model to predict the expected compliance rate based on previous projects with similar characteristics and the tender details and criteria to advise the owner to change or maintain them.

The overall steps followed in this study for the devilment of the model were the following:

Data sourcing;
Data processing;
Data treatment;
Exploratory data analysis;
Feature selection;
Model selection.

Furthermore, all the different model architectures created in this study were developed using Python programming language and Keras (2023b) and Sklearn (2023c). The following chapters are named according to the different steps highlighted in this methodology.

2.1 Data sourcing

Data access represents the greatest conceptual challenge to developing ML algorithms (Jacques de Sousa et al., 2023b) because it is essential in the learning process of ML models and, therefore, fundamental to the goal of budget compliance prediction (Elmousalami, 2020). Since implementing ML algorithms requires a large amount of data to support their training phases, the lack of a reliable database can be critical to implementing these models (Jacques de Sousa et al., 2023b). A high-quality and broad dataset is crucial for computational models to gain experience in correlations and tendencies considering a big data perspective (Kim et al., 2020; Alaka et al., 2019; Sonntag, 2004). However, only in the last couple of years has the importance of storing data for probabilistic analysis been recognised by the AEC sector stakeholders.

This study used the Portuguese Public Procurement Database (PPPData), a dataset comprising more than 5,000 procurement contracts with 37 distinct properties from 2015 to 2022 (Jacques de Sousa et al., 2022, 2023a). PPPData information originates from Portal Base, the Portuguese national repository of public procurement procedures, and Diário da Républica Eléctronico, the official national journal of the Portuguese Republic.

Table 1 organises the 12 features initially considered input data in the model where the target feature is the performance class of the project, calculated as the proportion between the initial price and the effective price at the end of the project. Project characteristics, such as the features of this model, have been considered among the most impactful for tendering (Bohari et al., 2021).

2.2 Data processing

Of the 5,214 contracts available in PPPData, 4,772 contracts remained due to excluding non-applicable values in the features (442 contracts). The “CPV Code” feature required using Sklearn’s label encoder (Sklearn, 2023a) to transform this feature from non-numerical to numerical values. Moreover, the “Environmental Criterion” feature was transformed into a binary class variable (i.e. 0 and 1) from its previous logical format. Lastly, the target feature “Performance Class” was adapted to a categorical format. Additional standardisation efforts were not necessary since the database providing the sample was already in a structured format.

2.3 Data treatment

An initial descriptive statistical analysis performed in Luís Jacques de Sousa et al. (2023), on the dataset showed that the dataset is highly unbalanced regarding the output feature (the “Project Performance” class). Most of the contracts complied with the price, while only 16.4% showcased a slippage in price, and 13.2% indicated recorded budget savings. After data processing, despite achieving better skewness, the same unbalanced distribution was verified, with the price compliance class recording 3,290 contracts, the price spill class recording 794 contracts, and the price savings class recording 688 contracts.

Thus, to balance this dataset and mitigate the skewness influence on the results, three different techniques were tested:

Oversampling: The oversampling method is based on multiplying the number of samples of the minority class until it equals the number of majority class samples (Brownlee, 2020b). The oversampling in this study was done randomly (i.e. random resampling) so that each output class had 3,290 cases;
Class weights balancing: The class weights method overcomes unbalances by assigning higher weights to the minority class and lower weights to the majority class, effectively undermining the over-representativeness of the majority class (Kotsiantis et al., 2005). During model configuration, more precisely during training, the Keras library allows for configuring a weight value to each class, represented by a decimal number, multiplied by the number of cases to balance the data (Keras, 2023a). In the case of this study, the weights given to the classes were 0.214/0.866/1;
Synthetic Minority Oversampling Technique (SMOTE) method: SMOTE is an oversampling technique that generates synthetic samples from the minority class. It obtains a synthetically class-balanced training set (Blagus and Lusa, 2013). SMOTE works by selecting samples close to the minority class, drawing a line between them, and drawing new samples at different points along that line (Chawla et al., 2002). SMOTE matched the lower classes to the class with the most cases, i.e. several synthetic cases were produced until 3,290 cases were made up for each category.

Although all the techniques produced a balanced dataset, a test was performed using the Adam algorithm (Kingma and Ba, 2014) (one of the tested model architectures) to benchmark the different data treatment techniques.

A fixed configuration was set for the algorithm across all balancing methods to benchmark them and retrieve their precision and efficiency. In order to enhance the calculation speed and ensure the algorithm’s convergence, empirical tests were carried out using various settings. Although there is no universally prescribed configuration for ML models, the best-fitting settings employed in this test included a train-test split of 0.33, a batch size of 32, and 50 epochs. Table 2 presents the results.

The results presented in Table 2 allow for comparing the three strategies. The Class Weights strategy is the underachieving technique compared to the other two. Moreover, the oversampling and SMOTE techniques achieved similar results; however, despite the oversampling method performing the 50 epochs in a shorter time than the SMOTE method, the latter had a 0.03 improvement in validation accuracy.

Concerning the macro average results, the SMOTE method achieved the best results among the three techniques, with 0.85 across all metrics.

As shown in Figure 2, the accuracy of these two methods over time favours the SMOTE method. The curve of Figure 2a, compared to Figure 2b, shows that the SMOTE method obtained a smooth train curve, demonstrating more stable training results than the oversampling method. Additionally, SMOTE reaches the plateau of accuracy sooner than the oversampling method for optimal results (approximately 85% validation accuracy at ten epochs, compared with 82% validation accuracy at 22 epochs for the oversampling method). Due to all these factors, the SMOTE method has been applied to balance the dataset.

2.4 Exploratory data analysis

After the data treatment, this study performed a correlation test between the 12 input features and the target feature. The correlation test in exploratory data analysis allows for determining which features overlap each other and which may be discarded. Figure 3 shows a heatmap of the correlation between each influencing factor. When the correlation is close to 1 or -1, it is high; when it is close to 0, it is low. Negative numbers represent inverse correlations.

The heatmap of Figure 3 shows that there are groups of features with significant correlations on the balanced dataset. For example, there is a high correlation between the “Award Criteria Category” and the “Weight of the price factor in award criteria” (0.99). Moreover, the “Base Tender Price” and “Initial Price” features are also closely related (0.98). Figure 3 shows that the target feature primarily correlates with the “Publication Year”, the “Award Criteria Category”, as well as the “Weight of the price factor in award criteria”. Additionally, the least correlated features with the “Performance Class” are the “District Code” and the “Number of Bidders”.

To further understand the impact of the features in the model, a feature importance analysis was performed using the Random Forest method (Breiman, 2001), namely, the impurity-base feature importance and via permutation importance. Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach that node, divided by the total number of samples, so that the higher the value, the more influential the feature is (Ronaghan, 2018).

In this study, the Random Forest algorithm from Scikit-learn (SKLearn, 2023b) is implemented to retrieve the feature importance using the RandomForestClassifier class. After training the model, the class provides an embedded property (feature_importances_) that can fetch the relative importance score for each input feature. Similarly, the feature importance via permutation used the Random Forest classifier and its accuracy to retrieve the feature importance by permuting the baseline metric (training data) with the feature column of the validation set. The permutation importance is calculated with the difference between the baseline metric and the metric from permutating the feature column (Brownlee, 2020a). The analysis of features via permutation allows for calculating relative importance scores independent of the model used, which is advantageous from the perspective of comparison with the impurity-based method.

The results of the feature importance analysis are shown in Figure 4, where the most relevant features are ordered from left to right and a threshold line (in red), set at 95% importance, was placed to intersect with the cumulative sum of the importance of the most relevant features. The threshold method aided feature selection by highlighting the attributes that need to be picked to achieve a representative sample without compromising efficiency or adding unnecessary noise to the data.

Both methods highlight the “Publication Year” as the most impactful feature for prediction. Furthermore, the “Award Criteria Category” was the second most relevant feature for classification for both feature importance measurements. The remaining feature importance order differs from one method to another; despite this, some features can be selected as they are relevant in both tests. Among these features, the “Weight of the price factor in award criteria”, the “Initial Price”, the “Base Tender Price”, the “Number of Bidders”, and the “Submission” and “Execution Deadlines” are highlighted as significant predictors both tests.

Moreover, the permutation method considers the “CPV Code”, the “Environmental Criteria”, and the “Initial Price Category” irrelevant for algorithmic classifications. However, this assessment is somewhat contradicted by the impurity-based method, as it considers the “CPV Code” relevant for classification while viewing the “District Code” as irrelevant.

This analysis’s results allow the selection of the features to be considered during the model training and testing presented in the following chapter.

2.5 Feature selection

The exploratory data analysis performed in the previous subsection allowed the selection of the features to be considered for the different model architectures. However, this analysis cannot be considered exclusively since the features must translate into a realistic construction procurement process, in which the owner may introduce his project data. To this end, taking into account the correlations between variables, the feature importance analysis, and the nature of the construction procurement process, eight features were selected among the twelve available, as seen in Table 3.

The “Publication Year” was selected as it was considered the most impactful feature. The “Award Criteria Category” and “Base Tender Price” were chosen because they can be customised at the time of the release of the procurement notice, to the detriment of the “Weight of the price factor in award criteria” and the “Initial Price” as they are significantly correlated.

Finally, the “Submission Deadline”, the “Execution Deadline”, the “Number of Bidders”, the “CPV Code”, and the “District Code” completed the group of selected features, discarding the remaining ones.

2.6 Model selection

The literature on the use of ML to predict budget compliance shows no agreement on which algorithm leads to the best results. The results of each architecture are highly related to the data that make up the model. To this end, different algorithms should be tested to find the optimal solution for each type of application.

The following algorithm types were tested to select which fits this study the best regarding its performance metrics:

Adam ANN: This is an algorithm for first-order gradient-based optimisation of stochastic objective functions. Deem as computationally efficient; this method is suited for problems that are large in terms of data and/or parameters (Kingma and Ba, 2014). This method was previously used in this study for benchmarking the data balance methods in the Data Treatment section;
Random forest: This ensemble method relies on groups of decision trees to make decisions about the classification of the target feature. The selected class is the most frequent among the decision trees (Breiman, 2001),
SVM: Initially a binary classification algorithm, SVM supervised learning algorithm performs classification by mapping a hyperplane or set of hyperplanes in a high or infinite-dimensional space, which can be used for classification by creating a divide between data points (Cristianini and Ricci, 2008; Boser et al., 1992),
Extreme Gradient Boosting (Xgboost): This algorithm compiles decision trees ensemble mechanics with gradient boosting. Similarly to Random Forest, it predicts using an ensemble of weaker models (Chen and Guestrin, 2016),
K-nearest neighbours (KNN): In KNN for classification, a sample is classified by a plurality vote of its neighbours, with the object assigned to the class most common among its nearest neighbours (Cover and Hart, 1967; Evelyn Fix, 1951).

All algorithms were developed in the Python language, in VSCode, using the Keras and Sckit-learn libraries for ML. All methods went through rigorous ablation studies to achieve the best accuracy possible. Table 4 demonstrates the settings of each algorithm in its best-performing configuration and compares the results of each model. All models used a 0.33 train-test split.

Table 4 shows that the KNN method was the model architecture that achieved the worst results. In fact, during training, it overfitted from 7 to 1,000 neighbours, which explains the low validation results and ultimately excluded it from consideration for the model.

Similarly, the XGboost and SVM methods also overfitted the sample. There is a significant discrepancy between the training and testing accuracy, and, despite the remaining metrics indicating good model adaptation to the sample, the training accuracy raises concerns about the models memorising the dataset and not identifying correlations. Another possible justification for overfitting may be that this problem may be relatively easy to classify. In this sense, it was decided to exclude the Xgboost and SVM methods from consideration.

The remaining model archetypes (Adam and Random Forest) proved capable of being applied to the problem at hand, achieving around 85% validation accuracy. Both algorithms had similar performances, so it is considered that the two could be used in this decision support model for award criteria in public construction tenders. Still, if one tries to select the best algorithm among the two, regarding the performance, the Adam ANN method had the best validation accuracy. Regarding consistency, the Random Forest algorithm obtained the slightest oscillation between training and testing metrics.

A second analysis was performed considering these two algorithms but removing the year of publication feature. This analysis was made for two reasons: to test the effects of removing the most influential variable on the performance of the algorithms; and because this feature is not customisable by the client during the definition of the projects tender. In fact, the inclusion of this variable may be artificially affecting the classifications of these algorithms since the performance of the projects in this sample varied according to the year of publication. However, the performance variations throughout the years are irrelevant information to assist the client in configuring the tender, as it cannot be changed.

In this sense, Table 5 presents the results of this second analysis, as it can be seen that the accuracy of both algorithms dropped by approximately 20% due to the elimination of the year of publication variable. In these test settings, the same observations regarding algorithm performance in the previous test are similarly evident: The Adam algorithm is the best performer, and the Random Forest method had the lowest discrepancy between training and testing.

Both algorithms registered an accuracy of around 65%. Since input variables have relatively low correlations between themselves and between the target feature, 0.65 is the realistic value one can expect from tender configuration assist algorithms for public procurement projects.

3. Discussion

This study set out to develop a model to assist procurement professionals in defining the details and award criteria for public procurement contracts for construction. Due to the difficulties in accessing data in the construction sector, this study used PPPData, an open-access database of over 5,000 public procurement contracts in Portugal. This dataset is exclusively composed of Portuguese public contracts awarded using the traditional procurement method (i.e. not considering Design and Build contracts or others). Thus, a potential bias may have been introduced. Furthermore, the data comprised contracts from 2015 to 2022, and global events like the COVID-19 pandemic and inflation crises may have influenced the projects within this timeframe. These external factors could have potentially affected project performance and tender criteria selection.

Data treatment was necessary to input data in ML models, like removing non-applicable values and balancing data, as the target feature (project performance class) was much skewed. The data sourcing and treatment needed to start this study highlight the still-emerging state of data sharing and storage in the construction industry.

The results of the data balance phase of this project indicate that the oversampling and the SMOTE methods could create a balanced dataset that achieves satisfactory performance for the problem at hand. However, the SMOTE method was the best fit because it achieves better results in fewer epochs, has a smoother learning rate, and creates new syntactical data, not just copies of previous entries.

The literature review indicates that financial aspects (e.g. base price) and bidding characteristics (e.g. submission time) significantly influence early project budget predictions. Additionally, criteria like contractor experience are also deemed important, but their omission from the Portuguese database may have hindered classification accuracy.

In a previous study, a descriptive statistical analysis was conducted on PPPData (Luís Jacques de Sousa et al., 2023). This analysis used statistical hypothesis tests, such as the t-test for paired samples, and revealed no direct correlation between the award criteria and budget compliance and that the multifactor award criteria method does not lead to overall enhanced performance. Nonetheless, the multifactor criteria method only improves performance for high-value projects. Furthermore, the study highlighted no discernible correlation between extended submission deadlines for project proposals and improved project outcomes. This lack of correlation between procurement variables and the target outcome underscores the challenges and unpredictability of making construction sector compliance predictions.

The findings of the model selection section show that, for this case, the algorithms that best fit this classification problem were the Adam ANN and the Random Forest ensemble method. Both were capable of achieving around 85% validation accuracy and F1-Score. However, these tests encounter overfitting problems with some algorithms (mainly the Xgboost and KNN methods) that can originate from the model’s architecture, the data’s characteristics or the classification problem.

Following the initial test, a subsequent examination was conducted to assess the effectiveness of the most accurate algorithms in predicting contract performance, disregarding the publication year. This second analysis is relevant since the adjudicator cannot modify this feature. Similarly to the first analysis, the best-performing algorithm in this analysis was Adam ANN, which achieved an accuracy rate of 68%. Thus, this study sets this benchmark value for the effectiveness of decision support algorithms in defining the configuration of public tenders.

The performance of Construction contracts can vary due to multiple reasons making it difficult to predict their compliance. Thus, despite this model’s relatively low value for accuracy, it provides valuable insights to the Client when deciding on the criteria for the public tender. Additionally, it accommodates quantitative and qualitative factors, which is a novelty for compliance prediction algorithms. Finally, it pioneers the test of these decision support tools in Portuguese public procurement.

In summary, forecasting compliance in construction contracts is a challenging task. Clients and Construction professionals should realistically expect that ML models can only provide suggestions on the most promising tender criteria based on probability. However, the accuracy of these models cannot guarantee which criteria will lead to contract compliance.

The proposed model aims to complement clients' decision-making processes while setting up tenders. It provides Clients with a justification for their decisions since the models created its predictions based on historical data. Moreover, it is expected to enhance the accuracy and reliability of tendering decisions, reducing the risk associated with this phase of construction regarding human errors and bias.

4. Conclusion

The use of ML algorithms to predict construction costs has been implemented in many models and paradigms in recent years. However, in this study, a cost compliance prediction model is developed to allow the client to select the best-fitting details and awarding criteria for the project in question using historical data. To this end, this algorithm utilised a dataset of over 5,000 Portuguese public procurement processes for construction from 2015 to 2022 to develop a model to predict construction cost compliance. Additionally, it performed three methods of data balancing, having found that the SMOTE method yielded the best results. Next, using a set of 12 variables, like initial price and award criteria, this study performed an exploratory data analysis followed by a feature importance analysis to select the most significant variables to be considered during model training. Finally, five different ML models were developed to predict the project’s compliance. Among the different architectures, the Random Forest ensembles and the Adam ANN best fit the problem and dataset in question. Additionally, the Adam ANN method achieved the best validation accuracy, while the Random Forest algorithm was the most consistent concerning the difference between training and validation results.

Moreover, a second analysis using only mutable feature was tested, achieving 68.3% accuracy for the Adam ANN model.

This research contributes to the existing body of knowledge regarding tender assistant models, highlighting that ML is not consistently reliable at accurately forecasting the compliance rate of construction projects solely based on procurement data. However, ML can offer valuable insights into the best-fitting procurement criteria for a tender based on probabilistic approaches. Although algorithm results may not always be more accurate than those performed by humans, the noteworthy advantage lies in their ability to deliver predictions in a fraction of the time compared to a technician. Moreover, it provides decision-makers with justifications rooted in historical data. This underscores their effectiveness as valuable tools during the procurement criteria definition process.

Future studies should aim to further the development of this tool to a usable interface to be employed by procurement technicians and test improvements to its accuracy as well as its impacts on efficiency and project financial compliance.

Figures

Figure 1

Model implementation in the procurement process

Figure 2

(a) Model accuracy for SMOTE method (b) model accuracy for oversampling method

Figure 3

Heatmap of correlation between features

Figure 4

Feature importance analysis results