1 Introduction

The recent success of Machine and Deep Learning (ML and DL) is endowing computational agents with more and more smart behaviours ranging from text [1] to speech [2] recognition, stepping through image recognition [3], and many more [4]. Typically, such smart behaviours are learnt from data in a semi-automatic way via human-designed malleable predictors—such as neural networks—which can be algorithmically trained to fit that data. This represents a paradigm shift w.r.t. the well-established practice of software engineering where the source code governing computational agents’ smart behaviours is designed and produced by human beings. Arguably, the increased interest in ML and DL solutions may be attributed to the groundbreaking performance gain that data-driven approaches can bring in comparison to otherwise hard-to-formalise, manually-defined approaches.

However, despite being both flexible and performant, current data-driven solutions come with a number of issues. First, they are data eager—meaning that the learning agent should have access to huge amounts of examples concerning the phenomena to learn. When examples are too few, learning cannot happen. Second, learning takes time. Unlike humans, learning and exploitation of learnt information are quite distinct stages for computational agents. The trainable components of computational agents are commonly pre-trained by human designers, up to a given performance score, and then provided to the agents for exploitation. Hence, no further learning typically occurs after that. Finally, the benefits of data-driven solutions come at the price of a reduced understandability of smart behaviours in the eyes of human users. Modern ML solutions rely upon sub-symbolic predictors which work as black boxes—so, humans cannot observe them and tell what a predictor has learnt and how it computes its predictions. This may be troublesome when predictions are used by agents to automate decision-making in critical domains, such as e-health [5] and smart transportation systems [6].

The exploitation of sub-symbolic predictors comes with further issues in the multi-agent systems (MAS) context. As shown by recent surveys [7], it is quite common for the MAS community to represent knowledge and agents’ behaviours symbolically both at the conceptual and technological level—often relying upon computational logic [8]. This makes the integration of sub-symbolic predictors both a conceptual and a technical issue, to be addressed since the earliest design phase down to implementation in any MAS engineering process. As a result, it becomes essential for agent-oriented programming frameworks to be interoperable with available ML libraries, and for those libraries to provide a clear and stable API supporting the automation of ML workflows—so that intelligent agents can autonomously exploit them.

Recently, symbolic knowledge injection (SKI) [9,10,11] has emerged as a possible solution to all the aforementioned issues. SKI is the task of letting sub-symbolic predictors acquire symbolic information and behave consistently w.r.t. it. For instance, this may involve a neural network taking into account the information represented by a logic theory when drawing predictions. There, “symbolic” refers to the way knowledge is represented: we consider as symbolic any intelligible language that is naturally interpretable by both human and computational agents. This includes a number of logic formalisms, and excludes the fixed-sized tensors of numbers commonly exploited in sub-symbolic predictors. Therefore, SKI mechanisms aim at training sub-symbolic predictors towards desirable behaviours.

Thanks to SKI, agents (and their designers) are back in control. The benefits of SKI to the training of ML predictors [12] include the following ones: (1) it mitigates the issues arising from the lack of sufficient amounts of training data—as under-represented situations can be suitably represented in symbols—; (2) it reduces learning time by providing straight away the very knowledge that predictors would otherwise struggle to learn by processing huge amounts of data; (3) it improves predictors’ predictive performance in corner cases—as in the case of unbalanced and overlapping classes—; (4) it prevents predictors from working as full black boxes during their training—hence overriding the need for explanations. Furthermore, (5) it harmonises the symbolic and sub-symbolic components of intelligent agents. Hence, agent designers may take advantage of SKI to endow agents with common sense—encoded in some suitable symbolic formalism—, whereas agents themselves may exploit SKI to finely govern their sub-symbolic components—e.g., by tuning them according to their beliefs or desires.

When tuning their sub-symbolic components via SKI, agents will typically aim at maximising predictive performance. It is a common practice to assess SKI mechanisms in terms of the performance gain they introduce w.r.t. some injection-free counterpart [9, 10]. However, performance gain is not the only relevant metrics an agent may intend to optimise. For instance, agents situated into resource-constrained environments may need to minimise the energy required to train/exploit ML predictors, as well as the computational resources required for their execution. Analogously, agents in need to interact with human beings may be aiming at maximising the intelligibility of their decision-making processes. Overall, there are several aspects of sub-symbolic predictors that agents could optimise via SKI. Along this line, as part of our recent research activities [13], we sketched a set of quality-of-service (QoS) metrics for SKI covering several aspects—ranging from energy-related to computational-cost-related ones via comprehensibility-related ones. Unfortunately, at that time we could not assess QoS metrics empirically, due to the lack of general-purpose software technologies supporting them.

The lack of viable software technologies for SKI is preventing not just the assessment of QoS metrics from [13], but also—and in the foremost place—the effective exploitation of SKI methods in MAS. However, as further discussed in [11], SKI methods from the literature share a general workflow, which can be briefly summarised as follows: (1) identify a suitable predictor w.r.t. the learning task at hand; (2) attain some symbolic knowledge aimed at describing relevant situations; (3) apply some SKI method to the given predictor and knowledge, hence generating a new predictor that encapsulates the knowledge; (4) train the new predictor on the available data, as usual. Notably, the last two steps may be cyclically repeated by an agent until some target QoS score is reached. Hence, in principle, SKI methods are interchangeable at the functional level as well as at the assessment level. Along this line, we designed a unified open source software library for SKI—namely, PSyKIFootnote 1—supporting the interchange, comparison, and exploitation of SKI methods in arbitrary ML workflows [11]. However, support for QoS-based assessment is currently missing.

Accordingly, in this paper we extend our previous work by proposing a full modelling of the QoS metrics for SKI, as well as their empirical evaluationFootnote 2 via PSyKI. To serve this purpose, we also extend PSyKI design, API, and codebase to support our QoS metrics. Our empirical analysis demonstrates both the soundness of the proposed metrics and the versatility of PSyKI as the first software tool supporting the application, interchange, and numerical assessment of SKI techniques. Our proposals are—to the best of our knowledge—the first attempt to introduce QoS metrics for SKI, along with the software tools enabling their practical exploitation by both human and computational agents.

The paper is organised as follows. Section 2 introduces some relevant definitions and summarises the background on SKI methods. Section 3 formally defines the QoS metrics, whereas Sect. 4 overviews PSyKI and describes the integration of QoS metrics. Section 5 outlines the experiments and their design, and discusses the results. Finally, Sect. 6 summarises the key findings and contributions, by highlighting the importance of the new QoS metrics in effectively evaluating the strength of SKI mechanisms.

2 Background and definitions

The benefits of sub-symbolic predictors in MAS come along with the issues deriving from their black-box nature and uncertain optimisation processes. This is why incorporating symbolic knowledge into the sub-symbolic prediction process could bring about a number of advantages. For instance, predictors may be able to make informed decisions based on prior knowledge, reducing the chances of producing unexpected results. Moreover, the injection of symbolic knowledge often results in improved prediction performance, as the predictors are better equipped to handle data with inherent structure and meaning. Therefore, a number of recent works [9, 10] have leveraged symbolic knowledge injection to mitigate the common problems of sub-symbolic predictors (lack of interpretability, poor generalisation, fuzzy optimisation procedure, etc.). The underlying idea is to enable the sub-symbolic predictor to take into account some prior symbolic knowledge when drawing its predictions, thus making the predictor more controllable.

The practice of SKI involves a rather simple workflow, yet it may rely on several different injection algorithms, often tailored on specific sorts of predictors or symbolic languages. Differences among those algorithms can be relevant, especially w.r.t. to how they perform injection. Hence, we can broadly define SKI as “any algorithmic procedure affecting how sub-symbolic predictors draw their inferences so that predictions are either computed as a function of, or made consistent with, some given symbolic knowledge”.

More formally, given an injection procedure \(\mathscr{I}\), some symbolic knowledge K, and a sub-symbolic predictor N aimed at solving some supervised learning task, we define the “knowledge-aware” predictor \(\hat{N}\) as the result of the application of \(\mathscr {I}\) to K and N:

$$\begin{aligned} \hat{N} = \mathscr {I}(K, N) \end{aligned}$$

There, we call N the uneducated predictor—as it has not yet undergone injection—, and \(\hat{N}\) the educated one.

Focussing on the inputs of SKI—namely, the symbolic knowledge K and the sub-symbolic predictor N—, nearly all SKI methods and techniques available in the literature assume that: (1) K is a logic knowledge base (KB, henceforth) of logic formulæ, encoded via some subset of first-order logic (FOL) [14], while (2) N is a neural network (NN). To support this statement, Table 1 reports a sample of the most relevant SKI techniques from the literature, pointing out the sorts of knowledge and predictor they support.

Should we speculate on the possible motivations behind the choice of FOL and NN, we would argue that logic brings great flexibility in representing knowledge in a way which is similar to how humans reason, whereas neural networks bring about malleability, composability, and trainability to intelligent systems—as they can be structured in various ways to serve diverse purposes.

Many algorithms may fit the definitions above—mostly differing for the particular sort of logic formalism, injection strategy, or neural network they support. For a more detailed discussion on SKI algorithms see [12, 15, 16].

2.1 Knowledge injection workflow

SKI assumes the input knowledge consists of crisp logic formulæ expressed in a logical language of choice. Such formulæ must somehow be converted into numeric form for injection to take place. Later on, injection is performed by training neural networks as usual. In other words, SKI is a process affecting networks before and during training.

Overall, SKI relies on three basic operations, namely parsing (\(\Pi\)), fuzzification (\(\zeta\)), and embedding (E).

The first step of any SKI method is parsing the input formulæ, hence producing a machine-interpretable and -browsable representation—namely, abstract syntax trees (AST). AST are then visited to produce a numeric representation of the input formulæ either consisting of functions of real numbers (e.g. a loss function, a neural network structure), or an array of real numbers.

Fuzzification is the process of converting some AST formula into a function of the form \(f: \mathbb {R}^n \rightarrow \mathbb {R}\), whose input values are numeric interpretations of the original formula, whereas their outputs are either truth degrees (cf. [35])—e.g. 1 means \(\texttt{true}\), 0 means \(\texttt{false}\), \(x \in [0, 1]\) means “\(\texttt{true}\) with \(x\%\) probability”—, or penalties (cf. [34])—e.g. 0 means “no penalty”, \(x \ne 0\) means “penalty proportional to \(\vert x \vert\)”. This is required when converting formulæ into loss functions (like in constraining methods) or activation functions (like in structuring methods).

Embedding is the process of converting some formula’s AST into a numeric array of fixed size. This is necessary when converting formulæ into numeric datasets for training.

As also highlighted in Table 1, there are three sorts of SKI methods: those that perform injection constraining during the training of neural networks, those that affect their internal structure, and finally those that perform embedding.

Table 1 Summary of relevant SKI techniques (not exhaustive)

2.2 Categorisation of injection methods from the literature

In the reminder of this section we delve into the details of the various injection strategies exploited by SKI, and elaborate on how the overall performance of SKI techniques can be assessed.

2.2.1 Constraining neural networks

The key idea behind SKI techniques of this sort is to steer the learning process of a neural network to make it behave consistently w.r.t. some given logic formulæ. This is achieved by penalising the network during training, whenever it violates the logic formulæ. Figure 1 provides an overview of the approach.

Fig. 1
figure 1

Symbolic knowledge injection via constraining: data flow

A common way to penalise the network under training is by altering the loss function [9, 10, 36]. The neural network training process essentially consists in the use of gradient descent [37], i.e. an optimisation process where the weight of NN synapses are iteratively modified so as to minimise a loss function. Most commonly, the loss function quantifies the overall predictive error of the network: the greater the error, the greater the loss. However, when SKI is applied, the loss function also takes into account the consistency of the logic formulæ. In this way, the learning process not only minimises the network error w.r.t. data, but also its error w.r.t. symbolic knowledge. In other words, the predictor is constrained to be compliant with the prior knowledge up to a certain extent.

The underlying assumption behind injection mechanisms of this kind is that logic formulæ should be converted into functions of real numbers of the form:

$$\begin{aligned} f: \mathscr {X} \rightarrow \mathbb {R}_{\ge 0} \end{aligned}$$

where \(\mathscr {X}\) is the same input space of the network, and \(\mathbb {R}_{\ge 0}\) is the set of non-negative numbers—here representing penalties. There, for any given input vector \(\textbf{x}\), the value \(f(\textbf{x})\) represents the discrepancy among the network prediction corresponding to \(\textbf{x}\) and what the logic formulæ prescribe for \(\textbf{x}\). Hence, \(f(\textbf{x}) = 0\) means that the network is behaving consistently w.r.t. the formulæ, hence it should get no penalty. Conversely, \(f(\textbf{x}) > 0\) means that the network behaviour is deviating the formulæ, hence it should be penalised.

Some relevant SKI algorithms based on constraining can be found in [9, 10, 25, 29, 30, 38].

2.2.2 Structuring neural networks

The key idea behind SKI techniques of this sort is to construct (a portion of) the neural network undergoing injection in such a way to make it reflect some given logic formulæ [18, 20, 30, 39, 40]. The resulting network is then trained as usual. However, given that (part of) its internals are tailored on the logic formulæ, the network is expected to have higher predictive performance—or at least require less training efforts to reach good performance scores—in all situations which are described by the logic formulæ. Figure 2 provides an overview of the approach.

Fig. 2
figure 2

Symbolic knowledge injection via structuring: data flow

The underlying assumption behind structuring SKI methods is that (a portion of) a neural network can be constructed to mimic the evaluation of one or more logic formulæ. This is commonly achieved by letting neurons and synapses represent either logic variables or combinations of logic expressions via logic connectives or arithmetic operators. Methods may then decide to keep the weights of the structured portion of network free to vary during training, in order to let them adapt to the peculiarities of the training data at hand.

Some relevant SKI algorithms based on structuring can be found in [17,18,19, 21, 26, 27, 31, 33].

2.2.3 Embedding knowledge into neural networks

The key idea behind SKI techniques of this sort is to convert symbolic knowledge into numeric-array form to be used as training data [41,42,43]. Predictors trained with such techniques are usually used as logic reasoning engines. Figure 3 provides an overview of the approach.

Fig. 3
figure 3

Symbolic knowledge injection via embedding: data flow

The underlying assumption behind embedding SKI methods is that input knowledge can be represented as a (possibly multi-dimensional) array of numbers. This, is turn, requires the knowledge to ground (i.e. variable free)—a requirement which heavily limits what logic can actually represent. So, in practice, embedding techniques are commonly applied to simple (i.e. less expressive) logics such as description logics. There, symbolic information consists of knowledge graphs [44], where nodes represent entities and edges represent relations among those entities. The graphs’ adjacency matrices are essentially numeric arrays—and this is one of the tricks exploited by embedding-based SKI methods.

Some relevant SKI algorithms based on knowledge graph embedding can be found in [23, 24, 28, 32].

2.3 Injection assessment

It is common for works in the SKI realm to measure the strength of their mechanism as the gain in performance achieved by the SKI predictor against its uneducated counterpart. In that case, the effectiveness of the injection mechanism \(\mathscr {I}\) when applied to a neural network N to inject the knowledge K is measured via some performance score \(\pi\) (accuracy, F1-score, MSE, etc.), aimed at assessing the performance of N with respect some test dataset T. More formally:

$$\begin{aligned} \epsilon _{K, N, \pi , T}(\mathscr {I}) = \pi (\mathscr {I}(K, N), T) - \pi (N, T) \end{aligned}$$
(1)

In other words, the effectiveness of some injection mechanism \(\mathscr {I}\) may be assessed differently depending on which knowledge base, neural network, and dataset it is applied to.

While being indicative of the quality of the SKI approach w.r.t. predictive performance, that metric does not capture every aspect of the knowledge injection, as there exist multiple properties that one may be willing to optimise through SKI—see Sect. 3.1. Due to the sudden rise in research interest towards sustainable AI approaches [45], there exists the opportunity to analyse if and how SKI brings about benefit in terms of computations, energy consumption, and data required to train and deploy sub-symbolic approaches. In the remainder of this paper we identify other metrics to reliably measure the performance of SKI.

3 SKI quality-of-service metrics definition

In this section we propose and analyse the novel set of metrics for identifying the quality of SKI systems. An overview of our proposals, along with a brief classification, is provided in Sect. 3.1. Roughly speaking, we introduce metrics for measuring SKI method’s efficiency—under multiple goodness criteria.

3.1 Overview

The current practice of SKI assessment relies exclusively on measuring improvements in the predictive performance of some educated predictor over an equivalent uneducated counterpart. However, predictive performance is not the only relevant benefit of SKI one may be willing to measure.

There exist multiple aspects of neural predictors which may be affected by SKI—and for which metrics should be defined. Just to name a few, SKI may affect the memory footprint, the latency, as well as the data and energy requirements of the predictors it is applied to. Overall, all such properties contribute to what we informally call a predictor’ efficiency. In the remainder of this paper we rely on the following efficiency properties:

  • memory footprint, i.e., the size of the predictor under examination;

  • latency, i.e., the time required to run a predictor for inference;

  • data efficiency, i.e., the amount of data required to train the predictor;

  • energy consumption, i.e., the amount of energy required to train/run the predictor;

other than, of course:

  • predictive performance, e.g. accuracy, F1-score, mean absolute/squared error, etc.

For the sake of brevity, we also denote as efficiency metrics any function aimed at measuring some efficiency property.

Efficiency metrics provide a score measuring how much some efficiency property P of a given uneducated predictor N improves in its educated counterpart \(\hat{N}\). Of course, the resulting score may be largely influenced by a number of different aspects, such as:

(A1):

Knowledge quality and coverage. The educated predictor \(\hat{N}\) is attained by injecting some input knowledge K. Furthermore, both N and \(\hat{N}\) are aimed at addressing the same learning task—say, classification or regression—, and they are both trained upon some training dataset D, which describes the task. Questions that may arise are, for instance: (1) are K and D coherent?, (2) is K covering situation which the data in D exemplifies?, (3) is K consistent, coherent, and correct? (4) can we say the same for D? Regardless of the particular efficiency property P being measured, the resulting score may greatly vary depending on the answers to these questions. So, in other words, efficiency measures depend on the particular input knowledge (K) and data (D) being used during SKI.

(A2):

Baseline quality. As both the educated (\(\hat{N}\)) and uneducated (N) predictors are targetting the same learning task, one may wonder if N is adequate enough to address that learning task. In this setting, questions that may arise are: (1) is N biased [46] in statistical sense? (2) in case it is, can we expect \(\hat{N}\) to carry any observable improvement on some efficiency measure P? (3) can we expect \(\hat{N}\) to carry any observable improvement on some efficiency measure P? (4) event in case where N is not biased, is the selected injection mechanism \(\mathscr {I}\) adequate for N? From these questions we understand that efficiency measures may also depend on the nature of the input predictor (N), and, of course, on the injection mechanism of choice (\(\mathscr {I}\)).

(A3):

Task at hand. The learning task targeted by both N and \(\hat{N}\) determines the training dataset as well as the test dataset T. The choice of T impacts the assessment of both N and \(\hat{N}\). Therefore, it may impact the score of any efficiency measure as well. So, efficiency measures may finally also depend on the target learning task, and, consequently on the test data (T).

Summarising, efficiency measures assess some injection method \(\mathscr {I}\) in a very specific setting that depends on (1) the particular knowledge to be injected, (2) the sort of predictor undergoing injection, (3) the training and (4) test data adopted for training. In other terms, any efficiency measure should be parametric w.r.t. K, N, D, and T.

Accordingly, in the following we propose the implementation and formalisation of metrics to assess the efficiency of SKI. In particular, as discussed at the beginning of this section, memory footprint, latency, energy consumption, and data efficiency are introduced as key performance indicators of SKI. Our objective is to assess the efficiency of SKI in terms of computational resource usage, and to provide insight into how these metrics could be used to inform the design and optimisation of SKI-based systems, with a particular emphasis on their potential benefit within MAS frameworks. Notably, we believe that an in-depth understanding of the trade-off between performance and efficiency is essential for the implementation of AI predictors in MAS frameworks.

3.2 Memory footprint

In the context of MAS, sub-symbolic predictors are gaining importance as the field moves towards more efficient and sustainable AI. As the demand for AI predictors that can operate on resource-constrained devices—such as IoT devices, edge devices, and mobile devices—continues to rise, researchers have focused on developing solutions that require less memory and computational resources [47,48,49].

To address those concerns, several metrics for measuring the memory footprint of AI predictors—in particularly the sub-symbolic ones – have been recently proposed in the literature [50,51,52]. For instance, in [13], the authors propose measuring neural networks’ memory footprint by counting the amount of parameters they are composed by—i.e., essentially, the amount of synapses composing each neural network. Alternatively, some authors leverage metrics such as Floating Point OPerations (FLOPs) [53] or Multiplication Addition Computations (MACs) [54], which measure the amount of total operations or multiplications and additions required to perform a single inference respectively. MACs consider solely multiplications and summations as they represent the most common computations in NNs. Theoe measures are indicative of the amount of memory required either to fit the whole sub-symbolic predictor—total number of parameters—or to run it—FLOPs and MACs. Even though they are intuitive, those metrics are also effective for measuring predictors complexity and the overall computational memory efficiency. Here, we consider leveraging on such measures to analyse the efficiency gain of SKI approaches. In other terms, we consider the ability of SKI mechanisms to produce lightweight sub-symbolic predictors—in terms of memory occupation.

The key insight here is that knowledge injection may lift part of the learning burden from the predictor at hand, by relieving the network from the need to learn complex or data-uncovered notions via trial-and-error. Indeed, the a-priori concepts carried by the input knowledge might now be injected instead of being learnt in a data-driven way. As a result, the amount of notions that sub-symbolic predictors must learn in a data-driven way might be significantly reduced. Fewer concepts to be learned are typically associated with fewer parameters, FLOPs, and MACs—or, in other words, a smaller memory footprint [55]. In the context of SKI, we define the memory footprint improvement score \(\mu _{\Psi , K, N}(\mathscr {I})\) as the amount of memory saved by the educated network \(\hat{N}\) w.r.t. its uneducated counterpart N. The higher the score, the more memory efficient the educated predictor is w.r.t. the uneducated one. However, as one may measure the memory footprint of a sub-symbolic predictor in different ways – e.g., by counting the number of parameters, FLOPs or MACs—, our scoring function is parametric in \(\Psi\)—which represents the memory footprint metric of choice. More formally:

$$\begin{aligned} \mu _{\Psi , K, N}(\mathscr {I}) = \Psi (N) - \Psi (\mathscr {I}(K, N)) \end{aligned}$$
(2)

where \(\hat{N} = \mathscr {I}(K, N)\) represents the educated predictor attained by injecting K into N.

It is worth noticing how the proposed memory footprint score may be influenced by quality and coverage of the input knowledge (A1), as well as by the memory footprint of the input predictor N (A2). About (A1), the reason is simple: the better the input knowledge, the lower the expected memory requirements of the educated predictor. Similarly, as far as (A2) is concerned, the better the input predictor, the lower the expected memory footprint improvements of the educated predictor. However, one may also notice from Equation (2) that our memory footprint score is not parametric when the current task is taken into account (A2). The reason is simple: the memory footprint of a neural network is not task-dependent, as it is a structural property of the neural network itself.

Finally, we stress that memory footprint of the educated predictor is expected to be lower than the one of the uneducated predictor. Indeed, our score is measuring the memory footprint improvement. A negative score \(\mu _{\Psi , K, N}(\mathscr {I})\) means that the educated predictor is more memory hungry than the uneducated one—i.e., that the SKI approach is not effective in reducing the memory footprint of the input predictor.

3.3 Energy consumption

The relationship between MAS and energy consumption is complex and articulated. In fact, to function effectively in resource-constrained environments, MAS require AI systems that consume a reduced amount of energy. Moreover, the distributed nature of MAS and the increasing demand for AI across a variety of applications drive the need for scalable and power-efficient solutions. This need is, however, aggravated by the dynamic and real-time requirements of many MAS applications, which can result in high energy consumption due to computational and resource requirements. The complexity of a MAS, with its multiple agents and agent interactions, adds a new level of difficulty, especially when processing large amounts of data or executing complex algorithms. Altogether, those factors make energy consumption a crucial aspect of the design and implementation of AI systems in MAS frameworks.

Overall, there are numerous ways to address this problem. One approach could be to use energy-efficient hardware, such as low-power processors, or to use distributed and federated learning techniques that can distribute computation complexity across multiple devices [56]. Another approach could exploit more efficient algorithms and data structures so as to reduce the amount of computation required by agents to process data. Along that line, the integration of sub-symbolic predictors, which require less memory and computational resources than conventional symbolic AI approaches, could meaningfully reduce energy consumption. However, improvements can still be made from an energy point of view by making the sub-symbolic systems more efficient. For instance, several techniques rely on ad-hoc strategies to compress or optimise sub-symbolic predictors.

In this context, we see SKI approaches as providing MAS designers with a huge opportunity. In fact, the introduction of injection mechanisms in the data-driven pipeline of sub-symbolic training mechanisms may reduce the amount of computations required to train and run sub-symbolic predictors. Indeed, knowledge injection reduces the complexity of the learning process, providing another source of knowledge other than the training data itself. Thus, one may be interested in assessing whether and to what extent SKI mechanisms contribute to reducing the amount of computations required by a sub-symbolic predictor along its life-cycle.

We propose a new score aimed at measuring the energy consumption of SKI approaches. This is tightly related with memory footprint score from Sect. 3.2. Indeed, it is usually the case for smaller predictors to require fewer amounts of energy to train and run. However, there may exist memory efficient predictors requiring a higher amount of energy to train and run, such as sparse ones. Indeed, sparsity induces a lower amount of operations, but is not usually effectively implemented at hardware level, increasing power consumption [53]. Therefore, energy consumption is a property which is worth to be measured by itself.

In order to analyse energy consumption as well as the possible improvements that SKI could bring about, we first need to define the life cycle of AI predictors, analysing hungriness of each component resource. In order to build and deploy a data-driven AI solution, a number of stages need to be completed, namely:

  1. 1.

    Model definition, where data scientists analyse the task at hand and select the most adequate sorts of sub-symbolic predictors, and the most promising hyperparameters assignments for those predictors.

  2. 2.

    Model training, where the parameters of the sub-symbolic predictor of choice are automatically tuned on the training data via some sort of training algorithm. There, the amount of training samples (as well as their dimensionality) may impact energy consumption. Indeed, training algorithms commonly require running the predictor on the data and updating it several times.

  3. 3.

    Model testing, where the predictor is tested against a—limited—set of testing samples to check if the performance are satisfactory. As for the training case, energy consumption here may be affected by the amount (and dimensionality) of testing samples.

  4. 4.

    Model deployment, where the predictor runs multiple times, which a frequency which really depends on the specific application at hand

From the definition of the data-driven AI life-cycle, it is possible to highlight that the training and deployment phases are the most resource hungry. Indeed, training requires a huge— yet predictable—amount of predictor executions and updates, whereas deployment might be very energy demanding depending on the predictor usage frequency and life expectation—which are typically hard to anticipate.

Accordingly, as far as energy consumption is concerned, we are interested in measuring the energy consumption of the training and deployment phases, individually. More precisely, for the training phase, we are interested in measuring the energy consumption of the training algorithm itself, hence excluding the cost of the inferences drawn during the training process—as their cost is expected to be analogous to the one of the deployment phase.

Notably, this distinction allows us to evaluate the impact of SKI during both the training and deployment phases—which may, in general, be significantly different. In fact, we expect SKI to decrease the energy consumption of the deployed predictors, at the price of an increased energy consumption of the training phase.

Delving into the details of the energy consumption measurements, we start by defining the \(\Upsilon ^\textsf{i}\) score, aimed at measuring the average energy consumed by a sub-symbolic predictor N on a per-single-inference basis:

$$\begin{aligned} \Upsilon ^\textsf{i}_{\upsilon }(N, T) = \frac{1}{\vert T \vert } \sum _{t \in T} \upsilon (N, t) \end{aligned}$$
(3)

Our definition assumes a function \(\upsilon (N, t)\) is available to measure the energy consumption of a single forward run of a sub-symbolic predictor N on a single sample t. Such a function may for instance estimate the heat dissipated by the hardware running the predictor, during the single inference N(t). Under that assumption, Eq. (3) measures the average energy consumption of a sub-symbolic predictor N on a test dataset T composed by several samples.

We now define the \(\Upsilon ^\textsf{t}\) score, aimed at measuring the average energy consumed while training a sub-symbolic predictor N on a training dataset T:

$$\begin{aligned} \Upsilon^{\textsf{t}}_{\upsilon , \gamma }(e, N, T)= \frac{\gamma (e, N, T)}{e \cdot \vert T \vert } - \Upsilon^{\textsf{i}}_{\upsilon}(N, T) \end{aligned}$$
(4)

Our definition assumes the training involves e epochs, and that during each epoch the whole training set T is used to update the predictor N. The definition also assumes \(\gamma (e, N, T)\) is a function estimating the overall energy consumed by the training phase as whole—including the energy consumed by the inferences drawn during the training process. Similarly to \(\upsilon\), function \(\gamma\) may for instance estimate the heat dissipated by the hardware running the predictor, during the whole training process. Under such assumptions, Eq. (4) measures the average energy consumption required by the predictor N for a single update, during its training on the dataset T.

We can now define the energy consumption improvement of a SKI mechanism as the amount of energy saved by the educated predictor, compared to its uneducated counterpart. Again, we distinguish between energy consumption during training and energy consumption during inference. Along this line, we introduce two scores, namely \(\varepsilon ^\textsf{i}_{\upsilon , K, N, T}(\mathscr {I})\) (resp. \(\varepsilon ^\textsf{t}_{\upsilon , \gamma , K, N, T}(\mathscr {I})\)), aimed at measuring the energy consumption improvement of a SKI mechanism \(\mathscr {I}\), during inference (resp. training). More formally:

$$\begin{aligned} \begin{aligned} \varepsilon ^\textsf{i}_{\upsilon , K, N, T}(\mathscr {I})&= \Upsilon ^{\textsf{i}}_{\upsilon }(N, T) - \Upsilon ^\textsf{i}_{\upsilon }(\mathscr {I}(K, N), T) \\ \varepsilon ^{\textsf{t}}_{\upsilon , \gamma , e, K, N, T}(\mathscr {I})&= \Upsilon ^{\textsf{t}}_{\upsilon , \gamma }(e, N, T) - \Upsilon ^\textsf{t}_{\upsilon , \gamma }(e, \mathscr {I}(K, N), T) \end{aligned} \end{aligned}$$
(5)

where \(\hat{N} = \mathscr {I}(K, N)\) represents the educated predictor attained by injecting K into N, and T is a reference dataset of choice—most commonly, the training set in the case of \(\varepsilon ^\textsf{t}\), and the test set in the case of \(\varepsilon ^\textsf{i}\).

It is worth noticing how the proposed energy consumption scores may be influenced all aspects (A1)–(A2). About input knowledge (A1) the reason is simple: the more complex the input knowledge, the higher (resp. lower) the expected energy consumption of the educated predictor during training (resp. inference). Similarly, as far as the input predictor is concerned (A2), the more energy-hungry it is, the higher we expect the educated predictor’s energy consumption improvements to be. Lastly, the task at hand (A2) has a clear effect on our scores, as they are both parametric in the dataset—energy consumption improvements are typically task-specific.

3.4 Latency

The amount of time required to draw a single prediction is one of the most relevant and impactful efficiency measures for sub-symbolic predictors. In what follows, we refer to such time-lapse as latency. A small latency indicates that a sub-symbolic predictor is able to compute relevant predictions in useful time—which is an important property in real-world applications. For instance, low latency is essential in those scenarios where human lives depend on the timely response of some AI system, such as intelligent transportation [6] and e-health [5]. Moreover, latency assumes a relevant role in multi-agent scenarios, where collaboration between multiple intelligent entities is required, and there can not exist lag between them due to lengthy computations [57]. Also, the processing of large amounts of data and execution of complex algorithms, such as those used in decision-making, can result in increased latency as the system struggles to keep up with the demands of the task. As a result, MAS complexity can contribute significantly to increase latency. This is why recent research efforts in the AI field are focussing on time-sensitive predictors.

One possible solution available to address this problem is the use of SKI approaches. By incorporating symbolic representations, SKI approaches can reduce the amount of computation required to process data, leading to reduced latency. Furthermore, the use of symbolic representations could help to simplify the complexity of the system, making it easier to predict the behavior of the system and identify the root causes of an increased latency. As a result, we believe it is crucial to measure latency in order to assess the computational efficiency of present AI systems.

In the remainder of this section we assume latency to be computed by averaging the time required to draw a number of predictions from a reference test dataset. More formally, we define the latency of a predictor N as the average time required to draw a single prediction from a dataset T:

$$\begin{aligned} \Lambda (N, T) = \frac{1}{\vert T \vert } \sum _{t \in T} \Theta (N, t) \end{aligned}$$
(6)

where \(\Theta (N, t)\) represents the time required to draw a prediction from N on the input t.

As far as SKI is concerned, we are interested in assessing the latency gain brought by a given SKI mechanism \(\mathscr {I}\) w.r.t. some uneducated predictor. Along this line, one may be interested in figuring out whether injection increases or decreases the latency of a given predictor. Hence, we define the latency gain \(\lambda _{K, N, T}(\mathscr {I})\) introduced by some SKI method \(\mathscr {I}\) as the average difference between the inference time of the educated predictor \(\hat{N}\) and its uneducated counterpart N, over a reference test dataset T. More formally:

$$\begin{aligned} \lambda _{K, N, T}(\mathscr {I}) = \frac{1}{\vert T \vert } \sum _{t \in T} \left( \Theta (N, t) - \Theta (\hat{N}, t) \right) = \Lambda (N, T) - \Lambda (\hat{N}, T) \end{aligned}$$
(7)

where \(\hat{N} = \mathscr {I}(K, N)\) represents the educated predictor attained by injecting K into N.

Similarly to the energy measurement, the latency metric is tightly related to the complexity of the educated sub-symbolic predictor and therefore with memory footprint. However, like energy consumption, latency is not always directly proportional to the amount of operations that construct the predictor at hand. Sparsely-structured operations might slow down the inference process due to their inefficient computation at hardware level. Moreover, input data complexity and quality might alter the latency achieved by the predictor. Indeed, inference over different—yet structurally analogous—samples may take vastly different timings, as shown in the attack proposed in [58].

It is worth noticing how the proposed latency score may be influenced all aspects (A1)–(A2). About input knowledge (A1), we argue it may have both a positive and a negative effect on the latency gain. In fact, on the one hand, some SKI mechanisms might introduce additional computations—such as the ones required to process the input knowledge K in structuring methods—see Sect. 2.2.2. We expect this effect to be magnified in the case of large knowledge bases, as the number of operations required to process them is expected to be higher. On the other hand, SKI systems might also reduce the inference time of the given predictor, by reducing the number of computations required to draw a prediction—likely, at the expense of higher training times. As far as the input predictor in concerned (A2), the more time-consuming it is, the higher we expect the educated predictor’s latency gain to be. Lastly, the task at hand (A2) has a clear effect on our score, as latencies are computed over task-specific test sets.

3.5 Data efficiency

Data efficiency is a critical aspect of MAS, as the amount of data generated and processed by these systems can be substantial. Inefficient data management can result in increased latency, decreased accuracy, and increased energy consumption, all of which can negatively impact the performance of MAS.

Sub-symbolic predictors which rely on data-driven training algorithms, can provide groundbreaking performance and flexibility to MAS, but the data-driven procedure comes with several shortcomings. These predictors require collecting significant amounts of data samples for each task to be tackled, leading to increased data storage and processing requirements. Furthermore, not only the quantity but also the quality of the data—here intended as its representativeness of the task at hand—is crucial for the predictor to learn effectively. All such requirements make the data collection process time–costly—and depending on the application—possibly affected to subjectivity or uncertainty—e.g., emotion recognition [59].

For all these reasons, recent research efforts have focused on proposing data-frugal predictors [60]. Among them, knowledge injection mechanisms play a significant role [10]. Indeed, leveraging a-priori knowledge, SKI relieves the learning process from part of its computational burdens. Concepts that an uneducated predictor would need to learn from data might now be injected into the educated predictor, instead. Hopefully, this would let the educated predictor’s learning process require lower amounts of data to attain acceptable performance levels. In this sense, SKI might be considered as a data-efficiency mechanism.

We are here interested in computing the data-efficiency gain brought by a given SKI mechanism w.r.t. some uneducated predictor. To do so, we firstly need to define the data footprint of a given predictor. Informally, the data footprint of a predictor N is the amount of data it requires to be trained to reach a certain performance level. Hence, assuming that a predictor N is trained on a dataset D—of samples of potentially different dimensions—, via some training process involving e epochs, and that it reached a performance level \(\pi (N, T)\) over a test set T—and according to some test dataset T—, we define its data footprint as follows:

$$\begin{aligned} \Delta _\pi (e, N, D, T) = \frac{e}{\pi (N, T)} \sum _{d \in D} \beta (d) \end{aligned}$$
(8)

where d is a single training sample, and \(\beta (d)\) is the amount of bytes required for its in-memory representation, and \(\pi\) is some performance score of choice. As the reader may notice, the data footprint is directly proportional to the number of epochs e, to the size of the training set, and to its dimensionality; whereas it is inversely proportional to the performance score of the resulting predictor.

We define the data-efficiency gain \(\delta _{e, K, N, D, D', T}(\mathscr {I})\) of a given SKI mechanism \(\mathscr {I}\) as the difference between the data footprint of the uneducated predictor N—trained upon some dataset D—and that of the educated predictor \(\mathscr {I}(K, N)\)—trained upon some other dataset \(D'\). The score assumes that the two predictors have been trained for the same number of epochs e, and that their performance is assessed using the same performance score \(\pi\), on the same test set T—in order to keep the comparison fair. More formally:

$$\begin{aligned} \delta _{e, K, N, D, D', T}(\mathscr {I}) = \Delta _\pi (e, N, D, T) - \Delta _\pi (e, \mathscr {I}(K, N), D', T) \end{aligned}$$
(9)

The simplest approach to improve data efficiency in SKI mechanisms is to reduce the amount of samples that compose the training dataset—i.e. \(\vert N \vert\) in Eq. (8). However, one may also consider the option of decreasing the size of samples either by reducing their dimensionality or by compressing their representations—in a nutshell, by reducing \(\beta (d)\) for all \(d \in D\).

To increase the data-efficiency gain, one may also consider engineering SKI and, consequently, the educated predictor. Along this line, the best strategy consist in reducing the size of the training set \(D'\) for the educated predictor by letting the input knowledge K compensate for such lack of data. Notably, this is possible because our score is sensitive to both aspects (A1) and (A2). In other words, both the input knowledge and the task at hand have a measurable effect on the data-efficiency gain. Finally, as far as the baseline predictor is concerned (A2), we argue that the more data hungry it is, the more the data-efficiency gain will be.

4 Integration of SKI QoS metrics into PSyKI

In this section we thoroughly discuss the PSyKI system by first providing the reader with a comprehensive overview of the system, then delving into the specifics of how QoS metrics are integrated into the PSyKI library.

PSyKI—acronym for “platform for symbolic knowledge injection”—is a Python library that provides support for the injection of prior symbolic knowledge into sub-symbolic predictors by letting the users—e.g., MAS designers—choose the most adequate method with respect to the ML task to accomplish [11]. PSyKI is a tool for intelligent systems engineers who need to either experiment with already-existing SKI algorithms or invent new ones. PSyKI is public and available at github.com/psykei/psyki-python.

Currently, PSyKI can be used with predictors created by Tensorflow [61] and supports the following SKI algorithms: (1)KBANN, one of the first SKI algorithms proposed in literature [17]; (2) KINS, a structuring-based injector that integrates symbolic knowledge into a target neural networks [34]; (3) KILL, a constraint-based injector that affect the training of a target neural network [35].

Fig. 4
figure 4

PSyKI design. Each SKI algorithm is follow the workflow represented in the figure The four yellow boxes represent the four main steps of the workflow. The first step is the parsing (\(\Pi\)) of the symbolic knowledge. The second step is the fuzzification (\(\zeta\)) of the parsed knowledge. The third step is the injection (\(\mathscr {I}\)) of the fuzzified knowledge into the uneducated target predictor (P). The fourth step is the training (\(\mathscr {T}\)) of the new predictor, making it educated (\(P'\))

Essentially, PSyKI is designed around the notion of injector, whose block diagram is shown in Fig. 4. An injector is any algorithm accepting as input a ML predictor and prior symbolic knowledge (typically logic formulæ) and producing a new predictor as output. In order to properly perform injection, injectors may require additional information, such as algorithm specific hyperparameters. The general workflow for SKI with PSyKI is compliant to the one presented in Sect. 2.1—with specific attention to parsing (\(\Pi\)), fuzzification (\(\zeta\)) and injection (\(\mathscr {I}\)).

PSyKI supports the processing of symbolic knowledge represented via logic formulæ. Based on the sort of logic adopted, user can build an abstract syntax tree (AST) for each formula. The AST can be inspected through a fuzzifier via pattern visitor [62] to encode the symbolic knowledge into a sub-symbolic form (e.g. fuzzy logic functions, ad-hoc layers). The resulting sub-symbolic object can finally be used by an injector to create a new predictor. This process—denoted with \(\zeta\) in Figs. 1, 2 and 3—is injector-specific; instead, the same parser \(\Pi\) can be used independently of the injector for logic formulæ of the same type.

The software is organised into well-separated packages and interfaces, so as to ensure extensibility towards new sorts of logics and fuzzifiers—see Fig. 5. A formula AST is represented in the software via instances of the Formula abstract class and its manifold subtypes (not shown in the figure)—aimed at covering the many logic-specific aspects supported by PSyKI. Ad-hoc implementations of Formula are included in PSyKI, one for each the logic formalism supported by our framework—currently, Prolog, Datalog, and their sub-sets –, and more may be introduced in the future by either us or other researchers by simply extending that class. The same holds for fuzzifiers (resp. injectors), i.e., sub-types of the Fuzzifier (resp. Injector) abstract class. Currently-available implementations of those class cover the KBANN [17], KINS [34], and KILL [35] injection algorithms—and the corresponding injectors as well,

Fig. 5
figure 5

Class diagram of PSyKI. Main entities are Injector, Formula, and Fuzzifier

However, in its original state PSyKI does not include any particular facility to assess SKI. This is why in the remainder of this paper we propose a PSyKI extension aimed at supporting engineers in need of practically assessing the effectiveness—as well as the other QoS properties discussed in this paper—of their SKI workflows.

4.1 QoS metrics implementation in PSyKI

QoS metrics are implemented as a set of classes that extend the Metric abstract class. Each class corresponds to a specific metric and is responsible for computing the corresponding score. Therefore, the Metric class provides a common interface for all metrics.

In particular, it provides two methods to compute the metric value between two predictors. The first method is compute_during_training and it is used to compute the metric during the training phase of the predictors. The second method is compute_during_inference and it is used to compute the metric when predictors are already trained. Both methods, accept the predictors to compare as input parameters. Additional parameters can be passed to the methods to customise the computation of the metric to meet the specific needs of the user (e.g., training set, batch size, etc.).

Implemented metrics are:

  1. 1.

    Memory: memory consumption efficiency of the predictors—Equation (2);

  2. 2.

    Energy: energy consumption efficiency of the predictors—Equation (5);

  3. 3.

    Latency: latency efficiency of the predictors—Equation (7);

  4. 4.

    DataEfficiency: data efficiency of the predictors—Equation (9).

Metrics are included into the psyki.qos package. It is worth noting that all the metrics can be computed using any kind of predictors: no need to have one uneducated and one educated predictor. Instead, one can also compare, say, two educated predictors, or two uneducated predictors of any sort.

5 Experiments

In this section we present several experiments aimed at assessing the effectiveness of the proposed QoS metrics, as implemented in PSyKI. We first describe the experimental setup, the datasets we adopt, and the rationale behind their choice. Then, we present the results of our experiments, and we discuss them.

The design of our experiments is as follows:

  1. 1.

    we select three relevant classification tasks from the literature, covering different application domains, and coming with datasets of increasing cardinality;

  2. 2.

    for each task and its corresponding dataset D, we (1) train some uneducated neural predictor N over the data in D—of course performing train/test-set splitting—, and we (2) select some symbolic knowledge base K to be injected in N;

  3. 3.

    for each uneducated predictor N we then apply SKI multiple times, one per each injection technique currently supported by PSyKI, namely KBANN, KINS, and KILL—hence attaining as many educated predictors;

  4. 4.

    finally, for each educated predictor \(\hat{N}\), we compute our QoS metrics, hence comparing that \(\hat{N}\) and N w.r.t. (1)data efficiency, energy consumption, memory footprint, latency, and accuracy variation.

The rationale behind this setup is to demonstrate the effectiveness of our QoS metrics in assessing the efficiency SKI techniques of different sorts.

It is worth highlighting that the experiments presented in this section are not intended as a comprehensive evaluation of knowledge injection techniques per se. Instead, they aim to demonstrate the validity of the proposed QoS metrics, w.r.t. their capability of revealing variations in relevant efficiency metrics, as introduced by SKI. In this respect, negative values may be imputed to either the injections algorithms themselves, or to their implementation in PSyKI. In fact, the primary goal of PSyKI is to provide correct—despite not fully optimised—SKI techniques.

For the sake of reproducibility, the code is public available at https://github.com/pikalab-unibo/ski-qos-jaamas-experiments-2022.

5.1 Datasets

We select three different datasets from the UCI repositoryFootnote 3: BCW, PSJGS, and CI.

  • Breast cancer Wisconsin dataset (BCW) [63] The BCW dataset contains 699 instances of breast cancer biopsy results, each with 9 features—summarising biological characteristics—and one class label. Values are integers in the range [1, 10]. The feature \(BareNuclei\) has 16 missing values, which are replaced with the value zero. The dataset’s target variable is a binary indicator of whether a biopsy was benign (\(\texttt {B}\)) or malignant (\(\texttt {M}\)), class repartition is 458 and 241 respectively. The purpose of the dataset is to develop predictors that can accurately diagnose breast cancer based on biopsies using the information contained in the features.

  • Primate splice junction gene sequences (PSJGS) [64] The PSJGS dataset includes information regarding gene splicing. The dataset includes 3190 instances, each representing a sequence of 60 DNA nucleotides. Each nucleotide is represented by one of the four letters \(\texttt {A}\) (adenine), \(\texttt {T}\) (thymine), \(\texttt {C}\) (cytosine), and \(\texttt {G}\) (guanine). Each sequence begins at position -30 and ends at 30, position zero is excluded.

    One DNA sequence can be classified as an exon–intron (\(\texttt {EI}\)) boundary, an intron–exon (\(\texttt {IE}\)) boundary, or none (\(\texttt {N}\)) of them. Class frequencies are 50% for \(\texttt {N}\), 25% for both \(\texttt {EI}\) and \(\texttt {IE}\).

    In addition to the four nucleotides, the dataset also includes other letters that indicate that for one specific position different nucleotides are allowed. For our experiments, we preprocess the dataset by binarising the nucleotides. In other words, each nucleotide is represented by a vector of 4 elements, where each element is 0 except for the one corresponding to the nucleotide itself, which is 1. Table 2 reports the complete binarisation of the nucleotides.

    Table 2 Mapping of aggregative symbols and the four nucleotides
  • Census income (CI) [65] The CI dataset contains individual information from the 1994 United States Census. The dataset contains 48,842 instances, each corresponding to one census participant. Each data row includes information such as age, education, and occupation, as well as income data about a single person. The purpose of the dataset is to predict whether an individual’s annual income is greater than or less/equal than/to 50,000 USD based on their demographic information. Hence, the target variable is binary—37,155 earn less/equal than/to 50,000 USD and 11,687 earn more than that amount per year.

    For our experiments, we convert the target \(Income\) to a binary output (1 if the income exceeds 50,000 USD and 0 otherwise). We also drop three features—namely \(Fnlwgt\), \(Education\), and \(Race\)—as they are irrelevant for our experiment (\(Fnlwgt\) is a similarity metric computed over the other features, the information provided by \(Education\) is already present thanks to the feature \(EducationNumeric\)) or possibly introduce cultural bias (\(Race\)). The remaining features are discretised. In particular, \(CapitalGain\) and \(CapitalLoss\) are binarised, while the remaining nominal categorical features are transformed into one-hot-encoded data.

We choose these datasets because of their increasing cardinality, which ranges from \(10^2\) to \(10^4\). In this way, we are able to observe the scalability and robustness of our predictors and metrics in handling datasets of different volume or dimensionality. This is important to get a broader overview about the performance of the different predictors both in terms of their accuracy and in terms of the various efficiency metrics proposed in this work.

We divide each dataset into train and test sets, with a ratio of 2/3 and 1/3 respectively.

Finally, we attain the knowledge bases to be injected in a task-specific way. As far as the PSJGS dataset is concerned, we rely on the knowledge base described into the corresponding paper [17], which we suitably convert in Prolog form. Conversely, as far as the BCW and CI datasets are concerned, we leverage upon symbolic knowledge extraction [66] to automatically generate knowledge bases in Prolog form out of trained predictors. This process is better discussed into “Appendix”.

5.2 Methodology

We define and train several neural predictors, for each dataset—in particular, one uneducated network and multiple educated counterparts. We attain educated networks by applying SKI via the KINS, KILL, and KBANN algorithms—each one exploiting some different approach to perform knowledge injection—see Sect. 4. By constructing all such predictors, we are able to compare and evaluate their performance and their metrics on each dataset.

For each uneducated predictor, we tune the structural hyperparameters (i.e. amount of layers and neurons per layer) by using a grid search with cross-validation. Networks attained via KBANN are a notable exception here, as in those cases the entire architecture of the network is dictated by KBANN, as a function of the input knowledge. In particular, we chose to vary the number of layers (from 1 to 3) and the number of neurons per layer (10, 50, and 100). The same process of grid search with cross cross-validation is repeated for the “educated” predictors. In this way, we can ensure good hyperparameters selection—in terms of predictive performance—, while still keeping the computation time reasonable. Table 3 shows the selected hyperparameters for each dataset and predictor.

Table 3 Results of a comprehensive grid search on various datasets

In order to tune the (hyper-)parameters of each predictor in a statistically significant way, we repeat the training 30 times, each time with different initial conditions and/or random seeds, grasping statistics about the average accuracy along the way. This lets us reduce the variability of the results and obtain a more accurate estimate of a predictor’s actual performance. The outcome of this procedure is shown in Table 4.

After calculating the average accuracy, we proceed in computing predictors’ efficiency metrics, for each dataset. In particular, we compute data efficiency, energy, memory, and latency metrics—see Sect. 3. The corresponding scores are presented in Table 4, and discussed in the following section.

Table 4 Comparison of the performance of different models (KBANN, KILL, and KINS) respect to the uneducated one on three datasets (Breast Cancer, Splice Junction, and Census Income) in terms data efficiency, energy consumption, memory usage, latency, and accuracy

5.3 Discussion

In the following we thoroughly analyse and interpret the results of our experiments. Accordingly, we examine the columns of Table 4 from left to right.

It is worth noticing how data-efficiency scores vary hugely across predictors and datasets. We recall that a positive data-efficiency score indicates that the educated predictor is more efficient than its uneducated counterpart, whereas a negative score indicates the opposite. In general, as stated in Sect. 3, it is important to consider how data-efficiency scores can be affected by all three aspects (A1)–(A2). Thus, for instance, the high variation of this score points out the importance of selecting the most appropriate predictor for a given task (A2). For instance, the KINS-based solution has a lower data-efficiency score than the other predictors tailored on the BCW dataset. This may indicate that KINS is not the best solution for this task. In contrast, we note that the CI dataset shows positive data-efficiency scores for all three predictors, indicating that, in terms of data efficiency, an improvement is obtained by using all three SKI algorithms proposed in this work.

The second column of Table 4 shows the energy metrics for both train and test. With regard to each predictor and dataset, we mostly see negative values for this metric. Again, it is important to note that energy consumption scores can be affected by a number of factors, including input knowledge (A1), input predictor (A2), and task to be performed (A2). In most cases, the table indicates that the KBANN-based solution consumes more energy than the other predictors. In contrast, the KILL-based solution consumes significantly less energy than the other predictors. Additionally, it is important to shift the emphasis towards input knowledge (A1). As stated in Sect. 3, it is expected that the more complex the input knowledge, the more energy the educated predictor will consume during training. Hence, in terms of data efficiency, we argue that more complex knowledge may produce a gain for the educated predictor—possibly at the price of higher expenses in terms of energy consumption.

The third column of Table 4 shows the results of the memory metric. We recall that a positive value here indicates that the educated predictor consumes less memory than the uneducated one. Conversely, a negative value indicates that the educated predictor consumes more memory. For example, in the case of the BCW dataset and the KBANN-based solution, the educated predictor shows a positive difference in memory consumption—which means it uses less memory than the uneducated one. In the PSJGS dataset, both KBANN- and KINS-based solutions show negative memory metrics. This suggests that, in this case, those educated predictors are more memory intensive than the uneducated one. Regarding the KILL-based solution, it often shows a memory metric of 0, indicating that there is no difference in memory between the educated and uneducated predictors.

The fourth column of Table 4 shows the latency results. Comparing the latency of educated predictors with the uneducated ones, we observe that, as far as KILL is concerned, the results between the two solutions are very similar—i.e., the metric is close to 0 in both cases. KBANN and KINS, on the other hand, have a slightly-worse latency, on all three datasets. As discussed in Sect. 3, we argue this is due to the complexity of the injected input knowledge, which can lead to negative effects in terms of latency—especially in structuring-type SKI methods, such as KBANN.

Finally, by looking at the accuracy scores—see the last column of the Table 4—, we observe that the educated and uneducated predictors show very similar results. Furthermore, the results indicate that the accuracy of all predictors on all three datasets is very similar and consistent. In general, results suggest that all predictors perform well and can accurately predict the results of the datasets.

To conclude, in terms of data efficiency, the educated predictor generally requires less data to achieve similar accuracy than the uneducated one. This is a positive result, as it suggests that the trained predictor is able to make accurate predictions using less data, which can be a nice-to-have feature in resource-constrained settings.

As far as energy is concerned, our results show a gain in energy during the training phase for the uneducated predictor, but during the inference phase this difference is close to 0. We argue that this is due to the knowledge injection process, which in these experiments required more energy expense for the educated predictor than the uneducated one. About memory, the results are somehow mixed: the educated predictor sometimes requires more memory and sometimes less memory than the uneducated one. Finally, as far as latency is concerned, results indicate that the uneducated predictor tends to have a slightly lower latency than the educated one.

Overall, our analysis provides valuable information that can be used to understand the performance of different injection predictors on different datasets: this can be useful for evaluating a predictor with higher integrity without resorting to accuracy metrics only. This points out the importance of adopting specific metrics when evaluating knowledge injection predictors. Using these sorts of metrics in the MAS context could provide intelligent systems engineers with a critical tool for comparing different predictors and selecting the best one for a given task.

6 Conclusions

In this work we propose a set of quality-of-service (QoS) metrics for SKI mechanisms, aiming at putting MAS engineers and agents back in control of the selection of the best predictor for a given task. In particular, our metrics focus upon efficiency gains achievable through SKI. Along this line, we formally define four metrics, namely: (1) memory footprint efficiency—i.e., gain in terms of predictor’s complexity; (2) energy efficiency—i.e., gain in terms of total energy required to train and deploy a sub-symbolic predictor; (3) latency efficiency—i.e., improvements in terms of time required for inference; and (4) data efficiency—i.e., improvement in terms of amount of data required to optimise a sub-symbolic predictor.

Furthermore, to support their practical exploitation, we also introduce an extension of the PSyKI library for symbolic knowledge injection, which includes a general-purpose software implementation of the metrics.

Enabled by PSyKI, we then perform a number of experiments aimed at demonstrating the effectiveness of our metrics. Overall, our experiments show that the proposed metrics can be exploited to grasp insights about whether a given SKI mechanism is actually able to improve the efficiency of a given predictor or not—according to some specific efficiency criteria among the aforementioned ones. As a by-product of our experiments we also show that the injection mechanisms currently supported by PSyKI leaves some room for improvement.

In perspective, our QoS metrics for SKI have a role to play in addressing various issues in the field of agent-oriented systems. Indeed, the design and implementation of MAS present significant challenges, such as energy consumption, latency, memory, and data efficiency. System complexity, coupled with the real-time requirements of many multi-agent applications, may lead to increased energy consumption and latency. In addition, the amount of data generated and processed by MAS can have a significant impact on their performance. Along this line, we observe that SKI approaches could reduce the amount of computation required to process data, thus leading to reduced latency and improved energy efficiency. Similarly, it could reduce the amount of data needed for training and improve the quality of data used, thus resulting in improved performance and efficiency. The key point here is that measuring efficiency gains in all such regards paves the way towards the automation of agents’ decision-making, which may then dynamically optimise their sub-symbolic components according to their goals.