Exploiting the Cone of Influence for Improving the Performance of Wavelet Transform-Based Models for ERP/EEG Classification

Chen, Xiaoqian; Gupta, Resh S.; Gupta, Lalit

doi:10.3390/brainsci13010021

Open AccessArticle

Exploiting the Cone of Influence for Improving the Performance of Wavelet Transform-Based Models for ERP/EEG Classification

by

Xiaoqian Chen

¹,

Resh S. Gupta

² and

Lalit Gupta

^1,*

¹

School of Electrical, Computer, and Biomedical Engineering, Southern Illinois University, Carbondale, IL 62901, USA

²

Center of Excellence for Stress and Mental Health, VA San Diego Healthcare System, San Diego, CA 92161, USA

^*

Author to whom correspondence should be addressed.

Brain Sci. 2023, 13(1), 21; https://doi.org/10.3390/brainsci13010021

Submission received: 22 November 2022 / Revised: 10 December 2022 / Accepted: 19 December 2022 / Published: 22 December 2022

(This article belongs to the Section Computational Neuroscience and Neuroinformatics)

Download

Browse Figures

Versions Notes

Abstract

:

Features extracted from the wavelet transform coefficient matrix are widely used in the design of machine learning models to classify event-related potential (ERP) and electroencephalography (EEG) signals in a wide range of brain activity research and clinical studies. This novel study is aimed at dramatically improving the performance of such wavelet-based classifiers by exploiting information offered by the cone of influence (COI) of the continuous wavelet transform (CWT). The COI is a boundary that is superimposed on the wavelet scalogram to delineate the coefficients that are accurate from those that are inaccurate due to edge effects. The features derived from the inaccurate coefficients are, therefore, unreliable. In this study, it is hypothesized that the classifier performance would improve if unreliable features, which are outside the COI, are zeroed out, and the performance would improve even further if those features are cropped out completely. The entire, zeroed out, and cropped scalograms are referred to as the “same” (S)-scalogram, “zeroed out” (Z)-scalogram, and the “valid” (V)-scalogram, respectively. The strategy to validate the hypotheses is to formulate three classification approaches in which the feature vectors are extracted from the (a) S-scalogram in the standard manner, (b) Z-scalogram, and (c) V-scalogram. A subsampling strategy is developed to generate small-sample ERP ensembles to enable customized classifier design for single subjects, and a strategy is developed to select a subset of channels from multiple ERP channels. The three scalogram approaches are implemented using support vector machines, random forests, k-nearest neighbor, multilayer perceptron neural networks, and deep learning convolution neural networks. In order to validate the performance hypotheses, experiments are designed to classify the multi-channel ERPs of five subjects engaged in distinguishing between synonymous and non-synonymous word pairs. The results confirm that the classifiers using the Z-scalogram features outperform those using the S-scalogram features, and the classifiers using the V-scalogram features outperform those using the Z-scalogram features. Most importantly, the relative improvement of the V-scalogram classifiers over the standard S-scalogram classifiers is dramatic. Additionally, enabling the design of customized classifiers for individual subjects is an important contribution to ERP/EEG-based studies and diagnoses of patient-specific disorders.

Keywords:

Graphical Abstract

1. Introduction

The continuous wavelet transform (CWT) and the discrete wavelet transform (DWT) are indispensable signal processing tools that are extensively used to analyze and develop classifiers for event-related potential (ERP) and electroencephalography (EEG) signals. Like most practical signals, ERPs and EEG epochs have finite lengths. The CWT and DWT of finite-length signals suffer from border distortions which are called cone of influence (COI) effects and border effects, respectively. This study focuses primarily on emphasizing the importance of taking the COI into account in the design of ERP/EEG machine learning classification models which use features derived from the CWT. The COI is a boundary that is superimposed on the CWT scalogram to delineate the accurately computed wavelet coefficients from those that are inaccurate (artifact coefficients) due to the wavelet extending beyond the observation interval of the signal [1,2,3,4]. These edge-effect artifacts are due to the mechanics of convolving the signal and the wavelet near the beginning and near the end of the signal. For a given wavelet, the number of artifact coefficients computed depends on the wavelet scale. Larger scales result in a larger number of artifact coefficients; consequently, the boundary takes on a cone shape in the scalogram. Although it is common practice to exclude unreliable artifact coefficients outside the COI in scalogram analyses, surprisingly, the same cannot be said about classification. This is especially true when wavelet scalograms are used in conjunction with deep learning convolution networks (CNNs). Many recent studies have shown that the conversion of one-dimensional time-domain signals such as EEGs and ERPs into a time-scale matrix representation through the CWT not only offers simultaneous time-scale information but also offers an ideal matrix input format for CNNs [5,6,7,8,9,10,11]. However, due to the mechanics of matrix convolutions in the CNNs, the convolutions are carried out over the entire scalogram in the first stage as well as the subsequent convolution stages. Features extracted from the scalogram are also combined into feature vectors for input to many other classifiers such as support vector machines (SVMs) [7,12,13,14,15], random forests (RFs) [16,17], k-nearest neighbor (k-NN) [15,18,19], and multilayer perceptron (MLP) neural networks [20].

Considering the widespread use of the CWT for classifying ERPs and EEGs as well as numerous other types of non-stationary signals and the fact that COI edge-effect artifacts are well known, it is remarkable how little attention has been paid to addressing the following question: how can knowledge of the COI contribute to the design of wavelet-based classifiers? Our specific goal, therefore, is to demonstrate the importance of exploiting the information provided by the COI by developing three classification approaches that employ features from the (a) entire scalogram, (b) entire scalogram but with zeroed-out coefficients outside the COI, and (c) cropped scalogram consisting of only coefficients inside the COI. It will be shown that the entire scalogram is obtained via “same” convolution and the cropped scalogram is obtained via “valid” convolution; therefore, the entire and cropped scalograms will be referred to as the S-scalogram and V-scalogram, respectively. The scalogram with the zeroed-out coefficients will be referred to as the Z-scalogram. The S-scalogram approach is the “standard” method employed in most wavelet classifiers and can, therefore, serve as a baseline for performance comparisons with the two other approaches. We hypothesize that the Z-scalogram-based classifiers will outperform the S-scalogram-based classifiers, and the V-scalogram-based classifiers will outperform the Z-scalogram-based classifiers. Most importantly, we hypothesize that the improvement of V-scalogram classifiers over the standard baseline S-scalogram classifiers will be dramatic.

The formulations of the scalogram-based classifiers are quite general and are not tied to any particular application. This study focuses on ERP classification; however, the methods developed are equally applicable to classifying EEGs and other non-stationary signals. ERPs are voltage fluctuations in the EEG that occur as a result of an external or internal event and are frequently used to assess sensory, cognitive, motor, and emotion-related processes [21]. EEG is a non-invasive method used to assess neurophysiological function. EEG measures the electrical activity of large, synchronously firing populations of neurons in the brain with electrodes placed on the scalp [20]. EEG signals are often contaminated by various sources of noise, which can be physiological (e.g., eye blinks, eye movements, muscle artifacts) and non-physiological (e.g., electrode movement relative to the scalp, electrode wire movement, electromagnetic noise) in nature. Thus, preprocessing steps, including filtering, artifact correction, and artifact rejection procedures, are typically applied to the raw data [22]. Several other useful and novel methods have been proposed to reduce these sources of noise [23,24]. Altogether, EEG signals have been used in a wide variety of studies ranging from emotion classification [25] to qualifying seizure dynamics [26].

In order to validate the hypotheses and to also show that the performance trends are consistent across classifiers, the three approaches are implemented using a diverse set of classifiers which are trained and tested on the 64-channel ERPs of 5 subjects engaged in a binary semantic task which involved distinguishing between synonymous and non-synonymous word pairs. The problems with developing customized ERP classifiers for individual subjects are identified, and a subsampling strategy is developed to generate small-sample average ensembles to facilitate the design. A rank-of-rank sum channel selection strategy which uses an interclass separation measure is developed to select a subset of channels that carry the most useful discriminatory information for classifying the synonymous and non-synonymous ERPs. Classification experiments are designed using the (5 subjects) (8 channels) = 40 data sets, and it is shown that:

(a): the Z-scalogram classifiers outperform the standard S-scalogram classifiers;
(b): the V-scalogram classifiers significantly outperform the S-scalogram and Z-scalogram classifiers;
(c): the relative improvement of the V-scalogram classifiers over the standard S-scalogram classifiers is dramatic;
(d): the improvement trends across the three scalogram approaches are remarkably consistent across the classifiers, ERP channels, and subjects;
(e): the region outside the COI does not carry useful discriminatory information;
(f): the subsampling strategy to generate small-sample ERP ensembles enables customized classifier design for single subjects.

To the best of our knowledge, we are not aware of any other study that systematically demonstrates how the performance of standard scalogram-based classifiers can be improved through the V-scalogram approach, which excludes the artifact coefficients outside the COI. Furthermore, the subsampling strategy, which enables the design of customized classifiers for individual subjects, is unique.

2. Materials and Methods

This section presents (a) the definitions of the COI, S-scalogram, Z-scalogram, V-scalogram, and V-complement scalogram, (b) methods to reduce effects of edge artifacts, (c) measures to indicate the quality of scalograms, (d) the development of the S-scalogram, Z-scalogram, and V-scalogram approaches, (e) the subsampling strategy to generate small-sample ERPs to enable the development of ERP classifiers for single subjects, (f) the development of the algorithm to rank and select channels for classification problems, (g) the diverse set of classifiers selected to implement the 3 scalogram approaches for ERP classification, (h) the 64-channel dichotomous EEG data of 5 subjects that is used to implement the various classifiers for the three scalogram approaches, (i) the experiments designed to evaluate the performances of the various implementations of the three classification approaches, and (j) details of the wavelet choice, averaging parameter selection, feature vector generation, classifier architectures and hyperparameters, and cross-validation.

2.1. COI

The primary focus of this study is on the development of classifiers that modify the scalogram for feature extraction based on the COI. It is, therefore, important to understand how the COI is generated in the scalogram, which, in turn, requires an understanding of how the scalogram is generated from the CWT. Given a wavelet basis function

Ψ (t)

, the CWT of a signal

x (t)

is given by

W (a, b) = \frac{1}{\sqrt{a}} \int_{- \infty}^{\infty} x (t) Ψ^{*} (\frac{t - b}{a}) d t

(1)

where

a

and

b

are the scale and translation parameters, respectively. That is, the CWT is the convolution of

x (t)

and scaled versions of

Ψ (t)

. Applying the CWT to

x (t)

results in a set of wavelet coefficients which are a function of scale and position. A plot of the coefficients provides a time-scale representation of

x (t)

. In order to implement on a computer, the CWT is discretized; consequently,

W (a, b)

is discrete both in scale and in time. The discretized CWT is often referred to as the DCWT to distinguish it from the discrete wavelet transform (DWT). Using

s

and

n

to denote the discrete-scale and discrete-translation variables, respectively,

W (s, n)

plotted as a function of

s

and

n

is called the CWT coefficient matrix. The scalogram is the amplitude

|W (s, n)|

or power

{|W (s, n)|}^{2}

which can be represented by a matrix

G (s, n), s = 0, 1, \dots (J - 1); n = 0, 1, \dots, (N - 1)

, where

J

is the number of scale bands, and

N

is the signal duration. In this zero-indexed matrix representation, row 0 corresponds to scale

S_{0}

, row 1 corresponds to scale

S_{1}

, …, row

(J - 1)

corresponds to scale

S_{J - 1}

, with

S_{j} < S_{j + 1}

.

The COI is best explained by first understanding the convolution operations that are used to generate the coefficient matrix and to identify the COI. Each row of the coefficient matrix

W (s, n)

is obtained from the linear convolution of

x (n)

with a wavelet of a particular scale. There are several different forms of linear convolution, which include “full”, “same” and “valid”, which are useful for identifying the COI in the coefficient matrix. The three types of convolutions, which are described in detail in Appendix A, differ in the manner in which the start and end points are computed. The standard scalogram, which we refer to as the S-scalogram, is generated by same-convolution. The S-scalogram contains artifact coefficients because it suffers from edge effects. The valid convolution operation yields coefficients that are accurate and is used to identify the COI in the coefficient matrix. The boundary that separates the artifact coefficients from the accurate coefficients defines the COI in the S-scalogram. We refer to the part of the S-scalogram that is inside the COI as the V-scalogram because it corresponds to the coefficients associated with valid convolutions. Furthermore, the entire S-scalogram with zeroed-out coefficients outside the COI is referred to as the Z-scalogram. Figure 1 shows an ERP and the 3 different types of scalograms generated from the Morlet wavelet transform of the ERP.

2.1.1. Edge Artifacts

It is inevitable that the CWT applied to a finite duration signal will lead to edge artifacts in the standard S-scalogram due to the need for extending the signal through padding. The inaccuracies of the artifact coefficients are the highest at the edges and decrease towards the COI. Various notable methods have been made to reduce, but not remove, the effects of the artifacts by smoothing the extensions [27,28,29], forecasting the extensions [30] and modifying the wavelet transform instead of extending the signal [31,32]. The effects of several extension methods have also been compared in [33,34]. For least-squares wavelet analysis (LSWA), which is used to analyze non-stationary signals that are not sampled evenly, the stochastic confidence level surface acts like the COI but is based on degrees of freedom [35].

Methods have also been proposed to approximate the COI, which include (a) using the time constant

1 / e

to delineate the borders of the cone of influence at each scale [2,36,37], (b) defining the extent of the COI at each scale as the point where the wavelet transform magnitude decays to 2% of its peak value [38,39], (c) delineating the COI by adding and subtracting (1/2) the wavelet footprint at the beginning and end of the observation interval at each scale [37,40,41], and (d) using the wavelet time interval that encompasses 95% of the wavelet’s energy for Morse wavelets [42,43,44].

2.1.2. Scalogram Quality

In order to analyze the oscillatory behavior in most physiological signals, the scalogram is represented in terms of the discrete frequency variable

f

, instead of the scale variable. The discrete-time and discrete-frequency S-scalogram can be represented by the matrix

G_{S} (f, n), f = 0, 1, \dots, (J - 1)

, where

J

in this case is the number of discrete-frequency bands. The COI depends not only on the wavelet choice and frequencies but also on the signal duration. For a given scalogram, the ratio

ρ

of the number of V-scalogram coefficients to the total number of S-scalogram coefficients can be used as a measure of the quality of the scalogram. For a (

J \times N)

scalogram, the ratio is given by

ρ = (\sum_{f = 0}^{J - 1} N_{f}) / [(J) (N)],

(2)

where

N_{f}

is the number of coefficients in row

f

that are inside the COI. For a given scalogram, a quality ratio

β

can also be defined in terms of the power enclosed by the COI to the total power in the following manner:

β = \frac{\sum_{f} \sum_{n} {|W (f, n)|}^{2}}{\sum_{f = 0}^{J - 1} \sum_{n = 0}^{N - 1} {|W (f, n)|}^{2}},

(3)

where the summations in the numerator are taken for values of

(f, n)

that fall inside the COI. Higher values of

ρ

and

β

indicate better scalogram quality.

2.2. The Three Scalogram Approaches

Our goal is to compare the performances of classifiers that employ S-scalogram, Z-scalogram, and V-scalogram features; therefore, the most straightforward way to demonstrate this goal is to design classifiers for each ERP channel. Clearly, the same goal can be demonstrated by designing classifiers that combine information from multiple ERP channels [45,46]. However, the performance will then also depend on the combination methods which could mislead the performance comparisons. Therefore, the assumption hereafter is that the classifiers are single-channel classifiers.

The strategy to demonstrate the influence of the COI on classification performance is to develop three classification approaches in which the feature vectors are derived from (a) the S-scalogram

G_{S} (f, n)

, (b) the Z-scalogram

G_{Z} (f, n)

, and (c) the V-scalogram

G_{V} (f, n)

. The scalogram

G_{S} (f, n)

consists of the signal power at each time-frequency location. For classification problems, transforming a signal

x (n)

into

G (f, n)

is regarded as a feature extraction process. Clearly, additional operations can be performed on the scalogram feature matrix to generate other features such as the mean band power, ratios of power in select bands, and those which are typically used in image classification, such as texture, entropies, higher-order statistical features, and PCA-derived features. However, the classification results would then depend on the choice of the features making it difficult to draw direct conclusions about the influence of the COI on classifier performance. Therefore, in order to avoid this problem, no further modifications of the scalogram features are employed in the development of the classifiers. That is, the features are simply the scalogram coefficients.

The most direct way to demonstrate the performance trends across the three classification approaches is to concatenate the rows of the scalogram into a feature vector. In the formulation of the three classification approaches, the region outside the COI is represented by

G_{O} (f, n)

such that

G_{S} (f, n) = G_{V} (f, n) \cup G_{O} (f, n)

(4)

2.2.1. S-Scalogram Approach

The feature vector in this approach is formed by concatenating the rows of the

(J \times N)

S-scalogram. If

Δ

and

T

represent the row concatenation and transpose operations, the

[(J) (N) \times 1]

feature vector

G_{S}

is given by

G_{S} = {[Δ_{f = 0}^{J - 1} G_{S} (f, :)]}^{T}

(5)

where,

G_{S} (f, :)

is row

f

in

G_{S} (f, n)

. The feature vector

G_{S}

includes a combination of accurately computed and artifact features. It can, therefore, be expected that the performance will be impacted negatively by the inclusion of the artifact features.

2.2.2. Z-Scalogram Approach

In this method, the S-scalogram is modified by multiplying it with a mask

M (f, n)

such that the resulting scalogram has elements set to zero outside the COI and are unchanged inside the COI. That is, the zeroed-out S-scalogram is given by

G_{Z} (f, n) = G_{S} (f, n) \times M (f, n)

(6)

where

M (f, n) = \{\begin{matrix} 1, a l l (f, n) \in G_{V} (f, n) \\ 0, a l l (f, n) \in G_{O} (f, n) \end{matrix}

(7)

The feature vector

G_{Z}

, which has the same dimension

[(J) (N) \times 1]

as

G_{S}

, is given by

G_{Z} = {[Δ_{f = 0}^{J - 1} G_{Z} (f, :)]}^{T}

(8)

Although the dimension of the feature vector

G_{Z}

is the same as

G_{S}

, it can be hypothesized that the performance would improve by zeroing out the unreliable features that are outside the COI. However, including features with zero values across all classes is completely uninformative, providing no discriminatory value. Furthermore, the unnecessarily large feature vector can lead to overfitting, which can negatively impact the performance.

2.2.3. V-Scalogram Approach

In this method, the S-scalogram is cropped with a COI-shaped mask

V (f, n)

which selects the features inside the COI and discards the features outside the COI. The resulting COI-shaped 2-dimensional V-scalogram

G_{V} (s, n)

is given by

G_{V} (f, n) = G_{S} (f, n) ∎ V (f, n)

(9)

where

∎

is used to denote the cropping operation, and the mask selects or discards features according to

V (f, n) = \{\begin{matrix} s e l e c t, a l l (f, n) i n s i d e C O I \\ d i s c a r d, a l l (f, n) o u t s i d e C O I \end{matrix}

(10)

The

[(J) (N_{V}) \times 1]

feature vector

G_{V}

is given by

G_{V} = {[Δ_{f = 0}^{J - 1} G_{V} (f, :)]}^{T}

(11)

where,

N_{V} = \sum_{f = 0}^{J - 1} N_{f}

, and

N_{f}

is the duration of the frequency band

f

inside the COI. Unlike the two previous cases, the durations of the frequency bands are not equal; consequently, the feature vector

G_{V}

does not have the same dimension as

G_{S}

. By excluding the features outside the COI, the dimension

(J) (N_{V})

of

G_{V}

will be smaller than the dimensions of

G_{S}

and

G_{Z}

. Through the simultaneous exclusion of unreliable features included in the S-scalogram approach and the cropping out of irrelevant features included in the Z-scalogram approach, it can be hypothesized that classifiers using

G_{V}

for the feature vector should outperform the classifiers using the feature vectors of the S-scalogram and Z-scalogram approaches.

2.2.4. V-Complement Scalogram Classification

In order to erase any doubts about the importance of excluding the features outside the COI, classifiers were also developed using only the features outside the COI, that is, the features contained in

G_{O} (f, n)

. This approach, represented by the

\bar{V}

-scalogram, is referred to as the V-complement scalogram approach because the feature set is the complement of the V-scalogram feature set. If these features are truly unreliable, as claimed throughout this study, the classification accuracies of the

\bar{V}

-scalogram classifiers should be worse than those of the baseline S-scalogram classifiers. The V-complement approach is not considered to be 1 of the 3 scalogram classification approaches but is simply developed to demonstrate that the features outside the COI do not carry any useful discriminatory information.

2.3. Subsample ERPs for Classifier Design

In statistical analyses, an ERP is formed by averaging all the time-locked single trials collected from a channel. The averaging operation is performed independently on the single trials of each channel. That is, 1 ERP is obtained for each channel. A different approach is required for ERP classification because classifiers are designed using the data split into two mutually exclusive subsets: the training set needed to train the model and the test set needed to evaluate the model. The term “design” includes classifier training and testing. Designing customized classifiers for individual subjects, whether it is through direct ERP-based classification or feature-based ERP classification, is impossible if each ERP category is represented by a single ERP. Group-based classifiers would require a very large number of subjects to generate enough ERPs for training and testing, which, in many practical cases, may be impossible. Also, it is important to note that the performance of a classifier is highly dependent on the number of ERPs and the quality of the training set. A small training set would not be representative of the variations in the ERPs, which results in poor generalization. Furthermore, large training sets are required to overcome the “curse of dimensionality” often encountered in the design of parametric classifiers [47]. Poor-quality (poor SNR) training sets lead to poor learning, which, in turn, results in poor performance.

The most obvious approach to increase the size of the training and test sets is to simply use single trials without averaging. However, high classification accuracies cannot be expected with single trial signals due to the poor SNR. Because the SNR improves through averaging [48,49,50], it can be expected that the classification accuracy will improve by averaging over a small number of single trials. The classification accuracy can be expected to increase even further by increasing the number of trials averaged. For convenience, the signal formed by averaging

m

single trials will be referred to as an

m

-ERP, and

m

will be referred to as the “averaging parameter”. Furthermore, the full-sample ERP formed by averaging all

Ω

single trials in an ensemble will be referred to as the

Ω

-ERP, and, for convenience, a single trial will be referred to as a 1-ERP. In order to form an

m

-ERP, the 1-ERP ensemble can be partitioned sequentially into blocks of size

m

and then averaging the 1-ERPs in each block. Alternatively, subsamples of size

m

single trials can be drawn randomly without replacement and averaged to form distinct

m

-ERPs. However, in both cases, the sizes of the

m

-ERP training and test sets, which are equal to the number of 1-ERPs in the respective sets divided by

m

, “shrink” as

m

increases, which leads to overfitting. The obvious solution to this problem is to increase the size of the single trial ensembles. However, collecting large ensembles of single trials to form a large ensemble of

m

-ERPs for training and testing is difficult in practice because subjects tend to lose focus on the task due to fatigue and lapses in concentration over long data collection sessions. Another alternative is to average

m

single trials drawn with replacement to generate large ensembles of

m

-ERPs. However, there is the risk of generating identical

m

-ERPs, which can result in singular covariance matrices. This could be problematic for parametric classifiers such as those with Gaussian discriminant functions, which require the computation of the inverse of the covariance matrix [47].

The real challenge, therefore, in designing ERP classifiers for many applications is to avoid having to repeat the stimulus presentation numerous times to be able to form high SNR

m

-ERPs, especially during real-time testing. To be practical, most applications require accurate classification of ERPs averaged over a small number of single trials. The goal, therefore, is to enable the design of practical ERP classifiers that can accurately classify ERPs averaged over a small number of single trials without having to collect a prohibitively large ensemble of 1-ERPs.

2.3.1. Generation of $m$ -ERPs through “ $m$ -Subsample Averaging”

In order to overcome the shrinking problem that accompanies sequential averaging and averaging random subsamples without replacement, we employ a subsampling and averaging strategy introduced and implemented in [45,51,52]. We will refer to this method of generating subsample ERPs as “

m

-Subsample Averaging (

m

-SA)”. For a given 1-ERP ensemble of size

Ω_{0}

, the

m

-SA strategy can be summarized as follows:

(a): Draw a random subsample of 1-ERPs of size $m,$ $m < Ω_{0}$ . The $m$ 1-ERPs in the subsample are drawn without replacement;
(b): Average the $m$ 1-ERPs in the subsample to obtain an $m$ -ERP;
(c): Replace the $m$ 1-ERPs of the subsample into the single trial ensemble;
(d): Repeat steps (a)–(c) $q$ times to generate an ensemble of $Ω_{q}$ $m$ -ERPs. Each ensemble generation is referred to as a “run”;
(e): Repeat steps (a)–(d) $Q$ times to yield $Q$ runs, with each run containing an $m$ -ERP ensemble of size $Ω_{q}$ .

In order to maintain mutual exclusivity between the training and test sets, the 1-ERP ensemble is first partitioned into training and test sets, and subsequently, m-ERPs are generated from each of these sets. Each run, therefore, begins with randomly partitioned 1-ERP training and test sets. The specified number of

m

-ERPs for the training and test sets set is generated from the 1-ERPs of the training and test sets, respectively. Consequently, none of the 1-ERPs used to form the

m

-ERPs of the training set are included in forming the

m

-ERPs of the test set.

The key point to note is that

m

-SA offers the flexibility of generating large ensembles of

m

-ERPs for classifier design. Most notably, the

m

-ERP ensembles can be generated from the 1-ERP ensembles of individual subjects, thus enabling the design of customized classifiers for individual subjects. Clearly,

m

-SA is equally applicable for generating large ensembles for multi-subject group-based classifier design. Increasing the amount of data for training in this fashion is a form of data augmentation that can be especially beneficial for improving the performance of data-driven classifiers such as deep learning CNNs. Also noteworthy is that the training set can be made up of

m

-ERPs with different values of m to increase the amount of data and the variability in the training set. An additional point to note is that the averaging parameter does not have to be the same for the training and test sets. That is, the classifier could be trained with

m

-ERPs and tested on

n

-ERPs. In practice, it is desirable to keep both

m

and

n

small in order to avoid having to collect a large number of single trials for training and testing, respectively. It is especially important to obtain high accuracies for small values of

n

so that trials do not have to be repeated many times during testing.

2.4. Rank-of-Rank-Sum (RRS) Channel Ranking

The primary reasons for channel selection are to extract a subset of channels that carry useful discriminatory information and to reduce the computational burden. For ERP analysis, temporospatial PCA can be used to identify channels and time windows that best capture effects elicited by stimuli [53,54]. Therefore, the classification performance can be expected to improve by using only the identified channels and time windows. However, the drawback of using such prior knowledge for channel selection with respect to multichannel ERP classifier design is that the information that a particular classifier uses to discriminate between the ERP categories may differ from the information identified by PCA which is aimed at seeking projections that best represent the data in a least-square sense. A different approach which is aimed at selecting channels that best discriminate between the ERP classes is required for classification problems. A common approach is to use a suitable criterion with backward elimination to develop greedy algorithms which start with the full set of channels and decrease the number of channels, one channel at a time, to determine a subset of channels that yield the best criterion score. References [55,56,57] review several notable channel selection methods that have been developed for EEG channel selection.

The rank-of-rank-sum (RRS) algorithm developed in this study is feature-blind and quite general because it is formulated to rank channels for classification problems involving multiple classes (polychotomous) and multiple subjects. Binary classification and single-subject cases are treated as special cases of the RRS algorithm. The channels are ranked according to the interclass separations of the ERPs generated at the channels. The interclass separation measure selected is a scatter-normalized measure of separation between two cluster means. In general, higher classification accuracies can be expected if clusters have higher interclass separations. However, the interclass separation, which is a pairwise measure, is not directly applicable for multi-channel and polychotomous ranking because it provides a separation measure only for a pair of channels and a pair of ERP classes. Moreover, it is not directly applicable to multiple subjects. To solve the first problem, channel ranking is obtained via the ranking of the sum of the pairwise rankings of the

U (U - 1) / 2

pairs of ERP classes where

U

is the number of ERP classes. One approach to solving the second problem is to combine the

B

subjects into a single group. This approach, however, is sensitive to the inter-subject variabilities of the ERPs. Instead, the RRS algorithm solves the second problem by determining the rankings of the sum of the channel rankings across the

B

subjects. The RRS algorithm is described in detail in Appendix A.

2.5. Choice of Classifiers

In order to test the performance-related hypotheses and to determine whether the performance trends are consistent across classifiers, channels, and subjects, a range of diverse classifiers is selected to implement the three scalogram approaches for ERP classification. The following well-known classifiers are selected: SVMs, RFs, k-NN, MLPs, and deep learning CNNs. Each of these classifiers has its advantages and limitations. Details of these well-known classifiers can be found in excellent books on machine learning [58,59,60,61,62].

Two variants of CNN classifiers are designed: vector input CNN-1 and matrix input CNN-2. The CNN-1 classifier accepts vector inputs and can, therefore, be implemented for all 3 approaches through the row concatenation operations described in Section 2.2. As mentioned in the introduction, the matrix scalogram is an ideal format for CNN inputs. Implementing matrix input CNN classifiers for the first 2 approaches is straightforward because the S-scalogram and the Z-scalogram have matrix formats. However, implementing CNN classifiers capable of accepting the unequal row V-scalogram matrix is problematic because the CNN matrix convolutions require equal row matrices. Such an implementation is beyond the scope of this study.

2.6. Data Sets

The EEG data used to demonstrate the application and evaluate the performance of the 3 classification approaches was downloaded from:

https://eeglab.org/tutorials/10_Group_analysis/study_creation.html#description-of-the-5-subject-experiment-tutorial-data (accessed on 1 October 2022).

This data set was selected because it was freely available and ideal for demonstrating the goals of this study. However, the methods formulated in this study are quite general and are not tied to this particular data set or any other data set. According to the description of the study on the website, the EEG data were collected from 5 subjects performing an auditory binary semantic task in which they were asked to distinguish between synonymous and non-synonymous word pairs. The second word was presented 1 s after the presentation of the first word. Data epochs were extracted 2 s before the second-word onset to 2 s after the second-word onset. EEGs were collected from 64 channels using a sampling rate of 200 Hz. After decomposing each subject’s data by ICA, 2 EEG datasets were extracted: the synonymous data set comprising trials in which the second word was synonymous with the first one and the non-synonymous data set in which the second word was not a synonym of the first. Thus, the study included 5 binary data sets or (5 subjects) (2 conditions) = 10 datasets for each of the 5 subjects. Both datasets of each subject were recorded in a single session. The number of epochs for each condition varied from 196 to 235 across the 10 data sets.

For the purpose of this study, the single trials (1-ERPs) for the classification experiments consisted of the 1 s segment extracted after the second-word onset in each epoch. Consequently, the total number of samples in each 1-ERP was 200. In order to facilitate 5-fold cross-validation and to have an equal number of epochs across the 10 data sets, the first 195 epochs were selected from each data set. The top 8 ranked channels for each subject determined by the RRS algorithm were selected for the experiments. The rankings are listed in Table 1. The rankings of the channels differ across the 5 subjects, which is not unexpected. Whether these channels differ from those extracted from temporospatial PCA is not of concern because the focus is on channel selection for classification and not for analyses. Furthermore, whether the top-ranked channels are correlated proximal channels is not taken into account because the goal is to simply select 1-ERPs from a subset of channels, instead of all channels, that can be used as data sets for the experiments. If needed, correlation analyses could be conducted on the ranked channels to systematically include low-correlated channels starting with the top-ranked channel.

Each classifier was designed on (5 subjects) (8 channels) = 40 binary data sets. The 40 binary data sets are large enough to adequately observe the performance trends across the three approaches. Adding more channels will simply increase the size of the already large tables that are included in the results without contributing to the trend analyses. Figure 2 shows the full-sample

Ω

-ERPs of subject

b_{1}

formed by averaging the 1-ERPs in the binary data sets. Observe that the

Ω

-ERPs of the two classes appear to be fairly well separated in all 8 channels.

2.7. Classification Experiments

This section describes various aspects of the design of experiments to test the implementations of the three classification approaches using the 5 different types of classifiers and the 5-subject 8-channel ERP data sets.

2.7.1. Wavelet Choice

Over the years, many different CWTs have been proposed, which include, among many others, Morlet [63,64], Morse [44], Gaussian [2], and Mexican Hat [65,66]. The choice of the wavelet is application dependent, and in this study, the analytic Morlet wavelet is selected because it is frequently used to analyze the oscillatory behavior of ERPs and EEGs. The analytic Morlet mother wavelet, a product of a complex exponential signal of frequency

f_{0}

and a zero-mean Gaussian window with variance

σ^{2}

, is given by

Ψ_{f_{0}} (t) = (A) (e^{- \frac{t^{2}}{2 σ^{2}}}) (e^{j (2 π f_{0}) t})

(12)

where the constant

A

ensures that the wavelet energy is equal to one. The analytic wavelet coefficients are complex, and the power spectrum is zero for negative frequencies. Note that the wavelet is defined in terms of a frequency parameter which is preferred in applications involving oscillatory analyses.

2.7.2. Selection of the Averaging Parameter

The aim of this study is not focused on obtaining the highest classification accuracies but on analyzing the performance trends across the 3 classification approaches using the data sets. In general, classifiers trained and tested on high SNR data sets yield good accuracies across the classifiers, making it difficult to compare their performances. Furthermore, it may not be practical to repeat trials a large number of times to generate high-SNR

m

-ERPs during real-time testing. It is, therefore, important to be able to obtain high accuracies for small values of

m

. For these reasons, the smallest value,

m = 4

, which enabled the analyses of the performance trends across the 3 approaches, was selected for generating the

m

-ERP ensembles. In the experiments to follow, a classifier was developed to test the

4

-ERPs of each channel collected from each subject independently.

2.7.3. Feature Vector Generation

The feature vectors for each approach were generated from the Morlet wavelet power scalogram through row concatenation, as described in Section 2.2. The question as to whether the scalogram is computed from 1-ERPs or

m

-ERPs has to be considered. In general, the scalogram of the average of the single trials is used for evoked power analysis, whereas the average of the scalograms of each single trial, which gives the total power, is used for induced power analysis. The total power is the sum of the evoked and induced power. In this study, for the purpose of feature extraction, a single scalogram is computed for each m-ERP where m is equal to 4. The duration of each 4-ERP (same as the single trial duration) was N = 200 samples, and the number of frequency bands J was 108. The dimensions of the scalograms were, therefore, 108 × 200. Hence, the dimension of the feature vectors

G_{S}

and

G_{Z}

were (21,600 × 1). Using the process outlined in Section 2.2 for selecting the features inside the COI, the dimension of the feature vector

G_{V}

was 17,056 × 1. The ratio

ρ

of the number of V-scalogram coefficients to the total number of S-scalogram coefficients for this case is (17,056/21,600) (100) = 78.96%. That is, approximately 21% of the scalogram coefficients were artifact coefficients.

2.7.4. Classifier Architectures and Parameters

This subsection describes the architectures and hyperparameters of the 6 different classifiers that were implemented. No serious attempt was made to optimize the architectures and hyperparameters because the goal is to compare the relative performances of a given classifier using the S-scalogram, Z-scalogram, and V-scalogram feature vectors. Details of the classifiers are as follows:

SVM Classifiers: The SVM classifiers were implemented using the Gaussian radial basis function kernel. The best combination of the regularization parameter

C

and influence parameter

γ

was selected using an exhaustive grid search.

RF Classifiers: The following hyperparameters were selected: number of trees = 10, maximum depth of a tree = 475, minimum number of samples required to split an internal node = 3, minimum number of samples required to be leaf node = 1, splitting quality criterion = entropy.

k-NN Classifiers: The Euclidean distance metric was used to measure proximity. Through trial and error, it was found that k = 20 yielded good performance across all classifiers.

MLP Classifiers: The MLP classifiers consisted of four layers of neurons: the first layer had 1024 neurons, the second layer had 512 neurons, the third layer had 256 neurons, and the output layer had 2 neurons. The sigmoidal activation function was used in all 4 layers. The following training options were used: initialization = uniform random, optimizer = Adam, learning rate = 0.001, number of epochs = 50, drop-out probabilities = 0.15.

CNN-1 Classifier: The CNN-1 classifiers consisted of three convolution layers followed by a max pooling layer and terminated with a fully connected network (FCN) consisting of four layers of neurons. The convolution layers used the ReLU activation functions, the first three layers in the FCN used sigmoidal activations, and the last layer used softmax activations. The “same” operation was used in the convolution layers. The number of filters in the convolution layers was 32. The filter dimensions in the convolution layers were

(9 \times 1)

. A (

2 \times 1)

max pooling filter with a stride of

2

was used in the pooling layer. The FCN had 1024, 512, 256, and 2 neurons in the four layers. The following training options were used: initialization = uniform random, optimizer = Adam, learning rate = 0.001, number of epochs = 50, drop-out probabilities = 0.15.

CNN-2 Classifiers: The CNN-2 classifiers consisted of a convolution layer, max pooling layer, convolution layer, max pooling layer, convolution layer, max pooling layer, and terminated with a fully connected network (FCN) consisting of four layers of neurons. The convolution filters had dimensions

(3 \times 3),

and the max pooling filter had dimensions

(2 \times 2)

with a stride equal to 2. The “same” operation was used in both convolution layers. The number of filters in both convolution layers was 32. The convolution layers used the ReLU activation functions, the first three layers in the FCN used sigmoidal activations, and the last layer used softmax activations. The FCN had 1024, 512, 256, and 2 neurons in the four layers. The following training options were used: initialization = uniform random, optimizer = Adam, learning rate = 0.001, number of epochs = 50, drop-out probabilities = 0.15.

In summary, the SVM, RF, k-NN, MLP, and CNN-1 classifiers were developed for all three scalogram approaches, and the CNN-2 classifiers were developed for the first 2 approaches to give a total of 17 classifiers. Each classifier was trained and tested independently on the 40 binary data sets. The classifiers were implemented using the PyTorch library.

2.7.5. Cross-Validation

The steps outlined in Section 2.3.1 were used to generate runs consisting of the

m

-ERPs of the training and test sets. For each run that started with randomly ordered 1-ERPs, the 195 1-ERPs of each class were partitioned into 5 folds, each containing 39 1-ERPs. The training set consisted of 4 folds (156 1-ERPs), and the test was the remaining fold. From the 1-ERP training set, an equal number (156) of 4-ERPs were generated for the training set. Similarly, 39 4-ERPs were generated for the test set. This method of generating 4-ERP training and test sets was conducted for each channel. The channel classifiers were trained and tested using 5-fold cross-validation for each run. For each run, the total number 4-ERPs for training and testing each channel classifier consisted of (195 4-ERPs/class) (2 classes) = 390 4-ERPs. The classification accuracy for each run, expressed as a percentage, was estimated as the ratio of the correctly classified 4-ERPs to the total number (390) of 4-ERPs of the 2 classes. The procedure was repeated over 50 runs, and the final classification accuracy of each classifier was estimated by averaging the accuracies of the 50 runs. That is, the final classification accuracy of each classifier was averaged across (390) (50) = 19,500

4

-ERPs belonging to the 2 classes. Importantly, all the classifiers were implemented using identical training and test sets.

3. Results

This section presents the results of the various experiments designed to observe the performance trends of the 3 classification approaches across classifiers, channels, and individual subjects.

3.1. m-ERP Classification Results

In order to observe the trends across the 3 approaches closely, the final classification accuracies obtained from the classifiers across the three approaches are presented separately in five “full” tables, which contain the results for each subject. Due to the large size, the five full tables are presented in Appendix B (Table A1, Table A2, Table A3, Table A4 and Table A5). Each entry in the tables is the average classification accuracy obtained from testing (390 4-ERPs

\times

50 runs) = 19,500 4-ERPs. The best result for each channel is highlighted in bold font. The entries are empty for the CNN-2 classifiers for the V-scalogram approach because they were not implemented. Any inconsistent result (expected improvement is not observed) is highlighted in red. Out of the total of 680 results presented in the five full tables, only 10 (<1.5%) were minorly inconsistent in the sense that the expected improvements of the Z-scalogram classifiers over the S-scalogram classifiers were not observed. Table 2 summarizes the overall average classification across the five subjects and eight channels for each approach. Except in the row labeled Global Average, each entry in Table 2 is the average classification accuracy obtained from testing (five subjects

\times

eight channels

\times

19,500 4-ERPs) = 780,000 4-ERPs. The row labeled Global Average is the average of the classifiers across each approach (column averages). An asterisk on CNN-2 indicates that the CNN-2 results were not included in the computation of the Global Average. The CNN-2 averages are included below the Global Averages to emphasize this point.

Figure 3 shows the 95% confidence interval error bars for the Global Averages in Table 2. The V-scalogram error bar does not overlap with the S-scalogram and Z-scalogram error bars. Therefore, the improvements of the V-scalogram approach over the S-scalogram and Z-scalogram approaches are statistically significant, with p-values less than 0.05.

3.2. Single Trial Results

Up to this point, we have assumed that single trial classification will yield poor classification accuracies. In order to support this assumption, the classifiers were trained and tested on the 1-ERPs of Subject

b_{3}

. The results are shown in Table A6 in Appendix B. For brevity, the results of the 4 other subjects are not included because the trends are the same as those observed for Subject

b_{3}

. As expected, the results for the 1-ERP classification are inferior to the 4-ERP classification. Interestingly, the CNN-1 results using V-scalogram features are quite good for single trial classification. The significant jumps in the CNN results from the S-scalogram accuracies to the V-Scalogram accuracies are noteworthy.

3.3. V-Complement Scalogram Results

Classifiers were developed using only the

(21600 - 17056) = 4544

features outside the COI. Table 3 summarizes the classification accuracies of the classifiers trained using this

\bar{V}

-scalogram approach. The S-scalogram results are also included in the table to facilitate comparison. The

\bar{V}

-scalogram average classification accuracy of 51.95% is close to the 50% accuracy expected for random binary classification. Clearly, the artifact coefficients do not carry useful discriminatory information. Therefore, the performance should improve by zeroing the artifact coefficients and improve even further by removing them completely, as confirmed by the experimental results in Appendix B.

4. Discussion

This section discusses the findings presented in the Section 3.

4.1. Performance Trends

The following general trends in the 6 full tables in Appendix B and the summary table are important to note: for each classifier, there is an improvement in the performance of the Z-scalogram classifiers over the S-scalogram classifiers (including CNN-2) and a consistent improvement of the V-scalogram classifiers over the Z-scalogram classifiers. This is also reflected across the overall averages for the three approaches. Importantly, the CNN-1 results in Table A1, Table A2, Table A3, Table A4 and Table A5 also show that the two-class semantic ERPs can be classified with accuracies exceeding 94% for all 5 subjects using a value as small as

m

= 4 for subsample averaging with the V-scalogram. As noted in the subsampling section, even higher accuracies can be obtained by increasing the averaging parameter

m

.

For the CNN-2 classifier, the improvement of the Z-scalogram approach over the S-scalogram approach is also notable. Quite interestingly, even the Z-scalogram approach can yield accuracies greater than 93% for all five subjects. This impressive performance can be attributed to the fact that the convolution filters are trained to extract discriminatory features from input during the training phase. That is, the zeroed-out region outside the COI is ignored to some degree by the filters. The filters are influenced by the zeroes to a greater extent when they overlap with scalogram regions that are both inside and outside the COI boundary. Based on the observed trends of the other classifiers across the three approaches and the trends between the two CNN-2 implementations across the first two approaches, we can expect a dramatic improvement in the CNN-2 implementation of the V-scalogram approach over the standard S-scalogram approach if the convolution mechanisms in the CNNs can be modified to accommodate unequal row matrices. Furthermore, through an understanding of how the filter weights are trained by convolving over entire matrices for S and Z-scalograms containing artifact and zeroed-out coefficients, respectively, it can be expected that the convolution filters trained only over the correct V-scalogram information are bound to lead to better performance. It is important to note that modifying wavelet-based CNN classifiers to restrict the matrix convolutions within the COI would be a major breakthrough, given the huge attention focused on developing and applying wavelet-based CNN classifiers to numerous problems in various fields. Future efforts will focus on developing direct/indirect methods to restrict matrix convolutions within the COI.

4.2. Relative Improvements

The consistent improvements of the Z-scalogram and V-scalogram approaches over the baseline S-scalogram approach are quite remarkable because they are observed across the

m

-ERPs of all 5 subjects across all 8 channels (all 40 data sets). The relative improvement in the performance of an approach B over an approach A, denoted by

Γ (B, A)

, can be measured using

Γ (B, A) = [\frac{α (B) - α (A)}{α (A)}] \times 100

(13)

where

α (A)

and

α (B)

are the classification accuracies of approaches

A

and

B

, respectively. Table 4 summarizes the relative improvements for the average accuracies listed in Table 2. These results clearly confirm that the Z-scalogram approach outperforms the S-scalogram approach and the V-scalogram approach outperforms the Z-scalogram approach. Most importantly, the improvements of the V-scalogram approach over the standard S-scalogram approach are dramatic.

4.3. Comparison of Classifiers

Although the primary goal of this study is to show that for any given classifier, the performance of the classifier using the standard S-scalogram features can be improved by using Z-scalogram features, and further improvements can be obtained by using V-scalogram features, the results in the tables can also be used to rank the 6 different classifiers to pick the best performing classifier type. From the averages in Table 2, it can be concluded that the best results for the V-scalogram approach across classifiers, channels, and individual subjects were obtained with the CNN-1 classifiers. The superiority of the CNN classifier models can be attributed to the fact that the filters in the convolution layer have the ability to extract complex features that are relevant to classification. Most importantly, the scalograms can be input directly to the CNNs without having to extract a set of hand-engineered features. Consequently, the performance is not dependent on the hand-engineered feature set, which is often determined through trial and error. The size and the shape of the convolution filters can couple information in different ways to extract different types of features from the same input presented to the CNN. For example, row filters couple information from individual frequency bands in the scalogram. A matrix filter can be used to couple information across frequency bands. If needed, the complexity of the features can be increased by increasing the number of convolution layers. In general, CNNs are highly scalable for larger classification problems by increasing the network depth and increasing training. Therefore, CNNs should be a good choice for multi-subject group-based classification. Furthermore, there are several elegant ways of fusing “information” from multiple channels with CNNs to improve performance. The information can be data/features of the channels or decisions of the channel classifiers. The study in [67] involving the classification of multi-axial (multi-channel) signals using CNNs shows that, in general, multi-axial fusion classification models outperform the best uniaxial (single-channel) classification model.

5. Conclusions

The goal of this study was to demonstrate the importance of exploiting information offered by the COI in designing CWT-based classifiers. It was hypothesized that the performance would improve by zeroing out the unreliable scalogram features outside the COI and that the performance would improve even further by completely excluding the unreliable features outside the COI. A subsampling strategy was developed to generate small-average ensembles to enable customized design for individual subjects/patients, and a rank-of-rank sum channel selection strategy was developed to select a subset of channels from multiple channels. The S-, Z-, and V-scalogram classification approaches developed to validate the performance hypotheses were implemented with a set of diverse classifiers and tested on semantic task ERPs of 5 subjects. The results from 40 data sets confirmed the expected relative improvements, and it was noted that the improvements were remarkably consistent across channels, classifiers, and subjects. Most importantly, the relative improvements of the V-scalogram classifiers over the standard S-scalogram classifiers were dramatic. Among the 5 classifier types implemented, the CNN classifiers yielded the best results consistently.

In conclusion, this study demonstrated unequivocally that excluding the region outside the COI for feature extraction results in a significant improvement in classifier performance. The V-scalogram approach is not restricted to ERP/EEG classification and is applicable to improving the performance of wavelet classifiers designed to classify other physiological signals as well as signals in many other fields, including engineering, science, and economics. Undeniably, this is a significant contribution, considering the vast number of wavelet-based classifiers that have been reported but have not exploited the COI for feature extraction in the manner described in this study. Furthermore, the ability to design customized classifiers for individual subjects which yield high classification accuracies using subsample averaging is an important contribution for studying and diagnosing patient-specific disorders using ERPs and EEGs.

Author Contributions

Conceptualization, L.G., X.C. and R.S.G.; methodology, X.C., L.G. and R.S.G.; software, X.C.; validation, L.G., X.C. and R.S.G.; formal analysis, X.C. and L.G; investigation, X.C. and L.G., writing—original draft preparation, X.C. and L.G.; writing—review and editing, L.G., X.C. and R.S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful suggestions to improve the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The convolution operations that lead to the definitions of the S-scalogram, Z-scalogram, V-scalogram, and COI are described in detail in this Appendix. The formulation of the generalized RRS channel selection algorithm applicable to single/multiple subjects and dichotomous/polychotomous classification is also presented in this Appendix.

Appendix A.1. Types of Convolution

The question that is addressed in this section is: what type of convolution causes the edge artifacts to occur in the scalogram coefficient matrix? In order to answer this question, three convolution mechanisms are described by assuming the durations of the signal

x (n)

and the zero-centered mother wavelet filter

h (n)

are

N

and

(2 M + 1)

, respectively. That is, the compact support of

h (n)

is

[- M, M]

. It is also assumed that

M < N

.

Appendix A.1.1. Full-Convolution

In general, the discrete convolution sum is given by

y (n) = \sum_{λ = - M}^{M} x (λ) h (n - λ)

(A1)

where

λ

is a dummy variable. That is, the convolution operation is performed by shifting a folded version of

h (λ)

along

x (λ)

and finding the dot product at each shift

n

. It is assumed that the shift occurs in unit steps. In order to simplify the discussion, it will be assumed that

h (n)

is symmetric. In full convolution, the rightmost sample of

h (λ)

is aligned with the leftmost sample of

x (λ)

at the start of convolution, and the leftmost sample of

h (λ)

is aligned with the rightmost sample of

x (λ)

at the end of convolution. By convention, the filtered output is at the filter center; therefore, the duration of the filtered output is

\tilde{N} = [N + (2 M + 1) - 1 = N + 2 M]

. In order to perform the dot product at the start and end,

x (λ)

has to be padded by

2 M

samples on the left and on the right. A common practice is to pad the ends with zeros; however, other methods, such as padding with the mean value of the signal, linear padding, symmetric padding, periodic extension padding, and fitting polynomial functions, can also be employed [27,68,69]. For the given signal and filter durations, the full-convolution sum can be expressed as

y (n) = \sum_{λ = - M}^{M} \hat{x} (λ) h (n - λ), n = 0, 1, \dots, (\tilde{N} - 1),

(A2)

where,

\hat{x} (λ)

is

x (λ)

padded with

2 M

samples on both ends. It is important to observe that the first

2 M

and the last

2 M

samples of the output are computed from partial overlaps between

x (λ)

and

h (λ),

which are the artifact coefficients in this case. Irrespective of the padding method employed, the outputs computed from the partial overlaps are the inaccurate artifact coefficients.

In practice, full convolution is performed very efficiently in the frequency domain by multiplying the DFTs of

x (n)

and

h (n)

. The filtered output is given by the inverse DFT of the multiplication. The DFTs are computed by augmenting both sequences with zeros to have durations equal to or greater than

\tilde{N}

for the resulting circular convolution to be identical to the desired linear convolution. The DFTs are computed efficiently using fast Fourier transform (FFT) algorithms [70].

Appendix A.1.2. Same-Convolution

This form of convolution is the one used to generate the standard CWT coefficient matrix to ensure that the rows have the same duration as the input. Each row of the coefficient matrix is obtained by performing the same-convolution operation between

x (n)

and

h (n)

. The convolution operation starts by aligning the filter center with the rightmost sample of

x (λ)

and ends with the alignment of the filter center with the rightmost sample of

x (λ)

. The signal

x (λ)

is padded with

M

samples at both ends; therefore, the first

M

and the last

M

samples of the output are computed from partial overlaps, which result in artifact coefficients. For the given signal and filter durations, the same-convolution sum can be written as

y (n) = \sum_{λ = - M}^{M} \hat{x} (λ) h (n - λ), n = 0, 1, \dots, (N - 1),

(A3)

where,

\hat{x} (λ)

is

x (λ)

now padded with

M

samples on both ends. We refer to the resulting scalogram, which has rows with the same duration as the input signal, as the same (S)-scalogram. The shorter duration same-convolution result can be obtained from full-convolution by cropping out the first

M

and last

M

samples. In general, row

j

of the standard scalogram is generated by same-convolving

x (n)

with a wavelet having an effective support

[- (S_{j}) (M), (S_{j}) (M)]

; therefore, the first and last

(S_{j}) (M)

coefficients in the row are the artifact coefficients, and the coefficients in between are the accurate coefficients.

Appendix A.1.3. Valid Convolution

The valid convolution operation yields coefficients that are accurate and is used to identify the COI in the coefficient matrix. In this form of convolution, the leftmost sample of the centered filter is aligned with the leftmost sample of

x (λ)

at the start, and the rightmost sample of the centered filter is aligned with the rightmost sample of

x (λ)

at the end of the convolution. Consequently, the duration of the output is

[N - 2 M]

. Valid convolution requires no padding at the ends of

x (λ)

; thus, none of the output samples are computed from partial overlaps resulting in no artifact coefficients. That is, all valid coefficients are computed accurately. Valid convolution can be obtained from same-convolution by cropping the first and last M samples and by cropping the first and last

2 M

samples from full-convolution. For the given signal and filter durations, the valid convolution sum can be written as

y (n) = \sum_{λ = - M}^{M} x (λ) h (n - λ), n = 0, 1, \dots, (N - 2 M - 1) .

(A4)

As noted in Section 2.1, the COI is the boundary that separates the artifact coefficients from the accurate coefficients in the S-scalogram, the part of the S-scalogram that is inside the COI is the V-scalogram, and the entire S-scalogram with zeroed-out coefficients outside the COI is referred to as the Z-scalogram.

Appendix A.2. RRS Algorithm

The channel rankings can be computed from the 1-ERPs or the

m

-ERPs. The RRS algorithm is described in terms of

m

-ERPs and is applicable to 1-ERPs by setting

m

= 1. Given

C

channels, the ranking

R_{m} (c), c = 1, 2, \dots, C

, is such that the channel with the highest multi-class separation is assigned the highest

r a n k = 1

and the channel with the smallest is assigned the lowest

r a n k = C

. Based on this ranking, channels can be selected by evaluating the performance, starting with the best channel, and adding the next best channels until the performance starts to drop. Alternatively, a subset of channels with the highest ranks can be selected to decrease the computational requirements, especially for dense-electrode arrays. If

L_{m x c b_{i}}

and

L_{m y c b_{i}}

are the

m

-ERPs of subject

b_{i}, i = 1, 2, \dots B

, elicited by stimuli

I_{x}

and

I_{y}

at channel

c

, respectively, and

||L_{1} - L_{2}||

is the Euclidean distance between

L_{1}

and

L_{2}

, the interclass separation between the

m

-ERPs belonging to a pair of classes,

ω_{x}

and

ω_{y}

, at channel

c

is given by

ρ_{(x, y) m b_{i}} (c) = \frac{{||\bar{L_{m x c b_{i}}} - \bar{L_{m y c b_{i}}}||}^{2}}{\sum_{q = 1}^{N_{x}} {||L_{m x c b_{i}}^{(q)} - \bar{L_{m x c b_{i}}}||}^{2} + \sum_{q = 1}^{N_{y}} {||L_{m y c b_{i}}^{(q)} - \bar{L_{m y c b_{i}}}||}^{2}}, c = 1, 2, \dots, C

(A5)

where

L_{m x c b_{i}}^{(q)}

and

L_{m y c b_{i}}^{(q)}

are the

q^{t h}

m

-ERPs;

\bar{L_{m x c b_{i}}}

and

\bar{L_{m y c b_{i}}}

are the means, and

N_{x}

and

N_{y}

are the number of

m

-ERPs in the

ω_{x}

and

ω_{y} m

-ERP ensembles, respectively. Note that the pair

(x, y)

takes on the following values:

(x, y) = (ω_{1}, ω_{2}), (ω_{1}, ω_{3}), \dots, (ω_{1}, ω_{J}), (ω_{2}, ω_{3}), (ω_{2}, ω_{4}), \dots, (ω_{2}, ω_{J}), \dots, [ω_{(U - 1)}, ω_{U}] .

(A6)

The rank

R_{(x, y) m b_{i}} (c)

of

ρ_{(x, y) m b_{i}} (c)

for each subject

b_{i}

is determined for all possible pairs of

x

and

y

. Then, the rank-sum

(R S)

of channel

c

for subject

b_{i}

across all possible

(U) (U - 1) / 2

pairs of ERP classes is given by

{(R S)}_{m b_{i}} (c) = \sum_{x = 1}^{U - 1} \sum_{y = x + 1}^{U} R_{(x, y) m b_{i}} (c)

(A7)

The ranking of the channels for subject

b_{i}

is obtained from the rank of the rank-sum given by

R_{m b_{i}} (c) = R a n k \{{(R S)}_{m b_{i}} (c)\}, c = 1, 2, \dots, C

(A8)

The ranking of the channels across all

B

subjects is determined by first summing the ranks to obtain the rank-sum

{(R S)}_{m} (c) = \sum_{i = 1}^{S} R_{m b_{i}} (c)

(A9)

and the final ranking of the channels across the U ERP classes and B subjects is determined from the rank of the rank-sum given by

R_{m} (c) = R a n k \{{(R S)}_{m} (c)\} .

(A10)

The entire sequence of steps can be summarized as follows:

\{R a n k \{\begin{matrix} S u m a c r o s s \\ s u b j e c t s \end{matrix} \{R a n k \{\begin{matrix} S u m a c r o s s \\ c l a s s p a i r s \end{matrix} \{R a n k \{ρ_{(x, y) m b_{i}} (c)\}\}\}\}\}\}

(A11)

For the special dichotomous (2-class) case, the ranking of the channels for each subject is determined directly from

R_{m b_{i}} (c)

which is the ranking of

ρ_{m b_{i}} (c)

, the interclass separation of the two classes

x

and

y

. The ranking across all subjects is obtained from the rank-sum

{(R S)}_{m} (c)

of channel

c

, which is given by

{(R S)}_{m} (c) = \sum_{i = 1}^{S} R_{m b_{i}} (c)

(A12)

and the final ranking of channel

c

is given by the rank-of-the rank-sum

R_{m} (c) = R a n k \{{(R S)}_{m} (c)\} .

(A13)

A final comment regarding the application of the RRS algorithm is that the channel rankings are determined by applying the RRS algorithm to the ERPs of only the training set.