Fast state tomography with optimal error bounds

M Guţă; J Kahn; R Kueng; J A Tropp

doi:10.1088/1751-8121/ab8111

1. Introduction

Quantum state tomography is the task of reconstructing a quantum state from experimental measurement data. Among various estimation methods, maximum-likelihood (ML) [1, 2] is a universal approach which produces point estimators with good average error properties and is extensively used in practice [3]. However, ML is computationally expensive and the practice of using ML in combination with bootstrap to produce error bars is not always theoretically grounded. Indeed such methods can be used to produce asymptotic confidence regions in scenarios where the theory of local asymptotic normality applies (e.g. for fully mixed states), but are inconsistent for rank-deficient states [4]. This has spurred the development of alternatives, such as Bayesian [5–7] and region [8–10] estimators. However, these methods have other drawbacks, such as comparatively high computational cost and weak (or implicit) convergence guarantees. In parallel, the estimation of low-rank states from a small number of observables has been studied in compressed sensing [11–14], and in a series of papers by Koltchinskii and coauthors [15–18].

In this work we present an in depth analysis of an alternative method, the projected least squares (PLS) estimator, and show that it improves on the status quo in several significant directions. PLS is obtained by performing a standard linear inversion [19] followed by a projection onto the space of density matrices. It is numerically 'cheap' in the sense that its computational cost is dominated by forming the least-squares estimator. Indeed, numerical simulations show that PLS is much faster than prominent alternatives like maximum-likelihood estimation [3] and compressed sensing [11].

PLS is also statistically efficient. Our main theoretical results show that PLS achieves fundamental lower bounds [20] for tomography with separate measurements: to reconstruct an arbitrary d dimensional state ρ of rank-r with accuracy in trace distance it suffices to measure r² d⁻² logd independent samples with a two-design measurement, or r² d⁻² samples with a covariant measurement. This sampling rate improves upon existing results [21] and is competitive with the most powerful techniques in the literature [13, 22]. The statistical performance of the PLS estimator has also been studied in [18] in the context of a different statistical model involving noisy observations of expectations of randomly chosen observables. In this context the authors show that up to logarithmic factors, the PLS estimator achieves general lower bounds for several loss functions.

Note that correlated measurements over several copies of the state can achieve even lower sampling rates [20, 23, 24], but are much more challenging to implement. Here we focus on the practically more relevant case of measuring each copy of ρ separately. The complementary analysis in [25] shows the PLS has a tractable asymptotic behavior which may provide the route to devising simple, tight, and theoretically justified error bars.

For a direct access to results and techniques, the paper is organised according to the onion peeling principle. We first introduce concepts and results and gradually reveal the technical details, with most proofs relegated to the appendices. In section 2 we introduce the PLS estimator. In section 3 we present the main theoretical results which establish concentration bounds for the PLS for several important classes of measurements: two-designs, Pauli bases, Pauli observables and covariant measurements. Section 4 provides the algorithmic details necessary for implementing the PLS estimator for the different measurement classes, including the explicit expression of the LS estimator and the projection onto the space of states. The detailed calculations and the proofs of the main theorems can be found in the appendices. Finally, in section 5 we present simulation results showing that the statistical performance of PLS is comparable to ML and compressed sensing for a range of states and measurements.

2. The projected least squares estimator

A measurement on a d-level system is described by d × d Hermitian matrices ${M}_{1},\dots ,\;{M}_{m}\in {\mathbb{H}}_{d}$ that are positive semidefinite and obey ${\sum }_{i=1}^{m}{M}_{i}=\mathbb{I}$ . All measurements will be assumed to be informationally complete, so that for every pair of distinct states ρ ≠ σ, there exists $i\in \left[m\right]=\left\{1,\dots ,\;m\right\}$ such that tr(M_i ρ) ≠ tr(M_i σ). Measuring a quantum state ρ results in one of m outcomes, indexed by $i\in \left[m\right]$ . The probability of observing outcome i depends on ρ and is described by Born's rule:

$\begin{equation}{\left[p\right]}_{i}=\mathrm{Pr}\left[i\vert \rho \right]=\mathrm{t}\mathrm{r}\left({M}_{i}\rho \right)\quad \mathrm{f}\mathrm{o}\mathrm{r}i\in \left[m\right].\end{equation} \tag{ 1 }$

These probabilities can be estimated by the empirical frequencies: measure n identical copies of the state separately, and set

$\begin{equation}{\left[{f}_{n}\right]}_{i}=\frac{{n}_{i}}{n}\quad \mathrm{f}\mathrm{o}\mathrm{r}i\in \left[m\right],\end{equation} \tag{ 2 }$

where n_i is the number of times outcome i was observed. Indeed by the law of large numbers, ${\left[{f}_{n}\right]}_{i}\to {\left[p\right]}_{i}$ as n → ∞. The least-squares (LS) estimator is the solution to the least-squares problem that results from replacing the true probabilities in Born's rule by frequencies (2):

$\begin{equation}{\hat{L}}_{n}=\underset{X\in {\mathbb{H}}_{d}}{\text{argmin}}\quad \sum _{i=1}^{m}{\left({\left[{f}_{n}\right]}_{i}-\mathrm{t}\mathrm{r}\left({M}_{i}X\right)\right)}^{2}.\end{equation} \tag{ 3 }$

This optimization inverts the m linear equations specified by Born's rule, and the solution has the closed form expression (7) which will be discussed in more detail in section 4. In general, ${\hat{L}}_{n}$ can have negative eigenvalues, so it may fail to be a quantum state. Several ways to overcome this drawback have been proposed [17, 21, 26–29].

In this paper we consider the projected least squares (PLS) estimator, which is obtained by projecting ${\hat{L}}_{n}$ onto the convex set of all quantum states with respect to the Frobenius distance

$\begin{equation}{\hat{\rho }}_{n}=\underset{\sigma \mathrm{i}\mathrm{s}\mathrm{a}\mathrm{q}\mathrm{u}\mathrm{a}\mathrm{n}\mathrm{t}\mathrm{u}\mathrm{m}\mathrm{s}\mathrm{t}\mathrm{a}\mathrm{t}\mathrm{e}}{\text{argmin}}\quad {{\Vert}{\hat{L}}_{n}-\sigma {\Vert}}_{2}.\end{equation} \tag{ 4 }$

The procedure for computing PLS is summarised in table 1.

Table 1. Summary of the estimation technique.

Projected least squares (PLS) estimator ${\hat{\rho }}_{n}$
• Estimate probabilities by frequencies (2),
• Compute the least squares estimator (3),
• Project it onto the set of quantum states (4).

The simplicity of PLS is a key advantage for both the analysis and the actual computation. The least-squares estimator (3) admits a closed-form expression, while the projection onto the set of quantum states (11) can be computed efficiently using a simple thresholding algorithm which adjusts the eigenvalues of the LS estimator while leaving the eigenvectors unchanged [27], as will be detailed in section 4. Hence, computing PLS requires considerably less storage and arithmetic than existing techniques that are based on more complicated optimization problems.

Below we discuss the statistical performance of PLS, and show that it exhibits a rank-dependent risk reduction with respect to least squares which matches fundamental lower bounds of [20].

3. Results

3.1. Error bounds and confidence regions for ${\hat{\rho }}_{n}$

We analyze several important and practically relevant measurement schemes: structured POVMs (e.g. symmetrically informationally complete (SIC)–POVMs, mutually unbiased bases (MUBs) and stabilizer states), (global) Pauli observables, Pauli basis measurements, and the uniform/covariant POVM. For details on these measurement classes, and the computation of the LS and PLS estimators we refer to section 4. Here we focus on explaining the novel results and the key techniques used in establishing them. The following theorem shows that the PLS estimator ${\hat{\rho }}_{n}$ converges to the true state ρ in trace distance ||⋅||₁ and provides rank-dependent concentration bounds for each measurement class and arbitrary sample size n.

Theorem 1. (Error bound for ${\hat{\rho }}_{n}$ ) Let $\rho \in {\mathbb{H}}_{d}$ be a state and fix a number of samples $n\in \mathbb{N}$ . Then, for each of the aforementioned measurements, the PLS estimator (table 1) obeys

$\begin{equation*}\mathrm{Pr}\left[{{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{1}{\geqslant}{\epsilon}\right]{\leqslant}d{\mathrm{e}}^{-\frac{n{{\epsilon}}^{2}}{43g\left(d\right){r}^{2}}}\quad \mathrm{f}\mathrm{o}\mathrm{r}{\epsilon}\in \left[0,1\right],\end{equation*}$

where $r=\mathrm{min}\left\{\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}\left(\rho \right),\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}\left({\hat{\rho }}_{n}\right)\right\}$ and g(d) specifies dependence on the ambient dimension:

g(d) = 2d for structured POVMs; see equation (8) for ${\hat{L}}_{n}$ ,

g(d) = d² for Pauli observables; see equation (9) for ${\hat{L}}_{n}$ ,

g(d) ≃ d^1.6 for Pauli basis measurements; see equation (10) for ${\hat{L}}_{n}$ .

The following immediate corollary endows ${\hat{\rho }}_{n}$ with rigorous non-asymptotic error-bars in trace distance.

Corollary 1. The trace-norm ball of size $\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}\left({\hat{\rho }}_{n}\right)\sqrt{43g\left(d\right){n}^{-1}\mathrm{log}\left(d/\delta \right)}$ around ${\hat{\rho }}_{n}$ (intersected with state space) is a δ-confidence region for the true state ρ.

We emphasize the following aspects of this result:

(a)
(Almost) optimal sampling rate: theorem 1 highlights that
$\begin{equation}n{\geqslant}43g\left(d\right)\frac{\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}{\left(\rho \right)}^{2}}{{{\epsilon}}^{2}}\mathrm{log}\left(\frac{d}{\delta }\right)\end{equation} \tag{ 5 }$
samples suffice to ensure ${{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{1}{\leqslant}{\epsilon}$ with probability at least 1 − δ. For structured POVMs and Pauli observables, this sampling rate is comparable to the best theoretical bounds for alternative tomography algorithms [11, 30]. Moreover, fundamental lower bounds in [13, 20] indicate that this scaling is optimal up to a single log(d)-factor, so it cannot be improved substantially.
(b)
Implicit exploitation of (approximate) low rank: the number of samples required to achieve a good estimator scales quadratically in the rank, rather than the ambient dimension d. This behavior extends to the case where ρ, or ${\hat{\rho }}_{n}$ , is well-approximated by a rank-r matrix; see theorem 4 in appendix E. These results are comparable with guarantees for compressed sensing methods [13] that are specifically designed to exploit low-rank. For numerical confirmation we refer to figure 2, and other simulation results in section 5.

Proof sketch for theorem 1. The least-squares estimator ${\hat{L}}_{n}$ can be viewed as a sum of n independent random matrices. To illustrate this, consider a single measurement of the structured POVM type. Then ${\hat{L}}_{1}$ defined in (8) is an instance of the random matrix $X=\left(d+1\right)\vert {v}_{k}\rangle \langle {v}_{k}\vert -\mathbb{I}$ , where $k\in \left[m\right]$ occurs with probability $\frac{d}{m}\langle {v}_{k}\vert \rho \vert {v}_{k}\rangle$ (Born's rule). This generalizes to ${\hat{L}}_{n}=\frac{1}{n}{\sum }_{i=1}^{n}{X}_{i}$ , where the matrices X_i are statistically independent. Such sums of random matrices concentrate sharply around their expectation value $\mathbb{E}{\hat{L}}_{n}=\rho$ , and matrix concentration inequalities [31] quantify this convergence:

$\begin{equation}\mathrm{Pr}\left[\parallel {\hat{L}}_{n}-\rho {\parallel }_{\infty }{\geqslant}\tau \right]{\leqslant}d{\mathrm{e}}^{-\frac{3n{\tau }^{2}}{8g\left(d\right)}}\quad \tau \in \left[0,1\right].\end{equation} \tag{ 6 }$

This bound induces a similar operator norm bound for the PLS estimator: the projection is a contraction in operator norm and the shift in eigenvalues is bounded by τ. The resulting bound ${{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{\infty }{\leqslant}2\tau$ may be transformed into a trace-norm bound, as shown in appendix D. This comparison only depends on the (approximate) rank of the states involved. We refer to appendix E for full details of the proof. □

3.2. Optimal performance guarantee for the uniform POVM

The upper bound in theorem 1 involves a dimension factor d that may be extraneous. This, in turn, introduces an additional log(d)-gap between equation (5) and existing lower bounds [20]. The dimensional factors emerge because we employ matrix-valued concentration inequalities in the proof. Our second main result shows that we can remove this factor for the uniform POVM, which encompasses all rank-one projectors:

Theorem 2. (Convergence of ${\hat{\rho }}_{n}$ for the uniform POVM) For uniform POVM measurements, the PLS estimator obeys

$\begin{equation*}\mathrm{Pr}\left[\parallel {\hat{\rho }}_{n}-\rho {\parallel }_{1}{\geqslant}{\epsilon}\right]{\leqslant}{\mathrm{e}}^{2.2d-\frac{{{\epsilon}}^{2}n}{480{r}^{2}}}\quad \mathrm{f}\mathrm{o}\mathrm{r}{\epsilon}{ >}0,\end{equation*}$

where $r=\mathrm{min}\left\{\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}\left(\rho \right),\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}\left({\hat{\rho }}_{n}\right)\right\}$ .

This result exactly reproduces the best existing performance guarantees for tomography from independent measurements [22]. The bound follows from standard techniques from high-dimensional probability theory.

Proof sketch of theorem 2. Similarly to the proof of theorm 1, we first establish a (different) concentration bound for the LS estimator ${\hat{L}}_{n}$ with respect to the operator norm, and then use the same arguments as in theorem 1 to convert this into a concentration bound for the PLS with respect to the trace norm. The operator norm has a variational formulation:

$\begin{equation*}{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }={\mathrm{max}}_{z\in {\mathbb{S}}^{d}}\vert \langle z\vert {\hat{L}}_{n}-\rho \vert z\rangle \vert .\end{equation*}$

The optimization over the unit sphere may be replaced by a maximization over a finite point set, called a covering net, whose cardinality scales exponentially in d. For any $z\in {\mathbb{S}}^{d}$ , $\langle z\vert {\hat{L}}_{n}-\rho \vert z\rangle$ is a sum of n i.i.d. random variables that exhibit subexponential tail decay. (Measuring the uniform POVM allows us to draw this conclusion.) Standard concentration inequalities yield a tail bound that decays exponentially in the number n of samples. Applying a union bound over all points z_i in the net then ensures $\mathrm{Pr}\left[{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\geqslant}\tau \right]{\leqslant}2{\mathrm{e}}^{{c}_{1}d-{c}_{2}n{\tau }^{2}}.$ Subsequently, closeness in operator norm for ${\hat{L}}_{n}$ may be converted into closeness in trace-norm for ${\hat{\rho }}_{n}$ at the cost of an additional (effective) rank factor, cf appendix D. The details of the concentration bound for ${\hat{L}}_{n}$ can be found in appendix F.

□

4. Algorithmic considerations

4.1. Explicit solutions for the least squares estimator (3)

Tomographically complete measurements can be viewed as injective linear maps $\mathcal{M}:{\mathbb{H}}_{d}\to {\mathbb{R}}^{m}$ with components ${\left[\mathcal{M}\left(X\right)\right]}_{i}=\mathrm{t}\mathrm{r}\left({M}_{i}X\right)$ specified by Born's rule (1). It is well known that the least-squares problem (3) admits the closed-form solution:

$\begin{equation}{\hat{L}}_{n}={\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left({\mathcal{M}}^{{\dagger}}\;\left({f}_{n}\right)\right).\end{equation} \tag{ 7 }$

We evaluate this formula for different measurements and content ourselves with sketching key steps and results. The details can be found in appendix B.

4.1.1. Structured POVMs and the uniform POVM

Also known as two-designs, these systems include highly structured, rank-one POVMs ${\left\{\frac{d}{m}\vert {v}_{i}\rangle \langle {v}_{i}\vert \right\}}_{i=1}^{m}$ , such as symmetric informationally complete POVMs [32], maximal sets of mutually unbiased bases [33], the set of all stabilizer states [34, 35], as well as the uniform POVM. By definition, for $X\in {\mathbb{H}}_{d}$ , all of the above systems obey

$\begin{equation*}{\mathcal{M}}^{{\dagger}}\mathcal{M}\left(X\right)=\frac{{d}^{2}}{m}\sum _{i=1}^{m}\langle {v}_{i}\vert X\vert {v}_{i}\rangle \vert {v}_{i}\rangle \langle {v}_{i}\vert =\frac{md}{d+1}\left(X+\mathrm{t}\mathrm{r}\left(X\right)\mathbb{I}\right).\end{equation*}$

These equations can readily be inverted, and equation (7) simplifies to

$\begin{equation}{\hat{L}}_{n}=\left(d+1\right)\sum _{i=1}^{m}{\left[{f}_{n}\right]}_{i}\vert {v}_{i}\rangle \langle {v}_{i}\vert -\mathbb{I}.\end{equation} \tag{ 8 }$

4.1.2. Pauli observables

Fix d = 2^k (k qubits), and let ${W}_{1},\dots ,\;{W}_{{d}^{2}}\in {\mathbb{H}}_{d}$ be the set of Pauli observables, comprising all possible k-fold tensor products of the elementary 2 × 2 Pauli matrices. We can approximate the expectation value tr(W_i ρ) of each Pauli observable by the empirical mean ${\hat{\mu }}_{i}={\left[{f}_{n/{d}^{2}}^{+}\right]}_{i}-\left[{f}_{n/{d}^{2}}^{-}\right]$ of the two-outcome POVM ${P}_{i}^{{\pm}}=\frac{1}{2}\left(\mathbb{I}{\pm}{W}_{i}\right)$ . Pauli matrices form a unitary operator basis, and the evaluation of equation (7) is simple:

$\begin{equation}{\hat{L}}_{n}=\frac{1}{d}\sum _{i=1}^{{d}^{2}}{\hat{\mu }}_{i}{W}_{i}=\frac{1}{d}\sum _{i=1}^{{d}^{2}}\left({\left[{f}_{n/{d}^{2}}^{+}\right]}_{i}-{\left[{f}_{n/{d}^{2}}^{-}\right]}_{i}\right){W}_{i}.\end{equation} \tag{ 9 }$

4.1.3. Pauli basis measurements

Rather than approximating (global) expectation values, it is possible to perform different combinations of local Pauli measurements. For d = 2^k, there are 3^k potential combinations in total. Each of the settings $\to {s}\in {\left\{x,y,z\right\}}^{k}$ corresponds to a basis measurement $\vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert$ , where $\to {o}\in {\left\{{\pm}1\right\}}^{k}$ labels the 2^k potential outcomes. The union $\mathcal{M}$ of all 3^k bases obeys $\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)\left(X\right)={3}^{k}{\mathcal{D}}_{1/3}^{\otimes k}\left(X\right)$ , where ${\mathcal{D}}_{1/3}\left(X\right)=\frac{1}{3}\rho +\left(1-\frac{1}{3}\right)\frac{\mathrm{t}\mathrm{r}\left(X\right)}{2}\mathbb{I}$ denotes a single-qubit depolarizing channel. Evaluating equation (7) yields

$\begin{equation}{\hat{L}}_{n}=\frac{1}{{3}^{k}}\sum _{\to {s},\to {o}}{\left[{f}_{n/{3}^{k}}\right]}_{\to {o}}^{\left(\to {s}\right)}{\left({\mathcal{D}}_{1/3}^{\otimes k}\right)}^{-1}\left(\vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \right).\end{equation} \tag{ 10 }$

Finally, we point out that all of these explicit solutions are guaranteed to have unit trace: $\mathrm{t}\mathrm{r}\left({\hat{L}}_{n}\right)=1$ . They are not positive semidefinite in general.

4.2. Explicit solutions for the projection step (4)

The PLS estimator is defined to be the state closest in Frobenius norm to the least-squares estimator ${\hat{L}}_{n}$ . The search (4) admits a simple, analytic solution [27]. Define the all-ones vector $\to {1}\in {\mathbb{R}}^{d}$ and the thresholding function ${\left[\cdot \right]}^{+}$ with components ${\left[\to {y}\right]}_{i}^{+}=\mathrm{max}\left\{{\left[\to {y}\right]}_{i},0\right\}$ . Let ${\hat{L}}_{n}=U\mathrm{diag}\left(\to {\lambda }\right){U}^{{\dagger}}$ be an eigenvalue decomposition. Then

$\begin{equation}{\hat{\rho }}_{n}=U\mathrm{diag}\left({\left[\to {\lambda }-{x}_{0}\to {1}\right]}^{+}\right){U}^{{\dagger}},\end{equation} \tag{ 11 }$

where ${x}_{0}\in \mathbb{R}$ is chosen so that $\mathrm{t}\mathrm{r}\left({\hat{\rho }}_{n}\right)=1$ . The fact that ${\hat{L}}_{n}$ itself has unit trace ensures that this solution to equation (4) is unique. The number x₀ may be determined by applying a root-finding algorithm to the non-increasing function $f\left(x\right)=2+\mathrm{t}\mathrm{r}\left({\hat{L}}_{n}\right)-\;\mathrm{d}x+{\sum }_{i=1}^{d}\left\vert {\lambda }_{i}-x\right\vert$ .

4.3. Runtime analysis

The two steps discussed here are inherently scalable: just count frequencies to determine the LS estimators (8)–(10) at a total cost of (at most) $\;\mathrm{min}\left\{m,n\right\}$ matrix additions. The subsequent projection onto state-space is a particular type of soft-thresholding. The associated computational cost is dominated by the eigenvalue decomposition and has runtime (at most) $\mathcal{O}\left({d}^{3}\right)$ .

In summary, forming ${\hat{L}}_{n}$ is the dominant cost of a naïve implementation. However, the high degree of structure may allow us to employ techniques from randomized linear algebra [36] to further reduce the cost.

5. Numerical experiments

We numerically compare the performance of PLS to maximum likelihood (ML) and compressed sensing (CS), respectively. Additional numerical studies for MUBs can be found at the end of this section.

5.1. PLS versus ML

Figure 1 compares ML and PLS for Pauli basis measurements in dimension d = 2⁴. The trace-norm error incurred by PLS is within a factor of two of ML for low-rank states. A more exhaustive simulation study comparing PLS with other 'truncated' estimators for a range of states can be found in [25]. This confirms the statistical efficiency of PLS for the family of low rank states which are the focus of this paper.

Computationally, the PLS is significantly faster than ML (implemented using Hradil's algorithm [1, 2]). While an exhaustive comparative analysis of the computational complexity is beyond the scope of the paper, we can get a rough idea by noting that both PLS and ML involve computing a sum of matrices over all outcomes (the LS estimator in the case of PLS and the 'R' operator in the case of ML). For both estimators, this constitutes the most time-consuming subroutine. However, this is done only once in the case of PLS (followed by the projection which is relatively fast), while the ML needs to iterate the step a large number of times to achieve convergence to the maximum. Another advantage of PLS is that the LS part could in principle be computed online while the data is gathered, thus reducing the extra time to that of computing the final projection step.

5.2. PLS versus CS

CS is a natural benchmark for low-rank tomography. The papers [11, 12] apply to Pauli observables, and they show that a random choice of m ⩾ Crdlog⁶(d) Pauli observables is sufficient to reconstruct any rank-r state. The actual reconstruction is performed by solving a convex optimization problem, e.g. the least-squares fit over the set of quantum states [37, 38]. Numerical studies from [13] suggest that m = 256 is appropriate for d = 2⁵ and r = 1. Figure 2 shows that PLS consistently outperforms the CS estimator in this regime. Importantly, PLS was also much faster to evaluate than both, ML and CS.

**Figure 2.** *Comparison between PLS (blue) and Compressed Sensing (red)*: 5-qubit Pauli observables and a (random) pure target state. CS was implemented with m = 256 randomly selected Pauli observables.
Download figure:
Standard image High-resolution image

5.3. PLS for mutually unbiased bases

Maximal sets of mutually unbiased bases (MUBs) form a structured POVM (two-design) that lends itself to numerical investigation. Efficient algebraic constructions of MUBs exist for dimensions d = p^k where p is a prime number and k is an arbitrary integer [39–41]. To further underline the implicit advantage of low-rank we fix a prime dimension d and choose a pure state uniformly from the Haar measure on the complex unit sphere in d dimensions. We compute the outcome probabilities for each of the d + 1 different MUB measurements. We then sample outcomes from each distribution a total of $s=\frac{n}{d+1}$ times and compute the estimator ${\hat{\rho }}_{n}$ associated with the total frequency statistics. Figure 3 shows the relation between reconstruction error (in trace distance) and the number of samples per basis on a log–log -scale for a range of different prime dimensions between d = 100 and d = 200. The significant overlap between the graphs for different dimensions suggests that the rate of convergence scales as d/n in terms of total sample size n = s(d + 1) and dimension d; if true, this would mean that the additional log(d)-factor in the norm-one upper bound resulting from theorem 1 may not be necessary, similarly to the upper bound in theorem 2 for uniform measurements. Whether this is the case remains an open question.

**Figure 3.** *PLS convergence with MUB measurements*: (d + 1) different MUB basis measurements are used to estimate a pure target state. Error decay in the number of samples per basis s seems to be almost independent of the ambient problem dimension d (different colors). *Inset*: ordinary plot of the same data.
Download figure:
Standard image High-resolution image

6. Conclusion and outlook

Linear inversion is one of the oldest and simplest approaches to solve the practically important task of quantum state tomography. In this work, we focused on related method called projected least squares (PLS) that projects the least-squares estimator onto the set of all quantum states. Not only is this estimator numerically cheap, but it comes with strong, non-asymptotic convergence guarantees. These results are derived using concentration inequalities for sums of random matrices, and they exploit the randomness inherent in quantum experiments.

We showed that PLS is competitive, both in theory and in practice. For a variety of measurements, the results match the best existing theoretical results for the sampling rate of other tomography methods. In particular, for the uniform POVM, an order of $\frac{{r}^{2}d}{{{\epsilon}}^{2}}$ samples suffice to reconstruct any rank-r state up to accuracy in trace distance. This result also saturates existing lower bounds [20] on the minimal sampling rate required for any tomographic procedure with independent measurements. Numerical studies underline these competitive features.

6.1. Outlook

Corollary 1 is not (yet) optimal. Bootstrapping could be used to obtain tighter confidence regions, and the low computational cost of PLS may speed up this process considerably. It also seems fruitful to combine the ideas presented here with recent insights from [42]. The proof of theorem 1 indicates that PLS is stable with respect to time-dependent state generation (drift). Moreover, the general principle of PLS can be extended to the related problem of process tomography. We intend to address these points in future work.

Acknowledgments

The authors thank Philippe Faist, Matthias Kleinmann, Anirudh Acharya and Theodore Kypraios for fruitful discussions. Martin Kliesch provided very helpful comments regarding an earlier version of the draft. RK and JT are supported by ONR Award No. N00014-17-12146. RK also acknowledges funding provided by the Institute of Quantum Information and Matter, an NSF Physics Frontiers Center (NSF Grant PHY-1733907).

Appendix A.: Organisation of appendices

At the heart of this work is projected least squares (PLS)—a simple point estimator for quantum state tomography from informationally complete measurements $\left\{{M}_{1},\dots ,\;{M}_{m}\right\}\subset {\mathbb{H}}_{d}$ . PLS is a three-step procedure:

(a)
Estimate outcome probabilities by frequencies.
(b)
Construct the least squares (linear inversion) estimator:
$\begin{equation}{\hat{L}}_{n}=\underset{X\in {\mathbb{H}}_{d}}{\text{argmin}}\quad \sum _{i=1}^{m}{\left({f}_{i}-\mathrm{t}\mathrm{r}\left({M}_{i}X\right)\right)}^{2}.\end{equation} \tag{ A.1 }$
(c)
Project onto the set of all quantum states:

$\begin{equation}{\hat{\rho }}_{n}=\underset{\sigma \mathrm{i}\mathrm{s}\mathrm{a}\mathrm{q}\mathrm{u}\mathrm{a}\mathrm{n}\mathrm{t}\mathrm{u}\mathrm{m}\mathrm{s}\mathrm{t}\mathrm{a}\mathrm{t}\mathrm{e}}{\text{argmin}}{{\Vert}{\hat{L}}_{n}-\sigma {\Vert}}_{2}.\end{equation} \tag{ A.2 }$

We analyze the performance of PLS for a variety of concrete measurement scenarios: structured POVMs, Pauli observables, Pauli basis measurements and the uniform POVM. For each of them, ${\hat{\rho }}_{n}$ may be equipped with rigorous non-asymptotic confidence regions in trace distance. In this appendix, we complement the rather succinct presentation in the main text with additional explanations, motivations and more detailed arguments.

Outline: in appendix B we provide explicit least squares solutions (A.1) for the different measurements. We also review essential features and properties of the individual scenarios to provide context.

Appendix C contains the main conceptual insight of this work: least squares estimators may be interpreted as sums of independent random matrices—the randomness is due to the fundamental laws of quantum mechanics (Born's rule). This allows us to apply strong matrix-valued concentration inequalities to show that, with high probability, ${\hat{L}}_{n}$ is close to the true target state in operator norm.

Appendix D is devoted to showing that closeness of ${\hat{L}}_{n}$ in operator norm implies closeness of ${\hat{\rho }}_{n}$ in trace norm.

We combine these two insights in appendix E to arrive at the main result of this work: convergence guarantees for the PLS estimator in trace norm. The result derived there is a strict generalization of the main result quoted in the main text. It extends to the notion of effective rank which may be beneficial in concrete applications. We illustrate this potential benefit with a caricature of a faulty state preparation apparatus.

In appendix F yet stronger convergence guarantees for the uniform POVM are derived. The proof technique is completely different and we believe that it may be of independent interest to the community.

Appendix B.: Closed-form expressions for least squares estimators

As outlined in the main text, any POVM measurement can be viewed as a linear map $\mathcal{M}:{\mathbb{H}}_{d}\to {\mathbb{R}}^{m}$ , defined component-wise as ${\left[\mathcal{M}\left(X\right)\right]}_{i}=\mathrm{t}\mathrm{r}\left({M}_{i}X\right)$ for $i\in \left[m\right]$ . We will focus our attention on injective, or tomographically complete, measurements. In this case, the least squares estimator (A.1) admits a unique solution:

$\begin{equation*}{\hat{L}}_{n}={\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}{\mathcal{M}}^{{\dagger}}\left({f}_{n}\right),\end{equation*}$

where ${f}_{n}\in {\mathbb{R}}^{m}$ subsumes the individual frequency estimates. In this section, we evaluate this formula explicitly for different types of prominent measurements.

B.1. The uniform POVM and two-designs

The uniform/covariant POVM in d-dimensions corresponds to the union of all (properly re-normalized) rank-one projectors: ${\left\{d\vert v\rangle \langle v\vert \mathrm{d}v\right\}}_{v\in \mathbb{S}}^{d}$ . Here, dv denotes the unique, unitarily invariant, measure on the complex unit sphere induced by the Haar measure (over the unitary group U(d)). Its high degree of symmetry allows for analyzing this POVM by means of powerful tools from representation theory. This is widely known, see e.g. [43, 44], but we include a short presentation here to be self-contained. Define the frame operator of order k: ${F}_{\left(k\right)}={\int }_{{\mathbb{S}}^{d}}{\left(\vert v\rangle \langle v\vert \right)}^{\otimes k}\mathrm{d}v\in {\mathbb{H}}_{d}^{\otimes k}$ . Unitary invariance of dv implies that this frame operator commutes with every k-fold tensor product of a unitary matrix U ∈ U(d):

$\begin{equation*}{U}^{\otimes k}{F}_{\left(k\right)}={\int }_{{\mathbb{S}}^{d}}{\left(U\vert v\rangle \langle v\vert \right)}^{\otimes k}\mathrm{d}v={\int }_{{\mathbb{S}}^{d}}{\left(\vert \tilde{v}\rangle \langle \tilde{v}\vert U\right)}^{\otimes k}\mathrm{d}\tilde{v}={F}_{\left(k\right)}{U}^{\otimes k}.\end{equation*}$

Here, we have used a change of variables (v ~ = U v) together with the fact that dv is unitarily invariant (dv ~ = dv). Schur's lemma—one of the most fundamental tools in representation theory—states that any matrix that commutes with every element of a given group representation must be proportional to a sum of the projectors onto the associated irreducible representations (irreps). For the task at hand, the representation of interest is the diagonal representation of the unitary group: U ↦ U^⊗k for all U ∈ U(d). This representation affords, in general, many irreps that may be characterized using Schur–Weyl duality. The symmetric subspace ${\mathrm{S}\mathrm{y}\mathrm{m}}^{\left(k\right)}\subset {\left({\mathbb{C}}^{d}\right)}^{\otimes k}$ is one of them and corresponds to the subspace of all vectors that are invariant under permuting tensor factors. Crucially, F_(k) is an average over rank-one projectors onto vectors |v⟩^⊗k ∈ Sym^(k) and, therefore, its range must be contained entirely within Sym^(k). Combining this with the assertion of Schur's lemma then yields

$\begin{equation}{F}_{\left(k\right)}={\int }_{{\mathbb{S}}^{d}}{\left(\vert v\rangle \langle v\vert \right)}^{\otimes k}\mathrm{d}v={\left(\genfrac{}{}{0.0pt}{}{d+k-1}{k}\right)}^{-1}{P}_{{\mathrm{S}\mathrm{y}\mathrm{m}}^{\left(k\right)}}\quad k\in \mathbb{N},\end{equation} \tag{ B.1 }$

The pre-factor ${\left(\genfrac{}{}{0.0pt}{}{d+k-1}{k}\right)}^{-1}=\mathrm{dim}{\left({\mathrm{S}\mathrm{y}\mathrm{m}}^{\left(k\right)}\right)}^{-1}$ follows from the fact that F_(k) has unit trace.

This closed-form expression is very useful. In particular, it implies that the uniform POVM ${\left\{d\vert v\rangle \langle v\vert \right\}}_{v\in {\mathbb{S}}^{d}}$ is almost an isometry. Fix $X\in {\mathbb{H}}_{d}$ and compute

$\begin{align}\hfill \left(d+1\right){\int }_{{\mathbb{S}}^{d}}d\langle v\vert X\vert v\rangle \vert v\rangle \langle v\vert \mathrm{d}v=& \left(d+1\right)d{\mathrm{t}\mathrm{r}}_{1}\left({\int }_{{\mathbb{S}}^{d}}{\left(\vert v\rangle \langle v\vert \right)}^{\otimes 2}\mathrm{d}v\left(X\otimes \mathbb{I}\right)\right)\hfill \\ \hfill =& 2{\mathrm{t}\mathrm{r}}_{1}\left({P}_{{\mathrm{S}\mathrm{y}\mathrm{m}}^{\left(2\right)}}\left(X\otimes \mathbb{I}\right)\right),\hfill \end{align} \tag{ B.2 }$

where tr₁(A ⊗ B) = tr(A)B denotes the partial trace over the first tensor factor. The projector onto the totally symmetric subspace of two parties has an explicit representation: ${P}_{{\mathrm{S}\mathrm{y}\mathrm{m}}^{\left(2\right)}}=\frac{1}{2}\left(\mathbb{I}+\mathbb{F}\right)$ , where $\mathbb{F}$ denotes the flip operator, i.e. $\mathbb{F}\vert x\rangle \otimes \vert y\rangle =\vert y\rangle \otimes \vert x\rangle$ for all $\vert x\rangle ,\vert y\rangle \in {\mathbb{C}}^{d}$ and extend it linearly to the entire tensor product. Inserting this explicit characterization into equation (B.2) yields

$\begin{equation}\left(d+1\right){\int }_{{\mathbb{S}}^{d}}d\langle v\vert X\vert v\rangle \vert v\rangle \langle v\vert \mathrm{d}v={\mathrm{t}\mathrm{r}}_{1}\left(\left(\mathbb{I}+\mathbb{F}\right)\left(X\otimes \mathbb{I}\right)\right)=X+\mathrm{t}\mathrm{r}\left(X\right)\mathbb{I}.\end{equation} \tag{ B.3 }$

We emphasize that the full symmetry of the uniform POVM is not required to derive this formula: equation (B.1) for k = 2 is sufficient. This motivates the following definition:

Definition 1 (Two-design). A (finite) set of m rank-one projectors ${\left\{\vert {v}_{i}\rangle \langle {v}_{i}\vert \right\}}_{i=1}^{m}$ is called a (complex-projective) two-design if

$\begin{equation*}\frac{1}{m}\sum _{i=1}^{m}{\left(\vert {v}_{i}\rangle \langle {v}_{i}\vert \right)}^{\otimes 2}={\left(\genfrac{}{}{0.0pt}{}{d+1}{2}\right)}^{-1}{P}_{{\mathrm{S}\mathrm{y}\mathrm{m}}^{\left(2\right)}}.\end{equation*}$

Each two-design is proportional to a POVM $\mathcal{M}={\left\{\frac{d}{m}\vert {v}_{i}\rangle \langle {v}_{i}\vert \right\}}_{i=1}^{m}$ . Clearly, each element is positive semidefinite and $\frac{d}{m}{\sum }_{i=1}^{m}\vert {v}_{i}\rangle \langle {v}_{i}\vert =\mathbb{I}$ follows from taking the partial trace of the defining property. Viewed as maps $\mathcal{M}:{\mathbb{H}}_{d}\to {\mathbb{R}}^{m}$ , these POVMs obey

$\begin{equation*}{\mathcal{M}}^{{\dagger}}\mathcal{M}\left(X\right)=\frac{{d}^{2}}{{m}^{2}}\sum _{i=1}^{m}\langle {v}_{i}\vert X\vert {v}_{i}\rangle \vert {v}_{i}\rangle \langle {v}_{i}\vert =\frac{d\left(X+\mathrm{t}\mathrm{r}\left(X\right)\mathbb{I}\right)}{\left(d+1\right)m}\quad \mathrm{f}\mathrm{o}\mathrm{r}X\in {\mathbb{H}}_{d}.\end{equation*}$

This can be readily inverted

$\begin{equation}{\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left(X\right)=\frac{m}{d}\left(\left(d+1\right)X-\mathrm{t}\mathrm{r}\left(X\right)\mathbb{I}\right).\end{equation} \tag{ B.4 }$

and the least-squares estimator becomes

$\begin{align}\hfill {\hat{L}}_{n}=& {\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left({\mathcal{M}}^{{\dagger}}\left(f\right)\right)={\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left(\frac{d}{m}\sum _{i=1}^{m}{f}_{i}\vert {v}_{i}\rangle \langle {v}_{i}\vert \right)\hfill \\ \hfill =& \sum _{i=1}^{m}{f}_{i}\left(\left(d+1\right)\vert {v}_{i}\rangle \langle {v}_{i}\vert -\mathrm{t}\mathrm{r}\left(\vert {v}_{i}\rangle \langle {v}_{i}\vert \right)\mathbb{I}\right)=\left(d+1\right)\sum _{i=1}^{m}{f}_{i}\vert {v}_{i}\rangle \langle {v}_{i}\vert -\mathbb{I}\hfill \end{align} \tag{ B.5 }$

for any frequency vector ${f}_{n}\in {\mathbb{R}}^{m}$ . Mathematically, this is a consequence of the fact that two-design POVMs 'almost' form a tight frame on ${\mathbb{H}}_{d}$ . The close connection to well-behaved, tomographically complete, rank-one POVMs has spurred considerable interest in the identification of two-designs. Over the past decades, the following concrete examples have been identified:

(a)
Equiangular lines (SIC POVMs): a family of m unit vectors $\vert {v}_{1}\rangle ,\dots ,\;\vert {v}_{m}\rangle \in {\mathbb{S}}^{d}$ is equiangular, if ${\left\vert \langle {v}_{i},{v}_{j}\rangle \right\vert }^{2}$ is constant for all i ≠ j. The minimal cardinality of such a set is m = d² in which case the angle must be fixed: ${\left\vert \langle {v}_{i},{v}_{j}\rangle \right\vert }^{2}=\frac{1}{d+1}$ . Such maximal sets of equiangular lines are known to form two-designs [32] and have been termed symmetric, informationally complete (SIC) POVMs. This nomenclature underlines the importance of equation (B.3) for the original quantum motivation of the study of equiangular lines. While several explicit constructions of SIC POVMs exist, the general question of their existence remains an intriguing open problem.
(b)
Mutually unbiased bases (MUBs): two orthonormal bases ${\left\{\vert {b}_{i}\rangle \right\}}_{i=1}^{d}$ and ${\left\{\vert {c}_{i}\rangle \right\}}_{i=1}^{d}$ of ${\mathbb{C}}^{d}$ are mutually unbiased if ${\left\vert \langle {b}_{i},{c}_{j}\rangle \right\vert }^{2}=\frac{1}{d}$ for all 1 ⩽ i, j ⩽ d. The study of such mutually unbiased bases (MUBs) has a rich history in quantum mechanics that dates back to Schwinger [45]. It is known that at most (d + 1) pairwise mutually unbiased bases can exist in dimension d and explicit algebraic constructions are known for prime power dimensions (d = p^k). Klappenecker and Roettler [33] showed that maximal sets of MUBs are guaranteed to form two-designs.
(c)
Stabilizer states (STABs): the stabilizer formalism is one of the cornerstones of quantum computation, fault tolerance and error correction, see e.g. [46]. Let ${\mathcal{P}}_{k}$ be the Pauli group on k qubits (d = 2^k), i.e. the group generated by k-fold tensor products of the elementary Pauli matrices, $\mathbb{I}$ and $i\mathbb{I}$ . It is then possible to find maximal abelian subgroups $\mathcal{S}\subset {\mathcal{P}}_{k}$ of size d = 2^k. Since all matrices $W\in \mathcal{S}$ commute, they can be simultaneously diagonalized and determine a single unit vector which is the joint eigenvector with eigenvalue +1 of all the matrices in $\mathcal{S}$ (provided that $-\mathbb{I}\notin \mathcal{S}$ ). Such vectors are called stabilizer states (STAB) and the group $\mathcal{S}\subset {\mathcal{P}}_{k}$ is its associated stabilizer group. A total of $m={2}^{k}{\prod }_{i=0}^{k}\left({d}^{i}+1\right)={2}^{\frac{1}{2}{k}^{2}+o\left(k\right)}$ different stabilizer states can be generated this way. The union of all of them is actually known to form a three-design [47–49] and, therefore, also a two-design. The latter is also a consequence of earlier results [34, 35]

B.2. Pauli observables

For d = 2^k, the Pauli matrices ${W}_{1},\dots ,\;{W}_{{d}^{2}}\in {\mathbb{H}}_{d}$ arise from all possible k-fold tensor products of elementary Pauli matrices $\left\{\mathbb{I},{\sigma }_{x},{\sigma }_{y},{\sigma }_{z}\right\}\subset {\mathbb{H}}_{2}$ . They are well-known to form a unitary operator basis:

$\begin{equation}X=\frac{1}{d}\sum _{i=1}^{{d}^{2}}\mathrm{t}\mathrm{r}\left({W}_{i}X\right){W}_{i},\end{equation} \tag{ B.6 }$

for all $X\in {\mathbb{H}}_{d}$ . While they do constitute observables, Pauli matrices by themselves are not POVMs. However, every observable W_i may be associated with a two-outcome POVM ${\mathcal{M}}_{i}=\left\{{P}_{i}^{{\pm}}\right\}=\left\{\frac{1}{2}\left(\mathbb{I}{\pm}{W}_{i}\right)\right\}$ . The union ${\bigcup }_{i=1}^{{d}^{2}}{\mathcal{M}}_{i}$ of all these two-outcome POVMs constitutes a linear map $\mathcal{M}:{\mathbb{H}}_{d}\to {\mathbb{R}}^{2\;\mathrm{m}}$ that obeys

$\begin{equation*}{\mathcal{M}}^{{\dagger}}\mathcal{M}\left(X\right)=\sum _{i=1}^{{d}^{2}}\sum _{o={\pm}}\mathrm{t}\mathrm{r}\left({P}_{i}^{o}X\right){P}_{i}^{o}=\sum _{i=1}^{{d}^{2}}\frac{1}{2}\left(\mathrm{t}\mathrm{r}\left(X\right)\mathbb{I}+\mathrm{t}\mathrm{r}\left({W}_{i}X\right){W}_{i}\right)=\frac{1}{2}\left({d}^{2}\mathrm{t}\mathrm{r}\left(X\right)\mathbb{I}+X\right),\end{equation*}$

where the equality follows from equation (B.6). Once more, this expression can be readily inverted:

$\begin{equation*}{\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left(X\right)=\frac{2}{d}X-\frac{2\mathrm{t}\mathrm{r}\left(X\right)}{{d}^{2}+1}\mathbb{I}.\end{equation*}$

Before we continue, we note that one Pauli matrix is equal to the identity, say ${W}_{1}=\mathbb{I}$ , and the associated POVM is trivial. Hence, we suppose that n copies of ρ are distributed equally among all d² − 1 non-trivial two-Outcome POVMs ${\mathcal{M}}_{i}$ . We denote the resulting frequencies by ${\left[f\right]}_{i}^{{\pm}}$ and suppress the dependence on the number of samples. Then, the explicit solution to the least squares problem becomes

$\begin{align*}\hfill {\hat{L}}_{n}=& {\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left({\mathcal{M}}^{{\dagger}}\left({f}_{n}\right)\right)={\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left(\mathbb{I}\right)+\sum _{i=2}^{{d}^{2}}\sum _{o={\pm}}{\left[f\right]}_{i}^{o}{\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left({P}_{i}^{o}\right)\hfill \\ \hfill =& \frac{2}{d\left({d}^{2}+1\right)}\mathbb{I}+\sum _{i=2}^{{d}^{2}}{\left[f\right]}_{i}^{+}\left(\frac{1}{d}\left(\mathbb{I}+{W}_{i}\right)-\frac{d}{{d}^{2}+1}\mathbb{I}\right)+\sum _{i=2}^{{d}^{2}}{\left[f\right]}_{i}^{-}\left(\frac{1}{d}\left(\mathbb{I}-{W}_{i}\right)-\frac{d}{{d}^{2}+1}\right)\hfill \\ \hfill =& \frac{1}{d}\sum _{i=2}^{{d}^{2}}\left({\left[f\right]}_{i}^{+}-{\left[f\right]}_{i}^{-}\right){W}_{i}+\frac{2+{\sum }_{i=2}^{{d}^{2}}\left({\left[f\right]}_{i}^{+}+{\left[f\right]}_{i}^{-}\right)}{d\left({d}^{2}+1\right)}\mathbb{I}.\hfill \end{align*}$

We can simplify this expression further by noticing that each two-outcome POVM is dichotomic: either + or − is observed for every run. This implies ${\left[f\right]}_{i}^{+}+{\left[f\right]}_{i}^{-}=1$ and, by extension, ${\sum }_{i=2}^{{d}^{2}}\left({\left[{f}_{i}\right]}_{i}^{+}+{\left[{f}_{i}\right]}_{i}^{-}\right)={d}^{2}-1$ . Hence,

$\begin{equation}{\hat{L}}_{n}=\frac{1}{d}\sum _{i=2}^{{d}^{2}}\left({\left[f\right]}_{i}^{+}-{\left[f\right]}_{i}^{-}\right){W}_{i}+\frac{1}{d}\mathbb{I}=\frac{1}{d}\sum _{i=1}^{{d}^{2}}\left({\left[f\right]}_{i}^{+}-{\left[f\right]}_{i}^{-}\right){W}_{i},\end{equation} \tag{ B.7 }$

because ${\left[f\right]}_{1}^{+}=1$ . This is the formula from the main text and has a compelling interpretation: the difference ${\hat{\mu }}_{i}={\left[{f}_{i}\right]}_{i}^{+}-{\left[{f}_{i}\right]}_{i}^{-}$ is an empirical estimate for the expectation value ${\mu }_{i}=\mathrm{t}\mathrm{r}\left({W}_{i}X\right)$ of the ith Pauli observable. Finally, note that this estimator is again unbiased with respect to random fluctuations in the sample statistics:

$\begin{equation}\mathbb{E}\left[{\hat{L}}_{n}\right]=\frac{1}{d}\sum _{i=1}^{d}\mathrm{t}\mathrm{r}\left({W}_{i}\rho \right){W}_{i}=\rho .\end{equation} \tag{ B.8 }$

B.3. Pauli basis measurements

Before considering the general case, we find it instructive to consider the single qubit case in more detail. For now, fix d = 2 and note that there are three non-trivial Pauli matrices σ_s with $s\in \left\{x,y,z\right\}$ . We may associate each σ_s with a two-outcome POVM that is also a basis measurement: $\frac{1}{2}\left(\mathbb{I}{\pm}{\sigma }_{s}\right)=\vert {b}_{{\pm}}^{\left(s\right)}\rangle \langle {b}_{{\pm}}^{\left(s\right)}\vert$ . For s ≠ s'

$\begin{equation*}{\left\vert \langle {b}_{{\pm}}^{\left(s\right)},{b}_{{o}^{\prime }}^{\left({\pm}\right)}\rangle \right\vert }^{2}=\frac{1}{4}\mathrm{t}\mathrm{r}\left(\left(\mathbb{I}{\pm}{\sigma }_{s}\right)\left(\mathbb{I}{\pm}{\sigma }_{{s}^{\prime }}\right)\right)=\frac{1}{2},\end{equation*}$

because σ_s, σ_s' and σ_s σ_s' have vanishing trace. This implies that the six vectors $\vert {b}_{o}^{\left(s\right)}\rangle \langle {b}_{o}^{\left(s\right)}\vert$ with $o\in \left\{{\pm}1\right\}$ form a maximal set of 3 = (d + 1) mutually unbiased bases. Such vector sets form spherical two-designs and equation (B.3) ensures for any $X\in {\mathbb{H}}_{d}$

$\begin{equation}\sum _{s,o}\langle {b}_{o}^{\left(s\right)}\vert X\vert {b}_{o}^{\left(s\right)}\rangle \vert {b}_{o}^{\left(s\right)}\rangle \langle {b}_{o}^{\left(s\right)}\vert =X+\mathrm{t}\mathrm{r}\left(X\right)\mathbb{I}=3\left(\frac{1}{3}X+\frac{2}{3}\frac{\mathrm{t}\mathrm{r}\left(X\right)}{2}\mathbb{I}\right)=3{\mathcal{D}}_{1/3}\left(X\right).\end{equation} \tag{ B.9 }$

Here, ${\mathcal{D}}_{1/3}\left(X\right)=\frac{1}{3}X+\left(1-\frac{1}{3}\right)\frac{\mathrm{t}\mathrm{r}\left(X\right)}{2}\mathbb{I}$ denotes a single-qubit depolarizing channel with loss parameter $p=\frac{1}{3}$ .

This behavior extends to multi-qubit systems, i.e. d = 2^k. Suppose that we perform k local (single-qubit) Pauli measurements on a k-qubit state $\rho \in {\mathbb{H}}_{d}$ . Then, there are a total of k potential combinations that we label by a string $\to {s}=\left({s}_{1},\dots ,\;{s}_{k}\right)\in {\left\{x,y,z\right\}}^{k}$ . Each of them corresponds to a POVM ${\mathcal{M}}^{\left(\to {s}\right)}$ with 2^k = d outcomes that we label by $\to {o}=\left({o}_{1},\dots ,\;{o}_{k}\right)\in {\left\{{\pm}1\right\}}^{k}$ . The POVM element associated with index $\to {s}$ and outcome $\to {o}$ has an appealing tensor-product structure: $\vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert ={\otimes }_{i=1}^{k}\vert {b}_{{o}_{i}}^{\left({s}_{i}\right)}\rangle \langle {b}_{{o}_{i}}^{\left({s}_{i}\right)}\vert$ . Let $\mathcal{M}={\bigcup }_{\to {s}}{\mathcal{M}}^{\left(\to {s}\right)}:{\mathbb{H}}_{d}\to {\mathbb{R}}^{{3}^{k}}{\times}{\mathbb{R}}^{{2}^{k}}$ denote the union of all such basis measurements. Then, the following formula is true for tensor product matrices $X={\otimes }_{i=1}^{k}{X}_{i}$ and ${X}_{i}\in {\mathbb{H}}_{2}$ :

$\begin{align}\hfill {\mathcal{M}}^{{\dagger}}\mathcal{M}\left(X\right)& =\sum _{\to {s},\to {o}}\langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert X\vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert =\underset{i=1}{\overset{k}{\otimes }}\left(\sum _{{s}_{i},{o}_{i}}\langle {b}_{{o}_{i}}^{\left({s}_{i}\right)}\vert {X}_{i}\vert {b}_{{o}_{i}}^{\left({s}_{i}\right)}\rangle \vert {b}_{{o}_{i}}^{\left({s}_{i}\right)}\rangle \langle {b}_{{o}_{i}}^{\left({s}_{i}\right)}\vert \right)\hfill \\ \hfill & ={3}^{k}\underset{i=1}{\overset{k}{\otimes }}\mathcal{D}\left({X}_{i}\right)={3}^{k}{\mathcal{D}}_{1/3}^{\otimes k}\left(X\right),\hfill \end{align} \tag{ B.10 }$

where we have used equation (B.9). Linear extension ensures that this formula remains valid for arbitrary matrices $X\in {\mathbb{H}}_{d}$ . Since the single qubit depolarizing channel is invertible, the same is true for its k-fold tensor product and we conclude

$\begin{equation*}{\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left(X\right)=\frac{1}{{3}^{k}}{\left({\mathcal{D}}_{1/3}^{\otimes k}\right)}^{-1}\left(X\right).\end{equation*}$

Inserting this explicit expression into the closed-form expression for the least squares estimator yields

$\begin{equation*}{\hat{L}}_{n}={\left({\mathcal{M}}^{{\dagger}}\mathcal{M}\right)}^{-1}\left({\mathcal{M}}^{{\dagger}}\left(\to {f}\right)\right)=\frac{1}{{3}^{k}}\sum _{\to {s},\to {o}}{\left[f\right]}_{\to {o}}^{\left(\to {s}\right)}{\left({\mathcal{D}}_{1/3}^{\otimes k}\right)}^{-1}\left(\vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \right),\end{equation*}$

as advertised in the main text. Here, ${\left[f\right]}_{\to {o}}^{\left(\to {s}\right)}$ is assumed to be a frequency approximation to ${p}_{o}^{\to {\left(s\right)}}=\langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \rho \vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle$ .

We conclude this section with a single-qubit observations that allows for characterizing this expression in a more explicit fashion. Note that one may rewrite ${\mathcal{D}}_{1/3}\left(X\right)$ as $\frac{\mathrm{t}\mathrm{r}\left(X\right)}{2}\mathbb{I}+\frac{1}{6}{\sum }_{s}\mathrm{t}\mathrm{r}\left({\sigma }_{s}X\right){\sigma }_{s}$ . This facilitates the computation of the single-qubit inverse:

$\begin{equation*}{\mathcal{D}}_{1/3}^{-1}\left(X\right)=\frac{\mathrm{t}\mathrm{r}\left(X\right)}{2}\mathbb{I}+\frac{3}{2}\sum _{s}\mathrm{t}\mathrm{r}\left({\sigma }_{s}X\right){\sigma }_{s}\end{equation*}$

and, in particular

$\begin{equation*}{\mathcal{D}}_{1/3}^{-1}\left(\vert {b}_{{\pm}}^{\left(s\right)}\rangle \langle {b}_{{\pm}}^{\left(s\right)}\vert \right)=\frac{1}{2}\left({\mathcal{D}}^{-1}\left(\mathbb{I}\right){\pm}{\mathcal{D}}^{-1}\left({\sigma }_{s}\right)\right)=\frac{1}{2}\left(\mathbb{I}{\pm}3{\sigma }_{s}\right)=3\vert {b}_{{\pm}}^{\left(s\right)}\rangle \langle {b}_{{\pm}}^{\left(s\right)}\vert -\mathbb{I}.\end{equation*}$

This ensures

$\begin{equation}{\hat{L}}_{n}=\frac{1}{{3}^{k}}\sum _{\to {s},\to {o}}{\left[f\right]}_{\to {o}}^{\left(\to {s}\right)}\underset{i=1}{\overset{k}{\otimes }}\left(3\vert {b}_{{o}_{i}}^{\left({s}_{i}\right)}\rangle \langle {b}_{{o}_{i}}^{\left({s}_{i}\right)}\vert -\mathbb{I}\right).\end{equation} \tag{ B.11 }$

which, again, is an unbiased estimator with respect to random fluctuations in the sample statistics.

Finally, we also point out another consequence that will be important later on:

$\begin{equation}{\left({\mathcal{D}}_{1/3}^{-1}\left(\vert {b}_{o}^{\left(s\right)}\rangle \langle {b}_{o}^{\left(s\right)}\vert \right)\right)}^{2}=5{\mathcal{D}}_{3/5}\left(\vert {b}_{o}^{\left(s\right)}\rangle \langle {b}_{o}^{\left(s\right)}\vert \right),\end{equation} \tag{ B.12 }$

where ${\mathcal{D}}_{3/5}$ is another single-qubit depolarizing channel.

Appendix C. : The matrix Bernstein inequality and concentration in operator norm

Scalar concentration inequalities provide sharp bounds on the probability of a sum of independent random variables deviating from their mean. Classical examples include Hoeffding's, Chernoff's and Bernstein's inequality—all of which have found widespread use in a variety of scientific disciplines. The main results of this work are based on a matrix generalizations of these classical statements—in particular the matrix Bernstein inequality developed by one of the authors, see [31, theorem 1.4].

Theorem 3 (Matrix Bernstein inequality). Consider a sequence of n independent, Hermitian random matrices ${A}_{1},\dots ,\;{A}_{n}\in {\mathbb{H}}_{d}$ . Assume that each A_i satisfies

$\begin{equation*}\mathbb{E}\left[{A}_{i}\right]=0\quad \mathrm{a}\mathrm{n}\mathrm{d}\quad {{\Vert}{A}_{i}{\Vert}}_{\infty }{\leqslant}R\;\mathrm{a}\mathrm{l}\mathrm{m}\mathrm{o}\mathrm{s}\mathrm{t}\mathrm{s}\mathrm{u}\mathrm{r}\mathrm{e}\mathrm{l}\mathrm{y}.\end{equation*}$

Then, for any t > 0

$\begin{equation*}\mathrm{Pr}\left[{{\Vert}\sum _{i=1}^{n}\left({A}_{i}-\mathbb{E}\left[{A}_{i}\right]\right){\Vert}}_{\infty }{\geqslant}t\right]{\leqslant}\begin{cases}d\mathrm{exp}\left(-\frac{3{t}^{2}}{8{\sigma }^{2}}\right)\quad \hfill & t{\leqslant}\frac{{\sigma }^{2}}{R},\hfill \\ d\mathrm{exp}\left(-\frac{3t}{8R}\right)\quad \hfill & t{\geqslant}\frac{{\sigma }^{2}}{R},\hfill \end{cases}\end{equation*}$

where ${\sigma }^{2}={{\Vert}{\sum }_{i=1}^{n}\mathbb{E}\left[{A}_{i}^{2}\right]{\Vert}}_{\infty }$ .

First results of this kind originate in Banach space theory [50–53] and were later independently developed in quantum information theory [54, 55]. Further advances by Oliveira [56] and one of the authors [31] led to the result that we employ here. We refer to the monograph [31] for a detailed exposition of related work and history.

Similar to the scalar Bernstein inequality, the tail behavior in theorem 3 consists of two regimes. Small deviations are suppressed in a subgaussian fashion, while larger deviations follow a subexponential decay. The ratio $\frac{{\sigma }^{2}}{R}$ marks the transition from one regime into the other. We also note in passing that this result recovers the scalar Bernstein inequality for d = 1 ( ${\mathbb{H}}_{1}\simeq \mathbb{R}$ ).

C.1. Concentration for structured POVM measurements

For structured measurements (two-designs) we may rewrite the (plain) least squares estimator (B.5) as

$\begin{equation*}{\hat{L}}_{n}=\left(d+1\right)\sum _{i=1}^{m}{\left[{f}_{n}\right]}_{i}\vert {v}_{i}\rangle \langle {v}_{i}\vert -\mathbb{I}=\frac{1}{n}\sum _{i=1}^{n}{X}_{i},\end{equation*}$

where each X_i is an i.i.d. copy of the random matrix $X\in {\mathbb{H}}_{d}$ that assumes $\left(d+1\right)\vert {v}_{k}\rangle \langle {v}_{k}\vert -\mathbb{I}$ with probability $\frac{d}{m}\langle {v}_{k}\vert \rho \vert {v}_{k}\rangle$ for all $k\in \left[m\right]$ . Unbiasedness with respect to the sample statistics ensures $\mathbb{E}\left[{\hat{L}}_{n}\right]=\mathbb{E}\left[X\right]=\rho$ . Hence, ${\hat{L}}_{n}-\rho$ is a sum of i.i.d., centered random matrices $\frac{1}{n}\left({X}_{i}-\mathbb{E}\left[{X}_{i}\right]\right)$ . These obey

$\begin{equation*}\frac{1}{n}{{\Vert}{X}_{i}-\mathbb{E}\left[{X}_{i}\right]{\Vert}}_{\infty }=\frac{1}{n}{{\Vert}\left(d+1\right)\vert {v}_{k}\rangle \langle {v}_{k}\vert -\mathbb{I}-\rho {\Vert}}_{\infty }{\leqslant}\frac{d}{n}=:R,\end{equation*}$

where $k\in \left[m\right]$ is arbitrary. Next, note that the random matrix X obeys

$\begin{equation*}\mathbb{E}\left[{\left(X-\mathbb{E}\left[X\right]\right)}^{2}\right]=\mathbb{E}\left[{X}^{2}\right]-\mathbb{E}{\left[X\right]}^{2}=\mathbb{E}\left[{X}^{2}\right]-{\rho }^{2}\end{equation*}$

and also

$\begin{align*}\hfill \mathbb{E}\left[{X}^{2}\right]=& \sum _{k=1}^{m}\frac{d}{m}\langle {v}_{k}\vert \rho \vert {v}_{k}\rangle {\left(\left(d+1\right)\vert {v}_{k}\rangle \langle {v}_{k}\vert -\mathbb{I}\right)}^{2}=\frac{d\left({d}^{2}-1\right)}{m}\sum _{k=1}^{m}\langle {v}_{k}\vert \rho \vert {v}_{k}\rangle \vert {v}_{k}\rangle \langle {v}_{k}\vert +\mathbb{I}\hfill \\ \hfill =& \left(d-1\right)\left(\rho +\mathbb{I}\right)+\mathbb{I},\hfill \end{align*}$

according to equation (B.3). This allows us to bound the variance parameter:

$\begin{equation*}{{\Vert}\sum _{i=1}^{n}\mathbb{E}{\left[\frac{1}{n}\left(X-\mathbb{E}\left[X\right]\right)\right]}^{2}{\Vert}}_{\infty }=\frac{1}{n}{{\Vert}\left(d-1\right)\rho +d\mathbb{I}-{\rho }^{2}{\Vert}}_{\infty }{\leqslant}\frac{2d}{n}=:{\sigma }^{2}.\end{equation*}$

The ratio $\frac{{\sigma }^{2}}{R}=2$ indicates that any choice of $\tau \in \left[0,2\right]$ will fall into the subgaussian regime of the matrix Bernstein inequality and theorem 3 yields

$\begin{equation}\mathrm{Pr}\left[{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\geqslant}\tau \right]=\mathrm{Pr}\left[{{\Vert}\frac{1}{n}\sum _{i=1}^{n}\left({X}_{i}-\mathbb{E}\left[{X}_{i}\right]\right){\Vert}}_{\infty }{\geqslant}\tau \right]{\leqslant}d{\mathrm{e}}^{-\frac{3{\tau }^{2}n}{16d}}.\end{equation} \tag{ C.1 }$

C.2. Concentration for (global) Pauli observables

We assume that the total number of samples n is distributed equally among the d² different Pauli measurements. Similar to before, unbiasedness (B.8) and the explicit characterization of the LI estimator (B.7) allow us to write

$\begin{equation*}{\hat{L}}_{n}-\rho =\sum _{k=1}^{{d}^{2}}\frac{1}{n}\sum _{i=1}^{n/{d}^{2}}\left({X}_{i}^{\left(k\right)}-\mathbb{E}\left[{X}_{i}^{\left(k\right)}\right]\right).\end{equation*}$

Here, each ${X}_{i}^{\left(k\right)}$ is an independent instance of the random matrix X^(k) = ±dW_k with probability $\frac{1}{2}\left(1{\pm}\mathrm{t}\mathrm{r}\left({W}_{k}\rho \right)\right)$ each. This is a sum of centered random matrices that are independent, but in general not identically distributed. However, independence alone suffices for applying theorem 3. We note in passing that this would not be the case for earlier (weaker) versions of the matrix Bernstein inequality. Bound

$\begin{equation*}\frac{1}{n}{{\Vert}{X}_{i}^{\left(k\right)}-\mathbb{E}\left[{X}_{i}^{\left(k\right)}\right]{\Vert}}_{\infty }=\frac{d}{n}{{\Vert}\left(1{\pm}\mathrm{t}\mathrm{r}\left({W}_{i}\rho \right)\right)\mathbb{I}{\Vert}}_{\infty }{\leqslant}\frac{2d}{n}=:R\end{equation*}$

and use $\mathbb{E}\left[{\left(X-\mathbb{E}\left[X\right]\right)}^{2}\right]{\leqslant}\mathbb{E}\left[{X}^{2}\right]$ (in the positive semidefinite order) to considerably simplify the variance computation:

$\begin{equation*}\frac{{d}^{2}}{n}{{\Vert}\sum _{k=1}^{{d}^{2}}\mathbb{E}\left[{\left({X}^{\left(k\right)}\right)}^{2}\right]{\Vert}}_{\infty }=\frac{1}{n}{{\Vert}\sum _{k=1}^{{d}^{2}}\mathbb{I}{\Vert}}_{\infty }=\frac{{d}^{2}}{n}=:{\sigma }^{2},\end{equation*}$

because ${\left({X}^{\left(k\right)}\right)}^{2}=\frac{1}{{d}^{2}}\mathbb{I}$ . Applying theorem 3 yields

$\begin{equation*}\mathrm{Pr}\left[{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\geqslant}\tau \right]{\leqslant}d{\mathrm{e}}^{-\frac{3{\tau }^{2}\tilde{n}}{8}}\quad \tau \in \left[0,d/2\right].\end{equation*}$

C.3. Concentration for Pauli-basis measurements

Once more, we assume that the total budget of samples n is distributed equally among all 3^k Pauli basis choices. Unbiasedness of the LI estimator together with the explicit description (B.11) allows us to once more interpret ${\hat{L}}_{n}-\rho$ as a sum of independent, centered random matrices:

$\begin{equation*}{\hat{L}}_{n}-\rho =\sum _{\to {s}}\frac{1}{n}\sum _{i=1}^{n/{3}^{k}}\left({X}_{i}^{\left(\to {s}\right)}-\mathbb{E}\left[{X}_{i}^{\left(\to {s}\right)}\right]\right).\end{equation*}$

For each $\to {s}\in {\left\{x,y,z\right\}}^{k}$ , ${X}_{i}^{\left(\to {s}\right)}$ is an independent copy of the random matrix

$\begin{equation*}{X}^{\left(\to {s}\right)}=\underset{i=1}{\overset{k}{\otimes }}\left(3\vert {b}_{{o}_{i}}^{\left({s}_{i}\right)}\rangle \langle {b}_{{o}_{i}}^{\left({s}_{i}\right)}\vert -\mathbb{I}\right)\end{equation*}$

with probability $\langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \rho \vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle$ for each $\to {o}\in {\left\{{\pm}1\right\}}^{k}$ . Jensen's inequality implies

$\begin{equation*}\frac{1}{n}{{\Vert}{X}^{\left(\to {s}\right)}-\mathbb{E}\left[{X}^{\left(\to {s}\right)}\right]{\Vert}}_{\infty }{\leqslant}\frac{2}{n}{{\Vert}{X}^{\left(\to {s}\right)}{\Vert}}_{\infty }=\frac{2}{n}\prod _{i=1}^{k}{{\Vert}3\vert {b}_{{o}_{i}}^{\left({s}_{i}\right)}\rangle \langle {b}_{{o}_{i}}^{\left({s}_{i}\right)}\vert -\mathbb{I}{\Vert}}_{\infty }=\frac{{2}^{k+1}}{n}=:R.\end{equation*}$

For the variance, we once more use $\mathbb{E}\left[{\left(X-\mathbb{E}\left[X\right]\right)}^{2}\right]{\leqslant}\mathbb{E}\left[{X}^{2}\right]$ and compute

$\begin{align*}\hfill \frac{1}{n}\sum _{\to {s}}\mathbb{E}{\left({X}^{\left(\to {s}\right)}\right)}^{2}=& \sum _{\to {s},\to {o}}\langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \rho \vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \underset{i=1}{\overset{k}{\otimes }}{\left({\mathcal{D}}^{-1}\left(\vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \right)\right)}^{2}\hfill \\ \hfill =& \sum _{\to {s}}\sum _{\to {o}}\langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \rho \vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \underset{i=1}{\overset{k}{\otimes }}\frac{5}{3}{\mathcal{D}}_{3/5}\left(\vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \right)\hfill \\ \hfill =& {5}^{k}{\mathcal{D}}_{3/5}^{\otimes k}\left(\frac{1}{{3}^{k}}\sum _{\to {s},\to {o}}\langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \rho \vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \vert {b}_{\to {o}}^{\left(\to {s}\right)}\rangle \langle {b}_{\to {o}}^{\left(\to {s}\right)}\vert \right)\hfill \\ \hfill =& {5}^{k}{\mathcal{D}}_{3/5}^{\otimes k}\left({\mathcal{D}}_{1/3}^{\otimes k}\left(\rho \right)\right)=\frac{{5}^{k}}{n}{\mathcal{D}}_{1/5}^{\otimes k}\left(\rho \right),\hfill \end{align*}$

where we have used equation (B.12) and the fact that the combination of two depolarizing channels is again a depolarizing channel. This expression can be evaluated explicitly. For $\alpha \subset \left[k\right]$ , let tr_α(ρ) denote the partial trace over all indices contained in α. Then, for $X={\otimes }_{j=1}^{k}{X}_{j}\in {\mathbb{H}}_{d}$

$\begin{equation*}{5}^{k}{\mathcal{D}}_{1/5}^{\otimes k}\left({\otimes }_{j=1}^{k}{X}_{j}\right)=\underset{j=1}{\overset{k}{\otimes }}\left(\mathrm{t}\mathrm{r}\left({X}_{j}\right)\mathbb{I}+{X}_{j}\right)=\sum _{\alpha \subset \left[k\right]}{2}^{\vert \alpha \vert }{\mathrm{t}\mathrm{r}}_{\alpha }\left(X\right)\otimes {\mathbb{I}}^{\otimes \alpha }\end{equation*}$

and this extends linearly to all of ${\mathbb{H}}_{d}\simeq {\mathbb{H}}_{2}^{\otimes k}$ . Consequently,

$\begin{align*}\hfill {{\Vert}\frac{1}{n}\sum _{\to {s}}\mathbb{E}\left[{\left({X}^{\left(\to {s}\right)}\right)}^{2}\right]{\Vert}}_{\infty }=& \frac{1}{n}{{\Vert}\sum _{\alpha \subset \left[k\right]}{2}^{\vert \alpha \vert }{\mathrm{t}\mathrm{r}}_{\alpha }\left(\rho \right)\otimes {\mathbb{I}}^{\otimes \alpha }{\Vert}}_{\infty }{\leqslant}\frac{1}{n}\sum _{\alpha \subset \left[k\right]}{2}^{\vert \alpha \vert }{{\Vert}{\mathrm{t}\mathrm{r}}_{\alpha }\left(\rho \right){\Vert}}_{\infty }{{\Vert}{\mathbb{I}}^{\otimes \alpha }{\Vert}}_{\infty }\hfill \\ \hfill {\leqslant}& \frac{1}{n}\sum _{\alpha \subset \left[k\right]}{2}^{\vert \alpha \vert }=\frac{1}{n}\sum _{j=0}^{k}\left(\genfrac{}{}{0.0pt}{}{k}{j}\right){2}^{j}=\frac{{\left(2+1\right)}^{k}}{n}=\frac{{3}^{k}}{n}=:{\sigma }^{2}.\hfill \end{align*}$

This estimate is actually tight for pure product states of the form ρ = (|ψ⟩⟨ψ|)^⊗k. We may now apply theorem 3 to conclude

$\begin{equation*}\mathrm{Pr}\left[{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\geqslant}\tau \right]{\leqslant}d\mathrm{exp}\left(-\frac{3n{\tau }^{2}}{8{\times}{3}^{k}}\right)\quad \tau \in \left[0,1\right].\end{equation*}$

Appendix D.: Conversion of confidence regions from operator norm to trace norm

The final ingredient for the framework presented in this manuscript is a reliable way to transform operator-norm closeness of the (plain) least squares estimator ${\hat{L}}_{n}$ into a statement about closeness of the PLS estimator ${\hat{\rho }}_{n}$ in trace distance. Recall that the optimization problem (A.2) admits an analytic solution [27]. Let $U\mathrm{diag}\left(\to {\lambda }\right){U}^{{\dagger}}$ be an eigenvalue decomposition of ${\hat{L}}_{n}$ . Then,

$\begin{equation}{\hat{\rho }}_{n}=U\mathrm{diag}\left({\left[\to {\lambda }-{x}_{0}\to {1}\right]}^{+}\right){U}^{{\dagger}},\end{equation} \tag{ D.1 }$

where x₀ is chosen such that $\mathrm{t}\mathrm{r}\left({\hat{\rho }}_{n}\right)=1$ and ${\left[\to {y}\right]}_{i}^{+}=\mathrm{max}\left\{{\left[\to {y}\right]}_{i},0\right\}$ denotes thresholding on non-negative components. This solution is unique, provided that $\mathrm{t}\mathrm{r}\left({\hat{L}}_{n}\right)=1$ , which is the case for all the least squares estimators we consider.

The conversion from closeness in operator norm to closeness in trace norm will introduce a factor that is proportional to the effective rank of the density matrix ρ, rather than a full dimensional factor. For $r\in \mathbb{N}$ , we define the best rank-r approximation ρ_r of a quantum state $\rho \in {\mathbb{H}}_{d}$ as the optimal feasible point of

$\begin{equation}{\sigma }_{r}\left(\rho \right)=\underset{\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}\left(Z\right){\leqslant}r}{\mathrm{m}\mathrm{i}\mathrm{n}\mathrm{i}\mathrm{m}\mathrm{i}\mathrm{z}\mathrm{e}}\quad {{\Vert}\rho -Z{\Vert}}_{1}.\end{equation} \tag{ D.2 }$

This problem can be solved analytically. Let $\rho ={\sum }_{i=1}^{d}{\lambda }_{i}\vert {x}_{i}\rangle \langle {x}_{i}\vert$ be an eigenvalue decomposition with eigenvalues arranged in non-increasing order. Then,

$\begin{equation*}{Z}^{{\sharp}}=\sum _{i=1}^{r}{\lambda }_{i}\vert {x}_{i}\rangle \langle {x}_{i}\vert ,\quad \mathrm{a}\mathrm{n}\mathrm{d}\quad {\sigma }_{r}\left(\rho \right)=\sum _{i=r+1}^{d}{\lambda }_{i}=1-\mathrm{t}\mathrm{r}\left({\rho }_{r}\right),\end{equation*}$

highlighting that the best rank-r approximation is simply a truncation onto the r largest contributions in the eigenvalue decomposition. This truncated description is accurate if the residual error σ_r(ρ) is small. If this is the case it is reasonable to say that ρ is well approximated by a rank-r matrix Z^♯ and has effective rank r.

Proposition 1. Suppose that ${\hat{L}}_{n}\in {\mathbb{H}}_{d}$ obeys $\mathrm{t}\mathrm{r}\left({\hat{L}}_{n}\right)=1$ and ${{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\leqslant}\tau$ for some quantum state $\rho \in {\mathbb{H}}_{d}$ and τ ⩾ 0. Then, for any $r\in \mathbb{N}$ the PLS estimator ${\hat{\rho }}_{n}$ obeys

$\begin{equation*}{{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{1}{\leqslant}4r\tau +2\;\mathrm{min}\left\{{\sigma }_{r}\left(\rho \right),{\sigma }_{r}\left({\hat{\rho }}_{n}\right)\right\},\end{equation*}$

where σ_r(ρ) is defined in equation (D.2).

The statement readily follows from combining two auxiliary results. The first one states that the threshold value x₀ in the analytic solution of ${\hat{\rho }}_{n}$ must be small if ${\hat{L}}_{n}$ is operator-norm close to a quantum state.

Lemma 1. Instantiate the assumptions from proposition 1. Then, the threshold value in equation (D.1) obeys ${x}_{0}\in \left[0,\tau \right]$ .

Proof. By assumption ${\hat{L}}_{n}$ has unit trace. If it is in addition psd, ${\hat{\rho }}_{n}={\hat{L}}_{n}$ , because ${\hat{L}}_{n}$ is already a quantum state and the projection is trivial (x₀ = 0) Otherwise, ${\hat{L}}_{n}$ is indefinite and unit trace ensures that the positive part dominates. Hence, x₀ must be strictly positive to enforce $\mathrm{t}\mathrm{r}\left({\hat{\rho }}_{n}\right)=1$ .

For the upper bound, let $P\in {\mathbb{H}}_{d}$ denote the orthogonal projection onto the range of ${\hat{\rho }}_{n}$ . Then, ${\hat{\rho }}_{n}=P\left({\hat{L}}_{n}-{x}_{0}\mathbb{I}\right)P=P{\hat{L}}_{n}P-{x}_{0}P$ , according to equation (D.1). In semidefinite order, this implies

$\begin{equation*}{\hat{\rho }}_{n}=P\left({\hat{L}}_{n}-\rho \right)P-{x}_{0}P+P\rho P{\leqslant}\left({{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }-{x}_{0}\right)P+P\rho P{\leqslant}\left(\tau -{x}_{0}\right)P+P\rho P,\end{equation*}$

where the last line follows from the assumption ${{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\leqslant}\tau$ . The trace preserves semidefinite order and we conclude

$\begin{equation*}0{\leqslant}\mathrm{t}\mathrm{r}\left({\hat{\rho }}_{n}\right)-\mathrm{t}\mathrm{r}\left(P\rho P\right){\leqslant}\left(\tau -{x}_{0}\right)\mathrm{t}\mathrm{r}\left(P\right)\end{equation*}$

which implies an upper bound of τ, because tr(P) > 0. □

The second technical lemma generalizes a result that is somewhat folklore in quantum information theory: the 'effective rank' of a difference of two quantum states is proportional to the minimal rank of the two density operators involved.

Lemma 2. Fix $r\in \mathbb{N}$ and let $\rho ,\sigma \in {\mathbb{H}}_{d}$ be quantum states. Then,

$\begin{equation*}{{\Vert}\rho -\sigma {\Vert}}_{1}{\leqslant}2r{{\Vert}\rho -\sigma {\Vert}}_{\infty }+2\;\mathrm{min}\left\{{\sigma }_{r}\left(\rho \right),{\sigma }_{r}\left(\sigma \right)\right\},\end{equation*}$

where the residual error σ_r(⋅) was defined in equation (D.2).

Proof. We can without loss of generality assume σ_r(ρ) ⩽ σ_r(σ). Decompose ρ into ρ_r + ρ_c, where ρ_r is the best rank-r approximation (D.2) and ρ_c = ρ − ρ_r denotes the 'tail'. By construction, both ρ_r and ρ_c are positive semidefinite matrices that obey σ_r(ρ) = tr(ρ_c) = 1 − tr(ρ_r). The triangle inequality then implies

$\begin{equation*}{{\Vert}\rho -\sigma {\Vert}}_{1}{\leqslant}{{\Vert}{\rho }_{r}-\sigma {\Vert}}_{1}+{\sigma }_{r}\left(\rho \right),\end{equation*}$

because ${{\Vert}{\rho }_{c}{\Vert}}_{1}={\sigma }_{r}\left(\rho \right)$ . Next, let ${P}_{+},{P}_{-}\in {\mathbb{H}}_{d}$ be the projections onto the positive and non-positive ranges of ρ_r − σ. By construction, P₊ has rank at most rank(ρ_r) = r and the trace norm equals

$\begin{equation*}{{\Vert}{\rho }_{r}-\sigma {\Vert}}_{1}=\mathrm{t}\mathrm{r}\left({P}_{+}\left({\rho }_{r}-\sigma \right)\right)-\mathrm{t}\mathrm{r}\left({P}_{-}\left({\rho }_{r}-\sigma \right)\right).\end{equation*}$

On the other hand,

$\begin{equation*}{\sigma }_{r}\left(\rho \right)=\mathrm{t}\mathrm{r}\left(\sigma -{\rho }_{r}\right)=-\mathrm{t}\mathrm{r}\left({P}_{+}\left({\rho }_{r}-\sigma \right)\right)-\mathrm{t}\mathrm{r}\left({P}_{-}\left({\rho }_{r}-\sigma \right)\right),\end{equation*}$

because ${P}_{+}+{P}_{-}=\mathbb{I}$ . Combining both relations yields

$\begin{align*}{{\Vert}{\rho }_{r}-\sigma {\Vert}}_{1}& =2\mathrm{t}\mathrm{r}\left({P}_{+}\left({\rho }_{r}-\sigma \right)\right)+{\sigma }_{r}\left(\rho \right){\leqslant}2\mathrm{t}\mathrm{r}\left({P}_{+}\left(\rho -\sigma \right)\right)+{\sigma }_{r}\left(\rho \right)\\ & {\leqslant}2{{\Vert}{P}_{+}{\Vert}}_{\infty }{{\Vert}\rho -\sigma {\Vert}}_{1}+{\sigma }_{r}\left(\rho \right),\end{align*}$

where we have used tr(P₊ ρ_c) ⩾ 0 and Hoelder's inequality. Finally, note that ${{\Vert}{P}_{+}{\Vert}}_{1}=\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}\left({P}_{+}\right)=r$ by construction and the claim follows. □

The main result of this section is a rather straightforward combination of these two technical statements.

Proof of proposition 1. Fix $r\in \mathbb{N}$ and use, lemma 2 to conclude

$\begin{equation*}{{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{1}{\leqslant}2r{{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{\infty }+2\;\mathrm{min}\left\{{\sigma }_{r}\left(\rho \right),{\sigma }_{r}\left({\hat{\rho }}_{n}\right)\right\}.\end{equation*}$

Next, note that according to (D.1), ${\hat{\rho }}_{n}$ may be viewed as the positive definite part of the matrix ${\hat{L}}_{n}-{x}_{0}\mathbb{I}$ . Such a restriction to the positive part can never increase the operator norm distance to another positive semidefinite matrix. Hence,

$\begin{equation*}{{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{\infty }{\leqslant}{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }+\vert {x}_{0}\vert {{\Vert}\mathbb{I}{\Vert}}_{\infty }{\leqslant}2\tau ,\end{equation*}$

where the last inequality follows from lemma 1. □

Appendix E.: Proof of the main result

By now we have everything in place to provide a complete proof of the main result of this work.

Theorem 4. Let $\rho \in {\mathbb{H}}_{d}$ be a state. Suppose that we either perform n structured POVM measurements (set g(d) = 2d), n Pauli observable measurements (set g(d) = d²), or n Pauli basis measurements (set g(d) = d^1.6). Then, for any $r\in \mathbb{N}$ and ${\epsilon}\in \left[0,1\right]$ ,the PLS estimator ${\hat{\rho }}_{n}$ (A.2) obeys

$\begin{equation*}\mathrm{Pr}\left[{{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{1}{\geqslant}{\epsilon}+2\;\mathrm{min}\left\{{\sigma }_{r}\left(\rho \right),{\sigma }_{r}\left({\hat{\rho }}_{n}\right)\right\}\right]{\leqslant}d{\mathrm{e}}^{-\frac{n{{\epsilon}}^{2}}{43g\left(d\right){r}^{2}}},\end{equation*}$

where ${\sigma }_{r}\left(\rho \right),{\sigma }_{r}\left({\hat{\rho }}_{n}\right)$ denote the residual error of approximating ρ and ${\hat{\rho }}_{n}$ by a rank-r matrix (D.2).

Note that theorem 1 in the main text is an immediate consequence of this more general result: simply set $r=\mathrm{min}\left\{\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}\left(\rho \right),\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{k}\left({\hat{\rho }}_{n}\right)\right\}$ which in turn ensures $\mathrm{min}\left\{{\sigma }_{r}\left(\rho \right),{\sigma }_{r}\left({\hat{\rho }}_{n}\right)\right\}=0$ .

However, unlike this specification, theorem 4 does feature an additional degree of freedom. The parameter $r\in \mathbb{N}$ allows for interpolating between small values (small sampling rate, but a potentially large reconstruction error) and large values (high sampling rate, but low reconstruction error). This tradeoff is particulary benign for quantum states that are approximately low-rank. Due to experimental imperfections, such states arise naturally in many experiments that aim at generating a pure quantum state. We illustrate this by means of the following caricature of a faulty state preparation protocol. Suppose that an apparatus either produces a target state |ψ⟩⟨ψ| perfectly, or fails completely, in the sense that it outputs a maximally mixed state. Then, the resulting state is

$\begin{equation*}\rho =\left(1-p\right)\vert \psi \rangle \langle \psi \vert +\frac{p}{d}\mathbb{I},\end{equation*}$

where $p\in \left[0,1\right]$ denotes the probability of failure. This state has clearly full rank. Theorem 1 in the main text requires at least $n{\geqslant}43\frac{g\left(d\right){d}^{2}}{{{\epsilon}}^{2}}\mathrm{log}\left(d/\delta \right)$ samples to estimate it up to trace-norm accuracy with high probability. In contrast, theorem 4 ensures that already $n{\geqslant}43\frac{g\left(d\right)}{{{\epsilon}}^{2}}\mathrm{log}\left(d/\delta \right)$ samples suffice to ensure that, with high probability, the PLS estimator obeys ${{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{1}{\leqslant}{\epsilon}+2p$ . For sufficiently high success probabilities/low accuracy ( ⩾ 2p/d) this outperforms the original statement.

Proof of theorem 4. We illustrate the proof for structured POVMs—the other settings are completely analogous. Fix ${\epsilon}\in \left[0,1\right]$ , $r\in \mathbb{N}$ and set $\tau =\frac{{\epsilon}}{4r}$ . Then, the main result of appendix C.1–equation (C.1)—ensures that the least squares estimator ${\hat{L}}_{n}$ obeys

$\begin{equation}{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\leqslant}\tau =\frac{{\epsilon}}{4r}\end{equation} \tag{ E.1 }$

with probability of failure bounded by $d{\mathrm{e}}^{-\frac{{{\epsilon}}^{2}n}{86d{r}^{2}}}$ . Assuming that this condition is true, proposition 1 readily yields ${{\Vert}{\hat{\rho }}_{n}-\rho {\Vert}}_{1}{\leqslant}{\epsilon}+2\;\mathrm{min}\left\{{\sigma }_{r}\left(\rho \right),{\sigma }_{r}\left({\hat{\rho }}_{n}\right)\right\}$ . □

Appendix F.: Improved convergence guarantees for the uniform POVM

All the convergence results derived so far feature an additional log(d)-factor. This is a consequence of the matrix Bernstein inequality that proved instrumental in deriving these results. One can show that such an additional factor necessarily features in all matrix concentration inequalities that are based on exclusively first and second moments of the random matrices in question [57].

However, the following question remains: is this log(d)-factor an artifact of the proof technique, or is it an intrinsic feature?

In this section we rule out the second possibility: a different proof technique allows for avoiding this log(d)-factor, provided that the POVM is sufficiently symmetric and well-behaved. More precisely, we re-visit the uniform POVM ${\left\{d\vert v\rangle \langle v\vert \mathrm{d}v\right\}}_{v\in {\mathbb{S}}^{d}}$ and exploit the fact that equation (B.1) completely characterizes all moments of the resulting outcome distribution. This opens the door for applying very strong proof techniques from large dimensional probability theory that found wide-spread applications in a variety of subjects, including compressed sensing [58] and, more recently, quantum information theory [59]. We believe that this technique may be of independent interest and find it therefore worthwhile to present it in a self-contained fashion. Roughly speaking, it is based on the following steps:

(a)
Reformulation: the operator norm of a random Hermitian matrix $A\in {\mathbb{H}}_{d}$ admits a variational definition:
$\begin{equation}{{\Vert}A{\Vert}}_{\infty }={\mathrm{max}}_{y\in {\mathbb{S}}^{d}}\left\vert \langle y\vert A\vert y\rangle \right\vert .\end{equation} \tag{ F.1 }$
(b)
Discretization: replace the maximization over the entire complex unit sphere ${\mathbb{S}}^{d}$ by a maximization over a finite point set $\mathcal{N}$ that covers ${\mathbb{S}}^{d}$ to sufficiently high accuracy (covering net).
(c)
Concentration: fix $y\in \mathcal{N}$ and show that the scalar random variable s_y = ⟨y|A|y⟩ concentrates sharply around its expectation value.
(d)
Union bound: apply a union bound over all $\vert \mathcal{N}\vert$ random variables s_y to obtain an upper bound on the operator norm ||A||_∞.

Ideally the tail bound from (d) is sharp enough to 'counter-balance' the $\vert \mathcal{N}\vert$ -pre-factor that results from the union bound in step (iv). Should this be not the case, more sophisticated methods, like generic chaining [60], may still allow for drawing non-trivial conclusions. Fortunately, for the task at hand, this turns out to not be necessary and the rather naive strategy sketched above suffices to achieve a result that is (provably) optimal up to a constant factors:

Theorem 5. Suppose that we perform n independent uniform POVM measurements on a quantum state $\rho \in {\mathbb{H}}_{d}$ . Then, the associated least squares estimator ${\hat{L}}_{n}$ obeys

$\begin{equation*}\mathrm{Pr}\left[{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\geqslant}\tau \right]{\leqslant}2\;\mathrm{exp}\left({c}_{1}d-{c}_{2}n{\tau }^{2}\right),\end{equation*}$

In particular, $n{\geqslant}C\frac{d}{{\tau }^{2}}\mathrm{log}\left(1/\delta \right)$ suffices to ensure ${\Vert}{\hat{L}}_{n}-\rho {{\Vert}}_{\infty }{\leqslant}\tau$ with probability at least 1 − δ. Here c₁, c₂, C > 0 denote constants of sufficient size.

No effort has been made to optimize the constants. The proof presented here yields c₁ = 2 log(3) and ${c}_{2}=\frac{1}{480}$ which could be further improved by a more careful analysis. Importantly, the second part of this statement can be combined with proposition 1 to readily deduce the last technical result of the main text:

Corollary 2. (Re-statement of theorem 5). For any rank-r state ρ, a number of $n{\geqslant}C\frac{{r}^{2}d}{{{\epsilon}}^{2}}\mathrm{log}\left(1/\delta \right)$ uniform POVM measurements suffice to ensure ${{\Vert}{\hat{\rho }}_{\left(n\right)}^{{\sharp}}-\rho {\Vert}}_{1}{\leqslant}{\epsilon}$ with probability at least 1 − δ.

Not only does this statement reproduce the best known sampling rates for tomography with independent measurements [22], it also exactly matches lower bounds on the minimal sample complexity associated with any tomographic procedure that may apply in this setting [20, table 1].

The remainder of this section is dedicated to proving theorem 5. For the sake of accessibility, we will divide this proof into three subsections that contain the steps summarized above.

F.1. Step I: reformulation and discretization

Suppose that we perform n uniform POVM measurements on a fixed quantum state $\rho \in {\mathbb{H}}_{d}$ . Then, the least squares estimator is equivalent to a sum of i.i.d. random matrices:

$\begin{equation*}{\hat{L}}_{n}=\frac{1}{n}\sum _{i=1}^{n}{X}_{i}.\end{equation*}$

Each X_i is an independent copy of the random matrix X that assumes the value $\left(d+1\right)\vert v\rangle \langle v\vert -\mathbb{I}$ with probability d⟨v|ρ|v⟩dv and v may range over the entire complex unit sphere. Unbiasedness of this estimator in turn implies

$\begin{equation*}{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }={{\Vert}\frac{1}{n}\sum _{i=1}^{n}\left({X}_{i}-\mathbb{E}\left[{X}_{i}\right]\right){\Vert}}_{\infty }={\mathrm{max}}_{y\in {\mathbb{S}}^{d}}\left\vert \langle y\vert \frac{1}{n}\sum _{i=1}^{n}\left({X}_{i}-\mathbb{E}\left[{X}_{i}\right]\right)\vert y\rangle \right\vert .\end{equation*}$

Next, we employ a result that is somewhat folklore in random matrix theory, see e.g. [61, lemma 5.3]. It states that the maximum over the entire unit sphere may be replaced by a maximum over certain finite point sets, called covering nets: a covering-net of ${\mathbb{S}}^{d}$ with fineness θ > 0 is a finite set of unit vectors ${\left\{{z}_{j}\right\}}_{j=1}^{N}\subseteq {\mathbb{S}}^{d}$ that covers the entire (complex) unit sphere in the sense that every $y\in {\mathbb{S}}^{d}$ is at least θ-close to a point in the net.

Lemma 3. Let ${\mathcal{N}}_{\theta }={\left\{{z}_{j}\right\}}_{j=1}^{N}$ be a covering net of ${\mathbb{S}}^{d}$ with fineness θ. Then, for any matrix $A\in {\mathbb{H}}_{d}$ :

$\begin{equation*}{\mathrm{max}}_{j\in \left[N\right]}\left\vert \langle {z}_{j}\vert A\vert {z}_{j}\rangle \right\vert {\leqslant}{{\Vert}A{\Vert}}_{\infty }{\leqslant}\frac{1}{1-2\theta }{\mathrm{max}}_{j\in \left[N\right]}\left\vert \langle {z}_{j}\vert A\vert {z}_{j}\rangle \right\vert .\end{equation*}$

This result highlights that already a rather coarse net suffices to get reasonable approximations to the operator norm. Here, we choose $\theta =\frac{1}{4}$ which, while certainly not optimal, simplifies exposition. In particular,

$\begin{equation}{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\leqslant}2\;{\mathrm{max}}_{j\in \left[N\right]}\left\vert \frac{1}{n}\sum _{j=1}^{n}\langle {z}_{j}\vert {X}_{i}-\mathbb{E}\left[{X}_{i}\right]\vert {z}_{j}\rangle \right\vert ,\end{equation} \tag{ F.2 }$

where the maximization is over a covering net of fineness $\theta =\frac{1}{4}$ .

F.2. Step II: concentration

Note that the right hand side of equation (F.2) corresponds to a maximum over N different random variables—each of them labeled by a unit vector z_j in the net. Let $z\in {\mathbb{S}}^{d}$ be such a vector. Then, the associated random variable itself corresponds to an empirical average of n i.i.d. variables:

$\begin{equation*}{s}_{z}=\langle z\vert X-\mathbb{E}\left[X\right]\vert z\rangle .\end{equation*}$

Clearly, s_z obeys $\mathbb{E}\left[{s}_{z}\right]=0$ and, more importantly, has sub-exponential moment growth. While this follows directly from the fact that s_z is bounded, the following result highlights that this tail-behavior is actually independent of the ambient dimension.

Lemma 4. Fix $z\in {\mathbb{S}}^{d}$ . Then for any integer p ⩾ 2, the random variable s_z obeys

$\begin{equation*}\mathbb{E}\left[\vert {s}_{z}{\vert }^{p}\right]{\leqslant}27{\times}{6}^{p-2}p!\end{equation*}$

We divert the proof of this statement to the end of this section and content ourselves with emphasizing that the closed form expression of the frame operator (B.1) is essential for bounding all moments simultaneously. More relevant to the task at hand is that such a moment behavior ensures that the tails of the distribution of s_z follow an exponential decay: $\mathrm{Pr}\left[\vert {s}_{z}\vert {\geqslant}t\right]{\leqslant}{\mathrm{e}}^{-ct}$ , where c is a constant independent of the dimension d. Strong classical concentration inequalities apply for sums of i.i.d. random variables that exhibit such sub-exponential behavior. We choose to apply a rather general version of the classical Bernstein inequality, see e.g. [58, theorem 7.30].

Theorem 6. Let ${s}_{1},\dots ,\;{s}_{n}\in \mathbb{R}$ i.i.d. copies of a mean-zero random variable s that obeys $\mathbb{E}\left[\vert s{\vert }^{p}\right]{\leqslant}p!{R}^{p-2}{\sigma }^{2}/2$ for all integers p ⩾ 2, where R, σ² > 0 are constants. Then, for all t > 0,

$\begin{equation*}\mathrm{Pr}\left[\left\vert \sum _{i=1}^{n}{s}_{i}\right\vert {\geqslant}t\right]{\leqslant}2\;\mathrm{exp}\left(-\frac{{t}^{2}/2}{n{\sigma }^{2}+Rt}\right).\end{equation*}$

Lemma 4 ensures that the random variable s_z meets this requirement with σ² = 54 and R = 6. Hence, the following corollary is an immediate consequence of theorem 6.

Corollary 3. Fix $z\in {\mathbb{S}}^{d}$ . Then, for any t ∈ [0, 1]

$\begin{equation*}\mathrm{Pr}\left[\left\vert \frac{1}{n}\sum _{i=1}^{n}\langle z\vert {X}_{i}-\mathbb{E}\left[{X}_{i}\right]\vert z\rangle \right\vert {\geqslant}t\right]{\leqslant}2{\mathrm{e}}^{-\frac{n{t}^{2}}{120}}.\end{equation*}$

F.3. Step III: union bound

Recall that equation (F.2) upper-bounds ${{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }$ by a maximum over finitely many random variables, each of which is controlled by the strong exponential tail inequality from corollary 3. To exploit this, we fix $\tau \in \left[0,1\right]$ and apply a union bound (also known as Boole's inequality) over all these different random variables to obtain

$\begin{align*}\hfill \mathrm{Pr}\left[\parallel {\hat{L}}_{n}-\rho {\parallel }_{\infty }{\geqslant}\tau \right]{\leqslant}& \mathrm{Pr}\left[{\mathrm{max}}_{j\in \left[N\right]}\left\vert \frac{1}{n}\sum _{j=1}^{n}\langle {z}_{j}\vert {X}_{i}-\mathbb{E}\left[{X}_{i}\right]\vert {z}_{j}\rangle \right\vert {\geqslant}\frac{\tau }{2}\right]\hfill \\ \hfill {\leqslant}& N{\mathrm{max}}_{j\in \left[N\right]}\mathrm{Pr}\left[\left\vert \frac{1}{n}\sum _{j=1}^{n}\langle {z}_{j}\vert {X}_{i}-\mathbb{E}\left[{X}_{i}\right]\vert {z}_{j}\rangle \right\vert {\geqslant}\frac{\tau }{2}\right]{\leqslant}2N{\mathrm{e}}^{-\frac{n{\tau }^{2}}{480}},\hfill \end{align*}$

where the last line is due to corollary 3 Here, $N=\vert {\mathcal{N}}_{\frac{1}{4}}\vert$ denotes the cardinality of a covering net for the complex unit sphere ${\mathbb{S}}^{d}$ with fineness $\theta =\frac{1}{4}$ . The complex unit sphere admits an isometric embedding into the real-valued unit sphere in 2d-dimensions: map real- and imaginary parts of each complex vector component onto two distinct real parameters. This map preserves Euclidean lengths and, by extension, also the geometry of the unit sphere. Volumetric upper bounds on the cardinality of covering nets for the 2d-dimensional real-valued unit sphere are widely known, see e.g. [58, proposition C.3] and [61, lemma 5.2]: $\vert {\mathcal{N}}_{\theta }\vert {\leqslant}{\left(1+\frac{2}{\theta }\right)}^{2d}$ . Since a fineness of $\theta =\frac{1}{4}$ suffices for our purpose, we can conclude N ⩽ 3^2d and consequently,

$\begin{equation*}\mathrm{Pr}\left[{{\Vert}{\hat{L}}_{n}-\rho {\Vert}}_{\infty }{\geqslant}\tau \right]{\leqslant}2{\times}{3}^{2d}{\mathrm{e}}^{-\frac{n{\tau }^{2}}{480}}=2{\mathrm{e}}^{2\mathrm{log}\left(3\right)d-\frac{n{\tau }^{2}}{480}}\end{equation*}$

This concludes the proof of theorem 5.

F.4. Proof of lemma 4

Recall that, by assumption, the random matrix X assumes the value $X=\left(d+1\right)\vert v\rangle \langle v\vert -\mathbb{I}$ with probability ⟨v|ρ|v⟩dv, where v may range over the entire complex unit sphere ${\mathbb{S}}^{d}$ . Moreover, $\mathbb{E}\left[X\right]=\rho$ . For fixed $z\in {\mathbb{S}}^{d}$ , we may therefore write

$\begin{equation*}{s}_{z}=\langle z\vert X-\mathbb{E}\left[X\right]\vert z\rangle =\left(d+1\right)\langle v\vert B\vert v\rangle ,\end{equation*}$

where $B=\vert z\rangle \langle z\vert -\frac{1+\langle z\vert \rho \vert z\rangle }{d+1}\mathbb{I}\in {\mathbb{H}}_{d}$ has bounded trace norm

$\begin{equation}{{\Vert}B{\Vert}}_{1}{\leqslant}1+\left(1+\langle z\vert \rho \vert z\rangle \right){\leqslant}3.\end{equation} \tag{ F.3 }$

Next, recall a basic identity from matrix analysis that states

$\begin{equation*}\vert \langle v\vert B\vert v\rangle \vert =\left\vert \mathrm{t}\mathrm{r}\left(\vert v\rangle \langle v\vert B\right)\right\vert {\leqslant}\mathrm{t}\mathrm{r}\left(\vert v\rangle \langle v\vert \vert B\vert \right),\end{equation*}$

where $\vert B\vert =\sqrt{{B}^{2}}$ denotes the absolute value of the matrix B. Also, the Schatten-p norms of matrices and their absolute values coincides, in particular ||B||₁ = tr(|B|) = |||B|||₁. We can use this trick to absorb the absolute value in the moment computation. More precisely, fix an integer p ⩾ 2 and note that $\mathbb{E}\left[\vert {s}_{z}{\vert }^{p}\right]$ obeys

$\begin{equation*}\mathbb{E}\left[{\left\vert \left(d+1\right)\langle v\vert B\vert v\rangle \right\vert }^{p}\right]{\leqslant}{\left(d+1\right)}^{p}\mathbb{E}\left[\mathrm{t}\mathrm{r}{\left(\vert v\rangle \langle v\vert \;\vert B\vert \right)}^{p}\right].\end{equation*}$

We can now include the distribution of the random matrices X, and (by extension) |v⟩⟨v|, to compute

$\begin{align*}\hfill \mathbb{E}\left[\vert {s}_{z}{\vert }^{p}\right]{\leqslant}& {\left(d+1\right)}^{p}\mathbb{E}\left[\mathrm{t}\mathrm{r}{\left(\vert v\rangle \langle v\vert \;\vert B\vert \right)}^{p}\right]=d{\left(d+1\right)}^{p}{\int }_{{\mathbb{S}}^{d}}\langle v\vert \rho \vert v\rangle \mathrm{t}\mathrm{r}{\left(\vert v\rangle \langle v\vert \;\vert B\vert \right)}^{p}\;\mathrm{d}v\hfill \\ \hfill =& d{\left(d+1\right)}^{p}\mathrm{t}\mathrm{r}\left({\int }_{{\mathbb{S}}^{d}}{\left.\left(\vert v\rangle \langle v\vert \right)\right)}^{\otimes \left(p+1\right)}\;\rho \otimes \vert B{\vert }^{\otimes p}\right)\hfill \\ \hfill =& d{\left(d+1\right)}^{p}{\left(\genfrac{}{}{0.0pt}{}{d+p}{p+1}\right)}^{-1}\mathrm{t}\mathrm{r}\left({P}_{{\mathrm{S}\mathrm{y}\mathrm{m}}^{\left(p+1\right)}}\rho \otimes \vert B{\vert }^{\otimes p}\right),\hfill \end{align*}$

where the last equation is due to equation (B.1). Next, we note that Hoelder's inequality implies

$\begin{equation*}\mathrm{t}\mathrm{r}\left({P}_{{\mathrm{S}\mathrm{y}\mathrm{m}}^{\left(p+1\right)}}\rho \otimes \vert B{\vert }^{\otimes p}\right){\leqslant}{{\Vert}{P}_{{\mathrm{S}\mathrm{y}\mathrm{m}}^{p+1}}{\Vert}}_{\infty }{{\Vert}\rho {\Vert}}_{1}{\Vert}\;\vert B{\vert \;{\Vert}}_{1}^{p}{\leqslant}{3}^{p},\end{equation*}$

because ${P}_{{\mathrm{S}\mathrm{y}\mathrm{m}}^{\left(p+1\right)}}$ is an orthogonal projector, ρ is a quantum state and B is bounded in trace norm (F.3). For the remaining pre-factor we use the crude bound

$\begin{equation*}d{\left(d+1\right)}^{p}{\left(\genfrac{}{}{0.0pt}{}{d+p}{p+1}\right)}^{-1}{\leqslant}\left(p+1\right)!{\leqslant}3{\times}{2}^{p-2}p!\end{equation*}$

to establish the statement.