Updates for Hutch++

At SOSA 2021, I presented the paper Hutch++: Optimal Stochastic Trace Estimation (with Cameron Musco, Christopher Musco, and David P. Woodruff). This webpage contains some short updates that may be useful to people.

Table of Contents:

Hutch++ for Undergrads

I believe that the analysis of both Hutchinson's Estimator and Hutch++ is simple and approachable. This section contains the analysis an advanced undergraduate should hopefully be able to understand.

Part 1: The Fundamental Shape of Hutchinson's Estimator

Definition 1: Hutchinson's Estimator
Definition

Fix ARn×n\boldsymbol{ A} \in \mathbb R^{n \times n} be a real matrix, and fix N\ell\in\mathbb N. Let x1,,xRn\mathbf{ x}_1,\ldots,\mathbf{ x}_\ell \in\mathbb R^n be vectors drawn with i.i.d. N(0,1)\mathcal N(0,1) entries. Then, Hutchinson's Estimator is

H(A) := 1i=1xiAxi H_\ell(\boldsymbol{ A}) \;{\vcentcolon=}\; \frac1\ell \sum_{i=1}^\ell \mathbf{ x}_i^\intercal \boldsymbol{ A} \mathbf{ x}_i

We can easily verify two properties of this classical estimator for symmetric A\boldsymbol{ A}:

Lemma 1: Hutchinson's Mean and Variance
Lemma

E[H(A)]=tr(A)\mathbb E[H_\ell(\boldsymbol{ A})] = \text{tr}(\boldsymbol{ A}) and Var[H(A)]=2AF2\text{Var}[H_\ell(\boldsymbol{ A})]=\frac2\ell\| \boldsymbol{ A}\|_{ F}^2

Proof. The expectation is easy to verify:

E[H(A)]=E[xAx]=i=1nj=1nE[[A]i,jxixj]=i=1nAi,i=tr(A) \mathbb E[H_\ell(\boldsymbol{ A})] = \mathbb E[\mathbf{ x}^\intercal\boldsymbol{ A}\mathbf{ x}] = \sum_{i=1}^n \sum_{j=1}^n \mathbb E[[\boldsymbol{ A}]_{i,j}\mathbf{ x}_i\mathbf{ x}_j] = \sum_{i=1}^n \boldsymbol{ A}_{i,i} = \text{tr}(\boldsymbol{ A})

In order to analyze the variance, we present the analysis of Lemma 9 from Avron, Toledo (2011). Let A=UΛU\boldsymbol{ A}=\boldsymbol{ U}\boldsymbol{ \Lambda}\boldsymbol{ U}^\intercal be the eigendecomposition of A\boldsymbol{ A}, and let y := Ux\mathbf{ y} \;{\vcentcolon=}\; \boldsymbol{ U}^\intercal\mathbf{ x}. Then, for xN(0,I)\mathbf{ x}\sim\mathcal N(0,\boldsymbol{ I}), we have yN(0,UU)=N(0,I)\mathbf{ y} \sim\mathcal N(0,\boldsymbol{ U}\boldsymbol{ U}^\intercal) = \mathcal N(0,\boldsymbol{ I}). Therefore,

Var[H(A)]=1Var[xAx]=1Var[xUΛUx]=1Var[(Ux)Λ(Ux)]=1Var[yΛy]=1Var[i=1nλiyi2]=1i=1nVar[λiyi2]=1i=1n2λi2=2AF2\begin{aligned} \text{Var}[H_\ell(\boldsymbol{ A})] &= \frac1\ell \text{Var}[\mathbf{ x}^\intercal\boldsymbol{ A}\mathbf{ x}] \\ &= \frac1\ell \text{Var}[\mathbf{ x}^\intercal\boldsymbol{ U}\boldsymbol{ \Lambda}\boldsymbol{ U}^\intercal\mathbf{ x}] \\ &= \frac1\ell \text{Var}[(\boldsymbol{ U}^\intercal\mathbf{ x})^\intercal\boldsymbol{ \Lambda}(\boldsymbol{ U}\mathbf{ x})] \\ &= \frac1\ell \text{Var}[\mathbf{ y}^\intercal\boldsymbol{ \Lambda}\mathbf{ y}] \\ &= \frac1\ell \text{Var}[\sum_{i=1}^n \lambda_i y_i^2] \\ &= \frac1\ell \sum_{i=1}^n \text{Var}[\lambda_i y_i^2] \\ &= \frac1\ell \sum_{i=1}^n 2\lambda_i^2 \\ &= \frac2\ell \| \boldsymbol{ A}\|_{ F}^2 \end{aligned}

Note we cannot just use linearity of trace naively on the sum xAx=i=1nj=1nxixjAi,j\mathbf{ x}^\intercal\boldsymbol{ A}\mathbf{ x}=\sum_{i=1}^n\sum_{j=1}^n x_ix_j \boldsymbol{ A}_{i,j} since linearity of variance only holds for independent samples.

\blacksquare \, \,

This variance is in term of the Frobenius norm of A\boldsymbol{ A}. So, we expect that Hutchinson's Estimator satisfies H(A)tr(A)±2AFH_\ell(\boldsymbol{ A}) \in \text{tr}(\boldsymbol{ A}) \pm \frac{\sqrt2}{\sqrt\ell} \| \boldsymbol{ A}\|_{ F} with high probability.[1] Typically in computer science we are given an error tolerance ε\varepsilon, and want to know how many matrix-vector products we need to compute. (i.e. how small can we make \ell while still being accurate?) In order to build a guarantee here, we assume A\boldsymbol{ A} is positive semi-definite (PSD), which in turn implies that AFtr(A)\| \boldsymbol{ A}\|_{ F} \leq \text{tr}(\boldsymbol{ A}). With this in mind, we can prove the standard result for Hutchinson's Estimator:

Lemma 2: Hutchinson's Estimator
Lemma

Fix ARd×d\boldsymbol{ A} \in \mathbb R^{d \times d} be a PSD matrix. Then, Var[H(A)]2tr2(A)\text{Var}[H_\ell(\boldsymbol{ A})]\leq\frac{2}{\ell}\text{tr}^2(\boldsymbol{ A})

So, if we want to ensure H(A)(1±ε)tr(A)H_\ell(\boldsymbol{ A})\in(1\pm\varepsilon)\text{tr}(\boldsymbol{ A}) with good probability, we need the standard deviation to have 2tr(A)εtr(A)\frac{\sqrt 2}{\sqrt \ell}\text{tr}(\boldsymbol{ A}) \leq \varepsilon \text{tr}(\boldsymbol{ A}), so we need =O(1ε2)\ell=O(\frac{1}{\varepsilon^2}) samples.

Part 2: Hutchinson's versus the Top Few Eigenvalues

This was a section in the first draft of the paper, but was unfortunately cut. This is a slow, methodical intuition for the geometry Hutch++ takes advantage of.

Hutchinson's Estimator is powerfull, and gives a nice rate of convergence. However, the proof of Lemma 2 has a suspicious step. If we write out three lines of analysis:

tr(A)H(A)2AF(Standard Deviation)2tr(A)(AFtr(A))=ε tr(A)(=O(1ε))\begin{aligned} |\text{tr}(\boldsymbol{ A}) - H_\ell(\boldsymbol{ A})| &\leq {\textstyle\frac{ \sqrt 2}{ \sqrt \ell}} \| \boldsymbol{ A}\|_F & \hspace{1cm} & (\text{Standard Deviation})\\ &\leq {\textstyle\frac{ \sqrt 2}{ \sqrt \ell}} \text{tr}(\boldsymbol{ A}) & & (\| \boldsymbol{ A}\|_F \leq \text{tr}(\boldsymbol{ A})) \\ &= \varepsilon \ \text{tr}(\boldsymbol{ A}) & & (\ell = O({\textstyle\frac{ 1}{ \varepsilon}})) \\ \end{aligned}

Looking at these inequalities,

  • The analysis of line 1 is always tight, since we know the variance of H(A)H_\ell(\boldsymbol{ A}) exactly.

  • The analysis of line 3 is always tight, since we just set \ell to the smallest value that gets error ε\varepsilon.

  • The analysis of line 2 is not always tight.

The second line is tight only if AFtr(A)\| \boldsymbol{ A}\|_F \approx \text{tr}(\boldsymbol{ A}). That is, the analysis above only tells us that Hutchinson's Estimator needs O(1ε)O(\frac1\varepsilon) samples if A\boldsymbol{ A} is the kind of matrix that has AFtr(A)\| \boldsymbol{ A}\|_F \approx \text{tr}(\boldsymbol{ A}). So, what kind of matrix is that?

Claim 1: Eigenvalue Intuition
Claim

If AFtr(A)\| \boldsymbol{ A}\|_F\approx\text{tr}(\boldsymbol{ A}), then A\boldsymbol{ A} has very few large eigenvalues, followed by much smaller eigenvalues.

Try to show this yourself. If you're having trouble, look at the argument below.

Proof. First, we rewrite AF\| \boldsymbol{ A}\|_F and tr(A)\text{tr}(\boldsymbol{ A}) explicitly in terms of the eigenvalues of A\boldsymbol{ A}. Let v=[λ1,λ2,,λd]\mathbf{ v} = [\lambda_1, \lambda_2, \ldots, \lambda_d] be the eigenvalues of A\boldsymbol{ A}. Then, recall that

AF=v2andtr(A)=i=1dλi=v1 \| \boldsymbol{ A}\|_F = \| \mathbf{ v}\|_2 \hspace{0.75cm} and \hspace{0.75cm} \text{tr}(\boldsymbol{ A}) = {\textstyle \sum_{i=1}^d \lambda_i} = \| \mathbf{ v}\|_1

where the last equality follows from A\boldsymbol{ A} being PSD. So, we are given that v2v1\| \mathbf{ v}\|_2 \approx \| \mathbf{ v}\|_1: this is now just a statement about vectors in Rd\mathbb R^d!

We now need to show that if a vector v\mathbf{ v} has v2v1\| \mathbf{ v}\|_2 \approx \| \mathbf{ v}\|_1, then v\mathbf{ v} is nearly sparse: it has only a few large values.

There are many intuition, and we will look at the extreme, and assume that v2=v1\| \mathbf{ v}\|_2=\| \mathbf{ v}\|_1 exactly. Prove that v\mathbf{ v} has at most 1 nonzero element in it.

\blacksquare \, \,

We now finish the discussion with a comparison of two ideas:

  • So, Hutchinson's Estimator only needs to use many samples (i.e. O(1ε)O(\frac1\varepsilon) samples) if A\boldsymbol{ A} has very special structure: it has a small number of large eigenvalues.

  • If a matrix has a small number of large eigenvalues, the trace must be well approximated by the sum of those eigenvalues.

With this, we conclude the fundamental intuition behind Hutch++:

Claim 2: Hutch++ Intuition
Claim

If A\boldsymbol{ A} is the kind of matrix that is hard for Hutchinsons's Estimator to handle, then tr(A)\text{tr}(\boldsymbol{ A}) is well approximated by the top few eigenvalues of A\boldsymbol{ A}

This leads us to pick the following rough design of an algorithm:

  1. Find a good low-rank approximation A~k\tilde\boldsymbol{ A}_k

  2. Notice that tr(A)=tr(A~k)+tr(AA~k)\text{tr}(\boldsymbol{ A}) = \text{tr}(\tilde \boldsymbol{ A}_k) + \text{tr}(\boldsymbol{ A} - \tilde\boldsymbol{ A}_k)

  3. Compute tr(A~k)\text{tr}(\tilde \boldsymbol{ A}_k) exactly

  4. Approximate tr(AA~k)\text{tr}(\boldsymbol{ A}-\tilde\boldsymbol{ A}_k) with Hutchinson's Estimator

  5. Return Hutch++(A)=tr(A~k)+H(AA~k)\text{Hutch++}(\boldsymbol{ A}) = \text{tr}(\tilde\boldsymbol{ A}_k) + H_\ell(\boldsymbol{ A}-\tilde\boldsymbol{ A}_k)

In the next section, we state a formal version of Claim 1 (see Lemma 3), show how to compute such a matrix A~k\tilde\boldsymbol{ A}_k (see QQA\boldsymbol{ Q}\boldsymbol{ Q}^\intercal\boldsymbol{ A} in Theorem 1), and bound the complexity of the Hutch++ estimator (see Theorem 2).

Part 3: The Variance of Hutch++

Special thanks to Joel A. Tropp for working most of this out, and encouraging us to look into sharpening the constants in the analysis of Hutch++.

By expressing the variance of Hutch++ with all of its constants laid bare, and by using a very simple analysis, this analysis will hopefully allow practitioners to easily the exact parameters in Hutch++ code.

Before bounding the variance of Hutch++, we include the proof of one lemma and import another theorem:

Lemma 3: L2/L1/L0 Bounds
Lemma

Let ARd×d\boldsymbol{ A}\in\mathbb R^{d \times d} be a PSD matrix. Let Ak\boldsymbol{ A}_k be the best rank-k approximation to A\boldsymbol{ A}. Then, AAkF12ktr(A)\| \boldsymbol{ A}-\boldsymbol{ A}_k\|_{ F} \leq \frac{1}{2\sqrt k} \text{tr}(\boldsymbol{ A}).

Proof. Let v=[λ1λn]\mathbf{ v}=[\lambda_1 \ldots \lambda_n] be the vector of eigenvalues of A\boldsymbol{ A}, such that λ1λ2... λn0\lambda_1 \geq \lambda_2 \geq ... \ \lambda_n \geq 0. Then vk := [λ1λk 00]\mathbf{ v}_k \;{\vcentcolon=}\; [\lambda_1 \ldots \lambda_k \ 0 \ldots 0] is the vector of eigenvalues of Ak\boldsymbol{ A}_k. With this notation, we just need to prove that vvk212kv1\| \mathbf{ v}-\mathbf{ v}_k\|_{ 2} \leq \frac{1}{2\sqrt k} \| \mathbf{ v}\|_{ 1}. The rest of this proof is just Lemma 7 from this paper.

Hölder's inequality states that i=1naibiapbq\sum_{i=1}^n a_i b_i \leq \| \mathbf{ a}\|_{ p} \cdot \| \mathbf{ b}\|_{ q} for any vector a,b\mathbf{ a},\mathbf{ b} so long as 1p+1q=1\frac1p + \frac1q = 1. If we take a=b=vvk\mathbf{ a}=\mathbf{ b}=\mathbf{ v}-\mathbf{ v}_k, p=1p=1, and q=1q=1, we find that vvk22vvk1 vvk\| \mathbf{ v}-\mathbf{ v}_k\|_{ 2}^2 \leq \| \mathbf{ v}-\mathbf{ v}_k\|_{ 1} \ \| \mathbf{ v}-\mathbf{ v}_k\|_{ \infty}. Further, note that by our construction of vk\mathbf{ v}_k, v1=vk1+vvk1\| \mathbf{ v}\|_{ 1} = \| \mathbf{ v}_k\|_{ 1} + \| \mathbf{ v}-\mathbf{ v}_k\|_{ 1}. Then,

vvk2v1vvk1 vvkvk1+vvk1 \frac{\| \mathbf{ v}-\mathbf{ v}_k\|_{ 2}}{\| \mathbf{ v}\|_{ 1}} \leq \frac{\sqrt{\| \mathbf{ v}-\mathbf{ v}_k\|_{ 1} \ \| \mathbf{ v}-\mathbf{ v}_k\|_{ \infty}}}{\| \mathbf{ v}_k\|_{ 1} + \| \mathbf{ v}-\mathbf{ v}_k\|_{ 1}}

Note that vvk=λk+1\| \mathbf{ v}-\mathbf{ v}_k\|_{ \infty} = \lambda_{k+1} and vk1=i=1kλikλk+1\| \mathbf{ v}_k\|_{ 1} = \sum_{i=1}^k \lambda_i \geq k\lambda_{k+1}. Then, if we let γ := vvk1\gamma \;{\vcentcolon=}\; \| \mathbf{ v}-\mathbf{ v}_k\|_{ 1}, we find

vvk1 vvkvk1+vvk1γ λk+1kλk+1+γ \frac{\sqrt{\| \mathbf{ v}-\mathbf{ v}_k\|_{ 1} \ \| \mathbf{ v}-\mathbf{ v}_k\|_{ \infty}}}{\| \mathbf{ v}_k\|_{ 1} + \| \mathbf{ v}-\mathbf{ v}_k\|_{ 1}} \leq \frac{\sqrt{\gamma \ \lambda_{k+1}}}{k\lambda_{k+1} + \gamma}

We don't have a good way to bound γ=vvk1\gamma = \| \mathbf{ v}-\mathbf{ v}_k\|_{ 1}, so we instead notice that the function γaγb+γ\gamma \mapsto \frac{\sqrt{a\gamma}}{b+\gamma} is maximized at γ=b\gamma=b (check with calculus). So, aγb+γab2b\frac{\sqrt{a\gamma}}{b+\gamma} \leq \frac{\sqrt{ab}}{2b} for all γ\gamma:

γ λk+1kλk+1+γ(kλk+1)λk+12kλk+1=12k \frac{\sqrt{\gamma \ \lambda_{k+1}}}{k\lambda_{k+1} + \gamma} \leq \frac{\sqrt{(k\lambda_{k+1}) \lambda_{k+1}}}{2k\lambda_{k+1}} = \frac{1}{2\sqrt k}

So, overall, we have shown that

AAkFtr(A)=vvk2v112k \frac{\| \boldsymbol{ A}-\boldsymbol{ A}_k\|_{ F}}{\text{tr}(\boldsymbol{ A})} = \frac{\| \mathbf{ v}-\mathbf{ v}_k\|_{ 2}}{\| \mathbf{ v}\|_{ 1}} \leq \frac{1}{2\sqrt k}
\blacksquare \, \,

We also import the following theorem, which controls the expected error from our low-rank approximation, and which we do not prove:

Theorem 1: Expected Projection Error
Theorem

Sample SRn×(k+p)\boldsymbol{ S}\in\mathbb R^{n \times (k+p)} with N(0,1)\mathcal N(0,1) entries. Let Q\boldsymbol{ Q} be any orthonormal basis for AS\boldsymbol{ A}\boldsymbol{ S} (e.g. a QR decomposition). Then,

E[(IQQ)AF2](1+kp1)AAkF2 \mathbb E[\| (\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A}\|_{ F}^2] \leq (1+{\textstyle\frac{ k}{ p-1}})\| \boldsymbol{ A}-\boldsymbol{ A}_k\|_{ F}^2

Proof. This is almost exactly the statement of Theorem 10.5 in (Halko et al. (2011)), but not exactly. Theorem 10.5 bounds the expected Frobenius norm (not squared). Instead, at the bottom of page 57 and top of page 58 from this version of the paper, we see that the proof of Theorem 10.5 actually proves

(E[(IQQ)AF2])1/2(1+kp1)Σ2F2 \left(\mathbb E [\| (\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A}\|_{ F}^2]\right)^{1/2} \leq \ldots \leq (1+\frac{k}{p-1})\| \boldsymbol{ \Sigma}_2\|_{ F}^2

Where, Σ2F2=AAkF2\| \boldsymbol{ \Sigma}_2\|_{ F}^2=\| \boldsymbol{ A}-\boldsymbol{ A}_k\|_{ F}^2, completing the importing of this theorem.

\blacksquare \, \,

Now, we proceed to the analysis of the variance of Hutch++:

Theorem 2: Variance of Hutch++
Theorem

Fix parameters kk and \ell. Let p=2k+1p=2k+1 and construct QRn×k+p\boldsymbol{ Q}\in\mathbb R^{n \times k+p} as in Theorem 1. Then let

Hutch++(A) := tr(QAQ)+H((IQQ)A) \text{Hutch++}(\boldsymbol{ A}) \;{\vcentcolon=}\; \text{tr}(\boldsymbol{ Q}^\intercal\boldsymbol{ A}\boldsymbol{ Q}) + H_\ell((\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A})

Then,

E[Hutch++(A)]=tr(A) \mathbb E[\text{Hutch++}(\boldsymbol{ A})] = \text{tr}(\boldsymbol{ A}) Var[Hutch++(A)]1ktr2(A) \text{Var}[\text{Hutch++}(\boldsymbol{ A})] \leq \frac{1}{k\ell}\text{tr}^2(\boldsymbol{ A})

Proof.

We first look at the expectation by using the Tower Law, the expectation of Hutchinson's estimator, the cyclic property of trace, and the linearity of trace: E[Hutch++(A)]=E[tr(QAQ)+H((IQQ)A)]=E[tr(QAQ)]+E[E[H((IQQ)A)Q]]=E[tr(QAQ)]+E[tr((IQQ)A)]=E[tr(AQQ)]+E[tr(AQQA)]=E[tr(AQQ)+tr(AQQA)]=tr(A)\begin{aligned} \mathbb E[\text{Hutch++}(\boldsymbol{ A})] &= \mathbb E[\text{tr}(\boldsymbol{ Q}^\intercal\boldsymbol{ A}\boldsymbol{ Q}) + H_\ell((\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A})] \\ &= \mathbb E[\text{tr}(\boldsymbol{ Q}^\intercal\boldsymbol{ A}\boldsymbol{ Q})] + \mathbb E\left[\mathbb E[H_\ell((\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A})|\boldsymbol{ Q}]\right] \\ &= \mathbb E[\text{tr}(\boldsymbol{ Q}^\intercal\boldsymbol{ A}\boldsymbol{ Q})] + \mathbb E[\text{tr}((\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A})] \\ &= \mathbb E[\text{tr}(\boldsymbol{ A}\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)] + \mathbb E[\text{tr}(\boldsymbol{ A}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal\boldsymbol{ A})] \\ &= \mathbb E[\text{tr}(\boldsymbol{ A}\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)+\text{tr}(\boldsymbol{ A}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal\boldsymbol{ A})] \\ &= \text{tr}(\boldsymbol{ A}) \end{aligned} We now move onto the Variance bound. First, recall the Conditional Variance Formula, which says for any random variables X,YX,Y,

Var[Y]=E[Var[YX]]+Var[E[YX]] \text{Var}[Y] = \mathbb E[\text{Var}[Y|X]] + \text{Var}[\mathbb E[Y|X]]

Taking Y=Hutch++(A)Y=\text{Hutch++}(\boldsymbol{ A}) and X=QX=\boldsymbol{ Q}, we can bound the variance of Hutch++:

Var[Hutch++]=E[Var[Hutch++Q]]+Var[E[Hutch++Q]] \text{Var}[\text{Hutch++}] = \mathbb E[\text{Var}[\text{Hutch++} | \boldsymbol{ Q}]] + \text{Var}[\mathbb E[\text{Hutch++} | \boldsymbol{ Q}]]

The second term above is always zero: Var[E[Hutch++(A)Q]]=Var[E[tr(QAQ)+H((IQQ)A)Q]]=Var[tr(QAQ)+tr((IQQ)A)]=Var[tr(A)]=0\begin{aligned} \text{Var}[\mathbb E[\text{Hutch++}(\boldsymbol{ A}) | \boldsymbol{ Q}]] &= \text{Var}\left[\mathbb E[\text{tr}(\boldsymbol{ Q}^\intercal\boldsymbol{ A}\boldsymbol{ Q}) + H_\ell( (\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A}) | \boldsymbol{ Q}]\right] \\ &= \text{Var}\left[\text{tr}(\boldsymbol{ Q}^\intercal\boldsymbol{ A}\boldsymbol{ Q}) + \text{tr}((\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A}) \right] \\ &= \text{Var}\left[\text{tr}(\boldsymbol{ A})\right] \\ &= 0 \end{aligned} So, we just need to bound the first term from the conditional variance formula: E[Var[Hutch++Q]]=E[Var[tr(QAQ)+H((IQQ)A)Q]]=E[Var[H((IQQ)A)Q]]2E[(IQQ)AF2]2(1+kp1)AAkF2\begin{aligned} \mathbb E[\text{Var}[\text{Hutch++} | \boldsymbol{ Q}]] &= \mathbb E[\text{Var}[\text{tr}(\boldsymbol{ Q}^\intercal\boldsymbol{ A}\boldsymbol{ Q})+H_\ell( (\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A})|Q]] \\ &= \mathbb E[\text{Var}[H_\ell( (\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A})|Q]] \\ &\leq {\textstyle\frac{ 2}{ \ell}} \mathbb E[\| (\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A}\|_{ F}^2] \\ &\leq {\textstyle\frac{ 2}{ \ell}} (1+\frac{k}{p-1})\| \boldsymbol{ A}-\boldsymbol{ A}_k\|_{ F}^2 \\ \end{aligned} Where the first inequality uses the variance of Hutchinson's Estimator, and the second uses Theorem 1. Recall that we set p=k+1p=k+1. After substituting and simplifying the expression, we find

E[Var[Hutch++Q]]4AAkF2 \mathbb E[\text{Var}[\text{Hutch++} | \boldsymbol{ Q}]] \leq {\textstyle\frac{ 4}{ \ell}}\| \boldsymbol{ A}-\boldsymbol{ A}_k\|_{ F}^2

And by Lemma 3, we complete the proof:

4AAkF2414tr2(A)=1ktr2(A) {\textstyle\frac{ 4}{ \ell}}\| \boldsymbol{ A}-\boldsymbol{ A}_k\|_{ F}^2 \leq {\textstyle\frac{ 4}{ \ell}} \cdot {\textstyle\frac{ 1}{ 4\ell}}\text{tr}^2(\boldsymbol{ A}) = {\textstyle\frac{ 1}{ k\ell}}\text{tr}^2(\boldsymbol{ A})
\blacksquare \, \,

Part 4: Practitioner Advice

The variance in Theorem 2 still has two parameters in it, which leave a bit of ambiguity for practitioners. So, we quickly analyze the choice of kk and \ell that minimizes our bound on the variance.

Formally, suppose we are allowed to compute exactly mm matrix-vector products with A\boldsymbol{ A}. Then,

  • Computing Q\boldsymbol{ Q} uses k+p=2k+1k+p=2k+1 products

  • Computing tr(QAQ)\text{tr}(\boldsymbol{ Q}^\intercal\boldsymbol{ A}\boldsymbol{ Q}) uses k+p=2k+1k+p=2k+1 products

  • Computing H((IQQ)A)H_\ell( (\boldsymbol{ I}-\boldsymbol{ Q}\boldsymbol{ Q}^\intercal)\boldsymbol{ A}) uses \ell products

So, we have m=2(2k+1)+=4k+2+m=2(2k+1)+\ell=4k+2+\ell. We can then verify that k=m28k=\frac{m-2}{8} and =m21\ell=\frac{m}{2}-1 minimizes the variance. Notably, this is equivalent to setting =4k\ell=4k, which looks more intuitive. This produces a final variance of Var[Hutch++(A)]16(m2)2tr2(A)\text{Var}[\text{Hutch++}(\boldsymbol{ A})]\leq\frac{16}{(m-2)^2}\text{tr}^2(\boldsymbol{ A}).

Nyström-Hutch++

One additional approach to implementing Hutch++ for PSD matrices considers the Nyström method, where we use the approximation

A~k=(AQ)(QAQ)1(AQ) \tilde\boldsymbol{ A}_k = (\boldsymbol{ A}\boldsymbol{ Q})(\boldsymbol{ Q}^\intercal\boldsymbol{ A}\boldsymbol{ Q})^{-1}(\boldsymbol{ A}\boldsymbol{ Q})^{\intercal}

Note that we can compute A~k=Y(QY)1Y\tilde\boldsymbol{ A}_k=\boldsymbol{ Y}(\boldsymbol{ Q}^\intercal\boldsymbol{ Y})^{-1}\boldsymbol{ Y}^\intercal where Y=AQ\boldsymbol{ Y}=\boldsymbol{ A}\boldsymbol{ Q} contains all the matrix-vector products with A\boldsymbol{ A}. We can compute tr(A~k)\text{tr}(\tilde\boldsymbol{ A}_k) efficiently:

tr(A~k)=tr(Y(QY)1Y)=tr((QY)1(YY))\begin{aligned} \text{tr}(\tilde\boldsymbol{ A}_k) &= \text{tr}(\boldsymbol{ Y}(\boldsymbol{ Q}^\intercal\boldsymbol{ Y})^{-1}\boldsymbol{ Y}^\intercal) \\ &= \text{tr}((\boldsymbol{ Q}^\intercal\boldsymbol{ Y})^{-1}(\boldsymbol{ Y}^\intercal\boldsymbol{ Y})) \end{aligned}

The matrix inversion is only for a (k+p)×(k+p)(k+p) \times (k+p) sized matrix, which is easy. We can then define the NYS-Hutch++ estimator as

NYS-Hutch++(A) := tr((QY)1(YY))+H(AA~k)\begin{aligned} \text{NYS-Hutch++}(\boldsymbol{ A}) &\;{\vcentcolon=}\; \text{tr}((\boldsymbol{ Q}^\intercal\boldsymbol{ Y})^{-1}(\boldsymbol{ Y}^\intercal\boldsymbol{ Y})) + H_\ell(\boldsymbol{ A}-\tilde\boldsymbol{ A}_k) \\ \end{aligned}

We can write out this Hutchinson's Estimator step as

H(AA~k)=1(tr(GAG)tr(R(QY)1R)) H_\ell(\boldsymbol{ A}-\tilde\boldsymbol{ A}_k) = \frac{1}{\ell}\left(\text{tr}(\boldsymbol{ G}^\intercal\boldsymbol{ A}\boldsymbol{ G}) - \text{tr}(\boldsymbol{ R}^\intercal(\boldsymbol{ Q}^\intercal\boldsymbol{ Y})^{-1}\boldsymbol{ R}) \right)

where G\boldsymbol{ G} has random sign-bit entries and R := YG\boldsymbol{ R}\;{\vcentcolon=}\;\boldsymbol{ Y}^\intercal\boldsymbol{ G}. Formally, we can write this out as the following algorithm:

Algorithm 1: NYS-Hutch++

---

input: Matrix-Vector Oracle access to ARd×d\boldsymbol{ A}\in\mathbb R^{d \times d}. Number mm of queries.

output: Approximation to tr(A)\text{tr}(\boldsymbol{ A})

  1. Sample SRd×m4\boldsymbol{ S}\in\mathbb R^{d \times \frac{m}{4}} and GRd×m2\boldsymbol{ G}\in\mathbb R^{d \times \frac{m}{2}} with i.i.d. {+1,1}\{+1,-1\} entries.

  2. Compute an orthonormal basis QRd×m4\boldsymbol{ Q}\in\mathbb R^{d \times \frac{m}{4}} for the span of AS\boldsymbol{ A}\boldsymbol{ S}

  3. Compute Y=AQ\boldsymbol{ Y}=\boldsymbol{ A}\boldsymbol{ Q} and R=YG\boldsymbol{ R}=\boldsymbol{ Y}^\intercal\boldsymbol{ G}

  4. return tr((QY)1(YY))+2m(tr(GAG)tr(R(QY)1R))\text{tr}((\boldsymbol{ Q}^\intercal\boldsymbol{ Y})^{-1}(\boldsymbol{ Y}^\intercal\boldsymbol{ Y})) + \frac{2}{m}\left(\text{tr}(\boldsymbol{ G}^\intercal\boldsymbol{ A}\boldsymbol{ G}) - \text{tr}(\boldsymbol{ R}^\intercal(\boldsymbol{ Q}^\intercal\boldsymbol{ Y})^{-1}\boldsymbol{ R}) \right)

Or, in matlab:

function trace_est=simple_nystrom_hutchplusplus(A, num_queries)
    S = 2*randi(2,size(A,1),ceil(num_queries/4))-3;
    G = 2*randi(2,size(A,1),floor(num_queries/2))-3;
    [Q,~] = qr(A*S,0);
    Y = A*Q;
    R = Y' * G;
    QYinv = pinv(Q' * Y);
    trace_est = trace(QYinv * (Y'*Y)) + 1/size(G,2)*[trace(G'*A*G) - trace(R'*QYinv*R)];
end

Epilogue

Code

Quiz Questions

If you have read Hutch++, then quiz yourself by trying to answer some intuitive high-level questions:

1. Why is the constant-factor approximation on our low-rank approximation AA~kF2AAkF\|A-\tilde{A}_k\|_F \leq 2 \|A-A_k\|_F sufficient? Why don't we need a relative error approximation like AA~kF(1+ε)AAkF\|A-\tilde{A}_k\|_F \leq (1+\varepsilon) \|A-A_k\|_F?

Give an intuitive answer, not a mathematical one.

Footnotes

[1] We analyze variance in this page for convenience, but all of these confidence intervals hold with high probability (i.e. with logarithmic dependence on the failure probability), and this is widely analyzed in the formal publications.

References