2∥A∥F with high probability.[1] Typically in computer science we are given an error tolerance ε, and want to know how many matrix-vector products we need to compute. (i.e. how small can we make ℓ while still being accurate?) In order to build a guarantee here, we assume A is positive semi-definite (PSD), which in turn implies that ∥A∥F≤tr(A). With this in mind, we can prove the standard result for Hutchinson's Estimator:
Fix A∈Rd×d be a PSD matrix. Then, Var[Hℓ(A)]≤ℓ2tr2(A)
So, if we want to ensure Hℓ(A)∈(1±ε)tr(A) with good probability, we need the standard deviation to have ℓ2tr(A)≤εtr(A), so we need ℓ=O(ε21) samples.
This was a section in the first draft of the paper, but was unfortunately cut. This is a slow, methodical intuition for the geometry Hutch++ takes advantage of.
Hutchinson's Estimator is powerfull, and gives a nice rate of convergence. However, the proof of Lemma 2 has a suspicious step. If we write out three lines of analysis:
The analysis of line 1 is always tight, since we know the variance of Hℓ(A) exactly.
The analysis of line 3 is always tight, since we just set ℓ to the smallest value that gets error ε.
The analysis of line 2 is not always tight.
The second line is tight only if ∥A∥F≈tr(A). That is, the analysis above only tells us that Hutchinson's Estimator needs O(ε1) samples if A is the kind of matrix that has ∥A∥F≈tr(A). So, what kind of matrix is that?
If ∥A∥F≈tr(A), then A has very few large eigenvalues, followed by much smaller eigenvalues.
Try to show this yourself. If you're having trouble, look at the argument below.
Proof. First, we rewrite ∥A∥F and tr(A) explicitly in terms of the eigenvalues of A. Let v=[λ1,λ2,…,λd] be the eigenvalues of A. Then, recall that
∥A∥F=∥v∥2andtr(A)=∑i=1dλi=∥v∥1
where the last equality follows from A being PSD. So, we are given that ∥v∥2≈∥v∥1: this is now just a statement about vectors in Rd!
We now need to show that if a vector v has ∥v∥2≈∥v∥1, then v is nearly sparse: it has only a few large values.
There are many intuition, and we will look at the extreme, and assume that ∥v∥2=∥v∥1 exactly. Prove that v has at most 1 nonzero element in it.
■
We now finish the discussion with a comparison of two ideas:
So, Hutchinson's Estimator only needs to use many samples (i.e. O(ε1) samples) if A has very special structure: it has a small number of large eigenvalues.
If a matrix has a small number of large eigenvalues, the trace must be well approximated by the sum of those eigenvalues.
With this, we conclude the fundamental intuition behind Hutch++:
If A is the kind of matrix that is hard for Hutchinsons's Estimator to handle, then tr(A) is well approximated by the top few eigenvalues of A
This leads us to pick the following rough design of an algorithm:
Find a good low-rank approximation A~k
Notice that tr(A)=tr(A~k)+tr(A−A~k)
Compute tr(A~k) exactly
Approximate tr(A−A~k) with Hutchinson's Estimator
Return Hutch++(A)=tr(A~k)+Hℓ(A−A~k)
In the next section, we state a formal version of Claim 1 (see Lemma 3), show how to compute such a matrix A~k (see QQ⊺A in Theorem 1), and bound the complexity of the Hutch++ estimator (see Theorem 2).
Special thanks to Joel A. Tropp for working most of this out, and encouraging us to look into sharpening the constants in the analysis of Hutch++.
By expressing the variance of Hutch++ with all of its constants laid bare, and by using a very simple analysis, this analysis will hopefully allow practitioners to easily the exact parameters in Hutch++ code.
Before bounding the variance of Hutch++, we include the proof of one lemma and import another theorem:
Let A∈Rd×d be a PSD matrix. Let Ak be the best rank-k approximation to A. Then, ∥A−Ak∥F≤2k1tr(A).
Proof. Let v=[λ1…λn] be the vector of eigenvalues of A, such that λ1≥λ2≥...λn≥0. Then vk:=[λ1…λk0…0] is the vector of eigenvalues of Ak. With this notation, we just need to prove that ∥v−vk∥2≤2k1∥v∥1. The rest of this proof is just Lemma 7 from this paper.
Hölder's inequality states that ∑i=1naibi≤∥a∥p⋅∥b∥q for any vector a,b so long as p1+q1=1. If we take a=b=v−vk, p=1, and q=1, we find that ∥v−vk∥22≤∥v−vk∥1∥v−vk∥∞. Further, note that by our construction of vk, ∥v∥1=∥vk∥1+∥v−vk∥1. Then,
We don't have a good way to bound γ=∥v−vk∥1, so we instead notice that the function γ↦b+γaγ is maximized at γ=b (check with calculus). So, b+γaγ≤2bab for all γ:
kλk+1+γγλk+1≤2kλk+1(kλk+1)λk+1=2k1
So, overall, we have shown that
tr(A)∥A−Ak∥F=∥v∥1∥v−vk∥2≤2k1
■
We also import the following theorem, which controls the expected error from our low-rank approximation, and which we do not prove:
Sample S∈Rn×(k+p) with N(0,1) entries. Let Q be any orthonormal basis for AS (e.g. a QR decomposition). Then, E[∥(I−QQ⊺)A∥F2]≤(1+p−1k)∥A−Ak∥F2
Proof. This is almost exactly the statement of Theorem 10.5 in (Halko et al. (2011)), but not exactly. Theorem 10.5 bounds the expected Frobenius norm (not squared). Instead, at the bottom of page 57 and top of page 58 from this version of the paper, we see that the proof of Theorem 10.5 actually proves
(E[∥(I−QQ⊺)A∥F2])1/2≤…≤(1+p−1k)∥Σ2∥F2
Where, ∥Σ2∥F2=∥A−Ak∥F2, completing the importing of this theorem.
■
Now, we proceed to the analysis of the variance of Hutch++:
Fix parameters k and ℓ. Let p=2k+1 and construct Q∈Rn×k+p as in Theorem 1. Then let Hutch++(A):=tr(Q⊺AQ)+Hℓ((I−QQ⊺)A)
Then,
E[Hutch++(A)]=tr(A)Var[Hutch++(A)]≤kℓ1tr2(A)
Proof.
We first look at the expectation by using the Tower Law, the expectation of Hutchinson's estimator, the cyclic property of trace, and the linearity of trace: E[Hutch++(A)]=E[tr(Q⊺AQ)+Hℓ((I−QQ⊺)A)]=E[tr(Q⊺AQ)]+E[E[Hℓ((I−QQ⊺)A)∣Q]]=E[tr(Q⊺AQ)]+E[tr((I−QQ⊺)A)]=E[tr(AQQ⊺)]+E[tr(A−QQ⊺A)]=E[tr(AQQ⊺)+tr(A−QQ⊺A)]=tr(A) We now move onto the Variance bound. First, recall the Conditional Variance Formula, which says for any random variables X,Y,
Var[Y]=E[Var[Y∣X]]+Var[E[Y∣X]]
Taking Y=Hutch++(A) and X=Q, we can bound the variance of Hutch++:
Var[Hutch++]=E[Var[Hutch++∣Q]]+Var[E[Hutch++∣Q]]
The second term above is always zero: Var[E[Hutch++(A)∣Q]]=Var[E[tr(Q⊺AQ)+Hℓ((I−QQ⊺)A)∣Q]]=Var[tr(Q⊺AQ)+tr((I−QQ⊺)A)]=Var[tr(A)]=0 So, we just need to bound the first term from the conditional variance formula: E[Var[Hutch++∣Q]]=E[Var[tr(Q⊺AQ)+Hℓ((I−QQ⊺)A)∣Q]]=E[Var[Hℓ((I−QQ⊺)A)∣Q]]≤ℓ2E[∥(I−QQ⊺)A∥F2]≤ℓ2(1+p−1k)∥A−Ak∥F2 Where the first inequality uses the variance of Hutchinson's Estimator, and the second uses Theorem 1. Recall that we set p=k+1. After substituting and simplifying the expression, we find
The variance in Theorem 2 still has two parameters in it, which leave a bit of ambiguity for practitioners. So, we quickly analyze the choice of k and ℓ that minimizes our bound on the variance.
Formally, suppose we are allowed to compute exactly m matrix-vector products with A. Then,
Computing Q uses k+p=2k+1 products
Computing tr(Q⊺AQ) uses k+p=2k+1 products
Computing Hℓ((I−QQ⊺)A) uses ℓ products
So, we have m=2(2k+1)+ℓ=4k+2+ℓ. We can then verify that k=8m−2 and ℓ=2m−1 minimizes the variance. Notably, this is equivalent to setting ℓ=4k, which looks more intuitive. This produces a final variance of Var[Hutch++(A)]≤(m−2)216tr2(A).
If you have read Hutch++, then quiz yourself by trying to answer some intuitive high-level questions:
1. Why is the constant-factor approximation on our low-rank approximation ∥A−A~k∥F≤2∥A−Ak∥F sufficient? Why don't we need a relative error approximation like ∥A−A~k∥F≤(1+ε)∥A−Ak∥F?
[1] We analyze variance in this page for convenience, but all of these confidence intervals hold with high probability (i.e. with logarithmic dependence on the failure probability), and this is widely analyzed in the formal publications.