2026-06-08

The Evolution of Random Matrix Theory: From Marchenko-Pastur to Extensions for Data with Nonlinear Correlations

How the spectral analysis of sample covariance matrices became the theoretical foundation of modern machine learning

Table of Contents

0. Introduction

Recently, machine learning models, including deep learning, have undergone dramatic evolution. However, behind this progress lay "mysteries" that could not be fully explained by classical statistics.

According to the common sense of the bias-variance tradeoff based on conventional machine learning theories of generalization (statistical learning theories such as Rademacher complexity and VC dimension), it was believed that in the "over-parameterized regime" where the number of parameters exceeds the number of data points, models would overfit, leading to a significant deterioration in predictive accuracy (generalization performance) on unseen data. However, since the late 2010s, a phenomenon contrary to conventional predictions has been observed in many models: in the over-parameterized regime, generalization performance, after initially deteriorating, begins to improve again. This is the Double Descent phenomenon.

Image
Double descent phenomenon in a polynomial regression model
Image
Conceptual diagram of the double descent phenomenon

This mysterious phenomenon has been theoretically unraveled and is now in the spotlight as a new theoretical foundation for modern high-dimensional statistics and machine learning. This is a field of mathematics and statistics known as Random Matrix Theory (RMT).

What is Random Matrix Theory (RMT)?

To begin with, as the name suggests, Random Matrix Theory is a field that studies the properties of "matrices whose entries are random variables."

Random Matrices

For example, if you generate a 5×55 \times 5 matrix AA where each entry follows an independent normal (Gaussian) distribution with a mean of 0 and a variance of 1, you will get a different matrix like the following every time.

A=[0.3260.2920.2240.4380.3762.0590.1901.5200.0031.2520.5171.3541.7292.0410.3001.1290.1550.4261.0750.7231.5371.8690.2932.540.605] A = \begin{bmatrix} -0.326 & 0.292 & -0.224 & 0.438 & -0.376 \\ -2.059 & -0.190 & -1.520 & 0.003 & -1.252 \\ 0.517 & 1.354 & -1.729 & -2.041 & -0.300 \\ 1.129 & 0.155 & -0.426 & -1.075 & 0.723 \\ 1.537 & -1.869 & 0.293 & 2.54 & -0.605 \end{bmatrix}

Founded in the early 20th century by Wishart in statistics and Wigner in nuclear physics, this theory possesses an astonishing property: "From a collection of seemingly random entries, beautifully deterministic laws emerge as the size (dimension) of the matrix approaches infinity."

Column: The Origins of RMT and Wigner's Semicircle Law

Random Matrix Theory originates from multivariate statistics in the 1920s (Wishart's study of covariance matrices) and quantum mechanics in the 1950s. Physicist Eugene Wigner considered the following random symmetric matrix XX to model the complex energy levels of heavy atomic nuclei:

X:=A+AT2n X := \frac{A + A^{\mathsf{T}}}{\sqrt{2n}}

Surprisingly, it was proven that as the dimension nn of this matrix approaches infinity, the histogram of its eigenvalues draws a perfectly deterministic "semicircle." This is called Wigner's Semicircle Law. A completely deterministic and beautiful law emerges from a collection of random entries. This is part of the charm of RMT.

import numpy as np
import matplotlib.pyplot as plt

# Simulation of Wigner's Semicircle Law (GOE)
n = 1000
A = np.random.randn(n, n)
GOE = (A + A.T) / np.sqrt(2 * n) # Symmetrization
eigvals_GOE = np.linalg.eigvalsh(GOE)

plt.hist(eigvals_GOE, bins=50, density=True, color="skyblue", edgecolor="black")
plt.title("Eigenvalues of GOE (Wigner Semicircle)")
plt.show()
Image
Simulation of Wigner Semicircle Law

Why Could RMT Solve the Mysteries of Machine Learning?

So, why has this mathematical theory become an indispensable tool for analyzing the generalization performance of machine learning? It is because the "datasets" and the "weight parameters" of neural networks handled in machine learning can be viewed as massive random matrices.

To put the difference simply, while conventional statistical learning theory assumes the worst-case scenario ("no matter how unlucky the data is, the error will not exceed this bound") to obtain an upper bound on the generalization error (Worst-case analysis), RMT uses probabilistic limits to describe the "average behavior" of the generalization error (Average-case analysis). Therefore, the theoretical formulas for generalization error derived using RMT successfully and mathematically rigorously explained the double descent phenomenon in the over-parameterized regime.

Roadmap of This Article

In this article, we will explain in an easy-to-understand manner how this random matrix theory has evolved and become the foundation of modern machine learning, following its historical progression and theoretical breakthroughs:

  1. The Era of Independence: The Marchenko-Pastur Law (1967)
  2. Breaking the Constraints: Establishing a Solid Theoretical Foundation by Silverstein (1995)
  3. Establishment of Deterministic Equivalents (2007-2011): A groundbreaking paradigm shift by Hachem, Loubaton, & Najim (2007) and Rubio & Mestre (2011).
  4. Extension to Nonlinear Feature Vectors (Recent years): Theoretical extensions to data with nonlinear correlations by Louart et al.
  5. Analysis of Test Risk and Unraveling the "Double Descent" Phenomenon: How RMT solved the mysteries of machine learning.

Now, let's step into the profound world of Random Matrix Theory.

1. Sample Covariance Matrices and the Stieltjes Transform

The primary object of study in high-dimensional data analysis is the Sample Covariance Matrix.

What is a Sample Covariance Matrix?

Given a data matrix with NsN_{\mathrm{s}} samples and pp features:

X:=[x1,x2,,xNs]TRNs×p X := [\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_{N_{\mathrm{s}}}]^{\mathsf{T}} \in \mathbb{R}^{N_{\mathrm{s}} \times p}

the sample covariance matrix is defined as follows:

Σ^:=1NsXTX=1Nsi=1NsxixiT \hat{\Sigma} := \frac{1}{N_{\mathrm{s}}} X^{\mathsf{T}}X = \frac{1}{N_{\mathrm{s}}} \sum_{i=1}^{N_{\mathrm{s}}} \boldsymbol{x}_i \boldsymbol{x}_i^{\mathsf{T}}

This matrix Σ^\hat{\Sigma} is a crucial matrix that represents the variance structure of the data and plays a central role in training and evaluating the performance of machine learning models. In RMT, the primary interest lies in the distribution of eigenvalues when the matrix dimension pp becomes large. Let the pp eigenvalues of Σ^\hat{\Sigma} be s1,,sps_1, \dots, s_p. Its Empirical Spectral Distribution (ESD) can be written as:

μΣ^(x)=1pj=1pδ(xsj)\mu_{\hat{\Sigma}}(x) = \frac{1}{p} \sum_{j=1}^p \delta(x - s_j)

However, it is extremely difficult to directly calculate the eigenvalues of a massive random matrix and mathematically determine its distribution function. Here, a powerful analytical tool that supported the development of RMT comes into play: the Stieltjes Transform. The Stieltjes transform Sμ(z)S_\mu(z) for a probability measure μ\mu is defined for a complex number zCRz \in \mathbb{C} \setminus \mathbb{R} as follows:

Sμ(z)=1xzdμ(x)(zCR)S_\mu(z) = \int \frac{1}{x-z} d\mu(x) \quad (z \in \mathbb{C} \setminus \mathbb{R})

When this definition is applied to the empirical spectral distribution μΣ^\mu_{\hat{\Sigma}} of the sample covariance matrix, it can be rewritten into a surprisingly simple matrix form:

SΣ^(z)=1pj=1p1sjz=1pTr[(Σ^zIp)1]S_{\hat{\Sigma}}(z) = \frac{1}{p} \sum_{j=1}^p \frac{1}{s_j - z} = \frac{1}{p} \mathrm{Tr}\left[ (\hat{\Sigma} - zI_p)^{-1} \right]

This (Σ^zIp)1(\hat{\Sigma} - zI_p)^{-1} is called the Resolvent Matrix. In RMT, rather than taking the difficult approach of "calculating eigenvalues directly," we follow a refined route: "find the limit of the trace of the resolvent (the Stieltjes transform), and from there, back-calculate the eigenvalue distribution using complex analysis techniques." Furthermore, when evaluating the test error of machine learning models such as ridge regression, the trace of this resolvent appears directly in the formulas. Therefore, deriving the Stieltjes transform directly translates to predicting machine learning performance.

2. The Era of Independence: The Marchenko-Pastur Law (1967)

In classical statistics, we consider the limit where the number of samples NsN_{\mathrm{s}} \to \infty while the number of features pp remains fixed. In this case, by the law of large numbers, the eigenvalue distribution of Σ^\hat{\Sigma} simply converges to the eigenvalues of the population covariance matrix Σ\Sigma.

However, in modern machine learning, it is common for the number of features pp and the number of samples NsN_{\mathrm{s}} to be of a similar magnitude. Actually, in such situations, classical statistical theories cannot be applied. Even when the population covariance matrix Σ\Sigma is the identity matrix IpI_p (where all eigenvalues are 11), the eigenvalue distribution of Σ^\hat{\Sigma} converges to a non-trivial distribution with a certain width.

In 1967, Marchenko and Pastur showed that in the "proportional asymptotic regime," where both the matrix dimension pp and the number of samples NsN_{\mathrm{s}} go to infinity while their ratio γ:=p/Ns(0,)\gamma := p/N_{\mathrm{s}} \in (0, \infty) is kept constant, the eigenvalue distribution of this matrix converges to a deterministic limit distribution.

Marchenko-Pastur distribution (1967)

Assume that all entries XijX_{ij} of the matrix XX are independent and identically distributed (i.i.d.) with mean 00 and variance σx2\sigma_x^2. That is, all xi\boldsymbol{x}_i are independent, and xiN(0,σx2Ip)\boldsymbol{x}_i \sim \mathcal{N}(0, \sigma_x^2 I_p). Then, in the limit as p,Nsp, N_{\mathrm{s}} \to \infty such that p/Nsγ(0,)p/N_{\mathrm{s}} \to \gamma \in (0, \infty), the empirical spectral distribution of Σ^\hat{\Sigma} weakly converges almost surely to a distribution with the following density function gMP(x)g_{\mathrm{MP}}(x):

gMP(x,γ,σx2)=12πσx2(γ+x)(xγ)xγ1[γ,γ+](x)+(11γ)+δ(x) g_{\mathrm{MP}}(x, \gamma, \sigma_x^2) = \frac{1}{2\pi\sigma_x^2} \frac{\sqrt{(\gamma_+ - x)(x - \gamma_-)}}{x\gamma} \mathbf{1}_{[\gamma_-, \gamma_+]}(x) + \left(1-\frac{1}{\gamma}\right)^+ \delta(x)

where γ±=σx2(1±γ)2\gamma_{\pm} = \sigma_x^2(1 \pm \sqrt{\gamma})^2

(Original paper: Distribution of eigenvalues for some sets of random matrices)

As can be seen from this result, when the number of samples NsN_{\mathrm{s}} and the number of features pp are of similar magnitude, the eigenvalues of the sample covariance matrix Σ^\hat{\Sigma} do not concentrate at a single point σx2\sigma_x^2. Instead, due to sampling fluctuations, the eigenvalues spread out into a bulk centered around σx2\sigma_x^2 with a width of [γ,γ+][\gamma_-, \gamma_+]. However, if pp is sufficiently smaller than NsN_{\mathrm{s}}, the eigenvalues will be narrowly distributed around σx2\sigma_x^2, making it consistent with classical statistical theory.

Image

What is noteworthy here is how the distribution changes depending on the dimensional ratio γ:=p/Ns\gamma := p/N_{\mathrm{s}} (the ratio of the number of parameters to the number of samples).

  • γ<1\gamma < 1 (Under-parameterized): Since p<Nsp < N_{\mathrm{s}}, the bulk of the eigenvalues forms a neat mountain away from zero.
  • γ=1\gamma = 1 (Interpolation threshold): Since p=Nsp = N_{\mathrm{s}}, we have γ=0\gamma_- = 0, resulting in a singularity (peak) where the distribution diverges to infinity near zero. This implies that the smallest eigenvalue of the matrix approaches zero, making it extremely ill-conditioned. This is the direct cause of the "interpolation peak (divergence of error) in double descent," which will be discussed later.
  • γ>1\gamma > 1 (Over-parameterized): Since p>Nsp > N_{\mathrm{s}}, the matrix becomes rank-deficient. As a result, a massive number of exactly zero eigenvalues are generated (the Dirac delta function δ(x)\delta(x) in the second term of the equation), and the remaining non-zero eigenvalues form a bulk on the right side.

While this result was extremely powerful, it was based on the strong assumption that "all entries are completely independent (the population covariance matrix is Σ=σx2Ip\Sigma = \sigma_x^2 I_p)." In reality, data generally has complex correlation structures between features, meaning that this assumption of independence could not be applied to the performance analysis of practical machine learning models.

The Brilliance of RMT: Universality

The reason RMT results, including the Marchenko-Pastur Law, boast immense practicality in machine learning lies in their property of Universality. Whether the elements of the data matrix are generated from a neat "Gaussian distribution (normal distribution)" or from a "coin toss (Rademacher distribution)" where +1+1 and 1-1 appear with equal probability, as long as the mean and variance (moments up to the second order) match, the eigenvalue distribution in the limit perfectly aligns with the exact same Marchenko-Pastur Law curve. Because they do not depend on the specific shape of the data distribution (higher-order moments), theoretical formulas hold strong predictive power even for real-world data.

# Simulation of a Wishart matrix (Marchenko-Pastur law)
p, N_s = 500, 1000 # Case where γ = 0.5
X = np.random.randn(p, N_s)
W = np.dot(X, X.T) / N_s
eigvals_W = np.linalg.eigvalsh(W)

# Simulation with coin tosses (Rademacher distribution)
X_rademacher = np.random.choice([-1, 1], size=(p, N_s))
W_rademacher = np.dot(X_rademacher, X_rademacher.T) / N_s
eigvals_W_rademacher = np.linalg.eigvalsh(W_rademacher)

plt.hist(eigvals_W, bins=50, density=True, alpha=0.6, label="Gaussian")
plt.hist(eigvals_W_rademacher, bins=50, density=True, alpha=0.6, label="Rademacher (Coin Toss)")
plt.title("Universality of Marchenko-Pastur Law")
plt.legend()
plt.show()
# Even in a finite experiment, both converge to almost the same distribution (they are completely identical in the limit)
Image
Simulation showing the universality of the Marchenko-Pastur law

3. Breaking the Constraints: Establishing a Solid Theoretical Foundation by Silverstein (1995)

The standard Marchenko-Pastur Law introduced in Chapter 2 is very beautiful and powerful, but it is often applied to cases where "all features are completely independent (the true population covariance matrix is Σ=σx2Ip\Sigma = \sigma_x^2 I_p)." However, in real-world image or audio data, complex correlations exist between pixels and features, so the assumption of independence cannot be applied to the analysis of practical machine learning models.

In fact, the original 1967 paper by Marchenko and Pastur already presented the prototype of the equation for cases considering the correlation structure of data. However, its mathematical proof imposed overly strict constraints that were too harsh to apply to real data, such as "the absolute fourth moments of the matrix entries must be finite."

The mathematician who broke through this theoretical wall and completed a solid foundation leading to modern machine learning was Jack Silverstein in his 1995 research. He considered a setting where the data matrix XX is generated using the true covariance matrix Σp\Sigma_p as follows:

X=ZΣp1/2 X = Z \Sigma_p^{1/2}

Here, ZRNs×pZ \in \mathbb{R}^{N_{\mathrm{s}} \times p} is a noise matrix where each element is i.i.d. (independent and identically distributed). In this case, the sample covariance matrix can be written as Σ^=1NsΣp1/2ZTZΣp1/2\hat{\Sigma} = \frac{1}{N_{\mathrm{s}}} \Sigma_p^{1/2} Z^{\mathsf{T}} Z \Sigma_p^{1/2}. (In the limit as NsN_{\mathrm{s}} \to \infty with pp fixed, 1NsZTZ\frac{1}{N_{\mathrm{s}}} Z^{\mathsf{T}} Z converges to the identity matrix IpI_p, so the eigenvalue distribution of Σ^\hat{\Sigma} converges to that of Σp\Sigma_p.)

Silverstein's greatest achievement was completely removing strict constraints like the 4th moment and proving the theorem under the minimal assumption that "each element of the noise matrix ZZ only needs to have a mean of 00 and a variance of 11 (2nd moment)." Specifically, under a natural high-dimensional limit where p,Nsp, N_{\mathrm{s}} \to \infty (ratio p/Nsγ(0,)p/N_{\mathrm{s}} \to \gamma \in (0, \infty)) and the empirical eigenvalue distribution of the true covariance matrix Σp\Sigma_p converges to some limit distribution HH, he rigorously showed that the spectral distribution of Σ^\hat{\Sigma} converges almost surely (Strong Convergence) to a deterministic limit distribution.

He then proved that the Stieltjes transform m(z)m(z) in that limit is the unique solution to the following integral equation:

m(z)=1τ(1γγzm(z))zdH(τ)(zCR) m(z) = \int \frac{1}{\tau \left(1 - \gamma - \gamma z m(z)\right) - z} dH(\tau) \quad (z \in \mathbb{C} \setminus \mathbb{R})

The brilliance of this result lies in "discarding complete independence of elements and allowing the true correlation structure to be incorporated into the theory as the limit distribution HH." Even for troublesome data with extreme outliers, as long as the variance can be defined, it was guaranteed that the limit behavior of the eigenvalue distribution could be captured through this self-consistent equation.

However, for practitioners and machine learning engineers, there was still a barrier to putting this abstract "integral form" equation directly onto computers and using it for model performance prediction. This is because the analytical form of the limit distribution HH is often unknown.

Nevertheless, the fact established here by Silverstein—that the asymptotic behavior of random matrices can be rigorously described by an equation based on the true correlation structure—provided extremely important inspiration to researchers in the 2000s. Eventually, this idea of an integral equation was reconstructed into a "practical trace-form equation" using the finite-dimensional matrix Σp\Sigma_p as is, which directly connected to a powerful paradigm shift called the "Deterministic Equivalent," explained in the next chapter.

4. Introduction of Deterministic Equivalents (2007-2011)

Thanks to Silverstein's research, it became possible to calculate the "trace (a quantity related to the sum of eigenvalues)" of a sample covariance matrix constructed from correlated data. However, for analyzing actual machine learning algorithms, this was insufficient.

For example, consider Ridge Regression, which adds regularization to linear regression:

β^=argminβRp{1NsyXβ22+λβ22} \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta} \in \mathbb{R}^p} \left\{ \frac{1}{N_{\mathrm{s}}} \|\boldsymbol{y} - X \boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2 \right\}

The solution that finds the optimal parameter vector β^\hat{\boldsymbol{\beta}} from the training data XX and the target variable y\boldsymbol{y} is given in the following closed-form:

β^=(Σ^+λIp)11NsXTy \hat{\boldsymbol{\beta}} = (\hat{\Sigma} + \lambda I_p)^{-1} \frac{1}{N_{\mathrm{s}}}X^{\mathsf{T}} \boldsymbol{y}

When trying to calculate the "test error (generalization error)" on new unseen data using these parameters, you encounter not just a simple Stieltjes transform: 1pTr[(Σ^+λIp)1]\frac{1}{p} \mathrm{Tr}[(\hat{\Sigma} + \lambda I_p)^{-1}], but more complex terms such as:

Bias=λ2βT(Σ^+λIp)1Σ(Σ^+λIp)1βVariance=σ2NsTr[ΣΣ^(Σ^+λIp)2]\begin{align} \text{Bias} &= \lambda^2\boldsymbol{\beta}_*^{\mathsf{T}} (\hat{\Sigma} + \lambda I_p)^{-1} \Sigma (\hat{\Sigma} + \lambda I_p)^{-1} \boldsymbol{\beta}_* \\ \text{Variance} &= \frac{\sigma^2}{N_{\mathrm{s}}}\mathrm{Tr}\left[\Sigma \hat{\Sigma}(\hat{\Sigma} + \lambda I_p)^{-2} \right] \end{align}

Such terms could not be analyzed simply by finding the limit of the trace.

The breakthrough came with a groundbreaking paradigm called the "Deterministic Equivalent (DE)," introduced by Walid Hachem et al. in 2007, and Fernando Rubio and Xavier Mestre et al. in 2011.

They proved that not only the limit of the trace, but "the inverse matrix itself of a massive random matrix" could be replaced by a mathematically tractable non-random (deterministic) matrix.

What is a Deterministic Equivalent?

For a sequence of random matrices ApA_p and a sequence of deterministic (non-random) matrices BpB_p, if for any deterministic matrix CpC_p with bounded spectral norm (Cp<\|C_p\| < \infty),

limp1pTr[Cp(ApBp)]=0almost surely \lim_{p \to \infty} \frac{1}{p} \mathrm{Tr}[C_p (A_p - B_p)] = 0 \quad \text{almost surely}

holds, we denote ApBpA_p \asymp B_p (BpB_p is a deterministic equivalent of ApA_p). This allows us to replace the asymptotic behavior of complex random matrices with the computation of tractable deterministic (i.e., non-random) matrices.

In 2011, for correlated data X=Σ12ZX = \Sigma^{\frac12} Z, Rubio and Mestre derived the deterministic equivalent of the resolvent—the core of the ridge regression solution—as the following formula.

Rubio and Mestre's Theorem (2011)

When the data matrix is given by X=Σ12ZX = \Sigma^{\frac12} Z, in the limit as Ns,pN_{\mathrm{s}}, p \rightarrow \infty with p/Nsγ(0,)p/N_{\mathrm{s}} \to \gamma \in (0, \infty), the following holds:

λ(Σ^+λIp)1κλ(Σ+κλIp)1 \lambda (\hat{\Sigma} + \lambda I_p)^{-1} \asymp \kappa_\lambda (\Sigma + \kappa_\lambda I_p)^{-1}

Here, the effective regularization parameter κλ\kappa_\lambda is the unique solution to the following self-consistent equation:

1NsTr[Σ(Σ+κλIp)1]+λκλ=1 \frac{1}{N_{\mathrm{s}}} \mathrm{Tr}\left[\Sigma (\Sigma + \kappa_\lambda I_p)^{-1}\right] + \frac{\lambda}{\kappa_\lambda} = 1

(Original paper: Spectral Convergence for a General Class of Random Matrices)

[What makes this so powerful?] When attempting to calculate the prediction error of ridge regression, the inverse matrix of a random matrix, (Σ^+λIp)1(\hat{\Sigma} + \lambda I_p)^{-1}, inevitably appears in the equations, causing the analysis to hit a dead end. However, by applying this "deterministic equivalent" formula, the random Σ^\hat{\Sigma} can instantly be replaced with the true Σ\Sigma and the scalar κλ\kappa_\lambda. This makes it possible to accurately calculate the true performance (such as generalization error) of machine learning algorithms on high-dimensional data using just pen and paper.

5. To the Frontiers of Nonlinearity

Rubio & Mestre's theory was overwhelmingly powerful, but it had a fatal weakness. It required the assumption that "the data generation process must be linear (X=Σ12ZX = \Sigma^{\frac12} Z)." Modern machine learning, such as the generators in Generative Adversarial Networks (GANs) or Random Feature Models, transforms data using highly advanced nonlinear functions. The assumption of linearity completely fails here.

However, recent studies have shattered this wall. For instance, Louart et al. proved that even if the data consists of "Concentrated Feature Vectors" generated through a Lipschitz continuous nonlinear function, Rubio & Mestre's theorem (the linear deterministic equivalent) can be applied exactly as it is (recently, it was shown that similar results hold even under more relaxed conditions). This became the bridge connecting "classical RMT" and "machine learning with modern complex nonlinear models."

Concentrated Feature Vectors

Suppose feature vectors are generated in the form xi=Ω(zi)x_i = \Omega(z_i) through a latent variable ziN(0,Iq)z_i \sim \mathcal{N}(0, I_q) and a Lipschitz continuous nonlinear function Ω()\Omega(\cdot). Feature vectors generated in this way are called "Concentrated Feature Vectors."

Deterministic Equivalent for Concentrated Feature Vectors

Assume the data xix_i is generated by the Lipschitz continuous function described above, and let its population covariance matrix be Σ:=E[xixiT]\Sigma := \mathbb{E}[x_i x_i^{\mathsf{T}}]. Then, in the limit as p,Nsp, N_{\mathrm{s}} \to \infty with p/Nsγ(0,)p/N_{\mathrm{s}} \to \gamma \in (0, \infty), the deterministic equivalent of the resolvent is given as follows:

λ(Σ^+λIp)1κλ(Σ+κλIp)1 \lambda(\hat{\Sigma} + \lambda I_p)^{-1} \asymp \kappa_\lambda(\Sigma + \kappa_\lambda I_p)^{-1}

Here, κλ\kappa_\lambda is the unique non-negative solution to the following equation:

1NsTr[(Σ+κλIp)1Σ]+λκλ=1 \frac{1}{N_{\mathrm{s}}} \mathrm{Tr}\left[ (\Sigma + \kappa_\lambda I_p)^{-1} \Sigma \right] + \frac{\lambda}{\kappa_\lambda} = 1

The fact that exactly the same deterministic equivalent as Rubio & Mestre holds in the limit, even for highly nonlinear features, powerfully corroborates the Universality of RMT.

6. Analysis of Test Risk and Unraveling the "Double Descent" Phenomenon

By obtaining the ultimate weapon known as the Deterministic Equivalent (DE), we became capable of rigorously calculating the test risk (expected prediction error on unseen data) of machine learning models like ridge regression.

The test risk Rλ,Σ^R_{\lambda, \hat{\Sigma}} can be decomposed into three terms: the Bias of the predictive model, the Variance, and the unobservable noise (σ2\sigma^2). Applying Rubio & Mestre's theorem and complex analysis techniques to this equation yields the following theorem.

Theorem 1 (Deterministic Equivalent of Test Risk)

In the limit as p,Nsp, N_{\mathrm{s}} \to \infty with p/Nsγ(0,)p/N_{\mathrm{s}} \to \gamma \in (0, \infty), the test risk of ridge regression Rλ,Σ^R_{\lambda, \hat{\Sigma}} asymptotically converges to the following deterministic equivalent Rλ,ΣDER_{\lambda, \Sigma}^{\mathrm{DE}}:

Rλ,ΣDE:=κλ21ηκβTΣ(Σ+κλIp)2βDE of Bias+σ2ηκ1ηκDE of Variance+σ2 R_{\lambda, \Sigma}^{\mathrm{DE}} := \underbrace{\frac{\kappa_\lambda^2}{1 - \eta_\kappa} \boldsymbol{\beta}_*^{\mathsf{T}} \Sigma(\Sigma + \kappa_\lambda I_p)^{-2} \boldsymbol{\beta}_*}_{\text{DE of Bias}} + \underbrace{\sigma^2 \frac{\eta_\kappa}{1 - \eta_\kappa}}_{\text{DE of Variance}} + \sigma^2

Here, ηκ:=1NsTr[Σ2(Σ+κλIp)2]\eta_\kappa := \frac{1}{N_{\mathrm{s}}} \mathrm{Tr}\left[\Sigma^2(\Sigma + \kappa_\lambda I_p)^{-2}\right] is the "normalized effective degrees of freedom."

Image
How the formula in Theorem 1 completely explains the double descent phenomenon

This formula is precisely what completely explains mathematically why the "Double Descent" phenomenon occurs.

In Chapter 2 on the Marchenko-Pastur Law, we confirmed that when the dimensional ratio γ=p/Ns\gamma = p/N_{\mathrm{s}} is 11 (i.e., the number of parameters equals the number of samples), an infinite peak forms near zero in the eigenvalue distribution (the matrix becomes extremely ill-conditioned).

Looking at the formula in Theorem 1, when the regularization λ0\lambda \to 0 and the number of parameters approaches the number of samples (p=Nsp = N_{\mathrm{s}}), the effective degrees of freedom ηκ\eta_\kappa approaches 11. Then, the denominator of the variance term (1ηκ)(1 - \eta_\kappa) approaches zero, causing the variance (sensitivity to noise) to blow up to infinity. The ill-conditioning of the matrix clearly manifests mathematically as the divergence of the test risk (Interpolation peak).

However, as the number of parameters is further increased and we enter the over-parameterized regime (pNsp \gg N_{\mathrm{s}}), ηκ\eta_\kappa moves away from 11 again. The expressiveness of the model increases, and optimization progresses in a well-conditioned subspace, causing the test risk to decrease once more. This is the mechanism of Double Descent.

7. The Practical Value of RMT: Predicting Performance Without Model Training

The result of Theorem 1 possesses not only theoretical beauty but also extremely high practical value.

Conventionally, to evaluate the generalization performance of a machine learning model, it was necessary to generate large datasets, repeat model training (involving matrix inversions or optimization loops) many times, and empirically calculate the average test error. This is very costly in terms of computational resources.

However, using this asymptotic test risk formula Rλ,ΣDER_{\lambda, \Sigma}^{\mathrm{DE}}, you can completely skip the model training process. By simply substituting the spectral information of the target data's "population covariance matrix Σ\Sigma" and the "target vector β\boldsymbol{\beta}_*" into the self-consistent equations, you can instantly and analytically plot the test risk curve for any data size NsN_{\mathrm{s}} and regularization parameter λ\lambda.

Despite being a theory predicated on p,Nsp, N_{\mathrm{s}} \to \infty, it has been demonstrated that even in finite, small-scale systems, the theoretical prediction curves given by RMT match empirical test errors astonishingly accurately.

8. Application to Free Probability Theory and Optimization Dynamics

The benefits of RMT are not limited to predicting generalization performance after training is complete. In fact, it also brings a dramatic effect to the analysis of how optimization algorithms like Gradient Descent (GD) converge step by step (learning dynamics).

In the learning process of machine learning models, what determines the shape of the valleys in the loss function (Loss Landscape) is the second derivative, the Hessian matrix HH. The Hessian of a neural network is often modeled as the sum H=H0+H1H = H_0 + H_1, where H0H_0 is the "structural term of the data" and H1H_1 is the "noise term from sampling."

Here, a major problem arises. Because matrix multiplication is non-commutative (ABBAAB \neq BA), we cannot calculate the eigenvalue distribution of the sum of multiple random matrices using the "convolution integral" used for adding regular random variables. The solution to this is Free Probability Theory in RMT.

Column: R-Transform and Asymptotic Freeness

As the dimension of random matrices approaches infinity, the property where the directions of the eigenvectors of the matrices become completely random (uncorrelated) relative to each other is called Asymptotic Freeness. When this property holds, by using a special function called the R-Transform, you can directly add the eigenvalue distributions of H0H_0 and H1H_1 to strictly calculate the eigenvalue distribution of H=H0+H1H = H_0 + H_1. (This is conceptually like extending characteristic functions or Laplace transforms from regular random variables to matrices.)

This spectral analysis of the Hessian enables performance evaluation of algorithms not in the worst-case, but in the average-case. Using RMT's theoretical formulas, it is even possible to derive in advance things like "how many steps Gradient Descent will take to reduce the error by how much (Halting time)" and "the optimal learning rate (step size) to finish learning as fast as possible."

9. Conclusion

Random Matrix Theory began with the analysis of the eigenvalue distribution of simple noise matrices (Marchenko-Pastur Law). Later, realistic correlation structures (Σ\Sigma) were incorporated by Silverstein and others, the analytical tool of the Stieltjes transform was refined, and a powerful paradigm called Deterministic Equivalents was established by Hachem, Rubio, Mestre, and others. Today, RMT theories continue to evolve daily, including extensions to nonlinear features by Louart et al.

Currently, Random Matrix Theory has transcended the boundaries of pure mathematics and physics, acting as one of the most powerful theoretical weapons in the analysis of deep learning dynamics and the elucidation of generalization performance.

References