The Problem
Consider an industrial set of measurements, like a parameter from a batch of products. If we assume that
the realisations of our measured random variable are IID, we expect that the resulting distribution is gaussian.
However, if the underlying process has genuinely multiple sources, the resulting distribution will be a mixture of gaussians, or a mixture of other distributions. Such process may occur if the there is a significant difference in the originating raw material
batches.
This is the reason, why it is important to be able to identify a potential bi/multi-modality in the observed distribution. Arguably, the
best method would be to visually inspect all of the distributions, which is sometimes not possible. A visual inspection can be
replaced by a shape recognition algorithm or similar. The resources to run such an algorithm may be limited or unavailable
due to other reasons. This is why one may want to rely on “older”/classical methods such as the usual hypothesis tests.
There are multiple statistical tools and tests to verify multimodality. The most common ones are the Hardigan’s dip test and the Silverman’s test. We will look and implement the latter one.
The core idea
The Silverman’s hypothesis tests whether the underlying distribution has $k$ or less modes ($H_0$) or more than $k$ modes ($H_1$). The main idea consists of two big steps
- Using KDE, smooth out the data using the $h_\text{crit.}$, where it is the threshold, at which the KDE becomes exactly $k$-modal
- Perform the bootstrap using this bandwidth. If many of the samples yield a higher-modal distribution, there is more evidence to reject the $H_0$.
The Kernel Density Estimation
In Silverman’s test, the KDE method is extensively used. In practice, this means smoothing out our histogram/data. Intuitively, this means approximating every “peak” of our histogram by a Gaussian “hat” and summing over all the range. Mathematically, this means to create a new function $\hat{f}_h(x)$ defined by $$ \hat{f}(x) = \frac{1}{Nh}\sum_i^N K \Bigl( \frac{x- x_i}{h} \Bigr) $$ ,where $N$ is the sample size of the dataset.
The critical bandwidth
The core parameter of the KDE is the bandwidth, which determines how wide is the “hat” we’re fitting to each data point. As a result, the higher is the bandwidth, the more smoothing we observe. A low $h$ means the KDE would “react” to every noise bump, resulting in a function with many local maxima/bumps. On the other hand, if the bandwidth is too wide, the histogram smoothing will be too smooth and will not capture all of the modes of our distirbution.
Given the parameter of the null hypothesis $k$, the first thing to do is to the critical bandwidth, so that it is “barely” giving a $k$-mode distribution. Mathematically, we write it as $$ h_\text{crit} = \text{inf} ( h: \hat{f}_h \text{ has } k \text{ modes or less} ) $$
That is, this $h_\text{crit}$ is just at the edge - if we reduce it even slightly, the number of modes will increase from $k$. Perfectly, the number of modes of $\hat{f}$ with $h_\text{crit}$ is $k$ and if we add a tiny “perturbation” $h_\text{crit}+\epsilon$, the number of modes becomes $k+1$.
The search of the $h_\text{crit}$ is a binary search:
- Start with a big $h$
- If the number $k$ of the target modes is larger, we reduce it by half
- If the new number of the modes is smaller than $k$, reduce it by half. If larger, multiply it by 2.
- Repeat the steps until we obtain the $h_\text{crit}$.
The number of peaks can be obtained in different ways, but we use the default find_peaks() method from scipy.
We can show how the number of modes changes with the critical bandwidth.
The bootstrap
The second step is the bootstrap step, which inherently uses the previously found $h_\text{crit}$. Given the original data $[x_1, x_2, …, x_N]$, the algorithms can be summarized as follows:
- Generate a sample set from the original data with replacement $[x_{J_1}, x_{J_2}, …, x_{J_n}]$, where ${J_{1}, J_{2}, …, J_{n}}$ a set of randomly generated indices with replacement.
- Add a scaled gaussian noise to each point: $h_\text{crit}z$ with $z\sim \mathcal{N}(0,1)$ to obtain the bootstrapped dataset $x^{\ast}$, ${x_1, x_2^\ast, … , x_n^\ast}$ where $x_i^\ast \coloneqq x_{J_i} + h_\text{crit}z$.
- We can compute the critical bandwidth based on this newly created dataset ${x_i^\ast}$ denoted $h_\text{crit}^*$.
- Repeat the previous steps $B$ times to obtain a bootstrapped set of bandwidth ${ h_{\text{crit}, i}^{*}}$ with $i \in [1,…,B]$.
- Compute the $p$-value as the number of times the critical bandwidth was larger than the one “original” $h_\text{crit}$ divided by $B$. Mathematically $$p = \frac{1}{B} \sum_i^B I(h_{\text{crit},i} > h_\text{crit})$$ where $I$ - the indicator function.
The intuition is the following: if the actual distribution really has more than k peaks, the observed critical bandwidth $h_\text{crit}$ will tend to be large, because substantial smoothing is needed before the KDE has max. k modes. Therefore, under $H_0$, such a large $h_\text{crit}$ is unlikely, leading to a low $p$-value and rejection of $H_0$.
Working example with marimo
Corrections
The Silverman’s test is known to be conservative. That means that there is a chance to fail to accept the $H_1$ (more precisely fail to reject $H_0$). In other words, it is bad at detecting modes if there are actually more than k modes, if these modes are poorly separated or have low weight.
Variance correction
The first correction will be reffered to as the “variance correction”. This correction involves slightly changing the generation of the $[x^*_i]$. The correction scales the generated data to reduce the variance.
The variance of the full quantity $$ \text{Var}(x^*) = \text{Var}(x) + \text{Var}(hz) = \sigma^2 + h^2 $$ where $\sigma^2$ - the variance of the sample and $h$ - the critical width. Therefore the newly generated data is slightly inflated by $h$. Intuitively, the data is more spread out, meaning detecting new modes is harder. To mitigate the issue, we want to scale the second term: $$ x = \bar{x} + \frac{x+hz-\bar{x}}{\sqrt{1+h^2/\sigma^2}} $$ which gives the same total variance $\sigma^2$.
Asymptotic correction
The second correction is known as the Hall-York correction and consists in comparing the bootstrapped $h$ to a scaled $h_\text{crit}$. That is, the $p$-value is computed as $$ p = \frac{1}{B} \sum_i^B I(h_{\text{crit},i} > \lambda h_\text{crit}) $$
This scaling factor is what helps us to deal with the conservativeness. The basic idea that motivates this in the paper is that the critical bandwidth assymptotically differ. Namely, $$\frac{ h_\text{crit}^*}{h_\text{crit}} \neq 1$$ even asymptotically. In order to find the $\lambda$, we can empirically track this discrepancy using Monte Carlo. We can actually use the value from the paper and use the fit format to compute $\lambda$, that depends on our confidence level $\alpha$. A typical value, that we use is around $0.8$.
Testing
Simple case
Let’s first look at how the Silverman’s test works with the simplest case. As the simplest case, we take the number of modes in the $H_0$ to be $k=2$ and generate a bimodal distribution.
The p-value distribution is significantly larger than any $\alpha$ threshold in almost any simulation and the test gives a sure value.
More exotic cases
It is possible to look at how the tests behave at different regimes in the Silverman’s test repo, where multiple notebooks for experimenting can be found.
Conclusion
The goal of the article is to introduce to the notion of the Silverman’s statistical test. We showed the main theoretical concepts and potential room for improvement. In order to experiment with the different data and with different test variants, all of the resources are available in Silverman’s test repo.