Gauge R&R - ANOVA & Multilevel models in python

The Problem

Consider the following known industrial problem - we manufacture samples (possibly with different dimensions, characteristics, etc…) which we are then testing/verifying. What we want then is to characterize the measurement processes, as it may have multiple sources of variability. Indeed, the measurement process can depend on the operator and appraisal (i.e. testing machine), the part itself and potentially some other things that we include in our observations. Additionally to the mentioned single factors, interactions could also be significant (e.g. some tester or operator is better off with some particular type of the part).

In what will follow, we consider two (three) potential sources of errors - the one coming from the tester (or operator).

A possible design, where all of the appraisals test all of the parts is shown below. Full design

Such design in the field of Design of Experiments (DoE) is known as a fully crossed (full factorial) design. However, we will simply refer to it as crossed design.
We can extend the idea of two degrees of freedom to three and more, with, let’s say, different operators using different appraisals testing different parts. Sometimes these degrees of freedom do not represent “real entities” and instead time. For example, instead of parts being attributed to a tester (parts 1,2,3 being tested with appraisal #14), we test a system 10 times within a day, then repeat again after a week [^10]. Although it seems different, this setup brings potential variability at different factors (tester $\rightarrow$ part, week $\rightarrow$ day).

The gauge R&R

The gauge within the Measurement System Analysis (MSA) is referred to the measurement system of our appraisal acting on different parts. The central concept of the gauge analysis is quantifying and characterizing the variances of the factors¹. The two main factors of interest are parts and appraisers. The R&R here stands for Repeatability and Reproducibility of the gauge. Those characteristics are nothing but metrics based on the obtained numbers and will be defined below.

The classical idea of the Gauge R&R MSA within the field of quality control and $6\sigma$-discipline is to decompose the variability of the process into predefined components.

Variance Decomposition

The variability decomposition that is used within the framework is given by $$ \text{Var}(y) = \sigma_A^2 + \sigma_P^2 + \sigma_{PA}^2 + \sigma_e^2 $$

$\sigma_e^2$ the “error” variance originating from repeated measurement/randomness. This is the part known as residual variability or Repeatability - the variation under repeated measurements.
$\sigma_A^2$ - the variability of the appraisal/measurement system (again, we call it the measurement system, but it may refer to the operator or any other concept related to the measurement system).
$\sigma_P^2$ - the part-to-part variability
$\sigma_{PA}^2$ - the appraisal-to-part variability. In other words, the variation of the part-to-appraisal interaction (how good/bad a appraisal operates with a certain part).

The MSA approach defines other quantities and equips them with meaning:

$\text{Reproducibility} = \sigma_A^2 + \sigma_{PT}^2$ - the variability that is brought by the appraisal.
One defines the Total Gauge R&R as the sum of the appraisal variability and random errors: $\text{Total Gauge R\&R}=\sigma_e^2 + \sigma_A^2 + \sigma_{AP}^2$. This describes the total uncertainty that does not originate from parts.
As a result one concludes that the total variation is given by $\text{Total Variation}= \text{Var}(y)= \text{Total Gauge R}\&{R} + \text{Part-to-part}$

ANOVA

The next step is to estimate the defined variances. In order to find the variance components from the sampled data, we use the known unbiased estimator - the mean square error.

The repeatability variance term describing the overall error term and is given by the MSE unbiased estimator, computed as residuals from the mean of repeated measurements $\bar{y}_{ij\cdot}$: $$ \sigma_e^2 \simeq \text{MSE}_e = \frac{\sum _{ijk} (y _{ijk} - \bar{y} _{ij\cdot})^2}{n_P n_A (n_r - 1)} $$

The remaining estimators for different $\sigma^2$ can be computed using the expressions for the mean squares

$$ \text{MSE}_e = \sigma_e^2 \newline \text{MSE} _{PA} = \sigma_e^2 + n _r \sigma _{PA}^2 \newline \text{MSE} _{A} = \sigma_e^2 + n _r \sigma _{PA}^2 + n _P n _r \sigma_A^2 \newline \text{MSE} _{P} = \sigma_e^2 + n _r \sigma _{PA}^2 + n _A n _r \sigma_P^2 $$ from which, for example, we easily obtain the estimator for one of the interaction terms

$$ \sigma_{PA}^2 = \frac{\text{MSE} _{PA} - \text{MSE} _{e}}{n _r} $$

This is the approach we use within the ANOVA framework in order to estimate the variance terms- iteratively taking the differences between the mean squared errors and scaling them by the appropriate degrees of freedom.

In practice, estimates may become negative due to sampling variability (if e.g. $\text{MSE} _{PA} < \text{MSE} _e$). This issue is known within the framework, and we will see that iterative, ANOVA-free methods which estimate the variances are not immune to negative-variance issues.

Interpreting results

In additional to trivial variance components shown before, one defines the the central gauge R&R quantity $\sigma _{\text{GRR}}^2 = \sigma_A^2 + \sigma _{PA}^2 + \sigma_e^2$, which is the sum of all of the variations without the part-to-part variation. As a result, one can write the full variance as the sum $$ \sigma _\text{total}^2 = \sigma _P^2 + \sigma _{\text{GRR}}^2 $$

To each variance, it is natural to define the standard deviation. For example, the repeatability stdev $$ \sigma_e \coloneqq \sqrt{\sigma_e^2} $$ or the total GRR deviation $$ \sigma_\text{GRR} = \sqrt{ \sigma_P + \sigma_{PA} + \sigma_e^2} $$

Therefore, the $\sigma_\text{GRR}$ represents the total variation that is arising when the appraiser is intervening. In other words, the standard deviation due to the measurement system.

For a process that is measured, one may usually define some sort of specification limits (otherwise, measurement process cannot be properly evaluated, as the outcome is not compared nor compared against anything). We want the process to be located within our upper and lower specification limits - USL and LSL.

In “typical” industrial setting, we want the usual process variation to be within a controlled interval, e.g. a $s\sigma$ interval, with, e.g. $s=6$, which is the known origin of the $6$-$\sigma$ methodology. In order to understand and interpret the data, we consider a relative quantity, that will call here GtT - the Gauge(R&R)-to-Tolerance ratio, which we define as $$ \text{GtT}_s = s \cdot \frac{\sigma _\text{GRR}}{\text{Tol}}$$

where $\text{Tol}$ - the tolerance defined as the specification range $\text{Tol} = \text{USL} - \text{LSL}$

The quantity actually represents is how well the $6$-$\sigma$ variation “sits” within the control range. In other words, what is the ratio, of the variability range to the total control range.

One attempts to illustrate the concept of GRR being contained in the tolerance in the image below.

Full design

In this case the quantity $\text{GtT}$ is smaller the better, as we want small variability of our appraisal compared to our control range. In practice, we consider the ratio to be good if $\text{GtT}<0.1$ (or $10%$) and acceptable when $0.1 < \text{GtT} < 0.3$.

Instead of $\sigma_\text{GRR}$ in the numerator of our ratio, we could use another derived standard deviation from another variation contribution, e.g. $\sigma_{P}$. The ratio in that case strongly depends on the nature of the experiment. That is, the parts may have intentionally come from the same batch. This means we expect from our measurement system to measure them as same. Similarly, it is possible to use differently produced samples, which would differently respond to measurements. Therefore, the ratio $6\sigma_{\text{P}}/\text{Tol}$ may have very different interpretations based on different designs.

The Linear model

Up to now, in the discussed design of experiment, we decomposed the total variance into the (pre-)defined components (natural variation, part, appraisal, part-to-appraisal) using the ANOVA method.

It is well known that ANOVA is nothing but a special case of the (general) linear model. Without going into details, this means that our response measurement variable that we denote $y$ responds to the inputs linearly as $$ y_{ijk} = \mu + P_i + A_j + PA_{ij} + \epsilon_{ijk} $$ where $P_i$ - the effect of part $i$, $A_j$ - the effect of appraisal $j$, $PA_{ij}$ the interaction of part $i$ and appraisal $j$, $\epsilon_{ijk}$ - the “random” error of the run and $\mu$ - the overall mean².

This linear model is easily estimated via ANOVA method (as a first approach, means and variances) with mean squared errors as variance estimates (see $\text{MSE}$ definitions above).

Mixed models

The framework of mixed models is known to work well with nested, multilevel, tabular data. In this case, the data is slightly more restricted, but nevertheless exhibits some nestedness - parts and appraisals. Indeed, we may fit the mixed models’ engine to our experimental data within the linear model $$ y_{ijk} = \mu + P_i + A_j + PA_{ij} + \epsilon_{ijk} $$ , where each effect is modeled as

$P_i \sim \mathcal{N}(0, \sigma_P^2)$
$A_j \sim \mathcal{N}(0, \sigma_A^2)$
$PA _{ij} \sim \mathcal{N}(0, \sigma^2 _{PA})$
$\epsilon_{ijk} \sim \mathcal{N}(0, \sigma^2 _{E})$

The formulation is very intuitive, as we expect (and we want) each of the effect to be centered at 0 (from the overall mean $\mu$) and each have their own variance.

What changes in the mixed models’ formulation is the way we can estimate the coefficients $\sigma$’s for every effect. In ANOVA, we would run variance estimations and obtain the coefficients as “byproducts”. The framework of Multilevel Models, however, is much broader (in the sense of the ability to model phenomena) and the variances $\sigma$’s are part of the model to be estimated, usually computed using the REML method. This is another fundamental difference between the two approaches.

Tabular real-world data often comes with missing and unbalanced entry. This means that for some of the nested factor combination, there are less or often missing data. In our case, this would mean that there are some combinations of parts-appraisals have different numbers, few numbers or no entries at all. This, in contrast to ANOVA, is naturally handled and taken into account.\ In practice, this means we can avoid doing a full-factorial/full-rotation for some of MSA studies, in order to get a first overview and understanding of our systems.

Another possible improvement would be the normality assumption. In fact, this assumption has been present in both methods the whole time. In the case of the “original” ANOVA, the assumption is assumed on all levels. For the considered framework of the Multilevel Models this is also the case. It is nevertheless possible to extend the model to the Multilevel General Linear Models.

Implementation

The described situation can be gracefully modeled in R. Python, however, is less natural for that. It is however possible to use the known statsmodel, that implements a part of the multilevel models’ functionality.

We will now quickly go over the structure of the implementation in python.\ We have a dataframe that contains data from the measurements in the long format. For example, we’re measuring some quantity $V$ (e.g. voltage), with the parts and appraisals being specified in the respective columns in the pandas/polars dataframe.

import pandas as pd
import statsmodels.formula.api as smf
d["appraisal"] = d["appraisal"].astype("category")
d["part"] = d["part"].astype("category")

# create interaction feature
d["appraisal_part"] = (
    d["appraisal"].astype(str) + ":" + d["part"].astype(str)
).astype("category")

# a dummy group trick - a single constant
d["_grp"] = 1

The model definition using the statsmodels must explicitly specify the variance components

model = smf.mixedlm(
    "y ~ 1",
    data=d,
    groups=d["_grp"],
    re_formula="0",
    vc_formula={
        "part": "0 + C(part)",
        "appraisal": "0 + C(appraisal)",
        "appraisal_part": "0 + C(appraisal_part)",
    },
)

res = model.fit(reml=True, method='lbfgs')

, the groups and re_formula specify the random effects components of the model, which we do not discuss here ³.

We can then retrieve the fitted data:

vcomp = pd.Series(res.vcomp, index=list(model.exog_vc.names))

var_repeat = max(float(res.scale), 0.0)
var_part = max(float(vcomp.get("part", 0.0)), 0.0)
var_appraisal = max(float(vcomp.get("appraisal", 0.0)), 0.0)
var_interaction = max(float(vcomp.get("appraisal_part", 0.0)), 0.0)

var_reprod = var_appraisal + var_interaction
var_grr = var_repeat + var_reprod
var_total = var_grr + var_part

Adding uncertainty - bootstrapping

Another framework, an even more natural one would be a Bayesian one. Namely, the Bayesian framework gracefully implements multilevel models, and allows high flexibility. A known advantage of Bayesian methods is its intrisic probabilistic notions. Within this framework, it would be natural and straightforward to include uncertainties.

We however did not use the native probabilistic approach. In our case, we would need other methods, such as bootstrapping.
Bootstrapping is a straightforward method, that consists in repeating our measurement/experiment by synthetically generating data by resampling it.

In our case, the idea would be to perform a sampling from the measurements for every appraisal $\times$ part combinations.

The approximate code below (python pseudocode) demonstrates this idea.

# d - the dataframe of the experiment
cell_groups = list(d.groupby(['appraisal', 'part']))

for boot_id in range(n_bootstrap):
    resampled_cells = []
    for cell_key, cell_df in cell_groups:
        n_rows = len(cell_df) # number of tests done in the appraisal x part combination
        row_ids = rng.integers(low=0,high=n_rows,size=n_rows) # generate sampling ids (with replacement)
        resampled_cell = cell_df.iloc[row_ids] # resample
        resampled_cells.append(resampled_cell) # store

    d_boot = pd.concat(resampled_cells) # create a new dataset
    model = get_model(d=d_boot) # get the statsmodels model
    res = model.fit( reml=True, method='lbfgs', start_params=res_0.params ) # fit on the new data
    grr_table = grr_table_from_fit(model=model,res=res,tol=tolerance)
    boot_records.append(grr_table)

Once the bootstrapped data is obtained, one can obtain visualize the histogram. This shows how “uncertain” our estimated parameters when resampling our data.

Figure: Histogram of bootstrapped parameters illustrating uncertainty.

Conclusion

In this short reading, we have defined the Gauge R&R method with its two main computing methods - the “classical” ANOVA and a the more “modern” Multilevel Model. We have stated the main ideas and mathematical considerations. We focused on main benefits and drawbacks of the Multilevel Model vs ANOVA and provided the squelette for its Python implementation.

A factor in the context of DoE is the influence/regressor that we are experimenting for (e.g. when analyzing the effects of watering and light on the plant growth, water and light are the factors). ↩︎
A linear model can be defined without including the overall mean $\mu$. In that case, this would mean that the “overall” effect is included in all of the effects’ contributions. ↩︎
The Python syntax for building statistical models is based solely on statsmodels and is very different from R. The latter is build to handle similar known cases nicely. Thus, if coming from R, the semantics can see very unnatural. ↩︎

The Problem#

The gauge R&R#

Variance Decomposition#

ANOVA#

Interpreting results#

The Linear model#

Mixed models#

Implementation#

Adding uncertainty - bootstrapping#

Conclusion#