Weighting methodology

This page is the technical companion to the Weighted results FAQ. The FAQ explains what weighting does and when it runs; this page documents the estimator, the variance approximation, and the scope of the inference we produce. It is written for analysts, data-science teams, and methodologists reviewing our approach.

Scope

MX8 Labs uses iterative proportional fitting (IPF, also called raking) to calibrate respondent weights to marginal targets supplied through the quota configuration. Point estimates are computed from the calibrated weights. Standard errors and significance tests are computed using an effective-sample-size adjustment so that precision claims degrade gracefully as weights become more heterogeneous.

This is a calibration-weighting framework with Kish effective-sample-size adjustment. It is not a full design-based variance estimator: it does not model primary sampling units, explicit strata, or finite-population corrections. In practice, for the kinds of online respondent sources MX8 supports, this framework is well matched to the data and the questions people ask of it, but the scope is worth naming explicitly.

Notation

Let respondents be indexed by $i = 1, \dots, n$ with final calibrated weights $w_i \ge 0$ .

For a reporting cell $c$ (for example, "women aged 25-34 in the West region"):

$S_c$ is the set of respondents in cell $c$ .
$N_c = |S_c|$ is the raw base for that cell.
$W_c = \sum_{i \in S_c} w_i$ is the weighted base.

For a binary outcome $Y_i \in \{0, 1\}$ , the weighted proportion is:

\hat{p}_c = \frac{\sum_{i \in S_c} w_i Y_i}{\sum_{i \in S_c} w_i}

Weighted means for numeric or rating outcomes are computed analogously.

Target marginals

Weighting targets are supplied as category proportions for one or more questions defined on the respondent source (age bands, gender, region, and so on). Targets may be:

One-way, e.g. the marginal distribution of age alone.
Nested (joint), e.g. a target distribution over the joint of age and gender. Joint targets are calibrated directly against the joint cell, not against the two marginals separately.

Only categories assigned a positive target are treated as required for the eligibility checks described below. A category with a zero target is permitted to be absent from the eligible data.

The IPF algorithm

Given a seed tensor $S$ built from observed respondents and a set of target margins $\{m_j\}$ on the weighting dimensions, IPF repeatedly rescales $S$ along one dimension at a time so that its marginal on that dimension matches the corresponding target. After each pass it moves to the next dimension and rescales again. The loop continues until the fitted tensor $R$ matches every target to within a convergence tolerance.

The seed tensor is constructed from observed respondent counts. If a previous weighting stage has already written weights, the seed is built from those existing weights rather than from raw counts; this is what makes stage composition multiplicative (see below).

Once $R$ has converged, each calibration cell $k$ receives a cell multiplier:

a_k = \frac{R_k}{S_k} \cdot \frac{\sum_k S_k}{\sum_k R_k}

The right-hand factor normalizes the weights so that their sum equals the eligible respondent count, which keeps weighted bases on the same scale as raw bases. Each respondent inherits the multiplier of their calibration cell as their weight.

Multi-stage weighting

A dataset can be configured with up to two stages: a pre-weighting stage and a main weighting stage. When both are configured, they are applied sequentially:

The pre-weighting stage runs IPF against its own targets and writes respondent weights.
The main weighting stage runs IPF again, but it builds its seed tensor from the pre-weighting weights rather than from raw counts. The effect is a multiplicative update: the final weight on each respondent is the product of the two cell multipliers from the two stages.

Eligibility is evaluated for every configured stage before any weighting runs. If any stage fails eligibility, no stage is applied and respondent weights remain at 1.0. This all-or-nothing behavior prevents a partial calibration from introducing targets that were never intended to be hit in isolation.

Eligibility guardrails

Before weighting runs, the platform checks that calibration is safe to attempt. A stage is eligible only if:

every question referenced by the targets is present in the data,
the number of respondents complete across all weighting questions exceeds the minimum base for weighting, and
every category with a positive target has at least one eligible respondent in that category (including joint categories for nested targets).

Condition (3) is the most commonly hit: it prevents IPF from trying to put mass into a category that is empty in the eligible sample, which would otherwise produce either non-convergence or implausibly large weights.

When a stage fails eligibility, the dataset is reported with unit weights and the weighting diagnostics will show the failure reason so you can fix the underlying issue (for example, collapse a sparse category, revise the target, or increase the sample).

Effective sample size

For any set of weights $\{w_i\}_{i \in S}$ , we compute the Kish effective sample size:

n_{\text{eff}} = \frac{\left(\sum_{i \in S} w_i\right)^2}{\sum_{i \in S} w_i^2}

Two properties to keep in mind:

If every respondent has the same weight, $n_{\text{eff}} = |S|$ . Unequal weights always reduce $n_{\text{eff}}$ below the raw count.
$n_{\text{eff}}$ is computed separately for each reporting cell, so a subgroup with stable weights can retain most of its precision even if the dataset overall has heavy weighting.

We report the weighting efficiency at the dataset level as:

\text{Efficiency} = \frac{n_{\text{eff}}}{n}

An efficiency near 1 means weighting has cost very little precision. Low efficiency, or a long right tail in the weight histogram, is a signal that the targets are straining the sample — often a cue to collapse sparse categories or revisit the quota.

Variance approximation

Standard errors for weighted proportions use a binomial-style approximation with $n_{\text{eff}}$ substituted for the raw base:

\widehat{\text{SE}}(\hat{p}_c) = \sqrt{\frac{\hat{p}_c (1 - \hat{p}_c)}{n_{\text{eff},c}}}

This is the standard calibration-weighting shortcut: it captures the first-order precision cost of unequal weights without requiring a full Taylor-linearization pass for every estimand. Mean-aggregated outputs use the same pattern after scale-normalizing the estimate to a proportion.

Significance testing

All significance tests in MX8 reports consume $n_{\text{eff}}$ rather than raw counts:

Column t-tests compare each cell to other cells in its column using the reported weighted mean and standard error, with sample size set to $n_{\text{eff}}$ .
Row t-tests are the same, applied across rows.
Residual t-tests (the default in cross-tabs) use a critical $t$ with degrees of freedom derived from $n_{\text{eff}} - 1$ , so the threshold for flagging a cell as significant tightens when the effective base is small.

The practical consequence is that heavily weighted data produces fewer significant cells than raw counts alone would suggest. This is intentional and avoids overclaiming precision from inflated weighted totals.

A small worked example

Suppose a 600-respondent genpop dataset is weighted to national age and gender targets. After weighting, the weights for one reporting cell — "women in the Northeast" — are distributed as follows:

40 respondents with $w_i \approx 0.7$
60 respondents with $w_i \approx 1.3$

For this cell:

\sum w_i = 40 \times 0.7 + 60 \times 1.3 = 28 + 78 = 106

\sum w_i^2 = 40 \times 0.49 + 60 \times 1.69 = 19.6 + 101.4 = 121.0

n_{\text{eff}} = \frac{106^2}{121.0} = \frac{11{,}236}{121.0} \approx 92.9

So the 100 raw respondents in this cell carry the information of roughly 93 equally-weighted respondents. If the weighted proportion on a binary outcome is $\hat{p} = 0.40$ , the standard error is:

\widehat{\text{SE}} = \sqrt{\frac{0.40 \times 0.60}{92.9}} \approx 0.051

Compared to $\sqrt{0.24 / 100} \approx 0.049$ if the cell were unweighted — a modest widening that reflects the mild weight dispersion.

Diagnostics to monitor

The weighting diagnostics for each dataset report:

respondent count $n$ ,
effective sample size $n_{\text{eff}}$ ,
efficiency $n_{\text{eff}} / n$ ,
min, median, mean, and max weight,
a histogram of the weight distribution.

Points to watch for:

Efficiency well below 1 — weights are doing a lot of work. Inference is valid but precision is reduced; consider whether the targets are achievable from the realized sample.
A long right tail on the weight histogram — a small number of respondents are carrying a lot of weight. These cells will have the largest effect on point estimates and the smallest effective base.
A stage reported as ineligible — read the failure reason in the diagnostics. Usually this points to an empty target category in the eligible sample.

Assumptions and limitations

Inference assumes independent respondents within calibration cells; there is no explicit modeling of clustering, stratification, or finite-population corrections.
The quality of point estimates depends on the validity of the supplied marginal targets and on overlap between the sample and the targets.
Extreme weights reduce efficiency and widen standard errors. The eligibility guardrails and the diagnostics are the operational controls for this.
The variance approximation is a binomial-style shortcut with $n_{\text{eff}}$ substitution, not a full design-based estimator.

Reproducibility

Weighting is deterministic given the input dataset, the marginal targets, the stage configuration, the respondent status filters, and the IPF convergence tolerance. Re-running the pipeline on the same inputs produces the same weights and the same reported statistics.