Within routinely collected health data, missing data for an individual might provide useful information in itself. This occurs, for example, in the case of electronic health records, where the presence or absence of data is informative. While the naive use of missing indicators to try to exploit such information can introduce bias, its use in conjunction with multiple imputation may unlock the potential value of missingness to reduce bias in causal effect estimation, particularly in missing not at random scenarios and where missingness might be associated with unmeasured confounders.

We conducted a simulation study to determine when the use of a missing indicator, combined with multiple imputation, would reduce bias for causal effect estimation, under a range of scenarios including unmeasured variables, missing not at random, and missing at random mechanisms. We use directed acyclic graphs and structural models to elucidate a variety of causal structures of interest. We handled missing data using complete case analysis, and multiple imputation with and without missing indicator terms.

We find that multiple imputation combined with a missing indicator gives minimal bias for causal effect estimation in most scenarios. In particular the approach: 1) does not introduce bias in missing (completely) at random scenarios; 2) reduces bias in missing not at random scenarios where the missing mechanism depends on the missing variable itself; and 3) may reduce or increase bias when unmeasured confounding is present.

In the presence of missing data, careful use of missing indicators, combined with multiple imputation, can improve causal effect estimation when missingness is informative, and is not detrimental when missingness is at random.

Missing data is a common feature in observational studies. It is conventional to view missing data as a nuisance, and as such, methods to handle missing data usually target an estimand that would be available in the absence of missing data (completed data estimand). The mechanism for missingness is conventionally divided into three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) [

Alongside missing data, confounding is a threat to causal effect estimation in observational studies, especially where this is caused by unmeasured variables. Where unmeasured confounding exists, it is not possible to construct unbiased estimators of a causal effect, without making strong, unverifiable assumptions [

For example, consider a scenario in which we are interested in calculating the causal effect of total cholesterol (exposure) on cardiovascular disease (outcome), using electronic health records. Presence (analogous to missingness) of a cholesterol test result for a particular patient indicates that a decision was made to run this test, and the reason for this decision is likely to depend on characteristics of the patient; for example the patient’s diet, which may or may not be recorded. Diet may affect both the result of the laboratory test, and the outcome of interest, hence confounding. If information concerning diet is not recorded, we therefore have unmeasured confounding, and unbiased estimators of the causal estimand may not exist. Moreover, the missingness mechanism for the exposure may depend on unmeasured variables, in which case the exposure is MNAR, and an unbiased estimator of the completed data estimand may not exist either.

An emerging hypothesis is that in scenarios such as this, missing data may be a blessing rather than a curse, because the missingness mechanism can be used as a proxy for the unmeasured confounding, through the use of missing indicators [

In this paper we investigate, through simulation supplemented with analytical findings, the potential for using the missingness mechanism to partly adjust for unmeasured confounding and other missing not at random scenarios, and identify the cases where this can reduce bias for causal effect estimation.

Our aim is to identify missing data strategies that recover causal effects of an exposure on an outcome, with minimal bias in a variety of scenarios, especially where the causal effects are affected by unmeasured confounding. The scenarios that we consider in this paper are given in Fig. _{A} where _{A} = 0 when _{A} denotes whether a cholesterol test has been performed or not. ^{∗} is the observable part of ^{∗} = _{A} = 1, and missing when _{A} = 0. So ^{∗} is what we observe, while

Causal directed acyclic graphs denoting missingness mechanism for A, R_{A}: six scenarios considered in the paper

We use the counterfactual notation for consideration of causal effects, e.g. _{A} ≔

First, we consider scenarios in Fig. _{A} is independent of all other variables. All other scenarios, (ii)-(vi), are MNAR since _{A} is dependent on

In scenarios (ii), (iii), (v), and (vi), we could view _{A} as a proxy for the unobserved _{A} in the outcome model. This may reduce bias in the estimation of the causal effect of

Second, we consider each of the six scenarios in Fig. _{A} in the outcome model, but we wish to examine any reduction in performance that doing so may introduce. Scenario (i-U) remains MCAR, while scenarios (ii-U) and (iii-U) are now MAR. Scenarios (iv-U), (v-U), and (vi-U) remain MNAR but only through the dependence of _{A} on

We now specify the structural models that will be assumed for our further derivations and simulations.

_{U}.

_{A} is binary, with either _{A} = 0] = expit(_{0} + _{U}_{A}_{UA}_{A} = 1 −

The outcome model is linear in _{A} = _{A} + _{U}_{UA}.

For notation, we use Greek letters with no superscripts to denote true parameter values (from the data generating mechanisms described in the previous section) – e.g. _{A} - and use the same Greek letters with bracketed superscripts to denote the parameters estimated in the various analysis models – e.g.

First, a complete case analysis. When _{A} = 1. When

Second, we consider multiple imputation, under a joint normal model assuming a MAR mechanism. Thus, when

Throughout, we denote the imputed _{imp}. We then consider the following three outcome/analysis models, when

‘MI(A)’:

‘MI(R + A)’:

‘MI(R*A)’:

Model 1 represents a standard multiple imputation (MI) approach, while models 2 and 3 are variants of the MIMI approach, without and with interaction (MI(R + A), MI(R*A)).

When

U. ‘MI(A)’:

U. ‘MI(R + A)’:

U. ‘MI(R*A)’:

Finally, we also include ‘completed data’ models in which we use the original variable

When _{A}) term may act as a partial proxy for

For the cases where

Where present in our models, we hypothesise that the

It is instructive to consider a special case of scenario (ii) (see Fig. _{A} = 1 − _{0} + _{A}_{U}_{A} in the outcome model – which corresponds to the MI(R + A) approach described above (model 2) - would be expected to perform well, as this analysis model matches the true model. Indeed, the regression coefficient of _{A}. However, the model produces a biased estimate of the regression coefficient of _{A}.

While the case _{A} = 1 −

The aims, general structure, and models, are described above. We consider the following specific data generating mechanisms, which cover all of the scenarios (i)-(vi) and (i-U)-(vi-U) described in Fig.

For the _{A} ≠ 1 −

We fix the sample size (number of observations within each simulation run) to be _{U} = 0.5.

We choose the intercepts as functions of the other parameters: _{0} such that _{0} such that _{0} such that _{A} = 0] varies over the grid {0.25,0.5,0.75}.

The main effect parameters, _{U}, _{A}, and _{U} are all varied over the grid {0,0.1,0.5,1}, the parameter _{U} over the grid {−1, 0, 0.1,0.5,1} (a negative _{U} is included to study whether the direction of correlation between _{A} is important), while we fix _{A} = 1.

The interaction effect parameters, _{UA} and _{UA}, are varied between {0,0.5}.

The standard deviation of _{Y}, is varied over the grid {0.1,0.5,1}, while we fix _{A} = 1.

For the _{A} = 1 −

We exclude _{U}, _{A} and _{UA}, which are redundant.

We vary _{U} over the grid {0.25,0.5,0.75}, as this is required to vary the proportion of missingness.

All combinations of the parameters are evaluated, resulting in 11,808 scenarios, of which 288 cover the case where _{A} = 1 −

For each scenario, we fit the models described in the previous section, and report estimates of the outcome coefficients from the various models. Each scenario is repeated 200 times and summary statistics over these iterations retained. For all parameters of interest – those of the form

Here we present a subset of the simulations that capture the main findings; full results are available – see _{A} = 1, _{U} = 1, and _{Y} = 1, although we consider both _{UA} = 0 and _{UA} = 0.5. We also restrict to cases that result in _{A} = 1] = 0.5. When _{UA} = 0, the marginal causal effect of _{UA} = 0.5, by standardisation the marginal causal effect of

Figure _{A} = _{UA} = 0. In addition, for this figure, we fix _{A} = 1] = 0.5, _{A} = 1, _{U} = 1, _{UA} = 0 and _{A} = 1. Scenario (i) is the case where _{U} = _{U} = 0. For Scenario (ii), _{U} controls the strength of the relationship between _{A}, with the extreme case _{A} = 1 − _{U} = 0. For Scenario (iii), _{U} additionally controls the strength of the relationship between

Results for scenarios (i)-(iii), with γ_{UA} = 0. Mean of estimated coefficients across simulations; error bars represent the 2.5th and 97.5th percentiles. Columns are different parameter estimates, rows are different values of β_{U}, with the special case R_{A} = 1 − U on the top row. Within each graph, the y-axis varies α_{U}

The causal effect of _{U} = _{U} = 0 all methods’ estimates of _{A} are able to recover this without bias and with appropriate coverage. As _{U} increases, all methods are still able to estimate the causal effect well, except that MI(A) becomes biased when _{A} = 1 − _{U} increases, the completed data model becomes biased because of unmeasured confounding. We see that the MIMI approaches and complete case analysis are able to mitigate this to some extent, and successfully when _{A} = 1 − _{R} becomes nonzero for the MIMI methods when _{U} ≠ 0: it is through this that the MIMI methods are able to partly correct for the unmeasured confounding. Note that when _{U} = − 1 then the _{U} = 1 case.

Figure _{UA} = 0.5; hence the marginal and conditional causal effects of _{A} = 1 − _{U} = 0 the completed data model estimates _{A} = _{A} + _{U}_{UA} = 1.25, the marginal effect, while complete case analysis estimates the conditional effect when _{A} = 1); this is of course not surprising as there is only data when _{U} = − 1 we now see that all methods have increased bias. This is because the reversal of the correlation between _{A} means that missingness is more likely when

Results for scenarios (i)-(iii), with γ_{UA} = 0.5. Mean of estimated coefficients across simulations; error bars represent the 2.5th and 97.5th percentiles. Columns are different parameter estimates, rows are different values of β_{U}, with the special case R_{A} = 1 − U on the top row. Within each graph, the y-axis varies α_{U}

Figure _{UA} = 0, i.e. the same conditions as Fig. _{A}. MI(A) and MI(R + A) are in almost perfect agreement.

Results for scenarios (i-U)-(iii-U), with γ_{UA} = 0. Mean of estimated coefficients across simulations; error bars represent the 2.5th and 97.5th percentiles. Columns are different parameter estimates, rows are different values of β_{U}, with the special case R_{A} = 1 − U on the top row. Within each graph, the y-axis varies α_{U}

Figure _{U} = _{UA} = 0, and we additionally fix _{A} = 1, _{U} = 1, _{UA} = 0, _{Y} = 1 and _{A} = 1. The key varying parameters are _{A}, which controls the dependence of _{A} on _{U}, which controls the dependence of _{A} on

Results for scenarios (iv) and (v), with γ_{UA} = 0. Mean of estimated coefficients across simulations; error bars represent the 2.5th and 97.5th percentiles. Columns are different parameter estimates, rows are different values of β_{U}. Within each graph, the y-axis varies β_{A}

When _{U} = 0 (corresponding to scenario (iv)), MI(A) is biased in estimating _{A}. However, MI(R + A) and MI(R*A) are not biased. When _{U} ≠ 0 things are more complicated, and there is no clear approach that minimizes the bias. What is consistent, however, is that _{R} estimates are nonzero when either _{U} or _{A} are not zero.

Figure _{U} do not cause particular problem for any method, while nonzero _{A} introduces bias in estimation of _{A} for MI(A) only.

Results for scenarios (iv-U) and (v-U), with γ_{UA} = 0. Mean of estimated coefficients across simulations; error bars represent the 2.5th and 97.5th percentiles. Columns are different parameter estimates, rows are different values of β_{U}. Within each graph, the y-axis varies β_{A}

Figure _{A} = 1, _{U} = 1, _{UA} = 0.5, _{A} = 1, _{Y} = 1 and _{UA} = 0, and _{U} = 0.5.

Results for scenario (vi), with γ_{UA} = 0 and α_{U} = 0.5. Mean of estimated coefficients across simulations; error bars represent the 2.5th and 97.5th percentiles. Columns are different parameter estimates, rows are different values of β_{U}. Within each graph, the y-axis varies β_{A}

The results are similar to those for Scenario (v) except that _{A} is more commonly overestimated.

Further results are given in the Supplements: Figs. S

In this paper we have explored, through simulation, the potential merits of supplementing multiple imputation with a missing indicator, particularly in circumstances where missingness is not at random, and the missingness may moreover act as a proxy for unmeasured confounding. We emphasise that, in contrast to the usual missing data literature that targets completed data estimands, here we target causal estimands that are not available in general even with completed data (because of unmeasured confounding). In scenarios where missingness of an exposure depends on an unmeasured confounder, the missingness indicator can be used as a proxy for the unmeasured confounding, and this may reduce bias in some situations. Careful consideration of the likely missingness mechanisms for a given clinical question/ dataset is key to deciding on the analytical approach.

In the MCAR and MAR scenarios, without unmeasured confounding, adding a missing indicator to multiple imputation did not introduce bias in estimation of causal effects. In the MNAR scenarios without unmeasured confounding, adding a missing indicator generally reduced bias compared with multiple imputation alone. In the presence of unmeasured confounding, bias in estimation was sometimes better and sometimes worse when including a missing indicator and/or its interaction with the main effect, depending on the relationships between the parameters. This reflects the potentially complex relationships, and shows that care should be given, and decisions based on a study-by-study basis. In all cases, when unmeasured confounding and/or MNAR exists, the missing indicator coefficient and/or its interaction with the main effect coefficient were estimated to be non-zero. These non-zero effect estimates of the missing indicators act as a signal that there may be MNAR mechanisms present, and hence it would be difficult or impossible to obtain unbiased causal effects. Any disagreement between the main effect parameter estimates with and without including a missing indicator provide a similar indication.

The ‘missing indicator’ approach has a somewhat negative reputation in the causal inference literature. This is because it is usually coupled with a weak approach to impute the missing data itself - such as using the unconditional mean [

We explored a wide range of simulation settings in a fully factorial design. While we can only present a limited range of results in the paper, the simulation code and results are available online for inspection. Nevertheless, simulations are necessarily simpler than scenarios that might be encountered in practice. First, missingness may affect many covariates. While addition of missing indicators, and interactions, seems robust, it may break down in some scenarios with complex multivariate patterns of missingness, and may also lead to unacceptable model complexity. Second, there may be multiple unmeasured or partially measured confounders. However, we could consider multiple confounders as being summarized by a propensity score, for example, and thus we expect the results here to generalize to the multiple confounders case. We emphasise that we focused on missing data in exposure where the causal estimand rather than the completed data estimand was targeted, and that results here should not be generalized to different scenarios [

We recommend that addition of a missing indicator, and corresponding interaction terms, can supplement, but not replace, standard multiple imputation. In particular, we recommend the use of MIMI (including interactions between missing indicators and the corresponding variable) as a strategy for handling missing data in causal effect estimation problems. Non-zero estimates of the missing indicator then alert to possible occurrence of MNAR and/or unmeasured confounding, and the need for further sensitivity analysis. We caveat that the use of missing indicators should not replace careful consideration of assumed plausible causal structures, and drawing a causal diagram to depict these assumptions remains the starting point for a well-conducted causal inference.

The authors thank Thomas House for useful discussions, and also thank the two reviewers whose comments greatly improved the manuscript.

MS: design of study, perform simulation study, draft paper. GM: design of study, major contributions to editing paper. The authors read and approved the final manuscript.

This work was partially funded by the MRC-NIHR Methodology Research Programme [grant number: MR/T025085/1]. The funding body had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

All simulation results are available at

Not applicable.

Not applicable.

The authors declare that they have no competing interests.

Here we give an informal justification for the bias result. In this section we use the superscript ∗ to denote true values of parameters.

First consider the imputation model
_{i} ∼ ^{2}). Note that _{i} does not appear in this model because _{i} = 0 for all

Now _{Y, i} instead of _{i}:

In analogy with [

Moreover,

The imputation model is then used to impute values for the missing _{i} s; i.e. for

Now consider again the outcome model,

In the absence of missing data, we would of course simply solve using least squares, and if _{0}, _{A}, _{U}) and

As we have missing data, rewriting the outcome model to replace the missing _{i} s with their imputed versions, for substitution into the least squares formula we have:

The residual sum of squares can then be written as

To consider minimising this expression, consider each bracket in turn. To minimise the first bracket, it is clear that

Rearranging yields,

Missing at random

Missing completely at random

Multiple imputation

Multiple imputation with missing indicator

Missing not at random

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.