Internal vs. external validity in studies with incomplete populations
Researchers working with administrative data rarely have access to the entire universe of units they need to estimate effects and make statistical inferences. Examples are varied and come from different disciplines. In social program evaluation, it is common to have data on all households who received the program, but only partial information on the universe of households who applied or could have applied for the program. In studies of voter turnout, information on the total number of citizens who voted is usually complete, but data on the total number of voting-eligible citizens is unavailable at low levels of aggregation. In criminology, information on arrests by race is available, but the overall population that could have potentially been arrested is typically unavailable. And in studies of drug overdose deaths, we lack complete information about the full population of drug users.
In all these cases, a reasonable strategy is to study treatment effects and descriptive statistics using the information that is available. This strategy may lack the generality of a full-population study, but may nonetheless yield valuable information for the included units if it has sufficient internal validity. However, the distinction between internal and external validity is complex when the subpopulation of units for which information is available is not defined according to a reproducible criterion and/or when this subpopulation itself is defined by the treatment of interest. When this happens, a useful approach is to consider the full range of conclusions that would be obtained under different possible scenarios regarding the missing information. I discuss a general strategy based on partial identification ideas that may be helpful to assess sensitivity of the partial-population study under weak (non-parametric) assumptions, when information about the outcome variable is known with certainty for a subset of the units. I discuss extensions such as the inclusion of covariates in the estimation model and different strategies for statistical inference.
Co-sponsored with the Political Science Department, Statistics Department and the Center for Social Statistics