在Y1，Y2上模拟数据，其中Y2缺少值

如何解决在Y1，Y2上模拟数据，其中Y2缺少值

考虑两个变量（Y ₁，Y ₂）问题，每个变量的定义如下：

Y ₁ = 1 + Z ₁，并且完全观察到Y ₁
Y ₂ = 5 + 2 *（Z ₁）+ Z ₂，而Y ₂为如果2 *（Y ₁ − 1）+ Z ₃
Z ₁，Z ₂和Z ₃遵循独立的标准正态分布。

我们将如何在（Y ₁，Y ₂）上模拟大小为500的（完整）数据集？这是我在下面写的：

    n <- 500
    y <- rnorm(n)

我们如何模拟相应的观测数据集（通过施加缺失在Y ₂上）？我不确定该问题在哪里。

    n <- 500
    z1 <- rnorm(n)
    z2 <- rnorm(n)
    z3 <- rnorm(n)

    y1 <- 1 + z1
    y2 <- 5 + 2*z1 + z2

显示完整数据（最初模拟的）和观察到的（施加缺失后）数据的Y ₂的边际分布。

解决方法

您可能希望在数据模拟中包括一个误差项，因此应再次使用rnorm(n)将等式为零的另一个向量包括在方程中。

seed <- sample(1:1e3,1)

set.seed(635)  ## for sake of reproducibility

n <- 500
z1 <- rnorm(n)
z2 <- rnorm(n)

要获取缺失，您可以采样一定百分比的向量并将其设置为NA。

y2 <- 5 + 2*z1 + z2 + rnorm(n)  ## add error term independent of the `z`s

pct.mis <- .1  ## percentage missings
y2[sample(length(y2),length(y2)*pct.mis)] <- NA

## check 1: resulting missings
prop.table(table(is.na(y2)))
# FALSE  TRUE 
#   0.9   0.1 

summary(y2)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
# -2.627   3.372   5.123   4.995   6.643  13.653      50

## check 2: rounded coefficients resemble equation
fit <- lm(y2 ~ z1 + z2)
round(fit$coe)
# (Intercept)          z1          z2 
#           5           2           1 

## check 3: number of fitted values equals number of non-missing obs.
length(fit$fitted.values) / length(y2)
# [1] 0.9

除了 @ jay.sf 的出色解释之外，另一种显示分布的方法是在新变量中构建缺失的数据机制，并比较y2和{{1 }}：

y2_missing

输出：