根据R中的间隔[开始，停止]数据估算密度

如何解决根据R中的间隔[开始，停止]数据估算密度

说明

这个问题的动机来自于临床/流行病学研究，其中的研究通常招募患者，然后随访患者不同的时间长度。

进入研究时的年龄分布通常是令人感兴趣的，并且易于评估，但是偶尔有兴趣在研究过程中的任何时间年龄分布。

我的问题是，是否有一种方法可以从间隔数据（例如[age_start，age_stop]）中估算出这样的密度，而不会按如下所示扩展数据？长格式方法似乎不太美观，更不用说它的内存使用了！

使用生存包中数据的可复制示例

#### Prep Data ###
library(survival)
library(ggplot2)
library(dplyr)

data(colon,package = 'survival')
# example using the colon dataset from the survival package
ccdeath <- colon %>%
  # use data on time to death (not recurrence)
  filter(etype == 2) %>%
  # age at end of follow-up (death or censoring)
  mutate(age_last = age + (time / 365.25))

#### Distribution Using Single Value ####
# age at study entry
ggplot(ccdeath,aes(x = age)) +
  geom_density() +
  labs(title = "Fig 1.",x = "Age at Entry (years)",y = "Density")

#### Using Person-Month Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age,age_last,status) ~ .,data = ccdeath,cut = seq(from = floor(min(ccdeath$age)),to = ceiling(max(ccdeath$age_last)),by = 1/12))

nrow(ccdeath_cp) # over 50,000 rows

# distribution of age at person-month level
ggplot(ccdeath_cp,aes(x = age)) +
  geom_density() +
  labs(title = "Figure 2: Density based on approximate person-months",x = "Age (years)",y = "Density")

#### Using Person-Day Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age,by = 1/365.25))

nrow(ccdeath_cp) # over 1.5 million rows!

# distribution of age at person-month level
ggplot(ccdeath_cp,aes(x = age)) +
  geom_density() +
  labs(title = "Figure 3: Density based on person-days",y = "Density")

注意：虽然我将这个问题标记为“生存”是因为我认为它会吸引熟悉该领域的人们，但我对这里的活动时间不感兴趣，而只是对所有学习时间的总体年龄分布感兴趣。 / p>

解决方法

可以计算特定年龄的患者人数的累积计数，而不必计算越来越精细的时间间隔

setDT(ccdeath)
x <- rbind(
  ccdeath[,.(age = age,num_patients = 1)],ccdeath[,.(age = age_last,num_patients = -1)]
)[,.(num_patients = sum(num_patients)),keyby = age]

cccdeath <- x[x[,.(age = unique(age))],on = 'age']
cccdeath[,num_patients := cumsum(num_patients)]
ggplot(cccdeath,aes(x = age,y = num_patients)) + geom_step()

锯齿模式是因为假定每个患者都始于整数年龄。对如何使它变得平滑并提出了这个想法有一些想法-给给定的age和age+1之间的一组均匀间隔的年龄分配相等的概率。你得到这样的东西，

smooth_param <- 12
x <- rbindlist(lapply(
  (1:smooth_param-0.5)/smooth_param,function(s) {
    rbind(
      ccdeath[,.(age = age+s,num_patients = 1/smooth_param)],.(age = age_last+s,num_patients = -1/smooth_param)]
    )
  }
))[,.(age = sort(unique(age)))],y = num_patients)) + geom_step()

我将遵循以下原则：

如果您有兴趣了解研究中t天之后的年龄分布，则该年龄就是入学年龄加上t天。您需要处理那些已去世或已被右删截的例外。在您的示例中，您似乎在人们离开研究之时就“冻结了”年龄。我个人认为，未经审查的幸存者的年龄分布在生存分析中更为有用，但在本例中，我将坚持您的设置。

然后，在t时每个患者的两种可能性是，如果t小于随访时间，则在入组时要加上年龄t。否则，年龄将为入学年龄加上随访时间。

如果将其包装在函数中，则可以查看整个研究中年龄分布的变化。为了完整起见，我们将始终在注册时绘制微弱的年龄密度，并用一条线表示当前的平均年龄：

age_distribution <- function(df,t = 0)
{
  df %>% 
    mutate(age_at_t = age + ifelse(time > t,t,time) / 365.25) %>%
    ggplot() +
    geom_density(aes(age),linetype = 2,colour = "gray50") +
    geom_density(aes(age_at_t)) +
    geom_vline(aes(xintercept = mean(age_at_t)),color = "red",linetype = 2) +
    labs(x = paste("Age at day","of study"),y = "Density",title = paste("Age distribution after","days in study"))
}

例如，

age_distribution(ccdeath,0)

一年后：

age_distribution(ccdeath,365)

5年后：

age_distribution(ccdeath,5 * 365.25)

为了完整起见，删除了审查/死亡患者的等效功能如下：

age_distribution <- function(df,t = 0)
{
  df %>% 
    filter(time > t) %>%
    mutate(age_at_t = age + t / 365.25) %>%
    ggplot() +
    geom_density(data = df,aes(age),"days in study"))
}

所以我们可以看到5年的随访后分布的形状如何变化：

age_distribution(ccdeath,5 * 365.25)

这更清楚地表明，从最初的队列中，老年人流失的比例不成比例。

根据R中的间隔[开始，停止]数据估算密度

如何解决根据R中的间隔[开始，停止]数据估算密度

说明

使用生存包中数据的可复制示例

解决方法

相关推荐