为什么用离散选项拟合具有随机效应的 GAM 会导致非常不同的输出？

如何解决为什么用离散选项拟合具有随机效应的 GAM 会导致非常不同的输出？

我目前正在使用 MGCV 包中的 BAM 函数对随时间推移的基因表达进行建模。有些基因需要很长时间来建模，所以我决定尝试将离散选项设置为 True。但是，现在输出的摘要似乎在模型之间差异很大。这在随机效应项的显着性和模型解释的总偏差中尤为明显。

将离散设置为真是否会以某种方式影响随机效应的建模？

模型拟合如下：

 bam(value ~ oGeno +
      s(age,bs = "gp",k = 8,m=2) +
      s(age,by = oGeno,bs ="gp",m=1) +
      s(age,sex,bs = "re") +
      s(litter,bs = "re") +
      s(hash,bs = "re") +
      s(hpool,bs = "re"),data = data,family = family,method="REML"

 bam(value ~ oGeno +
      s(age,discrete = T,method="fREML"

并生成这些摘要

当离散为假时：

Family: Tweedie(p=1.32) 
Link function: log 

Formula:
value ~ oGeno + s(age,bs = basis,k = k,m = 2) + s(age,m = 1) + s(age,bs = "re") + s(litter,bs = "re") + s(hash,bs = "re") + s(hpool,bs = "re")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.78886    0.05635  31.744   <2e-16 ***
oGeno.L     -0.05029    0.04337  -1.159    0.246    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                      edf  Ref.df       F  p-value    
s(age)           4.813266   4.864  48.547  < 2e-16 ***
s(age):oGenomut  3.100540   3.188   7.298 7.00e-05 ***
s(age,sex)       0.005974   1.000   0.039    0.188    
s(litter)       13.606287  89.000   5.819    0.199    
s(hash)         82.078473 148.000  38.136 8.22e-06 ***
s(hpool)        24.505710  28.000 654.942  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.259   Deviance explained = 33.5%
-REML =  27952  Scale est. = 1.9063    n = 15421

当离散为真时：

Family: Tweedie(p=1.32) 
Link function: log 

Formula:
value ~ oGeno + s(age,bs = "re")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.77812    0.05850  30.393   <2e-16 ***
oGeno.L     -0.07017    0.04388  -1.599     0.11    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                    edf  Ref.df       F p-value    
s(age)           4.8187   4.870  31.101 < 2e-16 ***
s(age):oGenomut  3.1031   3.192   4.819 0.00108 ** 
s(age,sex)       0.3756   1.000 122.771 0.98174    
s(litter)       13.3135  93.000  68.682 1.00000    
s(hash)         81.8371 152.000 140.906 1.00000    
s(hpool)        24.5326  29.000 698.652 1.00000    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.259   Deviance explained = 27.6%
fREML =  28369  Scale est. = 1.9063    n = 15421

解决方法

如果您在第一次模型拟合中使用默认 method = "fREML" 而没有 discrete 会怎样？就 EDF 而言，模型拟合看起来没有很大不同，所以我怀疑 "REML" 可以处理的拟合存在一些等级缺陷，但 "fREML" 或 discrete = TRUE 部分可以'不。因此，使用 bam() 在没有离散化的情况下检查 method = "fREML" 的拟合。

因为您使用的是 gp 平滑，所以 m 位并没有像您想象的那样做；这些平滑没有基于衍生的惩罚。相反，m = 1 是默认值，用球面协方差函数拟合 GP。因此，我怀疑全局加上主题特定的平滑高度相关，这会导致其中一种算法出现拟合问题。

我看不出有什么理由让您在这里找全科医生。如果速度是问题，请对简单三次回归样条使用 bs = "cr"。并且由于那些具有基于导数的惩罚，m = 1 应该有助于避免拟合多个 age 平滑的问题。