如何使用间接引用将 PCA 输出值映射/返回到原始 data.frame？

如何解决如何使用间接引用将 PCA 输出值映射/返回到原始 data.frame？

假设我们有如下元数据和特征计数的数据（来自 phyloseq R 包的 GlobalPatterns）：

1. Metadata:
> clipr::write_clip(knitr::kable(head(sample_data)))

|        |X.SampleID |Primer  |Final_Barcode |Barcode_truncated_plus_T |Barcode_full_length |SampleType |Description                                |
|:-------|:----------|:-------|:-------------|:------------------------|:-------------------|:----------|:------------------------------------------|
|CL3     |CL3        |ILBC_01 |AACGCA        |TGCGTT                   |CTAGCGTGCGT         |Soil       |Calhoun South Carolina Pine soil,pH 4.9   |
|CC1     |CC1        |ILBC_02 |AACTCG        |CGAGTT                   |CATCGACGAGT         |Soil       |Cedar Creek Minnesota,grassland,pH 6.1   |
|SV1     |SV1        |ILBC_03 |AACTGT        |ACAGTT                   |GTACGCACAGT         |Soil       |Sevilleta new Mexico,desert scrub,pH 8.3 |
|M31Fcsw |M31Fcsw    |ILBC_04 |AAGAGA        |TCTCTT                   |TCGACATCTCT         |Feces      |M3,Day 1,fecal swab,whole body study    |
|M11Fcsw |M11Fcsw    |ILBC_05 |AAGCTG        |CAGCTT                   |CGACTGCAGCT         |Feces      |M1,whole body study    |
|M31Plmr |M31Plmr    |ILBC_07 |AATCGT        |ACGATT                   |CGAGTCACGAT         |Skin       |M3,right palm,whole body study    |

2. OTU/features count:
> clipr::write_clip(knitr::kable(head(t(ps_otu))))

|       | CL3| CC1| SV1| M31Fcsw| M11Fcsw| M31Plmr| M11Plmr| F21Plmr| M31Tong| M11Tong| LMEpi24M| SLEpi20M|   AQC1cm|   AQC4cm|   AQC7cm|     NP2| NP3| NP5| TRRsed1| TRRsed2| TRRsed3| TS28| TS29| Even1| Even2| Even3|
|:------|---:|---:|---:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|--------:|--------:|--------:|--------:|--------:|-------:|---:|---:|-------:|-------:|-------:|----:|----:|-----:|-----:|-----:|
|549322 |   0|   0|   0|       0|       0|       0| 0.0e+00|       0|       0|       0|        0|    8e-07| 2.31e-05| 4.24e-05| 7.65e-05| 1.9e-06|   0|   0|       0|       0|       0|    0|    0|     0|     0|     0|
|522457 |   0|   0|   0|       0|       0|       0| 0.0e+00|       0|       0|       0|        0|    0e+00| 0.00e+00| 8.00e-07| 3.50e-06| 0.0e+00|   0|   0|       0|       0|       0|    0|    0|     0|     0|     0|
|951    |   0|   0|   0|       0|       0|       0| 2.3e-06|       0|       0|       0|        0|    0e+00| 0.00e+00| 0.00e+00| 0.00e+00| 0.0e+00|   0|   0|       0|       0|       0|    0|    0|     0|     0|     0|
|244423 |   0|   0|   0|       0|       0|       0| 0.0e+00|       0|       0|       0|        0|    0e+00| 0.00e+00| 9.30e-06| 1.71e-05| 0.0e+00|   0|   0|       0|       0|       0|    0|    0|     0|     0|     0|
|586076 |   0|   0|   0|       0|       0|       0| 0.0e+00|       0|       0|       0|        0|    0e+00| 0.00e+00| 8.00e-07| 6.00e-07| 0.0e+00|   0|   0|       0|       0|       0|    0|    0|     0|     0|     0|
|246140 |   0|   0|   0|       0|       0|       0| 0.0e+00|       0|       0|       0|        0|    0e+00| 0.00e+00| 4.00e-07| 1.80e-06| 0.0e+00|   0|   0|       0|       0|       0|    0|    0|     0|     0|     0|

简要说明：2.OTU/特征计数表的第一列解释了该表每一行的特征名称。注意：计数表是百分比， rowSums(ps_otu) 返回一系列的。

我们想知道 sample_data$SampleType 引起或预测 PC 响应的程度。据我们所知，我们可以通过广义线性模型 (GLM) 拟合来评估 PC 与元数据之间的关联。

PC 特征加载作为因变量（~ 的左侧）和 SampleType 列作为自变量（~ 的右侧）。

该方法来自以下文章：https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7224172/【16s分析部分】

输入：

library(tidyverse)
library(phyloseq)

# Load data
data("GlobalPatterns") # from phyloseq package
ps <- GlobalPatterns
ps <- transform_sample_counts(ps,function(x)
  x / sum(x))

ps_otu <- otu_table(ps) %>% t() %>% as.data.frame() 
# transposed because prcomp() expects taxa/asv/features as columns,observation/sample as rows ?

sample_data <- sample_data(ps) %>% data.frame()

PCA <- prcomp(ps_otu,center = T) 
# no need to scale because of prevIoUs transform_sample_counts() ?

PCA_load <- PCA$rotation[,1:4] %>%
  as.data.frame() %>%
  rownames_to_column(var = "ID") %>%
  filter((abs(PC1)) &
         (abs(PC2)) &
         (abs(PC3)) &
         (abs(PC4)) >= 0.05) # retain only first four PCs with loading value at least 0.05

sam_data_extended <-
  left_join(sample_data,rownames_to_column(ps_otu),by = c("X.SampleID" = "rowname")) %>%
  select(1:8 | PCA_load$ID)

输出：

> head(PCA_load)

|ID     |        PC1|        PC2|        PC3|        PC4|
|:------|----------:|----------:|----------:|----------:|
|12812  | -0.0097111|  0.0105137| -0.0081787|  0.0647053|
|317658 | -0.0140822| -0.1206650|  0.0345392|  0.0575640|
|329744 | -0.0252158| -0.1705315|  0.0523675|  0.1648879|
|471141 | -0.0080613| -0.0100105| -0.0978075| -0.0626787|
|64396  | -0.0240414| -0.0350119| -0.1862152| -0.0801919|
|549656 |  0.9686046| -0.0324051|  0.0109973| -0.0890339|

> head(sam_data_extended)

|X.SampleID |Primer  |Final_Barcode |Barcode_truncated_plus_T |Barcode_full_length |SampleType |Description                                | 549322|    12812|   317658|   329744|    471141|     64396|    549656|   279599|   557211|    263681|    360229|    536311|      9514|     94166|    550960|    484436|   323845|   228764|   534609|  319044|   87194|    322235|    158660|    331820|    244304|    471122|    361496|    348374|    291090|    259569|    316732|    171551|    114821|    145149|
|:----------|:-------|:-------------|:------------------------|:-------------------|:----------|:------------------------------------------|------:|--------:|--------:|--------:|---------:|---------:|---------:|--------:|--------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|--------:|--------:|--------:|-------:|-------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|
|CL3        |ILBC_01 |AACGCA        |TGCGTT                   |CTAGCGTGCGT         |Soil       |Calhoun South Carolina Pine soil,pH 4.9   |      0| 6.90e-06| 1.74e-05| 2.43e-05| 0.0000093| 0.0000116| 0.0002222| 2.78e-05| 1.39e-05| 0.0000486| 0.0000150| 0.0000127| 0.0000127| 0.0000417| 0.0002141| 0.0000035| 9.30e-06| 1.50e-05| 1.39e-05| 8.1e-06| 2.3e-06| 0.0001030| 0.0001632| 0.0004965| 0.0001296| 0.0000162| 0.0000509| 0.0000197| 0.0000301| 0.0000417| 0.0000590| 0.0002187| 0.0000174| 0.0000104|
|CC1        |ILBC_02 |AACTCG        |CGAGTT                   |CATCGACGAGT         |Soil       |Cedar Creek Minnesota,pH 6.1   |      0| 1.23e-05| 1.67e-05| 2.03e-05| 0.0000176| 0.0000159| 0.0003373| 2.03e-05| 1.41e-05| 0.0000229| 0.0000203| 0.0000123| 0.0000247| 0.0000616| 0.0002228| 0.0000106| 6.20e-06| 8.80e-06| 9.70e-06| 1.8e-06| 1.8e-06| 0.0000255| 0.0000749| 0.0001497| 0.0000229| 0.0000106| 0.0000220| 0.0000062| 0.0000097| 0.0000396| 0.0000141| 0.0000423| 0.0000167| 0.0000273|
|SV1        |ILBC_03 |AACTGT        |ACAGTT                   |GTACGCACAGT         |Soil       |Sevilleta new Mexico,pH 8.3 |      0| 1.29e-05| 6.45e-05| 9.61e-05| 0.0000186| 0.0000143| 0.0005663| 5.62e-04| 1.86e-05| 0.0000487| 0.0000172| 0.0000215| 0.0000115| 0.0000473| 0.0002323| 0.0000057| 1.58e-05| 8.60e-06| 1.29e-05| 2.9e-06| 4.3e-06| 0.0000344| 0.0001075| 0.0002996| 0.0000473| 0.0000115| 0.0000215| 0.0000129| 0.0000100| 0.0000516| 0.0000186| 0.0000789| 0.0000172| 0.0000100|
|M31Fcsw    |ILBC_04 |AAGAGA        |TCTCTT                   |TCGACATCTCT         |Feces      |M3,whole body study    |      0| 4.50e-06| 8.40e-06| 1.81e-05| 0.0000097| 0.0000175| 0.0000849| 1.94e-05| 9.10e-06| 0.0000240| 0.0000214| 0.0000084| 0.0000389| 0.0001153| 0.0000091| 0.0000045| 7.10e-06| 8.40e-06| 5.80e-06| 3.2e-06| 6.0e-07| 0.0620752| 0.0535722| 0.2298065| 0.0000557| 0.0000078| 0.0617629| 0.0391013| 0.0173274| 0.0209990| 0.0533208| 0.0328847| 0.0000706| 0.0000499|
|M11Fcsw    |ILBC_05 |AAGCTG        |CAGCTT                   |CGACTGCAGCT         |Feces      |M1,whole body study    |      0| 6.30e-06| 1.01e-05| 2.17e-05| 0.0000058| 0.0000231| 0.0000751| 5.73e-05| 1.11e-05| 0.1018644| 0.0000130| 0.0000053| 0.0000063| 0.0000318| 0.0000154| 0.0000024| 4.80e-06| 1.49e-05| 4.80e-06| 1.9e-06| 1.4e-06| 0.0231705| 0.1177721| 0.2177820| 0.1067559| 0.0000072| 0.0007469| 0.0105332| 0.0304699| 0.0374307| 0.0061484| 0.0065838| 0.0000159| 0.0000140|
|M31Plmr    |ILBC_07 |AATCGT        |ACGATT                   |CGAGTCACGAT         |Skin       |M3,whole body study    |      0| 9.70e-06| 1.25e-05| 2.09e-05| 0.0127910| 0.0467422| 0.0000946| 3.62e-05| 1.53e-05| 0.0000167| 0.0160402| 0.0052994| 0.0155256| 0.0401896| 0.0000209| 0.0016858| 1.25e-05| 1.11e-05| 5.60e-06| 1.4e-06| 0.0e+00| 0.0000876| 0.0000570| 0.0002991| 0.0000278| 0.0106253| 0.0000167| 0.0000167| 0.0000209| 0.0000515| 0.0000389| 0.0000431| 0.0189653| 0.0129051|

通过像下面的代码一样运行 GLM 函数，我们得到 SampleTypeOcean 是 12812 个特征（这是一个特征名称）的统计显着预测器。

glm <- glm(`12812` ~ SampleType,family = "gaussian",data = sam_data_extended)

Call:
glm(formula = `12812` ~ SampleType,data = sam_data_extended)

Deviance Residuals: 
      Min         1Q     Median         3Q        Max  
-0.035425  -0.000021  -0.000001   0.000015   0.051801  

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)   
(Intercept)                   1.306e-05  7.881e-03   0.002  0.99870   
SampleTypeFreshwater          5.364e-06  1.365e-02   0.000  0.99969   
SampleTypeFreshwater (creek)  1.810e-04  1.204e-02   0.015  0.98818   
SampleTypeMock                1.374e-05  1.204e-02   0.001  0.99910   
SampleTypeOcean               4.602e-02  1.204e-02   3.823  0.00136 **
SampleTypeSediment (estuary)  1.718e-03  1.204e-02   0.143  0.88816   
SampleTypeSkin                2.088e-05  1.204e-02   0.002  0.99864   
SampleTypeSoil               -2.330e-06  1.204e-02   0.000  0.99985   
SampleTypetongue              1.033e-03  1.365e-02   0.076  0.94057   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(dispersion parameter for gaussian family taken to be 0.0002484158)

    Null deviance: 0.0097674  on 25  degrees of freedom
Residual deviance: 0.0042231  on 17  degrees of freedom
AIC: -133.07

Number of Fisher Scoring iterations: 2

然而，我们无法直接推断 PC 是如何受 SampleType 影响的。我们尝试了类似 full_join(sample_data,PCA_load) 的方法，但似乎使用这样的连接数据集运行 glm() 会产生奇怪的不显着 p 值（即所有预测变量都有一个 p 值）。

有更好的方法吗？

谢谢。

编辑：归功于 Biocord 社区的 @cldbrk，broom::tidy() 函数几乎解决了这个问题。但是，我仍然无法将 PCA$rotation 中的信息链接/桥接/连接到原始 ps_otu 对象行（包含观察）。

broom::tidy(PCA,matrix="loadings") 

# A tibble: 499,616 x 3
   column    PC        value
   <chr>  <dbl>        <dbl>
 1 549322     1  0.0000954  
 2 549322     2 -0.00000256 
 3 549322     3  0.00000124 
 4 549322     4 -0.00000739 
 5 549322     5  0.000000362
 6 549322     6 -0.00000139 
 7 549322     7  0.00000215 
 8 549322     8  0.00000154 
 9 549322     9 -0.000000883
10 549322    10 -0.000000477
# ... with 499,606 more rows

我期待像 broom::tidy(PCA) 这样的输出，但它的第三列替换为 broom::tidy(PCA,matrix = "loadings") 的第三列。考虑到这两个输出的行长，我认为仅仅 rbind() 是不正确的。