跟踪字段变化的函数

如何解决跟踪字段变化的函数

我需要一个函数（使用基本 SAS 或 RStudio）来确定某个日期的 ID 号和开始日期的原始（根）ID 号。数据集包括旧 ID、新 ID 和 ID 更改日期。示例数据：

旧ID	NewID	更改日期
1	2	1/1/10
10	11	1/1/10
2	3	7/1/10
3	4	7/10/10
11	12	8/1/10

我需要知道截至 10 年 12 月 13 日的 ID 号和原始（根）ID（截至 10 年 1 月 1 日）。输出应如下所示：

OrigID	LastID
1	4
10	11

然后我需要一个标志来帮助我计算在给定时间间隔（在本例中为 1/1/10 到 7/15/10）内发生变化的 OrigID 的数量。我也需要对 10 年 12 月 13 日之后的多个日期进行类似的计数。

基础 SAS 或 RStudio 中是否有可以执行此操作的函数？

我研究的 SAS/R 中的功能（分层记录器、同步跟踪、序列跟踪功能）似乎不会起作用（例如，记录器、伐木工人、log4r、验证、futile.logger）

解决方法

这应该可行，我只是懒得输入正确的日期。

注意：这里假设数据按更改发生排序。

数据

df <- data.frame(
  OldID = c(1,10,2,3,11),NewID = c(2,11,4,12),ChangeDate = c(1,1,3))
df
#>   OldID NewID ChangeDate
#> 1     1     2          1
#> 2    10    11          1
#> 3     2     3          2
#> 4     3     4          2
#> 5    11    12          3

功能

process <- function(df,from,to) {
  process0 <- function(df,i = 1){
    # fetch new value
    new <- df$NewID[i]
    # check in old column
    j <- match(new,df$OldID)
    
    if(is.na(j)) {
      # if not matched,set i to next row
      i <- i + 1
    } else {
      # else we update current row with new "new" value
      df$NewID[i] <- df$NewID[j]
      # and increment the changes
      df$Changes[i] <- df$Changes[i] + 1
      # and remove obsolete row
      df <- df[-j,]
    }
    # do it all over again except if there is no next row
    if(i <= nrow(df)) process0(df,i) else df
  }
  # filter data frame
  df <- subset(df,ChangeDate >= from & ChangeDate <= to,select = c("OldID","NewID"))
  # start with 1 change per line
  df$Changes <- 1
  # run recursive function
  process0(df)
}

结果

process(df,2)
#>   OldID NewID Changes
#> 1     1     4       3
#> 2    10    11       1

^{由 reprex package (v0.3.0) 于 2021 年 6 月 9 日创建}

SAS 中有许多工具可用于从 [OLDID,NEWID] 边表定义的图中查找连接的子图。例如来自 SAS/OR 的 PROC OPTNET。或者 %SUBGRAPHS macro created by PGStats。

那么让我们首先将您的列表转换为实际数据集。

data have ;
  input OldID NewID Date :mmddyy.;
  format date yymmdd10.;
cards;
1 2 1/1/10
10 11 1/1/10
2 3 7/1/10
3 4 7/10/10
11 12 8/1/10
;

然后调用 %SUBGRAPHS() 宏来获取为每个节点计算的 CLUST（子图 id）。

%SubGraphs(have,from=oldid,to=newid,out=clusters);

现在将其与原始数据重新组合，以便获得日期。

proc sql;
  create table groups as 
    select distinct a.clust,b.*
    from clusters a
    inner join have b
      on a.node = b.oldid or a.node=b.newid
    order by a.clust,b.date
  ;
quit;

一旦您将数据中的记录与相同的子图 ID 匹配，那么查找任何日期范围的第一个/最后一个节点就很简单了：

data want ;
  do until (last.clust);
    set groups;
    by clust date;
    where '01JAN2010'd <= date <= '15JUL2010'd;
    if first.clust then origid=oldid;
  end;
  lastid=newid;
  keep origid lastid ;
run;

当然，如果您真的想按之前搜索子图的日期过滤数据，您可能会得到更多的子图，因为您可能已经消除了连接两组节点的边。

尝试使用 igraph，尽管可能有更简单的方法。我将从数据开始。

dat <- read.table(text="OldID   NewID   ChangeDate
1   2   1/1/10
10  11  1/1/10
2   3   7/1/10
3   4   7/10/10
11  12  8/1/10",header=TRUE)

dat$ChangeDate <- as.Date(dat$ChangeDate,format="%m/%d/%y")

设置带有属性的图形

library(igraph)
g <- graph.data.frame(dat)
V(g)$ChangeDate <- dat$ChangeDate[match(V(g)$name,dat$NewID)]
V(g)$ChangeDate[is.na(V(g)$ChangeDate)] <- -Inf

将图表子集到日期范围

g <- induced_subgraph(g,which(V(g)$ChangeDate <= as.Date("2010-07-15")))

找到起点和终点以及路径

din  <- degree(g,mode="in")
st   <- names(din[din == 0])
dout <- degree(g,mode="out")
en   <- names(dout[dout== 0])

fin  <- lapply(st,all_simple_paths,graph=g,to=en)

查询路径

t(sapply(
  unlist(fin,rec=FALSE),function(x) {
    nx <- as.numeric(names(x))
    c(OldID=head(nx,1),NewID=tail(nx,Changes=length(x)-1)
  }
))

#     OldID NewID Changes
#[1,]     1     4       3
#[2,]    10    11       1

另一种使用 igraph 及其函数 components 的可能性。从@thela 借用“dat”。我使用 data.table 进行数据整理步骤，但这当然也可以在 base 中完成。

library(igraph)
library(data.table)

mem = components(graph_from_data_frame(dat))$membership
setDT(dat)
dat[.(as.integer(names(mem))),on = .(OldID),mem := mem]
dat[ChangeDate <= as.Date("2010-07-15"),.(OldID = OldID[1],NewID = NewID[.N],changes = .N),by = mem]
#    mem OldID NewID changes
# 1:   1     1     4       3
# 2:   2    10    11       1

使用 graph_from_data_frame 创建图表，其中 'OldID' 和 'NewID' 列被视为边缘列表。
使用components获取图的连通分量，直接或间接。
选择 membership 元素以获取“每个顶点所属"
加入原始数据 on 'OldID' 的会员资格。
i 中的子集日期并抓取相关数据分组 by 成员资格，即第一个“OldID”，最后一个 'NewID' 和使用 .N
的行数

SAS 哈希可以跟踪 ID roots 和更改。可以为检测 asof 状态所需的链表遍历设置更改。

示例：

需要 ID 首次出现的根数据行。 asofs 在临时数组中指定，但也可以在加载到第三个哈希以进行 asof 迭代的外部数据集中指定。

data have; 
attrib OldID NewID length=8 ChangeDate informat=mmddyy10. format=mmddyy10.;
input OldID NewID   ChangeDate;
datalines;
.   1   3/1/09
.  10   6/1/09
1   2   1/1/10
10  11  1/1/10
2   3   7/1/10
3   4   7/10/10
11  12  8/1/10
;

data want(keep=asof origID currID changeCount);
  attrib asof format=mmddyy10. origID currID length=8;

  declare hash roots();
  roots.defineKey('NewID');
  roots.defineData('NewID','ChangeDate');
  roots.defineDone();

  declare hash changes();
  changes.defineKey('OldID');
  changes.defineData('NewID','ChangeDate');
  changes.defineDone();

  do while (not done);
    set have end=done;
    if missing(OldID)
      then roots.add();
      else changes.add();    
  end;

  array asofs (7) _temporary_ (
    '01-JAN-2009'd
    '15-MAY-2009'd
    '15-SEP-2009'd
    '15-MAR-2010'd
    '02-JUL-2010'd
    '15-JUL-2010'd
    '15-AUG-2010'd
  );

  declare hiter hi('roots');

  do index = 1 to dim(asofs);
    asof = asofs(index);

    do while (hi.next() eq 0);

      origID = newID;
      currID = .;

      do changeCount = -1 by 1 while (ChangeDate <= asof);
        currID = NewID;
        if changes.find(key:NewID) ne 0 then leave;
      end;

      output;
    end;
  end;
  stop;
run;

这是我对 purrr::reduce 的处理方法：

library(dplyr)

split(df,seq(nrow(df))) %>%
  purrr::reduce(~ { index <- which(.y$OldID == .x$NewID)
                    if (any(index))
                      mutate(.x,NewID = replace(NewID,index,.y$NewID))
                    else bind_rows(.x,.y) }) %>%
  transmute(OrigID = OldID,LastID = NewID)

  OrigID LastID
1      1      4
2     10     12

跟踪字段变化的函数

如何解决跟踪字段变化的函数

解决方法

相关推荐