proc SQL - 连接重复值

如何解决proc SQL - 连接重复值

我有两个数据集：

data snippet1;
  input ID callDate :ddmmyy. start_date :ddmmyy. end_date :ddmmyy. cured ;
  format Date start_date end_date  ddmmyy10.;
datalines4;
001 30/11/2020  28/11/2020  01/12/2020  Cured
001 01/12/2020  28/11/2020  01/12/2020  Cured
001 30/12/2020  28/12/2020  04/01/2021  Not Cured
001 31/12/2020  28/12/2020  04/01/2021  Not Cured
001 01/02/2021  28/01/2021  01/02/2021  Cured
    ;;;;

data have1;
  input ID  event_date :ddmmyy. description ;
  format  event_date ddmmyy10.;
datalines4;
001 28Oct2020   
001 29Nov2020   
001 29Nov2020   New Plan
001 30Nov2020   
001 01Dec2020   
001 01Dec2020   New Plan
001 01Dec2020   Stop Category
001 01Dec2020   Review Date
001 02Dec2020   
001 02Dec2020   OLd Contact Strategy Level
001 02Dec2020   
001 04Dec2020   Stop Category
001 04Dec2020   Review Date
001 29Dec2020   
001 29Dec2020   New Plan
001 30Dec2020   
001 31Dec2020   
001 01Jan2021   
001 01Jan2021   
001 02Jan2021   
001 04Jan2021   
001 05Jan2021   OLd Contact Strategy Level
001 05Jan2021   
001 29Jan2021   
001 29Jan2021   New Plan
001 30Jan2021   
001 31Jan2021   
001 01Feb2021   
001 01Feb2021   
001 02Feb2021   
001 02Feb2021   OLd Contact Strategy Level
001 02Feb2021
    ;;;;

我试图获得，基本上是 Snippet1，其中有一个名为 Description1 和 Description2 的两列，它们将分别获取 {{1} 之间每个时期的第一个和最后一个描述}}。所以对于Calldate + Calldate+2

我当然有更多的 ID，但我认为只要有一个就足以看到我的问题。

这是我目前的代码：

Calldate = 02Dec2020,Description1 = OLd Contact Strategy Level and Description2= Review Date

但这就是结果：

proc sql;
create table want as
select a.*,min(c.description) as description1,max(c.description) as description2
from snippet1 a
inner join
        have1 c
on a.id= c.id
and a.calldate<= c.event_date
and c.event_date <= a.calldate+ 2
Group by 1;
Quit;

如您所见，日期重复了几次，我什至不确定所有通话日期都在其中是否正确。

有人知道吗？

解决方法

您无法通过纯 SQL 获得您想要的东西，或者至少无法对您的数据进行一些修改。 SQL 不尊重“顺序”，尤其不尊重 SAS 的 SQL 实现（它不允许中间查询排序）。因此，您获取“最后”行的请求不会成功：就 SQL 而言，所有带有 calldate+2 的行都是等效的，如果您成功地请求，您可以随机有效地获取其中的任何一行。（它实际上不是随机的，但你应该把它当作是为了编写代码的目的——只有当你真的不关心你得到的是什么时才这样做。）

要在 SQL 中执行此操作，您必须添加一个排序字段。这在技术上是可行的，但不推荐（使用 monotonic()），因为它是一个未记录的函数。最好将其添加到数据步骤视图中。

首先：

data have1_v/view=have1_v;
    set have1;
    by id;
    if first.id then id_row=0;
    id_row+1;
run;

这就确立了顺序。然后：

proc sql;
  select snippet1.*,( select description from have1_v where have1_v.id=snippet1.id
                                        and have1_v.event_Date between snippet1.calldate and snippet1.calldate+2
                                        and have1_v.description is not null
                                    having have1_v.id_row = min(have1_v.id_row)
    )as min_descript,( select description from have1_v where have1_v.id=snippet1.id
                                        and have1_v.event_Date between snippet1.calldate and snippet1.calldate+2
                                        and have1_v.description is not null
                                    having have1_v.id_row = max(have1_v.id_row)
    )as max_descript
    from snippet1;
quit;

这会抓取您的“最小”和“最大”描述。我认为这会返回您要求的内容，尽管 snippet1 中没有与您在问题中提到的 12/2 日期匹配的行。

所有这一切都说，SAS 有更好的工具来处理这种顺序很重要的事情。 SAS 数据步骤确实有一个解决方案，例如：它保证顺序，假设您没有自己搞砸。例如，请参阅如何创建汇总数据集：

data have1_summarized;
  set have1;
  by id event_date;
  where not missing(description);
  retain min_description max_description;
  if first.event_date then min_description = description;
  max_description=description;
  if last.event_date then output;
run;

现在您可以使用 SQL 或其他工具将其与 snippet1 数据集合并，因为您不再有重复的事件日期，因此排序不再重要。

这就是所谓的 table lookup 问题。我向您推荐 double set 技能。很容易学。
顺便说一下，我已经修正了您的数据输入步骤中的几个错误。

data snippet1;
  input ID$ callDate :ddmmyy. start_date :ddmmyy. end_date :ddmmyy. cured$13. ;
  format callDate start_date end_date  ddmmyy10.;
datalines4;
001 30/11/2020  28/11/2020  01/12/2020  Cured
001 01/12/2020  28/11/2020  01/12/2020  Cured
001 30/12/2020  28/12/2020  04/01/2021  Not Cured
001 31/12/2020  28/12/2020  04/01/2021  Not Cured
001 01/02/2021  28/01/2021  01/02/2021  Cured
;;;;
run;

data have1;
  input ID$  event_date :date9. description $42. ;
  format  event_date ddmmyy10.;
datalines4;
001 28Oct2020   
001 29Nov2020   
001 29Nov2020   New Plan
001 30Nov2020   
001 01Dec2020   
001 01Dec2020   New Plan
001 01Dec2020   Stop Category
001 01Dec2020   Review Date
001 02Dec2020   
001 02Dec2020   OLd Contact Strategy Level
001 02Dec2020   
001 04Dec2020   Stop Category
001 04Dec2020   Review Date
001 29Dec2020   
001 29Dec2020   New Plan
001 30Dec2020   
001 31Dec2020   
001 01Jan2021   
001 01Jan2021   
001 02Jan2021   
001 04Jan2021   
001 05Jan2021   OLd Contact Strategy Level
001 05Jan2021   
001 29Jan2021   
001 29Jan2021   New Plan
001 30Jan2021   
001 31Jan2021   
001 01Feb2021   
001 01Feb2021   
001 02Feb2021   
001 02Feb2021   OLd Contact Strategy Level
001 02Feb2021
;;;;
run;

data want1;
  length description1 description2 $42.;
  set snippet1;

  do i = 1 to rec;
    set have1(rename=ID=TmpID)nobs=rec point=i;
    if ID=TmpID and callDate <= event_date <= callDate + 2 then do;
      if description1 = '' then description1 = description;
      if description ^= '' then description2 = description;
    end;
  end;
  drop Tmp:;
run;

您可能想阅读一篇关于 double set 的优秀文章：
Multiple Set Statements in a Data Step: A Powerful Technique for Combining and Aggregating Complex Data

我采用了@whymath 的解决方案，并使它变得更快；不过，这可能仅在您拥有 GB 数据或需要非常高的性能时才需要。

首先，我们构造一个数据集，用于存储 have1 中每个 event_date 的第一行号。当我们从 have1 检索行时，我们将在第二个数据步骤中使用它来指导我们的手，因此我们不需要不必要地迭代。我们还在此处创建了一个索引，以便在下一步中启用键控集。

其次，我们将它与 key 选项一起使用来检索起始行，然后在 point 循环中使用它而不是 1。我们还在此处添加 leave 以允许我们在超过标记时停止迭代。

这些都假设数据集按您想要的顺序排列 - 但我认为我们必须假设，否则您的整个想法就会出错。一定要确保它的顺序正确，否则你会遇到问题。

data have1_ids(index=(id_calldate=(id calldate)));
  set have1;
  rename event_Date=calldate;
  by id event_date;
  _row+1;   *This keeps track of the row number only;
  if first.event_date;
  keep id event_date   _row;
run;


data want1;
  length description1 description2 $42.;
  set snippet1;
  set have1_ids key=id_calldate;
  do i = _row to rec;                                            *now we can start on the right row;
    set have1(rename=ID=_ID) nobs=rec point=i;              
    if (event_date gt calldate+2) or (ID ne _ID) then leave;     *conditions to exit the loop - if either of these is true then we are done here;
    if missing(description1) then description1 = description;    *populate the earlier description once we hit a valid description;
    if not missing(description) then description2 = description; *keep rewriting this until the end;
  end;
  drop _:;
run;

注意，我不会像在其他键控集答案中那样检查 if _IORC_ eq 0 在这里 - 那是因为我不太在意它是否失败；如果没有找到匹配的行，则可以使用 row 的先验值。这不是最优的，但已经接近了 - 并且没有获得下一行的好方法。

这是另一种方法，在 have1 相对较大的某些情况下，这可能是性能最高的。尽管在某些方面它不太灵活。

这使用键控集来完成所有工作。它需要 have1，并为每一行制作三份副本 - 您希望它符合条件的每个日期各一份。然后，keyed set 只是抓取正确日期的行。键控集使用 set 数据集上的索引通过索引查找匹配的行。

data have1_expanded(index=(id_calldate=(id calldate)));
  set have1;
  if not missing(description);
  format calldate date9.;
  do calldate=event_date to event_date-2 by -1;
    output;
  end;
run;

data want1;
  set snippet1;
  do _n_ = 1 by 1 until (_IORC_ ne 0);  *technically pointless but I always include it to make sure I do not forget _IORC_;
    set have1_expanded key=id_calldate end=eof;
    if _IORC_ ne 0 then leave;          *as keyed set iterates,_IORC_ will be zero when it finds a match and nonzero when it does not find any more matches;
    if _n_ eq 1 then description1=description;  *first time through,grab that first description;
    description2=description;                   *every time through,overwrite this to get the last description;
  end;
run;

分两步完成。首先找到属于日期范围的描述，并按 event_date 对其进行排序。

proc sql ;
create table list as 
  select a.id,a.calldate,a.start_date,a.end_date,a.cured,b.event_date,b.description
  from snippet1 a left join have1 b
  on a.id=b.id 
    and a.calldate<= b.event_date
    and b.event_date <= a.calldate+ 2
    and b.description is not null
  order by a.id,b.event_date 
;
quit;

然后处理列表以减少到第一个和最后一个。

data want;
  set list;
  by id calldate start_date end_date cured ;
  length description1 description2 $42  ;
  if first.cured then do;
     event_date1=event_date;
     description1=description;
  end;
  retain event_date1 description1;
  if last.cured then do;
    if not first.cured then do;
      description2=description;
      event_date2=event_date;
    end;
    output;
  end;
  drop description event_date;
  format event_date1 event_date2 yymmdd10.;
run;

结果：