从SAS中交叉分类的嵌套组创建新组ID的最有效方法是什么?

如何解决从SAS中交叉分类的嵌套组创建新组ID的最有效方法是什么?

我有两个桌子。第一个包含大约370,000行,其中包含2010年至2019年之间的大学课程数据。相关列为:yearprogram_idcollege_id。我保留program_namecollege_name只是为了使其易于理解。这是我的数据的样子:

| year | college_id | college_name         | program_id | program_name           |
|------|------------|----------------------|------------|------------------------|
| 2010 | 001        | Wilbor College       | 001        | Civil Engineering      |
| 2010 | 001        | Wilbor College       | 002        | Electrical Engineering |
| 2010 | 001        | Wilbor College       | 003        | Mechanical Engineering |
| 2010 | 001        | Wilbor College       | 021        | English                |
| 2010 | 002        | Mary College         | 031        | Physics                |
| 2010 | 003        | Francis College      | 041        | Arts                   |
| 2019 | 001        | Wilbor College       | 004        | Engineering            |
| 2019 | 101        | South Wilbor College | 022        | English Teaching       |
| 2019 | 101        | South Wilbor College | 023        | English and Spanish    |
| 2019 | 101        | South Wilbor College | 024        | English Literature     |
| 2019 | 223        | Mary&Francis College | 031        | Physics                |
| 2019 | 223        | Mary&Francis College | 032        | Astronomy              |
| 2019 | 223        | Mary&Francis College | 033        | Geophysics             |
| 2019 | 223        | Mary&Francis College | 034        | Biophysics             |
| 2019 | 223        | Mary&Francis College | 041        | Arts                   |
| 2019 | 223        | Mary&Francis College | 042        | Visual Arts            |
| 2019 | 223        | Mary&Francis College | 043        | History of Art         |
| 2019 | 223        | Mary&Francis College | 044        | Cinema                 |

第二个表的元数据包含大约15,000行,有关链接程序的信息以及两列program_idlinked_program_id。看起来是这样:


| program_id | linked_program_id |
|------------|-------------------|
| 001        | 002               |
| 004        | 001               |
| 004        | 002               |
| 004        | 003               |
| 021        | 022               |
| 021        | 023               |
| 023        | 021               |
| 023        | 024               |
| 031        | 032               |
| 032        | 031               |
| 033        | 031               |
| 034        | 031               |
| 041        | 042               |
| 041        | 043               |
| 042        | 044               |

第二张表很重要,因为它可以识别彼此链接的程序。链接有几种类型,但简单来说,链接是program_id的值可以随时间变化的情况。通常,发生这些更改是因为多年来有一些程序会拆分或合并。

这在大学中也会发生(拆分和合并),但是我没有用于大学的相同元数据。但是,如果我可以跟踪他们多年来的计划,就可以创建它。

我想要在第一个数据集中添加两个变量:

  1. group_p_id:与多年来一直以某种方式链接的所有program_id值相同的ID;
  2. group_c_id:多年来以某种方式链接的所有college_id值的ID。

结果数据集如下所示:

| year | col_id | college_name         | prog_id | program_name           | group_p_id | group_c_id |
|------|--------|----------------------|---------|------------------------|------------|------------|
| 2010 | 001    | Wilbor College       | 001     | Civil Engineering      | 004        | 001        |
| 2010 | 001    | Wilbor College       | 002     | Electrical Engineering | 004        | 001        |
| 2010 | 001    | Wilbor College       | 003     | Mechanical Engineering | 004        | 001        |
| 2010 | 001    | Wilbor College       | 021     | English                | 021        | 001        |
| 2010 | 002    | Mary College         | 031     | Physics                | 031        | 223        |
| 2010 | 003    | Francis College      | 041     | Arts                   | 041        | 223        |
| 2019 | 001    | Wilbor College       | 005     | Engineering            | 004        | 001        |
| 2019 | 101    | South Wilbor College | 022     | English Teaching       | 021        | 001        |
| 2019 | 101    | South Wilbor College | 023     | English and Spanish    | 021        | 001        |
| 2019 | 101    | South Wilbor College | 024     | English Literature     | 021        | 001        |
| 2019 | 223    | Mary&Francis College | 031     | Physics                | 031        | 223        |
| 2019 | 223    | Mary&Francis College | 032     | Astronomy              | 031        | 223        |
| 2019 | 223    | Mary&Francis College | 033     | Geophysics             | 031        | 223        |
| 2019 | 223    | Mary&Francis College | 034     | Biophysics             | 031        | 223        |
| 2019 | 223    | Mary&Francis College | 041     | Arts                   | 041        | 223        |
| 2019 | 223    | Mary&Francis College | 042     | Visual Arts            | 041        | 223        |
| 2019 | 223    | Mary&Francis College | 043     | History of Art         | 041        | 223        |
| 2019 | 223    | Mary&Francis College | 044     | Cinema                 | 041        | 223        |

以下是SAS输入格式的数据集(id为数字格式以方便转换):

data have1;
INPUT year 1-4 college_id 6-8 college_name $10-29 program_id 31-33 program_name $35-56;
datalines;
2010 001 Wilbor College       001 Civil Engineering
2010 001 Wilbor College       002 Electrical Engineering
2010 001 Wilbor College       003 Mechanical Engineering
2010 001 Wilbor College       021 English
2010 002 Mary College         031 Physics
2010 003 Francis College      041 Arts
2019 001 Wilbor College       004 Engineering
2019 101 South Wilbor College 022 English Teaching
2019 101 South Wilbor College 023 English and_Spanish
2019 101 South Wilbor College 024 English Literature
2019 223 Mary&Francis College 031 Physics
2019 223 Mary&Francis College 032 Astronomy
2019 223 Mary&Francis_College 033 Geophysics
2019 223 Mary&Francis College 034 Biophysics
2019 223 Mary&Francis College 041 Arts
2019 223 Mary&Francis College 042 Visual Arts
2019 223 Mary&Francis College 043 History of Art
2019 223 Mary&Francis College 044 Cinema
;
run;

data have2;
INPUT program_id 1-3 linked_program_id 5-7;
datalines;
001 002
004 001
004 002
004 003
021 022
021 023
023 021
023 024
031 032
032 031
033 031
034 031
041 042
041 043
042 044
;
run;

data want;
INPUT year 1-4 college_id 6-8 college_name $10-29 program_id 31-33 program_name $35-56 broad_c_id 58-60 broad_c_id 62-64;
datalines;
2010 001 Wilbor College       001 Civil Engineering      004 001
2010 001 Wilbor College       002 Electrical Engineering 004 001
2010 001 Wilbor College       003 Mechanical Engineering 004 001
2010 001 Wilbor College       021 English                021 001
2010 002 Mary College         031 Physics                031 223
2010 003 Francis College      041 Arts                   041 223
2019 001 Wilbor College       005 Engineering            004 001
2019 101 South Wilbor College 022 English Teaching       021 001
2019 101 South Wilbor College 023 English and Spanish    021 001
2019 101 South Wilbor College 024 English Literature     021 001
2019 223 Mary&Francis College 031 Physics                031 223
2019 223 Mary&Francis College 032 Astronomy              031 223
2019 223 Mary&Francis College 033 Geophysics             031 223
2019 223 Mary&Francis College 034 Biophysics             031 223
2019 223 Mary&Francis College 041 Arts                   041 223
2019 223 Mary&Francis College 042 Visual Arts            041 223
2019 223 Mary&Francis College 043 History of Art         041 223
2019 223 Mary&Francis College 044 Cinema                 041 223
;
run;

棘手的是,第二个表(元数据)的组织性不是很好。一些ID以“多对多”方式同时出现在program_idlinked_program_id列中。

我试图自己解决这个问题,但是代码变得非常复杂,如果我做对了,我也不确定。所以这就是我所做的:

  1. 确保每个ID都列在两列中(分别为program_idlinked_program_id):
proc sql;
create table temp_1 as
select distinct
program_id,linked_program_id
from have2
union
select distinct
linked_program_id as program_id,program_id as linked_program_id
from have2;
quit;
  1. 通过linked_program_id计算program_id的最大值,反之亦然,然后计算每行的最大值:
proc sql;
create table temp_2 as 
select
max(ID_0,ID_1) as ID_2,*
from (select 
   max(linked_program_id) as ID_1,*                 
   from (select 
        max(program_id) as ID_0,*
        from temp_1
        group by linked_program_id) 
    group by program_id);
quit;

  1. 最后一步为大多数链接程序赋予了相同的新ID,但并非全部都赋予该ID。因此,我在宏中重复了最后一步的相同逻辑3次(将新ID id_2作为它的第一个输入):

(对于我在此处提供的示例数据集,运行一次循环就足够了。但是在我的15,000行数据集中,我必须运行3次,直到对所有链接程序获得相同的ID。)

%macro max_rep;
proc sql;
%do i=3 %to 9 %by 3; 
create table temp_%eval(&i.+2) as 
select
max(ID_&i.,ID_%eval(&i.+1)) as ID_%eval(&i.+2),program_id,linked_program_id
from (select 
    max(ID_&i.) as ID_%eval(&i.+1),*
    from (select 
        max(ID_%eval(&i.-1)) as ID_&i.,*
        from temp_%eval(&i.-1)
        group by linked_program_id) 
    group by program_id);

drop table temp_%eval(&i.-1);

%end;
quit;
%mend;
%max_rep;
  1. 最后,我在have1temp_11之间进行了联接:
proc sql;
create table want as
select
t1.*,t2.ID_11 as group_p_id
from have1 as t1 
left join (select distinct ID_11,program_id from temp_11) as t2
on t1.program_id=t2.program_id;
quit;

这给了我一半。现在,对于属于同一组链接程序的每个程序,我都有一个唯一的ID。我根据其他变量进行了一些其他更改,以使每个链接组的更广泛的程序都作为我的group_p_id(而不仅仅是最大值)。然后,我意识到我仍然需要弄清楚如何为大学(group_c_id)创建新的组ID。 SAS必须有一种更简单,更有效的方法来实现这一目标,不是吗?

感谢您的帮助。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)> insert overwrite table dwd_trade_cart_add_inc > select data.id, > data.user_id, > data.course_id, > date_format(
错误1 hive (edu)> insert into huanhuan values(1,'haoge'); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive> show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 <configuration> <property> <name>yarn.nodemanager.res