如何解决为什么我的 Stata 代码需要这么长时间才能运行?
我有不同公司员工的就业数据。我的目标是创建变量,根据与其他公司其他人的重叠就业期来衡量个人的网络强度和规模。我的数据比较大。我对超过 350,000 个人进行了超过 16,000,000 次观察。我编写了下面的 Stata 代码来生成所需的输出,但不幸的是它需要很长时间才能运行。完成我的数据需要一个月的时间。必须有一种方法可以使代码更有效,但我不知道如何。如果您能帮忙,我将不胜感激。我使用的是 Stata 16。
在下文中,我将提供一个最小的、可重现的示例。
首先,输入并保存一个原始玩具数据集:
clear
input byte(individual_id company_id) double(fiscal_year_end year)
1 1 18992 2011
1 1 19358 2012
1 2 18992 2011
1 2 19358 2012
1 2 19723 2013
1 2 20088 2014
2 1 18992 2011
2 1 19358 2012
2 1 19723 2013
2 3 19358 2012
2 3 19723 2013
3 2 19723 2013
3 2 20088 2014
4 1 18992 2011
4 1 19358 2012
5 1 19723 2013
5 1 20088 2014
5 3 19358 2012
5 3 19723 2013
end
format %td fiscal_year_end
save raw_data.dta,replace
其次,为每年创建具有活动个人 ID(在过去三年中)的临时数据集: [理由:我希望个人在一段时间内不活动时不会被包含在某人的网络中。这应该代表退休之类的事情。]
* Restrict to unique individual year pairs
keep individual_id year
duplicates drop
* Fill in missing individual year observations
fillin individual_id year
* Sort data by individual ID and year
sort individual_id year
* Generate indicator for active during last three years
gen active_3yrs = (_fillin==0)
by individual_id (year): replace active_3yrs = 1 if _fillin[_n-1]==0 | _fillin[_n-2]==0 | _fillin[_n-3]==0
* Drop observations for inactive individuals
drop if active_3yrs==0
* Save temporary datasets with active individual IDs for each year
levelsof year,local(years)
foreach i in `years' {
preserve
keep if year==`i'
keep individual_id
save active_`i'.dta,replace
restore
drop if year==`i'
}
第三,创建一个临时数据集,其中包含至少有两家公司的个人 ID: [理由:否则他们无法拥有外部网络]
* Load raw data
use raw_data.dta,replace
keep individual_id company_id
duplicates drop
duplicates tag individual_id,gen(tmp)
drop if tmp==0
drop tmp
keep individual_id
duplicates drop
rename individual_id focal_id
sort focal_id
save ids.dta,replace
第四,为每个单独的年份对创建具有潜在网络观测值的临时数据集:
forvalues i = 1/`=_N' {
* Load temporary dataset with individual IDs
use ids.dta,clear
* Restrict to focal individual ID and store it as local macro
keep if _n==`i'
quietly summarize focal_id
local x "`r(mean)'"
* Load temporary datasets with base information for network variable creation
use raw_data.dta,clear
* Generate focal individual ID variable
gen focal_id = `x'
format focal_id %12.0f
* Keep observations for individuals who work are at the same company during the same fiscal year as the focal individual
gen tmp = (individual_id==focal_id)
sort company_id fiscal_year_end
by company_id fiscal_year_end: egen tmp2 = max(tmp)
drop if tmp==1
drop if tmp2==0
drop tmp tmp2
* Save as temporary datasets (if dataset is not empty)
capture assert _N>0
if !_rc {
save tmp_individual_`i'.dta,replace
}
else {
// do nothing
}
}
第五,清理临时数据集,其中包含每个单独年份对的潜在网络观察: [注意:放弃对不活跃董事的观察以及在焦点年之后发生的观察。然后创建网络规模和强度变量,并为焦点年度保留唯一的公司-财政终端对。]
* Define programme that calculates network size and strength variables
capture program drop network_size_strength
program network_size_strength
quietly tab individual_id if rr_company_id!=company_id & rr_fiscal_year_end>=fiscal_year_end
gen network_size = r(r)
gen network_strength = r(N) / r(r)
end
* Load temporary dataset with IDs of individuals who were at at least two companies
use ids.dta,clear
forvalues i = 1/`=_N' {
* Load temporary dataset (if existing)
capture confirm file "tmp_individual_`i'.dta"
if !_rc {
use tmp_individual_`i'.dta,clear
levelsof year,local(years)
foreach j in `years' {
* Preserve data
preserve
* Observations after focal year
drop if year>`j'
* Drop inactive investors
merge m:1 individual_id using active_`j'.dta,nogen keep(match)
* Generate network size and strength variables
rangerun network_size_strength,int(fiscal_year_end . .) sprefix(rr_) use(individual_id company_id fiscal_year_end)
* Keep observations for focal year
keep if year==`j'
* Keep variables of interest
keep company_id fiscal_year_end focal_id network_size network_strength
* Rename individual ID variable
rename focal_id individual_id
* Drop duplicates
duplicates drop
* Save as temporary dataset
save tmp_individual_`i'_`j'.dta,replace
* Restore data
restore
}
* Remove temporary dataset
rm tmp_individual_`i'.dta
}
else {
// do nothing
}
}
第六,追加所有临时数据集:
* Load temporary datasets with base information for network variable creation
use raw_data.dta,clear
* Store unique year values in local macro
levelsof year,local(years)
* Load temporary dataset with IDs of individuals who were at at least two companies
use ids.dta,clear
forvalues i = 1/`=_N' {
foreach j in `years' {
* Drop focal ID variable (if existing)
capture confirm variable focal_id
if !_rc {
drop focal_id
}
else {
// do nothing
}
* Check if temporary dataset exists
capture confirm file "tmp_individual_`i'_`j'.dta"
if !_rc {
* Append temporary dataset
append using tmp_individual_`i'_`j'.dta
* Delete temporary dataset
rm tmp_individual_`i'_`j'.dta
}
else {
// do nothing
}
}
}
* Save data
save network_variables.dta,replace
第七,通过合并到原始数据来创建最终数据:
merge 1:1 individual_id company_id fiscal_year_end using raw_data.dta,nogen keep(match using)
order individual_id company_id fiscal_year_end year,first
sort individual_id company_id fiscal_year_end
save final_data.dta,replace
最终数据应如下所示:
clear
input double individual_id byte company_id double(fiscal_year_end year network_size network_strength)
1 1 18992 2011 0 .
1 1 19358 2012 0 .
1 2 18992 2011 . .
1 2 19358 2012 . .
1 2 19723 2013 2 2
1 2 20088 2014 2 2
2 1 18992 2011 0 .
2 1 19358 2012 1 1
2 1 19723 2013 1 2
2 3 19358 2012 2 2
2 3 19723 2013 3 1.6666666666666667
3 2 19723 2013 . .
3 2 20088 2014 . .
4 1 18992 2011 . .
4 1 19358 2012 . .
5 1 19723 2013 1 2
5 1 20088 2014 . .
5 3 19358 2012 0 .
5 3 19723 2013 1 1
end
format %td fiscal_year_end
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。