为什么我的 Stata 代码需要这么长时间才能运行？

如何解决为什么我的 Stata 代码需要这么长时间才能运行？

我有不同公司员工的就业数据。我的目标是创建变量，根据与其他公司其他人的重叠就业期来衡量个人的网络强度和规模。我的数据比较大。我对超过 350,000 个人进行了超过 16,000,000 次观察。我编写了下面的 Stata 代码来生成所需的输出，但不幸的是它需要很长时间才能运行。完成我的数据需要一个月的时间。必须有一种方法可以使代码更有效，但我不知道如何。如果您能帮忙，我将不胜感激。我使用的是 Stata 16。

在下文中，我将提供一个最小的、可重现的示例。

首先，输入并保存一个原始玩具数据集：

clear
input byte(individual_id company_id) double(fiscal_year_end year)
1 1 18992 2011
1 1 19358 2012
1 2 18992 2011
1 2 19358 2012
1 2 19723 2013
1 2 20088 2014
2 1 18992 2011
2 1 19358 2012
2 1 19723 2013
2 3 19358 2012
2 3 19723 2013
3 2 19723 2013
3 2 20088 2014
4 1 18992 2011
4 1 19358 2012
5 1 19723 2013
5 1 20088 2014
5 3 19358 2012
5 3 19723 2013
end
format %td fiscal_year_end
save raw_data.dta,replace

其次，为每年创建具有活动个人 ID（在过去三年中）的临时数据集： [理由：我希望个人在一段时间内不活动时不会被包含在某人的网络中。这应该代表退休之类的事情。]

* Restrict to unique individual year pairs
keep individual_id year
duplicates drop

* Fill in missing individual year observations
fillin individual_id year

* Sort data by individual ID and year
sort individual_id year

* Generate indicator for active during last three years
gen active_3yrs = (_fillin==0)
by individual_id (year): replace active_3yrs = 1 if _fillin[_n-1]==0 | _fillin[_n-2]==0  | _fillin[_n-3]==0

* Drop observations for inactive individuals
drop if active_3yrs==0

* Save temporary datasets with active individual IDs for each year
levelsof year,local(years)
foreach i in `years' {
    preserve
    keep if year==`i'
    keep individual_id
    save active_`i'.dta,replace
    restore
    drop if year==`i'
}

第三，创建一个临时数据集，其中包含至少有两家公司的个人 ID： [理由：否则他们无法拥有外部网络]

* Load raw data
use raw_data.dta,replace
keep individual_id company_id
duplicates drop
duplicates tag individual_id,gen(tmp)
drop if tmp==0
drop tmp
keep individual_id
duplicates drop
rename individual_id focal_id
sort focal_id
save ids.dta,replace

第四，为每个单独的年份对创建具有潜在网络观测值的临时数据集：

forvalues i = 1/`=_N' {
    * Load temporary dataset with individual IDs
    use ids.dta,clear
    
    * Restrict to focal individual ID and store it as local macro
    keep if _n==`i'
    quietly summarize focal_id
    local x "`r(mean)'"
    
    * Load temporary datasets with base information for network variable creation
    use raw_data.dta,clear
    
    * Generate focal individual ID variable
    gen focal_id = `x'
    format focal_id %12.0f
    
    * Keep observations for individuals who work are at the same company during the same fiscal year as the focal individual
    gen tmp = (individual_id==focal_id)
    sort company_id fiscal_year_end
    by company_id fiscal_year_end: egen tmp2 = max(tmp)
    drop if tmp==1
    drop if tmp2==0
    drop tmp tmp2
    
    * Save as temporary datasets (if dataset is not empty)
    capture assert _N>0
    if !_rc {
            save tmp_individual_`i'.dta,replace
        }
        else {
            // do nothing
        }
}

第五，清理临时数据集，其中包含每个单独年份对的潜在网络观察： [注意：放弃对不活跃董事的观察以及在焦点年之后发生的观察。然后创建网络规模和强度变量，并为焦点年度保留唯一的公司-财政终端对。]

* Define programme that calculates network size and strength variables
capture program drop network_size_strength
program network_size_strength
    quietly tab individual_id if rr_company_id!=company_id & rr_fiscal_year_end>=fiscal_year_end
    gen network_size = r(r)
    gen network_strength = r(N) / r(r)
end

* Load temporary dataset with IDs of individuals who were at at least two companies
use ids.dta,clear

forvalues i = 1/`=_N' {
    * Load temporary dataset (if existing)
    capture confirm file "tmp_individual_`i'.dta"
    if !_rc {
        use tmp_individual_`i'.dta,clear
        
        levelsof year,local(years)
        foreach j in `years' {
            * Preserve data
            preserve
                
            * Observations after focal year
            drop if year>`j'
            
            * Drop inactive investors
            merge m:1 individual_id using active_`j'.dta,nogen keep(match)
            
            * Generate network size and strength variables
            rangerun network_size_strength,int(fiscal_year_end . .) sprefix(rr_) use(individual_id company_id fiscal_year_end)
            
            * Keep observations for focal year
            keep if year==`j'
            
            * Keep variables of interest
            keep company_id fiscal_year_end focal_id network_size network_strength
            
            * Rename individual ID variable
            rename focal_id individual_id
            
            * Drop duplicates
            duplicates drop
            
            * Save as temporary dataset
            save tmp_individual_`i'_`j'.dta,replace
            
            * Restore data
            restore
        }
        * Remove temporary dataset
        rm tmp_individual_`i'.dta
        }
        else {
            // do nothing
        }
}

第六，追加所有临时数据集：

* Load temporary datasets with base information for network variable creation
use raw_data.dta,clear

* Store unique year values in local macro
levelsof year,local(years)

* Load temporary dataset with IDs of individuals who were at at least two companies
use ids.dta,clear

forvalues i = 1/`=_N' {
    foreach j in `years' {
        * Drop focal ID variable (if existing)
        capture confirm variable focal_id
        if !_rc {
            drop focal_id
        }
        else {
            // do nothing
        }
        
        * Check if temporary dataset exists
        capture confirm file "tmp_individual_`i'_`j'.dta"
        if !_rc {
            * Append temporary dataset
            append using tmp_individual_`i'_`j'.dta
            
            * Delete temporary dataset
            rm tmp_individual_`i'_`j'.dta
        }
        else {
            // do nothing
        }
    }
}

* Save data
save network_variables.dta,replace

第七，通过合并到原始数据来创建最终数据：

merge 1:1 individual_id company_id fiscal_year_end using raw_data.dta,nogen keep(match using)
order individual_id company_id fiscal_year_end year,first
sort individual_id company_id fiscal_year_end
save final_data.dta,replace

最终数据应如下所示：

clear
input double individual_id byte company_id double(fiscal_year_end year network_size network_strength)
1 1 18992 2011 0                  .
1 1 19358 2012 0                  .
1 2 18992 2011 .                  .
1 2 19358 2012 .                  .
1 2 19723 2013 2                  2
1 2 20088 2014 2                  2
2 1 18992 2011 0                  .
2 1 19358 2012 1                  1
2 1 19723 2013 1                  2
2 3 19358 2012 2                  2
2 3 19723 2013 3 1.6666666666666667
3 2 19723 2013 .                  .
3 2 20088 2014 .                  .
4 1 18992 2011 .                  .
4 1 19358 2012 .                  .
5 1 19723 2013 1                  2
5 1 20088 2014 .                  .
5 3 19358 2012 0                  .
5 3 19723 2013 1                  1
end
format %td fiscal_year_end