对大型数据集（2 亿 x 2 个变量）运行逻辑回归的有效方法是什么？

Question

对大型数据集（2 亿 x 2 个变量）运行逻辑回归的有效方法是什么？

use*_*057 5 python matlab hadoop r stata

我目前正在尝试运行逻辑回归模型。我的数据有两个变量，一个响应变量和一个预测变量。问题是我有 2 亿个观察值。我正在尝试运行逻辑回归模型，但即使在 Amazon 上的 EC2 实例的帮助下，在 R/Stata/MATLAB 中执行此操作也非常困难。我认为问题在于逻辑回归函数是如何在语言本身中定义的。还有另一种方法可以快速运行逻辑回归吗？目前我遇到的问题是我的数据很快就填满了它正在使用的空间。我什至尝试过使用高达 30 GB 的 RAM，但无济于事。任何解决方案都将受到极大欢迎。

Answer 1

Bri*_*roe 4

如果您的主要问题是在给定计算机内存限制的情况下估计 logit 模型的能力，而不是估计的速度，那么您可以利用最大似然估计的可加性并为ml编写自定义程序。Logit 模型只是使用逻辑分布的最大似然估计。只有一个自变量这一事实简化了这一问题。我模拟了下面的问题。您应该使用以下代码块创建两个 do 文件。

如果你在加载整个数据集时没有问题 - 你不应该这样做，我的模拟只使用了约 2 GB 的内存，使用 2 亿个 obs 和 2 个变量，尽管里程可能会有所不同 - 第一步是将数据集分解为可管理的部分。例如：

depvar = 您的因变量（0 或 1） indepvar = 您的自变量（某些数字数据类型）

cd "/path/to/largelogit"

clear all
set more off

set obs 200000000

// We have two variables, and independent variable and a dependent variable.
gen indepvar = 10*runiform()
gen depvar = .

// As indpevar increases, the probability of depvar being 1 also increases.
replace depvar = 1 if indepvar > ( 5 + rnormal(0,2) )
replace depvar = 0 if depvar == .

save full, replace
clear all

// Need to split the dataset into managable pieces

local max_opp = 20000000    // maximum observations per piece

local obs_num = `max_opp'

local i = 1
while `obs_num' == `max_opp' {

    clear

    local h = `i' - 1

    local obs_beg = (`h' * `max_opp') + 1
    local obs_end = (`i' * `max_opp')

    capture noisily use in `obs_beg'/`obs_end' using full

    if _rc == 198 {
        capture noisily use in `obs_beg'/l using full
    }
    if _rc == 198 { 
        continue,break
    }

    save piece_`i', replace

    sum
    local obs_num = `r(N)'

    local i = `i' + 1

}

Run Code Online (Sandbox Code Playgroud)

从这里，为了最大限度地减少内存使用量，关闭 Stata 并重新打开它。当您创建如此大的数据集时，即使您清除了数据集，Stata 也会保留一些分配用于开销等的内存。你可以memory在 thesave full和 the 之后输入clear all来看看我的意思。

接下来，您必须定义自己的自定义机器学习程序，该程序将在程序中一次一个地输入这些片段，计算并求和每个片段的每个观察的对数似然，并将它们全部加在一起。您需要使用d0 ml method而非lf方法，因为优化例程lf需要将所有数据加载到 Stata 中。

clear all
set more off

cd "/path/to/largelogit"

// This local stores the names of all the pieces 
local p : dir "/path/to/largelogit" files "piece*.dta"

local i = 1
foreach j of local p {    // Loop through all the names to count the pieces

    global pieces = `i'    // This is important for the program
    local i = `i' + 1

}

// Generate our custom MLE logit progam. This is using the d0 ml method 

program define llogit_d0

    args todo b lnf 

    tempvar y xb llike tot_llike it_llike

quietly {

    forvalues i=1/$pieces {

        capture drop _merge
        capture drop depvar indepvar
        capture drop `y'
        capture drop `xb'
        capture drop `llike' 
        capture scalar drop `it_llike'

        merge 1:1 _n using piece_`i'

        generate int `y' = depvar

        generate double `xb' = (indepvar * `b'[1,1]) + `b'[1,2]    // The linear combination of the coefficients and independent variable and the constant

        generate double `llike' = .

        replace `llike' = ln(invlogit( `xb')) if `y'==1    // the log of the probability should the dependent variable be 1
        replace `llike' = ln(1-invlogit(`xb')) if `y'==0   // the log of the probability should the dependent variable be 0

        sum `llike' 
        scalar `it_llike' = `r(sum)'    // The sum of the logged probabilities for this iteration

        if `i' == 1     scalar `tot_llike' = `it_llike'    // Total log likelihood for first iteration
        else            scalar `tot_llike' = `tot_llike' + `it_llike' // Total log likelihood is the sum of all the iterated log likelihoods `it_llike'

    }

    scalar `lnf' = `tot_llike'   // The total log likelihood which must be returned to ml

}

end

//This should work

use piece_1, clear

ml model d0 llogit_d0 (beta : depvar = indepvar )
ml search
ml maximize

Run Code Online (Sandbox Code Playgroud)

我刚刚运行了上面两个代码块并收到了以下输出：

大 Logit 输出

这种方法的优点和缺点：
优点：

- “max_opp”大小越小，内存使用量越低。如上所述，我从未在模拟器中使用过超过 1 台演出。

- 您最终会得到无偏估计量、整个数据集的估计量的完整对数似然、正确的标准误差 - 基本上是进行推理的所有重要内容。

缺点：

- 您在内存中节省的内容必须牺牲 CPU 时间。我在带有 i5 处理器的 Stata SE（单核）个人笔记本电脑上运行了这个程序，花了我一夜的时间。

- Wald Chi2统计是错误的，但我相信你可以根据上面提到的正确数据计算出来

- 你不会像使用 logit 那样得到 Psudo R2。

为了测试系数是否真的与标准 logit 相同，set obs设置为相对较小的值，100000，并设置max_opp为 1000 之类的值。运行我的代码，查看输出，运行logit depvar indepvar，查看输出，它们是相同的，除了我在上面的“缺点”中提到的。设置为与将更正 Wald Chi2 统计数据obs相同的值。max_opp

归档时间：	11 年，6 月前
查看次数：	3711 次
最近记录：	11 年，4 月前