有没有办法实现样本权重？

我在 Python 中使用 statsmodels 进行逻辑回归分析。例如：

import statsmodels.api as sm
import numpy as np
x = arange(0,1,0.01)
y = np.random.rand(100)
y[y<=x] = 1
y[y!=1] = 0
x = sm.add_constant(x)
lr = sm.Logit(y,x)
result = lr.fit().summary()

Run Code Online (Sandbox Code Playgroud)

但我想为我的观察定义不同的权重。我组合了 4 个不同大小的数据集，并希望对分析进行加权，以便来自最大数据集的观察结果不会主导模型。

python sample statsmodels logistic-regression

use*_*817

2019 05-23

7
推荐指数

1
解决办法

1万
查看次数

sklearn.linear_model.LogisticRegression每次都返回不同的系数,尽管设置了random_state

我正在拟合逻辑回归模型,并将随机状态设置为固定值.

每次我做一个"适合"我得到不同的系数,例如:

classifier_instance.fit(train_examples_features, train_examples_labels)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=1, tol=0.0001)

>>> classifier_instance.raw_coef_
array([[ 0.071101940040772596  ,  0.05143724979709707323,  0.071101940040772596  , -0.04089477198935181912, -0.0407380696457252528 ,  0.03622160087086594843,  0.01055345545606742319,
         0.01071861708285645406, -0.36248634699444892693, -0.06159019047096317423,  0.02370064668025737009,  0.02370064668025737009, -0.03159781822495803805,  0.11221150783553821006,
         0.02728295348681779309,  0.071101940040772596  ,  0.071101940040772596  ,  0.                    ,  0.10882033432637286396,  0.64630314505709030026,  0.09617956519989406816,
         0.0604133873444507169 ,  0.                    ,  0.04111685986987245051,  0.                    ,  0.                    ,  0.18312324521915510078,  0.071101940040772596  ,
         0.071101940040772596  ,  0.                    , -0.59561802045324663268, -0.61490898457874587635,  1.07812569991461248975,  0.071101940040772596  ]])

classifier_instance.fit(train_examples_features, train_examples_labels)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=1, tol=0.0001)

>>> classifier_instance.raw_coef_
array([[ 0.07110193825129411394,  0.05143724970282205489,  0.07110193825129411394, -0.04089477178162870957, …

Run Code Online (Sandbox Code Playgroud)

python scikit-learn logistic-regression

jon*_*ans

2014 06-26

7
推荐指数

1
解决办法

1820
查看次数

Vowpal Wabbit Logistic回归

我正在使用Vowpal Wabbit在具有25个特征和4800万个实例的数据集上执行逻辑回归.我对当前的预测值有疑问.它应该在0或1之内.

average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.693147   0.693147            1         1.0  -1.0000   0.0000       24
0.419189   0.145231            2         2.0  -1.0000  -1.8559       24
0.235457   0.051725            4         4.0  -1.0000  -2.7588       23
6.371911   12.508365           8         8.0  -1.0000  -3.7784       24
3.485084   0.598258           16        16.0  -1.0000  -2.2767       24
1.765249   0.045413           32        32.0  -1.0000  -2.8924       24
1.017911   0.270573           64        64.0  -1.0000  -3.0438       25
0.611419   0.204927          128       128.0  -1.0000  -3.1539       25
0.469127   0.326834          256       256.0  -1.0000  -1.6101       23
0.403473 …

Run Code Online (Sandbox Code Playgroud)

machine-learning vowpalwabbit logistic-regression

use*_*694

2014 11-10

7
推荐指数

1
解决办法

1007
查看次数

Logistic Regression Scikit-Learn 获取分类系数

我正在做多类分类并对其应用逻辑回归

当我通过调用拟合数据时

logistic.fit(InputDATA,OutputDATA)

Run Code Online (Sandbox Code Playgroud)

估算器“logistic”适合数据。

现在，当我调用logistic.coef_它时，它会打印一个 4 行（我有四类）和 n 列（每个功能一个）的二维数组

这是我在 SCIKIT 学习网站上看到的：

coef_ : 数组、形状 (n_features, ) 或 (n_targets, n_features) 线性回归问题的估计系数。如果在拟合期间传递了多个目标（y 2D），则这是一个形状为 (n_targets, n_features) 的二维数组，而如果仅传递一个目标，则这是一个长度为 n_features 的一维数组。

现在我的问题是：为什么不同的类有不同的系数，因为我只需要一个可以预测输出的假设。

machine-learning scikit-learn logistic-regression

Shi*_*gal

2015 07-22

7
推荐指数

1
解决办法

3650
查看次数

Scikit F-score指标错误

我试图使用SciKit的Logistic回归来预测一组标签.我的数据实际上是不平衡的(有更多'0'而不是'1'标签)所以我必须在交叉验证步骤中使用F1得分指标来"平衡"结果.

[Input]
X_training, y_training, X_test, y_test = generate_datasets(df_X, df_y, 0.6)
logistic = LogisticRegressionCV(
    Cs=50,
    cv=4,
    penalty='l2', 
    fit_intercept=True,
    scoring='f1'
)
logistic.fit(X_training, y_training)
print('Predicted: %s' % str(logistic.predict(X_test)))
print('F1-score: %f'% f1_score(y_test, logistic.predict(X_test)))
print('Accuracy score: %f'% logistic.score(X_test, y_test))

[Output]
>> Predicted: [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
>> Actual:    [0 0 0 1 0 0 0 0 0 1 1 0 …

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn cross-validation logistic-regression

Dav*_*vid

lucky-day

7
推荐指数

1
解决办法

1万
查看次数

分类:使用sklearn进行PCA和逻辑回归

第0步:问题描述

我有一个分类问题,即我想基于数字特征的集合,使用逻辑回归和运行主成分分析(PCA)来预测二进制目标.

我有2个数据集:df_train和df_valid(分别是训练集和验证集)作为pandas数据框,包含特征和目标.作为第一步,我使用get_dummiespandas函数将所有分类变量转换为boolean.例如,我会:

n_train = 10
np.random.seed(0)
df_train = pd.DataFrame({"f1":np.random.random(n_train), \
                         "f2": np.random.random(n_train), \
                         "f3":np.random.randint(0,2,n_train).astype(bool),\
                         "target":np.random.randint(0,2,n_train).astype(bool)})

In [36]: df_train
Out[36]: 
         f1        f2     f3 target
0  0.548814  0.791725  False  False
1  0.715189  0.528895   True   True
2  0.602763  0.568045  False   True
3  0.544883  0.925597   True   True
4  0.423655  0.071036   True   True
5  0.645894  0.087129   True  False
6  0.437587  0.020218   True   True
7  0.891773  0.832620   True  False
8  0.963663  0.778157  False  False
9  0.383442  0.870012   True   True …

Run Code Online (Sandbox Code Playgroud)

python classification pca scikit-learn logistic-regression

ldo*_*cao

2019 07-09

7
推荐指数

1
解决办法

4553
查看次数

不同的Sigmoid方程及其实现

在通过神经网络中使用的Sigmoid函数进行检查时,我们从https://en.wikipedia.org/wiki/Softmax_function#Softmax_Normalization中找到了这个等式:

与标准sigmoid方程不同:

第一个方程在某种程度上涉及平均值和标准差(我希望我没有错误地读取符号)而第二个方程推广了负均值并除以标准差作为常数,因为它在向量中的所有项中都是相同的/基质/张量.

所以在实现方程时,我会得到不同的结果.

用第2个方程(标准sigmoid函数):

def sigmoid(x):
    return 1. / (1 + np.exp(-x))

Run Code Online (Sandbox Code Playgroud)

我得到这些输出:

>>> x = np.array([1,2,3])
>>> print sigmoid(x)
[ 0.73105858  0.88079708  0.95257413]

Run Code Online (Sandbox Code Playgroud)

我希望第一个函数是相似的,但第一个和第二个元素之间的差距相当大(尽管元素的排名依然存在:

def get_statistics(x):
    n = float(len(x))
    m = x.sum() / n
    s2 = sum((x - m)**2) / (n-1.) 
    s = s2**0.5
    return m, s2, s

m, s, s2 = get_statistics(x)

sigmoid_x1 = 1 / (1 + np.exp(-(x[0] - m) / s2))
sigmoid_x2 = 1 / (1 + np.exp(-(x[1] - m) / s2))
sigmoid_x3 …

Run Code Online (Sandbox Code Playgroud)

python math neural-network logistic-regression softmax

alv*_*vas

2016 04-28

7
推荐指数

1
解决办法

995
查看次数

R geepack:使用GEE进行不合理的大规模估算

我用geepackR来估算逻辑边际模型geeglm().但我得到垃圾估计.它们大约16个数量级太大.然而,p值似乎与我的预期相似.这意味着响应基本上成为阶梯函数.见附图

以下是生成图表的代码:

require(geepack)
data = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=data, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=data)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)

Run Code Online (Sandbox Code Playgroud)

这是回归表:

Call:
geeglm(formula = moden ~ 1 + power, family = binomial, data = data, 
    id = defacto, corstr = "exchangeable")

 Coefficients:
             Estimate   Std.err  Wald Pr(>|W|)    
(Intercept) -7.38e+15  1.47e+15  25.1  5.4e-07 ***
power        2.05e+13  1.60e+12 164.4  < 2e-16 …

Run Code Online (Sandbox Code Playgroud)

r glm random-effects mixed-models logistic-regression

Mik*_*Rev

2017 01-17

7
推荐指数

1
解决办法

714
查看次数

用于逻辑回归的 Statsmodels Anova

我发现statsmodels线性模型的 anova 测试的实现非常有用（http://www.statsmodels.org/dev/generated/statsmodels.stats.anova.anova_lm.html#statsmodels.stats.anova.anova_lm）但我想知道，因为它不存在于库中，如何为逻辑回归部分构建等效版本。

公式：

from statsmodels.formula.api import ols, logit
import statsmodels.api as sm

ols(formula_str, data=data_on_which_to_perform_analysis).fit()
logit(formula_str, data=data_on_which_to_perform_analysis).fit()

sm.stats.anova_lm()

Run Code Online (Sandbox Code Playgroud)

这意味着本质上（通过查看源代码）复制anova_single. 有没有人已经在某个远程存储库中做过这件事？我问是因为实现速度非常快，而且非常深入statsmodels核心库，所以解决它并不容易（至少以我目前的技能水平）

关于如何进行的任何建议？

python anova statsmodels logistic-regression

Ash*_*r11

lucky-day

7
推荐指数

0
解决办法

982
查看次数

TypeError: 'numpy.float64' 对象不可调用 - 打印 F1 分数时

我正在尝试在 Jupyter Notebook 上运行以下代码：

lr = LogisticRegression(class_weight='balanced')
lr.fit(X_train,y_train)
y_pred = lr.predict(X_train)

acc_log = round(lr.score(X_train, y_train) * 100, 2)
prec_log = round(precision_score(y_train,y_pred) * 100,2)
recall_log = round(recall_score(y_train,y_pred) * 100,2)
f1_log = round(f1_score(y_train,y_pred) * 100,2)
roc_auc_log = roc_auc_score(y_train,y_pred)

Run Code Online (Sandbox Code Playgroud)

当尝试执行此操作时，我收到以下错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-bcb2d9729eb6> in <module>
      6 prec_log = round(precision_score(y_train,y_pred) * 100,2)
      7 recall_log = round(recall_score(y_train,y_pred) * 100,2)
----> 8 f1_log = round(f1_score(y_train,y_pred) * 100,2)
      9 roc_auc_log = roc_auc_score(y_train,y_pred)

TypeError: 'numpy.float64' object is not callable

Run Code Online (Sandbox Code Playgroud)

似乎无法弄清楚我做错了什么。

python scikit-learn logistic-regression

exc*_*man

2021 11-26

7
推荐指数

1
解决办法

6793
查看次数