我正在做一个需要做一些自然语言处理的项目。我为此目的使用了stanford MaxEnt 分类器。但我不确定,最大熵模型和逻辑回归是同一个还是某种特殊的逻辑回归?
任何人都可以提出一个解释吗?
我正考虑在Cross-Validated中发布我的问题,但决定来这里.我正在使用nnet软件包中的multinom()函数来估计因年龄和受教育程度而变得就业,失业或失业的可能性.我需要一些帮助解释.
我有一个依赖的分类变量就业状态(EmpSt)和两个独立的分类变量的以下数据集:年龄(年龄)和教育水平(教育).
>head(df)
EmpSt Age Education
1 Employed 61+ Less than a high school diploma
2 Employed 50-60 High school graduates, no college
3 Not in labor force 50-60 Less than a high school diploma
4 Employed 30-39 Bachelor's degree or higher
5 Employed 20-29 Some college or associate degree
6 Employed 20-29 Some college or associate degree
Run Code Online (Sandbox Code Playgroud)
以下是级别的摘要:
>summary(df)
EmpSt Age Education
Not in universe : 0 16-19: 6530 Less than a high school diploma :14686
Employed :61478 20-29:16031 …Run Code Online (Sandbox Code Playgroud) 如何在 R 中执行多变量(多因变量)逻辑回归?
我知道你这样做是为了线性回归,这是有效的
form <-cbind(A,B,C,D)~shopping_pt+price
mlm.model.1 <- lm(form, data = train)
Run Code Online (Sandbox Code Playgroud)
但是当我尝试以下(见下文)逻辑回归时,它不起作用。
model.logistic <- glm(form, family=binomial(link=logit), data=train)
Run Code Online (Sandbox Code Playgroud)
感谢您的帮助。
要补充的是,我使用上述线性模型执行此操作的代码似乎不正确。我正在尝试本文档中概述的内容,有些人可能会觉得这些内容很有用。
ftp://ftp.cis.upenn.edu/pub/datamining/public_html/ReadingGroup/papers/multiResponse.pdf
我正在为具有34个因变量的logit模型建模数据,并且它继续抛出奇异矩阵误差,如下所示:
Traceback (most recent call last):
File "<pyshell#1116>", line 1, in <module>
test_scores = smf.Logit(m['event'], train_cols,missing='drop').fit()
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 1186, in fit
disp=disp, callback=callback, **kwargs)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 164, in fit
disp=disp, callback=callback, **kwargs)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 357, in fit
hess=hess)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 405, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 328, in solve
raise LinAlgError, 'Singular matrix'
LinAlgError: Singular matrix
Run Code Online (Sandbox Code Playgroud)
当我对这种方法进行简化以将矩阵减少到其独立列时就是这样
def independent_columns(A, tol = …Run Code Online (Sandbox Code Playgroud) 我装逻辑回归模型预测的二元结果vs的mpg(mtcars数据集).情节如下所示.如何确定mpg任何特定vs值的值?例如,我有兴趣在mpg概率vs为0.50 时找出值是什么.感谢任何人都能提供的帮助!
model <- glm(vs ~ mpg, data = mtcars, family = binomial)
ggplot(mtcars, aes(mpg, vs)) +
geom_point() +
stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)
Run Code Online (Sandbox Code Playgroud)
我想用不同的参数对不同的分类器进行评分.
为了加速LogisticRegression我使用LogisticRegressionCV(至少快2倍)并计划GridSearchCV用于其他人.
但问题虽然它给我相同的C参数,但不是AUC ROC得分.
我会尽力解决许多参数,如scorer,random_state,solver,max_iter,tol...请看例子(实际数据没有母校):
测试数据和共同部分:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
Run Code Online (Sandbox Code Playgroud)
grid = {
'C': …Run Code Online (Sandbox Code Playgroud) python regression machine-learning scikit-learn logistic-regression
我有一个测试数据集和训练数据集,如下所示.我提供了带有最小记录的样本数据,但我的数据超过了1000条记录.这里E是我需要使用算法预测的目标变量.它只有四个类别,如1,2,3,4.它只能采用这些值中的任何一个.
培训数据集:
A B C D E
1 20 30 1 1
2 22 12 33 2
3 45 65 77 3
12 43 55 65 4
11 25 30 1 1
22 23 19 31 2
31 41 11 70 3
1 48 23 60 4
Run Code Online (Sandbox Code Playgroud)
测试数据集:
A B C D E
11 21 12 11
1 2 3 4
5 6 7 8
99 87 65 34
11 21 24 12
Run Code Online (Sandbox Code Playgroud)
由于E只有4个类别,我想用多项Logistic回归(1 vs Rest Logic)预测这个.我正在尝试使用python实现它.
我知道在变量中设置这些目标所需的逻辑,并使用算法来预测这些值中的任何一个:
output = …Run Code Online (Sandbox Code Playgroud) 我想使用交叉验证来测试/训练我的数据集,并评估逻辑回归模型在整个数据集上的性能,而不仅仅是在测试集上(例如25%).
这些概念对我来说是全新的,我不确定它是否做得对.如果有人能告诉我正确的步骤,我会在错误的地方采取行动,我将不胜感激.我的部分代码如下所示.
另外,如何在当前图形的同一图形上绘制"y2"和"y3"的ROC?
谢谢
import pandas as pd
Data=pd.read_csv ('C:\\Dataset.csv',index_col='SNo')
feature_cols=['A','B','C','D','E']
X=Data[feature_cols]
Y=Data['Status']
Y1=Data['Status1'] # predictions from elsewhere
Y2=Data['Status2'] # predictions from elsewhere
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
metrics.accuracy_score(y, predicted)
from sklearn.cross_validation import cross_val_score
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')
print (accuracy)
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())
from nltk import ConfusionMatrix
print (ConfusionMatrix(list(y), list(predicted)))
#print …Run Code Online (Sandbox Code Playgroud) 我正在做多类/多标签文本分类。我试图摆脱“ConvergenceWarning”。
当我将max_iter从默认值调整为4000 时,警告消失了。然而,我的模型精度从78降低到75。
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
logreg = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(n_jobs=1, C=1e5, solver='lbfgs',multi_class='ovr' ,random_state=0, class_weight='balanced' )),
])
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Logistic Regression Accuracy %s' % accuracy_score(y_pred, y_test))
cv_score = cross_val_score(logreg, train_tfidf, y_train, cv=10, scoring='accuracy')
print("CV Score : Mean : %.7g | Std : %.7g | Min : %.7g | Max : %.7g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score)))
Run Code Online (Sandbox Code Playgroud)
为什么当 max_iter = 4000 时我的准确率会降低?有没有其他方法可以修复 *“ConvergenceWarning:lbfgs 未能收敛。增加迭代次数。“迭代次数。”,ConvergenceWarning)”*
python cross-validation logistic-regression multiclass-classification
我正在尝试此页面中的代码。我跑到这个部分LR (tf-idf)并得到了类似的结果
之后我决定尝试一下GridSearchCV。我的问题如下:
1)
#lets try gridsearchcv
#https://www.kaggle.com/enespolat/grid-search-with-logistic-regression
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)
#tuned hpyerparameters :(best parameters) {'C': 10.0, 'penalty': 'l2'}
#best score : 0.7390325593588823
Run Code Online (Sandbox Code Playgroud)
然后我手动计算了f1分数。为什么不匹配?
logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]
final_prediction=np.where(logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]>=0.5,1,0)
#https://www.statology.org/f1-score-in-python/
from sklearn.metrics import f1_score
#calculate F1 score
f1_score(y_train, final_prediction)
0.9839388145315489
Run Code Online (Sandbox Code Playgroud)
scoring='precision'为什么会出现以下错误?我不清楚主要是因为我有相对平衡的数据集(55-45%)并且f1需要precision计算没有任何问题#lets try gridsearchcv #https://www.kaggle.com/enespolat/grid-search-with-logistic-regression
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), …Run Code Online (Sandbox Code Playgroud) python ×5
scikit-learn ×4
r ×3
regression ×2
multinomial ×1
nlp ×1
numpy ×1
predict ×1
probability ×1
python-2.7 ×1
singular ×1
stanford-nlp ×1
statsmodels ×1