我有大量的yelp数据,我必须将评论分为8个不同的类别.
分类
Cleanliness
Customer Service
Parking
Billing
Food Pricing
Food Quality
Waiting time
Unspecified
Run Code Online (Sandbox Code Playgroud)
评论包含多个类别,因此我使用了多重分类.但我很困惑我如何处理积极/消极.实例审查可能对食品质量有利,但对客户服务有负面影响.前 - food taste was very good but staff behaviour was very bad. so review contains positive food quality but negative Customer service
我该如何处理这个案子?我应该在分类前进行情绪分析吗?请帮我
classification machine-learning sentiment-analysis multilabel-classification multiclass-classification
嗨,我正在用新的SpaCy Model实现一个多分类模型(5个类)en_pytt_bertbaseuncased_lg
。新管道的代码在这里:
nlp = spacy.load('en_pytt_bertbaseuncased_lg')
textcat = nlp.create_pipe(
'pytt_textcat',
config={
"nr_class":5,
"exclusive_classes": True,
}
)
nlp.add_pipe(textcat, last = True)
textcat.add_label("class1")
textcat.add_label("class2")
textcat.add_label("class3")
textcat.add_label("class4")
textcat.add_label("class5")
Run Code Online (Sandbox Code Playgroud)
培训的代码如下,并基于此处的示例(https://pypi.org/project/spacy-pytorch-transformers/):
def extract_cat(x):
for key in x.keys():
if x[key]:
return key
# get names of other pipes to disable them during training
n_iter = 250 # number of epochs
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
dev_cats_single = [extract_cat(x) for x in dev_cats]
train_cats_single = [extract_cat(x) for x in train_cats] …
Run Code Online (Sandbox Code Playgroud) 与上一篇关于stackoverflow的帖子有关, Model()为参数'nr_class'获取了多个值-SpaCy多分类模型(BERT集成),其中我的问题部分已经解决,我想分享实现解决方案后出现的问题。
如果我删除nr_class
参数,则会在此出现此错误:
ValueError: operands could not be broadcast together with shapes (1,2) (1,5)
Run Code Online (Sandbox Code Playgroud)
我实际上以为会发生这种情况,因为我没有指定nr_class参数。它是否正确?
再一次,我的多类模型代码:
nlp = spacy.load('en_pytt_bertbaseuncased_lg')
textcat = nlp.create_pipe(
'pytt_textcat',
config={
"nr_class":5,
"exclusive_classes": True,
}
)
nlp.add_pipe(textcat, last = True)
textcat.add_label("class1")
textcat.add_label("class2")
textcat.add_label("class3")
textcat.add_label("class4")
textcat.add_label("class5")
Run Code Online (Sandbox Code Playgroud)
培训的代码如下,并基于此处的示例(https://pypi.org/project/spacy-pytorch-transformers/):
def extract_cat(x):
for key in x.keys():
if x[key]:
return key
# get names of other pipes to disable them during training
n_iter = 250 # number of epochs
train_data = list(zip(train_texts, [{"cats": cats} for …
Run Code Online (Sandbox Code Playgroud) 我构建了一个决策树,它也为我的分类提供了特征重要性。但是我怎么能告诉我的程序给我每个类的特征重要性呢?为了给我整体功能的重要性,我有这个代码:
importances = tree.feature_importances_
#std = np.std([tree.feature_importances_ for tree in forest.estimators_],
# axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), [feature_cols[i] for i in indices])
plt.xlim([-1, X.shape[1]])
plt.show()
Run Code Online (Sandbox Code Playgroud)
我有四个类 - 0、1、2、3。有人知道解决方案吗?
在泡菜...
我有一个包含 >100,000 个观察值的数据集;数据集的列包括CustomerID、VendorID、ProductID和CatNMap。这是它的样子:
如您所见,前 3 列(CustomerID、VendorID、ProductID)中表示的值表示唯一的数字映射值,如果在 x,y 平面上表示将毫无意义(这消除了许多分类方法的使用);最后一列包含由客户分配的类别的字符串。现在,这是我不明白并且不确定如何处理的部分......
目标:是为客户预测未来的CatNMap值,但是在我看来,我在这里拥有的功能没有用,是真的吗?现在,如果是,我可以使用什么方法作为CatNMap列具有 >7,000 个唯一值;此外,如果假设对于同一产品,不同客户分配了 2 个或更多不同类别,那么任何方法将如何处理对未来项目的分类?我需要为此实现 NN 吗?
感谢所有的答案!
python classification machine-learning neural-network multiclass-classification
我正在尝试探索 Xgboost 二进制分类以及多类的工作。在二进制类的情况下,我观察到base_score被视为起始概率,并且在计算Gain和Cover时也显示出重大影响。
在多类的情况下,我无法弄清楚base_score参数的重要性,因为它向我显示了不同(任何)base_score值的Gain和Cover的相同值。
我也无法找出为什么在计算多类的覆盖率时存在因子 2,即2*p*(1-p)
有人可以帮我解决这两部分吗?
statistics machine-learning boosting xgboost multiclass-classification
我正在运行 2000 个时代的多类模型(总共 40 个类)。该模型运行良好,直到 828 epoch 但在 829 epoch 它给了我一个 InvalidArgumentError (见下面的截图)
下面是我用来构建模型的代码。
n_cats = 40
input_bow = tf.keras.Input(shape=(40), name="bow")
hidden_1 = tf.keras.layers.Dense(200, activation="relu")(input_bow)
hidden_2 = tf.keras.layers.Dense(100, activation="relu")(hidden_1)
hidden_3 = tf.keras.layers.Dense(80, activation="relu")(hidden_2)
hidden_4 = tf.keras.layers.Dense(70, activation="relu")(hidden_3)
output = tf.keras.layers.Dense(n_cats, activation="sigmoid")(hidden_4)
model = tf.keras.Model(inputs=[input_bow], outputs=output)
METRICS = [
tf.keras.metrics.Accuracy(name="Accuracy"),
tf.keras.metrics.Precision(name="precision"),
tf.keras.metrics.Recall(name="recall"),
tf.keras.metrics.AUC(name="auc"),
tf.keras.metrics.BinaryAccuracy(name="binaryAcc")
]
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
"my_keras_model.h5", save_best_only=True)
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=1e-2,
decay_steps=10000,
decay_rate=0.9)
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
model.compile(loss="categorical_crossentropy",
optimizer="adam", metrics=METRICS)
training_history = model.fit(
(bow_train),
indus_cat_train,
epochs=2000,
batch_size=128,
callbacks=[checkpoint_cb],
validation_data=(bow_test, indus_cat_test)) …
Run Code Online (Sandbox Code Playgroud) 当 SVM-OVA 执行如下时,我试图绘制超平面:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
x = np.array([[1,1.1],[1,2],[2,1]])
y = np.array([0,100,250])
classifier = OneVsRestClassifier(SVC(kernel='linear'))
Run Code Online (Sandbox Code Playgroud)
基于这个问题的答案Plot hyperplane Linear SVM python,我编写了以下代码:
fig, ax = plt.subplots()
# create a mesh to plot in
x_min, x_max = x[:, 0].min() - 1, x[:, 0].max() + 1
y_min, y_max = x[:, 1].min() - 1, x[:, 1].max() + 1
xx2, yy2 = np.meshgrid(np.arange(x_min, x_max, .2),np.arange(y_min, y_max, .2))
Z = classifier.predict(np.c_[xx2.ravel(), yy2.ravel()])
Z = Z.reshape(xx2.shape)
ax.contourf(xx2, …
Run Code Online (Sandbox Code Playgroud) python machine-learning svm scikit-learn multiclass-classification
我试图通过将函数OneVsRestClassifier
与我自己的实现进行比较来验证我是否正确理解了 SVM - OVA(一对一)的工作原理。
在下面的代码中,我num_classes
在训练阶段实现了分类器,然后在测试集上测试了所有分类器,并选择了返回最高概率值的分类器。
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale
# Read dataset
df = pd.read_csv('In/winequality-white.csv', delimiter=';')
X = df.loc[:, df.columns != 'quality']
Y = df.loc[:, df.columns == 'quality']
my_classes = np.unique(Y)
num_classes = len(my_classes)
# Train-test split
np.random.seed(42)
msk = np.random.rand(len(df)) <= 0.8
train = df[msk]
test = df[~msk]
# From dataset to features and labels
X_train = train.loc[:, train.columns …
Run Code Online (Sandbox Code Playgroud) python ×6
pytorch ×2
scikit-learn ×2
spacy ×2
svm ×2
boosting ×1
python-3.x ×1
statistics ×1
xgboost ×1