小编Mat*_*ath的帖子

使用不平衡数据构建 ML 分类器

我有一个包含 1400 个 obs 和 19 列的数据集。Target 变量的值为 1（我最感兴趣的值）和 0。类别的分布显示不平衡 (70:30)。

使用下面的代码我得到了奇怪的值（全 1）。我不知道这是由于数据过度拟合/不平衡问题还是由于特征选择问题（我使用皮尔逊相关性，因为所有值都是数字/布尔值）。我认为接下来的步骤是错误的。

import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score

y = df['Label']
X = df.drop('Label',axis=1)

def create_cv(X,y):
    if type(X)!=np.ndarray:
        X=X.values
        y=y.values
 
    test_size=1/5
    proportion_of_true=y[y==1].shape[0]/y.shape[0]
    num_test_samples=math.ceil(y.shape[0]*test_size)
    num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
    num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
    
    y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
    y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])

    X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
    X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
    return X_train,X_test,y_train,y_test

X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
    
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)       

y_predict_test = tree.predict(X_test)

print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)

Run Code Online (Sandbox Code Playgroud)

输出：

     precision    recall  f1-score   support

           0       1.00      1.00      1.00        24
           1       1.00      1.00 …

Run Code Online (Sandbox Code Playgroud)

python machine-learning resampling scikit-learn cross-validation

Mat*_*ath

2021 10-02

5
推荐指数

1
解决办法

1223
查看次数

SMOTE - 无法将字符串转换为浮点数

我想我在下面的代码中遗漏了一些东西。

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE


# Split into training and test sets

# Testing Count Vectorizer

X = df[['Spam']]
y = df['Value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)


sm =  pd.concat([X_resampled, y_resampled], axis=1)

Run Code Online (Sandbox Code Playgroud)

当我收到错误时

ValueError：无法将字符串转换为浮点数：---> 19 X_resampled，y_resampled = SMOTE（）.fit_resample（X_train，y_train）

数据示例是

Spam                                             Value
Your microsoft account was compromised             1
Manchester United lost against PSG                 0
I like cooking                                     0

Run Code Online (Sandbox Code Playgroud)

我会考虑转换训练集和测试集来解决导致错误的问题，但我不知道如何应用于两者。我在谷歌上尝试了一些例子，但它没有解决问题。

python sampling resampling pandas smote

Mat*_*ath

2020 12-14

4
推荐指数

1
解决办法

1万
查看次数

使用 Word2Vec 的文本相似度

我想使用 Word2Vec 来检查文本的相似性。

我目前正在使用另一种逻辑：

from fuzzywuzzy import fuzz

def sim(name, dataset):
    matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1)
   return

Run Code Online (Sandbox Code Playgroud)

（名字是我的专栏）。

为了应用此功能，我执行以下操作：

df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1)

Run Code Online (Sandbox Code Playgroud)

你能告诉我如何用 Word2Vec 替换 Fuzzy.ratio 以便比较数据集中的文本吗？

数据集示例：

Text
Hello, this is Peter, what would you need me to help you with today? 
I need you
Good Morning, John here, are you calling regarding your cell phone bill? 
Hi, this this is John. What can I do for you?
...

Run Code Online (Sandbox Code Playgroud)

第一个文本和最后一个文本非常相似，尽管它们用不同的词来表达相似的概念。我想创建一个新列，用于为每一行放置相似的文本。我希望你能帮助我。

python similarity pandas word2vec

Mat*_*ath

2021 02-09

3
推荐指数

1
解决办法

1402
查看次数

将布尔值分配给以数字开头的列值

我有这个数据集

    Name              Col1 Col2 Col3 Col4 Col5 Col6 Col7
    tfedup.sm           1   1   1   1   1   1   1
    13wham.cc           1   1   1   1   1   1   1   
    1chancerslane.cc    1   1   1   1   1   1   1   
    24layover.cc        1   1   1   1   1   1   1   
    301-joy.cycle.cc    1   1   1   1   1   1   1

Run Code Online (Sandbox Code Playgroud)

我想创建一个新列，说明名称是否以数字开头。我做了

# Starts with numbers

    df['Nane_num'] = list( 
        map(lambda x: x.isdigit(), df['Name']))

Run Code Online (Sandbox Code Playgroud)

但它只给我 False 值。我上面的代码有什么问题？

python pandas

Mat*_*ath

lucky-day

1
推荐指数

1
解决办法

24
查看次数

标签统计

python ×4

pandas ×3

resampling ×2

cross-validation ×1

machine-learning ×1

sampling ×1

scikit-learn ×1

similarity ×1

smote ×1

word2vec ×1

使用不平衡数据构建 ML 分类器

SMOTE - 无法将字符串转换为浮点数

使用 Word2Vec 的文本相似度

将布尔值分配给以数字开头的列值

标签 统计

小编Mat_ath的帖子

标签统计