小编EmJ*_*EmJ的帖子

How to use Dynamic Time warping with kNN in python

I have a time-series dataset with two lables (0 and 1). I am using Dynamic Time Warping (DTW) as a similarity measure for classification using k-nearest neighbour (kNN) as described in these two wonderful blog posts:

https://nbviewer.jupyter.org/github/markdregan/K-Nearest-Neighbors-with-Dynamic-Time-Warping/blob/master/K_Nearest_Neighbor_Dynamic_Time_Warping.ipynb

http://alexminnaar.com/2014/04/16/Time-Series-Classification-and-Clustering-with-Python.html

Arguments
---------
n_neighbors : int, optional (default = 5)
    Number of neighbors to use by default for KNN

max_warping_window : int, optional (default = infinity)
    Maximum warping window allowed by the DTW dynamic
    programming function

subsample_step : int, optional (default …

Run Code Online (Sandbox Code Playgroud)

python classification time-series knn scikit-learn

EmJ*_*EmJ

2019 07-19

8
推荐指数

1
解决办法

8476
查看次数

如何计算sklearn中每个交叉验证模型中的特征重要性

我使用RandomForestClassifier()与10 fold cross validation如下。

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print(accuracy.mean())

Run Code Online (Sandbox Code Playgroud)

我想确定特征空间中的重要特征。获得单个分类的特征重要性似乎很简单，如下所示。

print("Features sorted by their score:")
feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)

Run Code Online (Sandbox Code Playgroud)

但是，我怎么也找不到执行feature importance对cross validation在sklearn。

总之，我想average importance score在 10 次交叉验证中确定最有效的特征（例如，通过使用）。

如果需要，我很乐意提供更多详细信息。

python classification machine-learning scikit-learn cross-validation

EmJ*_*EmJ

2019 04-02

7
推荐指数

1
解决办法

5359
查看次数

如何使用深度学习模型进行时间序列预测？

我从机器上记录(m1, m2, so on)了 28 天的信号。（注意：每天的每个信号长度为 360）。

machine_num, day1, day2, ..., day28
m1, [12, 10, 5, 6, ...], [78, 85, 32, 12, ...], ..., [12, 12, 12, 12, ...]
m2, [2, 0, 5, 6, ...], [8, 5, 32, 12, ...], ..., [1, 1, 12, 12, ...]
...
m2000, [1, 1, 5, 6, ...], [79, 86, 3, 1, ...], ..., [1, 1, 12, 12, ...]

Run Code Online (Sandbox Code Playgroud)

我想预测未来3天每台机器的信号序列。即在day29, day30, day31. 不过，我没有为天的值29，30和31。所以，我的计划如下使用 …

python time-series forecasting deep-learning lstm

EmJ*_*EmJ

2020 03-19

7
推荐指数

1
解决办法

754
查看次数

如何按属性查找Wikidata实体？

我想知道,有没有办法通过指定的属性使用他们的API查找维基数据实体.例如,有很多实体具有Freebase ID属性(属性:P646).它是唯一标识符,我希望通过此标识符获取实体.

谁知道如何实现这一目标？

sparql wikidata linked-data wikidata-api

Inf*_*Inf

2019 06-29

6
推荐指数

1
解决办法

2712
查看次数

如何在sklearn中使用gridsearchcv执行特征选择

我recursive feature elimination with cross validation (rfecv)用作以下功能选择器randomforest classifier。

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(X,y)

print("Optimal number of features : %d" % rfecv.n_features_)
features=list(X.columns[rfecv.support_])

Run Code Online (Sandbox Code Playgroud)

我还执行GridSearchCV以下操作，以调整以下超参数RandomForestClassifier。

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)

rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': …

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn grid-search data-science

EmJ*_*EmJ

2019 04-11

6
推荐指数

2
解决办法

1710
查看次数

如何在Gensim中使用word2vec2tensor？

我正在按照以下gensim教程将word2vec模型转换为张量。链接至教程：https : //radimrehurek.com/gensim/scripts/word2vec2tensor.html

更具体地说，我运行了以下命令

python -m gensim.scripts.word2vec2tensor -i C:\Users\Emi\Desktop\word2vec\model_name -o C:\Users\Emi\Desktop\word2vec

Run Code Online (Sandbox Code Playgroud)

但是，对于以上命令，我收到以下错误。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Run Code Online (Sandbox Code Playgroud)

当我model.wv.save_word2vec_format(model_name)用来保存模型时（如以下链接所述：https : //github.com/RaRe-Technologies/gensim/issues/1847），然后使用上述命令，我得到以下错误。

ValueError: invalid vector on line 1 (is this really the text format?)

Run Code Online (Sandbox Code Playgroud)

只是想知道我是否在逗号的语法上犯了任何错误。请让我知道如何解决此问题。

如果需要，我很乐意提供更多详细信息。

python gensim word2vec

EmJ*_*EmJ

2019 02-11

5
推荐指数

1
解决办法

213
查看次数

如何对 sklearn 中的不平衡数据集进行交叉验证

我有一个高度不平衡的数据集，我想执行二进制分类。

在阅读一些帖子时，我发现sklearn提供class_weight="balanced"了不平衡的数据集。所以，我的分类器代码如下。

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")

Run Code Online (Sandbox Code Playgroud)

然后我使用上面的分类器进行了 10 折交叉验证，如下所示。

k_fold = KFold(n_splits=10, shuffle=True, random_state=42)
new_scores = cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
print(new_scores.mean())

Run Code Online (Sandbox Code Playgroud)

但是，我不确定是否class_weight="balanced"通过 10 倍交叉验证反映出来。我做错了吗？如果是这样，在 sklearn 中是否有更好的方法来做到这一点？

如果需要，我很乐意提供更多详细信息。

python classification machine-learning scikit-learn

EmJ*_*EmJ

2019 03-31

5
推荐指数

1
解决办法

1948
查看次数

如何在python中使用交叉验证执行GridSearchCV

我正在使用执行超参数调整，RandomForest如下所示GridSearchCV。

X = np.array(df[features]) #all features
y = np.array(df['gold_standard']) #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)

Run Code Online (Sandbox Code Playgroud)

我得到的结果如下。

{'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'n_estimators': 200}

Run Code Online (Sandbox Code Playgroud)

之后，我将调整后的参数重新应用x_test如下。

rfc=RandomForestClassifier(random_state=42, criterion ='gini', max_depth= 6, max_features = 'auto', n_estimators = 200, class_weight = 'balanced')
rfc.fit(x_train, y_train)
pred=rfc.predict(x_test)
print(precision_recall_fscore_support(y_test,pred))
print(roc_auc_score(y_test,pred)) …

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn cross-validation

EmJ*_*EmJ

2019 04-10

5
推荐指数

0
解决办法

105
查看次数

如何在sklearn的randomforest中获得决策函数

我使用下面的代码，以获得优化的参数randomforest使用gridsearchcv。

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)

Run Code Online (Sandbox Code Playgroud)

现在，我想将调整后的参数应用于X_test. 为此，我做了以下工作，

pred = CV_rfc.decision_function(x_test)
print(roc_auc_score(y_test, pred))

Run Code Online (Sandbox Code Playgroud)

但是，decision_function似乎不支持，randomforest因为我收到以下错误。

AttributeError: 'RandomForestClassifier' 对象没有属性 'decision_function'。

有没有其他方法可以做到这一点？

如果需要，我很乐意提供更多详细信息。

python machine-learning random-forest scikit-learn grid-search

EmJ*_*EmJ

2019 04-10

5
推荐指数

1
解决办法

5024
查看次数

如何在python中实现带有重启的随机游走

我有使用创建的以下图表networkx。

import networkx as nx

G = nx.Graph()

G.add_nodes_from(["John", "Mary", "Jill", "Todd",
                  "iPhone5", "Kindle Fire", "Fitbit Flex Wireless", "Harry Potter", "Hobbit"])

G.add_edges_from([
    ("John", "iPhone5"),
    ("John", "Kindle Fire"),
    ("Mary", "iPhone5"),
    ("Mary", "Kindle Fire"),
    ("Mary", "Fitbit Flex Wireless"),
    ("Jill", "iPhone5"),
    ("Jill", "Kindle Fire"),
    ("Jill", "Fitbit Flex Wireless"),
    ("Todd", "Fitbit Flex Wireless"),
    ("Todd", "Harry Potter"),
    ("Todd", "Hobbit"),
])

Run Code Online (Sandbox Code Playgroud)

现在，我想random walk with restarts识别与John. 我搜索了文档networkx，但找不到它的实现networkx。

请告诉我是否有 python 库/代码可以random walk with restarts做到这一点。

如果需要，我很乐意提供更多详细信息。