I have a time-series dataset with two lables (0 and 1). I am using Dynamic Time Warping (DTW) as a similarity measure for classification using k-nearest neighbour (kNN) as described in these two wonderful blog posts:
http://alexminnaar.com/2014/04/16/Time-Series-Classification-and-Clustering-with-Python.html
Arguments
---------
n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for KNN
max_warping_window : int, optional (default = infinity)
Maximum warping window allowed by the DTW dynamic
programming function
subsample_step : int, optional (default …Run Code Online (Sandbox Code Playgroud)我使用RandomForestClassifier()与10 fold cross validation如下。
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print(accuracy.mean())
Run Code Online (Sandbox Code Playgroud)
我想确定特征空间中的重要特征。获得单个分类的特征重要性似乎很简单,如下所示。
print("Features sorted by their score:")
feature_importances = pd.DataFrame(clf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
Run Code Online (Sandbox Code Playgroud)
但是,我怎么也找不到执行feature importance对cross validation在sklearn。
总之,我想average importance score在 10 次交叉验证中确定最有效的特征(例如,通过使用)。
如果需要,我很乐意提供更多详细信息。
python classification machine-learning scikit-learn cross-validation
我从机器上记录(m1, m2, so on)了 28 天的信号。(注意:每天的每个信号长度为 360)。
machine_num, day1, day2, ..., day28
m1, [12, 10, 5, 6, ...], [78, 85, 32, 12, ...], ..., [12, 12, 12, 12, ...]
m2, [2, 0, 5, 6, ...], [8, 5, 32, 12, ...], ..., [1, 1, 12, 12, ...]
...
m2000, [1, 1, 5, 6, ...], [79, 86, 3, 1, ...], ..., [1, 1, 12, 12, ...]
Run Code Online (Sandbox Code Playgroud)
我想预测未来3天每台机器的信号序列。即在day29, day30, day31. 不过,我没有为天的值29,30和31。所以,我的计划如下使用 …
我想知道,有没有办法通过指定的属性使用他们的API查找维基数据实体.例如,有很多实体具有Freebase ID属性(属性:P646).它是唯一标识符,我希望通过此标识符获取实体.
谁知道如何实现这一目标?
我recursive feature elimination with cross validation (rfecv)用作以下功能选择器randomforest classifier。
X = df[[my_features]] #all my features
y = df['gold_standard'] #labels
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(X,y)
print("Optimal number of features : %d" % rfecv.n_features_)
features=list(X.columns[rfecv.support_])
Run Code Online (Sandbox Code Playgroud)
我还执行GridSearchCV以下操作,以调整以下超参数RandomForestClassifier。
X = df[[my_features]] #all my features
y = df['gold_standard'] #labels
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = {
'n_estimators': [200, 500],
'max_features': …Run Code Online (Sandbox Code Playgroud) python machine-learning scikit-learn grid-search data-science
我正在按照以下gensim教程将word2vec模型转换为张量。链接至教程:https : //radimrehurek.com/gensim/scripts/word2vec2tensor.html
更具体地说,我运行了以下命令
python -m gensim.scripts.word2vec2tensor -i C:\Users\Emi\Desktop\word2vec\model_name -o C:\Users\Emi\Desktop\word2vec
Run Code Online (Sandbox Code Playgroud)
但是,对于以上命令,我收到以下错误。
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Run Code Online (Sandbox Code Playgroud)
当我model.wv.save_word2vec_format(model_name)用来保存模型时(如以下链接所述:https : //github.com/RaRe-Technologies/gensim/issues/1847),然后使用上述命令,我得到以下错误。
ValueError: invalid vector on line 1 (is this really the text format?)
Run Code Online (Sandbox Code Playgroud)
只是想知道我是否在逗号的语法上犯了任何错误。请让我知道如何解决此问题。
如果需要,我很乐意提供更多详细信息。
我有一个高度不平衡的数据集,我想执行二进制分类。
在阅读一些帖子时,我发现sklearn提供class_weight="balanced"了不平衡的数据集。所以,我的分类器代码如下。
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
Run Code Online (Sandbox Code Playgroud)
然后我使用上面的分类器进行了 10 折交叉验证,如下所示。
k_fold = KFold(n_splits=10, shuffle=True, random_state=42)
new_scores = cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
print(new_scores.mean())
Run Code Online (Sandbox Code Playgroud)
但是,我不确定是否class_weight="balanced"通过 10 倍交叉验证反映出来。我做错了吗?如果是这样,在 sklearn 中是否有更好的方法来做到这一点?
如果需要,我很乐意提供更多详细信息。
我正在使用执行超参数调整,RandomForest如下所示GridSearchCV。
X = np.array(df[features]) #all features
y = np.array(df['gold_standard']) #labels
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
Run Code Online (Sandbox Code Playgroud)
我得到的结果如下。
{'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'n_estimators': 200}
Run Code Online (Sandbox Code Playgroud)
之后,我将调整后的参数重新应用x_test如下。
rfc=RandomForestClassifier(random_state=42, criterion ='gini', max_depth= 6, max_features = 'auto', n_estimators = 200, class_weight = 'balanced')
rfc.fit(x_train, y_train)
pred=rfc.predict(x_test)
print(precision_recall_fscore_support(y_test,pred))
print(roc_auc_score(y_test,pred)) …Run Code Online (Sandbox Code Playgroud) 我使用下面的代码,以获得优化的参数randomforest使用gridsearchcv。
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
Run Code Online (Sandbox Code Playgroud)
现在,我想将调整后的参数应用于X_test. 为此,我做了以下工作,
pred = CV_rfc.decision_function(x_test)
print(roc_auc_score(y_test, pred))
Run Code Online (Sandbox Code Playgroud)
但是,decision_function似乎不支持,randomforest因为我收到以下错误。
AttributeError: 'RandomForestClassifier' 对象没有属性 'decision_function'。
有没有其他方法可以做到这一点?
如果需要,我很乐意提供更多详细信息。
python machine-learning random-forest scikit-learn grid-search
我有使用创建的以下图表networkx。
import networkx as nx
G = nx.Graph()
G.add_nodes_from(["John", "Mary", "Jill", "Todd",
"iPhone5", "Kindle Fire", "Fitbit Flex Wireless", "Harry Potter", "Hobbit"])
G.add_edges_from([
("John", "iPhone5"),
("John", "Kindle Fire"),
("Mary", "iPhone5"),
("Mary", "Kindle Fire"),
("Mary", "Fitbit Flex Wireless"),
("Jill", "iPhone5"),
("Jill", "Kindle Fire"),
("Jill", "Fitbit Flex Wireless"),
("Todd", "Fitbit Flex Wireless"),
("Todd", "Harry Potter"),
("Todd", "Hobbit"),
])
Run Code Online (Sandbox Code Playgroud)
现在,我想random walk with restarts识别与John. 我搜索了文档networkx,但找不到它的实现networkx。
请告诉我是否有 python 库/代码可以random walk with restarts做到这一点。
如果需要,我很乐意提供更多详细信息。
python ×9
scikit-learn ×6
grid-search ×2
time-series ×2
data-science ×1
forecasting ×1
gensim ×1
igraph ×1
knn ×1
linked-data ×1
lstm ×1
networkx ×1
sparql ×1
wikidata ×1
wikidata-api ×1
word2vec ×1