我正在使用GA包,我的目标是找到k-means聚类算法的最佳初始质心位置.我的数据是TF-IDF得分中的稀疏矩阵,可以在这里下载.以下是我实施的一些阶段:
0.库和数据集
library(clusterSim) ## for index.DB()
library(GA) ## for ga()
corpus <- read.csv("Corpus_EnglishMalay_tfidf.csv") ## a dataset of 5000 x 1168
Run Code Online (Sandbox Code Playgroud)
1.二进制编码并生成初始种群.
k_min <- 15
initial_population <- function(object) {
## generate a population to turn-on 15 cluster bits
init <- t(replicate(object@popSize, sample(rep(c(1, 0), c(k_min, object@nBits - k_min))), TRUE))
return(init)
}
Run Code Online (Sandbox Code Playgroud)
2.健身功能最小化Davies-Bouldin(DB)指数.我在哪里评估生成的每个解决方案的DBI
initial_population.
DBI2 <- function(x) {
## x is a vector of solution of nBits
## exclude first column of corpus
initial_centroid <- …Run Code Online (Sandbox Code Playgroud) 我正在尝试对数据集进行分类。我首先使用XGBoost:
import xgboost as xgb
import pandas as pd
import numpy as np
train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})
features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)
params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss"}
rounds = 180
result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=23333)
print result
Run Code Online (Sandbox Code Playgroud)
结果是:
test-logloss-mean test-logloss-std train-logloss-mean
0 0.683539 0.000141 0.683407
179 0.622302 0.001504 0.606452
Run Code Online (Sandbox Code Playgroud)
我们可以看到它在0.622左右。
但是当我切换为sklearn使用完全相同的参数(我认为)时,结果却大不相同。下面是我的代码:
from sklearn.model_selection import cross_val_score
from xgboost.sklearn import XGBClassifier
import pandas as pd …Run Code Online (Sandbox Code Playgroud) python machine-learning scikit-learn cross-validation xgboost
我有一个地址,这81000是邮政编码(总是一个5位数字).
address <- "F47, First Floor, PTD 106273, Persiaran Indahpura Utama, Bandar Indahpura, 81000 Kulaijaya, Johor"
Run Code Online (Sandbox Code Playgroud)
我正在尝试确定使用的邮政编码regex,我尝试了以下内容:
## postal code pattern
postal_pattern <- '\\d{5}'
## extract postal code
postal_code <- stringr::str_extract_all(address, postal_pattern)
Run Code Online (Sandbox Code Playgroud)
但是,我得到了以下输出,这是部分正确的:
> postal_code
[[1]]
[1] "10627" "81000"
Run Code Online (Sandbox Code Playgroud)
我怎样才能提取81000使用regex或任何库?