我正在使用弹性搜索查询范围使用 python 获取从一个日期到另一个日期的记录,但我只获取 10 条记录。
下面是查询
{"query": {"range": {"date": {"gte":"2022-01-01 01:00:00", "lte":"2022-10-10 01:00:00"}}}}
Run Code Online (Sandbox Code Playgroud)
示例输出:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 8,
"successful": 8,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 1.0,
"hits": [{"_source": {}}]
}
Run Code Online (Sandbox Code Playgroud)
“点击”列表仅包含 10 条记录。当我检查我的数据库时,有超过10条记录。
谁能告诉我如何修改查询以获取上述日期范围的所有记录?
我有一个评论集,它的类别标签为正/负。我正在将朴素贝叶斯应用于该评论数据集。首先,我要转换成单词袋。这里sorted_data ['Text']是评论,而final_counts是稀疏矩阵
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
Run Code Online (Sandbox Code Playgroud)
我将数据分为训练和测试数据集。
X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
Run Code Online (Sandbox Code Playgroud)
我正在应用朴素贝叶斯算法如下
optimal_alpha = 1
NB_optimal = BernoulliNB(alpha=optimal_aplha)
# fitting the model
NB_optimal.fit(X_tr, y_tr)
# predict the response
pred = NB_optimal.predict(X_test)
# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the NB classifier for k = %d is %f%%' % (optimal_aplha, acc))
Run Code Online (Sandbox Code Playgroud)
这里X_test是测试数据集,其中pred变量为我们提供X_test中的向量是正类还是负类。
X_test形状为(54626行,尺寸为82343)
pred的长度是54626
我的问题是我想获得每个向量中概率最高的单词,以便我可以通过单词了解为什么它预测为正类或负类。因此,如何获得每个向量中概率最高的单词?
我有一个带有时间戳的数据框,它的数据类型是对象。
0 2020-07-09T04:23:50.267Z
1 2020-07-09T11:21:55.536Z
2 2020-07-09T11:23:18.015Z
3 2020-07-09T04:03:28.581Z
4 2020-07-09T04:03:33.874Z
Name: timestamp, dtype: object
Run Code Online (Sandbox Code Playgroud)
我不知道上述数据帧中日期时间的格式。我将pd.to_datetime应用于上述列,其中数据类型更改为datetime64[ns, UTC]。
df['timestamp'] = pd.to_datetime(df.timestamp)
Run Code Online (Sandbox Code Playgroud)
现在数据框看起来像这样,
0 2020-07-09 04:23:50.267000+00:00
1 2020-07-09 11:21:55.536000+00:00
2 2020-07-09 11:23:18.015000+00:00
3 2020-07-09 04:03:28.581000+00:00
4 2020-07-09 04:03:33.874000+00:00
Name: timestamp, dtype: datetime64[ns, UTC]
Run Code Online (Sandbox Code Playgroud)
我想将上面的 datetime64[ns, UTC] 格式转换为正常的日期时间。
For example,
2020-07-09 04:23:50.267000+00:00 to 2020-07-09 04:23:50
Run Code Online (Sandbox Code Playgroud)
谁能解释一下这个2020-07-09T04:23:50.267Z表示的含义是什么,以及如何将其转换为日期时间对象?
我想在函数中使用调整后的 Rsquarecross_val_score。我尝试使用make_scorer功能但它不起作用。
from sklearn.cross_validation import train_test_split
X_tr, X_test, y_tr, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
regression = LinearRegression(normalize=True)
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import r2_score
def adjusted_rsquare(y_true,y_pred):
adjusted_r_squared = 1 - (1-r2_score(y_true, y_pred))*(len(y_pred)-1)/(len(y_pred)-X_test.shape[1]-1)
return adjusted_r_squared
my_scorer = make_scorer(adjusted_rsquare, greater_is_better=True)
score = np.mean(cross_val_score(regression, X_tr, y_tr, scoring=my_scorer,cv=crossvalidation, n_jobs=1))
Run Code Online (Sandbox Code Playgroud)
它抛出一个错误:
IndexError: positional indexers are out-of-bounds
Run Code Online (Sandbox Code Playgroud)
有什么方法可以使用我的自定义功能,即;adjusted_rsquare和cross_val_score?
python machine-learning python-3.x scikit-learn sklearn-pandas
我有一个评论数据集,其类别标签为正面/负面。我正在对该评论数据集应用逻辑回归。首先,我正在转换成词袋。这里sorted_data['Text']是评论,final_counts是一个稀疏矩阵
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
standardized_data = StandardScaler(with_mean=False).fit_transform(final_counts)
Run Code Online (Sandbox Code Playgroud)
将数据集拆分为训练和测试
X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)
Run Code Online (Sandbox Code Playgroud)
我正在应用逻辑回归算法如下
optimal_lambda = 0.001000
log_reg_optimal = LogisticRegression(C=optimal_lambda)
# fitting the model
log_reg_optimal.fit(X_tr, y_tr)
# predict the response
pred = log_reg_optimal.predict(X_test)
# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the Logistic Regression for C = %f is %f%%' % (optimal_lambda, …Run Code Online (Sandbox Code Playgroud) machine-learning scikit-learn logistic-regression sklearn-pandas
我有一个有 5000 条评论的文档。我对该文档应用了 tf-idf 。这里的sample_data包含5000条评论。我正在对具有一克范围的sample_data应用tf-idf向量化器。现在我想从样本数据中获取具有最高 tf-idf 值的前 1000 个单词。谁能告诉我如何获得最热门的单词?
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1))
tf_idf_vect.fit(sample_data)
final_tf_idf = tf_idf_vect.transform(sample_data)
Run Code Online (Sandbox Code Playgroud) tf-idf python-3.x scikit-learn sklearn-pandas tfidfvectorizer
我有一个由两列组成的数据框。我使用一个函数作为 udf 并使用 pyspark 中的 applyInPandas 运行该函数。
下面是代码
import pandas as pd
from pyspark.sql.functions import pandas_udf, ceil
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
def normalize(pdf):
v = pdf.v
return pdf.assign(v=(v - v.mean()) / v.std())
df.groupby("id").applyInPandas(
normalize, schema="id long, v double").show()
Run Code Online (Sandbox Code Playgroud)
我必须再将一个参数传递给作为 udf 的规范化函数。当我传递参数时出现错误
下面是代码
import pandas as pd
from pyspark.sql.functions import pandas_udf, ceil
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
def normalize(pdf,value):
v …Run Code Online (Sandbox Code Playgroud) 我有一个评论数据集,其中有正面/负面的类别标签。我正在将决策树应用于该评论数据集。首先,我正在转换成一个词袋。这里的sorted_data['Text']是评论,final_counts是一个稀疏矩阵。
我将数据分为训练数据集和测试数据集。
X_tr, X_test, y_tr, y_test = cross_validation.train_test_split(sorted_data['Text'], labels, test_size=0.3, random_state=0)
# BOW
count_vect = CountVectorizer()
count_vect.fit(X_tr.values)
final_counts = count_vect.transfrom(X_tr.values)
Run Code Online (Sandbox Code Playgroud)
应用决策树算法如下
# instantiate learning model k = optimal_k
# Applying the vectors of train data on the test data
optimal_lambda = 15
final_counts_x_test = count_vect.transform(X_test.values)
bow_reg_optimal = DecisionTreeClassifier(max_depth=optimal_lambda,random_state=0)
# fitting the model
bow_reg_optimal.fit(final_counts, y_tr)
# predict the response
pred = bow_reg_optimal.predict(final_counts_x_test)
# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the Decision Tree for depth = %f …Run Code Online (Sandbox Code Playgroud) python machine-learning decision-tree scikit-learn sklearn-pandas
scikit-learn ×5
python ×4
python-3.x ×3
datetime ×1
naivebayes ×1
pandas ×1
pyspark ×1
tf-idf ×1