我创建了以下示例代码,以便在[vars][1]包的帮助下在R中绘制"脉冲响应函数" .
library(vars)
data(Canada)
Canada <- data.frame(Canada)
irfplot = function(x, y) {
VAR <- VAR(cbind(x,y), p = 2, type = "trend")
irf_o <-irf(VAR, impulse = colnames(VAR$y)[1], response = colnames(VAR$y)[2], boot = TRUE, cumulative = FALSE, n.ahead = 20, ci = 0.90)
plot(irf_o)
}
irfplot(Canada["rw"],Canada["U"])
Run Code Online (Sandbox Code Playgroud)
到目前为止应该有效.但是,通过编写函数来尝试使脚本更灵活
irfplot = function(x, y, lags, deter) {
VAR <- VAR(cbind(x,y), p = lags, type = deter)
...
irfplot(Canada["rw"],Canada["U"], 2, "trend")
Run Code Online (Sandbox Code Playgroud)
它返回:
Error in VAR(y = ysampled, p = lags, type = "trend") : …Run Code Online (Sandbox Code Playgroud) 我有一个与gensim有关的问题.我想知道在保存或加载模型(或多个模型)时是否建议或必须使用pickle,因为我在GitHub上找到了可以执行任何操作的脚本.
mymodel = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)
mymodel.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
Run Code Online (Sandbox Code Playgroud)
看到这里
变式1:
import pickle
# Save
mymodel.save("mymodel.pkl") # Stores *.pkl file
# Load
mymodel = pickle.load("mymodel.pkl")
Run Code Online (Sandbox Code Playgroud)
变式2:
# Save
model.save(mymodel) # Stores *.model file
# Load
model = Doc2Vec.load(mymodel)
Run Code Online (Sandbox Code Playgroud)
在gensim.utils我看来,嵌入了一个pickle函数:https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py
def save ... try:_pickle.dump(self,fname_or_handle,protocol = pickle_protocol)...
我的问题的目标: 我很乐意学习1)我是否需要pickle(为了更好的内存管理)和2)万一,为什么它比加载*.model文件更好.
谢谢!
我遇到了匹配表的问题,其中一个数据框包含特殊字符而另一个不包含。示例:Doña Ana County 与 Dona Ana County
这是一个脚本,您可以在其中重现输出:
library(tidyverse)
library(acs)
tbl_df(acs::fips.place) # contains "Do\xf1a Ana County"
tbl_df(tigris::fips_codes) # contains "Dona Ana County"
Run Code Online (Sandbox Code Playgroud)
例子:
tbl_df(tigris::fips_codes) %>% filter(county == "Dona Ana County")
Run Code Online (Sandbox Code Playgroud)
返回:
# A tibble: 1 x 5
state state_code state_name county_code county
<chr> <chr> <chr> <chr> <chr>
1 NM 35 New Mexico 013 Dona Ana County
Run Code Online (Sandbox Code Playgroud)
不幸的是,以下查询不返回任何内容:
tbl_df(acs::fips.place) %>% filter(COUNTY == "Do\xf1a Ana County")
tbl_df(acs::fips.place) %>% filter(COUNTY == "Doña Ana County")
tbl_df(acs::fips.place) %>% filter(COUNTY == "Dona Ana County")
# …Run Code Online (Sandbox Code Playgroud) 我目前有以下脚本,可以帮助您找到doc2vec模型的最佳模型。它的工作方式如下:首先根据给定的参数训练一些模型,然后针对分类器进行测试。最后,它输出最佳模型和分类器(我希望)。
数据
可以在此处下载示例数据(data.csv):https ://pastebin.com/takYp6T8 请 注意,该数据的结构应能以1.0的精度构成理想的分类器。
脚本
import sys
import os
from time import time
from operator import itemgetter
import pickle
import pandas as pd
import numpy as np
from argparse import ArgumentParser
from gensim.models.doc2vec import Doc2Vec
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from sklearn.base import BaseEstimator
from gensim import corpora
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import …Run Code Online (Sandbox Code Playgroud) 我有一个关于将tseries.period.PeriodIndex转换为日期时间的问题.
我有一个DataFrame,如下所示:
colors country time_month 2010-09 xxx xxx 2010-10 xxx xxx 2010-11 xxx xxx ...
time_month是索引.
type(df.index)
Run Code Online (Sandbox Code Playgroud)
回报
class 'pandas.tseries.period.PeriodIndex'
Run Code Online (Sandbox Code Playgroud)
当我尝试使用df进行VAR分析时(http://statsmodels.sourceforge.net/devel/vector_ar.html#vector-autoregressions-tsa-vector-ar),
VAR(mdata)
Run Code Online (Sandbox Code Playgroud)
收益:
Given a pandas object and the index does not contain dates
Run Code Online (Sandbox Code Playgroud)
显然,Period不被认为是日期时间.现在,我的问题是如何将索引(time_month)转换为VAR分析可以使用的日期时间?
df.index = pandas.DatetimeIndex(df.index)
Run Code Online (Sandbox Code Playgroud)
回报
cannot convert Int64Index->DatetimeIndex
Run Code Online (Sandbox Code Playgroud)
谢谢您帮忙!
如何更改默认起始页?目前,index.html始终是帖子的索引.
我想在index.html上显示我当前/关于页面的内容,并希望有一个/ articles链接到帖子的索引.
将/about/index.html的内容复制粘贴到/index.html是唯一的解决方案吗?
我有以下分类报告的输出:
precision recall f1-score support
0 0.6772 0.5214 0.5892 491
1 0.8688 0.9273 0.8971 1678
avg / total 0.8254 0.8354 0.8274 2169
Run Code Online (Sandbox Code Playgroud)
数据集中的真实标签是s和p。
问题:我怎么知道哪个标签是“0”,哪个是“1”?或者:我如何通过labels=或target_names=以正确的顺序分配标签?
我在Python中阅读以下关于Pipelines和GridSearchCV的示例:http://www.davidsbatista.net/blog/2017/04/01/document_classification/
Logistic回归:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag')),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
"clf__estimator__C": [0.01, 0.1, 1],
"clf__estimator__class_weight": ['balanced', None],
}
Run Code Online (Sandbox Code Playgroud)
SVM:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC()),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
"clf__estimator__C": [0.01, 0.1, 1],
"clf__estimator__class_weight": ['balanced', None],
}
Run Code Online (Sandbox Code Playgroud)
有没有一种方法可以将Logistic回归和SVM组合成一个管道?比方说,我有一个TfidfVectorizer,喜欢测试多个分类器,然后每个分类器输出最好的模型/参数.
我的问题涉及通过choroplethr和choroplethrZip在MSA级别上绘制整个美国地图.
在下面的示例中,我们绘制1)美国县级地图上的人口普查人口信息和2)选定的城市/微观统计区域(MSA)级别的缩放地图.
示例R代码:
library(choroplethr)
library(choroplethrZip)
?zip.regions
data(zip.regions)
head(zip.regions)
?df_pop_county
data(df_pop_county)
df_pop_county
?df_pop_zip
data(df_pop_zip)
# U.S. County Population Data
county_choropleth(df_pop_county, legend = "Population")
# NY-NJ-PA MSA Population Data
zip_choropleth(df_pop_zip,
msa_zoom = "New York-Newark-Jersey City, NY-NJ-PA",
title = "2012 NY-Newark-Jersey City MSA\nZCTA Population Estimates",
legend = "Population")
Run Code Online (Sandbox Code Playgroud)
我们还可以绘制整个MSA级别的美国地图,而不仅仅是放大特定的MSA吗?像这样的方法
zip_choropleth(df_pop_zip, legend = "Population")
Run Code Online (Sandbox Code Playgroud)
没有用,也可能会策划ZCTA地区,而不是MSA地区.
谢谢!
我喜欢运行以下工作流程:
这是您可以重现的代码:
网格搜索:
%%time
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(0)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
data.label, random_state=0)
# Find best Tfidf model using LR
pipeline = Pipeline([
('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)), …Run Code Online (Sandbox Code Playgroud) python ×4
scikit-learn ×4
grid-search ×3
r ×3
gensim ×2
pickle ×2
pipeline ×2
census ×1
character ×1
choroplethr ×1
dataframe ×1
function ×1
jekyll ×1
memory ×1
model ×1
pandas ×1
parameters ×1
plot ×1
utf-8 ×1