我想在Jupyternotebook上看到一个进度条,当我使用Dask运行计算任务时,我正在计算一个大的csv文件+ 4GB的"id"列的所有值,所以任何想法?
import dask.dataframe as dd
df = dd.read_csv('data/train.csv')
df.id.count().compute()
Run Code Online (Sandbox Code Playgroud) 我想为Random Forest Regressor改进这个GridSearchCV的参数。
def Grid_Search_CV_RFR(X_train, y_train):
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
estimator = RandomForestRegressor()
param_grid = {
"n_estimators" : [10,20,30],
"max_features" : ["auto", "sqrt", "log2"],
"min_samples_split" : [2,4,8],
"bootstrap": [True, False],
}
grid = GridSearchCV(estimator, param_grid, n_jobs=-1, cv=5)
grid.fit(X_train, y_train)
return grid.best_score_ , grid.best_params_
def RFR(X_train, X_test, y_train, y_test, best_params):
from sklearn.ensemble import RandomForestRegressor
estimator = RandomForestRegressor(n_jobs=-1).set_params(**best_params)
estimator.fit(X_train,y_train)
y_predict = estimator.predict(X_test)
print "R2 score:",r2(y_test,y_predict)
return y_test,y_predict
def splitter_v2(tab,y_indicator):
from …
Run Code Online (Sandbox Code Playgroud) 我正在尝试在 Dask 上使用 Pivot_table 和以下数据框:
date store_nbr item_nbr unit_sales year month
0 2013-01-01 25 103665 7.0 2013 1
1 2013-01-01 25 105574 1.0 2013 1
2 2013-01-01 25 105575 2.0 2013 1
3 2013-01-01 25 108079 1.0 2013 1
4 2013-01-01 25 108701 1.0 2013 1
Run Code Online (Sandbox Code Playgroud)
当我尝试将 pivot_table 如下:
ddf.pivot_table(values='unit_sales', index={'store_nbr','item_nbr'},
columns={'year','month'}, aggfunc={'mean','sum'})
Run Code Online (Sandbox Code Playgroud)
我收到此错误:
ValueError: 'index' must be the name of an existing column
Run Code Online (Sandbox Code Playgroud)
如果我只在索引和列参数上使用一个值,如下所示:
df.pivot_table(values='unit_sales', index='store_nbr',
columns='year', aggfunc={'sum'})
Run Code Online (Sandbox Code Playgroud)
我收到此错误:
ValueError: 'columns' must be category dtype
Run Code Online (Sandbox Code Playgroud) 我正在尝试在 lambda AWS 上导入 numpy 库,步骤如下:
\n\nlayer=numpy\nmkdir -p $layer/python/lib/python3.7/site-packages/\ncd $layer/python/lib/python3.7/site-packages/\npip install -t . numpy\ncd ../../../../\nzip -r $layer.zip .\n
Run Code Online (Sandbox Code Playgroud)\n\n这是 lambda 的错误:
\n\n\n[错误] Runtime.ImportModuleError:无法导入模块\n \'lambda_function\':
\n\n重要提示:请阅读本文以获取有关如何解决此问题的建议!
\n\n导入 numpy c 扩展失败。\n - 尝试卸载并重新安装 numpy。\n - 如果您已经这样做,则:\n 1. 检查您是否希望使用“/var/lang/bin 中的 Python3.7” /python3.7",\n 并且您的 PATH 或 PYTHONPATH 中没有可能干扰您尝试使用的 Python 和 numpy 版本“1.17.3”的目录。\n 2. 如果 (1 )看起来不错,您可以在\n https://github.com/numpy/numpy/issues打开一个新问题。请提供以下详细信息:\n - 如何安装 Python\n - 如何安装 numpy\n - 您的操作系统\n - 您是否安装了多个版本的 Python\n - 如果您从源代码构建,则您的编译器版本最好是构建日志
\n\n\n
- 如果您正在使用 numpy git 存储库,请尝试
git …
如果使用Dask应用功能在大型数据集的单列上计算对数,我该怎么做?
df_train.apply(lambda x: np.log1p(x), axis=1 , meta={'column_name':'float32'}).compute()
Run Code Online (Sandbox Code Playgroud)
数据集非常大(1.25亿行),我该怎么做?
我想估算 Dask Dataframe 的负值,使用 Pandas 我使用以下代码:
df.loc[(df.column_name < 0),'column_name'] = 0
Run Code Online (Sandbox Code Playgroud) 我有这个,
col1 col2
no yes no yes
index
A 2 8 2 6
B 0 2 1 1
Run Code Online (Sandbox Code Playgroud)
我想赞成列的百分比,如下所示,
col1 col2
yes yes
col0
A 0.8 0.75
B 1.0 0.5
Run Code Online (Sandbox Code Playgroud) 我正在尝试从Skit-learn向CountVectorizer添加Lematization,如下所示
import nltk
from pattern.es import lemma
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
def __call__(self, text):
return [lemma(t) for t in word_tokenize(text)]
vectorizer = CountVectorizer(stop_words=stopwords.words('spanish'),tokenizer=LemmaTokenizer())
sentence = ["EVOLUCIÓN de los sucesos y la EXPANSIÓN, ellos juegan y yo les dije lo que hago","hola, qué tal vas?"]
vectorizer.fit_transform(sentence)
Run Code Online (Sandbox Code Playgroud)
这是输出:
[u',', u'?', u'car', u'decir', u'der', u'evoluci\xf3n', u'expansi\xf3n', u'hacer', u'holar', u'ir', u'jugar', u'lar', u'ler', u'sucesos', u'tal', u'yar']
Run Code Online (Sandbox Code Playgroud)
更新
这是出现的停用词,已经过词缀化:
u'lar',u'ler',u'der'
它限制所有单词,并且不会删除停用词。那么,有什么想法吗?
我尝试同时在多个 lambda 中部署和更新代码,但是当推送到我的分支并部署 CodeBuild 时,出现以下错误:
调用UpdateFunctionCode操作时发生错误(InvalidParameterValueException):解压大小必须小于350198字节
[容器] 2021/04/24 00:09:31 命令未成功退出 aws lambda update-function-code --function-name my_lambda_03 --zip-file fileb://my_lambda_03.zip 退出状态 254 [容器] 2021 /04/24 00:09:31 阶段完成:POST_BUILD 状态:FAILED [Container] 2021/04/24 00:09:31 阶段上下文状态代码:COMMAND_EXECUTION_ERROR 消息:执行命令时出错:aws lambda update-function-code - -函数名称 my_lambda_03 --zip 文件 fileb://my_lambda_03.zip。原因:退出状态254
这是buildspec.yml:
version: 0.2
phases:
install:
runtime-versions:
python: 3.x
commands:
- echo "Installing dependencies..."
build:
commands:
- echo "Zipping all my functions....."
- cd my_lambda_01/
- zip -r9 ../my_lambda_01.zip .
- cd ..
- cd my_lambda_02/
- zip -r9 ../my_lambda_02.zip . …
Run Code Online (Sandbox Code Playgroud) 首先,下载并解压缩.tgz
tar xvf zeppelin-0.7.3-bin-all.tgz
Run Code Online (Sandbox Code Playgroud)
二,修改变量回家,
vi ~/.bashrc
Run Code Online (Sandbox Code Playgroud)
添加
export SPARK_HOME="/home/miguel/spark-2.3.0-bin-hadoop2.7/"
第三,在cmd上午餐Zeppeling
bin/zeppelin-daemon.sh start
Run Code Online (Sandbox Code Playgroud)
第四,尝试执行pyspark
%pyspark
print("Hello")
Run Code Online (Sandbox Code Playgroud)
我收到了这个错误:
java.lang.ClassNotFoundException: org.apache.spark.ui.jobs.JobProgressListener
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.zeppelin.spark.SparkInterpreter.setupListeners(SparkInterpreter.java:170)
at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:148)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:843)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
at org.apache.zeppelin.spark.PySparkInterpreter.getSparkInterpreter(PySparkInterpreter.java:565)
at org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:209)
at org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:162)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Run Code Online (Sandbox Code Playgroud) dask ×4
python ×3
dataframe ×2
pandas ×2
apache-spark ×1
aws-lambda ×1
devops ×1
grid-search ×1
logarithm ×1
multi-index ×1
nltk ×1
numpy ×1
percentage ×1
pivot-table ×1
pyspark ×1
python-3.x ×1
scikit-learn ×1
stop-words ×1