我正在尝试在我的 EMR 集群的主实例上安装 pyarrow,但是我总是收到此错误。
[hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow
Collecting pyarrow
Downloading https://files.pythonhosted.org/packages/c0/a0/f7e9dfd8988d94f4952f9b50eb04e14a80fbe39218520725aab53daab57c/pyarrow-0.10.0.tar.gz (2.1MB)
100% |????????????????????????????????| 2.2MB 643kB/s
Requirement already satisfied: numpy>=1.10 in /usr/local/lib64/python3.4/site-packages (from pyarrow)
Requirement already satisfied: six>=1.0.0 in /usr/local/lib/python3.4/site-packages (from pyarrow)
Installing collected packages: pyarrow
Running setup.py install for pyarrow ... error
Complete output from command /usr/bin/python3.4 -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-pr3y5_mu/pyarrow/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-vmywdpeg-record/install-record.txt --single-version-externally-managed --compile:
/usr/lib64/python3.4/distutils/dist.py:260: UserWarning: Unknown distribution option: 'long_description_content_type'
warnings.warn(msg)
/mnt/tmp/pip-build-pr3y5_mu/pyarrow/.eggs/setuptools_scm-3.1.0-py3.4.egg/setuptools_scm/utils.py:118: UserWarning: 'git' was not found
running …Run Code Online (Sandbox Code Playgroud) 背景:我正在使用pyspark.ml中的RandomForestClassifier进行简单的二进制分类.在将数据提供给培训之前,我设法使用VectorIndexer通过提供参数maxCategories来确定要素是数字还是分类.
问题:即使我使用VectorIndexer并将maxCategories设置为30,我仍然在训练管道中遇到错误:
An error occurred while calling o15371.fit.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature 0 has 10765 values. Considering remove this and other categorical features with a large number of values, or add more training examples.
Run Code Online (Sandbox Code Playgroud)
我的代码很简单,col_idx是我生成的列字符串列表,它将传递给stringindexer,col_all是一个列字符串列表,它将传递给stringindexer和onehotencoder,col_num是数字列名.
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler, IndexToString, VectorIndexer
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
my_data.cache()
# stringindexers and encoders
stIndexers …Run Code Online (Sandbox Code Playgroud)