Raj*_*dra 14 numpy amazon-web-services amazon-emr pandas pyspark
使用以下引导代码启动集群并获取以下标准输出后,当我尝试在 pyspark 中导入 pandas 时,由于与标准输出中不存在的不同 numpy 版本冲突,出现以下错误。因此,pyspark 似乎选择性地忽略了 numpy 安装并使用导致冲突的旧版本。我该如何解决这个问题?
我使用的emr版本是emr-5.33.0
import pandas as pd
File "/usr/local/lib64/python3.7/site-packages/pandas/__init__.py", line 22, in <module>
from pandas.compat import (
File "/usr/local/lib64/python3.7/site-packages/pandas/compat/__init__.py", line 15, in <module>
from pandas.compat.numpy import (
File "/usr/local/lib64/python3.7/site-packages/pandas/compat/numpy/__init__.py", line 21, in <module>
f"this version of pandas is incompatible with numpy < {_min_numpy_ver}\n"
ImportError: this version of pandas is incompatible with numpy < 1.17.3
your numpy version is 1.16.5.
Please upgrade numpy to >= 1.17.3 to use this pandas version
Run Code Online (Sandbox Code Playgroud)
这是我正在使用的引导代码
#!/bin/bash
set -x -e
echo -e 'export PYSPARK_PYTHON=/usr/bin/python3
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_JARS_DIR=/usr/lib/spark/jars
export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc
sudo python3 -m pip install
sudo python3 -m pip install numpy pandas awscli boto spark-nlp
sudo python3 -m pip freeze
sudo ls /usr/local/lib64/python3.7/site-packages/
set +x
exit 0
Run Code Online (Sandbox Code Playgroud)
这是我给出的软件配置
[{
"Classification": "spark-env",
"Configurations": [{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.stagingDir": "hdfs:///tmp",
"spark.yarn.preserve.staging.files": "true",
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2"
}
}
]
Run Code Online (Sandbox Code Playgroud)
这是我在引导后得到的黑啤酒
Collecting numpy
Downloading https://files.pythonhosted.org/packages/2c/d2/8973eb282fc3c7e6c4db0469f0390d81d8eb9ae56dfaa2a7e6db07283682/numpy-1.21.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (14.1MB)
Installing collected packages: numpy
Successfully installed numpy-1.21.0
Collecting pandas
Downloading https://files.pythonhosted.org/packages/99/f7/01cea7f6c963100f045876eb4aa1817069c5c9eca73d2dbfb5d31ff9a39f/pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8MB)
Collecting awscli
Downloading https://files.pythonhosted.org/packages/aa/24/e098cf5ce28a764bca174e88f4ccb70754e9f049c9bf986e582aedcb7420/awscli-1.19.112-py2.py3-none-any.whl (3.6MB)
Requirement already satisfied: boto in /usr/local/lib/python3.7/site-packages
Collecting spark-nlp
Downloading https://files.pythonhosted.org/packages/6a/98/5e860fdd0227b8eac3907acd5f896c9b2aae0a93cd676aaaf2aa4f48dfe0/spark_nlp-3.1.2-py2.py3-none-any.whl (45kB)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas)
Requirement already satisfied: numpy>=1.17.3 in /root/.local/lib/python3.7/site-packages (from pandas)
Collecting python-dateutil>=2.7.3 (from pandas)
Downloading https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl (247kB)
Collecting rsa<4.8,>=3.1.2; python_version > "2.7" (from awscli)
Downloading https://files.pythonhosted.org/packages/e9/93/0c0f002031f18b53af7a6166103c02b9c0667be528944137cc954ec921b3/rsa-4.7.2-py3-none-any.whl
Collecting docutils<0.16,>=0.10 (from awscli)
Downloading https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl (547kB)
Requirement already satisfied: PyYAML<5.5,>=3.10 in /usr/local/lib64/python3.7/site-packages (from awscli)
Collecting s3transfer<0.5.0,>=0.4.0 (from awscli)
Downloading https://files.pythonhosted.org/packages/63/d0/693477c688348654ddc21dcdce0817653a294aa43f41771084c25e7ff9c7/s3transfer-0.4.2-py2.py3-none-any.whl (79kB)
Collecting colorama<0.4.4,>=0.2.5 (from awscli)
Downloading https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Collecting botocore==1.20.112 (from awscli)
Downloading https://files.pythonhosted.org/packages/c7/ea/11c3beca131920f552602b98d7ba9fc5b46bee6a59cbd48a95a85cbb8f41/botocore-1.20.112-py2.py3-none-any.whl (7.7MB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas)
Collecting pyasn1>=0.1.3 (from rsa<4.8,>=3.1.2; python_version > "2.7"->awscli)
Downloading https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77kB)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from botocore==1.20.112->awscli)
Collecting urllib3<1.27,>=1.25.4 (from botocore==1.20.112->awscli)
Downloading https://files.pythonhosted.org/packages/5f/64/43575537846896abac0b15c3e5ac678d787a4021e906703f1766bfb8ea11/urllib3-1.26.6-py2.py3-none-any.whl (138kB)
Installing collected packages: python-dateutil, pandas, pyasn1, rsa, docutils, urllib3, botocore, s3transfer, colorama, awscli, spark-nlp
Successfully installed awscli-1.19.112 botocore-1.20.112 colorama-0.4.3 docutils-0.15.2 pandas-1.3.0 pyasn1-0.4.8 python-dateutil-2.8.2 rsa-4.7.2 s3transfer-0.4.2 spark-nlp-3.1.2 urllib3-1.26.6
awscli==1.19.112
beautifulsoup4==4.9.3
boto==2.49.0
botocore==1.20.112
click==7.1.2
colorama==0.4.3
docutils==0.15.2
jmespath==0.10.0
joblib==1.0.1
lxml==4.6.2
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.0
pandas==1.3.0
py-dateutil==2.2
pyasn1==0.4.8
python-dateutil==2.8.2
pytz==2021.1
PyYAML==5.4.1
regex==2021.3.17
rsa==4.7.2
s3transfer==0.4.2
six==1.13.0
spark-nlp==3.1.2
tqdm==4.59.0
urllib3==1.26.6
windmill==1.6
click
click-7.1.2.dist-info
joblib
joblib-1.0.1.dist-info
lxml
lxml-4.6.2-py3.7.egg-info
mysqlclient-1.4.2-py3.7.egg-info
MySQLdb
pandas
pandas-1.3.0.dist-info
PyYAML-5.4.1-py3.7.egg-info
regex
regex-2021.3.17-py3.7.egg-info
tqdm
tqdm-4.59.0.dist-info
yaml
_yaml
Run Code Online (Sandbox Code Playgroud)
r_g*_*_s_ 16
此问题实际上是一个 EMR 错误,正在 AWS 论坛上进行讨论: https: //forums.aws.amazon.com/thread.jspa ?messageID=989210&tstart=0
我在 上面临同样的问题emr 6.3.0;pandas=1.2.5我的解决方案是在引导脚本中设置。在 AWS 修复问题之前,这是一个快速修复方法。
此外,我看到这里发布了一些解决方案/技巧。
如何在 Amazon EMR 上安装多个版本的 numpy 以及如何删除早期版本?
| 归档时间: |
|
| 查看次数: |
3972 次 |
| 最近记录: |