ℕʘʘ*_*ḆḽḘ 6 python jython apache-pig pytz cloudera
我datetime在我的pig脚本中使用的一些Python udfs中使用.到现在为止还挺好.我在Cloudera 5.5上使用了猪12.0
但是,我也需要使用pytz或者dateutil包,它们似乎不是一个vanilla python安装的一部分.
我可以Pig在某些方面在我的udfs中使用它们吗?如果是这样,怎么样?我认为dateutil是安装在我的节点上(我不是管理员,所以我怎么能实际检查是这种情况?),但是当我输入时:
import sys
#I append the path to dateutil on my local windows machine. Is that correct?
sys.path.append('C:/Users/me/AppData/Local/Continuum/Anaconda2/lib/site-packages')
from dateutil import tz
Run Code Online (Sandbox Code Playgroud)
在我的udfs.py剧本中,我得到:
2016-08-30 09:56:06,572 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
File "udfs.py", line 23, in <module>
from dateutil import tz
ImportError: No module named dateutil
Run Code Online (Sandbox Code Playgroud)
当我运行我的猪脚本.
我所有的其他python udfs(datetime例如使用)工作得很好.知道怎么解决这个问题吗?
非常感谢!
UPDATE
在用python路径玩了一下后,我现在能够
import dateutil
Run Code Online (Sandbox Code Playgroud)
(至少猪不会崩溃).但如果我尝试:
from dateutil import tz
Run Code Online (Sandbox Code Playgroud)
我收到一个错误.
from dateutil import tz
File "/opt/python/lib/python2.7/site-packages/dateutil/tz.py", line 16, in <module>
from six import string_types, PY3
File "/opt/python/lib/python2.7/site-packages/six.py", line 604, in <module>
viewkeys = operator.methodcaller("viewkeys")
AttributeError: type object 'org.python.modules.operator' has no attribute 'methodcaller'
Run Code Online (Sandbox Code Playgroud)
如何克服这个?我用以下方式使用tz
to_zone = dateutil.tz.gettz('US/Eastern')
from_zone = dateutil.tz.gettz('UTC')
Run Code Online (Sandbox Code Playgroud)
然后我改变时间戳的时区.我可以直接导入dateutil吗?什么是正确的语法?
更新2
根据yakuza的建议,我能够
import sys
sys.path.append('/opt/python/lib/python2.7/site-packages')
sys.path.append('/opt/python/lib/python2.7/site-packages/pytz/zoneinfo')
import pytz
Run Code Online (Sandbox Code Playgroud)
但现在我又得到了错误
Caused by: Traceback (most recent call last): File "udfs.py", line 158, in to_date_local File "__pyclasspath__/pytz/__init__.py", line 180, in timezone pytz.exceptions.UnknownTimeZoneError: 'America/New_York'
Run Code Online (Sandbox Code Playgroud)
当我定义
to_zone = pytz.timezone('America/New_York')
from_zone = pytz.timezone('UTC')
Run Code Online (Sandbox Code Playgroud)
在这里找到一些提示使用Py2Exe编译的Python应用程序引发了UnknownTimezoneError异常
该怎么办?Awww,我只想转换Pig中的时区:(
Well, as you probably know all Python UDF functions are not executed by Python interpreter, but Jython that is distributed with Pig. By default in 0.12.0 it should be Jython 2.5.3. Unfortunately six package supports Python starting from Python 2.6 and it's package required by dateutil. However pytz seems not to have such dependency, and should support Python versions starting from Python 2.4.
So to achieve your goal you should distribute pytz package to all your nodes for version 2.5 and in your Pig UDF add it's path to sys.path. If you complete same steps you did for dateutil everything should work as you expect. We are using very same approach with pygeoip and it works like a charm.
When you run Pig script that references some Python UDF (more precisely Jython UDF), you script gets compiled to map/reduce job, all REGISTERed files are included in JAR file, and are distributed on nodes where code is actually executed. Now when your code is executed, Jython interpreter is started and executed from Java code. So now when Python code is executed on each node taking part in computation, all Python imports are resolved locally on node. Imports from standard libraries are taken from Jython implementation, but all "packages" have to be install otherwise, as there is no pip for it. So to make external packages available to Python UDF you have to install required packages manually using other pip or install from sources, but remember to download package compatible with Python 2.5! Then in every single UDF file, you have to append site-packages on each node, where you installed packages (it's important to use same directory on each node). For example:
import sys
sys.path.append('/path/to/site-packages')
# Imports of non-stdlib packages
Run Code Online (Sandbox Code Playgroud)
Let's assume some we have following files:
/opt/pytz_test/test_pytz.pig:
REGISTER '/opt/pytz_test/test_pytz_udf.py' using jython as test;
A = LOAD '/opt/pytz_test/test_pytz_data.csv' AS (timestamp:int);
B = FOREACH A GENERATE
test.to_date_local(timestamp);
STORE B INTO '/tmp/test_pytz_output.csv' using PigStorage(',');
Run Code Online (Sandbox Code Playgroud)
/opt/pytz_test/test_pytz_udf.py:
from datetime import datetime
import sys
sys.path.append('/usr/lib/python2.6/site-packages/')
import pytz
@outputSchema('date:chararray')
def to_date_local(unix_timestamp):
"""
converts unix timestamp to a rounded date
"""
to_zone = pytz.timezone('America/New_York')
from_zone = pytz.timezone('UTC')
try :
as_datetime = datetime.utcfromtimestamp(unix_timestamp)
.replace(tzinfo=from_zone).astimezone(to_zone)
.date().strftime('%Y-%m-%d')
except:
as_datetime = unix_timestamp
return as_datetime
Run Code Online (Sandbox Code Playgroud)
/opt/pytz_test/test_pytz_data.csv:
1294778181
1294778182
1294778183
1294778184
Run Code Online (Sandbox Code Playgroud)
Now let's install pytz on our node (it has to be installed using Python version on which pytz is compatible with Python 2.5 (2.5-2.7), in my case I'll use Python 2.6):
sudo pip2.6 install pytz
Please make sure, that file /opt/pytz_test/test_pytz_udf.py adds to sys.path reference to site-packages where pytz is installed.
Now once we run Pig with our test script:
pig -x local /opt/pytz_test/test_pytz.pig
我们应该能够读取工作的输出,其中应该列出:
2011-01-11
2011-01-11
2011-01-11
2011-01-11
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
408 次 |
| 最近记录: |