具有python变换功能的Hive:"无法识别'变换'附近的输入"错误

Ale*_*ord 4 python hadoop hive

我有一个Hive表,用于跟踪在一个进程的各个阶段中移动的对象的状态.该表如下所示:

hive> desc journeys;
object_id           string                                      
journey_statuses    array<string>
Run Code Online (Sandbox Code Playgroud)

以下是记录的典型示例:

12345678    ["A","A","A","B","B","B","C","C","C","C","D"]
Run Code Online (Sandbox Code Playgroud)

表中的记录是使用Hive 0.13生成的collect_list,并且状态有一个订单(如果订单不重要,我会使用collect_set).对于每个object_id,我想缩短旅程以按照它们出现的顺序返回旅程状态.

我写了一个从stdin读取的快速Python脚本:

#!/usr/bin/env python
import sys
import itertools

for line in sys.stdin:
    inputList = eval(line.strip())
    readahead = iter(inputList)
    next(readahead)
    result = []
    for id, (a, b) in enumerate(itertools.izip(inputList, readahead)):
        if id == 0:
          result.append(a)
        if a != b:
          result.append(b)
    print result
Run Code Online (Sandbox Code Playgroud)

我打算在Hive transform电话中使用它.它似乎在本地运行时工作:

$ echo '["A","A","A","B","B","B","C","C","C","C","D"]' | python abbreviate_list.py
['A', 'B', 'C', 'D']
Run Code Online (Sandbox Code Playgroud)

但是,当我添加文件并尝试在Hive中执行时,会返回错误:

hive> add file abbreviateList.py;                                                                           
Added resource: abbreviateList.py

hive> select
    >   object_id,
    >   transform(journey_statuses) using 'python abbreviateList.py' as journey_statuses_abbreviated
    > from journeys;
NoViableAltException( ... wall of Java error messages ... )
FAILED: ParseException line 3:2 cannot recognize input near 'transform' '(' 'journey_statuses' in select expression
Run Code Online (Sandbox Code Playgroud)

你能看出我做错了什么吗?

rch*_*ang 5

显然,您无法选择其他不在转换中的字段(在您的示例中为object_id).这个其他的SO问题似乎间接解决了这个问题:

如何选择列并在Hive中执行TRANSFORM?

理论上,您可以修改Python以接受object_id作为输入参数,如果需要将其包含在输出中,则将其作为另一个输出字段的直通.