如何通过Python访问Hive？

Question

如何通过Python访问Hive？

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python似乎已过时.

当我将其添加到/ etc/profile时:

export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py

Run Code Online (Sandbox Code Playgroud)

然后,我可以执行链接中列出的导入,from hive import ThriftHive但实际需要的除外:

from hive_service import ThriftHive

Run Code Online (Sandbox Code Playgroud)

接下来示例中的端口是10000,当我尝试时导致程序挂起.默认的Hive Thrift端口是9083,它停止了悬挂.

所以我这样设置:

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
    transport = TSocket.TSocket('<node-with-metastore>', 9083)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)
    client = ThriftHive.Client(protocol)
    transport.open()
    client.execute("CREATE TABLE test(c1 int)")

    transport.close()
except Thrift.TException, tx:
    print '%s' % (tx.message)

Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
self.recv_execute()
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'execute'

Run Code Online (Sandbox Code Playgroud)

但检查ThriftHive.py文件会显示该方法在Client类中执行.

我如何使用Python访问Hive？

Answer 1

Tri*_*eid 44

我相信最简单的方法是使用PyHive.

要安装,您需要这些库:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

Run Code Online (Sandbox Code Playgroud)

请注意,虽然您将库安装为PyHive,但您将模块导入为pyhive全部小写.

如果您使用的是Linux,则可能需要在运行上述内容之前单独安装SASL.使用apt-get或yum或您的发行版的任何软件包管理器安装软件包libsasl2-dev.对于Windows,GNU.org上有一些选项,您可以下载二进制安装程序.在Mac上,如果您安装了xcode开发人员工具(xcode-select --install在终端中),则应该可以使用SASL

安装后,您可以像这样连接到Hive:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

Run Code Online (Sandbox Code Playgroud)

现在您已经拥有了hive连接,您可以选择如何使用它.你可以直接查询:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
  use_result(result)

Run Code Online (Sandbox Code Playgroud)

...或使用连接来制作Pandas数据帧:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)

Run Code Online (Sandbox Code Playgroud)

在Debian上连接到HiveServer2时遇到了一些麻烦。错误为：“ SASL身份验证失败：未找到有价值的机械”。我必须安装libsasl2-modules软件包（通过apt-get）才能正常工作。 (2认同)

Answer 2

hus*_*lta 24

我断言您正在使用HiveServer2,这是导致代码无效的原因.

您可以使用pyhs2正确访问您的Hive以及示例代码:

import pyhs2

with pyhs2.connect(host='localhost',
               port=10000,
               authMechanism="PLAIN",
               user='root',
               password='test',
               database='default') as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()

        #Execute query
        cur.execute("select * from table")

        #Return column info from query
        print cur.getSchema()

        #Fetch table results
        for i in cur.fetch():
            print i

Run Code Online (Sandbox Code Playgroud)

注意,在使用pip安装pyhs2之前,可以安装python-devel.x86_64 cyrus-sasl-devel.x86_64.

希望这可以帮到你.

参考:https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

Answer 3

pyt*_*ter 12

python程序下面应该可以从python访问hive表:

import commands

cmd = "hive -S -e 'SELECT * FROM db_name.table_name LIMIT 1;' "

status, output = commands.getstatusoutput(cmd)

if status == 0:
   print output
else:
   print "error"

Run Code Online (Sandbox Code Playgroud)

当您无法在服务器上安装外部yum或pip包时,在某些快速和脏的情况下+1可能会很好. (4认同)
@ python-starter，仅当配置单元位于安装python的同一服务器上时，您的方法才有效。如果您要访问远程服务器上的配置单元表，我想还有其他要求。 (2认同)

Answer 4

Nav*_*ani 6

您可以使用配置单元库,因为您要从配置单元导入ThriftHive导入配置单元类

试试这个例子:

import sys

from hive import ThriftHive
from hive.ttypes import HiveServerException

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
  transport = TSocket.TSocket('localhost', 10000)
  transport = TTransport.TBufferedTransport(transport)
  protocol = TBinaryProtocol.TBinaryProtocol(transport)
  client = ThriftHive.Client(protocol)
  transport.open()
  client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")
  client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r")
  client.execute("SELECT * FROM r")
  while (1):
    row = client.fetchOne()
    if (row == None):
       break
    print row

  client.execute("SELECT * FROM r")
  print client.fetchAll()
  transport.close()
except Thrift.TException, tx:
  print '%s' % (tx.message)

Run Code Online (Sandbox Code Playgroud)

Answer 5

小智 6

要使用用户名/密码进行连接并指定端口,代码如下所示:

from pyhive import presto

cursor = presto.connect(host='host.example.com',
                    port=8081,
                    username='USERNAME:PASSWORD').cursor()

sql = 'select * from table limit 10'

cursor.execute(sql)

print(cursor.fetchone())
print(cursor.fetchall())

Run Code Online (Sandbox Code Playgroud)

Answer 6

Ole*_*nko 6

禁止用户在集群节点上下载和安装软件包和库是一种常见的做法。在这种情况下，如果 hive 在同一节点上运行，@python-starter 和 @goks 的解决方案可以完美工作。否则，可以使用命令行工具beeline 代替hive。查看具体信息

#python 2
import commands

cmd = 'beeline -u "jdbc:hive2://node07.foo.bar:10000/...<your connect string>" -e "SELECT * FROM db_name.table_name LIMIT 1;"'

status, output = commands.getstatusoutput(cmd)

if status == 0:
   print output
else:
   print "error"

Run Code Online (Sandbox Code Playgroud)

。

#python 3
import subprocess

cmd = 'beeline -u "jdbc:hive2://node07.foo.bar:10000/...<your connect string>" -e "SELECT * FROM db_name.table_name LIMIT 1;"'

status, output = subprocess.getstatusoutput(cmd)

if status == 0:
   print(output)
else:
   print("error")

Run Code Online (Sandbox Code Playgroud)

Answer 7

小智 6

这是一种通用方法，它使我很容易，因为我一直从 python 连接到多个服务器（SQL、Teradata、Hive 等）。因此，我使用 pyodbc 连接器。以下是使用 pyodbc 的一些基本步骤（以防您从未使用过它）：

先决条件：在执行以下步骤之前，您应该在 Windows 设置中具有相关的 ODBC 连接。如果你没有它，在这里找到相同的

一旦完成：第1步。PIP安装： pip install pyodbc（这里的下载从微软的网站相关的驱动程序链接）

STEP 2. 现在，在你的 python 脚本中导入相同的内容：

import pyodbc

Run Code Online (Sandbox Code Playgroud)

STEP 3. 最后，继续并提供连接详细信息如下：

conn_hive = pyodbc.connect('DSN = YOUR_DSN_NAME , SERVER = YOUR_SERVER_NAME, UID = USER_ID, PWD = PSWD' )

Run Code Online (Sandbox Code Playgroud)

使用 pyodbc 最好的部分是我只需要导入一个包就可以连接到几乎任何数据源。

归档时间：	11 年，10 月前
查看次数：	152015 次
最近记录：	6 年，6 月前