我正在尝试获取一个位于hive(hortonworks)的表,收集一些twitter数据以在机器学习项目上实现,使用pyhive,因为python3.6不支持pyhs2.
这是我的代码:
from pyhive import hive
conn = hive.Connection(host='192.168.1.11', port=10000, auth='NOSASL')
import pandas as pd
import sys
df = pd.read_sql("SELECT * FROM my_table", conn)
print(sys.getsizeof(df))
df.head()
Run Code Online (Sandbox Code Playgroud)
得到此错误:
Traceback (most recent call last):
File "C:\Users\PWST112\Desktop\import.py", line 44, in <module>
conn = hive.Connection(host='192.168.1.11', port=10000, auth='NOSASL')
File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site- packages\pyhive\hive.py", line 164, in __init__
response = self._client.OpenSession(open_session_req)
File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site- packages\TCLIService\TCLIService.py", line 187, in OpenSession
return self.recv_OpenSession()
File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site-packages\TCLIService\TCLIService.py", line 199, in recv_OpenSession
(fname, mtype, rseqid) = iprot.readMessageBegin()
File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site-packages\thrift\protocol\TBinaryProtocol.py", line 148, in readMessageBegin
name …
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用DB-API (异步)示例通过 python (PyHive 0.5、python 2.7)连接到在 docker 容器内(从容器外部)运行的 Hive server-2
from pyhive import hive
conn = hive.connect(host='172.17.0.2', port='10001', auth='NOSASL')
Run Code Online (Sandbox Code Playgroud)
但是,我收到以下错误
Traceback (most recent call last):
File "py_2.py", line 4, in <module>
conn = hive.connect(host='172.17.0.2', port='10001', auth='NOSASL')
File "/home/foodie/anaconda2/lib/python2.7/site-packages/pyhive/hive.py", line 64, in connect
return Connection(*args, **kwargs)
File "/home/foodie/anaconda2/lib/python2.7/site-packages/pyhive/hive.py", line 164, in __init__
response = self._client.OpenSession(open_session_req)
File "/home/foodie/anaconda2/lib/python2.7/site-packages/TCLIService/TCLIService.py", line 187, in OpenSession
return self.recv_OpenSession()
File "/home/foodie/anaconda2/lib/python2.7/site-packages/TCLIService/TCLIService.py", line 199, in recv_OpenSession
(fname, mtype, rseqid) = iprot.readMessageBegin()
File "/home/foodie/anaconda2/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 148, in readMessageBegin …
Run Code Online (Sandbox Code Playgroud) 我在蜂巢上运行多个查询。我有一个有 6 个节点的 Hadoop 集群。集群中的总 vcore 为 21。
我只需要将 2 个内核分配给一个 python 进程,以便其他可用内核将由另一个主进程使用。
代码
from pyhive import hive
hive_host_name = "subdomain.domain.com"
hive_port = 20000
hive_user = "user"
hive_password = "password"
hive_database = "database"
conn = hive.Connection(host=hive_host_name, port=hive_port,username=hive_user, database=hive_database, configuration={})
cursor = conn.cursor()
cursor.execute('select count(distinct field) from somedata')
Run Code Online (Sandbox Code Playgroud) 我正在尝试在 Windows 10 计算机(64 位)上安装 sasl3-0.2.11 python 包。它因 C1083 致命错误而失败。
由于一些代理和我无法避免它们,我通过从 pypi 下载 tar.gz 来安装它,登录到未压缩的文件夹并执行python setup.py install
.
该解决方案适用于除 sasl 之外的所有模块。
然后我读了这个有用的评论,但 Cyrus Sasl 的 .whl 也不起作用。他们支持 python 3.7,而不是 3.8。
我真的很想知道如何绕过这个问题,或者我是否可以避免使用 sasl 来使用 Pyhive。
提前致谢。
努鲁
使用pyhive,是否可以执行多个hql,例如'CREATE TABLE TABLE1 (ITEM_KEY BIGINT);CREATE TABLE TABLE2 (ITEM_NAME BIGINT);'。
示例代码
from pyhive import hive
conn = hive.Connection(host=host
, port=port, username=user
, password=passwd
, auth=auth)
cursor = conn.cursor()
query= 'CREATE TABLE TABLE1 (ITEM_KEY BIGINT );CREATE TABLE TABLE2 (ITEM_NAME BIGINT );'.
cursor.execute(query)
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用 SqlAlchemy ORM 在 Hive 数据库中创建一个表。我的设置是 Python 3.6,带有PyHive==0.6.1
和SQLAlchemy==1.2.11
(及其相对依赖项)和Hive 1.1.0-cdh5.15.1
.
我的方法如下:
from sqlalchemy import create_engine
host = 'localhost'
port = 10000
database = 'foo'
engine = create_engine(f'hive://{host}:{port}')
engine.execute(f'CREATE DATABASE {database}')
engine.execute(f'USE {database}')
Run Code Online (Sandbox Code Playgroud)
连接到 Hive 并创建一个新数据库效果很好。此时我创建数据模型:
from sqlalchemy import Column
from sqlalchemy import String
from sqlalchemy import Integer
from sqlalchemy.ext.declarative import declarative_base
ModelBase = declarative_base()
class TestTable(ModelBase):
__tablename__ = 'test_table'
id = Column(Integer, primary_key=True)
text = Column(String(32), index=True)
Run Code Online (Sandbox Code Playgroud)
我尝试:
ModelBase.metadata.create_all(engine)
Run Code Online (Sandbox Code Playgroud)
没有成功:(因为引发以下异常:
OperationalError: (pyhive.exc.OperationalError) TExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=["*org.apache.hive.service.cli.HiveSQLException:Error while …
Run Code Online (Sandbox Code Playgroud) 使用此链接尝试连接到远程配置单元。下面是使用的代码。下面还给出了收到的错误消息
代码
from pyhive import hive
conn = hive.Connection(host="10.111.22.11", port=10000, username="user1" ,database="default")
Run Code Online (Sandbox Code Playgroud)
错误信息
Could not connect to any of [('10.111.22.11', 10000)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.6/site-packages/pyhive/hive.py", line 131, in __init__
self._transport.open()
File "/opt/anaconda3/lib/python3.6/site-packages/thrift_sasl/__init__.py", line 61, in open
self._trans.open()
File "/opt/anaconda3/lib/python3.6/site-packages/thrift/transport/TSocket.py",line 113, in open
raise TTransportException(TTransportException.NOT_OPEN, msg)
thrift.transport.TTransport.TTransportException: Could not connect to any of [('10.111.22.11', 10000)]
Run Code Online (Sandbox Code Playgroud)
成功连接还需要什么条件?我能够直接连接到服务器(使用 putty)并运行配置单元。但是当从另一台服务器 X 尝试时,我收到此错误。我也可以从服务器 X ping 配置单元服务器。
端口号可能是问题吗?如何检查正确的端口号?
正如下面的答案中所讨论的,我尝试启动 hiveserver2 。但该命令似乎不起作用。非常感谢任何帮助。
当我从 hive shell 执行查询时,我在日志中看到的端口是8088 …
我使用 AWS EMR 创建了一个 presto 集群。我正在使用所有默认配置。我想在主节点上编写一个python脚本来将查询推送到presto并获得结果。
我找到了 PyHive 库,但我不知道在连接字符串中放入什么:
from pyhive import presto # or import hive
cursor = presto.connect('localhost').cursor()
statement = 'SELECT * FROM my_awesome_data LIMIT 10'
cursor.execute(statement)
my_results = cursor.fetchall()
Run Code Online (Sandbox Code Playgroud)
我认为 localhost 可能是正确的,因为我在 presto 集群的主节点上运行脚本,但出现错误:
OperationalError: Unexpected status code 404
b'<!DOCTYPE html><html><head><title>Apache Tomcat/8.0.45 - Error report</title><style type="text/css">H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}.line {height: 1px; background-color: #525D76; border: none;}</style> </head><body><h1>HTTP Status 404 - /v1/statement</h1><div class="line"></div><p><b>type</b> Status report</p><p><b>message</b> …
Run Code Online (Sandbox Code Playgroud) pyhive ×8
python ×5
hive ×4
amazon-emr ×1
docker ×1
hadoop ×1
hadoop-yarn ×1
presto ×1
python-3.8 ×1
python-3.x ×1
sasl ×1
sqlalchemy ×1
thrift ×1