psycopg2 Postgres COPY EXPERT to Pandas read_csv 使用内存缓冲区失败并出现 ValueError

Question

psycopg2 Postgres COPY EXPERT to Pandas read_csv 使用内存缓冲区失败并出现 ValueError

所以我使用 Python 3.5 中的 psycopg2 驱动程序运行以下代码到 Pandas 19.x。

 buf = io.StringIO()
 cursor = conn.cursor()
 sql_query = 'COPY ('+ base_sql + ' limit 100) TO STDOUT WITH CSV HEADER'
 cursor.copy_expert(sql_query, buf)
 df = pd.read_csv(buf.getvalue(),engine='c')
 buf.close()

Run Code Online (Sandbox Code Playgroud)

从内存缓冲区读取时，read_csv 会炸毁块：

pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4175)()

pandas\parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:8333)()

C:\Users\....\AppData\Local\Continuum\Anaconda3\lib\genericpath.py in exists(path)
     17     """Test whether a path exists.  Returns False for broken symbolic links"""
     18     try:
---> 19         os.stat(path)
     20     except OSError:
     21         return False

ValueError: stat: path too long for Windows

Run Code Online (Sandbox Code Playgroud)

呃..wot 路径？buf 在内存中。我在这里缺少什么？

仅供参考，副本似乎按预期工作。

下面的解决方案代码

多亏了下面的答案，使用这种方法我的查询速度翻了一番，我的内存使用量下降了 500%。这是我包含的最终测试代码，以帮助其他人解决他们的性能问题。我很想看到任何改进这一点的代码！请务必在您的问题中链接回此问题。

# COPY TO CSV quick and dirty performance test
import io
import sys

start = time.time()
conn_str_copy= r'postgresql+psycopg2://' + user_id + r":" + pswd + r"@xxx.xxx.xxx.xxx:ppppp/my_database"
result = urlparse(conn_str_copy)
username = result.username
password = result.password
database = result.path[1:]
hostname = result.hostname

size = 2**30
buf = io.BytesIO()
# buf = io.StringIO()

engine = create_engine(conn_str_copy)
conn_copy= psycopg2.connect(
    database=database, user=username, password=password, host=hostname)

cursor_copy = conn_copy.cursor()
sql_query = 'COPY ('+ my_sql_query + ' ) TO STDOUT WITH CSV HEADER'
cursor_copy.copy_expert(sql_query,  buf, size)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')
tmp = buf.seek(0)
df = pd.read_csv(buf,engine='c', low_memory=False )
buf.close()
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')

Run Code Online (Sandbox Code Playgroud)

从 postgres 复制数据的速度约为 4 分钟，将其加载到 Pandas 数据帧的速度不到 30 秒。请注意，复制命令是 psycopg2 驱动程序的一项功能，可能无法在其他驱动程序中使用。

Answer 1

Jea*_*bre 3

您必须将文件句柄或文件名传递给pandas.read_csv().

传递buf.getvalue()使 pandasread_csv相信您正在传递文件名，因为对象没有read方法，除了“文件名”是缓冲区并且它被视为太长（Windows 文件名限制为 255 个字符）

你几乎明白了。由于buf它已经是一个类似文件的对象，因此只需按原样传递即可。小细节：您必须倒回它，因为cursor.copy_expert(sql_query, buf)可能使用了之前的调用write并且buf位置位于末尾（尝试不使用它，您可能会得到一个空数据框）

buf.seek(0)  # rewind because you're at the end of the buffer
df = pd.read_csv(buf,engine='c')

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年前
查看次数：	2048 次
最近记录：	9 年前