我有一个很大的 .xz 文件(几 GB)。它充满了纯文本。我想处理文本以创建自定义数据集。我想一行一行地阅读它,因为它太大了。有人知道怎么做吗?
我已经尝试过 如何在内存中打开和读取 LZMA 文件,但它不起作用。
编辑:我收到此错误“ascii”编解码器无法解码位置 0 中的字节 0xfd:序号不在范围内(128)
for line in uncompressed:从链接就行
EDIT2:我的代码(使用 python 3.5)
with open(filename) as compressed:
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用以下代码在 Linux 中打开一个压缩为 .lzma 文件的 .csv 文件:
import lzma
import pandas as pd
myfile= '/home/stacey/work/roll_158_oe_2018-03-02/BBG.XTKS.8219.S/inst.BBG.XTKS.8219.S.csv.lzma'
with lzma.open(myfile,'rt') as f:
pair_info=pd.read_csv(f,engine='c',header=0,index_col=0)
Run Code Online (Sandbox Code Playgroud)
其中 myfile 是 Linux 中存在的路径。
但是我收到错误:
with lzma.open(stock,'rt') as f:
AttributeError: 'module' object has no attribute 'open'
Run Code Online (Sandbox Code Playgroud)
我尝试添加以下内容:
import lzma
import pandas as pd
myfile= '/home/stacey/work/roll_158_oe_2018-03-02/BBG.XTKS.8219.S/inst.BBG.XTKS.8219.S.csv.lzma'
with open(myfile) as compressed:
with lzma.LZMAFile(compressed,'r') as uncompressed:
line in uncompressed:
print(line)
Run Code Online (Sandbox Code Playgroud)
但我收到错误:
with lzma.LZMAFile(compressed,'r') as uncompressed:
TypeError: coercing to Unicode: need string or buffer, file found
Run Code Online (Sandbox Code Playgroud)
我也试过:
import pandas as pd
import lzma …Run Code Online (Sandbox Code Playgroud)