Chr*_*ris 1 python tar tarfile python-3.x
我是python的新手.我无法将tarfile的内容读入python.
数据是期刊文章的内容(在pubmed中心托管).请参阅以下信息.并链接到我想读入Python的tarfile.
http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61 -65.tar.gz
我有一个类似的.tar.gz文件列表,我最终也想读.我认为(知道)所有tarfiles都有一个与之关联的.nxml文件.它是我实际感兴趣的.nxml文件的内容是提取/读取.打开任何有关最佳方法的建议......
如果我将tar文件保存到我的PC,这就是我所拥有的.全部按预期运行.
tarfile_name = "F:/PMC_OA_TextMining/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
tfile = tarfile.open(tarfile_name)
tfile_members = tfile.getmembers()
tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)
tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
tfile_members2.append(tfile_members1[i])
tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()
Run Code Online (Sandbox Code Playgroud)
我今天了解到,为了直接从pubmed中心FTP站点访问tarfile,我必须使用设置网络请求urllib.下面是修改后的代码(以及我收到的stackoverflow回答的链接):
将.tar.gz文件的内容从网站读入python 3.x对象
tarfile_name = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_name)
tfile = tarfile.open(fileobj=ftpstream, mode="r|gz")
Run Code Online (Sandbox Code Playgroud)
但是,当我运行代码的剩余部分(下面)时,我收到一条错误消息("不允许向后搜索").怎么会?
tfile_members = tfile.getmembers()
tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)
tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
tfile_members2.append(tfile_members1[i])
tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()
Run Code Online (Sandbox Code Playgroud)
代码在最后一行失败,我尝试读取与tarfile关联的.nxml内容.以下是我收到的实际错误消息.这是什么意思?读取/访问这些.nxml文件内容的最佳解决方法是什么?这些文件都嵌入在tarfiles中?
Traceback (most recent call last):
File "F:\PMC_OA_TextMining\test2.py", line 135, in <module>
tfile_extract1_text = tfile_extract1.read()
File "C:\Python30\lib\tarfile.py", line 804, in read
buf += self.fileobj.read()
File "C:\Python30\lib\tarfile.py", line 715, in read
return self.readnormal(size)
File "C:\Python30\lib\tarfile.py", line 722, in readnormal
self.fileobj.seek(self.offset + self.position)
File "C:\Python30\lib\tarfile.py", line 531, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed
Run Code Online (Sandbox Code Playgroud)
在此先感谢您的帮助.克里斯
Dam*_*ick 11
出了什么问题: Tar文件是交错存储的.它们来自订单标题,数据,标题,数据,标题,数据等.当您枚举文件时getmembers(),您已经读完整个文件以获取标题.然后,当您要求tarfile对象读取数据时,它尝试从最后一个标头向后搜索第一个数据.但是,如果不关闭并重新打开urllib请求,则无法在网络流中向后搜索.
如何解决它:您需要下载文件,将临时副本保存到磁盘或StringIO,枚举此临时副本中的文件,然后提取所需的文件.
#!/usr/bin/env python3
from io import BytesIO
import urllib.request
import tarfile
tarfile_url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_url)
# BytesIO creates an in-memory temporary file.
# See the Python manual: http://docs.python.org/3/library/io.html
tmpfile = BytesIO()
while True:
# Download a piece of the file from the connection
s = ftpstream.read(16384)
# Once the entire file has been downloaded, tarfile returns b''
# (the empty bytes) which is a falsey value
if not s:
break
# Otherwise, write the piece of the file to the temporary file.
tmpfile.write(s)
ftpstream.close()
# Now that the FTP stream has been downloaded to the temporary file,
# we can ditch the FTP stream and have the tarfile module work with
# the temporary file. Begin by seeking back to the beginning of the
# temporary file.
tmpfile.seek(0)
# Now tell the tarfile module that you're using a file object
# that supports seeking backward.
# r|gz forbids seeking backward; r:gz allows seeking backward
tfile = tarfile.open(fileobj=tmpfile, mode="r:gz")
# You want to limit it to the .nxml files
tfile_members2 = [filename
for filename in tfile.getnames()
if filename.endswith('.nxml')]
tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()
# And when you're done extracting members:
tfile.close()
tmpfile.close()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3068 次 |
| 最近记录: |