在python脚本中读取tar文件内容而不解压缩它

ran*_*psp 72 python tar

我有一个tar文件,里面有多个文件.我需要编写一个python脚本,它将读取文件的内容并给出总字符数,包括字母总数,空格,换行符,所有内容,而不用解压缩tar文件.

gho*_*g74 113

你可以使用getmembers()

>>> import  tarfile
>>> tar = tarfile.open("test.tar")
>>> tar.getmembers()
Run Code Online (Sandbox Code Playgroud)

之后,您可以使用extractfile()将成员提取为文件对象.只是一个例子

import tarfile,os
import sys
os.chdir("/tmp/foo")
tar = tarfile.open("test.tar")
for member in tar.getmembers():
    f=tar.extractfile(member)
    content=f.read()
    print "%s has %d newlines" %(member, content.count("\n"))
    print "%s has %d spaces" % (member,content.count(" "))
    print "%s has %d characters" % (member, len(content))
    sys.exit()
tar.close()
Run Code Online (Sandbox Code Playgroud)

在上例中,使用文件对象"f",可以使用read(),readlines()等.

  • "对于tar.getmembers()中的成员"可以更改为"for member in tar",它可以是生成器或迭代器(我不确定是哪个).但它一次成为一个成员. (13认同)
  • 我只是有一个类似的问题,但是tarfile模块似乎吃掉了我的ram,即使我使用了''r |'`选项。 (2认同)
  • 啊.我解决了 假设您将编写由huggie提示的代码,您必须偶尔"清理"成员列表.因此,鉴于上面的代码示例,那将是`tar.members = []`.更多信息:http://bit.ly/JKXrg6 (2认同)

Ste*_*ini 11

你需要使用tarfile模块.具体来说,您使用类TarFile的实例来访问该文件,然后使用TarFile.getnames()访问这些名称

 |  getnames(self)
 |      Return the members of the archive as a list of their names. It has
 |      the same order as the list returned by getmembers().
Run Code Online (Sandbox Code Playgroud)

如果您想要阅读内容,则使用此方法

 |  extractfile(self, member)
 |      Extract a member from the archive as a file object. `member' may be
 |      a filename or a TarInfo object. If `member' is a regular file, a
 |      file-like object is returned. If `member' is a link, a file-like
 |      object is constructed from the link's target. If `member' is none of
 |      the above, None is returned.
 |      The file-like object is read-only and provides the following
 |      methods: read(), readline(), readlines(), seek() and tell()
Run Code Online (Sandbox Code Playgroud)


Tho*_*ner 6

以前,这篇文章展示了一个“dict(zip(()”)将成员名称和成员列表放在一起的例子,这很愚蠢,会导致对档案的过度读取,为了达到同样的目的,我们可以使用字典理解:

index = {i.name: i for i in my_tarfile.getmembers()}
Run Code Online (Sandbox Code Playgroud)

有关如何使用 tarfile 的更多信息

提取 tarfile 成员

#!/usr/bin/env python3
import tarfile

my_tarfile = tarfile.open('/path/to/mytarfile.tar')

print(my_tarfile.extractfile('./path/to/file.png').read())
Run Code Online (Sandbox Code Playgroud)

索引 tar 文件

#!/usr/bin/env python3
import tarfile
import pprint

my_tarfile = tarfile.open('/path/to/mytarfile.tar')

index = my_tarfile.getnames()  # a list of strings, each members name
# or
# index = {i.name: i for i in my_tarfile.getmembers()}

pprint.pprint(index)
Run Code Online (Sandbox Code Playgroud)

索引、读取、动态额外一个 tar 文件

#!/usr/bin/env python3

import tarfile
import base64
import textwrap
import random

# note, indexing a tar file requires reading it completely once
# if we want to do anything after indexing it, it must be a file
# that can be seeked (not a stream), so here we open a file we
# can seek
my_tarfile = tarfile.open('/path/to/mytar.tar')


# tarfile.getmembers is similar to os.stat kind of, it will
# give you the member names (i.name) as well as TarInfo attributes:
#
# chksum,devmajor,devminor,gid,gname,linkname,linkpath,
# mode,mtime,name,offset,offset_data,path,pax_headers,
# size,sparse,tarfile,type,uid,uname
#
# here we use a dictionary comprehension to index all TarInfo
# members by the member name
index = {i.name: i for i in my_tarfile.getmembers()}

print(index.keys())

# pick your member
# note: if you can pick your member before indexing the tar file,
# you don't need to index it to read that file, you can directly
# my_tarfile.extractfile(name)
# or my_tarfile.getmember(name)

# pick your filename from the index dynamically
my_file_name = random.choice(index.keys())

my_file_tarinfo = index[my_file_name]
my_file_size = my_file_tarinfo.size
my_file_buf = my_tarfile.extractfile( 
    my_file_name
    # or my_file_tarinfo
)

print('file_name: {}'.format(my_file_name))
print('file_size: {}'.format(my_file_size))
print('----- BEGIN FILE BASE64 -----'
print(
    textwrap.fill(
        base64.b64encode(
            my_file_buf.read()
        ).decode(),
        72
    )
)
print('----- END FILE BASE64 -----'
Run Code Online (Sandbox Code Playgroud)

具有重复成员的 tarfile

如果我们有一个奇怪地创建的 tar,在这个例子中通过将同一文件的多个版本附加到同一个 tar 存档,我们可以小心地使用它,我已经注释了哪些成员包含什么文本,假设我们想要第四个(索引 3)成员,“capturetheflag\n”

tar -tf mybadtar.tar 
mymember.txt  # "version 1\n"
mymember.txt  # "version 1\n"
mymember.txt  # "version 2\n"
mymember.txt  # "capturetheflag\n"
mymember.txt  # "version 3\n"
Run Code Online (Sandbox Code Playgroud)
#!/usr/bin/env python3

import tarfile
my_tarfile = tarfile.open('mybadtar.tar')

# >>> my_tarfile.getnames()
# ['mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt']

# if we use extracfile on a name, we get the last entry, I'm not sure how python is smart enough to do this, it must read the entire tar file and buffer every valid member and return the last one

# >>> my_tarfile.extractfile('mymember.txt').read()
# b'version 3\n'

# >>> my_tarfile.extractfile(my_tarfile.getmembers()[3]).read()
# b'capturetheflag\n'
Run Code Online (Sandbox Code Playgroud)

或者,我们可以遍历 tar 文件 #!/usr/bin/env python3

import tarfile
my_tarfile = tarfile.open('mybadtar.tar')
# note, if we do anything to the tarfile object that will 
# cause a full read, the tarfile.next() method will return none,
# so call next in a loop as the first thing you do if you want to
# iterate

while True:
    my_member = my_tarfile.next()
    if not my_member:
        break
    print((my_member.offset, mytarfile.extractfile(my_member).read,))

# (0, b'version 1\n')
# (1024, b'version 1\n')
# (2048, b'version 2\n')
# (3072, b'capturetheflag\n')
# (4096, b'version 3\n')


    
Run Code Online (Sandbox Code Playgroud)