使用Python读取CR2(Raw Canon Image)标题

Question

使用Python读取CR2(Raw Canon Image)标题

Esc*_*alo 10 python metadata image-processing binary-data

我正在尝试提取从CR2拍摄照片的日期/时间(原始照片的佳能格式).

我知道CR2规范,我知道我可以使用Python struct模块从二进制缓冲区中提取片段.

简而言之,规范说在Tag中0x0132 / 306我可以找到一个长度为20的字符串 - 日期和时间.

我尝试使用以下方法获取该标记:

struct.unpack_from(20*'s', buffer, 0x0132)

Run Code Online (Sandbox Code Playgroud)

但我明白了

('\x00', '\x00', "'", '\x88, ...[and more crap])

Run Code Online (Sandbox Code Playgroud)

有任何想法吗？

编辑

非常感谢您的全力以赴!答案是惊人的,我学到了很多关于处理二进制数据的知识.

Answer 1

Jon*_*age 7

你有没有考虑到你所谈论的IFD块之前应该(根据规范)的标题？

我查看了规范,它说第一个IFD块跟随16字节头.因此,如果我们读取字节16和17(偏移量为0x10十六进制),我们应该得到第一个IFD块中的条目数.然后我们只需搜索每个条目,直到找到匹配的标记ID(我读它)给出了日期/时间字符串的字节偏移量.

这对我有用:

from struct import *

def FindDateTimeOffsetFromCR2( buffer, ifd_offset ):
    # Read the number of entries in IFD #0
    (num_of_entries,) = unpack_from('H', buffer, ifd_offset)
    print "ifd #0 contains %d entries"%num_of_entries

    # Work out where the date time is stored
    datetime_offset = -1
    for entry_num in range(0,num_of_entries-1):
        (tag_id, tag_type, num_of_value, value) = unpack_from('HHLL', buffer, ifd_offset+2+entry_num*12)
        if tag_id == 0x0132:
            print "found datetime at offset %d"%value
            datetime_offset = value
    return datetime_offset

if __name__ == '__main__':
    with open("IMG_6113.CR2", "rb") as f:
        buffer = f.read(1024) # read the first 1kb of the file should be enough to find the date / time
        datetime_offset = FindDateTimeOffsetFromCR2(buffer, 0x10)
        print unpack_from(20*'s', buffer, datetime_offset)

Run Code Online (Sandbox Code Playgroud)

我的示例文件的输出是:

ifd #0 contains 14 entries
found datetime at offset 250
('2', '0', '1', '0', ':', '0', '8', ':', '0', '1', ' ', '2', '3', ':', '4', '5', ':', '4', '6', '\x00')

Run Code Online (Sandbox Code Playgroud)

[编辑] - 修订/更彻底的例子

from struct import *

recognised_tags = { 
    0x0100 : 'imageWidth',
    0x0101 : 'imageLength',
    0x0102 : 'bitsPerSample',
    0x0103 : 'compression',
    0x010f : 'make',    
    0x0110 : 'model',
    0x0111 : 'stripOffset',
    0x0112 : 'orientation', 
    0x0117 : 'stripByteCounts',
    0x011a : 'xResolution',
    0x011b : 'yResolution',
    0x0128 : 'resolutionUnit',
    0x0132 : 'dateTime',
    0x8769 : 'EXIF',
    0x8825 : 'GPS data'};

def GetHeaderFromCR2( buffer ):
    # Unpack the header into a tuple
    header = unpack_from('HHLHBBL', buffer)

    print "\nbyte_order = 0x%04X"%header[0]
    print "tiff_magic_word = %d"%header[1]
    print "tiff_offset = 0x%08X"%header[2]
    print "cr2_magic_word = %d"%header[3]
    print "cr2_major_version = %d"%header[4]
    print "cr2_minor_version = %d"%header[5]
    print "raw_ifd_offset = 0x%08X\n"%header[6]

    return header

def FindDateTimeOffsetFromCR2( buffer, ifd_offset, endian_flag ):
    # Read the number of entries in IFD #0
    (num_of_entries,) = unpack_from(endian_flag+'H', buffer, ifd_offset)
    print "Image File Directory #0 contains %d entries\n"%num_of_entries

    # Work out where the date time is stored
    datetime_offset = -1

    # Go through all the entries looking for the datetime field
    print " id  | type |  number  |  value   "
    for entry_num in range(0,num_of_entries):

        # Grab this IFD entry
        (tag_id, tag_type, num_of_value, value) = unpack_from(endian_flag+'HHLL', buffer, ifd_offset+2+entry_num*12)

        # Print out the entry for information
        print "%04X | %04X | %08X | %08X "%(tag_id, tag_type, num_of_value, value),
        if tag_id in recognised_tags:
            print recognised_tags[tag_id]

        # If this is the datetime one we're looking for, make a note of the offset
        if tag_id == 0x0132:
            assert tag_type == 2
            assert num_of_value == 20
            datetime_offset = value

    return datetime_offset

if __name__ == '__main__':
    with open("IMG_6113.CR2", "rb") as f:
        # read the first 1kb of the file should be enough to find the date/time
        buffer = f.read(1024) 

        # Grab the various parts of the header
        (byte_order, tiff_magic_word, tiff_offset, cr2_magic_word, cr2_major_version, cr2_minor_version, raw_ifd_offset) = GetHeaderFromCR2(buffer)

        # Set the endian flag
        endian_flag = '@'
        if byte_order == 0x4D4D:
            # motorola format
            endian_flag = '>'
        elif byte_order == 0x4949:
            # intel format
            endian_flag = '<'

        # Search for the datetime entry offset
        datetime_offset = FindDateTimeOffsetFromCR2(buffer, 0x10, endian_flag)

        datetime_string = unpack_from(20*'s', buffer, datetime_offset)
        print "\nDatetime: "+"".join(datetime_string)+"\n"

Run Code Online (Sandbox Code Playgroud)

Answer 2

Jim*_*som 6

0x0132不是偏移量,它是日期的标记号.CR2或TIFF分别是基于目录的格式.您必须在给定您正在寻找的(已知)标签的情况下查找条目.

编辑:好的,首先,你必须阅读文件数据是使用little或big-endian格式保存的.前八个字节指定标头,该标头的前两个字节指定字节顺序.Python的struct模块允许您通过在格式字符串前加上"<"或">"来处理小端和大端数据.因此,假设data是包含CR2图像的缓冲区,您可以通过处理字节序

header = data[:8]
endian_flag = "<" if header[:2] == "II" else ">"

Run Code Online (Sandbox Code Playgroud)

格式规范指出第一个图像文件目录以相对于文件开头的偏移量开始,偏移量在标头的最后4个字节中指定.因此,要获得第一个IFD的偏移量,您可以使用与此类似的行:

ifd_offset = struct.unpack("{0}I".format(endian_flag), header[4:])[0]

Run Code Online (Sandbox Code Playgroud)

您现在可以继续阅读第一个IFD.您将在目录中找到指定偏移量的条目数,该文件宽度为两个字节.因此,您将使用以下方法读取第一个IFD中的条目数:

number_of_entries = struct.unpack("{0}H".format(endian_flag), data[ifd_offset:ifd_offset+2])[0]

Run Code Online (Sandbox Code Playgroud)

字段条目长度为12个字节,因此您可以计算IFD的长度.在number_of_entries*12个字节之后,将有另外4个字节的长偏移,告诉你在哪里寻找下一个目录.这基本上就是你如何处理TIFF和CR2图像.

这里的"神奇"是要注意,对于每个12字节字段条目,前两个字节将是标签ID.这就是你寻找你的标签0x0132的地方.因此,如果您知道第一个IFD从文件中的ifd_offset开始,您可以通过以下方式扫描第一个目录:

current_position = ifd_offset + 2
for field_offset in xrange(current_position, number_of_entries*12, 12):
    field_tag = struct.unpack("{0}H".format(endian_flag), data[field_offset:field_offset+2])[0]
    field_type = struct.unpack("{0}H".format(endian_flag), data[field_offset+2:field_offset+4])[0]
    value_count = struct.unpack("{0}I".format(endian_flag), data[field_offset+4:field_offset+8])[0]
    value_offset = struct.unpack("{0}I".format(endian_flag), data[field_offset+8:field_offset+12])[0]

    if field_tag == 0x0132:
        # You are now reading a field entry containing the date and time
        assert field_type == 2 # Type 2 is ASCII
        assert value_count == 20 # You would expect a string length of 20 here
        date_time = struct.unpack("20s", data[value_offset:value_offset+20])
        print date_time

Run Code Online (Sandbox Code Playgroud)

您显然希望将解包重构为一个公共函数,并可能将整个格式包装成一个很好的类,但这超出了本示例的范围.您还可以通过将多个格式字符串组合成一个来缩短解包,从而产生一个更大的元组,其中包含您可以解压缩到不同变量的所有字段,为清楚起见,我省略了这些字节.

归档时间：	15 年，4 月前
查看次数：	7240 次
最近记录：	8 年，7 月前