Mat*_*nez 2 pdf base64 python-3.x
我从 Excel 电子表格中提取了一个嵌入对象,该电子表格是 pdf 文件,但 Excel zip 文件将嵌入对象保存为二进制文件。
我正在尝试读取二进制文件并将其返回到原始格式(pdf)。我从另一个有类似问题的问题中获取了一些代码,但是当我尝试打开 pdf adobe 时,出现错误“无法打开,因为文件已损坏...未正确解码...”
有谁知道有什么方法可以做到这一点?
with open('oleObject1.bin','rb') as f:
binaryData = f.read()
print(binaryData)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(base64.decodebytes(binaryData))
Run Code Online (Sandbox Code Playgroud)
谢谢瑞安,我能明白你在说什么。这是解决方案供将来参考。
str1 = b'%PDF-' # Begin PDF
str2 = b'%%EOF' # End PDF
with open('oleObject1.bin', 'rb') as f:
binary_data = f.read()
print(binary_data)
# Convert BYTE to BYTEARRAY
binary_byte_array = bytearray(binary_data)
# Find where PDF begins
result1 = binary_byte_array.find(str1)
print(result1)
# Remove all characters before PDF begins
del binary_byte_array[:result1]
print(binary_byte_array)
# Find where PDF ends
result2 = binary_byte_array.find(str2)
print(result2)
# Subtract the length of the array from the position of where PDF ends (add 5 for %%OEF characters)
# and delete that many characters from end of array
print(len(binary_byte_array))
to_remove = len(binary_byte_array) - (result2 + 5)
print(to_remove)
del binary_byte_array[-to_remove:]
print(binary_byte_array)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(binary_byte_array)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3739 次 |
| 最近记录: |