Oli*_*Oli 2 php linux character-encoding
我需要将一些文件转换为UTF-8,因为它们是在UTF-8网站上输出的,而且内容看起来有点令人讨厌.
我现在可以这样做,或者我可以在阅读时做到这一点(通过PHP,只使用fopen,没什么特别的).欢迎任何建议.
我没有一个明确的PHP解决方案,但对于Python,我个人使用通用编码检测器库,它可以很好地猜测文件的编码方式.
为了让你开始,这是我用来进行转换的Python脚本(最初的目的是我想从UTF-16和Shift-JIS的混合转换日语代码库,我做了默认猜测如果chardet对检测编码没有信心):
import sys
import codecs
import chardet
from chardet.universaldetector import UniversalDetector
""" Detects encoding
Returns chardet result"""
def DetectEncoding(fileHdl):
detector = UniversalDetector()
for line in fileHdl:
detector.feed(line)
if detector.done: break
detector.close()
return detector.result
""" Reencode file to UTF-8
"""
def ReencodeFileToUtf8(fileName, encoding):
#TODO: This is dangerous ^^||, would need a backup option :)
#NOTE: Use 'replace' option which tolerates errorneous characters
data = codecs.open(fileName, 'rb', encoding, 'replace').read()
open(fileName, 'wb').write(data.encode('utf-8', 'replace'))
""" Main function
"""
if __name__=='__main__':
# Check for arguments first
if len(sys.argv) <> 2:
sys.exit("Invalid arguments supplied")
fileName = sys.argv[1]
try:
# Open file and detect encoding
fileHdl = open(fileName, 'rb')
encResult = DetectEncoding(fileHdl)
fileHdl.close()
# Was it an empty file?
if encResult['confidence'] == 0 and encResult['encoding'] == None:
sys.exit("Possible empty file")
# Only attempt to reencode file if we are confident about the
# encoding and if it's not UTF-8
encoding = encResult['encoding'].lower()
if encResult['confidence'] >= 0.7:
if encoding != 'utf-8':
ReencodeFileToUtf8(fileName, encoding)
else:
# TODO: Probably you could make a default guess and try to encode, or
# just simply make it fail
except IOError:
sys.exit('An IOError occured')
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
4211 次 |
最近记录: |