2To*_*oad 140
该StreamReader.CurrentEncoding属性很少为我返回正确的文本文件编码.通过分析字节顺序标记(BOM),我在确定文件的字节顺序方面取得了更大的成功:
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}
// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
return Encoding.ASCII;
}
Run Code Online (Sandbox Code Playgroud)
作为旁注,您可能希望修改此方法的最后一行以返回Encoding.Default,因此默认情况下会返回OS当前ANSI代码页的编码.
Sim*_*ier 41
以下代码适用于我,使用StreamReader类:
using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
{
reader.Peek(); // you need this!
var encoding = reader.CurrentEncoding;
}
Run Code Online (Sandbox Code Playgroud)
诀窍是使用Peek调用,否则,.NET没有做任何事情(并且它没有读取前导码,BOM).当然,如果ReadXXX在检查编码之前使用任何其他调用,它也可以工作.
如果文件没有BOM,则将使用defaultEncodingIfNoBom编码.还有一个没有这种重载方法的StreamReader(在这种情况下,默认(ANSI)编码将用作defaultEncodingIfNoBom),但我建议您在上下文中定义您认为的默认编码.
我已成功测试了包含UTF8,UTF16/Unicode(LE&BE)和UTF32(LE&BE)BOM的文件.它不适用于UTF7.
Ber*_*eux 13
提供@CodesInChaos 提出的步骤的实现细节:
1) 检查是否有字节顺序标记
2) 检查文件是否是有效的 UTF8
3) 使用本地的“ANSI”代码页(微软定义的 ANSI)
第 2 步之所以有效,是因为代码页中的大多数非 ASCII 序列不是 UTF8,而是无效的 UTF8。/sf/answers/316557601/更详细地解释了该策略。
using System; using System.IO; using System.Text;
// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage
public string DetectFileEncoding(Stream fileStream)
{
var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
{
string detectedEncoding;
try
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
}
detectedEncoding = reader.CurrentEncoding.BodyName;
}
catch (Exception e)
{
// Failed to decode the file using the BOM/UT8.
// Assume it's local ANSI
detectedEncoding = "ISO-8859-1";
}
// Rewind the stream
fileStream.Seek(0, SeekOrigin.Begin);
return detectedEncoding;
}
}
[Test]
public void Test1()
{
Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv");
var detectedEncoding = DetectFileEncoding(fs);
using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
{
// Consume your file
var line = reader.ReadLine();
...
Run Code Online (Sandbox Code Playgroud)
Cod*_*aos 12
我尝试以下步骤:
1)检查是否有字节顺序标记
2)检查文件是否有效UTF8
3)使用本地"ANSI"代码页(Microsoft定义的ANSI)
第2步可行,因为大多数非ASCII序列在其他UTF8无效UTF8的代码页中.
检查一下。
这是Mozilla Universal Charset Detector的端口,您可以像这样使用它...
public static void Main(String[] args)
{
string filename = args[0];
using (FileStream fs = File.OpenRead(filename)) {
Ude.CharsetDetector cdet = new Ude.CharsetDetector();
cdet.Feed(fs);
cdet.DataEnd();
if (cdet.Charset != null) {
Console.WriteLine("Charset: {0}, confidence: {1}",
cdet.Charset, cdet.Confidence);
} else {
Console.WriteLine("Detection failed.");
}
}
}
Run Code Online (Sandbox Code Playgroud)
小智 6
@nonoandy 提出的解决方案非常有趣,我已经成功地测试了它并且似乎工作得很好。
需要的nuget包是Microsoft.ProgramSynthesis.Detection(目前版本8.17.0)
我建议使用EncodingTypeUtils.GetDotNetName而不是使用开关来获取实例Encoding:
using System.Text;
using Microsoft.ProgramSynthesis.Detection.Encoding;
...
public Encoding? DetectEncoding(Stream stream)
{
try
{
if (stream.CanSeek)
{
// Read from the beginning if possible
stream.Seek(0, SeekOrigin.Begin);
}
// Detect encoding type (enum)
var encodingType = EncodingIdentifier.IdentifyEncoding(stream);
// Get the corresponding encoding name to be passed to System.Text.Encoding.GetEncoding
var encodingDotNetName = EncodingTypeUtils.GetDotNetName(encodingType);
if (!string.IsNullOrEmpty(encodingDotNetName))
{
return Encoding.GetEncoding(encodingDotNetName);
}
}
catch (Exception e)
{
// Handle exception (log, throw, etc...)
}
// In case of error return null or a default value
return null;
}
Run Code Online (Sandbox Code Playgroud)
.NET 不是很有帮助,但您可以尝试以下算法:
这是电话:
var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");
Run Code Online (Sandbox Code Playgroud)
这是代码:
public class FileHelper
{
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings
/// Defaults to UTF8 when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding or null.</returns>
public static Encoding GetEncoding(string filename)
{
var encodingByBOM = GetEncodingByBOM(filename);
if (encodingByBOM != null)
return encodingByBOM;
// BOM not found :(, so try to parse characters into several encodings
var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
if (encodingByParsingUTF8 != null)
return encodingByParsingUTF8;
var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
if (encodingByParsingLatin1 != null)
return encodingByParsingLatin1;
var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
if (encodingByParsingUTF7 != null)
return encodingByParsingUTF7;
return null; // no encoding found
}
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM)
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
private static Encoding GetEncodingByBOM(string filename)
{
// Read the BOM
var byteOrderMark = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(byteOrderMark, 0, 4);
}
// Analyze the BOM
if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;
return null; // no BOM found
}
private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
{
var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());
try
{
using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
{
while (!textReader.EndOfStream)
{
textReader.ReadLine(); // in order to increment the stream position
}
// all text parsed ok
return textReader.CurrentEncoding;
}
}
catch (Exception ex) { }
return null; //
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
115289 次 |
| 最近记录: |