Geo*_*ge2 6 c# validation encoding utf-8
我有一个非常大的XML文档(大约120M),我不想立刻将它加载到内存中.我的目的是检查此文件是否使用有效的UTF-8编码.
有什么想法可以快速检查,而不是以整个文件的形式读入整个文件byte[]
?
我正在使用VSTS 2008和C#.
当XMLDocument
用于加载包含无效字节序列的XML文档时,有一个例外,但是当将所有内容读入字节数组然后检查UTF-8时,没有任何异常,任何想法?
这是显示我的XML文件内容的屏幕截图,或者您可以从此处下载该文件的副本
编辑1:
class Program
{
public static byte[] RawReadingTest(string fileName)
{
byte[] buff = null;
try
{
FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
BinaryReader br = new BinaryReader(fs);
long numBytes = new FileInfo(fileName).Length;
buff = br.ReadBytes((int)numBytes);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return buff;
}
static void XMLTest()
{
try
{
XmlDocument xDoc = new XmlDocument();
xDoc.Load("c:\\abc.xml");
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
static void Main()
{
try
{
XMLTest();
Encoding ae = Encoding.GetEncoding("utf-8");
string filename = "c:\\abc.xml";
ae.GetString(RawReadingTest(filename));
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return;
}
}
Run Code Online (Sandbox Code Playgroud)
编辑2:使用new UTF8Encoding(true, true)
时会出现异常,但在使用时new UTF8Encoding(false, true)
,不会抛出异常.我很困惑,因为它应该是控制是否抛出异常的第二个参数(如果有无效的字节序列),为什么第一个参数很重要?
public static void TestTextReader2()
{
try
{
// Create an instance of StreamReader to read from a file.
// The using statement also closes the StreamReader.
using (StreamReader sr = new StreamReader(
"c:\\a.xml",
new UTF8Encoding(true, true)
))
{
int bufferSize = 10 * 1024 * 1024; //could be anything
char[] buffer = new char[bufferSize];
// Read from the file until the end of the file is reached.
int actualsize = sr.Read(buffer, 0, bufferSize);
while (actualsize > 0)
{
actualsize = sr.Read(buffer, 0, bufferSize);
}
}
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
}
Run Code Online (Sandbox Code Playgroud)
var buffer = new char[32768] ;
using (var stream = new StreamReader (pathToFile,
new UTF8Encoding (true, true)))
{
while (true)
try
{
if (stream.Read (buffer, 0, buffer.Length) == 0)
return GoodUTF8File ;
}
catch (ArgumentException)
{
return BadUTF8File ;
}
}
Run Code Online (Sandbox Code Playgroud)