无需 BOM 即可将源代码转换为 UTF-8

Question

无需 BOM 即可将源代码转换为 UTF-8

Vla*_*hov 4 powershell utf-8 character-encoding

我尝试将所有源文件从目标文件夹转换为 UTF-8（无 BOM）编码。我使用以下 PowerShell 脚本：

$MyPath = "D:\my projects\etc\"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    $content = Get-Content $_.FullName  
    $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
    [System.IO.File]::WriteAllLines($_.FullName, $content, $Utf8NoBomEncoding)    
}
cmd /c pause | out-null

Run Code Online (Sandbox Code Playgroud)

如果文件已经不是UTF-8，它可以正常工作。但是，如果某个文件已经在 UTF-8 no-BOM 中，则所有国家符号都会转换为未知符号（例如，如果我再次运行脚本）。如何更改脚本以解决问题？

Answer 1

mkl*_*nt0 5

正如Ansgar Wiechers在评论中指出的那样，问题在于Windows PowerShell在没有 BOM 的情况下默认将文件解释为“ANSI”编码，即旧系统区域设置（ANSI 代码页）隐含的编码，正如 .NET Framework（但不是 .NET Core）在[System.Text.Encoding]::Default.

鉴于此，根据您的后续的评论，您的输入文件中的BOM无文件是一个混合的Windows的1251编码和UTF-8的文件，你必须检查他们的内容，以确定其特定的编码：

读取每个文件-Encoding Utf8并测试结果字符串是否包含 Unicode REPLACEMENT CHARACTER( U+FFFD)。如果是，则暗示该文件不是UTF-8，因为此特殊字符用于表示遇到了在 UTF-8 中无效的字节序列。
如果文件不是有效的 UTF-8，只需再次读取文件而不指定-Encoding，这会导致 Windows PowerShell 将文件解释为 Windows-1251 编码，因为这是您的系统区域设置所隐含的编码（代码页）。

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    # Note:
    #  * the use of -Encoding Utf8 to first try to read the file as UTF-8.
    #  * the use of -Raw to read the entire file as a *single string*.
    $content = Get-Content -Raw -Encoding Utf8 $_.FullName  

    # If the replacement char. is found in the content, the implication
    # is that the file is NOT UTF-8, so read it again *without -Encoding*,
    # which interprets the files as "ANSI" encoded (Windows-1251, in your case).
    if ($content.Contains([char] 0xfffd)) {
      $content = Get-Content -Raw $_.FullName  
    }

    # Note the use of WriteAllText() in lieu of WriteAllLines()
    # and that no explicit encoding object is passed, given that
    # .NET *defaults* to BOM-less UTF-8.
    # CAVEAT: There's a slight risk of data loss if writing back to the input
    #         file is interrupted.
    [System.IO.File]::WriteAllText($_.FullName, $content)    
}

Run Code Online (Sandbox Code Playgroud)

甲更快的替代方案是使用[IO.File]::ReadAllText()与一个UTF-8编码对象抛出异常是遇到无效-AS-UTF-8字节时（PSv5 +语法）：

$utf8EncodingThatThrows = [Text.UTF8Encoding]::new($false, $true)

# ...

  try {
     $content = [IO.File]::ReadAllText($_.FullName, $utf8EncodingThatThrows)
  } catch [Text.DecoderFallbackException] {         
     $content = [IO.File]::ReadAllText($_.FullName, [Text.Encoding]::Default)
  }

# ...

Run Code Online (Sandbox Code Playgroud)

将上述解决方案适配到 PowerShell Core / .NET Core：

PowerShell Core默认为（无 BOM）UTF-8，因此简单地省略-Encoding对于读取 ANSI 编码的文件不起作用。
同样，在 .NET Core 中[System.Text.Encoding]::Default 总是报告 UTF-8。

因此，您必须手动确定活动系统语言环境的 ANSI 代码页并获取相应的编码对象：

$ansiEncoding = [Text.Encoding]::GetEncoding(
  [int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP)
)

Run Code Online (Sandbox Code Playgroud)

然后，您需要将此编码显式传递给Get-Content -Encoding( Get-Content -Raw -Encoding $ansiEncoding $_.FullName) 或 .NET 方法 ( [IO.File]::ReadAllText($_.FullName, $ansiEncoding))。

答案的原始形式：对于已经全部采用 UTF-8 编码的输入文件：

因此，如果您的某些 UTF-8 编码文件（已经）是无BOM 的，您必须明确指示Get-Content将它们视为 UTF-8，使用-Encoding Utf8- 否则它们将被误解，如果它们包含 7 位 ASCII 之外的字符范围：

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    # Note:
    #  * the use of -Encoding Utf8 to ensure the correct interpretation of the input file
    #  * the use of -Raw to read the entire file as a *single string*.
    $content = Get-Content -Raw -Encoding Utf8 $_.FullName  

    # Note the use of WriteAllText() in lieu of WriteAllLines()
    # and that no explicit encoding object is passed, given that
    # .NET *defaults* to BOM-less UTF-8.
    # CAVEAT: There's a slight risk of data loss if writing back to the input
    #         file is interrupted.
    [System.IO.File]::WriteAllText($_.FullName, $content)    
}

Run Code Online (Sandbox Code Playgroud)

注意：在您的场景中不需要重写 BOM-less UTF-8 文件，但这样做是良性的并简化了代码；在替代方案将是测试，如果前3个字节的每个文件的是UTF-8 BOM和跳过这样的文件：
$hasUtf8Bom = "$(Get-Content -Encoding Byte -First 3 $_.FullName)" -eq '239 187 191'（Windows PowerShell中）或
$hasUtf8Bom = "$(Get-Content -AsByteStream -First 3 $_.FullName)" -eq '239 187 191'（PowerShell核心）。

顺便说一句：如果输入文件使用非 UTF8 编码（例如，UTF-16），只要这些文件具有 BOM，该解决方案仍然有效，因为PowerShell（安静地）优先于指定编码的 BOM通过-Encoding.

请注意，使用-Raw/WriteAllText()将文件作为一个整体（单个字符串）读取/写入不仅可以稍微加快处理速度，还可以确保保留每个输入文件的以下特征：

特定的换行样式（CRLF (Windows) 与 LF-only (Unix)）
最后一行是否有尾随换行符。

相比之下，不使用-Raw（行由行读），并使用.WriteAllLines()它不会保留这些特点：你总是得到适合平台换行符（在Windows PowerShell中，总是CRLF），你总是在换行符得到。

请注意，多平台Powershell Core版本在读取没有 BOM 的文件时明智地默认为 UTF-8，并且默认情况下还会创建无BOM 的 UTF-8 文件- 创建带有BOM的 UTF-8 文件需要显式选择使用-Encoding utf8BOM.

因此，PowerShell Core解决方案要简单得多：

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    # Note:
    #  * the use of -Encoding Utf8 to first try to read the file as UTF-8.
    #  * the use of -Raw to read the entire file as a *single string*.
    $content = Get-Content -Raw -Encoding Utf8 $_.FullName  

    # If the replacement char. is found in the content, the implication
    # is that the file is NOT UTF-8, so read it again *without -Encoding*,
    # which interprets the files as "ANSI" encoded (Windows-1251, in your case).
    if ($content.Contains([char] 0xfffd)) {
      $content = Get-Content -Raw $_.FullName  
    }

    # Note the use of WriteAllText() in lieu of WriteAllLines()
    # and that no explicit encoding object is passed, given that
    # .NET *defaults* to BOM-less UTF-8.
    # CAVEAT: There's a slight risk of data loss if writing back to the input
    #         file is interrupted.
    [System.IO.File]::WriteAllText($_.FullName, $content)    
}

Run Code Online (Sandbox Code Playgroud)

更快的基于 .NET 类型的解决方案

上述解决方案的工作，但Get-Content和Set-Content相对较慢，因此，使用.NET类型来读取和重写文件将更好地履行。

如上所述，在以下解决方案中不必明确指定编码（即使在Windows PowerShell 中也不行），因为.NET 本身自成立以来就默认默认为无 BOM 的 UTF-8（同时仍然识别 UTF-8 BOM如果存在）：

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
  # CAVEAT: There's a slight risk of data loss if writing back to the input
  #         file is interrupted.
  [System.IO.File]::WriteAllText(
    $_.FullName,
    [System.IO.File]::ReadAllText($_.FullName)
  )   
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，3 月前
查看次数：	6800 次
最近记录：	7 年，3 月前