使用 Powershell 删除 unicode 字符

Question

使用 Powershell 删除 unicode 字符

我在 Excel 中使用 vlookup 时遇到一些问题。\n我已经看到了该问题，但还没有找到解决方案。

\n

我的 txt 文件中有大量行，这些行包含 Unicode 字符。

\n

示例:\n此行: 'S0841488.JPG06082014\xe2\x80\x8f\xe2\x80\x8e08.21'\n包含这两个 unicode 字符: U+200F U+200E\n'S0841488.JPG06082014 U+200F U+200E 08.21。

\n

请告诉我如何使用 Powershell 删除这些 unicode 字符。

\n

Answer 1

mkl*_*nt0 14

\n

如果要删除ASCII范围（Unicode 代码点范围- ）之外的所有字符：U+0000U+007F

\n
# Removes any non-ASCII characters from the LHS string,\n# which includes the problematic hidden control characters.\n\'S0841488.JPG06082014\xe2\x80\x8f\xe2\x80\x8e08.21\' -creplace \'\\P{IsBasicLatin}\'\n
Run Code Online (Sandbox Code Playgroud)\n
该解决方案使用基于正则表达式的运算符-creplace的区分大小写的变体^[1] ，以及Unicode块名称的否定形式 ( ) ，它指的是 Unicode 的 ASCII 子范围。简而言之：匹配任何非 ASCII 字符，并且由于未指定替换字符串，因此有效地将其删除；结合始终替换输入字符串中的所有匹配项，所有非 ASCII 字符都将被删除。-replace\\P IsBasicLatin\\P{IsBasicLatin}-creplace
\n
笔记：
\n
\n
如果您想删除ISO-8859-1范围（Unicode 代码点范围- ）之外的字符（其中包括重音字符，例如），请使用以下命令：U+0000U+00FF\xc3\xa9
\n
# Removes non ISO-8859-1 characters.\n# -> \'Caf\xc3\xa9 \xc2\xa3\', i.e. \'\xe2\x82\xac\' and \'\xe2\x80\x94\' (em dash) were removed, \n# but \'\xc3\xa9\' and \'\xc2\xa3\' were retained. \n\'Caf\xc3\xa9 \xe2\x82\xac\xe2\x80\x94\xc2\xa3\' -creplace \'[^\\p{IsBasicLatin}\\p{IsLatin-1Supplement}]\'\n
Run Code Online (Sandbox Code Playgroud)\n
\n
警告： ISO-8859-1与Windows-1252大致相同，但并不完全相同，其一个显着后果是缺少，如上所示。您可以手动包含上述字符集表达式 ( ) 中缺少的字符，以实现完全 Windows-1252 兼容性：\xe2\x82\xac[...]\xe2\x82\xac\xe2\x80\x9a\xc6\x92\xe2\x80\x9e\xe2\x80\xa6\xe2\x80\xa0\xe2\x80\xa1\xcb\x86\xe2\x80\xb0\xc5\xa0\xe2\x80\xb9\xc5\x92\xc5\xbd\xe2\x80\x98\xe2\x80\x99\xe2\x80\x9c\xe2\x80\x9d\xe2\x80\xa2\xe2\x80\x93\xe2\x80\x94\xcb\x9c\xe2\x84\xa2\xc5\xa1\xe2\x80\xba\xc5\x93\xc5\xbe\xc5\xb8
\n
\n
\n
考虑到与 Unicode 代码点范围的相关性，您还可以使用更简洁但描述性较差的解决方案：
\n
\n
... -replace \'[^\\x00-\\x7F]仅保留 ASCII 范围的字符。
\n
... -replace \'[^\\x00-\\xFF]仅保留 ISO-88591-1 范围字符。
\n
\n
\n
\n
\n
您可以借助该函数验证这是否有效地从字符串中删除了（不可见的）LEFT-TO-RIGHT MARKU+200E和 RIGHT-TO-LEFT MARK字符，该函数可作为MIT 许可的 Gist提供：U+200FDebug-String
\n
# Download and define the Debug-String function.\n# NOTE: \n# I can personally assure you that doing this is safe, but you\n# you should always check the source code first.\nirm https://gist.github.com/mklement0/7f2f1e13ac9c2afaf0a0906d08b392d1/raw/Debug-String.ps1 | iex\n\n\n# Visualize the existing non-ASCII-range characters\n\'S0841488.JPG06082014\xe2\x80\x8f\xe2\x80\x8e08.21\' | Debug-String -UnicodeEscapes\n\n# Remove them and verify that they\'re gone.\n\'S0841488.JPG06082014\xe2\x80\x8f\xe2\x80\x8e08.21\' -replace \'\\P{IsBasicLatin}\' | Debug-String -UnicodeEscapes\n
Run Code Online (Sandbox Code Playgroud)\n
上面的结果如下：
\n
S0841488.JPG06082014`u{200f}`u{200e}08.21\nS0841488.JPG0608201408.21\n
Run Code Online (Sandbox Code Playgroud)\n
`u{200f}请注意原始输入字符串中不可见控制字符的可视化`u{200e}，以及它们在应用操作后如何不再存在-replace。
\n
在 PowerShell (Core) 7+（但不是 Windows PowerShell）中，此类 Unicode 转义序列也可以在可扩展字符串中使用，即在双引号字符串文字内（例如，"Hi`u{21}"扩展为 verbatim Hi!） - 请参阅概念性about_Special_Characters帮助主题。
\n
\n
^{[1] 请参阅此答案，了解为什么必须使用区分大小写的匹配。
\n尽管运算符区分大小写，但本质上不区分大小写的正则表达式\\P{L}块名称构造仍然排除小写字母（而\\P{Lu}/\\P{Ll}只会排除大写/小写字母）。}
\n

归档时间：	4 年，7 月前
查看次数：	8385 次
最近记录：	2 年，4 月前