遍历字符串失败，出现StringIndexError

Question

遍历字符串失败，出现StringIndexError

注意：这个问题是由这个话语线索引起的。

考虑以下示例字符串：

str = "This is some text that initially consists of normal ASCII characters—but oh wait, the em-dash is only part of the extended ASCII character set!"

Run Code Online (Sandbox Code Playgroud)

尝试使用其长度遍历此字符串：

for i in 1:length(str)
  println(i, str[i])
end

Run Code Online (Sandbox Code Playgroud)

失败StringIndexError，返回循环中途返回以下消息：

ERROR: StringIndexError("This is some text that initially consists of normal ASCII characters—but oh wait, the em-dash is only part of the extended ASCII character set!", 70)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:217
 [3] getindex(::String, ::Int64) at ./strings/string.jl:210
 [4] top-level scope at ./REPL[4]:2

Run Code Online (Sandbox Code Playgroud)

这种行为的确切原因是什么？

Answer 1

Wol*_*olf 5

Julia中的字符串完全支持Unicode字符的UTF-8编码标准。但是，这取决于字符，使单个字符的编码大小可变。

标准ASCII字符（代码点少于128个）使用一个字节，并在迭代过程中产生预期的行为。但是，由于破折号—是扩展ASCII字符集的一部分，因此在尝试使用统一步长进行索引时会产生错误。有关字符串及其行为的更多信息，请参见文档（特别是“ Unicode和UTF-8”部分）。

编辑：正如Stefan在评论中提到的那样，请注意length(str)以预期的方式运行并返回字符串中的实际字符数。最后一个索引位置可以通过检索lastindex(str)。

可以通过多种方式来避免此错误，具体取决于所需的行为：

选项1：直接迭代字符串元素
如果索引不相关，这是最简单的方法：

for c in str
  println(c)
end

Run Code Online (Sandbox Code Playgroud)

选项2：使用eachindex提取正确的字符串指数
如果字符串中的实际索引位置是相关的，一个可以这样做：

for bi in eachindex(str)
  println(bi, str[bi])
end

Run Code Online (Sandbox Code Playgroud)

方案3：使用enumerate获得线性索引位置和角色
如果“字符”指数（即指数/当前字符数，不是它的字节索引）到字符串和相应的字符是相关的：

for (ci, c) in enumerate(str)
  println(ci, c)
end

Run Code Online (Sandbox Code Playgroud)

编辑2：添加了一个小示例来澄清。以字符串str = "a ? x ? y"为例。

选项1返回：

julia> for c in str; print(c, " | "); end
a |   | ? |   | x |   | ? |   | y |

Run Code Online (Sandbox Code Playgroud)

选项2返回：

julia> for bi in eachindex(str); print(bi, " ", str[bi], " | "); end
1 a | 2   | 3 ? | 6   | 7 x | 8   | 9 ? | 12   | 13 y |

Run Code Online (Sandbox Code Playgroud)

注意，例如从3-> 6的跳跃

选项3返回：

julia> for (ci, c) in enumerate(str); print(ci, " ", c, " | "); end
1 a | 2   | 3 ? | 4   | 5 x | 6   | 7 ? | 8   | 9 y |

Run Code Online (Sandbox Code Playgroud)

我认为问题是你对两种索引都使用 `ci`，但它们有很大不同： `julia> str = "∀ x ∃ y" "∀ x ∃ y" julia>collect(eachindex(str)) 7 -element Array{Int64,1}: 1 4 5 6 7 10 11 julia>collect(enumerate(str)) 7-element Array{Tuple{Int64,Char},1}: (1, '∀') (2, ' ') (3, 'x') (4, ' ') (5, '∃') (6, ' ') (7, 'y')`。请注意，前一个索引是不均匀的，而后者始终恰好是“1:length(str)”。 (3认同)

归档时间：	6 年，5 月前
查看次数：	53 次
最近记录：	6 年，5 月前