Nat*_*lin 2 dictionary if-statement julia
我从一个字数列表开始:
julia> import Iterators: partition
julia> import StatsBase: countmap
julia> s = split("the lazy fox jumps over the brown dog");
julia> vocab_counter = countmap(s)
Dict{SubString{String},Int64} with 7 entries:
"brown" => 1
"lazy" => 1
"jumps" => 1
"the" => 2
"fox" => 1
"over" => 1
"dog" => 1
Run Code Online (Sandbox Code Playgroud)
然后我想计算不.每个单词的ngrams并将其存储在嵌套字典中.外键是ngram,内键是单词,而最内层的值是给出单词的ngram的计数.
我试过了:
ngram_word_counter = Dict{Tuple,Dict}()
for (word, count) in vocab_counter
for ng in ngram(word, 2) # bigrams.
if ! haskey(ngram_word_counter, ng)
ngram_word_counter[ng] = Dict{String,Int64}()
ngram_word_counter[ng][word] = 0
end
ngram_word_counter[ng][word] += 1
end
end
Run Code Online (Sandbox Code Playgroud)
这给了我所需的数据结构:
julia> ngram_word_counter
Dict{Tuple,Dict} with 20 entries:
('b','r') => Dict("brown"=>1)
('t','h') => Dict("the"=>1)
('o','w') => Dict("brown"=>1)
('z','y') => Dict("lazy"=>1)
('o','g') => Dict("dog"=>1)
('u','m') => Dict("jumps"=>1)
('o','x') => Dict("fox"=>1)
('e','r') => Dict("over"=>1)
('a','z') => Dict("lazy"=>1)
('p','s') => Dict("jumps"=>1)
('h','e') => Dict("the"=>1)
('d','o') => Dict("dog"=>1)
('w','n') => Dict("brown"=>1)
('m','p') => Dict("jumps"=>1)
('l','a') => Dict("lazy"=>1)
('o','v') => Dict("over"=>1)
('v','e') => Dict("over"=>1)
('r','o') => Dict("brown"=>1)
('f','o') => Dict("fox"=>1)
('j','u') => Dict("jumps"=>1)
Run Code Online (Sandbox Code Playgroud)
但请注意,这些值是错误的:
('t','h') => Dict("the"=>1)
('h','e') => Dict("the"=>1)
Run Code Online (Sandbox Code Playgroud)
本来应该:
('t','h') => Dict("the"=>2)
('h','e') => Dict("the"=>2)
Run Code Online (Sandbox Code Playgroud)
自从这个词出现两次.
仔细看后,似乎haskey(ngram_word_counter, ng)总是假=(
julia> ngram_word_counter = Dict{Tuple,Dict}()
for (word, count) in vocab_counter
for ng in ngram(word, 2) # bigrams.
println(haskey(ngram_word_counter, ng))
end
end
Run Code Online (Sandbox Code Playgroud)
[OUT]:
false
false
false
false
false
false
false
false
false
false
false
false
false
false
false
false
false
false
false
false
Run Code Online (Sandbox Code Playgroud)
为什么这种haskey()情况总是错误的?
TL; DR:它应该ngram_word_counter[ng][word] += count代替ngram_word_counter[ng][word] += 1.
添加只是1忽略了多次出现的单词的多重贡献.单词出现的次数编码vocab_counter为count在for循环中变为变量的值.因此增量应该是count.
后来的调试检查是无效的,并且通常情况下,调试代码的错误会混淆问题.预期的检查可能是:
julia> ngram_word_counter = Dict{Tuple,Dict}()
for (word, count) in vocab_counter
for ng in ngram(word, 2) # bigrams.
println(haskey(ngram_word_counter, ng))
ngram_word_counter[ng] = 1
end
end
Run Code Online (Sandbox Code Playgroud)