假设我有一个包含以下单词的变量
ChicKen120
Chicken1.20
Chicken(1.20)
Cow
cow.
cow/
cat
Run Code Online (Sandbox Code Playgroud)
我意识到我可以
grep("chicken", df$words, ignore.case=T)找到所有类似于鸡的单词,但是通过每个单词运行它会很繁琐,即第一只鸡,然后是牛,然后是猫......
有没有办法在整列中找到相似的单词?
我想将类似的单词转换为一种标准格式,
chicken(1.20)
chicken(1.20)
chicken(1.20)
cow
cow
cow
cat
Run Code Online (Sandbox Code Playgroud)关于您的第一个问题,您可以尝试adist():
text <- c("ChicKen120","Chicken1.20","Chicken(1.20)","Cow","cow.", "cow/")
> adist(text)
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 2 4 9 9 9
#[2,] 2 0 2 10 9 10
#[3,] 4 2 0 12 11 12
#[4,] 9 10 12 0 2 2
#[5,] 9 9 11 2 0 1
#[6,] 9 10 12 2 1 0
Run Code Online (Sandbox Code Playgroud)
具有2个或更少连接对的六个字的矩阵元素,最多相差2个字符.
更具体地说,可以列出不相同且最多相差两个字符的单词对:
which(adist(text)<=2 & upper.tri(adist(text)), arr.ind=T)
# row col
#[1,] 1 2
#[2,] 2 3
#[3,] 4 5
#[4,] 4 6
#[5,] 5 6
Run Code Online (Sandbox Code Playgroud)
这里逻辑函数upper.tri()仅用于选择矩阵的上三角形,从而防止对的双输出(即,以相反的顺序重复)并去除对角线上的相同对.
对应于上面列出的行号和列号的单词可以像这样检索:
words <- text[which(adist(text)<=2 & upper.tri(adist(text)), arr.ind=T)]
matrix(words,ncol=2)
# [,1] [,2]
#[1,] "ChicKen120" "Chicken1.20"
#[2,] "Chicken1.20" "Chicken(1.20)"
#[3,] "Cow" "cow."
#[4,] "Cow" "cow/"
#[5,] "cow." "cow/"
Run Code Online (Sandbox Code Playgroud)