在String向量中使用余弦相似度来过滤掉类似的字符串

ALE*_*HEW 5 r

我有一个字符串向量.矢量的一些字符串(可能多于两个)在它们包含的单词方面彼此相似.我想过滤掉与矢量的任何其他字符串具有超过30%的余弦相似度的字符串.在比较的两个字符串中,我希望保持字符串更多的单词.也就是说,我只想要那些与原始向量的任何字符串具有小于30%相似性的字符串.我的目的是过滤掉类似的字符串,只保留大致不同的字符串.

防爆.矢量是:

x <- c("Dan is a good man and very smart", "A good man is rare", "Alex can be trusted with anything", "Dan likes to share his food", "Rare are man who can be trusted", "Please share food")
Run Code Online (Sandbox Code Playgroud)

结果应该给出(假设相似度小于30%):

c("Dan is a good man and very smart", "Dan likes to share his food", "Rare are man who can be trusted")
Run Code Online (Sandbox Code Playgroud)

以上结果尚未得到验证.

余弦代码我正在使用: 

CSString_vector <- c("String One","String Two")
    corp <- tm::VCorpus(VectorSource(CSString_vector))
    controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf),
    weighting = weightTf)
    dtm <- DocumentTermMatrix(corp,control = controlForMatrix)
    matrix_of_vector = as.matrix(dtm)
    res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,])
Run Code Online (Sandbox Code Playgroud)

我在RStudio工作.

Jan*_*uGe 2

因此,重新表述您想要的内容:您想计算所有字符串对的成对相似度。然后,您希望使用该相似性矩阵来识别足够不同以形成不同组的字符串组。对于每个组,您都希望删除除最长字符串之外的所有字符串并将其返回。我做对了吗?

经过一些实验,这是我提出的解决方案,一步一步:

  • 计算相似度矩阵并使用阈值将其二值化
  • igraph使用包中的图形算法识别不同的群体(派系)
  • 找到每个派系中的所有字符串并保留最长的字符串

注意:我必须将阈值调整为 0.4 才能使您的示例正常工作。


相似度矩阵

这在很大程度上基于您提供的代码,但我将其打包为一个函数,并使用tidyverse来使代码(至少按照我的口味)更具可读性。

library(tm)
library(lsa)
library(tidyverse)

get_cos_sim <- function(corpus) {
  # pre-process corpus
  doc <- corpus %>%
    VectorSource %>%
    tm::VCorpus()
  # get term frequency matrix
  tfm <- doc %>%
    DocumentTermMatrix(
      control = corpus %>% list(
        removePunctuation = TRUE,
        wordLengths = c(1, Inf),
        weighting = weightTf)) %>%
    as.matrix()
  # get row-wise similarity
  sim <- NULL
  for(i in 1:nrow(tfm)) {
    sim_i <- apply(
      X = tfm, 
      MARGIN = 1, 
      FUN = lsa::cosine, 
      tfm[i,])
    sim <- rbind(sim, sim_i)
  }
  # set identity diagonal to zero
  diag(sim) <- 0
  # label and return
  rownames(sim) <- corpus
  return(sim)
}
Run Code Online (Sandbox Code Playgroud)

现在我们将此函数应用于您的示例数据

# example corpus
strings <- c(
  "Dan is a good man and very smart", 
  "A good man is rare", 
  "Alex can be trusted with anything", 
  "Dan likes to share his food", 
  "Rare are man who can be trusted", 
  "Please share food")

# get pairwise similarities
sim <- get_cos_sim(strings)
# binarize (using a different threshold to make your example work)
sim <- sim > .4  
Run Code Online (Sandbox Code Playgroud)

识别不同的群体

事实证明这是一个有趣的问题!我找到了这篇论文Chalermsook & Chuzhoy:最大独立矩形集,这使我在包中找到了这个实现igraph。基本上,我们将相似的字符串视为图中的连接顶点,然后在整个相似度矩阵的图中查找不同的组

library(igraph)

# create graph from adjacency matrix
cliques <- sim %>% 
  dplyr::as_data_frame() %>%
  mutate(from = row_number()) %>% 
  gather(key = 'to', value = 'edge', -from) %>% 
  filter(edge == T) %>%
  graph_from_data_frame(directed = FALSE) %>%
  max_cliques()
Run Code Online (Sandbox Code Playgroud)

找到最长的字符串

现在我们可以使用派系列表来检索每个派系的字符串vertices并选择每个派系最长的字符串。注意:图中缺少语料库中没有相似字符串的字符串。我正在手动将它们添加回来。包中可能有一个函数igraph可以更好地处理它,如果有人发现一些东西会感兴趣

# get the string indices per vertex clique first
string_cliques_index <- cliques %>% 
  unlist %>%
  names %>%
  as.numeric
# find the indices that are distinct but not in a clique
# (i.e. unconnected vertices)
string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
  as.numeric
# get a list with all indices
all_distict <- cliques %>% 
  lapply(names) %>% 
  lapply(as.numeric) %>%
  c(string_uniques_index)
# get a list of distinct strings
lapply(all_distict, find_longest, strings)  
Run Code Online (Sandbox Code Playgroud)

测试用例:

让我们用更长的不同字符串向量来测试一下:

strings <- c(
  "Dan is a good man and very smart", 
  "A good man is rare", 
  "Alex can be trusted with anything", 
  "Dan likes to share his food", 
  "Rare are man who can be trusted", 
  "Please share food",
  "NASA is a government organisation",
  "The FBI organisation is part of the government of USA",
  "Hurricanes are a tragedy",
  "Mangoes are very tasty to eat ",
  "I like to eat tasty food",
  "The thief was caught by the FBI")
Run Code Online (Sandbox Code Playgroud)

我得到这个二值化相似矩阵:

Dan is a good man and very smart                      FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A good man is rare                                     TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Alex can be trusted with anything                     FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Dan likes to share his food                           FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Rare are man who can be trusted                       FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Please share food                                     FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
NASA is a government organisation                     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
Hurricanes are a tragedy                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Mangoes are very tasty to eat                         FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
I like to eat tasty food                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
The thief was caught by the FBI                       FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
Run Code Online (Sandbox Code Playgroud)

基于这些相似之处,预期结果将是:

# included
Dan is a good man and very smart
Alex can be trusted with anything
Dan likes to share his food
NASA is a government organisation
The FBI organisation is part of the government of USA
Hurricanes are a tragedy
Mangoes are very tasty to eat

# omitted
A good man is rare
Rare are man who can be trusted
Please share food
I like to eat tasty food
The thief was caught by the FBI
Run Code Online (Sandbox Code Playgroud)

实际输出具有正确的元素,但不按原始顺序。您可以使用原始字符串向量重新排序

[[1]]
[1] "The FBI organisation is part of the government of USA"

[[2]]
[1] "Dan is a good man and very smart"

[[3]]
[1] "Alex can be trusted with anything"

[[4]]
[1] "Dan likes to share his food"

[[5]]
[1] "Mangoes are very tasty to eat "

[[6]]
[1] "NASA is a government organisation"

[[7]]
[1] "Hurricanes are a tragedy"
Run Code Online (Sandbox Code Playgroud)

就这样!希望这是您正在寻找的内容,并且可能对其他人有用。