如何在R中找到相似的句子/短语?

sgt*_*per 7 statistics nlp r

例如,我有数十亿个短语,我想要它们的类似群集.

> strings.to.cluster <- c("Best Toyota dealer in bay area. Drive out with a new car today",
                        "Largest Selection of Furniture. Stock updated everyday" , 
                        " Unique selection of Handcrafted Jewelry",
                        "Free Shipping for orders above $60. Offer Expires soon",
                        "XXXX is where smart men buy anniversary gifts",
                        "2012 Camrys on Sale. 0% APR for select customers",
                        "Closing Sale on office desks. All Items must go" 
                         )
Run Code Online (Sandbox Code Playgroud)

假设这个向量是数十万行.R中是否有一个包来按意义聚类这些短语?或者是否有人建议通过对给定短语的含义对"相似"短语进行排名的方法.

Vin*_*ynd 8

您可以将短语视为"词袋",即构建矩阵("术语 - 文档"矩阵),每个短语一行,每个单词一列,如果单词出现在短语中则为1,否则为0 .(你可以用一些可以解释短语长度和词频的重量来代替1).然后,您可以应用任何群集算法.该tm软件包可以帮助您构建此矩阵.

library(tm)
library(Matrix)
x <- TermDocumentMatrix( Corpus( VectorSource( strings.to.cluster ) ) )
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )  
plot( hclust(dist(t(y))) )
Run Code Online (Sandbox Code Playgroud)