根据复述检测查找类似的文本

Sum*_* TV 0 nlp text-mining nltk semantic-analysis

我有兴趣根据释义找到类似的内容(文本).我该怎么做呢?有没有特定的工具可以做到这一点?在python中最好.

IVR*_*IVR 5

我相信您正在寻找的工具是潜在语义分析.

鉴于我的帖子将会非常冗长,我不打算详细解释它背后的理论 - 如果你认为它确实是你正在寻找的,我建议你查阅它.最好的起点是:

http://staff.scm.uws.edu.au/~lapark/lt.pdf

总之,LSA试图基于相似单词出现在类似文档中的假设来揭示单词和短语的潜在/潜在含义.我将用R它来演示它是如何工作的.

我将设置一个函数,根据它们的潜在含义检索类似的文档:

# Setting up all the needed functions:

SemanticLink = function(text,expression,LSAS,n=length(text),Out="Text"){ 

  # Query Vector
  LookupPhrase = function(phrase,LSAS){ 
    lsatm = as.textmatrix(LSAS) 
    QV = function(phrase){ 
      q = query(phrase,rownames(lsatm)) 
      t(q)%*%LSAS$tk%*%diag(LSAS$sk) 
    } 

    q = QV(phrase) 
    qd = 0 

    for (i in 1:nrow(LSAS$dk)){ 
      qd[i] <- cosine(as.vector(q),as.vector(LSAS$dk[i,])) 
    }  
    qd  
  } 

  # Handling Synonyms
  Syns = function(word){   
    wl    =   gsub("(.*[[:space:]].*)","", 
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","", 
                        unlist(strsplit(PlainTextDocument(synonyms(word)),",")))) 
    wl = wl[wl!=""] 
    return(wl)  
  } 

  ex = unlist(strsplit(expression," "))
  for(i in seq(ex)){ex = c(ex,Syns(ex[i]))}
  ex = unique(wordStem(ex))

  cache = LookupPhrase(paste(ex,collapse=" "),LSAS) 

  if(Out=="Text"){return(text[which(match(cache,sort(cache,decreasing=T)[1:n])!="NA")])} 
  if(Out=="ValuesSorted"){return(sort(cache,decreasing=T)[1:n]) } 
  if(Out=="Index"){return(which(match(cache,sort(cache,decreasing=T)[1:n])!="NA"))} 
  if(Out=="ValuesUnsorted"){return(cache)} 

} 
Run Code Online (Sandbox Code Playgroud)

请注意,我们在汇总查询向量时会使用同义词.这种方法并不完美,因为qdap库中的某些同义词最多只是远程...这可能会干扰您的搜索查询,因此要获得更准确但不太通用的结果,您可以简单地删除同义词位并手动选择构成查询向量的所有相关术语.

我们来试试吧.我还将使用包装中的美国国会数据集RTextTools:

library(tm)
library(RTextTools)
library(lsa)
library(data.table)
library(stringr)
library(qdap)

data(USCongress)

text = as.character(USCongress$text)

corp = Corpus(VectorSource(text)) 

parameters = list(minDocFreq        = 1, 
                  wordLengths       = c(2,Inf), 
                  tolower           = TRUE, 
                  stripWhitespace   = TRUE, 
                  removeNumbers     = TRUE, 
                  removePunctuation = TRUE, 
                  stemming          = TRUE, 
                  stopwords         = TRUE, 
                  tokenize          = NULL, 
                  weighting         = function(x) weightSMART(x,spec="ltn"))

tdm = TermDocumentMatrix(corp,control=parameters)
tdm.reduced = removeSparseTerms(tdm,0.999)

# setting up LSA space - this may take a little while...
td.mat = as.matrix(tdm.reduced) 
td.mat.lsa = lw_bintf(td.mat)*gw_idf(td.mat) # you can experiment with weightings here
lsaSpace = lsa(td.mat.lsa,dims=dimcalc_raw()) # you don't have to keep all dimensions
lsa.tm = as.textmatrix(lsaSpace)

l = 50 
exp = "support trade" 
SemanticLink(text,exp,n=5,lsaSpace,Out="Text") 

[1] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small businesses, and for other purposes."                                                                       
[2] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel AJ."           
[3] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the yacht EXCELLENCE III."
[4] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel M/V Adios."    
[5] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small business, and for other purposes." 
Run Code Online (Sandbox Code Playgroud)

正如你可以看到,而"支持贸易"可能无法在上面的例子中出现这样的功能已经获取了一套相关的查询文件.该函数旨在检索具有语义链接而非精确匹配的文档.

我们还可以通过绘制余弦距离来查看这些文档如何"关闭"到查询向量:

plot(1:l,SemanticLink(text,exp,lsaSpace,n=l,Out="ValuesSorted") 
     ,type="b",pch=16,col="blue",main=paste("Query Vector Proximity",exp,sep=" "), 
     xlab="observations",ylab="Cosine") 
Run Code Online (Sandbox Code Playgroud)

我还没有足够的声誉制作情节,抱歉.

正如您所看到的,前两个条目似乎与查询向量的关联性大于其余条目(虽然大约有5个条目特别相关),即使阅读它们,您也不会这样.我会说这是使用同义词构建查询向量的效果.然而,忽略这一点,图表允许我们有多少其他文档与查询向量远程相似.


编辑:


就在最近,我不得不解决你想要解决的问题,但上述功能不能很好地工作,仅仅是因为数据非常残缺 - 文本很短,文本很少而且主题不多被探索过.因此,为了在这种情况下找到相关条目,我开发了另一个纯粹基于正则表达式的函数.

在这里:

HLS.Extract = function(pattern,text=active.text){


  require(qdap)
  require(tm)
  require(RTextTools)

  p = unlist(strsplit(pattern," "))
  p = unique(wordStem(p))
  p = gsub("(.*)i$","\\1y",p)

  Syns = function(word){   
    wl    =   gsub("(.*[[:space:]].*)","",      
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",  
                        unlist(strsplit(PlainTextDocument(synonyms(word)),",")))) 
    wl = wl[wl!=""] 
    return(wl)     
  } 

  trim = function(x){

    temp_L  = nchar(x)
    if(temp_L < 5)                {N = 0}
    if(temp_L > 4 && temp_L < 8)  {N = 1}
    if(temp_L > 7 && temp_L < 10) {N = 2}
    if(temp_L > 9)                {N = 3}
    x = substr(x,0,nchar(x)-N)
    x = gsub("(.*)","\\1\\\\\\w\\*",x)

    return(x)
  }

  # SINGLE WORD SCENARIO

  if(length(p)<2){

    # EXACT
    p = trim(p)
    ndx_exact  = grep(p,text,ignore.case=T)
    text_exact = text[ndx_exact]

    # SEMANTIC
    p = unlist(strsplit(pattern," "))

    express  = new.exp = list()
    express  = c(p,Syns(p))
    p        = unique(wordStem(express))

    temp_exp = unlist(strsplit(express," "))
    temp.p = double(length(seq(temp_exp)))

    for(j in seq(temp_exp)){
      temp_exp[j] = trim(temp_exp[j])
    }

    rgxp   = paste(temp_exp,collapse="|")
    ndx_s  = grep(paste(temp_exp,collapse="|"),text,ignore.case=T,perl=T)
    text_s = as.character(text[ndx_s])

    f.object = list("ExactIndex"    = ndx_exact,
                    "SemanticIndex" = ndx_s,
                    "ExactText"     = text_exact,
                    "SemanticText"  = text_s)
  }

  # MORE THAN 2 WORDS

  if(length(p)>1){

    require(combinat)

    # EXACT
    for(j in seq(p)){p[j] = trim(p[j])}

    fp     = factorial(length(p))
    pmns   = permn(length(p))
    tmat   = matrix(0,fp,length(p))
    permut = double(fp)
    temp   = double(length(p))
    for(i in 1:fp){
      tmat[i,] = pmns[[i]]
    }

    for(i in 1:fp){
      for(j in seq(p)){
        temp[j] = paste(p[tmat[i,j]])
      }
      permut[i] = paste(temp,collapse=" ")
    }

    permut = gsub("[[:space:]]",
                  "[[:space:]]+([[:space:]]*\\\\w{,3}[[:space:]]+)*(\\\\w*[[:space:]]+)?([[:space:]]*\\\\w{,3}[[:space:]]+)*",permut)

    ndx_exact  = grep(paste(permut,collapse="|"),text)
    text_exact = as.character(text[ndx_exact])


    # SEMANTIC

    p = unlist(strsplit(pattern," "))
    express = list()
    charexp = permut = double(length(p))
    for(i in seq(p)){
      express[[i]] = c(p[i],Syns(p[i]))
      express[[i]] = unique(wordStem(express[[i]]))
      express[[i]] = gsub("(.*)i$","\\1y",express[[i]])
      for(j in seq(express[[i]])){
        express[[i]][j] = trim(express[[i]][j])
      }
      charexp[i] = paste(express[[i]],collapse="|")
    }

    charexp  = gsub("(.*)","\\(\\1\\)",charexp)
    charexpX = double(length(p))
    for(i in 1:fp){
      for(j in seq(p)){
        temp[j] = paste(charexp[tmat[i,j]])
      }
      permut[i] = paste(temp,collapse=
                          "[[:space:]]+([[:space:]]*\\w{,3}[[:space:]]+)*(\\w*[[:space:]]+)?([[:space:]]*\\w{,3}[[:space:]]+)*")
    }
    rgxp   = paste(permut,collapse="|")
    ndx_s  = grep(rgxp,text,ignore.case=T)
    text_s = as.character(text[ndx_s])

    temp.f = function(x){
      if(length(x)==0){x=0}
    }

    temp.f(ndx_exact);  temp.f(ndx_s)
    temp.f(text_exact); temp.f(text_s)

    f.object = list("ExactIndex"    = ndx_exact,
                    "SemanticIndex" = ndx_s,
                    "ExactText"     = text_exact,
                    "SemanticText"  = text_s,
                    "Synset"        = express)

  }
  return(f.object)
  cat(paste("Exact Matches:",length(ndx_exact),sep=""))
  cat(paste("\n"))
  cat(paste("Semantic Matches:",length(ndx_s),sep=""))
}
Run Code Online (Sandbox Code Playgroud)

尝试一下:

HLS.Extract("buy house",
            c("we bought a new house",
              "I'm thinking about buying a new home",
              "purchasing a brand new house"))[["SemanticText"]]

$SemanticText
[1] "I'm thinking about buying a new home" "purchasing a brand new house"
Run Code Online (Sandbox Code Playgroud)

如您所见,该功能非常灵活.它也会选择"购房".它没有拿起"我们买了一所新房子",因为"买"是一个不规则的动词 - 这是LSA会捡到的那种东西.

所以你可能想尝试两者,看看哪一个效果更好.SemanticLink函数也需要大量内存,当你有一个特别大的语料库时,你将无法使用它

干杯