可视化文本之间的距离

xii*_*408 10 matlab r graph distance text-mining

我正在为学校开展研究项目.我写了一些文本挖掘软件,分析集合中的法律文本,并吐出一个分数,表明它们有多相似.我运行程序来比较每个文本和其他所有文本,我有这样的数据(虽然有更多的点):

codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582

Run Code Online (Sandbox Code Playgroud)

现在我需要在图表上绘制它们.我可以轻松地反转分数,以便现在小值表示相似的文本,大值表示不相似的文本:值可以是表示文本的图形上的点之间的距离.

codeofhammurabi.txt crete.txt      1.75212
codeofhammurabi.txt iraqi.txt      0.8812
codeofhammurabi.txt magnacarta.txt 1.0573
codeofhammurabi.txt us.txt         0.7965
crete.txt iraqi.txt                3.0344
crete.txt magnacarta.txt           1.6955
crete.txt us.txt                   2.0329
iraqi.txt magnacarta.txt           1.1983
iraqi.txt us.txt                   0.7261
magnacarta.txt us.txt              0.9125

Run Code Online (Sandbox Code Playgroud)

简短版本:上面的值是散点图上点之间的距离(1.75212是codeofhammurabi点和克里特点之间的距离).我可以想象一个大的方程组,圆圈表示点之间的距离.制作此图表的最佳方法是什么？我有MATLAB,R,Excel,并且可以访问我可能需要的任何软件.

如果你甚至可以指出我的方向,我将无限感激.

您的数据实际上是由文档中包含的单词语料库跨越的多变量空间中的某种形式的距离(某种形式).诸如这些的不相似性数据通常被用于提供不同性的最佳k- d映射.主坐标分析和非度量多维缩放是两种这样的方法.我建议你绘制将这些方法中的一种或另一种应用于数据的结果.我在下面提供了两个例子.

首先,加载您提供的数据(此阶段没有标签)

con <- textConnection("1.75212
0.8812
1.0573
0.7965
3.0344
1.6955
2.0329
1.1983
0.7261
0.9125
")
vec <- scan(con)
close(con)

Run Code Online (Sandbox Code Playgroud)

你实际拥有的是以下距离矩阵:

mat <- matrix(ncol = 5, nrow = 5)
mat[lower.tri(mat)] <- vec
colnames(mat) <- rownames(mat) <-
  c("codeofhammurabi","crete","iraqi","magnacarta","us")

> mat
                codeofhammurabi  crete  iraqi magnacarta us
codeofhammurabi              NA     NA     NA         NA NA
crete                   1.75212     NA     NA         NA NA
iraqi                   0.88120 3.0344     NA         NA NA
magnacarta              1.05730 1.6955 1.1983         NA NA
us                      0.79650 2.0329 0.7261     0.9125 NA

Run Code Online (Sandbox Code Playgroud)

通常,R需要类的不同对象"dist".我们as.dist(mat)现在可以使用这个对象,或者我们可以跳过创建mat并直接转到这样的"dist"对象:

class(vec) <- "dist"
attr(vec, "Labels") <- c("codeofhammurabi","crete","iraqi","magnacarta","us")
attr(vec, "Size") <- 5
attr(vec, "Diag") <- FALSE
attr(vec, "Upper") <- FALSE

> vec
           codeofhammurabi   crete   iraqi magnacarta
crete              1.75212                           
iraqi              0.88120 3.03440                   
magnacarta         1.05730 1.69550 1.19830           
us                 0.79650 2.03290 0.72610    0.91250

Run Code Online (Sandbox Code Playgroud)

现在我们有一个正确类型的对象,我们可以将其纵坐标.R有许多用于执行此操作的软件包和函数(请参阅CRAN上的多变量或Environmetrics任务视图),但我将使用纯素软件包,因为我对它有些熟悉...

require("vegan")

Run Code Online (Sandbox Code Playgroud)

主要坐标

首先,我将介绍如何使用素食主义者对数据进行主坐标分析.

pco <- capscale(vec ~ 1, add = TRUE)
pco

> pco
Call: capscale(formula = vec ~ 1, add = TRUE)

              Inertia Rank
Total           10.42     
Unconstrained   10.42    3
Inertia is squared Unknown distance (euclidified) 

Eigenvalues for unconstrained axes:
 MDS1  MDS2  MDS3 
7.648 1.672 1.098 

Constant added to distances: 0.7667353

Run Code Online (Sandbox Code Playgroud)

第一个PCO轴是解释文本差异之间最重要的,如特征值所示.现在可以通过使用该plot方法绘制PCO的特征向量来生成排序图

plot(pco)

Run Code Online (Sandbox Code Playgroud)

哪个产生

在此输入图像描述

非度量多维缩放

非度量多维缩放(nMDS)不会尝试在欧几里德空间中找到原始距离的低维表示.相反,它试图找到k维度的映射,最好地保留观察之间距离的等级排序.对于该问题没有封闭形式的解决方案(与上面应用的PCO不同),并且需要迭代算法来提供解决方案.建议随机启动以确保该算法未收敛到次优的局部最优解.素食主义者的metaMDS功能包含了这些功能以及更多功能.如果您想要普通的旧nMDS,请参阅MASSisoMDS包.

set.seed(42)
sol <- metaMDS(vec)

> sol

Call:
metaMDS(comm = vec) 

global Multidimensional Scaling using monoMDS

Data:     vec 
Distance: user supplied 

Dimensions: 2 
Stress:     0 
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation 
Species: scores missing

Run Code Online (Sandbox Code Playgroud)

利用这个小数据集,我们基本上可以完美地表示不相似性的等级排序(因此警告,未示出).可以使用该plot方法实现绘图

plot(sol, type = "text", display = "sites")

Run Code Online (Sandbox Code Playgroud)

哪个产生

在此输入图像描述

在两种情况下,样本之间的图上的距离是它们的不相似性的最佳2-d近似值.在PCO图的情况下,它是真实不相似性的二维近似(需要3个维度来完全表示所有相异性),而在nMDS图中,图上样本之间的距离反映了等级差异性不是观察之间的实际差异.但基本上图上的距离代表计算的不相似性.靠近在一起的文本最相似,在图上相隔很远的文本彼此最不相同.

如果问题是"我怎么可以这样做这家伙没有？" (从xiii1408的问题评论),然后答案是使用Gephi内置的Force Atlas 2算法对文档主题后验概率的欧几里德距离.

"这个人"是Matt Jockers,他是数字人文学科的创新学者.他已经在他的博客上记录了他的一些方法,以及其他地方,等等 .Jockers主要工作R并分享他的一些代码.他的基本工作流程似乎是:

将纯文本分成1000个字块,
删除停用词(不要干),
进行词性标注并仅保留名词,
构建一个主题模型(使用LDA),
根据主题比例计算文档之间的欧几里德距离,将距离子集化为仅保留低于某个阈值的距离,然后
用力导向图可视化

这是一个小规模可重复的例子R(导出到Gephi)可能接近Jockers所做的:

#### prepare workspace
# delete current objects and clear RAM
rm(list = ls(all.names = TRUE))
gc()

Run Code Online (Sandbox Code Playgroud)

获取数据......

#### import text
# working from the topicmodels package vignette
# using collection of abstracts of the Journal of Statistical Software (JSS) (up to 2010-08-05).
install.packages("corpus.JSS.papers", repos = "http://datacube.wu.ac.at/", type = "source")
data("JSS_papers", package = "corpus.JSS.papers")
# For reproducibility of results we use only abstracts published up to 2010-08-05 
JSS_papers <- JSS_papers[JSS_papers[,"date"] < "2010-08-05",]

Run Code Online (Sandbox Code Playgroud)

清洁和重塑......

#### clean and reshape data
# Omit abstracts containing non-ASCII characters in the abstracts
JSS_papers <- JSS_papers[sapply(JSS_papers[, "description"], Encoding) == "unknown",]
# remove greek characters (from math notation, etc.)
library("tm")
library("XML")
remove_HTML_markup <- function(s) tryCatch({
    doc <- htmlTreeParse(paste("<!DOCTYPE html>", s),
                         asText = TRUE, trim = FALSE)
                         xmlValue(xmlRoot(doc))
                         }, error = function(s) s)
# create corpus
corpus <- Corpus(VectorSource(sapply(JSS_papers[, "description"], remove_HTML_markup)))
# clean corpus by removing stopwords, numbers, punctuation, whitespaces, words <3 characters long..
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus_clean <- tm_map(corpus, wordLengths=c(3,Inf), FUN = tm_reduce, tmFuns = funcs)

Run Code Online (Sandbox Code Playgroud)

部分语音标记和名词子设置......

#### Part-of-speach tagging to extract nouns only
library("openNLP", "NLP")
# function for POS tagging
tagPOS <-  function(x) {

  s <- NLP::as.String(x)
  ## Need sentence and word token annotations.

  a1 <- NLP::Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- NLP::annotate(s, openNLP::Maxent_Word_Token_Annotator(), a1)
  a3 <- NLP::annotate(s,  openNLP::Maxent_POS_Tag_Annotator(), a2)

  ## Determine the distribution of POS tags for word tokens.
  a3w <- a3[a3$type == "word"]
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))

  ## Extract token/POS pairs (all of them): easy - not needed
  # POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  return(unlist(POStags))
} 
# a  loop to do POS tagging on each document and do garbage cleaning after each document
# first prepare vector to hold results (for optimal loop speed)
corpus_clean_tagged <- vector(mode = "list",  length = length(corpus_clean))
# then loop through each doc and do POS tagging
# warning: this may take some time!
for(i in 1:length(corpus_clean)){
  corpus_clean_tagged[[i]] <- tagPOS(corpus_clean[[i]])
  print(i) # nice to see what we're up to
  gc()
}

# subset nouns
wrds <- lapply(unlist(corpus_clean), function(i) unlist(strsplit(i, split = " ")))
NN <- lapply(corpus_clean_tagged, function(i) i == "NN")
Noun_strings <- lapply(1:length(wrds), function(i) unlist(wrds[i])[unlist(NN[i])])
Noun_strings <- lapply(Noun_strings, function(i) paste(i, collapse = " "))
# have a look to see what we've got
Noun_strings[[1]]
[8] "variogram model splus user quality variogram model pairs locations measurements variogram nonstationarity outliers variogram fit sets soil nitrogen concentration"

Run Code Online (Sandbox Code Playgroud)

使用潜在Dirichlet分配的主题建模...

#### topic modelling with LDA (Jockers uses the lda package and MALLET, maybe topicmodels also, I'm not sure. I'm most familiar with the topicmodels package, so here it is. Note that MALLET can be run from R: https://gist.github.com/benmarwick/4537873
# put the cleaned documents back into a corpus for topic modelling
corpus <- Corpus(VectorSource(Noun_strings))
# create document term matrix 
JSS_dtm <- DocumentTermMatrix(corpus)
# generate topic model 
library("topicmodels")
k = 30 # arbitrary number of topics (they are ways to optimise this)
JSS_TM <- LDA(JSS_dtm, k) # make topic model
# make data frame where rows are documents, columns are topics and cells 
# are posterior probabilities of topics
JSS_topic_df <- setNames(as.data.frame(JSS_TM@gamma),  paste0("topic_",1:k))
# add row names that link each document to a human-readble bit of data
# in this case we'll just use a few words of the title of each paper
row.names(JSS_topic_df) <- lapply(1:length(JSS_papers[,1]), function(i) gsub("\\s","_",substr(JSS_papers[,1][[i]], 1, 60)))

Run Code Online (Sandbox Code Playgroud)

使用主题概率作为文档的"DNA"计算一个文档与另一个文档的欧几里德距离

#### Euclidean distance matrix
library(cluster)
JSS_topic_df_dist <-  as.matrix(daisy(JSS_topic_df, metric =  "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# This is how Jockers subsets the distance matrix to keep only 
# closely related documents and avoid a dense spagetti diagram 
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
JSS_topic_df_dist[ sweep(JSS_topic_df_dist, 1, (apply(JSS_topic_df_dist,1,min) + apply(JSS_topic_df_dist,1,sd) )) > 0 ] <- 0

Run Code Online (Sandbox Code Playgroud)

使用力导向图形可视化...

#### network diagram using Fruchterman & Reingold algorithm (Jockers uses the ForceAtlas2 algorithm which is unique to Gephi)
library(igraph)
g <- as.undirected(graph.adjacency(JSS_topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1,  vertex.color= "grey", edge.arrow.size = 0.1, vertex.label.dist=0.5, vertex.label = NA)

Run Code Online (Sandbox Code Playgroud)

在此输入图像描述如果你想在Gephi中使用Force Atlas 2算法,你只需将R图形对象导出到一个graphml文件然后在Gephi中打开它并将布局设置为Force Atlas 2:

# this line will export from R and make the file 'JSS.graphml' in your working directory ready to open with Gephi
write.graph(g, file="JSS.graphml", format="graphml")

Run Code Online (Sandbox Code Playgroud)

这是使用Force Atlas 2算法的Gephi图: 在此输入图像描述

归档时间：	12 年，9 月前
查看次数：	4113 次
最近记录：	9 年，5 月前

如果小于行标准差，则将行值更改为零 5

更多相关链接

什么是原始,内部,内置和特殊功能？ 17

Matlab R2015b中`toc`功能可能出现的错误 11

ggplot 2中的多行标题,带有多个斜体字 11

左右对齐轴文本与之间的精确间距 10

在MATLAB中更改seqlogo数字的x轴 6

将视图从一个活动传递到另一个活 5

如何在MATLAB中通过GUI选择文件 4

连通分量的并行算法 4

为什么红色,绿色,蓝色图像通道是灰度级的(Matlab)？ 3

使用25个字符或更少的字符在MATLAB中创建特定的矩阵 2

最终C++书籍指南和列表 4246

如何将空目录添加到Git存储库？ 4039

RESTful编程究竟是什么？ 3917

在JavaScript中循环遍历数组 2940

将具有默认值的列添加到SQL Server中的现有表 2648

如何通过引用传递变量？ 2480

如何在Bash中解析命令行参数？ 1764

使用Git从先前的提交中分支 1658

如何使用sed替换换行符(\n)？ 1295

如何检查字符串是否包含Objective-C中的另一个字符串？ 1200