xii*_*408 10 matlab r graph distance text-mining
我正在为学校开展研究项目.我写了一些文本挖掘软件,分析集合中的法律文本,并吐出一个分数,表明它们有多相似.我运行程序来比较每个文本和其他所有文本,我有这样的数据(虽然有更多的点):
codeofhammurabi.txt crete.txt 0.570737
codeofhammurabi.txt iraqi.txt 1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt 1.25546
crete.txt iraqi.txt 0.329545
crete.txt magnacarta.txt 0.589786
crete.txt us.txt 0.491903
iraqi.txt magnacarta.txt 0.834488
iraqi.txt us.txt 1.37718
magnacarta.txt us.txt 1.09582
Run Code Online (Sandbox Code Playgroud)
现在我需要在图表上绘制它们.我可以轻松地反转分数,以便现在小值表示相似的文本,大值表示不相似的文本:值可以是表示文本的图形上的点之间的距离.
codeofhammurabi.txt crete.txt 1.75212
codeofhammurabi.txt iraqi.txt 0.8812
codeofhammurabi.txt magnacarta.txt 1.0573
codeofhammurabi.txt us.txt 0.7965
crete.txt iraqi.txt 3.0344
crete.txt magnacarta.txt 1.6955
crete.txt us.txt 2.0329
iraqi.txt magnacarta.txt 1.1983
iraqi.txt us.txt 0.7261
magnacarta.txt us.txt 0.9125
Run Code Online (Sandbox Code Playgroud)
简短版本:上面的值是散点图上点之间的距离(1.75212是codeofhammurabi点和克里特点之间的距离).我可以想象一个大的方程组,圆圈表示点之间的距离.制作此图表的最佳方法是什么?我有MATLAB,R,Excel,并且可以访问我可能需要的任何软件.
如果你甚至可以指出我的方向,我将无限感激.
Rei*_*son 10
您的数据实际上是由文档中包含的单词语料库跨越的多变量空间中的某种形式的距离(某种形式).诸如这些的不相似性数据通常被用于提供不同性的最佳k- d映射.主坐标分析和非度量多维缩放是两种这样的方法.我建议你绘制将这些方法中的一种或另一种应用于数据的结果.我在下面提供了两个例子.
首先,加载您提供的数据(此阶段没有标签)
con <- textConnection("1.75212
0.8812
1.0573
0.7965
3.0344
1.6955
2.0329
1.1983
0.7261
0.9125
")
vec <- scan(con)
close(con)
Run Code Online (Sandbox Code Playgroud)
你实际拥有的是以下距离矩阵:
mat <- matrix(ncol = 5, nrow = 5)
mat[lower.tri(mat)] <- vec
colnames(mat) <- rownames(mat) <-
c("codeofhammurabi","crete","iraqi","magnacarta","us")
> mat
codeofhammurabi crete iraqi magnacarta us
codeofhammurabi NA NA NA NA NA
crete 1.75212 NA NA NA NA
iraqi 0.88120 3.0344 NA NA NA
magnacarta 1.05730 1.6955 1.1983 NA NA
us 0.79650 2.0329 0.7261 0.9125 NA
Run Code Online (Sandbox Code Playgroud)
通常,R需要类的不同对象"dist".我们as.dist(mat)现在可以使用这个对象,或者我们可以跳过创建mat并直接转到这样的"dist"对象:
class(vec) <- "dist"
attr(vec, "Labels") <- c("codeofhammurabi","crete","iraqi","magnacarta","us")
attr(vec, "Size") <- 5
attr(vec, "Diag") <- FALSE
attr(vec, "Upper") <- FALSE
> vec
codeofhammurabi crete iraqi magnacarta
crete 1.75212
iraqi 0.88120 3.03440
magnacarta 1.05730 1.69550 1.19830
us 0.79650 2.03290 0.72610 0.91250
Run Code Online (Sandbox Code Playgroud)
现在我们有一个正确类型的对象,我们可以将其纵坐标.R有许多用于执行此操作的软件包和函数(请参阅CRAN上的多变量或Environmetrics任务视图),但我将使用纯素软件包,因为我对它有些熟悉...
require("vegan")
Run Code Online (Sandbox Code Playgroud)
首先,我将介绍如何使用素食主义者对数据进行主坐标分析.
pco <- capscale(vec ~ 1, add = TRUE)
pco
> pco
Call: capscale(formula = vec ~ 1, add = TRUE)
Inertia Rank
Total 10.42
Unconstrained 10.42 3
Inertia is squared Unknown distance (euclidified)
Eigenvalues for unconstrained axes:
MDS1 MDS2 MDS3
7.648 1.672 1.098
Constant added to distances: 0.7667353
Run Code Online (Sandbox Code Playgroud)
第一个PCO轴是解释文本差异之间最重要的,如特征值所示.现在可以通过使用该plot方法绘制PCO的特征向量来生成排序图
plot(pco)
Run Code Online (Sandbox Code Playgroud)
哪个产生

非度量多维缩放(nMDS)不会尝试在欧几里德空间中找到原始距离的低维表示.相反,它试图找到k维度的映射,最好地保留观察之间距离的等级排序.对于该问题没有封闭形式的解决方案(与上面应用的PCO不同),并且需要迭代算法来提供解决方案.建议随机启动以确保该算法未收敛到次优的局部最优解.素食主义者的metaMDS功能包含了这些功能以及更多功能.如果您想要普通的旧nMDS,请参阅MASSisoMDS包.
set.seed(42)
sol <- metaMDS(vec)
> sol
Call:
metaMDS(comm = vec)
global Multidimensional Scaling using monoMDS
Data: vec
Distance: user supplied
Dimensions: 2
Stress: 0
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation
Species: scores missing
Run Code Online (Sandbox Code Playgroud)
利用这个小数据集,我们基本上可以完美地表示不相似性的等级排序(因此警告,未示出).可以使用该plot方法实现绘图
plot(sol, type = "text", display = "sites")
Run Code Online (Sandbox Code Playgroud)
哪个产生

在两种情况下,样本之间的图上的距离是它们的不相似性的最佳2-d近似值.在PCO图的情况下,它是真实不相似性的二维近似(需要3个维度来完全表示所有相异性),而在nMDS图中,图上样本之间的距离反映了等级差异性不是观察之间的实际差异.但基本上图上的距离代表计算的不相似性.靠近在一起的文本最相似,在图上相隔很远的文本彼此最不相同.
Ben*_*Ben 10
如果问题是"我怎么可以这样做这家伙没有?" (从xiii1408的问题评论),然后答案是使用Gephi内置的Force Atlas 2算法对文档主题后验概率的欧几里德距离.
"这个人"是Matt Jockers,他是数字人文学科的创新学者.他已经在他的博客上记录了他的一些方法,以及其他 地方,等等 .Jockers主要工作R并分享他的一些代码.他的基本工作流程似乎是:
这是一个小规模可重复的例子R(导出到Gephi)可能接近Jockers所做的:
#### prepare workspace
# delete current objects and clear RAM
rm(list = ls(all.names = TRUE))
gc()
Run Code Online (Sandbox Code Playgroud)
获取数据......
#### import text
# working from the topicmodels package vignette
# using collection of abstracts of the Journal of Statistical Software (JSS) (up to 2010-08-05).
install.packages("corpus.JSS.papers", repos = "http://datacube.wu.ac.at/", type = "source")
data("JSS_papers", package = "corpus.JSS.papers")
# For reproducibility of results we use only abstracts published up to 2010-08-05
JSS_papers <- JSS_papers[JSS_papers[,"date"] < "2010-08-05",]
Run Code Online (Sandbox Code Playgroud)
清洁和重塑......
#### clean and reshape data
# Omit abstracts containing non-ASCII characters in the abstracts
JSS_papers <- JSS_papers[sapply(JSS_papers[, "description"], Encoding) == "unknown",]
# remove greek characters (from math notation, etc.)
library("tm")
library("XML")
remove_HTML_markup <- function(s) tryCatch({
doc <- htmlTreeParse(paste("<!DOCTYPE html>", s),
asText = TRUE, trim = FALSE)
xmlValue(xmlRoot(doc))
}, error = function(s) s)
# create corpus
corpus <- Corpus(VectorSource(sapply(JSS_papers[, "description"], remove_HTML_markup)))
# clean corpus by removing stopwords, numbers, punctuation, whitespaces, words <3 characters long..
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus_clean <- tm_map(corpus, wordLengths=c(3,Inf), FUN = tm_reduce, tmFuns = funcs)
Run Code Online (Sandbox Code Playgroud)
部分语音标记和名词子设置......
#### Part-of-speach tagging to extract nouns only
library("openNLP", "NLP")
# function for POS tagging
tagPOS <- function(x) {
s <- NLP::as.String(x)
## Need sentence and word token annotations.
a1 <- NLP::Annotation(1L, "sentence", 1L, nchar(s))
a2 <- NLP::annotate(s, openNLP::Maxent_Word_Token_Annotator(), a1)
a3 <- NLP::annotate(s, openNLP::Maxent_POS_Tag_Annotator(), a2)
## Determine the distribution of POS tags for word tokens.
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
## Extract token/POS pairs (all of them): easy - not needed
# POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
return(unlist(POStags))
}
# a loop to do POS tagging on each document and do garbage cleaning after each document
# first prepare vector to hold results (for optimal loop speed)
corpus_clean_tagged <- vector(mode = "list", length = length(corpus_clean))
# then loop through each doc and do POS tagging
# warning: this may take some time!
for(i in 1:length(corpus_clean)){
corpus_clean_tagged[[i]] <- tagPOS(corpus_clean[[i]])
print(i) # nice to see what we're up to
gc()
}
# subset nouns
wrds <- lapply(unlist(corpus_clean), function(i) unlist(strsplit(i, split = " ")))
NN <- lapply(corpus_clean_tagged, function(i) i == "NN")
Noun_strings <- lapply(1:length(wrds), function(i) unlist(wrds[i])[unlist(NN[i])])
Noun_strings <- lapply(Noun_strings, function(i) paste(i, collapse = " "))
# have a look to see what we've got
Noun_strings[[1]]
[8] "variogram model splus user quality variogram model pairs locations measurements variogram nonstationarity outliers variogram fit sets soil nitrogen concentration"
Run Code Online (Sandbox Code Playgroud)
使用潜在Dirichlet分配的主题建模...
#### topic modelling with LDA (Jockers uses the lda package and MALLET, maybe topicmodels also, I'm not sure. I'm most familiar with the topicmodels package, so here it is. Note that MALLET can be run from R: https://gist.github.com/benmarwick/4537873
# put the cleaned documents back into a corpus for topic modelling
corpus <- Corpus(VectorSource(Noun_strings))
# create document term matrix
JSS_dtm <- DocumentTermMatrix(corpus)
# generate topic model
library("topicmodels")
k = 30 # arbitrary number of topics (they are ways to optimise this)
JSS_TM <- LDA(JSS_dtm, k) # make topic model
# make data frame where rows are documents, columns are topics and cells
# are posterior probabilities of topics
JSS_topic_df <- setNames(as.data.frame(JSS_TM@gamma), paste0("topic_",1:k))
# add row names that link each document to a human-readble bit of data
# in this case we'll just use a few words of the title of each paper
row.names(JSS_topic_df) <- lapply(1:length(JSS_papers[,1]), function(i) gsub("\\s","_",substr(JSS_papers[,1][[i]], 1, 60)))
Run Code Online (Sandbox Code Playgroud)
使用主题概率作为文档的"DNA"计算一个文档与另一个文档的欧几里德距离
#### Euclidean distance matrix
library(cluster)
JSS_topic_df_dist <- as.matrix(daisy(JSS_topic_df, metric = "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# This is how Jockers subsets the distance matrix to keep only
# closely related documents and avoid a dense spagetti diagram
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
JSS_topic_df_dist[ sweep(JSS_topic_df_dist, 1, (apply(JSS_topic_df_dist,1,min) + apply(JSS_topic_df_dist,1,sd) )) > 0 ] <- 0
Run Code Online (Sandbox Code Playgroud)
使用力导向图形可视化...
#### network diagram using Fruchterman & Reingold algorithm (Jockers uses the ForceAtlas2 algorithm which is unique to Gephi)
library(igraph)
g <- as.undirected(graph.adjacency(JSS_topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1, vertex.color= "grey", edge.arrow.size = 0.1, vertex.label.dist=0.5, vertex.label = NA)
Run Code Online (Sandbox Code Playgroud)
如果你想在Gephi中使用Force Atlas 2算法,你只需将R图形对象导出到一个graphml文件然后在Gephi中打开它并将布局设置为Force Atlas 2:
# this line will export from R and make the file 'JSS.graphml' in your working directory ready to open with Gephi
write.graph(g, file="JSS.graphml", format="graphml")
Run Code Online (Sandbox Code Playgroud)
这是使用Force Atlas 2算法的Gephi图:
