在MATLAB中聚类文本

Tin*_*lin 3 matlab cluster-analysis text-mining

我想在MATLAB中对文本进行分层凝聚聚类.说,我有四句话,

I have a pen.
I have a paper. 
I have a pencil.
I have a cat. 
Run Code Online (Sandbox Code Playgroud)

我想对上面四个句子进行聚类,看看哪个更相似.我知道统计工具箱有像pdist测量成对距离,linkage计算聚类相似性等命令.一个简单的代码如:

X=[1 2; 2 3; 1 4];
Y=pdist(X, 'euclidean');
Z=linkage(Y, 'single');
H=dendrogram(Z)
Run Code Online (Sandbox Code Playgroud)

工作正常并返回树形图.

我想知道如上所述我可以在文本上使用这些命令.有什么想法吗 ?


更新:

感谢Amro.读取理解并计算字符串之间的距离.代码如下:

clc
S1='I have a pen'; % first String

f_id=fopen('events.txt','r'); %saved strings to compare with
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events); % selects one text randomly.
% store the texts in a cell array

for kk=1:ii

   S2=events(kk);
   S2=cell2mat(S2);
   Z=levenshtein_distance(S1,S2);
   X(kk)=Z;

end 
Run Code Online (Sandbox Code Playgroud)

我输入一个字符串,我有4个保存的字符串.现在我用levenshtein_distance函数计算了成对距离.它返回一个矩阵X=[ 17 0 16 18 16].

**我想这是我的配对距离矩阵.与pdist的相似.是吗 ?

**现在,我正在尝试输入X来计算链接

Z=linkage(X, 'single);
Run Code Online (Sandbox Code Playgroud)

我得到的输出是:

在93大小的Y处使用==>链接时出错与PDIST函数的输出不兼容.

错误==> Untitled2在20 Z =连接(X,'单').

为什么这样 ?可以使用联动功能吗?帮助赞赏.

更新2

clc
S1='I have a pen';

f_id=fopen('events.txt','r');
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events)+1; % total number of strings in the comparison

D=zeros(ii, ii); % initialized distance matrix;
for kk=1:ii 

    S2=events(kk);

    %S2=cell2mat(S2);

    for jk=kk+1:ii

  D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

    end

end

D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T).
Run Code Online (Sandbox Code Playgroud)

错误:??? 单元格内容引用非单元格数组对象.错误==> Untitled2 at 22 D(kk,jk)= levenshtein_distance(S1 {kk},S2 {jk});

另外,为什么我从第一个循环内的文件中读取事件?似乎不合逻辑.有点困惑,如果我可以这样工作或只有解决方案是输入代码中的所有字符串.非常感谢.

UPDATE

用于比较两个句子的代码:

clc
    str1 = 'Fire in NY';
    str2= 'Jeff is sick';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');  
Run Code Online (Sandbox Code Playgroud)

输出D = 18.

不同的字符串:

clc
str1 = 'Fire in NY';
str2= 'NY catches fire';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default'); 
Run Code Online (Sandbox Code Playgroud)

d = 28.

根据距离,完全不同的句子看起来相似.我正在尝试做什么,如果我在纽约存储Fire,我就不会存储NY catches fire.但是,对于第一种情况,我会存储,因为信息是新的.

IS LD是否足以做到这一点?帮助赞赏.

Amr*_*mro 5

你需要的是一个可以处理字符串的距离函数.查看Levenshtein距离(编辑距离).那里有很多实现:

或者,你应该提取一些有趣的特征(例如:元音的数量,字符串的长度等等)来构建一个向量空间表示,然后你可以在新的上应用任何常用的距离测量(欧几里德,...)表示.


编辑

您的代码的问题是LINKAGE期望输入距离格式与PDIST的格式相匹配,即对应于1-vs-2,1-vs-3,2-vs-3等顺序的观察对的行向量..这基本上是完整距离矩阵的下半部分(因为它应该是对称的dist(1,2) == dist(2,1))

%# instances
str = {'I have a pen.'
    'I have a paper.'
    'I have a pencil.'
    'I have a cat.'};
numStr = numel(str);

%# create and fill upper half only of distance matrix
D = zeros(numStr,numStr);
for i=1:numStr
    for j=i+1:numStr
        D(i,j) = levenshtein_distance(str{i},str{j});
    end
end
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T)
Run Code Online (Sandbox Code Playgroud)

有关更多信息,请参阅相关功能的文档...