我想测量熊猫DataFrame中文本之间的jaccard相似度。更确切地说,我有一些实体组,并且一段时间内每个实体都有一些文本。我想针对每个实体分别分析一段时间内的文本相似度(此处为Jaccard相似度)。
一个最小的例子来说明我的观点:
import pandas as pd
entries = [
{'Entity_Id':'Firm1', 'date':'2001-02-05', 'text': 'This is a text'},
{'Entity_Id':'Firm1', 'date':'2001-03-07', 'text': 'This is a text'},
{'Entity_Id':'Firm1', 'date':'2003-01-04', 'text': 'No similarity'},
{'Entity_Id':'Firm1', 'date':'2007-10-12', 'text': 'Some similarity'},
{'Entity_Id':'Firm2', 'date':'2001-10-10', 'text': 'Another firm'},
{'Entity_Id':'Firm2', 'date':'2005-12-03', 'text': 'Another year'},
{'Entity_Id':'Firm3', 'date':'2002-05-05', 'text': 'Something different'}
]
df = pd.DataFrame(entries)
Run Code Online (Sandbox Code Playgroud)
Entity_Id日期文字
Firm1 2001-02-05 'This is a text'
Firm1 2001-03-07 'This is a text'
Firm1 2003-01-04 'No similarity'
Firm1 2007-10-12 'Some similarity'
Firm2 2001-10-10 'Another firm'
Firm2 2005-12-03 'Another …Run Code Online (Sandbox Code Playgroud)