mk_*_*sch 9 python list matrix find-occurrences sklearn-pandas
我有一个名单列表,如:
names = ['A', 'B', 'C', 'D']
Run Code Online (Sandbox Code Playgroud)
和文件清单,在每个文件中提到了一些这些名称.
document =[['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']]
Run Code Online (Sandbox Code Playgroud)
我想得到一个输出作为共现矩阵,如:
A B C D
A 0 2 1 1
B 2 0 2 1
C 1 2 0 1
D 1 1 1 0
Run Code Online (Sandbox Code Playgroud)
在R中有一个针对这个问题的解决方案(创建共生矩阵),但我无法在Python中实现.我想在熊猫中做到这一点,但还没有进展!
Moc*_*ird 10
另一种选择是使用构造
csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])从scipy.sparse.csr_matrix那里data,row_ind并col_ind满足关系a[row_ind[k], col_ind[k]] = data[k]。
诀窍是生成row_ind并col_ind遍历文档并创建元组列表(doc_id,word_id)。data将仅仅是相同长度向量的向量。
将docs-words矩阵与其转置相乘会得到共现矩阵。
此外,这在运行时间和内存使用方面都是高效的,因此它还应该处理大型的程序集。
import numpy as np
import itertools
from scipy.sparse import csr_matrix
def create_co_occurences_matrix(allowed_words, documents):
print(f"allowed_words:\n{allowed_words}")
print(f"documents:\n{documents}")
word_to_id = dict(zip(allowed_words, range(len(allowed_words))))
documents_as_ids = [np.sort([word_to_id[w] for w in doc if w in word_to_id]).astype('uint32') for doc in documents]
row_ind, col_ind = zip(*itertools.chain(*[[(i, w) for w in doc] for i, doc in enumerate(documents_as_ids)]))
data = np.ones(len(row_ind), dtype='uint32') # use unsigned int for better memory utilization
max_word_id = max(itertools.chain(*documents_as_ids)) + 1
docs_words_matrix = csr_matrix((data, (row_ind, col_ind)), shape=(len(documents_as_ids), max_word_id)) # efficient arithmetic operations with CSR * CSR
words_cooc_matrix = docs_words_matrix.T * docs_words_matrix # multiplying docs_words_matrix with its transpose matrix would generate the co-occurences matrix
words_cooc_matrix.setdiag(0)
print(f"words_cooc_matrix:\n{words_cooc_matrix.todense()}")
return words_cooc_matrix, word_to_id
Run Code Online (Sandbox Code Playgroud)
运行示例:
allowed_words = ['A', 'B', 'C', 'D']
documents = [['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']]
words_cooc_matrix, word_to_id = create_co_occurences_matrix(allowed_words, documents)
Run Code Online (Sandbox Code Playgroud)
输出:
allowed_words:
['A', 'B', 'C', 'D']
documents:
[['A', 'B'], ['C', 'B', 'K'], ['A', 'B', 'C', 'D', 'Z']]
words_cooc_matrix:
[[0 2 1 1]
[2 0 2 1]
[1 2 0 1]
[1 1 1 0]]
Run Code Online (Sandbox Code Playgroud)
from collections import OrderedDict
document = [['A', 'B'], ['C', 'B'], ['A', 'B', 'C', 'D']]
names = ['A', 'B', 'C', 'D']
occurrences = OrderedDict((name, OrderedDict((name, 0) for name in names)) for name in names)
# Find the co-occurrences:
for l in document:
for i in range(len(l)):
for item in l[:i] + l[i + 1:]:
occurrences[l[i]][item] += 1
# Print the matrix:
print(' ', ' '.join(occurrences.keys()))
for name, values in occurrences.items():
print(name, ' '.join(str(i) for i in values.values()))
Run Code Online (Sandbox Code Playgroud)
输出;
A B C D
A 0 2 1 1
B 2 0 2 1
C 1 2 0 1
D 1 1 1 0
Run Code Online (Sandbox Code Playgroud)
您也可以使用矩阵技巧来找到共生矩阵。希望当你有更大的词汇量时,这能很好地工作。
import scipy.sparse as sp
voc2id = dict(zip(names, range(len(names))))
rows, cols, vals = [], [], []
for r, d in enumerate(document):
for e in d:
if voc2id.get(e) is not None:
rows.append(r)
cols.append(voc2id[e])
vals.append(1)
X = sp.csr_matrix((vals, (rows, cols)))
Run Code Online (Sandbox Code Playgroud)
现在,你可以通过简单的乘法发现共现矩阵X.T与X
Xc = (X.T * X) # coocurrence matrix
Xc.setdiag(0)
print(Xc.toarray())
Run Code Online (Sandbox Code Playgroud)
我们可以使用 极大地简化这个过程NetworkX。以下names是我们要考虑的节点,其中的列表包含document要连接的节点。
我们可以连接每个子列表中长度为 2 的节点combinations,并创建一个MultiGraph来考虑共现:
import networkx as nx
from itertools import combinations
G = nx.from_edgelist((c for n_nodes in document for c in combinations(n_nodes, r=2)),
create_using=nx.MultiGraph)
nx.to_pandas_adjacency(G, nodelist=names, dtype='int')
A B C D
A 0 2 1 1
B 2 0 2 1
C 1 2 0 1
D 1 1 1 0
Run Code Online (Sandbox Code Playgroud)
显然,可以针对您的目的进行扩展,但是它会执行以下常规操作:
import math
for a in 'ABCD':
for b in 'ABCD':
count = 0
for x in document:
if a != b:
if a in x and b in x:
count += 1
else:
n = x.count(a)
if n >= 2:
count += math.factorial(n)/math.factorial(n - 2)/2
print '{} x {} = {}'.format(a, b, count)
Run Code Online (Sandbox Code Playgroud)
这是使用的另一个解决方案itertools和模块中的Counter类collections.
import numpy
import itertools
from collections import Counter
document =[['A', 'B'], ['C', 'B'],['A', 'B', 'C', 'D']]
# Get all of the unique entries you have
varnames = tuple(sorted(set(itertools.chain(*document))))
# Get a list of all of the combinations you have
expanded = [tuple(itertools.combinations(d, 2)) for d in document]
expanded = itertools.chain(*expanded)
# Sort the combinations so that A,B and B,A are treated the same
expanded = [tuple(sorted(d)) for d in expanded]
# count the combinations
c = Counter(expanded)
# Create the table
table = numpy.zeros((len(varnames),len(varnames)), dtype=int)
for i, v1 in enumerate(varnames):
for j, v2 in enumerate(varnames[i:]):
j = j + i
table[i, j] = c[v1, v2]
table[j, i] = c[v1, v2]
# Display the output
for row in table:
print(row)
Run Code Online (Sandbox Code Playgroud)
输出(可以很容易变成DataFrame)是:
[0 2 1 1]
[2 0 2 1]
[1 2 0 1]
[1 1 1 0]
Run Code Online (Sandbox Code Playgroud)