当我尝试使用 python networkx 总结文本文档时出现错误“幂迭代未能在 100 次迭代内收敛”)

Gra*_*ace 6 python nlp networkx

当我尝试使用 python networkx 总结文本文档(如下面的代码所示)时,我得到了 PowerIterationFailedConvergence:(PowerIterationFailedConvergence(...), 'power iteration failed to fusion inside 100 iterations') 。代码“scores = nx.pagerank(sentence_similarity_graph)”显示的错误

\n
def read_article(file_name):\n    file = open(file_name, "r",encoding="utf8")\n    filedata = file.readlines()\n    text=""\n    for s in filedata:\n        text=text+s.replace("\\n","")\n        text=re.sub(' +', ' ', text) #remove space\n        text=re.sub('\xe2\x80\x94',' ',text)\n    \n    article = text.split(". ") \n    sentences = []\n    for sentence in article:\n#         print(sentence)\n        sentences.append(sentence.replace("[^a-zA-Z]", "").split(" "))\n    sentences.pop()\n    new_sent=[]\n    for lst in sentences:\n        newlst=[]\n        for i in range(len(lst)):\n            if lst[i].lower()!=lst[i-1].lower():\n                newlst.append(lst[i])\n            else:\n                newlst=newlst\n        new_sent.append(newlst)\n    return new_sent\ndef sentence_similarity(sent1, sent2, stopwords=None):\n    if stopwords is None:\n        stopwords = []\n \n    sent1 = [w.lower() for w in sent1]\n    sent2 = [w.lower() for w in sent2]\n \n    all_words = list(set(sent1 + sent2))\n \n    vector1 = [0] * len(all_words)\n    vector2 = [0] * len(all_words)\n \n    # build the vector for the first sentence\n    for w in sent1:\n        if w in stopwords:\n            continue\n        vector1[all_words.index(w)] += 1\n \n    # build the vector for the second sentence\n    for w in sent2:\n        if w in stopwords:\n            continue\n        vector2[all_words.index(w)] += 1\n \n    return 1 - cosine_distance(vector1, vector2)\ndef build_similarity_matrix(sentences, stop_words):\n    # Create an empty similarity matrix\n    similarity_matrix = np.zeros((len(sentences), len(sentences)))\n \n    for idx1 in range(len(sentences)):\n        for idx2 in range(len(sentences)):\n            if idx1 == idx2: #ignore if both are same sentences\n                continue \n            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)\n\n    return similarity_matrix\nstop_words = stopwords.words('english')\nsummarize_text = []\n\n    # Step 1 - Read text anc split it\nnew_sent =  read_article("C:\\\\Users\\\\Documents\\\\fedPressConference_0620.txt")\n\n    # Step 2 - Generate Similary Martix across sentences\nsentence_similarity_martix = build_similarity_matrix(new_sent1, stop_words)\n\n    # Step 3 - Rank sentences in similarity martix\nsentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)\nscores = nx.pagerank(sentence_similarity_graph)\n\n    # Step 4 - Sort the rank and pick top sentences\nranked_sentence = sorted(((scores[i],s) for i,s in enumerate(new_sent1)), reverse=True)    \nprint("Indexes of top ranked_sentence order are ", ranked_sentence)    \n\nfor i in range(10):\n    summarize_text.append(" ".join(ranked_sentence[i][1]))\n\n    # Step 5 - Offcourse, output the summarize texr\nprint("Summarize Text: \\n", ". ".join(summarize_text))\n\n
Run Code Online (Sandbox Code Playgroud)\n

Car*_*s B 11

也许你现在已经解决了。

问题是您使用的向量太长。您的向量是使用整个词汇表构建的,这对于模型来说可能太长,无法在 100 个周期内收敛(这是 pagerank 的默认值)。

您可以减少词汇表的长度(您是否检查过是否正确删除了停用词?)或使用任何其他技术,例如减少不太频繁的单词,或使用 TF-IDF。

就我而言,我遇到了同样的问题,但是使用了 Glove 词嵌入。对于 300 维,我无法收敛,这可以通过使用 100 维模型轻松解决。

您可以尝试的另一件事是在调用 nx.pagerank 时扩展 max_iter 参数:

nx.pagerank(nx_graph, max_iter=600) # Or any number that will work for you.
Run Code Online (Sandbox Code Playgroud)

默认值是 100 个周期。