Gra*_*ace 6 python nlp networkx
当我尝试使用 python networkx 总结文本文档(如下面的代码所示)时,我得到了 PowerIterationFailedConvergence:(PowerIterationFailedConvergence(...), 'power iteration failed to fusion inside 100 iterations') 。代码“scores = nx.pagerank(sentence_similarity_graph)”显示的错误
\ndef read_article(file_name):\n file = open(file_name, "r",encoding="utf8")\n filedata = file.readlines()\n text=""\n for s in filedata:\n text=text+s.replace("\\n","")\n text=re.sub(' +', ' ', text) #remove space\n text=re.sub('\xe2\x80\x94',' ',text)\n \n article = text.split(". ") \n sentences = []\n for sentence in article:\n# print(sentence)\n sentences.append(sentence.replace("[^a-zA-Z]", "").split(" "))\n sentences.pop()\n new_sent=[]\n for lst in sentences:\n newlst=[]\n for i in range(len(lst)):\n if lst[i].lower()!=lst[i-1].lower():\n newlst.append(lst[i])\n else:\n newlst=newlst\n new_sent.append(newlst)\n return new_sent\ndef sentence_similarity(sent1, sent2, stopwords=None):\n if stopwords is None:\n stopwords = []\n \n sent1 = [w.lower() for w in sent1]\n sent2 = [w.lower() for w in sent2]\n \n all_words = list(set(sent1 + sent2))\n \n vector1 = [0] * len(all_words)\n vector2 = [0] * len(all_words)\n \n # build the vector for the first sentence\n for w in sent1:\n if w in stopwords:\n continue\n vector1[all_words.index(w)] += 1\n \n # build the vector for the second sentence\n for w in sent2:\n if w in stopwords:\n continue\n vector2[all_words.index(w)] += 1\n \n return 1 - cosine_distance(vector1, vector2)\ndef build_similarity_matrix(sentences, stop_words):\n # Create an empty similarity matrix\n similarity_matrix = np.zeros((len(sentences), len(sentences)))\n \n for idx1 in range(len(sentences)):\n for idx2 in range(len(sentences)):\n if idx1 == idx2: #ignore if both are same sentences\n continue \n similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)\n\n return similarity_matrix\nstop_words = stopwords.words('english')\nsummarize_text = []\n\n # Step 1 - Read text anc split it\nnew_sent = read_article("C:\\\\Users\\\\Documents\\\\fedPressConference_0620.txt")\n\n # Step 2 - Generate Similary Martix across sentences\nsentence_similarity_martix = build_similarity_matrix(new_sent1, stop_words)\n\n # Step 3 - Rank sentences in similarity martix\nsentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)\nscores = nx.pagerank(sentence_similarity_graph)\n\n # Step 4 - Sort the rank and pick top sentences\nranked_sentence = sorted(((scores[i],s) for i,s in enumerate(new_sent1)), reverse=True) \nprint("Indexes of top ranked_sentence order are ", ranked_sentence) \n\nfor i in range(10):\n summarize_text.append(" ".join(ranked_sentence[i][1]))\n\n # Step 5 - Offcourse, output the summarize texr\nprint("Summarize Text: \\n", ". ".join(summarize_text))\n\nRun Code Online (Sandbox Code Playgroud)\n
Car*_*s B 11
也许你现在已经解决了。
问题是您使用的向量太长。您的向量是使用整个词汇表构建的,这对于模型来说可能太长,无法在 100 个周期内收敛(这是 pagerank 的默认值)。
您可以减少词汇表的长度(您是否检查过是否正确删除了停用词?)或使用任何其他技术,例如减少不太频繁的单词,或使用 TF-IDF。
就我而言,我遇到了同样的问题,但是使用了 Glove 词嵌入。对于 300 维,我无法收敛,这可以通过使用 100 维模型轻松解决。
您可以尝试的另一件事是在调用 nx.pagerank 时扩展 max_iter 参数:
nx.pagerank(nx_graph, max_iter=600) # Or any number that will work for you.
Run Code Online (Sandbox Code Playgroud)
默认值是 100 个周期。
| 归档时间: |
|
| 查看次数: |
9772 次 |
| 最近记录: |