我在文本文件中有一个句子,我想在 python 中显示,但我想显示它,以便在每个句号(句点)之后开始一个新行。
\n\n例如我的段落是
\n\n"Dr. Harrison bought bargain.co.uk for 2.5 million pounds, i.e. he\npaid a lot for it. Did he mind? John Smith, Esq. thinks he didn\'t.\nNevertheless, this isn\'t true... Well, with a probability of .9 it\nisn\'t."\nRun Code Online (Sandbox Code Playgroud)\n\n但我希望它显示如下
\n\n"Dr. Harrison bought bargain.co.uk for 2.5 million pounds, i.e. he\npaid a lot for it. \nDid he mind? John Smith, Esq. thinks he didn\'t. \nNevertheless, this isn\'t true... \nWell, with a probability of .9 it isn\xe2\x80\x99t."\nRun Code Online (Sandbox Code Playgroud)\n\n句子中出现的其他句号(例如网站地址中的“Dr.”、“Esq.”、“.9”,当然还有前两个句号)使得这一点变得越来越困难。省略号中的点。
\n\n …“朴素贝叶斯的另一个系统性问题是,特征被假定为独立的。因此,即使单词是相关的,每个单词也会单独贡献证据。因此,具有强单词依赖性的类的权重大小大于具有强单词依赖性的类的权重大小。弱词依赖性。为了防止具有更多依赖性的类占主导地位,我们对分类权重进行标准化。” (参考)
这究竟意味着什么?有没有什么例子可以更好地解释它?
回溯(最近一次调用最后):文件“run_summarization.py”,第327行,在 
tf.app.run()
  文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第126行,在运行_sys.exit(main(argv))
  文件“run_summarization.py”中,第306行,在主
batcher = Batcher(FLAGS.data_path, vocab, hps, single_pass=FLAGS.single_pass)
  文件“/home/hdm/hdm/program/CNN/pointer-generator-master/batcher.py”中,第238行,在init
self._example_queue = Queue.Queue(self.BATCH_QUEUE_MAX * self._hps.batch_size) 
TypeError中:不支持的操作数类型) 对于 *: 'int' 和 'Flag'
给定一些文本,我如何获得 n=1 到 6 之间最常见的 n 元语法?我见过一些方法来获取 3 克或 2 克的方法,一次一个 n,但是有没有办法提取最有意义的最大长度短语以及所有其余的短语?
例如,在本文中仅用于演示目的: 
fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.
n-gram 及其计数器的理想结果是:
fri evening commute: 3,
off-peak: 2,
rest of the words: 1
Run Code Online (Sandbox Code Playgroud)
任何建议表示赞赏。谢谢。
我想从 pdf 论文中提取作者姓名。有人知道一种可靠的方法吗?
例如,我想Archana Shukla从此pdf中提取名称https://arxiv.org/pdf/1111.1648
python pdf nlp named-entity-recognition information-extraction
我使用的是 Windows 10,并使用 pip 安装 spacy,但现在运行时出现错误
import spacy
Run Code Online (Sandbox Code Playgroud)
在 python shell 中。
我的错误信息是:
Traceback (most recent call last):
  File "C:\Users\Administrator\errbot-root\plugins\utility\model_training_test.py", line 17, in <module>
import spacy
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\__init__.py", line 4, in <module>
from .cli.info import info as cli_info
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\cli\__init__.py", line 1, in <module>
from .download import download
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\cli\download.py", line 5, in <module>
import requests
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\__init__.py", line 43, in <module>
import urllib3
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\urllib3\__init__.py", line 8, in <module>
from .connectionpool import (
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\urllib3\connectionpool.py", line 11, in <module> …Run Code Online (Sandbox Code Playgroud) 我试图编写正则表达式,仅匹配由Python中超过3个字母的英文字母文本组成的文本。我试过:
 regex = r'[a-z][a-z][a-z]+'
Run Code Online (Sandbox Code Playgroud)
但它不能过滤掉像这样的字符串
 how@@
Run Code Online (Sandbox Code Playgroud)
任何想法,将不胜感激:)
我想计算一个文档中句子之间的编辑距离。我找到了一个计算字符级别距离的代码,但我希望它是字级别的。\n 例如,这个字符级别的输出是 6\xef\xbc\x8c 但我希望它是 1 ,这意味着如果我们想将 b 更改为 a 或将 a 更改为 b \xef\xbc\x9a,则只需删除一个单词即可
\n\na = "The patient tolerated this ."\nb = "The patient tolerated ."\n\ndef levenshtein_distance(a, b):\n\n    if a == b:\n        return 0\n    if len(a) < len(b):\n        a, b = b, a\n    if not a:\n        return len(b)\n    previous_row = range(len(b) + 1)\n    for i, column1 in enumerate(a):\n        current_row = [i + 1]\n        for j, column2 in enumerate(b):\n            insertions = previous_row[j + 1] + 1\n            deletions = current_row[j] + …Run Code Online (Sandbox Code Playgroud) 我在从下面的元组列表中获取文本列表时遇到了很大的挑战,这是我从 nltk 库中获取的关键字
[('supreme court justice ruth bader ginsburg may', 14.0),
 ('justice ruth bader ginsburg rightly holds', 12.0),
 ('vintage ruth— ‘ straight ahead', 10.0),
 ('fellow supreme court colleagues penned', 10.0),
 ('could make things better', 8.0),
 ('neighbor sanford greenberg says', 8.0),
 ('live. ” ginsburg ’', 8.0),]
Run Code Online (Sandbox Code Playgroud)
这是我想得到的预期输出
['supreme court justice ruth bader ginsburg may',
 'justice ruth bader ginsburg rightly holds',
 'vintage ruth— ‘ straight ahead',
 'fellow supreme court colleagues penned',
 'could make things better',
 'neighbor sanford greenberg says', 
 'live. ” ginsburg …Run Code Online (Sandbox Code Playgroud) 我很想设置这个简单的 Keras 模型的输入形状 :( X 和 Y 都是 numpy.narray 但我不知道它有什么问题!我尝试了不同的 X 形状,但错误在那里!数据集(维度、样本数量等)在代码中可用。X_train 的 .pkl 文件来自预训练模型的隐藏状态。
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras import Input, Model
from keras.layers import Dense
import numpy as np
############################## X_Train ############################
X_Train_3embed1 = pd.read_pickle("XX_Train_3embeding.pkl")
X_Train_3embed = np.array(X_Train_3embed1)
print("X-Train")
print(X_Train_3embed.shape)   # (230, 1, 128)
print(type(X_Train_3embed))  # <class 'numpy.ndarray'>
print(X_Train_3embed[0].shape) # (1, 128)
print(type(X_Train_3embed[0])) # <class 'numpy.ndarray'>
############################## Y_Train ############################
Y_Train_labels_list = pd.read_pickle("lis_Y_all_Train.pkl")
print(type(Y_Train_labels_list))  #<class 'numpy.ndarray'>
print(type(Y_Train_labels_list[0])) #<class 'str'>
encoder …Run Code Online (Sandbox Code Playgroud) nlp ×10
python ×8
string ×2
tensorflow ×2
keras ×1
layer ×1
pdf ×1
period ×1
python-2.7 ×1
r ×1
regex ×1
spacy ×1
text ×1
text-mining ×1