小编red*_*red的帖子

用德语文本中的nltk提取单词

我试图从德语文档中提取单词,当我使用nltk教程中描述的以下方法时,我无法获得具有特定语言特殊字符的单词.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*');
words = nltk.Text(ptcr.words(DocumentName))

Run Code Online (Sandbox Code Playgroud)

如何获取文档中的单词列表？

nltk.tokenize.WordPunctTokenizer()德语短语的示例Veränderungen über einen Walzer如下:

In [231]: nltk.tokenize.WordPunctTokenizer().tokenize(u"Veränderungen über einen Walzer")

Out[231]: [u'Ver\xc3', u'\xa4', u'nderungen', u'\xc3\xbcber', u'einen', u'Walzer']

Run Code Online (Sandbox Code Playgroud)

在这个例子中,"ä"被视为分隔符,即使"ü"不是.

python nlp text-mining nltk

red*_*red

2017 02-13

9
推荐指数

2
解决办法

1万
查看次数

在matplotlib中绘制双变量高斯分布

我们如何绘制(在python matplotlib中)双变量高斯分布,给定它们的中心和协方差矩阵作为numpy数组？

假设我们的参数如下:

center1=np.array([3,3])
center2=np.array([5,5])
cov1=np.array([ [1.,.5], [.5,.1]])
cov2=np.array([ [.2,.5], [.5,.2]])

Run Code Online (Sandbox Code Playgroud)

python statistics numpy normal-distribution matplotlib

red*_*red

lucky-day

8
推荐指数

1
解决办法

1万
查看次数

UTF-8字符是否涵盖ISO8859-xx和windows-12xx的所有编码？

我试图从python中具有不同编码的一堆文档中编写一个通用文档索引器.我想知道是否可以用utf-8读取我的所有文件(用utf-8,ISO8859-xx和windows-12xx编码)而没有字符丢失？

阅读部分如下:

fin=codecs.open(doc_name, "r","utf-8");

doc_content=fin.read()

Run Code Online (Sandbox Code Playgroud)

python text-processing character-encoding

red*_*red

2012 02-29

1
推荐指数

1
解决办法

315
查看次数

标签统计

python ×3

character-encoding ×1

matplotlib ×1

nlp ×1

nltk ×1

normal-distribution ×1

numpy ×1

statistics ×1

text-mining ×1

text-processing ×1

用德语文本中的nltk提取单词

在matplotlib中绘制双变量高斯分布

UTF-8字符是否涵盖ISO8859-xx和windows-12xx的所有编码？

标签 统计

小编red_red的帖子

标签统计