Pan*_*hal 12 python apostrophe python-2.7
mycorpus.txt
Human where's machine interface for lab abc computer applications
A where's survey of user opinion of computer system response time
Run Code Online (Sandbox Code Playgroud)
stopWords.txt中
let's
ain't
there's
Run Code Online (Sandbox Code Playgroud)
以下代码
corpus = set()
for line in open("path\\to\\mycorpus.txt"):
corpus.update(set(line.lower().split()))
print corpus
stoplist = set()
for line in open("C:\\Users\\Pankaj\\Desktop\\BTP\\stopwords_new.txt"):
stoplist.add(line.lower().strip())
print stoplist
Run Code Online (Sandbox Code Playgroud)
给出以下输出
set(['a', "where's", 'abc', 'for', 'of', 'system', 'lab', 'machine', 'applications', 'computer', 'survey', 'user', 'human', 'time', 'interface', 'opinion', 'response'])
set(['let\x92s', 'ain\x92t', 'there\x92s'])
Run Code Online (Sandbox Code Playgroud)
为什么撇号在第二组中变成\ x92?
CB *_*ley 17
窗口1252编码中的代码点92(十六进制)是Unicode代码点2019(十六进制),即"右单引号".这看起来很像一个撇号,很可能是你所拥有的实际角色,stopwords.txt我已经从python解释的方式中猜到了,已经在windows-1252中编码,或者是一个共享ASCII和’码点值的编码.
'vs'
| 归档时间: |
|
| 查看次数: |
16209 次 |
| 最近记录: |