Ham*_*mad 1 python file-io file web-crawler scrapy
我想查找给定文本中匹配的肯定和否定单词的总数。我在positive.txt文件中有肯定词列表,在文件中有否定词列表negative.txt。如果一个单词与肯定单词列表中的单词匹配,那么我想要一个简单的整数变量,该变量的值增加1,与否定匹配单词相同。从我给定的代码中,我得到了下面的一段@class=[story-hed]。这是我要与肯定和否定单词列表以及单词总数进行比较的文本。我的代码是
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dawn.items import DawnItem
class dawnSpider(BaseSpider):
name = "dawn"
allowed_domains = ["dawn.com"]
start_urls = [
"http://dawn.com/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//h3[@class="story-hed"]//a/text()').extract()
items=[]
for site in sites:
item=DawnItem()
item['title']=site
items.append(item)
return items
Run Code Online (Sandbox Code Playgroud)
下面的独立代码可以达到目的:
from collections import Counter
def readwords( filename ):
f = open(filename)
words = [ line.rstrip() for line in f.readlines()]
return words
positive = readwords('positive.txt')
negative = readwords('negative.txt')
paragraph = 'this is really bad and in fact awesome. really awesome.'
count = Counter(paragraph.split())
pos = 0
neg = 0
for key, val in count.iteritems():
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
print pos, neg
Run Code Online (Sandbox Code Playgroud)
这是两个输入文件中的内容:
positive.txt:
good
awesome
Run Code Online (Sandbox Code Playgroud)
negative.txt:
bad
ugly
Run Code Online (Sandbox Code Playgroud)
输出为:2 1
要抓紧实施,您可能需要使用项目管道http://doc.scrapy.org/en/latest/topics/item-pipeline.html