gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html': {'dail': 1, 'focus': 1, 'actions': 1, 'trade': 2, 'protest': 1, 'identify': 1, 'previous': 1, 'detectives': 1, 'republican': 1, 'group': 1, 'monitor': 1, 'clashes': 1, 'civil': 1, 'charge': 1, 'breaches': 1, 'travelling': 1, 'main': 1, 'disrupt': 1, 'real': 1, 'policing': 3, 'march': 6, 'finance': 1, 'drawn': 1, 'assistant': 1, 'protesters': 1, 'emphasised': 1, 'department': 1, 'traffic': 2, 'outbreak': 1, 'culprits': 1, 'proportionate': 1, 'instructions': 1, 'warned': 2, 'commanders': 1, 'michael': 2, 'exploit': 1, 'culminating': 1, 'large': 2, 'continue': 1, 'team': …Run Code Online (Sandbox Code Playgroud) 我是一个绝对的初学者.从未使用Java在weka中创建分类器或任何东西我以前使用过该接口.基本上我有点失落我已经看过weka的过滤器类并且玩了一下它.我的文件是文本文件,我需要将它们分为两类.
我不确定如何定义类别或如何将文档加载到要分类的IDE中
:-(
任何帮助/教程或指针将不胜感激.
创建了一个python模块,它读取文件,删除停用词并输出一个python字典,其中包含单词及其频率(文档中出现的次数).
def run():
filelist = os.listdir(path)
regex = re.compile(r'.*<div class="body">(.*?)</div>.*', re.DOTALL | re.IGNORECASE)
reg1 = re.compile(r'<\/?[ap][^>]*>', re.DOTALL | re.IGNORECASE)
quotereg = re.compile(r'"', re.DOTALL | re.IGNORECASE)
puncreg = re.compile(r'[^\w]', re.DOTALL | re.IGNORECASE)
f = open(stopwordfile, 'r')
stopwords = f.read().lower().split()
totalfreq = {}
filewords = {}
htmlfiles = []
for file in filelist:
if file[-5:] == '.html':
htmlfiles.append(file)
for file in htmlfiles:
f = open(path + file, 'r')
words = f.read().lower()
words = regex.findall(words)[0]
words = quotereg.sub(' ', words)
words = reg1.sub(' …Run Code Online (Sandbox Code Playgroud) 我在Python中创建了一个Google App Engine项目,它在我的localhost上运行但是当我将它上传到geo-event-maps.appspot.com时,标记没有显示.我有一个cron,它可以调用/放置.我没有日志错误我的数据存储空了!正在上传txt文件:
file_path = os.path.dirname(__file__)
path = os.path.join(file_path, 'storing', 'txtFiles')
Run Code Online (Sandbox Code Playgroud)
有没有办法检查文件是否已上传?!
我绝对亏损.以前有没有人遇到过这些问题?
下面是我的main.py:
'''
Created on Mar 30, 2011
@author: kimmasterson
'''
#!/usr/bin/env python
from google.appengine.ext import webapp
from google.appengine.ext import db
from placemaker import placemaker
import logging
import wsgiref.handlers
import os, glob
from google.appengine.dist import use_library
use_library('django', '1.2')
from google.appengine.ext.webapp import template
class Story(db.Model):
id = db.StringProperty()
loc_name = db.StringProperty()
title = db.StringProperty()
long = db.FloatProperty()
lat = db.FloatProperty()
link = db.StringProperty()
date = db.StringProperty()
class MyStories(webapp.RequestHandler):
def …Run Code Online (Sandbox Code Playgroud) python cron google-app-engine google-maps google-maps-markers
我有一个Python脚本,它接收'.html'文件删除停用词并返回python词典中的所有其他单词.但是如果在多个文件中出现相同的单词,我希望它只返回一次.即包含不间断的单词,每次只包含一次.
def run():
filelist = os.listdir(path)
regex = re.compile(r'.*<div class="body">(.*?)</div>.*', re.DOTALL | re.IGNORECASE)
reg1 = re.compile(r'<\/?[ap][^>]*>', re.DOTALL | re.IGNORECASE)
quotereg = re.compile(r'"', re.DOTALL | re.IGNORECASE)
puncreg = re.compile(r'[^\w]', re.DOTALL | re.IGNORECASE)
f = open(stopwordfile, 'r')
stopwords = f.read().lower().split()
filewords = {}
htmlfiles = []
for file in filelist:
if file[-5:] == '.html':
htmlfiles.append(file)
totalfreq = {}
for file in htmlfiles:
f = open(path + file, 'r')
words = f.read().lower()
words = regex.findall(words)[0]
words = quotereg.sub(' ', words)
words = reg1.sub(' …Run Code Online (Sandbox Code Playgroud) python ×4
dictionary ×2
regex ×2
weka ×2
arff ×1
count ×1
cron ×1
documents ×1
duplicates ×1
file ×1
filter ×1
frequency ×1
google-maps ×1
java ×1
stop-words ×1