我有以下数据:
20120219,\\n,43166053
20120220,\\n,46813269
20120221,\\n,47277204
20120222,\\n,46344556
20120223,\\n,26926236
20120224,\\n,6472506
20120225,\\n,39580476
20120226,\\n,55968342
20120227,\\n,32889948
20120228,\\n,32116361
20120229,\\n,32424829
20120301,\\n,56123889
20120302,\\n,67102459
20120303,\\n,81681885
20120304,\\n,85740021
20120305,\\n,83874668
20120306,\\n,83606683
20120307,\\n,56660981
20120308,\\n,44534668
20120309,\\n,37532071
20120310,\\n,39260242
20120311,\\n,40491186
20120312,\\n,39041085
20120313,\\n,27010562
20120314,\\n,44121900
20120315,\\n,87750645
20120316,\\n,86588523
20120317,\\n,86121469
20120318,\\n,89343506
20120319,\\n,89198664
20120320,\\n,90273127
Run Code Online (Sandbox Code Playgroud)
我有以下代码来创建条形图:
import matplotlib
matplotlib.use('Agg')
from matplotlib.mlab import csv2rec
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pylab import *
from datetime import datetime
import dateutil
import sys
import matplotlib.ticker as mticker
y = []
input = open(sys.argv[1], 'r')
data = csv2rec(input, names=['date', 'symbol', 'count'])
for item in …Run Code Online (Sandbox Code Playgroud) I'm trying to gather twitter statistics from a specific dataset that was provided to me. I have no control over how the data is formatted before it is given to me so I'm locked into this messy for.
I would like some suggestions on how I can build a python program to parse this sort of input and outputting something more along the lines of a CSV file with the field titles as header and the values below.
I want …
我试图在python中处理多线程.我有工作代码来计算单词数,带有文本的行数,并创建一个带有每个单词计数的字典.它可以在代码注释中提到的小文件上快速运行.但是我通常使用glob来拉入多个文件.当我这样做时,我的运行时间显着增加.同时,由于我的脚本是单线程的,我看到我有3个其他核心闲置而一个最大化.
我以为我会给pythons多线程模块一个镜头,这是我到目前为止所做的(非工作):
#!/bin/python
#
# test file: http://www.gutenberg.org/ebooks/2852.txt.utf-8
import fileinput
from collections import defaultdict
import threading
import time
inputfilename = 'pg2852.txt'
exitFlag = 0
line = []
line_counter = 0
tot_words = 0
word_dict = defaultdict(int)
def myCounters( threadName, delay):
for line in fileinput.input([inputfilename]):
line = line.strip();
if not line: continue
words = line.split()
tot_words += len(words)
line_counter += 1
for word in words:
word_dict[word] += 1
print "%s: %s:" %( threadName, time.ctime(time.time()) )
print word_dict
print "Total Words: ", …Run Code Online (Sandbox Code Playgroud) 我这里有一个RDF文件:rdf.rdf,其中有35696条记录.我正试图用Jena处理它:
./bin/sparql --data=/tmp/rdf.rdf --query=./basic.query
Run Code Online (Sandbox Code Playgroud)
但我得到了:
21:25:27 ERROR riot :: Element type "j.0:target" must be followed by either attribute specifications, ">" or "/>".
Failed to load data
Run Code Online (Sandbox Code Playgroud)
我相信这个问题是一个特定的记录,但我不知道哪一个,有没有人有办法检查这个或命令产生问题的行号?
我有一个不断增长的csv文件,看起来像:
143100, 2012-05-21 09:52:54.165852
125820, 2012-05-21 09:53:54.666780
109260, 2012-05-21 09:54:55.144712
116340, 2012-05-21 09:55:55.642197
125640, 2012-05-21 09:56:56.094999
122820, 2012-05-21 09:57:56.546567
124770, 2012-05-21 09:58:57.046050
103830, 2012-05-21 09:59:57.497299
114120, 2012-05-21 10:00:58.000978
-31549410, 2012-05-21 10:01:58.063470
90390, 2012-05-21 10:02:58.108794
81690, 2012-05-21 10:03:58.161329
80940, 2012-05-21 10:04:58.227664
102180, 2012-05-21 10:05:58.289882
99750, 2012-05-21 10:06:58.322063
87000, 2012-05-21 10:07:58.391256
92160, 2012-05-21 10:08:58.442438
80130, 2012-05-21 10:09:58.506494
Run Code Online (Sandbox Code Playgroud)
当生成文件的服务具有API连接失败时,会出现负数.我已经使用matplotlib来绘制数据图形,但是人工负数会大大地压缩图形.我想找到所有否定条目并删除相应的行.在任何情况下,负数实际上都不代表任何实际数据.
在Bash我会做类似的事情:
awk '{print $1}' original.csv | sed '/-/d' > new.csv
Run Code Online (Sandbox Code Playgroud)
但是这很麻烦而且往往很慢,如果我能帮助它,我真的不想在我的python图形脚本中嵌入bash命令.
谁能指出我正确的方向?
编辑:
这是我用来读取/绘制数据的代码:
import matplotlib
matplotlib.use('Agg')
from matplotlib.mlab import csv2rec
import matplotlib.pyplot as …Run Code Online (Sandbox Code Playgroud) 我有一个小程序,它使用NLTK来获得相当大的数据集的频率分布.问题是,几百万字后,我开始吃掉我系统上的所有RAM.以下是我认为相关的代码行:
freq_distribution = nltk.FreqDist(filtered_words) # get the frequency distribution of all the words
top_words = freq_distribution.keys()[:10] # get the top used words
bottom_words = freq_distribution.keys()[-10:] # get the least used words
Run Code Online (Sandbox Code Playgroud)
必须有办法将密钥,值存储写入磁盘,我只是不确定如何.我试图远离像MongoDB这样的文档存储,并保持纯粹的pythonic.如果有人有一些建议我会很感激.
我有时间格式:
Sat Jan 17 04:33:06 +0000 2015
Run Code Online (Sandbox Code Playgroud)
我无法匹配strptime格式.我能找到最接近的:基本日期时间类型是%c"Locale的适当日期和时间表示." 但是,这并不完全匹配.
我要去:
time_vec = [datetime.strptime(str(x),'%c') for x in data['time']]
Run Code Online (Sandbox Code Playgroud)
任何帮助,将不胜感激.