我有一篇包含32篇文章的文本文档,我想发现每篇文章的日期.我观察到日期出现在每篇文章的第5行.到目前为止,我已将文本拆分为32篇文章:
import re
sections = []
current = []
with open("Aberdeen2005.txt") as f:
for line in f:
if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):
sections.append("".join(current))
current = [line]
else:
current.append(line)
print(len(sections))
Run Code Online (Sandbox Code Playgroud)
我想创建一个列表,其中包含每篇文章的日期,仅限MONTH和YEAR:
可以看出,日期的格式来自上图,但有时不包括日期,例如星期四.
有任何想法吗?
亲切的问候,
安德烈斯
我正在尝试计算文档属于 LDA 模型找到的每个主题的概率。我已经成功地制作了 LDA,但现在我被卡住了。我的代码如下:
## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
## Tokenizing
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = stopwords.words('english')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
import json
import nltk
import re
import pandas
appended_data = []
for i in range(2005,2016):
if i > 2013:
df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
appended_data.append(df0)
df1 …Run Code Online (Sandbox Code Playgroud) 我有以下数据:
Newspaper Month Year Date Topic1 Topic2 Topic3 Topic4 Topic5
1 Scotsman December 2005 December 2005 0.013749700 0.000127470 0.38575261 0.000127470 0.070778523
2 Scotsman December 2005 December 2005 0.000165017 0.000165017 0.05219433 0.004611941 0.000165017
3 Scotsman December 2005 December 2005 0.000356507 0.024344932 0.01135670 0.000356507 0.000356507
4 Scotsman December 2005 December 2005 0.000185186 0.000185186 0.10796924 0.044639345 0.106613401
5 Scotsman December 2005 December 2005 0.065869506 0.009775978 0.09610254 0.017584819 0.000103681
6 Scotsman December 2005 December 2005 0.000153257 0.000153257 0.11443001 0.000153257 0.046316677
Run Code Online (Sandbox Code Playgroud)
我想创建一个单独的变量,对应于TopicN更高的百分比.
在第一篇文章(行)的情况下,它将是3.任何想法?
我想在一个简单的向量空间图中绘制不同单词之间的相似性。我已经使用 gensim 给出的模型计算了它们word2vec,但我在文献中找不到任何图形示例。我的代码如下:
## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import json
import nltk
import re
import pandas
appended_data = []
#for i in range(20014,2016):
# df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
# appended_data.append(df0)
for i in range(2005,2016):
if i > 2013:
df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
appended_data.append(df0)
df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' …Run Code Online (Sandbox Code Playgroud) 我有10个所谓的JSON文件Herald500_2005,Herald500_2006...... Herald500_2015.我试图在每个文件中对关键字进行相同的搜索.我不想一个接一个地做,而是希望能够循环进行.到目前为止,我尝试了以下代码:
for i in range(5,15):
df = pandas.DataFrame([json.loads(l) for l in open('Herald500_200i.json')])
# Parse dates and set index
df.date = pandas.to_datetime(df.date)
df.set_index('date', inplace=True)
# match keywords
matchingbodies = df[df.body.str.contains("|".join(keywords3))&df.body.str.contains("|".join(keywords2))&df.body.str.contains("|".join(keywords1))].body
# Count by month
counts = matchingbodies.groupby(lambda x: x.month).agg(len)
print "TH 200i"
print counts
Run Code Online (Sandbox Code Playgroud)
通过运行此代码我收到以下错误:
<ipython-input-9-76f2d2649df0> in <module>()
1 for i in range(5,15):
----> 2 df = pandas.DataFrame([json.loads(l) for l in open('Herald500_200i.json')])
3 # Parse dates and set index
4 df.date = pandas.to_datetime(df.date)
5 df.set_index('date', …Run Code Online (Sandbox Code Playgroud) 我试图用Python读取一个JSON文件,不管你信不信,我不知道从哪里开始.以下是我的JSON文件名称AEE_2007.json的样子:
{"date": "October 30, 2007 Tuesday", "body": "For those of us who have been around Aberdeen for a while, your question \"What now for the oil industry? (Evening Express, October 26) had a touch of deja vu about it. That same question has been asked almost since the day the first drop of oil was pumped out of the North Sea. In the past 30 years we have seen a constant cycle of ups and downs, booms and busts in …Run Code Online (Sandbox Code Playgroud) 我正在使用主题可视化库 LDAvis:
## visualization of the topics
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
Run Code Online (Sandbox Code Playgroud)
它生成 LDA(潜在狄利克雷分配)模型揭示的主题的主成分图像。我想下载图像,但我卡住了。任何帮助非常感谢!
我有以下数据框:
d =
id group value
1 A 1
2 A 2
3 A 10
4 B 100
5 B 200
6 B 1000
Run Code Online (Sandbox Code Playgroud)
我想用NA替换99%四分位数以上的值,具体取决于它们所属的组.在这个例子中将是观察(id)3和6.到目前为止,我有这段代码可以完成我想要的但不依赖于每个组.
d[ d$value.TA < quantile(d$value, 0.99), 'value'] <- NA
Run Code Online (Sandbox Code Playgroud)
有帮助吗?