我成功地使用Selenium和PhantomJS来重新加载动态加载的无限滚动页面,如下例所示.但是如何修改它以便不是手动设置一些重载,程序在达到最低点时停止了?
reloads = 100000 #set the number of times to reload
pause = 0 #initial time interval between reloads
driver = webdriver.PhantomJS()
# Load Twitter page and click to view all results
driver.get(url)
driver.find_element_by_link_text("All").click()
# Keep reloading and pausing to reach the bottom
for _ in range(reloads):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
text_file.write(driver.page_source.encode("utf-8"))
text_file.close()
Run Code Online (Sandbox Code Playgroud) 我有一个R dataframe(df
),如下所示:
blogger; word; n; total
joe; dorothy; 17; 718
paul; sheriff; 10; 354
joe; gray; 9; 718
joe; toto; 9; 718
mick; robin; 9; 607
paul; robin; 9; 354
...
Run Code Online (Sandbox Code Playgroud)
我想使用ggplot2
绘图n
除以total
每个blogger
.
我有这个代码:
ggplot(df, aes(n/total, fill = blogger)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0004) +
facet_wrap(~blogger, ncol = 2, scales = "free_y")
Run Code Online (Sandbox Code Playgroud)
但它产生了这个警告:
Warning message:
“Removed 1474 rows containing non-finite values (stat_bin).”Warning message in rep(no, length.out = length(ans)):
“'x' …
Run Code Online (Sandbox Code Playgroud) 这是我的代码.由于要解析的原始数据的内容,我最终得到的"用户列表"和"推文列表"的长度不同.将列表作为数据框中的列写入时,我得到了ValueError: arrays must all be same length
.我意识到这一点,但一直在寻找一种方法来解决它,打印0
或NaN
在较短阵列的正确位置.有任何想法吗?
import pandas
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('#raw.html'))
chunk = soup.find_all('div', class_='content')
userlist = []
tweetlist = []
for tweet in chunk:
username = tweet.find_all(class_='username js-action-profile-name')
for user in username:
user2 = user.get_text()
userlist.append(user2)
for text in chunk:
tweets = text.find_all(class_='js-tweet-text tweet-text')
for tweet in tweets:
tweet2 = tweet.get_text().encode('utf-8')
tweetlist.append('|'+tweet2)
print len(tweetlist)
print len(userlist)
#MAKE A DATAFRAME WITH THIS
data = {'tweet' : tweetlist, 'user' : …
Run Code Online (Sandbox Code Playgroud) 我想结束一个for循环的每个interation,并将一行新内容(包括换行符)写入csv文件.我有这个:
# Set up an output csv file with column headers
with open('outfile.csv','w') as f:
f.write("title; post")
f.write("\n")
Run Code Online (Sandbox Code Playgroud)
这似乎没有写出实际的\n(换行符)文件.进一步:
# Concatenate into a row to write to the output csv file
csv_line = topic_title + ";" + thread_post
with open('outfile.csv','w') as outfile:
outfile.write(csv_line + "\n")
Run Code Online (Sandbox Code Playgroud)
这也不会将outfile中的光标移动到下一行.每个循环的每次迭代都会覆盖最新的一行.
我也试过outfile.write(os.linesep)
但没有奏效.
为什么此代码不绘制按“值”排序的 x 轴?
import pandas as pd
import matplotlib.pyplot as plt
# creating dataframe
df=pd.DataFrame()
df['name'] = [1,2,3]
df['value'] = [4,3,5]
# sorting dataframe
df.sort_values('value', ascending = False, inplace= True)
# plot
plt.scatter(df['value'],df['name'])
plt.show()
Run Code Online (Sandbox Code Playgroud) 这是两本词典:
monkeydict = {'16:43': 1, '16:44': 1, '16:49': 3}
pigdict = {'16:41': 3, '16:44': 2, '16:51': 3}
Run Code Online (Sandbox Code Playgroud)
这是所需的数据框:
time,monkeydict,pigdict
16:41,,3
16:43,1,
16:44,1,2
16:49,3,
16:51,,3
Run Code Online (Sandbox Code Playgroud) 我有下面的代码,它搜索Twitter并滚动无限滚动.但"打印数据"这行并不适用于我.有任何想法吗?
# Import Selenium stuff
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
# Import other needed packages
import sys
import unittest, time, re
# Call up Firefox, do the Twitter search, click the "All" link and start paging
class Sel(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(30)
self.base_url = "https://twitter.com" …
Run Code Online (Sandbox Code Playgroud) 我有这个:
date = chunk.find_all('a', title=True, class_='tweet-timestamp js-permalink js-nav js-tooltip')
Run Code Online (Sandbox Code Playgroud)
哪个返回:
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/15colleen/status/537395294133313536" title="3:59 PM - 25 Nov 2014"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1416959997" data-time-ms="1416959997000">Nov 25</span></a>
Run Code Online (Sandbox Code Playgroud)
显然get_text()
返回Nov 25
,但我想提取片段3:59 PM - 25 Nov 2014
.
显然,如果我们这样做,计数器将保持为0,因为它在每次迭代开始时重置:
for thing in stuff:
count = 0
print count
count =+1
write_f.write(thing)
Run Code Online (Sandbox Code Playgroud)
但是因为我在函数内部有这个代码,所以它也不起作用:
count=0
for thing in stuff:
print count
count =+1
write_f.write(thing)
Run Code Online (Sandbox Code Playgroud)
我有几个不同的缩进级别,无论我如何移动count=0
,它要么没有效果,要么抛出UnboundLocalError: local variable 'count' referenced before assignment
.有没有办法在for循环内部生成一个简单的交互计数器?
我while True:
在webscraping脚本中运行循环.我希望刮刀在增量循环中运行,直到遇到某个错误.一般的问题是如何在某个条件匹配时突破一段时间的True循环.代码就是永远输出第一次运行:
output 1;1
...
output 1;n
Run Code Online (Sandbox Code Playgroud)
这是我的代码的最小可重现的示例.
runs = [1,2,3]
for r in runs:
go = 0
while True:
go +=1
output = ("output " + str(r) + ";" +str(go))
try:
print(output)
except go > 3:
break
Run Code Online (Sandbox Code Playgroud)
所需的输出是:
output 1;1
output 1;2
output 1;3
output 2;1
output 2;2
output 3;3
output 3;1
output 3;2
output 3;3
[done]
Run Code Online (Sandbox Code Playgroud)