我正在使用Selenium Webdriver(在Python中)自动下载数千个文件.我想以编程方式设置Chrome的下载文件夹.看完这个,我尝试这样做:
chromepath = '/Users/thiagomarzagao/Desktop/searchcode/chromedriver'
desired_caps = {'prefs': {'download': {'default_directory': '/Users/thiagomarzagao/Desktop/downloaded_files/'}}}
driver = webdriver.Chrome(executable_path = chromepath, desired_capabilities = desired_caps)
Run Code Online (Sandbox Code Playgroud)
不好.下载仍然会转到默认下载文件夹("/ Users/thiagomarzagao/Downloads").
有什么想法吗?
(Python 2.7.5,Selenium 2.2.0,Chromedriver 2.1.210398,Mac OS X 10.6.8)
我正在Apache Spark上构建一个RESTful API.提供以下Python脚本spark-submit似乎工作正常:
import cherrypy
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myApp').getOrCreate()
sc = spark.sparkContext
class doStuff(object):
@cherrypy.expose
def compute(self, user_input):
# do something spark-y with the user input
return user_output
cherrypy.quickstart(doStuff())
Run Code Online (Sandbox Code Playgroud)
但谷歌搜索我看到像Livy和spark-jobserver这样的东西.我阅读了这些项目的文档和一些教程,但我仍然不完全理解Livy或spark-jobserver优于使用CherryPy或Flask或任何其他Web框架的简单脚本的优点.它是关于可扩展性的吗?上下文管理?我在这里错过了什么?如果我想要的是一个用户不多的简单RESTful API,那么Livy或spark-jobserver值得吗?如果是这样,为什么?
所以,我有大约4,000个CSV文件,我需要外部加入所有这些文件.每个文件有两列(一个字符串和一个浮点数),在10,000-1,000,000行之间,我希望通过第一列(即字符串变量)加入.
我试过了numpy.lib.recfunctions.join_by,但那很痛苦.我切换到了pandas.merge这个速度要快得多,但考虑到我拥有的表的数量(和大小),它仍然太慢了.它似乎真的是内存密集型 - 当合并的文件有数十万行时,机器变得无法使用(我主要使用的是MacBook Pro,2.4GHz,4GB).
所以现在我正在寻找替代方案 - 我还缺少其他潜在的解决方案吗?Python还有哪些其他外连接实现?是否有某个论文/网站讨论并比较每个实施的时间复杂性?如果我只是简单地使用Python调用sqlite3,然后让sqlite3进行连接,那会更有效吗?字符串键是问题吗?如果我可以使用数字键,它应该更快吗?
如果它有助于您更加具体地了解我正在尝试实现的目标,请使用pandas.merge以下代码:
import os
import pandas as pd
def load_and_merge(file_names, path_to_files, columns):
'''
seq, str, dict -> pandas.DataFrame
'''
output = pd.DataFrame(columns = ['mykey']) # initialize output DataFrame
for file in file_names:
# load new data
new_data = pd.read_csv(path + file,
usecols = [col for col in columns.keys()],
dtype = columns,
names = ['mykey', file.replace('.csv', '')],
header = None)
# merge with previous data
output = pd.merge(output, …Run Code Online (Sandbox Code Playgroud) 当我这样做时,from rpy2.robjects import r我收到一个错误:
>>> from rpy2.robjects import r
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/rpy2/robjects/__init__.py", line 27, in <module> from . import language
File "/usr/local/lib/python3.6/dist-packages/rpy2/robjects/language.py", line 16, in <module> _str2lang = ri.baseenv['str2lang']
File "/usr/local/lib/python3.6/dist-packages/rpy2/rinterface_lib/conversion.py", line 44, in _ cdata = function(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/rpy2/rinterface_lib/_rinterface_capi.py", line 282, in _robj = function(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/rpy2/rinterface_lib/sexp.py", line 355, in __getitem__
raise KeyError("'%s' not found" % key)
KeyError: "'str2lang' not found"
Run Code Online (Sandbox Code Playgroud)
但是当我降级到 3.2.0 版时一切正常。
有什么想法吗? …
所以,我有单词列表,我需要知道每个单词在每个列表中出现的频率.使用".count(word)"有效,但速度太慢(每个列表有数千个单词,我有数千个列表).
我一直试图用numpy来加快速度.我为每个单词生成了一个唯一的数字代码,所以我可以使用numpy.bincount(因为它只适用于整数,而不是字符串).但我得到"ValueError:数组太大了".
所以现在我试图调整numpy.histogram函数的"bins"参数,使其返回我需要的频率计数(不知何故numpy.histogram似乎没有大数组的麻烦).但到目前为止还不好.那里的任何人碰巧都曾经这样做过吗?它甚至可能吗?是否有一些我没有看到的更简单的解决方案?
我正在使用Selenium Webdriver(在Python中)自动从某个网站下载数千个文件(不能通过urllib,httplib等传统方式进行网页编写).我的脚本与Firefox完美配合,但我不需要看到魔法发生,所以我正在尝试使用PhantomJS.它几乎一直工作,除非它试图单击某个按钮以关闭窗口.这是脚本卡住的命令:
browser.find_element_by_css_selector("img[alt=\"Close Window\"]").click()
Run Code Online (Sandbox Code Playgroud)
它只是挂在那里,没有任何反应.
PhantomJS比Firefox更快(因为没有视觉效果),所以我认为问题可能与"关闭窗口"按钮不能很快点击相关.因此我尝试使用显式等待:
element = WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "img[alt=\"Close Window\"]")))
print "done with waiting"
browser.find_element_by_css_selector("img[alt=\"Close Window\"]").click()
Run Code Online (Sandbox Code Playgroud)
不起作用:等待很快结束(大约一秒钟后出现"等待完成"消息),但代码再次挂起.我也试过使用隐式等待,但这也不起作用.
所以,我很茫然.当我使用Firefox时,同样的脚本就像魅力一样运行,那么为什么它不能与PhantomJS一起使用呢?
我不知道这是否有帮助,但这里是页面来源:
http://www.flickr.com/photos/88729961@N00/9512669916/sizes/l/in/photostream/
我不知道这是否有帮助,但是当我用Crtl-C打破执行时,我得到了这个:
Traceback (most recent call last):
File "myscript.py", line 361, in <module>
myfunction(some_argument, some_other_argument)
File "myscript.py", line 277, in myfunction
browser.find_element_by_css_selector("img[alt=\"Close Window\"]").click()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/webelement.py", line 54, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/webelement.py", line 228, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/webdriver.py", line 163, in execute
response = self.command_executor.execute(driver_command, params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(url, method=command_info[0], data=data) …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用scikit-learn的DBSCAN实现来集群化一堆文档.首先,我使用scikit-learn的TfidfVectorizer创建TF-IDF矩阵(它是numpy.float64类型的163405x13029稀疏矩阵).然后我尝试聚类该矩阵的特定子集.当子集较小时(例如,最多几千行),事情就可以正常工作.但是我得到了大的子集(有几万行)ValueError: could not convert integer scalar.
这是完整的追溯(idxs是一个索引列表):
ValueError Traceback (most recent call last)
<ipython-input-1-73ee366d8de5> in <module>()
193 # use descriptions to clusterize items
194 ncm_clusterizer = DBSCAN()
--> 195 ncm_clusterizer.fit_predict(tfidf[idxs])
196 idxs_clusters = list(zip(idxs, ncm_clusterizer.labels_))
197 for e in idxs_clusters:
/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit_predict(self, X, y, sample_weight)
294 cluster labels
295 """
--> 296 self.fit(X, sample_weight=sample_weight)
297 return self.labels_
/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit(self, X, y, sample_weight)
264 X = check_array(X, accept_sparse='csr')
265 clust = dbscan(X, sample_weight=sample_weight,
--> 266 …Run Code Online (Sandbox Code Playgroud) 我正在使用Selenium Webdriver(Python绑定),我的脚本适用于Mac(OS X 10.6.8),但不适用于PC(Windows 7 Enterprise).这是我得到的错误:
C:\Python27>python myscript.py
Traceback (most recent call last):
File "myscript.py", line 303, in <module>
myfunction(arg1)
File "myscript.py", line 87, in myfunction
browser = webdriver.Firefox(firefox_profile = fp)
File "C:\Python27\lib\site-packages\selenium\webdriver\firefox\webdriver.py",
line 61, in __init__
self.binary, timeout),
File "C:\Python27\lib\site-packages\selenium\webdriver\firefox\extension_conne
ction.py", line 47, in __init__
self.binary.launch_browser(self.profile)
File "C:\Python27\lib\site-packages\selenium\webdriver\firefox\firefox_binary.
py", line 61, in launch_browser
self._wait_until_connectable()
File "C:\Python27\lib\site-packages\selenium\webdriver\firefox\firefox_binary.
py", line 105, in _wait_until_connectable
self.profile.path, self._get_firefox_output()))
selenium.common.exceptions.WebDriverException: Message: "Can't load the profile.
Profile Dir: c:\\users\\marzagao.1\\appdata\\local\\temp\\tmpnn0nhk Firefox out
put: "
Run Code Online (Sandbox Code Playgroud)
这是我的脚本的相关部分(我正在迭代不同的下载文件夹):
for download_folder …Run Code Online (Sandbox Code Playgroud) 我可以毫无问题地打开 Jupyter 控制台,但是当我创建一个新笔记本时,它会不断连接到内核并断开与内核的连接(消息“连接到内核”/“已连接”一直显示在右上角)。这是 Chrome 的控制台输出的内容(在 Firefox 中也是如此):
Untitled3.ipynb?kernel_name=python3:121 loaded custom.js
default.js:48Default extension for cell metadata editing loaded.
rawcell.js:82Raw Cell Format toolbar preset loaded.
slideshow.js:43Slideshow extension for metadata editing loaded.
menubar.js:240actions jupyter-notebook:find-and-replace does not exist, still binding it in case it will be defined later...
MenuBar.bind_events @ menubar.js:240
extension.js Failed to load resource: the server responded with a status of 404 (Not Found)
main.js:184Widgets are not available. Please install widgetsnbextension or ipywidgets 4.0
(anonymous) @ main.js:184
session.js:54Session: kernel_created (1b236a4b-902d-4b33-9118-63013be4f270)
kernel.js:456Starting WebSockets: …Run Code Online (Sandbox Code Playgroud) 所以,我正在尝试使用SQS在两个EC2实例之间传递Python对象.这是我失败的尝试:
import boto.sqs
from boto.sqs.message import Message
class UserInput(Message):
def set_email(self, email):
self.email = email
def set_data(self, data):
self.data = data
def get_email(self):
return self.email
def get_data(self):
return self.data
conn = boto.sqs.connect_to_region('us-west-2')
q = conn.create_queue('user_input_queue')
q.set_message_class(UserInput)
m = UserInput()
m.set_email('something@something.com')
m.set_data({'foo': 'bar'})
q.write(m)
Run Code Online (Sandbox Code Playgroud)
它会返回一条错误消息The request must contain the parameter MessageBody.实际上,该教程告诉我们m.set_body('something')在将消息写入队列之前要做的事情.但是这里我没有传递字符串,我想传递一个UserInput类的实例.那么,MessageBody应该是什么?我已经阅读了文档并且他们说了
The constructor for the Message class must accept a keyword parameter “body” which represents the content or body of the message. The …
python ×8
python-2.7 ×4
numpy ×2
selenium ×2
amazon-ec2 ×1
amazon-sqs ×1
apache-spark ×1
boto ×1
dbscan ×1
histogram ×1
ipython ×1
jupyter ×1
livy ×1
pandas ×1
phantomjs ×1
python-3.x ×1
r ×1
rpy2 ×1
scikit-learn ×1
scipy ×1
webdriver ×1