我想提取:
image标签的src的文本和div类标记内的锚标记的文本我成功地设法提取img src,但是无法从锚标记中提取文本.
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
Run Code Online (Sandbox Code Playgroud)
这是整个HTML页面的链接.
这是我的代码:
for div in soup.findAll('div', attrs={'class':'image'}):
print "\n"
for data in div.findNextSibling('div', attrs={'class':'data'}):
for a in data.findAll('a', attrs={'class':'title'}):
print a.text
for img in div.findAll('img'):
print img['src']
Run Code Online (Sandbox Code Playgroud)
我想要做的是提取图像src(链接)和里面的标题div class=data,例如:
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> …Run Code Online (Sandbox Code Playgroud) 使用python我创建了包含相似度值的以下数据框:
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000
2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000
3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353
4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000
5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000
6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000
Run Code Online (Sandbox Code Playgroud)
我正在尝试编写一个R脚本来生成另一个反映二进制数据的数据框,但如果该值大于0.5,则应用我的条件条件
伪代码:
if (cosinFcolor > 0.5 & cosinFcolor <= 0.6)
bin = 1
if (cosinFcolor > 0.6 & cosinFcolor <= 0.7)
bin = 2 …Run Code Online (Sandbox Code Playgroud) 我正在解析json数据并尝试将一些json数据存储到Mysql数据库中.我目前正在关注unicode错误.我的问题是我应该如何处理这个问题.
这是我的表格结构
CREATE TABLE yahoo_questions (
question_id varchar(40) NOT NULL,
question_subj varbinary(255),
question_content varbinary(255),
question_userId varchar(40) NOT NULL,
question_timestamp varchar(40),
category_id varbinary(20) NOT NULL,
category_name varchar(40) NOT NULL,
choosen_answer varbinary(255),
choosen_userId varchar(40),
choosen_usernick varchar(40),
choosen_ans_timestamp varchar(40),
UNIQUE (question_id)
);
Run Code Online (Sandbox Code Playgroud)
通过python代码插入时出错:
Traceback (most recent call last):
File "YahooQueryData.py", line 78, in <module>
+"VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (row[2], row[5], row[6], quserId, questionTime, categoryId, categoryName, qChosenAnswer, choosenUserId, choosenNickName, choosenTimeStamp))
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/MySQLdb/cursors.py", line 159, …Run Code Online (Sandbox Code Playgroud) 我写了很多刮刀但是我不确定如何处理无限滚动条.这些天大多数网站等,Facebook,Pinterest都有无限的滚动条.
我有以下方法,我从表中选择所有ID并将它们附加到列表并返回该列表.但是当执行这段代码时,我最终得到元组指标必须是整数...错误.我已经附加了错误和打印输出以及我的方法:
def questionIds(con):
print 'getting all the question ids'
cur = con.cursor()
qIds = []
getQuestionId = "SELECT question_id from questions_new"
try:
cur.execute(getQuestionId)
for row in cur.fetchall():
print 'printing row'
print row
qIds.append(str(row['question_id']))
except Exception, e:
traceback.print_exc()
return qIds
Run Code Online (Sandbox Code Playgroud)
打印我的方法:
Database version : 5.5.10
getting all the question ids
printing row
(u'20090225230048AAnhStI',)
Traceback (most recent call last):
File "YahooAnswerScraper.py", line 76, in questionIds
qIds.append(str(row['question_id'][0]))
TypeError: tuple indices must be integers, not str
Run Code Online (Sandbox Code Playgroud) 我有包含用户和购买数据的数据集.下面是一个示例,其中第一个元素是userId,第二个元素是productId,第三个元素是boolean.
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
Run Code Online (Sandbox Code Playgroud)
我想确保我只占用每个用户数据的80%并构建RDD,同时占用20%的剩余部分并构建另一个RDD.让我们来电话和测试.我想远离使用groupBy开始,因为它可以创建内存问题,因为数据集很大.什么是最好的方法呢?
我可以做以下但这不会给每个用户80%.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
Run Code Online (Sandbox Code Playgroud) 我的python级别是新手.我从来没有写过网络刮刀或爬虫.我编写了一个python代码来连接到api并提取我想要的数据.但对于一些提取的数据,我想得到作者的性别.我发现这个网站,http://bookblog.net/gender/genie.php但缺点是没有api可用.我想知道如何编写一个python来向页面中的表单提交数据并提取返回数据.如果我能得到一些指导,那将是一个很大的帮助.
这是dom的形式:
<form action="analysis.php" method="POST">
<textarea cols="75" rows="13" name="text"></textarea>
<div class="copyright">(NOTE: The genie works best on texts of more than 500 words.)</div>
<p>
<b>Genre:</b>
<input type="radio" value="fiction" name="genre">
fiction
<input type="radio" value="nonfiction" name="genre">
nonfiction
<input type="radio" value="blog" name="genre">
blog entry
</p>
<p>
</form>
Run Code Online (Sandbox Code Playgroud)
结果页面dom:
<p>
<b>The Gender Genie thinks the author of this passage is:</b>
male!
</p>
Run Code Online (Sandbox Code Playgroud) 如何通过单击按钮从ion.rangeSlider组件获取低值和高值?
这是我的jQuery代码:
<script>
$(document).ready(function(){
$("#range_1").ionRangeSlider({
min: 10,
max: 50,
from: 10,
to: 20,
type: 'double',
step: 1,
prettify: true,
hasGrid: false
});
});
</script>
<script>
$(document).ready(function(){
$('#get_values').click(function(){
var low = $('#range_1').... ???;
var high = $('#range_1').... ???;
alert(low);
});
});
</script>
Run Code Online (Sandbox Code Playgroud) 我正处于一个我调用api的场景,并根据api的结果我为api中的每条记录调用数据库.我的api调用返回字符串,当我通过api为数据库调用返回的项时,对于某些元素,我得到以下错误.
Traceback (most recent call last):
File "TopLevelCategories.py", line 267, in <module>
cursor.execute(categoryQuery, {'title': startCategory});
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 158, in execute
query = query % db.literal(args)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 265, in literal
return self.escape(o, self.encoders)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 203, in unicode_literal
return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
Run Code Online (Sandbox Code Playgroud)
上面错误引用的代码段是:
...
for startCategory in value[0]:
categoryResults = []
try:
categoryRow = ""
baseCategoryTree[startCategory] = []
#print categoryQuery % {'title': startCategory};
cursor.execute(categoryQuery, {'title': …Run Code Online (Sandbox Code Playgroud) 我刚开始使用jenv,我跟着一篇博客文章,解释了如何jenv在MacOSX上使用和设置多个java版本.但是我现在遇到的问题是设置JAVA_HOME.当我切换java环境时,jenv我想确保JAVA_HOME我的bash_profile也相应地改变.
我怎么做?
我跟着我 ~/.bash_profile
if which jenv > /dev/null; then eval "$(jenv init -)"; fi
Run Code Online (Sandbox Code Playgroud)