我的网页是这样的 -
<p>
<strong class="offender">YOB:</strong> 1987<br/>
<strong class="offender">RACE:</strong> WHITE<br/>
<strong class="offender">GENDER:</strong> FEMALE<br/>
<strong class="offender">HEIGHT:</strong> 5'05''<br/>
<strong class="offender">WEIGHT:</strong> 118<br/>
<strong class="offender">EYE COLOR:</strong> GREEN<br/>
<strong class="offender">HAIR COLOR:</strong> BROWN<br/>
</p>
Run Code Online (Sandbox Code Playgroud)
我要提取的信息对每个人,并得到了YOB:1987,RACE:WHITE等....
我试过的是 -
subc = soup.find_all('p')
subc1 = subc[1]
subc2 = subc1.find_all('strong')
Run Code Online (Sandbox Code Playgroud)
但是,这给我的唯一的值YOB:,RACE:等
有没有一种方法,我可以得到的数据YOB:1987,RACE:WHITE格式?
我正在尝试编写一个简单的脚本来验证HDFS和本地文件系统校验和.
在HDFS我得到 -
[m@x01tbipapp3a ~]$ hadoop fs -checksum /user/m/file.txt
/user/m/file.txt MD5-of-0MD5-of-512CRC32C **000002000000000000000000755ca25bd89d1a2d64990a68dedb5514**
Run Code Online (Sandbox Code Playgroud)
在本地文件系统上,我得到 -
[m@x01tbipapp3a ~]$ cksum file.txt
**3802590149 26276247** file.txt
[m@x01tbipapp3a ~]$ md5sum file.txt
**c1aae0db584d72402d5bcf5cbc29134c** file.txt
Run Code Online (Sandbox Code Playgroud)
现在我如何比较它们.我试图将HDFS校验和从十六进制转换为十进制,以查看它是否与chksum匹配,但它不...
有没有办法比较使用任何算法的2校验和?
谢谢
我想找到从网页中提取标题和段落文本。问题是在具有相同标题标签和段落标签的标题之后有灵活数量的标题和段落。
示例 HTML -
<h6>PHYSICAL DESCRIPTION</h6>
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
<h6>SCARS, MARKS, TATTOOS</h6>
<p>
</p>
Run Code Online (Sandbox Code Playgroud)
我使用的代码如下 -
sub = soup.findAll('h6')
print sub.text
sub = soup.findAll('p')
for strong_tag in sub.find_all('strong'):
print strong_tag.text, strong_tag.next_sibling
Run Code Online (Sandbox Code Playgroud)
由于标头中不包含 p 标签,我不确定如何处理 HTML 以使其写入。
有没有一种方法可以将 HTML 视为文件并找到下一个 h6 标签,然后找到下一个 p 标签并一直这样做到最后?
mlr包非常棒,创建ModelMultiplexer的想法也很有帮助.但ModelMultiplexer 从所使用的模型中" 选择 "了1个单一模型.
是否有任何支持或计划支持创建单个模型的Bagged或Boosted Ensemble?
bls = list(
makeLearner("classif.ksvm"),
makeLearner("classif.randomForest")
)
lrn = makeModelMultiplexer(bls)
ps = makeModelMultiplexerParamSet(lrn,
makeNumericParam("sigma", lower = -10, upper = 10, trafo = function(x) 2^x),
makeIntegerParam("ntree", lower = 1L, upper = 500L))
> print(res)
Tune result:
**Op. pars: selected.learner=classif.randomForest; classif.randomForest.ntree=197
mmce.test.mean=0.0333**
Run Code Online (Sandbox Code Playgroud) sqldf有一个限制选项来获取'X'行.我们还可以使用sqldf进行'x%'样本吗?
例如
> sqldf("select * from iris limit 3")
Loading required package: tcltk
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> sqldf("select * from iris sample 0.01")
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: near ".1": syntax error
Run Code Online (Sandbox Code Playgroud)
这有什么解决方法吗?
马尼什
我有一个很长的 R 代码,需要大约 2-3 个小时才能运行并编织到 HTML。然而,即使有小错误或警告..针织中止..在下面的示例中,由于保存历史错误,它已经这样做了。
processing file: model_v64.Rmd
|...................... | 33%
ordinary text without R code
|........................................... | 67%
label: unnamed-chunk-1 (with options)
List of 1
$ echo: logi TRUE
Quitting from lines 21-278 (model_v64.Rmd)
**Error in .External2(C_savehistory, file) : no history available to save**
Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> savehistory
Execution halted
Run Code Online (Sandbox Code Playgroud)
有什么办法可以
我的部分 HTML 如下所示:
<div id="qryNav">
<form method="post" action="OffQryRedirector.jsp" id="form1" name="form1">
<input type="hidden" name="NextPage" value="7" />
<input type="submit" name="Action" id="oq-nav-begin" value="<<" />
<input type="submit" name="Action" id="oq-nav-prv" value="<" />
<span class="oq-nav-btwn">Page 1 of 4</span>
<input type="submit" name="Action" id="oq-nav-nxt" value=">" />
<input type="submit" name="Action" id="oq-nav-end" value=">>" />
</form>
<a href="OffQryForm.jsp" class="qryNav"><span>Start a New Search</span></a>
<!--<a href="javascript:history.back()" class="qryNav"><span>Modify Your Search</span> </a>-->
</div>
Run Code Online (Sandbox Code Playgroud)
我正在尝试确定页数,然后移至下一页。我的代码如下所示 -
html = driver.page_source
soup = BeautifulSoup(html)
pages = soup.find_all('span', {'class': 'oq-nav-btwn'})[0].text.encode('ascii', 'ignore').strip().upper()
loc_of = pages.find('OF')
num_pages = int(pages[loc_of+2:].strip())
>>> print …Run Code Online (Sandbox Code Playgroud) 我正在使用Python将数据从Mysql DB移动到PostgresDB.我的代码如下所示 -
conn = psycopg2.connect("dbname='aaa' user='aaa' host='localhost' password='aaa' ")
curp = conn.cursor()
db = MySQLdb.connect(host="127.0.0.1", user="root", passwd="root" , unix_socket='/var/mysql/mysql.sock', port=3306 )
cur.execute('select * from aaa_pledge')
list = cur.fetchall()
for l in list:
print l
for i in range(len(l)):
print i, l[i]
qry = "insert into aaa_pledge values ('%s','%s','%s','%s',%s,%s,'%s',%s,%s,%s,%s,'%s')" %( l[0], l[1], l[2], l[3], l[4], l[5], l[6], str(l[7]) , l[8], l[9], l[10], l[11] )
print qry
res = curp.execute( qry)
print res
curp.execute( "commit")
curp.close()
cur.execute( "commit")
cur.close()
Run Code Online (Sandbox Code Playgroud)
代码在有效日期 …
我有一些文字数据
>>> print content
Date,Open,High,Low,Close,Volume,Adj Close
2015-03-17,4355.83,4384.98,4349.69,4375.63,1724370000,4375.63
2015-03-16,4338.29,4371.46,4327.27,4370.47,1713480000,4370.47
2015-03-13,4328.09,4347.87,4289.30,4314.90,1851410000,4314.90
2015-03-12,4302.73,4339.20,4300.87,4336.23,1855110000,4336.23
2015-03-11,4336.05,4342.87,4304.28,4305.38,1846020000,4305.38
Run Code Online (Sandbox Code Playgroud)
现在我想将其转换为Dict,以便我可以使用cursor.executemany将其加载到数据库中,这允许我提供dict作为输入.
是否有一个模块可以自动将其转换为Dict.我看了Numpy - loadtext但是这需要我先写一个文件.有没有办法,我可以做到这一点,而无需创建文件?
在stackoverflow的所有链接之后,我在计算实例上安装并设置了postgresql.它启动并运行以下配置 -
pg_hba.conf --
# TYPE DATABASE USER ADDRESS METHOD
local all all peer
host all all 127.0.0.1/32 ident
host all all 0.0.0.0/0 md5
-bash-4.2$ cat postgresql.conf | grep listen
listen_addresses = '*' # what IP address(es) to listen on;
Run Code Online (Sandbox Code Playgroud)
在监听IP和pg_hba.conf更改后,我重新启动了.发布服务已启动并运行 -
[xxxxxxx_gmail_com@python-postgres ~]$ sudo systemctl status postgresql-9.4
postgresql-9.4.service - PostgreSQL 9.4 database server
Loaded: loaded (/usr/lib/systemd/system/postgresql-9.4.service; enabled)
Active: active (running) since Wed 2015-02-18 13:07:55 UTC; 12min ago
[xxxxxxx_gmail_com@python-postgres ~]$ netstat -a --numeric-ports | grep 5432
tcp 0 0 0.0.0.0:5432 …Run Code Online (Sandbox Code Playgroud) python ×5
r ×3
web-scraping ×3
postgresql ×2
checksum ×1
csv ×1
dictionary ×1
hdfs ×1
html ×1
html-parsing ×1
knitr ×1
markdown ×1
md5 ×1
mlr ×1
mysql ×1
psycopg2 ×1
r-markdown ×1
sample ×1
selenium ×1
sqldf ×1