Uma*_*air 5 python beautifulsoup python-2.7
我刮这个链接与BeautifulSoup4
我像这样解析页面HTML
page = BeautifulSoup(page.replace('ISO-8859-1', 'utf-8'),"html5lib")
Run Code Online (Sandbox Code Playgroud)
你可以看到像这样的值-4 -115(分隔-)
我希望列表中的两个值,所以我使用这个正则表达式.
value = re.findall(r'[+-]?\d+', value)
Run Code Online (Sandbox Code Playgroud)
它完美无缺,但不是这些值+2½ -102,我只能得到[-102]
为了解决这个问题,我也试过了
value = value.replace("½","0.5")
value = re.findall(r'[+-]?\d+', value)
Run Code Online (Sandbox Code Playgroud)
但这给了我关于编码的错误,说我必须设置我的文件的编码.
我也试过设置encoding=utf-8在文件顶部,但仍然给出相同的错误.
我需要问我如何转换½为0.5
要在Python 2脚本中嵌入像½这样的Unicode文字,您需要在脚本顶部使用特殊注释,以便解释器知道Unicode是如何编码的.如果您想使用UTF-8,您还需要告诉编辑器将文件保存为UTF-8.如果要打印Unicode文本,请确保您的终端也设置为使用UTF-8.
这是一个简短的例子,在Python 2.6.6上测试过
# -*- coding: utf-8 -*-
value = "a string with fractions like 2½ in it"
value = value.replace("½",".5")
print(value)
Run Code Online (Sandbox Code Playgroud)
产量
a string with fractions like 2.5 in it
Run Code Online (Sandbox Code Playgroud)
请注意,我".5"用作替换字符串; 使用"0.5"会转换"2½"为"20.5",这是不正确的.
实际上,这些字符串应该标记为Unicode字符串,如下所示:
# -*- coding: utf-8 -*-
value = u"a string with fractions like 2½ in it"
value = value.replace(u"½", u".5")
print(value)
Run Code Online (Sandbox Code Playgroud)
有关在Python中使用Unicode的更多信息,请参阅由退伍军人Ned Batchelder编写的实用Unicode.
我还要提一下,你需要改变你的正则表达式模式,以便它允许一个小数点.例如:
# -*- coding: utf-8 -*-
from __future__ import print_function
import re
pat = re.compile(r'[-+]?(?:\d*?[.])?\d+', re.U)
data = u"+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114"
print(data)
print(pat.findall(data.replace(u"½", u".5")))
Run Code Online (Sandbox Code Playgroud)
产量
+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114
[u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-102', u'-2.5', u'-114']
Run Code Online (Sandbox Code Playgroud)
Unicode中的庸俗分数要多于1/2,以下是一些可以解析它们的代码:
# coding=utf8
# curl -s "http://www.unicode.org/Public/UNIDATA/extracted/DerivedNumericValues.txt" | grep "VULGAR FRACTION"
fractions = {
0x2189: 0.0, # ; ; 0 # No VULGAR FRACTION ZERO THIRDS
0x2152: 0.1, # ; ; 1/10 # No VULGAR FRACTION ONE TENTH
0x2151: 0.11111111, # ; ; 1/9 # No VULGAR FRACTION ONE NINTH
0x215B: 0.125, # ; ; 1/8 # No VULGAR FRACTION ONE EIGHTH
0x2150: 0.14285714, # ; ; 1/7 # No VULGAR FRACTION ONE SEVENTH
0x2159: 0.16666667, # ; ; 1/6 # No VULGAR FRACTION ONE SIXTH
0x2155: 0.2, # ; ; 1/5 # No VULGAR FRACTION ONE FIFTH
0x00BC: 0.25, # ; ; 1/4 # No VULGAR FRACTION ONE QUARTER
0x2153: 0.33333333, # ; ; 1/3 # No VULGAR FRACTION ONE THIRD
0x215C: 0.375, # ; ; 3/8 # No VULGAR FRACTION THREE EIGHTHS
0x2156: 0.4, # ; ; 2/5 # No VULGAR FRACTION TWO FIFTHS
0x00BD: 0.5, # ; ; 1/2 # No VULGAR FRACTION ONE HALF
0x2157: 0.6, # ; ; 3/5 # No VULGAR FRACTION THREE FIFTHS
0x215D: 0.625, # ; ; 5/8 # No VULGAR FRACTION FIVE EIGHTHS
0x2154: 0.66666667, # ; ; 2/3 # No VULGAR FRACTION TWO THIRDS
0x00BE: 0.75, # ; ; 3/4 # No VULGAR FRACTION THREE QUARTERS
0x2158: 0.8, # ; ; 4/5 # No VULGAR FRACTION FOUR FIFTHS
0x215A: 0.83333333, # ; ; 5/6 # No VULGAR FRACTION FIVE SIXTHS
0x215E: 0.875, # ; ; 7/8 # No VULGAR FRACTION SEVEN EIGHTHS
}
rx = r'(?u)([+-])?(\d*)(%s)' % '|'.join(map(unichr, fractions))
test = u'15? and ¼ and +212½ and -?'
import re
for sign, d, f in re.findall(rx, test):
sign = -1 if sign == '-' else 1
d = int(d) if d else 0
number = sign * (d + fractions[ord(f)])
print 'found', number
Run Code Online (Sandbox Code Playgroud)