如何使用BeautifulSoup和Python获取属性值？

Question

如何使用BeautifulSoup和Python获取属性值？

我很失败地使用BeautifulSoup和Python获取属性值.以下是XML的结构:

...
</total>
<tag>
    <stat fail="0" pass="1">TR=111111 Sandbox=3000613</stat>
    <stat fail="0" pass="1">TR=121212 Sandbox=3000618</stat>
    ...
    <stat fail="0" pass="1">TR=999999 Sandbox=3000617</stat>
</tag>
<suite>
...

Run Code Online (Sandbox Code Playgroud)

我想要得到的是pass价值,但对于我的生活,我只是无法理解如何去做.我检查了BeautifulSoup,似乎我应该使用类似的东西stat['pass'],但这似乎不起作用.

这是我的代码:

with open('../results/output.xml') as raw_resuls:
results = soup(raw_resuls, 'lxml')
for stat in results.find_all('tag'):
            print stat['pass']

Run Code Online (Sandbox Code Playgroud)

如果我这样做,results.stat['pass']则返回另一个标记内的值,在XML blob中向上.

如果我打印stat变量,我得到以下内容:

<stat fail="0" pass="1">TR=787878 Sandbox=3000614</stat>
...
<stat fail="0" pass="1">TR=888888 Sandbox=3000610</stat>

Run Code Online (Sandbox Code Playgroud)

这似乎没问题.

我很确定我错过了什么或做错了什么.我应该在哪里看？我采取了错误的做法吗？

任何建议或指导将不胜感激!谢谢

Answer 1

dte*_*ell 10

请考虑这种方法:

from bs4 import BeautifulSoup

with open('test.xml') as raw_resuls:
    results = BeautifulSoup(raw_resuls, 'lxml')

for element in results.find_all("tag"):
    for stat in element.find_all("stat"):
        print(stat['pass'])

Run Code Online (Sandbox Code Playgroud)

您的解决方案的问题是,pass包含在stat中,而不是在您搜索它的标记中.

此解决方案搜索所有标记,并在这些标记中搜索stat.从这些结果中获得通过.

对于XML文件

<tag>
    <stat fail="0" pass="1">TR=111111 Sandbox=3000613</stat>
    <stat fail="0" pass="1">TR=121212 Sandbox=3000618</stat>
    <stat fail="0" pass="1">TR=999999 Sandbox=3000617</stat>
</tag>

Run Code Online (Sandbox Code Playgroud)

上面的脚本获取输出

1
1
1

Run Code Online (Sandbox Code Playgroud)

加成

由于一些细节似乎仍然不清楚(见评论),考虑这个完整的解决方法BeautifulSoup用于获得你想要的一切.如果您遇到性能问题,使用词典作为列表元素的解决方案可能并不完美.但是,由于您似乎在使用Python和S汤时遇到了一些麻烦,我认为通过提供按名称而不是索引访问所有相关信息的可能性,我可以尽可能简单地创建此示例.

from bs4 import BeautifulSoup

# Parses a string of form 'TR=abc123 Sandbox=abc123' and stores it in a dictionary with the following
# structure: {'TR': abc123, 'Sandbox': abc123}. Returns this dictionary. 
def parseTestID(testid):
    dict = {'TR': testid.split(" ")[0].split("=")[1], 'Sandbox': testid.split(" ")[1].split("=")[1]}
    return dict

# Parses the XML content of 'rawdata' and stores pass value, TR-ID and Sandbox-ID in a dictionary of the 
# following form: {'Pass': pasvalue, TR': TR-ID, 'Sandbox': Sandbox-ID}. This dictionary is appended to
# a list that is returned.
def getTestState(rawdata):
    # initialize parser
    soup = BeautifulSoup(rawdata,'lxml')
    parsedData= []

    # parse for tags
    for tag in soup.find_all("tag"):
        # parse tags for stat
        for stat in tag.find_all("stat"):
            # store everthing in a dictionary
            dict = {'Pass': stat['pass'], 'TR': parseTestID(stat.string)['TR'], 'Sandbox': parseTestID(stat.string)['Sandbox']}
            # append dictionary to list
            parsedData.append(dict)

    # return list
    return parsedData

Run Code Online (Sandbox Code Playgroud)

你可以按照以下方式使用上面的脚本来做任何你想做的事情(例如只是打印出来)

# open file
with open('test.xml') as raw_resuls:
    # get list of parsed data 
    data = getTestState(raw_resuls)

# print parsed data
for element in data:
    print("TR = {0}\tSandbox = {1}\tPass = {2}".format(element['TR'],element['Sandbox'],element['Pass']))

Run Code Online (Sandbox Code Playgroud)

输出看起来像这样

TR = 111111 Sandbox = 3000613   Pass = 1
TR = 121212 Sandbox = 3000618   Pass = 1
TR = 222222 Sandbox = 3000612   Pass = 1
TR = 232323 Sandbox = 3000618   Pass = 1
TR = 333333 Sandbox = 3000605   Pass = 1
TR = 343434 Sandbox = ZZZZZZ    Pass = 1
TR = 444444 Sandbox = 3000604   Pass = 1
TR = 454545 Sandbox = 3000608   Pass = 1
TR = 545454 Sandbox = XXXXXX    Pass = 1
TR = 555555 Sandbox = 3000617   Pass = 1
TR = 565656 Sandbox = 3000615   Pass = 1
TR = 626262 Sandbox = 3000602   Pass = 1
TR = 666666 Sandbox = 3000616   Pass = 1
TR = 676767 Sandbox = 3000599   Pass = 1
TR = 737373 Sandbox = 3000603   Pass = 1
TR = 777777 Sandbox = 3000611   Pass = 1
TR = 787878 Sandbox = 3000614   Pass = 1
TR = 828282 Sandbox = 3000600   Pass = 1
TR = 888888 Sandbox = 3000610   Pass = 1
TR = 999999 Sandbox = 3000617   Pass = 1

Run Code Online (Sandbox Code Playgroud)

让我们总结一下使用的核心要素:

查找XML标记 要查找您使用的XML标记,soup.find("tag")它返回第一个匹配的标记或soup.find_all("tag")查找所有匹配的标记并将它们存储在列表中.通过迭代列表可以轻松访问单个标记.

查找嵌套标记 要查找嵌套标记,可以使用find()或find_all()再次将其应用于第一个结果find_all().

访问标记 的内容要访问应用于string单个标记的标记内容.例如,如果tag = <tag>I love Soup!</tag> tag.string = "I love Soup!".

查找属性值 要获取属性值,可以使用下标表示法.例如,如果tag = <tag color=red>I love Soup!</tag> tag['color']="red".

为了解析表单的字符串,"TR=abc123 Sandbox=abc123"我使用了常见的Python字符串拆分.您可以在此处阅读更多相关信息:如何在Python中拆分和解析字符串？

归档时间：	8 年，7 月前
查看次数：	11804 次
最近记录：	8 年，7 月前