当父标签的子标签具有特定属性值时，如何使用 BeautifulSoup 获取父标签的名称值？

Question

当父标签的子标签具有特定属性值时，如何使用 BeautifulSoup 获取父标签的名称值？

Pen*_*nny 3 python xml tags attributes beautifulsoup

为了让这个问题更容易理解，下面是一个例子

<Tag name="Thumbnail" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="AText" Searchable="yes"></Attribute>
</Attribute>
</Tag>

<Tag name="Label" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="AText" Searchable="no"></Attribute>
</Attribute>
</Tag>

<Tag name="Image" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="BText" Searchable="yes">
</Attribute>
</Tag>

<Tag name="Wonder" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="BText" Searchable="yes"></Attribute>
</Attribute>
</Tag>

Run Code Online (Sandbox Code Playgroud)

预期结果

所以在excel中，如果Attribute标签的Searchable值为“ yes ” ，第一行应该是Attribute标签的名称值；然后这些“合格的”属性标签的父标签 -标签-名称值将列在下面。

目前，我只能找到所有Tag的 name 值，如果它的 children 的 Searchable 值为“yes”，但无法在相应的Attribute标签的 name 值下对它们进行分类。下面是我的初始代码：

import os, openpyxl from bs4 import BeautifulSoup cwd = os.getcwd() def func(x): for file in os.listdir(cwd): if file.endswith('.xml'): f = open(file, encoding = 'utf=8', mode = 'r+') soup = BeautifulSoup(f, 'lxml') AttrYES = soup.find_all(attrs={"Searchable": "yes"}) for items in AttrYES: tagName = items.parent.parent.get('name') print (tagName) x = os.listdir(cwd) func(x)
Run Code Online (Sandbox Code Playgroud)
我也会尝试解决这个问题，但为了使过程更快，如果您有任何想法，请提供建议。谢谢！！

Answer 1

Tin*_*y.D 5

您的代码找不到任何内容，如果您打印AttrYES，它将是[]. 问题是当您使用bs4with parser 时lxml，所有标签和 attr 名称都将转换为小写，请参阅官方文档。如果你打印汤，它会给你：

<html><body><tag inline="no" name="Thumbnail" nonsearchable="yes">
<attribute>
<attribute name="AText" searchable="yes"></attribute>
</attribute>
</tag>
<tag inline="no" name="Label" nonsearchable="yes">
<attribute>
<attribute name="AText" searchable="no"></attribute>
</attribute>
</tag>
<tag inline="no" name="Image" nonsearchable="yes">
<attribute>
<attribute name="BText" searchable="yes">
</attribute>
</attribute></tag>
<tag inline="no" name="Wonder" nonsearchable="yes">
<attribute>
<attribute name="BText" searchable="yes"></attribute>
</attribute>
</tag></body></html>

Run Code Online (Sandbox Code Playgroud)

因此，您可以像这样修改代码：

import bs4
f = open('test.xml',mode = 'r+')
soup = bs4.BeautifulSoup(f, 'lxml')
AttrYES = soup.findAll(attrs={'searchable': 'yes'})
result = dict()
for items in AttrYES:
    result[items.get('name')] = result.get(items.get('name'),[])+[items.parent.parent.get('name')]    
print(result)

Run Code Online (Sandbox Code Playgroud)

打印将是：

{'AText': ['Thumbnail'], 'BText': ['Image', 'Wonder']}

Run Code Online (Sandbox Code Playgroud)

然后你可以将它们写入你的excel文件：

import xlsxwriter

workbook = xlsxwriter.Workbook('result.xlsx')
worksheet = workbook.add_worksheet()

# Write header
worksheet.write(0, 0, result.keys()[0])
worksheet.write(0, 1, result.keys()[1])

# Write data.
worksheet.write_column(1, 0, result.values()[0])
worksheet.write_column(1, 1, result.values()[1])

workbook.close()

Run Code Online (Sandbox Code Playgroud)

该result.xlsx会是：

更新：使用 openpyxl

from openpyxl import Workbook
wb = Workbook()

# grab the active worksheet
ws = wb.active

# Data can be assigned directly to cells
i,j = 1,1
for keys,values in a.items():
    ws.cell(column=i, row=1, value=keys)
    for row in range(len(values)):
        ws.cell(column=i, row=j+1, value=values[row])
        j+=1
    j=1
    i+=1
wb.save("result.xlsx")

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，8 月前
查看次数：	995 次
最近记录：	8 年，8 月前