我需要匹配所有这些开始标记:
<p>
<a href="foo">
Run Code Online (Sandbox Code Playgroud)
但不是这些:
<br />
<hr class="foo" />
Run Code Online (Sandbox Code Playgroud)
我想出了这个,并希望确保我做对了.我只抓住了a-z.
<([a-z]+) *[^/]*?>
Run Code Online (Sandbox Code Playgroud)
我相信它说:
/,然后我有这个权利吗?更重要的是,你怎么看?
Interestingly, pandas I/O tools does not maintain a read_xml() method and the counterpart to_xml(). However, read_json proves tree-like structures can be implemented for dataframe import and read_html for markup formats.
Now, if the pandas team does consider such a read_xml method for a future pandas version, what implementation would they pursue: parsing with built-in xml.etree.ElementTree with its iterfind() or iterparse() functions or the third-party module, lxml with its XPath 1.0 and XSLT 1.0 methods?
下面是我在一个简单,扁平,以元素为中心的XML输入上的四种方法类型的测试运行.所有这些都设置为root的任何二级子级的基因化解析,并且每个方法应该产生完全相同的pandas数据帧.除了pd.Dataframe()字典列表上的最后一次调用之外的所有内容.该XSLT转换的方法,以XML为CSV铸造StringIO()在 …
我有这个xml文件想要在python中将内容转换为csv文件的数据框:
<?xml version="1.0" encoding="utf-8"?>
<dashboardreport name="jvm_report" version="7.0.21.1017" reportdate="2018-08-08T10:37:01.510-04:00" description="">
<source name="CORP_GTM">
<filters summary="from Jul-30 23:40 to Jul-31 02:40">
<filter>tf:CustomTimeframe?1533008450802:1533019250802</filter>
</filters>
</source>
<reportheader>
<reportdetails>
<user>test1</user>
</reportdetails>
</reportheader>
<data>
<chartdashlet name="jvm_mem_percent" description="" showabsolutevalues="false">
<measures structuretype="tree">
<measure measure="Memory Utilization - Memory Utilization (split by Agent)" color="#800080" aggregation="Maximum" unit="%" thresholds="false" drawingorder="1">
<measure measure="Memory Utilization - test@server1" color="#7aebd0" aggregation="Maximum" unit="%" thresholds="false">
<measurement timestamp="1533008460000" avg="11.116939544677734" min="11.007165908813477" max="11.143875122070312" sum="66.7016372680664" count="6"></measurement>
<measurement timestamp="1533008520000" avg="11.204706827799479" min="11.144883155822754" max="11.268420219421387" sum="67.22824096679688" count="6"></measurement>
</measure>
<measure measure="Memory Utilization - test@server2" color="#a6f2e0" aggregation="Maximum" unit="%" thresholds="false"> …Run Code Online (Sandbox Code Playgroud) 我有几个从 Sharepoint 查询数据的 .iqy 文件。我需要在 Python Pandas 中组合和处理这些。Python有办法做到这一点吗?我知道 Python Sharepoint 库存在,但我试图避免通过 Python 设置自己的连接,而是依赖 .iqy 文件。有任何想法吗?
为了回答这个问题,假设该表如下所示:
+------+------+
| col1 | col2 |
+------+------+
| 1 | 2 |
| 3 | 4 |
+------+------+
Run Code Online (Sandbox Code Playgroud)
此外,我愿意接受非 Python 解决方案,以获取自动运行 .iqy 查询并将数据转换为 Python 可读格式(例如 .csv)的方法。但不确定这种方法是什么样的
有一个如下所示的 XML 文件:
?<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>
Run Code Online (Sandbox Code Playgroud)
我想要做的是将 ID、Text 和 CreationDate 列提取到 Pandas DF 中,我尝试了以下操作:
import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)
root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
ID = row.find('Id')
text = row.find('Text')
date = row.find('CreationDate')
print(ID, text, date) …Run Code Online (Sandbox Code Playgroud) 有人可以帮助将以下 XML 文件转换为 Pandas 数据框:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<bathrooms type="dict">
<n35237 type="number">1.0</n35237>
<n32238 type="number">3.0</n32238>
<n44699 type="number">nan</n44699>
</bathrooms>
<price type="dict">
<n35237 type="number">7020000.0</n35237>
<n32238 type="number">10000000.0</n32238>
<n44699 type="number">4128000.0</n44699>
</price>
<property_id type="dict">
<n35237 type="number">35237.0</n35237>
<n32238 type="number">32238.0</n32238>
<n44699 type="number">44699.0</n44699>
</property_id>
</root>Run Code Online (Sandbox Code Playgroud)
它应该是这样的——
这是我写的代码:-
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('real_state.xml')
root = tree.getroot()
dfcols = ['property_id', 'price', 'bathrooms']
df_xml = pd.DataFrame(columns=dfcols)
for node in root:
property_id = node.attrib.get('property_id')
price = node.attrib.get('price')
bathrooms = node.attrib.get('bathrooms')
df_xml = df_xml.append(
pd.Series([property_id, …Run Code Online (Sandbox Code Playgroud) 我正在尝试将大型 (53MB) XML 文件加载到 Pandas 数据帧中。这里有 3 行实际数据(来自 NTSB 航空事故报告的公共数据库),但实际文件有 77257 行:
<?xml version="1.0"?>
<DATA xmlns="http://www.ntsb.gov">
<ROWS>
<ROW EventId="20150901X74304" InvestigationType="Accident" AccidentNumber="GAA15CA244" EventDate="09/01/2015" Location="Truckee, CA" Country="United States" Latitude="" Longitude="" AirportCode="" AirportName="" InjurySeverity="" AircraftDamage="" AircraftCategory="" RegistrationNumber="N786AB" Make="JOE SALOMONE" Model="SUPER CUB SQ2" AmateurBuilt="" NumberOfEngines="" EngineType="" FARDescription="" Schedule="" PurposeOfFlight="" AirCarrier="" TotalFatalInjuries="" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="" WeatherCondition="" BroadPhaseOfFlight="" ReportStatus="Preliminary" PublicationDate=""/>
<ROW EventId="20150901X92332" InvestigationType="Accident" AccidentNumber="CEN15LA392" EventDate="08/31/2015" Location="Houston, TX" Country="United States" Latitude="29.809444" Longitude="-95.668889" AirportCode="IWS" AirportName="WEST HOUSTON" InjurySeverity="Non-Fatal" AircraftDamage="Substantial" AircraftCategory="Airplane" RegistrationNumber="N452CS" Make="CESSNA" Model="T240" AmateurBuilt="No" NumberOfEngines="" EngineType="" FARDescription="Part 91: General …Run Code Online (Sandbox Code Playgroud) 我必须将 xml 文件转换为数据框熊猫。我尝试了很多模式,但结果是一样的:无,无......我错了什么?另一个图书馆更好吗?是否可能是因为我的 XML 格式?xml 文件的类型为:
<Document xmlns="xxx/zzz/yyy">
<Header>
<DocumentName>GXXXXXXXXXX</DocumentName>
<DocumentType>G10</DocumentType>
<Version>2.0.0.0</Version>
<Created>2018-12-11T09:00:02.987777+00:00</Created>
<TargetProcessingDate>2019-02-11</TargetProcessingDate>
<Part>
<CurrentPage>1</CurrentPage>
<TotalPages>1</TotalPages>
</Part>
</Header>
<Body>
<Accounts>
<Account>
<Type>20WE</Type>
<OldType>19WE</OldType>
<Kids>
<Kid>
<Name>marc</Name>
<BirthDate>2000-02-06</BirthDate>
<Year>19</Year>
<Email>marc@xxx.com</Email>
</Kid>
</Kids>
</Account>
</Accounts>
</Body>
</Document>
Run Code Online (Sandbox Code Playgroud)
import xml.etree.ElementTree as ET
import pandas as pd
class XML2DataFrame:
def __init__(self, xml_data):
self.root = ET.XML(xml_data)
def parse_root(self, root):
"""Return a list of dictionaries from the text and attributes of the
children under this XML root."""
return [parse_element(child) for child in root.getchildren()]
def …Run Code Online (Sandbox Code Playgroud)