Naw*_*waz 3 python xml youtube-api elementtree python-3.x
我正在尝试解析从youtube视频源获取的XML字符串,使用Python 3.3.1.这是代码:
import re
import sys
import urllib.request
import urllib.parse
import xml.etree.ElementTree as element_tree
def get_video_id(video_url):
return re.search(r'watch\?v=.*', video_url).group(0)[8:]
def get_video_feed(video_url):
video_feed = "http://gdata.youtube.com/feeds/api/videos/" + get_video_id(video_url)
return urllib.request.urlopen(video_feed).read()
def get_media_info(video_url):
content = get_video_feed(video_url)
content = str(content, 'ascii')
media = {}
e = element_tree.XML(content);
print ( "CONTENT: \n" + content )
print ( "\n\nELEMENTS : \n")
for i in list(e):
print (i)
media['title'] = e.findall('title') //NOTE THIS!
return media
def main():
video_url = 'http://youtube.com/watch?v=q5sOLzEerwA'
print ( get_media_info(video_url) )
if __name__ == '__main__':
main()
Run Code Online (Sandbox Code Playgroud)
我不知道为什么for循环get_media_info()打印元素为
<Element '{http://www.w3.org/2005/Atom}title' at 0x0000000002BF7D18>
Run Code Online (Sandbox Code Playgroud)
而不是这个:
<Element 'title' at 0x0000000002BF7D18>
Run Code Online (Sandbox Code Playgroud)
坦率地说,我不在乎它的印刷品.所有我关心的是,我要传递'title'到findall()并期望元素(一个或多个)的返回值的列表.但它返回空列表,即使titlexml中有一个带有名称的元素.
所以我尝试了这个:
media['title'] = e.findall('{http://www.w3.org/2005/Atom}title')
Run Code Online (Sandbox Code Playgroud)
它确实返回了一个元素的列表.我确信这不是这样做的方式,我觉得我错过了什么.
如何解决这个问题?
这是上面代码的输出:
内容:
<?xml version='1.0' encoding='UTF-8'?>
<entry xmlns='http://www.w3.org/2005/Atom' xmlns:media='http://search.yahoo.com/mrss/' xmlns:gd='http://schemas.google.com/g/2005' xmlns:yt='http://gdata.youtube.com/schemas/2007'>
<id>http://gdata.youtube.com/feeds/api/videos/q5sOLzEerwA</id>
<published>2011-12-01T18:18:36.000Z</published>
<updated>2013-05-07T03:20:04.000Z</updated>
<category scheme='http://schemas.google.com/g/2005#kind' term='http://gdata.youtube.com/schemas/2007#video'/>
<category scheme='http://gdata.youtube.com/schemas/2007/categories.cat' term='Music' label='Music'/>
<title type='text'>Kala Bazaar - Khoya Khoya Chand Khula Aasman - Mohd Rafi.flv</title>
<content type='text'>tanhayi me akele me khoya khoya chand.........</content>
<link rel='alternate' type='text/html' href='http://www.youtube.com/watch?v=q5sOLzEerwA&feature=youtube_gdata'/>
<link rel='http://gdata.youtube.com/schemas/2007#video.responses' type='application/atom+xml' href='http://gdata.youtube.com/feeds/api/videos/q5sOLzEerwA/responses'/>
<link rel='http://gdata.youtube.com/schemas/2007#video.related' type='application/atom+xml' href='http://gdata.youtube.com/feeds/api/videos/q5sOLzEerwA/related'/>
<link rel='http://gdata.youtube.com/schemas/2007#mobile' type='text/html' href='http://m.youtube.com/details?v=q5sOLzEerwA'/>
<link rel='self' type='application/atom+xml' href='http://gdata.youtube.com/feeds/api/videos/q5sOLzEerwA'/>
<author>
<name>a1a2a3a4a786</name>
<uri>http://gdata.youtube.com/feeds/api/users/a1a2a3a4a786</uri>
</author>
<gd:comments>
<gd:feedLink rel='http://gdata.youtube.com/schemas/2007#comments' href='http://gdata.youtube.com/feeds/api/videos/q5sOLzEerwA/comments' countHint='6'/>
</gd:comments>
<media:group>
<media:category label='Music' scheme='http://gdata.youtube.com/schemas/2007/categories.cat'>Music</media:category>
<media:content url='http://www.youtube.com/v/q5sOLzEerwA?version=3&f=videos&app=youtube_gdata' type='application/x-shockwave-flash' medium='video' isDefault='true' expression='full' duration='293' yt:format='5'/>
<media:content url='rtsp://v6.cache3.c.youtube.com/CiILENy73wIaGQkArx4xLw6bqxMYDSANFEgGUgZ2aWRlb3MM/0/0/0/video.3gp' type='video/3gpp' medium='video' expression='full' duration='293' yt:format='1'/>
<media:content url='rtsp://v6.cache3.c.youtube.com/CiILENy73wIaGQkArx4xLw6bqxMYESARFEgGUgZ2aWRlb3MM/0/0/0/video.3gp' type='video/3gpp' medium='video' expression='full' duration='293' yt:format='6'/>
<media:description type='plain'>tanhayi me akele me khoya khoya chand.........</media:description>
<media:keywords/>
<media:player url='http://www.youtube.com/watch?v=q5sOLzEerwA&feature=youtube_gdata_player'/>
<media:thumbnail url='http://i.ytimg.com/vi/q5sOLzEerwA/0.jpg' height='360' width='480' time='00:02:26.500'/>
<media:thumbnail url='http://i.ytimg.com/vi/q5sOLzEerwA/1.jpg' height='90' width='120' time='00:01:13.250'/>
<media:thumbnail url='http://i.ytimg.com/vi/q5sOLzEerwA/2.jpg' height='90' width='120' time='00:02:26.500'/>
<media:thumbnail url='http://i.ytimg.com/vi/q5sOLzEerwA/3.jpg' height='90' width='120' time='00:03:39.750'/>
<media:title type='plain'>Kala Bazaar - Khoya Khoya Chand Khula Aasman - Mohd Rafi.flv</media:title>
<yt:duration seconds='293'/>
</media:group>
<gd:rating average='4.733333' max='5' min='1' numRaters='30' rel='http://schemas.google.com/g/2005#overall'/>
<yt:statistics favoriteCount='0' viewCount='8140'/>
</entry>
Run Code Online (Sandbox Code Playgroud)
元素:
<Element '{http://www.w3.org/2005/Atom}id' at 0x0000000002BF79F8>
<Element '{http://www.w3.org/2005/Atom}published' at 0x0000000002BF7B88>
<Element '{http://www.w3.org/2005/Atom}updated' at 0x0000000002BF7A48>
<Element '{http://www.w3.org/2005/Atom}category' at 0x0000000002BF7C78>
<Element '{http://www.w3.org/2005/Atom}category' at 0x0000000002BF7CC8>
<Element '{http://www.w3.org/2005/Atom}title' at 0x0000000002BF7D18>
<Element '{http://www.w3.org/2005/Atom}content' at 0x0000000002BF7D68>
<Element '{http://www.w3.org/2005/Atom}link' at 0x0000000002BF7DB8>
<Element '{http://www.w3.org/2005/Atom}link' at 0x0000000002BF7E08>
<Element '{http://www.w3.org/2005/Atom}link' at 0x0000000002BF7E58>
<Element '{http://www.w3.org/2005/Atom}link' at 0x0000000002BF7EA8>
<Element '{http://www.w3.org/2005/Atom}link' at 0x0000000002BF7EF8>
<Element '{http://www.w3.org/2005/Atom}author' at 0x0000000002BF7F48>
<Element '{http://schemas.google.com/g/2005}comments' at 0x0000000002C0B0E8>
<Element '{http://search.yahoo.com/mrss/}group' at 0x0000000002C0B1D8>
<Element '{http://schemas.google.com/g/2005}rating' at 0x0000000002C0B778>
<Element '{http://gdata.youtube.com/schemas/2007}statistics' at 0x0000000002C0B7C8>
{'title': []}
Run Code Online (Sandbox Code Playgroud)
Mar*_*nen 12
XML文档的名称空间非常重要.ElementTree要求标签完全限定以找到正确的元素.以下是三个元素在不同命名空间中具有相同标记的示例:
data = '''\
<root xmlns="xyz" xmlns:name="abc">
<object name="one" />
<name:object name="two" />
<object xmlns="def" name="three" />
</root>
'''
Run Code Online (Sandbox Code Playgroud)
这是ElementTree看到的元素:
>>> from xml.etree import ElementTree as et
>>> tree = et.fromstring(data)
>>> print(tree.findall('.//*'))
>>> et.dump(tree)
[<Element '{xyz}object' at 0x0000000003B07BD8>,
<Element '{abc}object' at 0x0000000003B07C28>,
<Element '{def}object' at 0x0000000003B07C78>]
Run Code Online (Sandbox Code Playgroud)
所以你做对了.给定默认名称空间定义:
<entry xmlns='http://www.w3.org/2005/Atom' ...
Run Code Online (Sandbox Code Playgroud)
要访问使用默认命名空间的'title'标记:
media['title'] = e.findall('{http://www.w3.org/2005/Atom}title')
Run Code Online (Sandbox Code Playgroud)
要访问'media:group'标签,请参阅媒体命名空间定义:
<entry ... xmlns:media='http://search.yahoo.com/mrss/' ...
Run Code Online (Sandbox Code Playgroud)
并使用:
e.findall('{http://search.yahoo.com/mrss/}group')
Run Code Online (Sandbox Code Playgroud)
请注意可以指定命名空间的不同方法:
<root xmlns="xyz" xmlns:name="abc"> # default namespace and
# 'abc' namespace with id 'name'.
<object name="one" /> # Uses default namespace 'xyz'.
<name:object name="two" /> # uses 'abc' namespace (specified by id).
<object xmlns="def" name="three" /> # change the default namespace to 'def'.
</root>
Run Code Online (Sandbox Code Playgroud)
要从特定命名空间中读取特定标记:
>>> print(tree.find('{abc}object').attrib['name'])
'two'
Run Code Online (Sandbox Code Playgroud)
请注意,命名空间ID只是快捷方式.以下是转储解析后的XML树时发生的情况.ElementTree无需保存原始命名空间ID并以下列格式生成自己的命名空间ID ns#:
>>> et.dump(tree)
<ns0:root xmlns:ns0="xyz" xmlns:ns1="abc" xmlns:ns2="def">
<ns0:object name="one" />
<ns1:object name="two" />
<ns2:object name="three" />
</ns0:root>
Run Code Online (Sandbox Code Playgroud)
如果要定义特定的快捷方式,请使用`register_namespace':
>>> et.register_namespace('','xyz') # default namespace
>>> et.register_namespace('name','abc')
>>> et.register_namespace('custom','def')
>>> et.dump(tree)
<root xmlns="xyz" xmlns:custom="def" xmlns:name="abc">
<object name="one" />
<name:object name="two" />
<custom:object name="three" />
</root>
Run Code Online (Sandbox Code Playgroud)
实际上我已经尝试使用以下方法xml.dom.minidom,以防万一它对你有帮助。
#!/usr/bin/python
from xml.dom.minidom import parseString
import re
import urllib
def get_video_id(video_url):
return re.search(r'watch\?v=.*', video_url).group(0)[8:]
def get_video_feed(video_url):
video_feed = "http://gdata.youtube.com/feeds/api/videos/" + get_video_id(video_url)
print video_feed
return urllib.urlopen(video_feed).read()
def get_media_info(video_url):
content = get_video_feed(video_url)
dom = parseString(content)
media = {}
media['title'] = dom.getElementsByTagName('title')[0].firstChild.nodeValue
return media
def main():
video_url = 'http://youtube.com/watch?v=q5sOLzEerwA'
print ( get_media_info(video_url) )
if __name__ == '__main__':
main()
Run Code Online (Sandbox Code Playgroud)