元数据收获

Ara*_*idi 6 python metadata oai

我正在尝试使用元数据收集包https://pypi.python.org/pypi/pyoai来收集此网站上的数据https://www.duo.uio.no/oai/request?verb=Identify

我在pyaoi网站上尝试了这个例子,但是没有用.当我测试它时,我得到一个错误.代码是:

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'http://uni.edu/ir/oaipmh'
registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

for record in client.listRecords(metadataPrefix='oai_dc'):
    print record
Run Code Online (Sandbox Code Playgroud)

这是堆栈跟踪:

Traceback (most recent call last):
  File "/Users/arashsaidi/PycharmProjects/get-new-DUO/get-files.py", line 8, in <module>
    for record in client.listRecords(metadataPrefix='oai_dc'):
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 115, in method
    return obj(self, **kw)
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 110, in __call__
    return bound_self.handleVerb(self._verb, kw)
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 65, in handleVerb
    kw, self.makeRequestErrorHandling(verb=verb, **kw))    
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 273, in makeRequestErrorHandling
    raise error.XMLSyntaxError(kw)
oaipmh.error.XMLSyntaxError: {'verb': 'ListRecords', 'metadataPrefix': 'oai_dc'}
Run Code Online (Sandbox Code Playgroud)

我需要访问我上面链接的页面上的所有文件,并生成带有一些元数据的附加文件.

有什么建议?

Ara*_*idi 3

我最终使用了 Sickle 包,我发现它有更好的文档并且更易于使用:

此代码获取所有集合,然后检索每个集合中的每条记录。鉴于需要处理超过 30000 条记录,这似乎是最佳解决方案。每组都这样做可以提供更多控制。希望这可以帮助其他人。我不知道为什么图书馆使用 OAI,对我来说似乎不是组织数据的好方法......

# gets sickle from OAI
        sickle = Sickle('http://www.duo.uio.no/oai/request')
        sets = sickle.ListSets()  # gets all sets
        for recs in sets:
            for rec in recs:
                if rec[0] == 'setSpec':
                    try:
                        print rec[1][0], self.spec_list[rec[1][0]]
                        records = sickle.ListRecords(metadataPrefix='xoai', set=rec[1][0], ignore_deleted=True)
                        self.write_file_and_metadata()
                    except Exception as e:
                        # simple exception handling if not possible to retrieve record
                        print('Exception: {}'.format(e))
Run Code Online (Sandbox Code Playgroud)