假设我有一个关于一个人找到工作的简单HTML页面:
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<title>New Job for John Doe</title>
</head>
<body>
<h1>New Job for John Doe</h1>
<p>This week John Doe accepted an offer to become a Software Engineer at MITRE. John graduated from MIT in 2005 with a BS in Computer Science. He previously worked at a small company near Boston. Blah, blah, blah.</p>
<p>The MITRE Corporation is a not-for-profit organization chartered to work in the public interest. The MITRE Corporation has two principal locations: Bedford, Massachusetts, …Run Code Online (Sandbox Code Playgroud) 我主要使用Ruby来做到这一点,但到目前为止我的攻击计划如下:
使用gems rdf,rdf-rdfa和rdf-microdata或mida来解析给定任何URI的数据.我认为最好映射到schema.org之类的统一模式,例如,使用这个yaml文件试图描述数据词汇表和opengraph到schema.org之间的转换:
# Schema X to schema.org conversion
#data-vocabulary
DV:
name:name
street-address:streetAddress
region:addressRegion
locality:addressLocality
photo:image
country-name:addressCountry
postal-code:postalCode
tel:telephone
latitude:latitude
longitude:longitude
type:type
#opengraph
OG:
title:name
type:type
image:image
site_name:site_name
description:description
latitude:latitude
longitude:longitude
street-address:streetAddress
locality:addressLocality
region:addressRegion
postal-code:postalCode
country-name:addressCountry
phone_number:telephone
email:email
Run Code Online (Sandbox Code Playgroud)
然后,我可以存储以一种格式找到的信息,并使用schema.org语法重新显示它们.
另一部分是确定类型.我会在schema.org之后对我的表进行建模,我想知道记录的"Thing"(Thing)类型.因此,如果我解析一个opengraph类型的'bar',我会存储它是'BarOrPub'(BarOrPub).
有没有更好的方法呢?什么东西自动化?已有解决方案吗?任何输入赞赏.
编辑:
所以我发现这个解析得很好(其中all_tags包含我感兴趣的标签作为键,schema.org等同于值):
RDF::RDFa::Reader.open(url) do |reader|
reader.each_statement do |statement|
tag = statement.predicate.to_s.split('/')[-1].split('#')[-1]
Rails.logger.debug "rdf tag: #{tag}"
Rails.logger.debug "rdf predicate: #{statement.predicate}"
if all_tags.keys.include? tag
Rails.logger.debug "Found mapping for #{statement.predicate} and #{all_tags[tag]}"
results[all_tags[tag]] = statement.object.to_s.strip
end
end
end
Run Code Online (Sandbox Code Playgroud)