Bru*_*oij 7 python database-design rdf freebase triplestore
我承认这基本上是在本地服务器上使用freebase数据的重复问题?但我需要比那里已经给出的更详细的答案
我完全爱上了Freebase.我现在想要的是基本上创建一个非常简单的Freebase克隆,用于存储可能不属于Freebase本身的内容,但可以使用Freebase架构进行描述.基本上我想要的是一种简单而优雅的方式来存储像Freebase本身那样的数据,并且能够在Python(CherryPy)Web应用程序中轻松使用这些数据.
MQL参考指南的第2章指出:
作为Metaweb基础的数据库与您可能熟悉的关系数据库根本不同.关系数据库以表的形式存储数据,但Metaweb数据库将数据存储为节点图和这些节点之间的关系.
我想这意味着我应该使用三重存储或图形数据库,如Neo4j?这里有人有使用Python环境中任何一种的经验吗?
(到目前为止我实际尝试的是创建一个能够轻松存储Freebase主题的关系数据库模式,但是我遇到了在SQLAlchemy中配置映射的问题).
我正在研究的事情
更新[28/12/2011]:
我在Freebase博客上发现了一篇文章,描述了Freebase自己使用的专有元组存储/数据库(图形):http://blog.freebase.com/2008/04/09/a-brief-tour-of-graphd/
pep*_*_bg 10
这对我有用.它允许您在少于100GB的磁盘上加载标准MySQL安装中的所有Freebase转储.关键是理解转储中的数据布局,然后对其进行转换(针对空间和速度进行优化).
在尝试使用之前您应该了解的Freebase概念(全部来自文档):
其他一些重要的Freebase细节:
[{'id':'/','mid':null}]?'/m/0cwtm'是人类);'/m/03lmb2f'类型的实体'/film/performance'不是主题(我选择将这些视为RDF中的空白节点,尽管这可能不是哲学上准确的),而'/m/04y78wb'类型'/film/director'(等)是;(参见底部的Python代码)
TRANSFORM 1(来自shell,从命名空间中拆分链接,忽略notable_for和non/lang/en文本):
python parse.py freebase.tsv #end up with freebase_links.tsv and freebase_ns.tsv
Run Code Online (Sandbox Code Playgroud)
TRANSFORM 2(从Python控制台,在freebase_ns_types.tsv上拆分freebase_ns.tsv,在freebase_ns_props.tsv上加上我们现在忽略的其他15个)
import e
e.split_external_keys( 'freebase_ns.tsv' )
Run Code Online (Sandbox Code Playgroud)
TRANSFORM 3(从Python控制台,将属性和目标转换为mids)
import e
ns = e.get_namespaced_data( 'freebase_ns_types.tsv' )
e.replace_property_and_destination_with_mid( 'freebase_links.tsv', ns ) #produces freebase_links_pdmids.tsv
e.replace_property_with_mid( 'freebase_ns_props.tsv', ns ) #produces freebase_ns_props_pmids.tsv
Run Code Online (Sandbox Code Playgroud)
TRANSFORM 4(从MySQL控制台,在数据库中加载freebase_links_mids.tsv,freebase_ns_props_mids.tsv和freebase_ns_types.tsv):
CREATE TABLE links(
source VARCHAR(20),
property VARCHAR(20),
destination VARCHAR(20),
value VARCHAR(1)
) ENGINE=MyISAM CHARACTER SET utf8;
CREATE TABLE ns(
source VARCHAR(20),
property VARCHAR(20),
destination VARCHAR(40),
value VARCHAR(255)
) ENGINE=MyISAM CHARACTER SET utf8;
CREATE TABLE types(
source VARCHAR(20),
property VARCHAR(40),
destination VARCHAR(40),
value VARCHAR(40)
) ENGINE=MyISAM CHARACTER SET utf8;
LOAD DATA LOCAL INFILE "/data/freebase_links_pdmids.tsv" INTO TABLE links FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE "/data/freebase_ns_props_pmids.tsv" INTO TABLE ns FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE "/data/freebase_ns_base_plus_types.tsv" INTO TABLE types FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE INDEX links_source ON links (source) USING BTREE;
CREATE INDEX ns_source ON ns (source) USING BTREE;
CREATE INDEX ns_value ON ns (value) USING BTREE;
CREATE INDEX types_source ON types (source) USING BTREE;
CREATE INDEX types_destination_value ON types (destination, value) USING BTREE;
Run Code Online (Sandbox Code Playgroud)
将此保存为e.py:
import sys
#returns a dict to be used by mid(...), replace_property_and_destination_with_mid(...) bellow
def get_namespaced_data( file_name ):
f = open( file_name )
result = {}
for line in f:
elements = line[:-1].split('\t')
if len( elements ) < 4:
print 'Skip...'
continue
result[(elements[2], elements[3])] = elements[0]
return result
#runs out of memory
def load_links( file_name ):
f = open( file_name )
result = {}
for line in f:
if len( result ) % 1000000 == 0:
print len(result)
elements = line[:-1].split('\t')
src, prop, dest = elements[0], elements[1], elements[2]
if result.get( src, False ):
if result[ src ].get( prop, False ):
result[ src ][ prop ].append( dest )
else:
result[ src ][ prop ] = [dest]
else:
result[ src ] = dict([( prop, [dest] )])
return result
#same as load_links but for the namespaced data
def load_ns( file_name ):
f = open( file_name )
result = {}
for line in f:
if len( result ) % 1000000 == 0:
print len(result)
elements = line[:-1].split('\t')
src, prop, value = elements[0], elements[1], elements[3]
if result.get( src, False ):
if result[ src ].get( prop, False ):
result[ src ][ prop ].append( value )
else:
result[ src ][ prop ] = [value]
else:
result[ src ] = dict([( prop, [value] )])
return result
def links_in_set( file_name ):
f = open( file_name )
result = set()
for line in f:
elements = line[:-1].split('\t')
result.add( elements[0] )
return result
def mid( key, ns ):
if key == '':
return False
elif key == '/':
key = '/boot/root_namespace'
parts = key.split('/')
if len(parts) == 1: #cover the case of something which doesn't start with '/'
print key
return False
if parts[1] == 'm': #already a mid
return key
namespace = '/'.join(parts[:-1])
key = parts[-1]
return ns.get( (namespace, key), False )
def replace_property_and_destination_with_mid( file_name, ns ):
fn = file_name.split('.')[0]
f = open( file_name )
f_out_mids = open(fn+'_pdmids'+'.tsv', 'w')
def convert_to_mid_if_possible( value ):
m = mid( value, ns )
if m: return m
else: return None
counter = 0
for line in f:
elements = line[:-1].split('\t')
md = convert_to_mid_if_possible(elements[1])
dest = convert_to_mid_if_possible(elements[2])
if md and dest:
elements[1] = md
elements[2] = dest
f_out_mids.write( '\t'.join(elements)+'\n' )
else:
counter += 1
print 'Skipped: ' + str( counter )
def replace_property_with_mid( file_name, ns ):
fn = file_name.split('.')[0]
f = open( file_name )
f_out_mids = open(fn+'_pmids'+'.tsv', 'w')
def convert_to_mid_if_possible( value ):
m = mid( value, ns )
if m: return m
else: return None
for line in f:
elements = line[:-1].split('\t')
md = convert_to_mid_if_possible(elements[1])
if md:
elements[1]=md
f_out_mids.write( '\t'.join(elements)+'\n' )
else:
#print 'Skipping ' + elements[1]
pass
#cPickle
#ns=e.get_namespaced_data('freebase_2.tsv')
#import cPickle
#cPickle.dump( ns, open('ttt.dump','wb'), protocol=2 )
#ns=cPickle.load( open('ttt.dump','rb') )
#fn='/m/0'
#n=fn.split('/')[2]
#dir = n[:-1]
def is_mid( value ):
parts = value.split('/')
if len(parts) == 1: #it doesn't start with '/'
return False
if parts[1] == 'm':
return True
return False
def check_if_property_or_destination_are_mid( file_name ):
f = open( file_name )
for line in f:
elements = line[:-1].split('\t')
#if is_mid( elements[1] ) or is_mid( elements[2] ):
if is_mid( elements[1] ):
print line
#
def split_external_keys( file_name ):
fn = file_name.split('.')[0]
f = open( file_name )
f_out_extkeys = open(fn+'_extkeys' + '.tsv', 'w')
f_out_intkeys = open(fn+'_intkeys' + '.tsv', 'w')
f_out_props = open(fn+'_props' + '.tsv', 'w')
f_out_types = open(fn+'_types' + '.tsv', 'w')
f_out_m = open(fn+'_m' + '.tsv', 'w')
f_out_src = open(fn+'_src' + '.tsv', 'w')
f_out_usr = open(fn+'_usr' + '.tsv', 'w')
f_out_base = open(fn+'_base' + '.tsv', 'w')
f_out_blg = open(fn+'_blg' + '.tsv', 'w')
f_out_bus = open(fn+'_bus' + '.tsv', 'w')
f_out_soft = open(fn+'_soft' + '.tsv', 'w')
f_out_uri = open(fn+'_uri' + '.tsv', 'w')
f_out_quot = open(fn+'_quot' + '.tsv', 'w')
f_out_frb = open(fn+'_frb' + '.tsv', 'w')
f_out_tag = open(fn+'_tag' + '.tsv', 'w')
f_out_guid = open(fn+'_guid' + '.tsv', 'w')
f_out_dtwrld = open(fn+'_dtwrld' + '.tsv', 'w')
for line in f:
elements = line[:-1].split('\t')
parts_2 = elements[2].split('/')
if len(parts_2) == 1: #the blank destination elements - '', plus the root domain ones
if elements[1] == '/type/object/key':
f_out_types.write( line )
else:
f_out_props.write( line )
elif elements[2] == '/lang/en':
f_out_props.write( line )
elif (parts_2[1] == 'wikipedia' or parts_2[1] == 'authority') and len( parts_2 ) > 2:
f_out_extkeys.write( line )
elif parts_2[1] == 'm':
f_out_m.write( line )
elif parts_2[1] == 'en':
f_out_intkeys.write( line )
elif parts_2[1] == 'source' and len( parts_2 ) > 2:
f_out_src.write( line )
elif parts_2[1] == 'user':
f_out_usr.write( line )
elif parts_2[1] == 'base' and len( parts_2 ) > 2:
if elements[1] == '/type/object/key':
f_out_types.write( line )
else:
f_out_base.write( line )
elif parts_2[1] == 'biology' and len( parts_2 ) > 2:
f_out_blg.write( line )
elif parts_2[1] == 'business' and len( parts_2 ) > 2:
f_out_bus.write( line )
elif parts_2[1] == 'soft' and len( parts_2 ) > 2:
f_out_soft.write( line )
elif parts_2[1] == 'uri':
f_out_uri.write( line )
elif parts_2[1] == 'quotationsbook' and len( parts_2 ) > 2:
f_out_quot.write( line )
elif parts_2[1] == 'freebase' and len( parts_2 ) > 2:
f_out_frb.write( line )
elif parts_2[1] == 'tag' and len( parts_2 ) > 2:
f_out_tag.write( line )
elif parts_2[1] == 'guid' and len( parts_2 ) > 2:
f_out_guid.write( line )
elif parts_2[1] == 'dataworld' and len( parts_2 ) > 2:
f_out_dtwrld.write( line )
else:
f_out_types.write( line )
Run Code Online (Sandbox Code Playgroud)
将其保存为parse.py:
import sys
def parse_freebase_quadruple_tsv_file( file_name ):
fn = file_name.split('.')[0]
f = open( file_name )
f_out_links = open(fn+'_links'+'.tsv', 'w')
f_out_ns = open(fn+'_ns' +'.tsv', 'w')
for line in f:
elements = line[:-1].split('\t')
if len( elements ) < 4:
print 'Skip...'
continue
#print 'Processing ' + str( elements )
#cases described here http://wiki.freebase.com/wiki/Data_dumps
if elements[1].endswith('/notable_for'): #ignore notable_for, it has JSON in it
continue
elif elements[2] and not elements[3]: #case 1, linked
f_out_links.write( line )
elif not (elements[2].startswith('/lang/') and elements[2] != '/lang/en'): #ignore languages other than English
f_out_ns.write( line )
if len(sys.argv[1:]) == 0:
print 'Pass a list of .tsv filenames'
for file_name in sys.argv[1:]:
parse_freebase_quadruple_tsv_file( file_name )
Run Code Online (Sandbox Code Playgroud)
e.get_namespaced_data( 'freebase_ns_types.tsv' ))这里标准免责声明.我这样做了几个月.我相信它大部分是正确的,但如果我的笔记遗漏了某些东西,我会道歉.不幸的是,我需要它的项目已经崩溃了,但希望这有助于其他人.如果有什么不明确的话,请在这里发表评论.
我的2美分......
我使用一些Java代码将Freebase数据转换转换为RDF:https://github.com/castagna/freebase2rdf
我使用Apache Jena的TDB存储来加载RDF数据,使用Fuseki通过HTTP上的SPARQL协议来提供数据.
也可以看看:
SPARQL 是查询 RDF 的查询语言,它允许编写类似 SQL 的查询。大多数 RDF 数据库都实现 SPARQL 接口。此外,Freebase 允许您导出 RDF 中的数据,因此您可以直接在 RDF 数据库中使用该数据并使用 SPARQL 进行查询。
我想看一下本教程,以便更好地了解 SPARQL。
如果您要处理大型数据集,例如 freebase,我会将4store与任何Python 客户端一起使用。4store 通过 HTTP 公开 SPARQL,您可以发出 HTTP 请求来断言、删除和查询数据。它还处理 JSON 格式的结果集,这对于 Python 来说非常方便。我已经在几个项目中使用了这个基础设施,不是使用 CherryPy,而是使用 Django,但我想这种差异并不重要。
| 归档时间: |
|
| 查看次数: |
4174 次 |
| 最近记录: |