我有一个包含JSON哈希的大文件(> 50Mb).就像是:
{
"obj1": {
"key1": "val1",
"key2": "val2"
},
"obj2": {
"key1": "val1",
"key2": "val2"
}
...
}
Run Code Online (Sandbox Code Playgroud)
我想解析散列中的每个项目,而不是解析整个文件并说出前十个元素.我实际上并不关心关键,即obj1.
如果我将上面的内容转换为:
{
"key1": "val1",
"key2": "val2"
}
"obj2": {
"key1": "val1",
"key2": "val2"
}
Run Code Online (Sandbox Code Playgroud)
我可以使用Yajl流轻松实现我想要的东西:
io = File.open(path_to_file)
count = 10
Yajl::Parser.parse(io) do |obj|
puts "Parsed: #{obj}"
count -= 1
break if count == 0
end
io.close
Run Code Online (Sandbox Code Playgroud)
有没有办法在不必更改文件的情况下执行此操作?也许在Yajl中有某种回调?
rai*_*inz 12
最后我解决这个使用JSON ::流具有回调start_document,start_object等等.
我给了'解析器'一个to_enum方法,当它们被解析时会发出所有'Resource'对象.请注意,ResourcesCollectionNode除非您完全解析JSON流,否则永远不会真正使用它,并且它ResourceNode只是ObjectNode用于命名目的的子类,尽管我可能只是去除它:
class Parser
METHODS = %w[start_document end_document start_object end_object start_array end_array key value]
attr_reader :result
def initialize(io, chunk_size = 1024)
@io = io
@chunk_size = chunk_size
@parser = JSON::Stream::Parser.new
# register callback methods
METHODS.each do |name|
@parser.send(name, &method(name))
end
end
def to_enum
Enumerator.new do |yielder|
@yielder = yielder
begin
while !@io.eof?
# puts "READING CHUNK"
chunk = @io.read(@chunk_size)
@parser << chunk
end
ensure
@yielder = nil
end
end
end
def start_document
@stack = []
@result = nil
end
def end_document
# @result = @stack.pop.obj
end
def start_object
if @stack.size == 0
@stack.push(ResourceCollectionNode.new)
elsif @stack.size == 1
@stack.push(ResourceNode.new)
else
@stack.push(ObjectNode.new)
end
end
def end_object
if @stack.size == 2
node = @stack.pop
#puts "Stack depth: #{@stack.size}. Node: #{node.class}"
@stack[-1] << node.obj
# puts "Parsed complete resource: #{node.obj}"
@yielder << node.obj
elsif @stack.size == 1
# puts "Parsed all resources"
@result = @stack.pop.obj
else
node = @stack.pop
# puts "Stack depth: #{@stack.size}. Node: #{node.class}"
@stack[-1] << node.obj
end
end
def end_array
node = @stack.pop
@stack[-1] << node.obj
end
def start_array
@stack.push(ArrayNode.new)
end
def key(key)
# puts "Stack depth: #{@stack.size} KEY: #{key}"
@stack[-1] << key
end
def value(value)
node = @stack[-1]
node << value
end
class ObjectNode
attr_reader :obj
def initialize
@obj, @key = {}, nil
end
def <<(node)
if @key
@obj[@key] = node
@key = nil
else
@key = node
end
self
end
end
class ResourceNode < ObjectNode
end
# Node that contains all the resources - a Hash keyed by url
class ResourceCollectionNode < ObjectNode
def <<(node)
if @key
@obj[@key] = node
# puts "Completed Resource: #{@key} => #{node}"
@key = nil
else
@key = node
end
self
end
end
class ArrayNode
attr_reader :obj
def initialize
@obj = []
end
def <<(node)
@obj << node
self
end
end
end
Run Code Online (Sandbox Code Playgroud)
和一个使用的例子:
def json
<<-EOJ
{
"1": {
"url": "url_1",
"title": "title_1",
"http_req": {
"status": 200,
"time": 10
}
},
"2": {
"url": "url_2",
"title": "title_2",
"http_req": {
"status": 404,
"time": -1
}
},
"3": {
"url": "url_1",
"title": "title_1",
"http_req": {
"status": 200,
"time": 10
}
},
"4": {
"url": "url_2",
"title": "title_2",
"http_req": {
"status": 404,
"time": -1
}
},
"5": {
"url": "url_1",
"title": "title_1",
"http_req": {
"status": 200,
"time": 10
}
},
"6": {
"url": "url_2",
"title": "title_2",
"http_req": {
"status": 404,
"time": -1
}
}
}
EOJ
end
io = StringIO.new(json)
resource_parser = ResourceParser.new(io, 100)
count = 0
resource_parser.to_enum.each do |resource|
count += 1
puts "READ: #{count}"
pp resource
break
end
io.close
Run Code Online (Sandbox Code Playgroud)
输出:
READ: 1
{"url"=>"url_1", "title"=>"title_1", "http_req"=>{"status"=>200, "time"=>10}}
Run Code Online (Sandbox Code Playgroud)
我遇到了同样的问题并创建了 gem json-streamer,它可以让您无需创建自己的回调。
您的情况的用法是(v 0.4.0):
io = File.open(path_to_file)
streamer = Json::Streamer::JsonStreamer.new(io)
streamer.get(nesting_level:1).each do |object|
p oject
end
io.close
Run Code Online (Sandbox Code Playgroud)
将它应用于您的示例,它将产生没有 'obj' 键的对象:
{
"key1": "val1",
"key2": "val2"
}
Run Code Online (Sandbox Code Playgroud)