我正在编写一个程序,它要求我生成一个非常大的json文件.我知道传统的方法是使用json.dump()转储字典列表,但是列表太大了,即使总内存交换空间在转储之前也无法保存它.无论如何将它流式传输到json文件中,即将数据逐步写入json文件?
解决方法
我知道这已经晚了一年,但问题仍然存在,我很惊讶
json.iterencode()没有被提及.
在这个例子中,iterencode的潜在问题是你希望通过使用生成器对大数据集进行迭代处理,而json编码不会序列化生成器.
解决这个问题的方法是子类列表类型并覆盖__iter__魔术方法,以便您可以生成生成器的输出.
以下是此列表子类的示例.
class StreamArray(list): """ Converts a generator into a list object that can be json serialisable while still retaining the iterative nature of a generator. IE. It converts it to a list without having to exhaust the generator and keep it's contents in memory. """ def __init__(self,generator): self.generator = generator self._len = 1 def __iter__(self): self._len = 0 for item in self.generator: yield item self._len += 1 def __len__(self): """ Json parser looks for a this method to confirm whether or not it can be parsed """ return self._len
从这里开始使用非常简单.获取生成器句柄,将其传递到StreamArray类,将流数组对象传递给iterencode()并迭代块.块将是json格式输出,可以直接写入文件.
用法示例:
#Function that will iteratively generate a large set of data. def large_list_generator_func(): for i in xrange(5): chunk = {'hello_world': i} print 'Yielding chunk: ',chunk yield chunk #Write the contents to file: with open('/tmp/streamed_write.json','w') as outfile: large_generator_handle = large_list_generator_func() stream_array = StreamArray(large_generator_handle) for chunk in json.JSONEncoder().iterencode(stream_array): print 'Writing chunk: ',chunk outfile.write(chunk)
Yielding chunk: {'hello_world': 0} Writing chunk: [ Writing chunk: { Writing chunk: "hello_world" Writing chunk: : Writing chunk: 0 Writing chunk: } Yielding chunk: {'hello_world': 1} Writing chunk:,Writing chunk: { Writing chunk: "hello_world" Writing chunk: : Writing chunk: 1 Writing chunk: }