我有一大堆JSON数据格式如下:
[
[{
"created_at": "2017-04-28T16:52:36Z","as_of": "2017-04-28T17:00:05Z","trends": [{
"url": "http://twitter.com/search?q=%23ChavezSigueCandanga","query": "%23ChavezSigueCandanga","tweet_volume": 44587,"name": "#ChavezSigueCandanga","promoted_content": null
},{
"url": "http://twitter.com/search?q=%2327Abr","query": "%2327Abr","tweet_volume": 79781,"name": "#27Abr","promoted_content": null
}],"locations": [{
"woeid": 395277,"name": "Turmero"
}]
}],[{
"created_at": "2017-04-28T16:57:35Z","as_of": "2017-04-28T17:00:03Z","trends": [{
"url": "http://twitter.com/search?q=%23fyrefestival","query": "%23fyrefestival","tweet_volume": 141385,"name": "#fyrefestival",{
"url": "http://twitter.com/search?q=%23HotDocs17","query": "%23HotDocs17","tweet_volume": null,"name": "#HotDocs17","locations": [{
"woeid": 9807,"name": "Vancouver"
}]
}]
]...
我编写了一个函数,将其格式化为采用以下形式的pandas数据框:
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
| | name | promoted_content | query | tweet_volume | url | as_of | created_at | location_name | location_woeid |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
| 47 | #BatesMotel | | %23BatesMotel | 59748 | http://twitter.com/search?q=%23BatesMotel | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
| 48 | #AdviceForPeopleJoiningTwitter | | %23AdviceForPeopleJoiningTwitter | 51222 | http://twitter.com/search?q=%23AdviceForPeopleJoiningTwitter | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
| 49 | #CADTHSymp | | %23CADTHSymp | | http://twitter.com/search?q=%23CADTHSymp | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
| 0 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
| 1 | #TravelTuesday | | %23TravelTuesday | | http://twitter.com/search?q=%23TravelTuesday | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
| 2 | #DigitalLeap | | %23DigitalLeap | | http://twitter.com/search?q=%23DigitalLeap | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
| … | … | … | … | … | … | … | … | … | … |
| 0 | #nusnc17 | | %23nusnc17 | | http://twitter.com/search?q=%23nusnc17 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
| 1 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
| 2 | #littleboyblue | | %23littleboyblue | 20772 | http://twitter.com/search?q=%23littleboyblue | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
这是将JSON写入DataFrame的函数:
def trends_to_dataframe(data):
df = pd.DataFrame()
for location in data:
temp_df = pd.DataFrame()
for trend in location[0]['trends']:
temp_df = temp_df.append(pd.Series(trend),ignore_index=True)
temp_df['as_of'] = location[0]['as_of']
temp_df['created_at'] = location[0]['created_at']
temp_df['location_name'] = location[0]['locations'][0]['name']
temp_df['location_woeid'] = location[0]['locations'][0]['woeid']
df = df.append(temp_df)
return df
不幸的是,由于我拥有的数据量(以及我测试过的一些简单的计时器),这将需要大约4个小时才能完成.有关如何提高速度的任何想法?
最佳答案
您可以通过使用concurrent.futures异步展平数据来加快速度,然后将其全部加载到带有from_records的DataFrame中.
from concurrent.futures import ThreadPoolExecutor
def get_trends(location):
trends = []
for trend in location[0]['trends']:
trend['as_of'] = location[0]['as_of']
trend['created_at'] = location[0]['created_at']
trend['location_name'] = location[0]['locations'][0]['name']
trend['location_woeid'] = location[0]['locations'][0]['woeid']
trends.append(trend)
return trends
flat_data = []
with ThreadPoolExecutor() as executor:
for location in data:
flat_data += get_trends(location)
df = pd.DataFrame.from_records(flat_data)