python – 使用大量数据操作将JSON加速到数据帧

我有一大堆JSON数据格式如下：

[
    [{
        "created_at": "2017-04-28T16:52:36Z","as_of": "2017-04-28T17:00:05Z","trends": [{
            "url": "http://twitter.com/search?q=%23ChavezSigueCandanga","query": "%23ChavezSigueCandanga","tweet_volume": 44587,"name": "#ChavezSigueCandanga","promoted_content": null
        },{
            "url": "http://twitter.com/search?q=%2327Abr","query": "%2327Abr","tweet_volume": 79781,"name": "#27Abr","promoted_content": null
        }],"locations": [{
            "woeid": 395277,"name": "Turmero"
        }]
    }],[{
        "created_at": "2017-04-28T16:57:35Z","as_of": "2017-04-28T17:00:03Z","trends": [{
            "url": "http://twitter.com/search?q=%23fyrefestival","query": "%23fyrefestival","tweet_volume": 141385,"name": "#fyrefestival",{
            "url": "http://twitter.com/search?q=%23HotDocs17","query": "%23HotDocs17","tweet_volume": null,"name": "#HotDocs17","locations": [{
            "woeid": 9807,"name": "Vancouver"
        }]
    }]
]...

我编写了一个函数,将其格式化为采用以下形式的pandas数据框：

+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
|    |              name              | promoted_content |              query               | tweet_volume |                             url                              |        as_of         |      created_at      | location_name | location_woeid |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
| 47 | #BatesMotel                    |                  | %23BatesMotel                    | 59748        | http://twitter.com/search?q=%23BatesMotel                    | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg      | 2972           |
| 48 | #AdviceForPeopleJoiningTwitter |                  | %23AdviceForPeopleJoiningTwitter | 51222        | http://twitter.com/search?q=%23AdviceForPeopleJoiningTwitter | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg      | 2972           |
| 49 | #CADTHSymp                     |                  | %23CADTHSymp                     |              | http://twitter.com/search?q=%23CADTHSymp                     | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg      | 2972           |
| 0  | #WorldPenguinDay               |                  | %23WorldPenguinDay               | 79006        | http://twitter.com/search?q=%23WorldPenguinDay               | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto       | 4118           |
| 1  | #TravelTuesday                 |                  | %23TravelTuesday                 |              | http://twitter.com/search?q=%23TravelTuesday                 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto       | 4118           |
| 2  | #DigitalLeap                   |                  | %23DigitalLeap                   |              | http://twitter.com/search?q=%23DigitalLeap                   | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto       | 4118           |
| …  | …                              | …                | …                                | …            | …                                                            | …                    | …                    | …             | …              |
| 0  | #nusnc17                       |                  | %23nusnc17                       |              | http://twitter.com/search?q=%23nusnc17                       | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham    | 12723          |
| 1  | #WorldPenguinDay               |                  | %23WorldPenguinDay               | 79006        | http://twitter.com/search?q=%23WorldPenguinDay               | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham    | 12723          |
| 2  | #littleboyblue                 |                  | %23littleboyblue                 | 20772        | http://twitter.com/search?q=%23littleboyblue                 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham    | 12723          |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+

这是将JSON写入DataFrame的函数：

def trends_to_dataframe(data):
    df = pd.DataFrame()

    for location in data:
        temp_df = pd.DataFrame()

        for trend in location[0]['trends']:
            temp_df = temp_df.append(pd.Series(trend),ignore_index=True)

        temp_df['as_of'] = location[0]['as_of']
        temp_df['created_at'] = location[0]['created_at']
        temp_df['location_name'] = location[0]['locations'][0]['name']
        temp_df['location_woeid'] = location[0]['locations'][0]['woeid']

        df = df.append(temp_df)

    return df

不幸的是,由于我拥有的数据量(以及我测试过的一些简单的计时器),这将需要大约4个小时才能完成.有关如何提高速度的任何想法？

最佳答案

您可以通过使用concurrent.futures异步展平数据来加快速度,然后将其全部加载到带有from_records的DataFrame中.

from concurrent.futures import ThreadPoolExecutor

def get_trends(location):
    trends = []
    for trend in location[0]['trends']:
        trend['as_of'] = location[0]['as_of']
        trend['created_at'] = location[0]['created_at']
        trend['location_name'] = location[0]['locations'][0]['name']
        trend['location_woeid'] = location[0]['locations'][0]['woeid']
        trends.append(trend)
    return trends

flat_data = []
with ThreadPoolExecutor() as executor:
    for location in data:
        flat_data += get_trends(location)

df = pd.DataFrame.from_records(flat_data)

python – 使用大量数据操作将JSON加速到数据帧

猜你在找的Python相关文章