How to resample a dataframe with different functions applied to each column? - 【StackMirror】|python|numpy|time-series|pandas

I have a times series with temperature and radiation in a pandas dataframe. The time resolution is 1 minute in regular steps.

import datetime
import pandas as pd
import numpy as np

date_times = pd.date_range(datetime.datetime(2012, 4, 5, 8, 0),
                           datetime.datetime(2012, 4, 5, 12, 0),
                           freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
frame = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
                     index=date_times)
frame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 241 entries, 2012-04-05 08:00:00 to 2012-04-05 12:00:00
Freq: T
Data columns:
radiation    241  non-null values
tamb         241  non-null values
dtypes: float64(2)

How can I down-sample this dataframe to a resolution of one hour, computing the hourly mean for the temperature and the hourly sum for radiation?

2012-04-04 23:17
by bmu

With pandas 0.18 the resample API changed (see the docs). So for pandas >= 0.18 the answer is:

In [31]: frame.resample('1H').agg({'radiation': np.sum, 'tamb': np.mean})
Out[31]: 
                         tamb   radiation
2012-04-05 08:00:00  5.161235  279.507182
2012-04-05 09:00:00  4.968145  290.941073
2012-04-05 10:00:00  4.478531  317.678285
2012-04-05 11:00:00  4.706206  335.258633
2012-04-05 12:00:00  2.457873    8.655838

Old Answer:

I am answering my question to reflect the time series related changes in pandas >= 0.8 (all other answers are outdated).

Using pandas >= 0.8 the answer is:

In [30]: frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
Out[30]: 
                         tamb   radiation
2012-04-05 08:00:00  5.161235  279.507182
2012-04-05 09:00:00  4.968145  290.941073
2012-04-05 10:00:00  4.478531  317.678285
2012-04-05 11:00:00  4.706206  335.258633
2012-04-05 12:00:00  2.457873    8.655838

2012-07-22 19:13
by bmu

This can be extended to a list of functions per column: frame.resample('1H', how={'radiation': [np.sum, np.min], 'tamb': np.mean}). The resulting DataFrame has a MultiIndex on its columns, with the original column name as level 0 and the function name as level 1 - Def_Os 2015-05-26 23:36

To add to my previous comment: instead of a list of functions per column, you can also use a dictionary, where the key is the new column name and the value is the function to use: frame.resample('1H', how={'radiation': {'sum_rad': np.sum, 'min_rad': np.min}, 'tamb': np.mean})Def_Os 2015-05-27 20:27

say if you want to add a NEW column into the result, such as count() of each row in the resample period - codingknob 2016-03-29 19:45

@codingknob: Sorry, I don't undestand your comment - bmu 2016-04-08 12:24

@bmu , what if I want to aggregate by a custom function that combines data from two columns. For instance, what if I want to aggregate by np.mean(frame['radiation'] * frame['tamb']) - Amitai 2017-07-23 12:37

You can also downsample using the asof method of pandas.DateRange objects.

In [21]: hourly = pd.DateRange(datetime.datetime(2012, 4, 5, 8, 0),
...                          datetime.datetime(2012, 4, 5, 12, 0),
...                          offset=pd.datetools.Hour())

In [22]: frame.groupby(hourly.asof).size()
Out[22]: 
key_0
2012-04-05 08:00:00    60
2012-04-05 09:00:00    60
2012-04-05 10:00:00    60
2012-04-05 11:00:00    60
2012-04-05 12:00:00    1
In [23]: frame.groupby(hourly.asof).agg({'radiation': np.sum, 'tamb': np.mean})
Out[23]: 
                     radiation  tamb 
key_0                                
2012-04-05 08:00:00  271.54     4.491
2012-04-05 09:00:00  266.18     5.253
2012-04-05 10:00:00  292.35     4.959
2012-04-05 11:00:00  283.00     5.489
2012-04-05 12:00:00  0.5414     9.532

2012-04-05 01:45
by Garrett

+1 for using DateRange.asofdiliop 2012-04-05 02:48

To tantalize you, in pandas 0.8.0 (under heavy development in the timeseries branch on GitHub), you'll be able to do:

In [5]: frame.convert('1h', how='mean')
Out[5]: 
                     radiation      tamb
2012-04-05 08:00:00   7.840989  8.446109
2012-04-05 09:00:00   4.898935  5.459221
2012-04-05 10:00:00   5.227741  4.660849
2012-04-05 11:00:00   4.689270  5.321398
2012-04-05 12:00:00   4.956994  5.093980

The above mentioned methods are the right strategy with the current production version of pandas.

2012-04-07 20:34
by Wes McKinney

Thanks, but what I want to have would be something like frame.convert('1h', how={'radiation': 'sum, 'tamb': 'mean'}). Is this an option in 0.8 - bmu 2012-04-08 10:30

@ Wes McKinney this should be resample in 0.8, isn't it - bmu 2012-06-15 20:58

If you would update your answer, I would accept it. otherwise you should remove it I think, because it will point users to the wrong direction - bmu 2012-10-13 08:23

You need to use groupby as such:

grouped = frame.groupby(lambda x: x.hour)
grouped.agg({'radiation': np.sum, 'tamb': np.mean})
# Same as: grouped.agg({'radiation': 'sum', 'tamb': 'mean'})

with the output being:

        radiation      tamb
key_0                      
8      298.581107  4.883806
9      311.176148  4.983705
10     315.531527  5.343057
11     288.013876  6.022002
12       5.527616  8.507670

So in essence I am splitting on the hour value and then calculating the mean of tamb and the sum of radiation and returning back the DataFrame (similar approach to R's ddply). For more info I would check the documentation page for groupby as well as this blog post.

Edit: To make this scale a bit better you could group on both the day and time as such:

grouped = frame.groupby(lambda x: (x.day, x.hour))
grouped.agg({'radiation': 'sum', 'tamb': 'mean'})
          radiation      tamb
key_0                        
(5, 8)   298.581107  4.883806
(5, 9)   311.176148  4.983705
(5, 10)  315.531527  5.343057
(5, 11)  288.013876  6.022002
(5, 12)    5.527616  8.507670

2012-04-05 01:11
by diliop