I have a times series with temperature and radiation in a pandas dataframe
. The time resolution is 1 minute in regular steps.
import datetime
import pandas as pd
import numpy as np
date_times = pd.date_range(datetime.datetime(2012, 4, 5, 8, 0),
datetime.datetime(2012, 4, 5, 12, 0),
freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
frame = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
index=date_times)
frame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 241 entries, 2012-04-05 08:00:00 to 2012-04-05 12:00:00
Freq: T
Data columns:
radiation 241 non-null values
tamb 241 non-null values
dtypes: float64(2)
How can I down-sample this dataframe
to a resolution of one hour, computing the hourly mean for the temperature and the hourly sum for radiation?
With pandas 0.18 the resample API changed (see the docs). So for pandas >= 0.18 the answer is:
In [31]: frame.resample('1H').agg({'radiation': np.sum, 'tamb': np.mean})
Out[31]:
tamb radiation
2012-04-05 08:00:00 5.161235 279.507182
2012-04-05 09:00:00 4.968145 290.941073
2012-04-05 10:00:00 4.478531 317.678285
2012-04-05 11:00:00 4.706206 335.258633
2012-04-05 12:00:00 2.457873 8.655838
Old Answer:
I am answering my question to reflect the time series related changes in pandas >= 0.8
(all other answers are outdated).
Using pandas >= 0.8 the answer is:
In [30]: frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
Out[30]:
tamb radiation
2012-04-05 08:00:00 5.161235 279.507182
2012-04-05 09:00:00 4.968145 290.941073
2012-04-05 10:00:00 4.478531 317.678285
2012-04-05 11:00:00 4.706206 335.258633
2012-04-05 12:00:00 2.457873 8.655838
frame.resample('1H', how={'radiation': {'sum_rad': np.sum, 'min_rad': np.min}, 'tamb': np.mean})
Def_Os 2015-05-27 20:27
You can also downsample using the asof
method of pandas.DateRange
objects.
In [21]: hourly = pd.DateRange(datetime.datetime(2012, 4, 5, 8, 0),
... datetime.datetime(2012, 4, 5, 12, 0),
... offset=pd.datetools.Hour())
In [22]: frame.groupby(hourly.asof).size()
Out[22]:
key_0
2012-04-05 08:00:00 60
2012-04-05 09:00:00 60
2012-04-05 10:00:00 60
2012-04-05 11:00:00 60
2012-04-05 12:00:00 1
In [23]: frame.groupby(hourly.asof).agg({'radiation': np.sum, 'tamb': np.mean})
Out[23]:
radiation tamb
key_0
2012-04-05 08:00:00 271.54 4.491
2012-04-05 09:00:00 266.18 5.253
2012-04-05 10:00:00 292.35 4.959
2012-04-05 11:00:00 283.00 5.489
2012-04-05 12:00:00 0.5414 9.532
DateRange.asof
diliop 2012-04-05 02:48
To tantalize you, in pandas 0.8.0 (under heavy development in the timeseries
branch on GitHub), you'll be able to do:
In [5]: frame.convert('1h', how='mean')
Out[5]:
radiation tamb
2012-04-05 08:00:00 7.840989 8.446109
2012-04-05 09:00:00 4.898935 5.459221
2012-04-05 10:00:00 5.227741 4.660849
2012-04-05 11:00:00 4.689270 5.321398
2012-04-05 12:00:00 4.956994 5.093980
The above mentioned methods are the right strategy with the current production version of pandas.
frame.convert('1h', how={'radiation': 'sum, 'tamb': 'mean'})
. Is this an option in 0.8 - bmu 2012-04-08 10:30
You need to use groupby
as such:
grouped = frame.groupby(lambda x: x.hour)
grouped.agg({'radiation': np.sum, 'tamb': np.mean})
# Same as: grouped.agg({'radiation': 'sum', 'tamb': 'mean'})
with the output being:
radiation tamb
key_0
8 298.581107 4.883806
9 311.176148 4.983705
10 315.531527 5.343057
11 288.013876 6.022002
12 5.527616 8.507670
So in essence I am splitting on the hour value and then calculating the mean of tamb
and the sum of radiation
and returning back the DataFrame
(similar approach to R's ddply
). For more info I would check the documentation page for groupby as well as this blog post.
Edit: To make this scale a bit better you could group on both the day and time as such:
grouped = frame.groupby(lambda x: (x.day, x.hour))
grouped.agg({'radiation': 'sum', 'tamb': 'mean'})
radiation tamb
key_0
(5, 8) 298.581107 4.883806
(5, 9) 311.176148 4.983705
(5, 10) 315.531527 5.343057
(5, 11) 288.013876 6.022002
(5, 12) 5.527616 8.507670
frame.resample('1H', how={'radiation': [np.sum, np.min], 'tamb': np.mean})
. The resulting DataFrame has a MultiIndex on its columns, with the original column name as level 0 and the function name as level 1 - Def_Os 2015-05-26 23:36