I have a data frame (71568x4) consist of several variables which are observed every hour (total are 24hours in a day) and contain many NAs.
I want to find the maximum value in every 24hours (in other word is a daily maxima) for each variable. If 12 or more hourly observation are missing during this 24hours period on any day, the data for that day is considered missing and hence reported as NA. Can anyone help me to do this in R?
Here is a sample example:
tDate <- rep(c(19980101,19980102,19980103), each = 24)
tTime <- rep(seq(1:24), 3)
x1 <- c(c(1:4),rep(NA,7),c(2:10),6,2,9,1,rep(NA,4),c(4:23),c(2:8),
rep(NA,7),c(3:5),rep(NA,7))
x2 <- c(rep(NA,3),c(11:15),NA,c(3:15),rep(NA,10),c(7:10),NA,c(2:4),NA,3,
rep(NA,5),c(6:9),NA,c(8:20),rep(NA,5),5,1)
datmat <- cbind(tDate,tTime,x1,x2)
The output will be like this
> matrix(c(10,23,NA,15,NA,20), byrow = FALSE, ncol = 2)
Many thanks in advance.
I'd define a custom function to take the max that you want:
my.max <- function(vec) {
if(length(vec[is.na(vec)])>=12) {
return(NA)
} else {
return(max(vec, na.rm=T))
}
}
Then use plyr
and specifically ddply
:
ddply(as.data.frame(datmat), .(tDate), summarise, x1=my.max(x1), x2=my.max(x2))