I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table
package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks
data.table
one way would be to flatten both df1
and df2
to 3 columns: (K,P,val) each with a 2-column key (K,P). Then df1[df2,val:=df2.val]
and unflatten afterwards. Or, keeping the same structure you have, in a loop through df2
do df1[k,p:=value,with=FALSE]
which will be fast because loops on data.table
s are much faster. If you like the loop approach then set()
is even faster than :=
- Matt Dowle 2012-04-10 12:05
df1[df2,val:=df2.val]
gives Error in := (val, df2.val) : := is defined for use in j only; i.e., DT[i,col:=1L] not DT[i,col]:=1L or DT[i]$col:=1L.
- gkaupas 2012-04-12 18:25
df1
needs to be a data.table
; e.g. df1=as.data.table(df1)
. I'll add something to that error message to suggest checking that type - Matt Dowle 2012-04-12 20:47
dt1 <- data.table(read.table(text="K1 K2 V\nA P1 1\nA P2 1\nB P1 1\nB P2 1", header=TRUE),key=c("K1","K2"));
then dt2 <- data.table(read.table(text="K1 K2 V\nB P1 2\nB P2 1\nC P1 2\nC P2 2", header=TRUE),key=c("K1","K2"));
then dt1[dt2,V:=dt2.V]
yields error combining bywithoutby with := in j is not yet implemented
- gkaupas 2012-04-13 13:50
decider
function below. Fabricate two very small example normalized tables. Apply the function. Result is correct. 1: decider <- function(x, y) ifelse(is.na(x), y, x);
2: dt1 <- data.table(read.table(text="KEY1 KEY2 OLDVAL\nA P1 1\nA P2 1\nB P1 1\nB P2 1", header=TRUE),key=c("KEY1","KEY2"));
3: dt2 <- data.table(read.table(text="KEY1 KEY2 NEWVAL\nA P1 2\nA P2 NA\nB P1 2\nB P2 2", header=TRUE),key=c("KEY1","KEY2"));
4: dt1[dt2,decider(NEWVAL,OLDVAL)]
gkaupas 2012-04-13 14:50
grpsize = ceiling(6e5/26^2);
and dt1<-data.table(data.frame(KEY1=rep(factor(LETTERS),each=26*grpsize),KEY2=rep(factor(letters),each=grpsize),OLDVAL=runif(grpsize*26^2)),key=c("KEY1","KEY2"));
and repeat for dt2 (change OLDVAL to NEWVAL), then dt1[dt2,decider(NEWVAL,OLDVAL)]
and gc
before/after I see memory peaks at 8 Gb for two 10 Mb tables... far too costly. If I change grpsize to use 7e5
, the [
fails with cannot allocate vector of size 2.7 Gb
- gkaupas 2012-04-13 14:51
grpsize = ceiling(1e7/26^2);
as shown in data.table documentation. The dt1[dt2,decider(NEWVAL,OLDVAL)]
fails with Error in rep(1L, sum(lengths)) : invalid 'times' argument In addition: Warning message: In sum(lengths) : Integer overflow - use sum(as.numeric(.))
. A normalized dataset of 1e7 rows is "small" for my application. It seems I have to write a layer of partitioning code and spoon-feed the datasets to R in smaller pieces. It is not obvious to me how packages like bigmemory
or ff
help as there is no mention of actively paging data to/from disk on demand - gkaupas 2012-04-13 14:52
This loops by column, setting dt1
by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2
data.table(read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE), key=c("K"))
with the result being all of dt1
apparently replaced by all of dt2
. What might cause the distinction - gkaupas 2012-04-19 19:11
NA_character_
and "NA"
- Matt Dowle 2012-04-19 23:22
na.strings=""
. Once that was added, your code works fine. Example here where dt2
contains 2 updates, first record overwrites same in dt1
, second has both columns NA (one is numeric, one is character), and it correctly has no effect on matching record in dt1
. I will re-run my benchmarks where dt2
has random NA's in it and read.table
has correct options, but this solution is clearly the most effective one. Thank you for your diligence - gkaupas 2012-04-20 16:54
is.na(DF) <- DF == -999
(whether DF's or DT's) is killing me with a 16K cell/second fix rate. Is there a better way to write that fix using data.table
kung fu - gkaupas 2012-04-25 15:51
This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5
df1$K <- as.factor(df1$K)
, then the rbind borks with invalid factor level, NAs generated
. The second approach works; I am trying with a large dataset - gkaupas 2012-04-04 18:24
EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix
instead of data.frame
? If so it could be a single line using A[B]
syntax where B
is a 2-column matrix containing the row and column numbers to update.
Error in [.data.table(p, !is.na(p)) :
i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14).
Re: the A[B]
syntax, if I need to contruct B
with update instructions, then I have to iterate through all the cells to determine that, which defeats the purpose, unless I'm misunderstanding - gkaupas 2012-04-12 18:20
unlist
rather than as.vector
. Have edited answer. On A[B]
you can very likely vectorize the creation of B
. If matrix
is ok for your case then (all columns integer or all columns numeric, and n*m < 2^31) then matrix
and A[B]
seems more appropriate and should be fast - Matt Dowle 2012-04-12 21:10
Supplied n items to be assigned to 1 items of column 'P1' (n-1 unused)
. My data will be a mix of strings and numbers so I guess that means no matrix
route here - gkaupas 2012-04-13 13:56
names(p):=p
to names(p):=as.list(p)
. Ok, yes that means no matrix
route. Must be an easier way in data.table (perhaps by wrapping it up in some generic function that does that task) - Matt Dowle 2012-04-13 15:22
dt1[dt2,decider(NEWVAL,OLDVAL)]
I posted in the 3-part comment at the top, based on your input and Tyler's, is tantalizingly simple, even if it requires me to normalize my data first, and only handles a subset of the rows (i.e. I can either get back all rows of dt1 or all rows of dt2 and have to rbind in the rest). Unfortunately it just doesn't scale to my data volume - gkaupas 2012-04-13 18:15
:=
updates by reference. See ?":="
. Are you sure that dt2
(not df2
) has not been updated - Matt Dowle 2012-04-17 09:09
dt2[
to dt1[
, then dt1
does indeed change as desired. Otherwise apparently dt2 just overwrites itself. I will try this with some larger datasets this afternoon. I will still need to rbind in the new row(s) from dt2, but I already conceded to having to do that in my original post. Thank - gkaupas 2012-04-17 14:38
data.table
loop was a pretty consistent 32,300 cells/second. Previously on the same machine, using mapply
on data frames, I averaged 147,800 cells/second. I get 600,200 cells/second with SAS on ancient 1.35 GHz RISC CPUs (vs the 2.0 GHz Nehalem's I'm using for R). gc()
says memory maxes at 4 Gb when combining the 2 100,000,000 cell data.tables, and the server has 8 Gb available and nothing else running - gkaupas 2012-04-18 17:13
for()
loop was the wrong way around, should be by column (not row). Or, back to the normalized (flattened) route as first suggested. Looking back, the error combining bywithoutby with := in j is not yet implemented
can be solved by setting mult="first"
, assuming that the keys are unique. Also assuming that the tests are simplified and we can't speed up by knowing input1 and input2 have the same number of rows with the same keys like the tests. Anyway, will modify the loop to go by column .. - Matt Dowle 2012-04-19 00:12
The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names