Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.

The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)

Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.

So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:

  K P1 P2 P3
1 A  1  1  1
2 B  1  1  1
3 C  1  1  1

...and the update data frame "df2":

  K P1 P2 P3
1 B  2 NA  2
2 C NA  2  2
3 D  2  2  2

The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:

  K P1 P2 P3
1 A  1  1  1
2 B  2  1  2
3 C  1  2  2
4 D  2  2  2

This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.

I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).

I'm sure I'm missing something, but can't find a comparable example.

I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.

Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.

Here is some code to fabricate the data frames above:

cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);

Thanks

2012-04-04 16:37
by gkaupas

Possible duplicate: http://stackoverflow.com/questions/9918450/join-matching-columns-in-a-data-frame-or-data-table/9918769#991876 - Tyler Rinker 2012-04-04 16:52

I take it back this is not a duplicate. I didn't read carefully enough. You want a merge with a replace NAs of one df with another df. A bit more complex - Tyler Rinker 2012-04-04 17:37

In data.table one way would be to flatten both df1 and df2 to 3 columns: (K,P,val) each with a 2-column key (K,P). Then df1[df2,val:=df2.val] and unflatten afterwards. Or, keeping the same structure you have, in a loop through df2 do df1[k,p:=value,with=FALSE] which will be fast because loops on data.tables are much faster. If you like the loop approach then set() is even faster than := - Matt Dowle 2012-04-10 12:05

@MatthewDowle The normalized (flattened) route with df1[df2,val:=df2.val] gives Error in := (val, df2.val) : := is defined for use in j only; i.e., DT[i,col:=1L] not DT[i,col]:=1L or DT[i]$col:=1L. - gkaupas 2012-04-12 18:25

df1 needs to be a data.table; e.g. df1=as.data.table(df1). I'll add something to that error message to suggest checking that type - Matt Dowle 2012-04-12 20:47

dt1 <- data.table(read.table(text="K1 K2 V\nA P1 1\nA P2 1\nB P1 1\nB P2 1", header=TRUE),key=c("K1","K2")); then dt2 <- data.table(read.table(text="K1 K2 V\nB P1 2\nB P2 1\nC P1 2\nC P2 2", header=TRUE),key=c("K1","K2")); then dt1[dt2,V:=dt2.V] yields error combining bywithoutby with := in j is not yet implemented - gkaupas 2012-04-13 13:50

Three part comment, part one: Take Tyler's decider function below. Fabricate two very small example normalized tables. Apply the function. Result is correct. 1: decider <- function(x, y) ifelse(is.na(x), y, x); 2:

dt1 <- data.table(read.table(text="KEY1 KEY2 OLDVAL\nA P1 1\nA P2 1\nB P1 1\nB P2 1", header=TRUE),key=c("KEY1","KEY2"));

dt2 <- data.table(read.table(text="KEY1 KEY2 NEWVAL\nA P1 2\nA P2 NA\nB P1 2\nB P2 2", header=TRUE),key=c("KEY1","KEY2"));

4: dt1[dt2,decider(NEWVAL,OLDVAL)]gkaupas 2012-04-13 14:50

Second part: Now use data.table example to generate monster table. With grpsize = ceiling(6e5/26^2); and

dt1<-data.table(data.frame(KEY1=rep(factor(LETTERS),each=26*grpsize),KEY2=rep(factor(letters),each=grpsize),OLDVAL=runif(grpsize*26^2)),key=c("KEY1","KEY2"));

and repeat for dt2 (change OLDVAL to NEWVAL), then dt1[dt2,decider(NEWVAL,OLDVAL)] and gc before/after I see memory peaks at 8 Gb for two 10 Mb tables... far too costly. If I change grpsize to use 7e5, the [ fails with cannot allocate vector of size 2.7 Gb - gkaupas 2012-04-13 14:51

Third part: Now use grpsize = ceiling(1e7/26^2); as shown in data.table documentation. The dt1[dt2,decider(NEWVAL,OLDVAL)] fails with

Error in rep(1L, sum(lengths)) : invalid 'times' argument In addition: Warning message: In sum(lengths) : Integer overflow - use sum(as.numeric(.))

. A normalized dataset of 1e7 rows is "small" for my application. It seems I have to write a layer of partitioning code and spoon-feed the datasets to R in smaller pieces. It is not obvious to me how packages like bigmemory or ff help as there is no mention of actively paging data to/from disk on demand - gkaupas 2012-04-13 14:52

Please confirm if the approach in my answer works or not, and scales. Then at least I understand the problem and we can look at a more elegant solution after that .. - Matt Dowle 2012-04-15 17:04

This loops by column, setting dt1 by reference and (hopefully) should be quick.

dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
    stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
    nna = !is.na(dt2[[i]])
    set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
     K P1 P2 P3
[1,] A  1  1  1
[2,] B  2  1  2
[3,] C  1  2  2
[4,] D  2  2  2

2012-04-19 00:52
by Matt Dowle

Now we're cooking. I am getting "merge" rates of 3,200,000 cells per second for 1000-row datasets increasing to 7,200,000 cells per second for 10,000-row datasets, with the rates seemingly independent of the number of columns in the dataset. However the code seems to behave only if the values are numeric. My test case is all columns are character populated with A's and B's read in with data.table(read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE), key=c("K")) with the result being all of dt1 apparently replaced by all of dt2. What might cause the distinction - gkaupas 2012-04-19 19:11

Phew, sounds promising. There's no reason why character shouldn't work the same as numeric, afaik. Will need an example please showing what you get vs what you expect. Wild guess that it's the difference between NA_character_ and "NA" - Matt Dowle 2012-04-19 23:22

My fault, your guess is indeed the cause; in my read.table I forgot to specify na.strings="". Once that was added, your code works fine. Example here where dt2 contains 2 updates, first record overwrites same in dt1, second has both columns NA (one is numeric, one is character), and it correctly has no effect on matching record in dt1. I will re-run my benchmarks where dt2 has random NA's in it and read.table has correct options, but this solution is clearly the most effective one. Thank you for your diligence - gkaupas 2012-04-20 16:54

Relief. Thanks for your patience, got there in the end. It's been useful for me too as it turns out - Matt Dowle 2012-04-20 19:31

Epilogue: Despite the 3 million+ cell/second processing, on my hardware, unfortunately my overall throughput is gated by reading/writing the data. Whether I read/write CSV files, or load/save uncompressed .RData files, my effective processing rate drops to 250K cells/second on the server I used for the R benchmarking, or around 120K cells/second on the hardware running SAS. These are numbers for just read/write or load/save, no merging code in between. SAS reads/merges/writes at a net 600K cells/second. Foo - gkaupas 2012-04-23 18:23

I don't quite follow. 250k/sec (R) is faster than 120k/sec (SAS). Is the time R takes to load data missing? You have taken into account network latency, and isolated that by testing local files? However R's file reader is very slow. There are some settings to try (search SO). Sqldf has a fast file reader. Data.table has the bare bones of a fast file reader (to be completed) - Matt Dowle 2012-04-23 19:59

Full SAS script (read, merge, write) on "the SAS server" runs 600K cells/second. Full R script (read, data.table merge, write) on a faster server runs 250K cells/second. Simple R script (read, write) in R 2.7 on "the SAS server" runs 120K cells/second. In all cases the files are on the same NFS-mounted filesystem (yeah, not the best). All 3 (NFS filer, SAS server, R 2.15 server) are in the same data center. Testing plain R I/O on "the SAS server" was to get closer to comparing apples to apples, and base R takes 5X the time SAS does. I'll revisit sqldf - gkaupas 2012-04-24 16:21

Well, sqldf is reading at around 500K cells/second. But now I have to deal with the NA problem, where the hack is to doctor the incoming data with placeholder values, then clean up those values afterwards. That cleanup process, at least using the suggested is.na(DF) <- DF == -999 (whether DF's or DT's) is killing me with a 16K cell/second fix rate. Is there a better way to write that fix using data.table kung fu - gkaupas 2012-04-25 15:51

There is. See this answer (go straight to the EDIT at the end) and this thread for more info. And glad sqldf worked out - Matt Dowle 2012-04-25 16:16

This is likely not the fastest solution but is done entirely in base.

(updated answer per Tommy's comments)

#READING IN YOUR DATA FRAMES
df1 <- read.table(text="  K P1 P2 P3
1 A  1  1  1
2 B  1  1  1
3 C  1  1  1", header=TRUE)

df2 <- read.table(text="  K P1 P2 P3
1 B  2 NA  2
2 C NA  2  2
3 D  2  2  2", header=TRUE)

all <- c(levels(df1$K), levels(df2$K))                  #all cells of key column
dups <- all[duplicated(all)]                            #the overlapping key cells
ndups <- all[!all %in% dups]                            #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows

decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ], 
    FUN = decider)) #repalce all NAs of df2 with df1 values if they exist

df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ]  #reorder based on key column
rownames(df5) <- 1:nrow(df5)  #give proper non duplicated rownames
df5

This yields:

  K P1 P2 P3
1 A  1  1  1
2 B  2  1  2
3 C  1  2  2
4 D  2  2  2

Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:

all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
        colnames(LS[[i]]) <- colnames(LS[[2]])
        return(LS[[i]])
    }
)

LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])

decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5

2012-04-04 16:50
by Tyler Rinker

...so how does this hande the key column matching? And it it seems all numeric values turn into factors.. - Tommy 2012-04-04 16:54

@Tommy I updated to reflect your factor comment. That was because I just cut and pasted from the last time I gave a similar anser - Tyler Rinker 2012-04-04 17:13

@Tommy I get the key column matching now as well. Didn't catch that before. back to the drawing board. I thought the 2 dataframes were the same observations - Tyler Rinker 2012-04-04 17:15

@TylerRinker It is OK to assume the columns will have the same names and be in the same order; I have some quick pre-processing that ensures that - gkaupas 2012-04-04 18:11

If all files have the same names then my first approach is likely easier. Not sure how it will fair with speed. Once you test it let me know - Tyler Rinker 2012-04-04 18:12

@TylerRinker If I use the first approach but with my files/read.tables, then my data frames don't have factors, so the "all" object is NULL. If I coerce the K columns to be factors with df1$K <- as.factor(df1$K), then the rbind borks with invalid factor level, NAs generated. The second approach works; I am trying with a large dataset - gkaupas 2012-04-04 18:24

@gkaupas I addressed this in the second approach but not the first. So this approach works for you then? How is the speed on 10,000 columns - Tyler Rinker 2012-04-04 18:26

@TylerRinker: 200 seconds to read in the files, 93 seconds through creation of LS2, 35 more seconds through LS4, 886 seconds for the mapply, 51 seconds after that, plus 57 seconds to write.csv the result; 22 minutes total, about 3X the time SAS requires. Much better than my for-loops for sure - gkaupas 2012-04-04 19:37

EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.

require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
    k = K[i]
    p = unlist(dt2[i,-1,with=FALSE])
    p = p[!is.na(p)]
    dt1[J(k),names(p):=as.list(p),with=FALSE]
}

or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.

2012-04-10 12:15
by Matt Dowle

This fails with

Error in [.data.table(p, !is.na(p)) :
  i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14).

Re: the A[B] syntax, if I need to contruct B with update instructions, then I have to iterate through all the cells to determine that, which defeats the purpose, unless I'm misunderstanding - gkaupas 2012-04-12 18:20

@gkaupas Apols, meant unlist rather than as.vector. Have edited answer. On A[B] you can very likely vectorize the creation of B. If matrix is ok for your case then (all columns integer or all columns numeric, and n*m < 2^31) then matrix and A[B] seems more appropriate and should be fast - Matt Dowle 2012-04-12 21:10

This moves the error down to the last line in the loop; something of the form Supplied n items to be assigned to 1 items of column 'P1' (n-1 unused). My data will be a mix of strings and numbers so I guess that means no matrix route here - gkaupas 2012-04-13 13:56

@gkaupas Apols again, will edit answer to change names(p):=p to names(p):=as.list(p). Ok, yes that means no matrix route. Must be an easier way in data.table (perhaps by wrapping it up in some generic function that does that task) - Matt Dowle 2012-04-13 15:22

Well the dt1[dt2,decider(NEWVAL,OLDVAL)] I posted in the 3-part comment at the top, based on your input and Tyler's, is tantalizingly simple, even if it requires me to normalize my data first, and only handles a subset of the rows (i.e. I can either get back all rows of dt1 or all rows of dt2 and have to rbind in the rest). Unfortunately it just doesn't scale to my data volume - gkaupas 2012-04-13 18:15

But does this answer work now ok? And does it scale for your data volume? I'm not following the comments to the question at the top - could you draw it together into one code block that I could paste into a fresh R session - Matt Dowle 2012-04-15 16:56

Well it runs, but I don't see what it does. Both dt1 and dt2 appear unchanged afterwards. Do I need to add code inside the for loop? IIRC, when I put print()'s around the statements to see what they returned, the last statement in the for loop just returned the entire dt2 object each time; I don't see how that helps. As for the comments at the top: the 600-byte comment length limit made it prohibitive to repeat the long code over and over - gkaupas 2012-04-16 12:49

:= updates by reference. See ?":=". Are you sure that dt2 (not df2) has not been updated - Matt Dowle 2012-04-17 09:09

I see: if I change the last line inside the loop from dt2[ to dt1[, then dt1 does indeed change as desired. Otherwise apparently dt2 just overwrites itself. I will try this with some larger datasets this afternoon. I will still need to rbind in the new row(s) from dt2, but I already conceded to having to do that in my original post. Thank - gkaupas 2012-04-17 14:38

Apols again, edited answer (dt2[ now dt1[) - Matt Dowle 2012-04-17 15:13

I ran 32 cases, with datasets ranging from 1,000,000 cells to 100,000,000 cells. The "merge" rate of this data.table loop was a pretty consistent 32,300 cells/second. Previously on the same machine, using mapply on data frames, I averaged 147,800 cells/second. I get 600,200 cells/second with SAS on ancient 1.35 GHz RISC CPUs (vs the 2.0 GHz Nehalem's I'm using for R). gc() says memory maxes at 4 Gb when combining the 2 100,000,000 cell data.tables, and the server has 8 Gb available and nothing else running - gkaupas 2012-04-18 17:13

Something is wrong then. If you post your test code so we can reproduce your timings then I (at least) will try to improve it - Matt Dowle 2012-04-18 20:23

You can see my Perl script here. It takes two (optional) arguments, the number of rows and columns to generate. It creates two input files with the first record containing column headings starting with K, then P000000, P000001, so on. Then rows of data with key values K000000 incrementing upward. The first file's cells are all A's. The second all B's. It generates an R script and has a system call to invoke it. The R script just reads in the two files, runs your loop, writes the result out as a CSV, reports the timings - gkaupas 2012-04-18 22:53

Wow. Ok, you know your stuff and it's a thorough set of tests. Reproduced here ok. Realised the for() loop was the wrong way around, should be by column (not row). Or, back to the normalized (flattened) route as first suggested. Looking back, the error combining bywithoutby with := in j is not yet implemented can be solved by setting mult="first", assuming that the keys are unique. Also assuming that the tests are simplified and we can't speed up by knowing input1 and input2 have the same number of rows with the same keys like the tests. Anyway, will modify the loop to go by column .. - Matt Dowle 2012-04-19 00:12

The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.

library(data.table)
library(reshape2)

cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")

dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table

dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")

dt1s[dt2s,value:=value.new] # Update new values

dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

2013-01-16 04:16
by dnlbrky