Is the whole Map copied when a new binding is inserted?

I would like to better understand the interns of e.g. Data.Map. When I insert a new binding in a Map, then, because of immutability of data I get back a new data structure that is identical with the old data structure plus the new binding.

I would like to understand how this is achieved. Does the compiler eventually implement this by copying the whole data structure with e.g. millions of bindings? Can it generally be said that mutable data structures/arrays (e.g. Data.Judy) or imperative programming languages perform better in such cases? Does immutable data have any advantage when it comes to dictionaries/key-value stores?

haskell

2012-04-03 23:30
by J Fritsch

Map is built on a tree data structure. Basically, a new Map value is constructed, but it'll be filled almost entirely with pointers to the old structure. Since values never change in Haskell, this is a safe, and very important optimisation, known as sharing.

This means that you can have many similar versions of the same data structure hanging around, but only the branches of the tree that differ will be stored anew; the rest will simply be pointers to the original copy of the branch. And, of course, if you throw away the old Map, the branches you did change will be reclaimed by the garbage collector.

Sharing is key to the performance of immutable data structures. You might find this Wikipedia article helpful; it has some enlightening graphs showing how modified data gets represented with sharing.

2012-04-03 23:41
by ehird

Irrefragable answer - Nikita Volkov 2012-04-03 23:56

If this is all pointers every time where does the speed difference of insertions into Judy arrays come from - J Fritsch 2012-04-03 23:57

@JFritsch: Well, it still has to construct the modified parts of the tree. That's mitigated by laziness (only the parts of the tree you actually look at get constructed), but if you don't need any of the advantages immutability offers (much simpler programming model, you can keep multiple modified versions around without storing whole copies, etc.), then of course simply writing directly to memory will be faster. But it's often not as big a difference as you might think - ehird 2012-04-04 00:19

@JFritsch: After a map insert you need to copy O(log N) nodes for a tree with N elements. In fact, in general to work with a pure data structure you may have to do an extra logarithmic amount of work over a single mutable ephemeral data structure. But for insert into a tree, you would have to pay for this same amount of overhead just to recurse down to the nodes in the first place, so while a little more memory is allocated, the asymptotic performance of that operation remains the same. With a Judy array, it can use small hash tables internally, going O(1) in places - Edward KMETT 2012-04-04 01:31

@JFritsch: However, with immutable stores you can always revert to any previous version of the store, just by keeping a pointer to it. This gives you more or less free 'undo' functionality, and lets you evaluate multiple possible edits to the store (say, if it was a game board) in parallel with no extra reasoning or locking overhead - Edward KMETT 2012-04-04 01:33

No. The documentation for Data.Map.insert states that insertion takes O(log n) time. It would be impossible to satisfy that bound if it had to copy the entire structure.

2012-04-03 23:48
by hammar

Data.Map doesn't copy the old map; it (lazily) allocates O(log N) new nodes, which point to (and thereby share) most of the old map.

Because "updating" the map doesn't disrupt old versions, this kind of datastructure gives you greater freedom in building concurrent algorithms.

2012-04-04 00:16
by comingstorm