Why Haskell alloc huge amounts of memory when working with strings?

I've written a program in Haskell which had to load and parse big text file in UTF8. The file represents a dictionary with key:value pairs on each line. In my program I want to have a Data.Map container for fast dictionary search. My file is about 40MB, but after loading it to my program 1.5 GB of RAM is used, and never freed. What did I do wrong? Is the memory usage expected?

Here is a code sample from my program:

module Main where

import Engine

import Codec.Archive.Zip
import Data.IORef
import System.IO
import System.Directory
import qualified System.IO.UTF8 as UTF8
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.UTF8 as BsUtf
import qualified Data.Map as Map

import Graphics.UI.Gtk
import Graphics.UI.Gtk.Glade

maybeRead :: Read a => BsUtf.ByteString -> Maybe a
maybeRead s = case reads $ BsUtf.toString s of
     [(x, "")] -> Just x
     _         -> Nothing    

parseToEntries :: [BsUtf.ByteString] -> [(BsUtf.ByteString, Int)]
parseToEntries [] = []
parseToEntries (x:xs) = let (key, svalue) = BsUtf.break (==':') x
                            value = maybeRead svalue
                        in case value of 
                            Just x -> [(key, x)] ++ parseToEntries xs 
                            Nothing -> parseToEntries xs 

createDict :: BsUtf.ByteString -> IO (Map.Map BsUtf.ByteString Int)
createDict str = do
    let entries = parseToEntries $ BsUtf.lines str
        dict = Map.fromList entries
    return (dict)

main :: IO ()
main = do

    currFileName <- newIORef ""

    dictZipFile <- B.readFile "data.db"    
    extractFilesFromArchive [] $ toArchive dictZipFile
    dictFile <- UTF8.readFile "dict.txt"
    dict <- createDict $ BsUtf.fromString dictFile

...

searchAccent :: Map.Map BsUtf.ByteString Int -> String -> Int
searchAccent dict word = let sword = BsUtf.fromString $ map toLower word
                             entry = Map.lookup sword dict
                         in case entry of
                            Nothing -> -1
                            Just match -> 0

2012-04-05 21:08
by KolKir

I'm a little rusty on Haskell, but iirc, the ++ syntax is memory-expensive, where as the cons operator (:) is cheap. Is it possible to use something like (key, x) : parseToEntries xs? Again . . . my Haskell is very rusty, so this might be way off - jpm 2012-04-05 21:16

@jpm, it's memory expensiveness depends on length of the first argument of ++. In this case it is not relevant - max taldykin 2012-04-05 21:21

@maxtaldykin Ah, that makes sense. Thanks for the clarification - jpm 2012-04-05 21:38

@jpm, there are nice pictures about how ++ works in Okasaki book page 9max taldykin 2012-04-05 21:43

40 megs compressed or uncompressed? If it's 40 megs compressed you may need to find a an alternative zip library - Nathan Howell 2012-04-05 21:43

http://www.haskell.org/pipermail/haskell-cafe/2010-August/081772.html discusses the purpose of the library you are using, which aren't like yours. It took about four minutes to unzip a 23 megabyte zip archive, doing nothing else, but under 10 seconds by other means. I don't think (++) and string vs bytestring are the main trouble - applicative 2012-04-06 05:16

Quick answer.
Main problem is that System.IO.UTF8.readFile reads file into String.

Supposed bottleneck is here:

dictFile <- UTF8.readFile "dict.txt"
dict <- createDict $ BsUtf.fromString dictFile

When dealing with UTF-8 text it is better to use Data.Text instead of ByteString. Try something like this:

import qualified Data.Text.Lazy as LT
import qualified Data.Text.Lazy.Encoding as LT

...
dictFile <- B.readFile "dict.txt"
dict <- createDict $ LT.decodeUtf8 dictFile

Another bottleneck is parsing numbers: you are converting ByteString to String and then read it. It's better to use Data.Text.Lazy.Read:

import qualified Data.Text.Lazy.Read as LT

maybeRead :: LT.Text -> Maybe Int
maybeRead s = case LT.decimal s of
    Left _  -> Nothing
    Right i -> Just i

2012-04-05 21:39
by max taldykin

Thanks, for the answer. Data.Text really takes much less of the RAM memory. But I've used strict version which i think is more economical - KolKir 2012-04-06 09:34

The Haskell String type is an indirect (because of laziness) linked list of characters; it is extremely wasteful space-wise. You may wish to try Data.Text (from http://hackage.haskell.org/package/text) instead, for large amounts of text.

(edit now that source is up I see the strings are lazy ByteString instead of String, so this is not relevant. Profiling is the next step.)

2012-04-05 21:31
by geekosaur

The program only uses String in maybeRead, the arguments to which are presumably relatively small - ehird 2012-04-05 21:34

Okay, I see you have partial source now and the primary format is UTF8 lazy ByteString (which is not ideal but far better than String). It is possible you're too lazy and have thunks built up; the next step is to use profiling to see where the space is going - geekosaur 2012-04-05 21:40

I'm not the asker, but the question has never been edited, so if the source was added in, it must have been in the first five minutes. (Actually, I'm not sure the grace period applies to questions...) However, my previous comment was wrong, as max's answer shows - ehird 2012-04-05 21:43