I've written a program in Haskell which had to load and parse big text file in UTF8. The file represents a dictionary with key:value pairs on each line. In my program I want to have a Data.Map container for fast dictionary search. My file is about 40MB, but after loading it to my program 1.5 GB of RAM is used, and never freed. What did I do wrong? Is the memory usage expected?
Here is a code sample from my program:
module Main where
import Engine
import Codec.Archive.Zip
import Data.IORef
import System.IO
import System.Directory
import qualified System.IO.UTF8 as UTF8
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.UTF8 as BsUtf
import qualified Data.Map as Map
import Graphics.UI.Gtk
import Graphics.UI.Gtk.Glade
maybeRead :: Read a => BsUtf.ByteString -> Maybe a
maybeRead s = case reads $ BsUtf.toString s of
[(x, "")] -> Just x
_ -> Nothing
parseToEntries :: [BsUtf.ByteString] -> [(BsUtf.ByteString, Int)]
parseToEntries [] = []
parseToEntries (x:xs) = let (key, svalue) = BsUtf.break (==':') x
value = maybeRead svalue
in case value of
Just x -> [(key, x)] ++ parseToEntries xs
Nothing -> parseToEntries xs
createDict :: BsUtf.ByteString -> IO (Map.Map BsUtf.ByteString Int)
createDict str = do
let entries = parseToEntries $ BsUtf.lines str
dict = Map.fromList entries
return (dict)
main :: IO ()
main = do
currFileName <- newIORef ""
dictZipFile <- B.readFile "data.db"
extractFilesFromArchive [] $ toArchive dictZipFile
dictFile <- UTF8.readFile "dict.txt"
dict <- createDict $ BsUtf.fromString dictFile
...
searchAccent :: Map.Map BsUtf.ByteString Int -> String -> Int
searchAccent dict word = let sword = BsUtf.fromString $ map toLower word
entry = Map.lookup sword dict
in case entry of
Nothing -> -1
Just match -> 0
++
. In this case it is not relevant - max taldykin 2012-04-05 21:21
++
works in Okasaki book page 9max taldykin 2012-04-05 21:43
Quick answer.
Main problem is that System.IO.UTF8.readFile
reads file into String
.
Supposed bottleneck is here:
dictFile <- UTF8.readFile "dict.txt"
dict <- createDict $ BsUtf.fromString dictFile
When dealing with UTF-8 text it is better to use Data.Text
instead of ByteString
.
Try something like this:
import qualified Data.Text.Lazy as LT
import qualified Data.Text.Lazy.Encoding as LT
...
dictFile <- B.readFile "dict.txt"
dict <- createDict $ LT.decodeUtf8 dictFile
Another bottleneck is parsing numbers: you are converting ByteString
to String
and then read
it.
It's better to use Data.Text.Lazy.Read
:
import qualified Data.Text.Lazy.Read as LT
maybeRead :: LT.Text -> Maybe Int
maybeRead s = case LT.decimal s of
Left _ -> Nothing
Right i -> Just i
The Haskell String
type is an indirect (because of laziness) linked list of characters; it is extremely wasteful space-wise. You may wish to try Data.Text
(from http://hackage.haskell.org/package/text) instead, for large amounts of text.
(edit now that source is up I see the strings are lazy ByteString
instead of String
, so this is not relevant. Profiling is the next step.)
String
in maybeRead
, the arguments to which are presumably relatively small - ehird 2012-04-05 21:34
ByteString
(which is not ideal but far better than String
). It is possible you're too lazy and have thunks built up; the next step is to use profiling to see where the space is going - geekosaur 2012-04-05 21:40
++
syntax is memory-expensive, where as the cons operator (:
) is cheap. Is it possible to use something like(key, x) : parseToEntries xs
? Again . . . my Haskell is very rusty, so this might be way off - jpm 2012-04-05 21:16