Monday, 22 February 2010

The BookWorm challenge

Chris Smith and Brian McNamara of Microsoft have been squabbling like mildly-naughty henchmen in recent blog posts over F# solutions to the BookWorm challenge, a word search on a hexagonal board:

Despite their best efforts, their results make them look more like plucky comic relief than genuinely evil. We'll save you the bluster and focus on destroying their entire line of thinking. Suffice to say, our blog post should be read with an English accent.

We present a solution that is simultaneously less than half the length of their shortest solution and over eight times faster than their fastest solution on the largest example problem they have given to date. Moreover, our solution requires no search data structure and, consequently, uses far less memory. In fact, a simple alteration can even evade the need to store the entire dictionary, reducing the total memory consumption to only a few bytes for each word and the puzzle itself.

To date, Chris and Brian have developed in a single direction: using a trie data structure to perform asymptotically fast string searches. This sounds like a good idea but, in fact, the sole purpose of a trie is to be asymptotically efficient in the limit of infinite-length keys. Consequently, tries work wonders when your keys are giga-length strings such as DNA sequences but, as Chris noted, the keys for this problem are English words under 16 characters in length. As such small strings easily fit into a cache line, traversing them directly is very fast. In contrast, a tree-based search data structure that incurs an indirection at every character imposes a much higher constant prefactor in the time taken. Moreover, their optimizations have come at the expense of correctness because they lost the ability to handle board positions containing multiple characters such as Qu in their own example board pictured above.

Our program uses a simple string array array to represent the board such that positions may contain multiple characters and a string array to represent the dictionary. Each word is searched for in the board in turn and those found are printed and, finally, the total number of words found in the board is printed. The entire program is around 40 lines of code and the whole process, including loading the dictionary, takes only 0.9s on this 8-core machine. Here is the complete program:

let data = [|[|"O"; "U"; "B"; "N"; "O"; "V"; "S"|]; [|"J"; "A"; "P"; "A"; "L"; "O"; "M"|]; [|"G"; "T"; "E"; "J"; "T"; "S"; "H"|]; [|"U"; "A"; "E"; "O"; "F"; "S"; "A"|]; [|"N"; "O"; "A"; "E"; "L"; "E"; "P"|]; [|"O"; "S"; "O"; "U"; "G"; "N"; "P"|]; [|"A"; "T"; "F"; "G"; "A"; "I"; "I"|]; [|"W"; "U"; "B"; "N"; "O"; "V"; "S"|]; [|"J"; "E"; "P"; "A"; "L"; "O"; "M"|]; [|"G"; "S"; "E"; "J"; "T"; "Y"; "C"|]; [|"U"; "A"; "O"; "O"; "F"; "Y"; "R"|]; [|"N"; "O"; "M"; "E"; "L"; "U"; "T"|]; [|"O"; "S"; "O"; "E"; "G"; "E"; "I"|]; [|"H"; "T"; "F"; "N"; "A"; "I"; "Y"|]; [|"O"; "U"; "B"; "N"; "E"; "V"; "S"|]; [|"J"; "A"; "P"; "A"; "S"; "O"; "M"|]; [|"G"; "T"; "E"; "J"; "T"; "S"; "C"|]; [|"U"; "A"; "E"; "O"; "F"; "Y"; "R"|]; [|"N"; "O"; "A"; "E"; "L"; "U"; "T"|]; [|"O"; "S"; "O"; "U"; "G"; "E"; "I"|]; [|"H"; "T"; "F"; "G"; "A"; "I"; "Y"|]|] let rows, cols = data.Length, data.[0].Length let read file = System.IO.File.ReadAllLines(@"C:\Users\Jon\Documents\" + file) let words = let ignored = HashSet(read "IgnoredWords.txt", HashIdentity.Structural) [|for word in read "TWL06.txt" do if word.Length > 2 && not(ignored.Contains word) then yield word|] let neighbors = [|for y in 0..rows-1 -> [|for x in 0..cols-1 -> [|for dx, dy in [-1, -1; 0, -1; -1, 0; 1, 0; -1, 1; 0, 1] do let x, y = x + dx + (if y%2=1 && dy<>0 then 1 else 0), y + dy if x>=0 && x<cols && y>=0 && y<rows then yield x, y|]|]|] let rec eq (word: string) i (elt: string) j = j = elt.Length || i < word.Length && word.[i] = elt.[j] && eq word (i+1) elt (j+1) let rec search (visited: _ [] []) word i (x, y) = let elt = data.[y].[x] eq word i elt 0 && let i = i + elt.Length i = word.Length || (visited.[y].[x] <- true let pred(x, y) = not visited.[y].[x] && search visited word i (x, y) let result = Array.exists pred neighbors.[y].[x] visited.[y].[x] <- false result) let xys = [|for y in 0..rows-1 do for x in 0..cols-1 do yield x, y|] let contains (word: string) = let visited = Array.map (Array.map (fun _ -> false)) data if Array.exists (search visited word 0) xys then Some word else None do let words = Array.Parallel.choose contains words Array.iter (printfn "%s") words printfn "%d words" words.Length

10 comments:

Anonymous said...

I found this too - my old WordFind program for Palm (www.wordfind.org) originally used the input pattern to drive the search of the DAWG (Trie), but eventually I realized that for most searches simply iterating through the dictionary and seeing if each word matched the pattern was not only faster but was better bounded in the upper limit. That broke down for multi-word search where I still reverted to pattern-driven matching, but for single word searches it worked well. Not always faster but certainly more general and way faster in many cases.

Ben said...

One thing that both Brian's and Chris's solutions have over yours is a more detailed explanation ;)

Flying Frog Consultancy Ltd. said...

@Ben: Did we not explain it fully enough for you? Oh, how evil of us...

Anonymous said...

Is this correct?
neighbors.[1].[0] = [|(0,0); (1,0); (1,1); (0,2); (1,2)|]

Anonymous said...

Yes, it is.

Anonymous said...

A man is only as good as what he loves.


--------------
Kyoto University

Anonymous said...

greetings I am looking for men to come to my Guild Caesary come check it out. and join my group please! hawks raiders at http://www.lekool.com. Been playing Caesary game for 6 weeks. Caesary has been the best browser game in a long time!

[url=http://caesary.lekool.com]Caesary[/url]
[url=http://caesary.lekool.com]browser based[/url]
[url=http://video.lekool.com]Game Videos[/url]
[url=http://dc.lekool.com]Game Videos[/url]

Anonymous said...

To be both a speaker of words and a doer of deeds.

-----------------------------------

Anonymous said...

Ok, so if you include the time taken to initialise the trie data structure your approach wins out.

How does it fare once created though? Ie if you were to make a bulk bookworm solver responsible for processing thousands of boards, does the trie then begin paying back dividends?

Anonymous said...

hetfcmutb http://www.free-mass-traffic.net/free-mass-traffic/free-mass-traffic-review Free Mass Traffic