Tuesday, 22 December 2009

Zach Cox' word count challenge

Zach Cox published an interesting challenge on his blog recently: to traverse a directory hierarchy of files and compute the total distribution of word frequencies. Here is a simple F# solution:

open System.IO

let rec words_of_dir dir =
  seq { for dir in Directory.GetDirectories dir do
          yield! words_of_dir dir
        for file in Directory.GetFiles dir do
          use reader = new StreamReader(File.OpenRead file)
          while not reader.EndOfStream do
            yield! reader.ReadLine() |> String.split [' '; '.'; ':'; ';'; '?'] }

  |> words_of_dir
  |> Seq.countBy id
  |> Seq.sortBy (fun (_, n) -> -n)
  |> printf "%A\n"

Taking only 17.5s at only 14 lines of code, this F# combines the brevity of the most concise solution (Ruby) with the performance of the fastest solution (Scala).


Nathan said...

Are you getting the Ruby/Scala numbers from the same machine?

I wouldn't compare the numbers either way until it does the same set of operations (split via regex, 2x sorting, write results to file) either.

.NET 4.0 does make this a bit cleaner, and using a mutable dictionary will probably always be faster than Seq.countBy. Switching to String.split doubles the performance again.

Jules said...

You code does not write the output in alphabetic order, and it doesn't write to a file. Here is my Ruby solution to your problem:

words = Hash.new 0
for file in Dir["thedir/**/*.txt"]
File.read(file).scan(/\w+/).each{|w| words[w] += 1}
puts words.sort_by{|w,n| -n}

And if you want alphabetic too:

puts words.sort