A parallel grep(1) prototype in Elixir

Suraj N. Kurapati

31 July 2016

Note: This all happened a year ago. I’m just late in writing about it.

A colleague needed a way to quickly grep(1) through several terabytes of plain text reports at work today, so I wrote a basic prototype of parallel grep(1) in Elixir during my lunch break. Its performance, to my surprise, was astonishing!

$ cat search.exs
defmodule Search do
  def launch(regex, files) do
    files |> Stream.reject(&File.dir?/1) |> Enum.map(fn file ->
      spawn(__MODULE__, :search, [regex, file, self])
    end)
  end

  def search(regex, file, receiver) do
    file |> File.stream! |> Enum.each(fn line ->
      case Regex.run(regex, line) do
        nil -> ; # line didn't match so ignore it
        matches -> send receiver, {file, matches}
      end
    end)
  end

  def results(count\\0) do
    receive do
      _result -> results(count+1) # consume the message and loop again
      after 1000 -> IO.puts "got #{count} results; no more in last 1s"
    end
  end
end

[pattern|files] = System.argv
regex = Regex.compile!(pattern)
Search.launch(regex, files)
Search.results

Extracting lines, along with capture groups, that matched a Perl-compatible regular expression from a small sample of 17GB of plain text reports took 19 seconds for the prototype (note that it waits for an extra second after finding the last search result), compared to 1 minute and 50 seconds for GNU grep 2.5.1:

$ regex='^\s*(\S+)(?:\s+([\d\.e+-]+)){4,}'

$ du -csh /some/*/location/ | tail -1
17G     total

$ time elixir -r search.exs -- "$regex" /some/*/location/*
165984 results; nothing more in last 1s
elixir -r search.exs --    220.98s user 6.38s system 1147% cpu 19.816 total

$ time grep -Pq "$regex" /some/*/location/*
grep -Pq    31.92s user 6.51s system 34% cpu 1:50.30 total

This meant that the Elixir prototype was 5.7 times faster than grep(1), but why?

The answer is parallelism: grep(1) is a single OS process that handles a single file at a time. Whereas in the Elixir prototype, each file is handled by its own dedicated Erlang “process”, which is an actor (see also Command Pattern) that is significantly lighter than any OS process or thread. This difference is readily seen in the CPU utilization statistics reported by the time(1) command:

	User time	System time	CPU utilization	Elapsed time
Elixir	220.98s user	6.38s system	1147% cpu	19.816 total
grep(1)	31.92s user	6.51s system	34% cpu	1:50.30 total

Here we see that the Elixir prototype utilized 11.47 CPU cores worth of computational resources, whereas grep(1) utilized only 34% of only one CPU core. Since the machine that ran this search had 24 CPU cores, several of which were already under heavy use by ~800 other users on it, this was a remarkable result.

In short, parallelism won. And so did Elixir and the Erlang VM, for this was my first time, in 18 years of programming, seeing my code exhaust a machine’s entire CPU capacity with no extra effort on my part! The revolution has arrived.

Comments

NOTE: Comments are temporarily disabled on this page while I migrate them from Disqus to Gisqus (GitHub Discussions) because the former has 💪 forced the display of advertisements 🤑 on this website recently.