Speeding up I/O on very large files in Elixir

Suraj N. Kurapati

  1. Problem
    1. Approach
      1. Solution
        1. Conclusion


          At work today, a colleague needed to process a very large (1.5 GiB) log file, line by line, using Elixir 1.4.2 and Erlang 19.2 on Linux 3.16. A simple iex experiment yielded abysmal performance, taking 8 hours and 7 minutes to finish!

          file = "path/to/very/large/file"
          lines = file |> File.stream!
          grep = lines |> Enum.filter(&String.match?(&1, ~r/regex_to_grep/))

          In contrast, a similar experiment in Python 2.7.8 finished in merely 4 seconds:

          import re
          regex_to_grep = re.compile('regex_to_grep')
          file = open('path/to/very/large/file', 'r')
          grep = [line for line in file if regex_to_grep.search(line)]


          My first instinct was to parallelize the grep operation, but that didn’t help:

          file = "path/to/very/large/file"
          lines = file |> File.stream!
          grep = lines |>
            Task.async_stream(&(String.match?(&1, ~r/regex_to_grep/) and &1)) |>
            Stream.map(fn {:ok, async_stream_result} -> async_stream_result end) |>

          Surprised, I retreated back to a simple loop over the entire length of the file:

          file = "path/to/very/large/file"
          lines = file |> File.stream!
          num_lines = lines |> Enum.count

          Inconceivably, it hung again! So I now suspected that something might be wrong with that log file itself. Paging through it, I discovered an enormous sequence of NUL bytes embalmed several thousand lines deep. That was unexpected, but why should NUL bytes only affect Elixir and not Python? Were C programs immune too?

          $ grep -c 'regex_to_grep' path/to/very/large/file
          grep: line too long

          Aha! GNU Grep actually had a safeguard built-in to prevent memory exploitation by aborting on very long lines, likely elongated by that enormous NUL sequence.

          Perhaps excessive line lengths also paralyzed Elixir? To verify, I wrote my own file reader that would lazily read a single 128 KiB (32 filesystem blocks, commonly sized to be 4 KiB) sized chunk at a time and then split it into lines:

          file = "path/to/very/large/file"
          lines = file |> File.stream!([], 32 * 4096) |>
            Stream.map(&String.split(&1, ~r/(?<=\n)/)) |>
            Stream.transform("", fn lines, acc ->
              if Enum.empty?(lines) do
                if acc == "" do
                  {:halt, acc}
                  {[acc], ""}
                # prepend previous partial line to this chunk
                [first_line | rest_lines] = lines
                lines = [acc <> first_line | rest_lines]
                acc = ""
                # carry any partial line at end of this chunk
                unless List.last(lines) |> String.ends_with?("\n") do
                  {lines, [acc]} = Enum.split(lines, -1)
                {lines, acc}
          num_lines = lines |> Enum.count

          This workaround successfully prevented the enormous NUL byte sequence from hanging Elixir (hurray!) but the overall operation took 24 minutes to complete.

          How do I optimize this further? Well, I would imagine that Elixir already uses a similar (in spirit) line splitting algorithm that is already heavily optimized, so eliminating my own would be ideal (deleted code is debugged code). In addition, I also have control over the sizing of chunks I read from the file.


          Following this intuition, I searched the web and unearthed two disparate hints:

          1. Use IO.binstream/2 to read the file as a raw byte sequence (not as UTF-8, which is the case with IO.stream/2 and thus with File.stream!/2 as well). http://stackoverflow.com/a/31397000

          2. Ask for large chunks while reading the file to counteract the overhead of :file.read/2 and use additional buffering on top of it for performance. http://stackoverflow.com/a/9771547

          Applying these hints yielded 16x speedup, cutting execution time to 90 seconds:

          file = "path/to/very/large/file"
          io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
          lines = io |> IO.binstream(:line)
          num_lines = lines |> Enum.count

          That’s actually good performance, since wc took 74 seconds to count characters:

          $ time wc path/to/very/large/file
           92474     498826 1603770580 path/to/very/large/file
          wc path/to/very/large/file  73.57s user 0.84s system 100% cpu 1:14.41 total

          Although this performance still pales in comparison to Python’s 4 seconds, it’s a reasonable trade-off considering the concurrency and fault-tolerance merits of Elixir over Python. Furthermore, this subpar performance might be a remnant of the underlying algorithms being used to fill buffers and split lines inside Elixir, given the similarity in execution time with wc when counting characters.


          Read large files as raw binary streams (not as UTF-8) along with enough caching:

          file = "path/to/very/large/file"
          io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
          lines = io |> IO.binstream(:line)