Speeding up I/O on very large files in Elixir

Suraj N. Kurapati


  1. Problem
    1. Approach
      1. Solution
        1. Conclusion

          Problem

          At work today, a colleague needed to process a very large (1.5 GiB) log file, line by line, using Elixir 1.4.2 and Erlang 19.2 on Linux 3.16. A simple iex experiment yielded abysmal performance, taking 8 hours and 7 minutes to finish!

          file = "path/to/very/large/file"
          lines = file |> File.stream!
          grep = lines |> Enum.filter(&String.match?(&1, ~r/regex_to_grep/))
          

          In contrast, a similar experiment in Python 2.7.8 finished in merely 4 seconds:

          import re
          regex_to_grep = re.compile('regex_to_grep')
          file = open('path/to/very/large/file', 'r')
          grep = [line for line in file if regex_to_grep.search(line)]
          

          Approach

          My first instinct was to parallelize the grep operation, but that didn’t help:

          file = "path/to/very/large/file"
          lines = file |> File.stream!
          grep = lines |>
            Task.async_stream(&(String.match?(&1, ~r/regex_to_grep/) and &1)) |>
            Stream.map(fn {:ok, async_stream_result} -> async_stream_result end) |>
            Enum.filter(&is_binary/1)
          

          Surprised, I retreated back to a simple loop over the entire length of the file:

          file = "path/to/very/large/file"
          lines = file |> File.stream!
          num_lines = lines |> Enum.count
          

          Inconceivably, it hung again! So I now suspected that something might be wrong with that log file itself. Paging through it, I discovered an enormous sequence of NUL bytes embalmed several thousand lines deep. That was unexpected, but why should NUL bytes only affect Elixir and not Python? Were C programs immune too?

          $ grep -c 'regex_to_grep' path/to/very/large/file
          grep: line too long
          

          Aha! GNU Grep actually had a safeguard built-in to prevent memory exploitation by aborting on very long lines, likely elongated by that enormous NUL sequence.

          Perhaps excessive line lengths also paralyzed Elixir? To verify, I wrote my own file reader that would lazily read a single 128 KiB (32 filesystem blocks, commonly sized to be 4 KiB) sized chunk at a time and then split it into lines:

          file = "path/to/very/large/file"
          lines = file |> File.stream!([], 32 * 4096) |>
            Stream.map(&String.split(&1, ~r/(?<=\n)/)) |>
            Stream.transform("", fn lines, acc ->
              if Enum.empty?(lines) do
                if acc == "" do
                  {:halt, acc}
                else
                  {[acc], ""}
                end
              else
                # prepend previous partial line to this chunk
                [first_line | rest_lines] = lines
                lines = [acc <> first_line | rest_lines]
                acc = ""
          
                # carry any partial line at end of this chunk
                unless List.last(lines) |> String.ends_with?("\n") do
                  {lines, [acc]} = Enum.split(lines, -1)
                end
          
                {lines, acc}
              end
            end)
          num_lines = lines |> Enum.count
          

          This workaround successfully prevented the enormous NUL byte sequence from hanging Elixir (hurray!) but the overall operation took 24 minutes to complete.

          How do I optimize this further? Well, I would imagine that Elixir already uses a similar (in spirit) line splitting algorithm that is already heavily optimized, so eliminating my own would be ideal (deleted code is debugged code). In addition, I also have control over the sizing of chunks I read from the file.

          Solution

          Following this intuition, I searched the web and unearthed two disparate hints:

          1. Use IO.binstream/2 to read the file as a raw byte sequence (not as UTF-8, which is the case with IO.stream/2 and thus with File.stream!/2 as well). http://stackoverflow.com/a/31397000

          2. Ask for large chunks while reading the file to counteract the overhead of :file.read/2 and use additional buffering on top of it for performance. http://stackoverflow.com/a/9771547

          Applying these hints yielded 16x speedup, cutting execution time to 90 seconds:

          file = "path/to/very/large/file"
          io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
          lines = io |> IO.binstream(:line)
          num_lines = lines |> Enum.count
          

          That’s actually good performance, since wc took 74 seconds to count characters:

          $ time wc path/to/very/large/file
           92474     498826 1603770580 path/to/very/large/file
          wc path/to/very/large/file  73.57s user 0.84s system 100% cpu 1:14.41 total
          

          Although this performance still pales in comparison to Python’s 4 seconds, it’s a reasonable trade-off considering the concurrency and fault-tolerance merits of Elixir over Python. Furthermore, this subpar performance might be a remnant of the underlying algorithms being used to fill buffers and split lines inside Elixir, given the similarity in execution time with wc when counting characters.

          Conclusion

          Read large files as raw binary streams (not as UTF-8) along with enough caching:

          file = "path/to/very/large/file"
          io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
          lines = io |> IO.binstream(:line)
          

          Updates