Speeding up I/O on very large files in Elixir

Suraj N. Kurapati

  1. Summary
    1. Problem
      1. Approach
        1. Solution

          Summary

          Read large files as raw binary streams (instead of UTF-8) with generous caching:

          file = "path/to/very/large/file"
          io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
          lines = io |> IO.binstream(:line)
          

          Problem

          At work today, a colleague needed to process a very large (1.5 GiB) log file, line by line, using Elixir 1.4.2 and Erlang 19.2 on Linux 3.16. A simple iex experiment yielded abysmal performance, taking 8 hours and 7 minutes to finish!

          file = "path/to/very/large/file"
          lines = file |> File.stream!
          grep = lines |> Enum.filter(&String.match?(&1, ~r/regex_to_grep/))
          

          In contrast, a similar experiment in Python 2.7.8 finished in merely 4 seconds:

          import re
          regex_to_grep = re.compile('regex_to_grep')
          file = open('path/to/very/large/file', 'r')
          grep = [line for line in file if regex_to_grep.search(line)]
          

          Approach

          My first instinct was to parallelize the grep operation, but that didn’t help:

          file = "path/to/very/large/file"
          lines = file |> File.stream!
          grep = lines |>
            Task.async_stream(&(String.match?(&1, ~r/regex_to_grep/) and &1)) |>
            Stream.map(fn {:ok, async_stream_result} -> async_stream_result end) |>
            Enum.filter(&is_binary/1)
          

          Surprised, I retreated back to a simple loop over the entire length of the file:

          file = "path/to/very/large/file"
          lines = file |> File.stream!
          num_lines = lines |> Enum.count
          

          Inconceivably, it hung again! So I now suspected that something might be wrong with that log file itself. Paging through it, I discovered an enormous sequence of NUL bytes embalmed several thousand lines deep. That was unexpected, but why should NUL bytes only affect Elixir and not Python? Were C programs immune too?

          $ grep -c 'regex_to_grep' path/to/very/large/file
          grep: line too long
          

          Aha! GNU Grep actually had a safeguard built-in to prevent memory exploitation by aborting on very long lines, likely elongated by that enormous NUL sequence.

          Perhaps excessive line lengths also paralyzed Elixir? To verify, I wrote my own file reader that would lazily read a single 128 KiB (32 filesystem blocks, commonly sized to be 4 KiB) sized chunk at a time and then split it into lines:

          file = "path/to/very/large/file"
          lines = file |> File.stream!([], 32 * 4096) |>
            Stream.map(&String.split(&1, ~r/(?<=\n)/)) |>
            Stream.transform("", fn lines, acc ->
              if Enum.empty?(lines) do
                if acc == "" do
                  {:halt, acc}
                else
                  {[acc], ""}
                end
              else
                # prepend previous partial line to this chunk
                [first_line | rest_lines] = lines
                lines = [acc <> first_line | rest_lines]
                acc = ""
          
                # carry any partial line at end of this chunk
                unless List.last(lines) |> String.ends_with?("\n") do
                  {lines, [acc]} = Enum.split(lines, -1)
                end
          
                {lines, acc}
              end
            end)
          num_lines = lines |> Enum.count
          

          This workaround successfully prevented the enormous NUL byte sequence from hanging Elixir (hurray!) but the overall operation took 24 minutes to complete.

          How do I optimize this further? Well, I would imagine that Elixir already uses a similar (in spirit) line splitting algorithm that is already heavily optimized, so eliminating my own would be ideal (deleted code is debugged code). In addition, I also have control over the sizing of chunks I read from the file.

          Solution

          Following this intuition, I searched the web and unearthed two disparate hints:

          1. Use IO.binstream/2 to read the file as a raw byte sequence (not as UTF-8, which is the case with IO.stream/2 and thus with File.stream!/2 as well). http://stackoverflow.com/a/31397000

          2. Ask for large chunks while reading the file to counteract the overhead of :file.read/2 and use additional buffering on top of it for performance. http://stackoverflow.com/a/9771547

          Applying these hints yielded 16x speedup, cutting execution time to 90 seconds:

          file = "path/to/very/large/file"
          io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
          lines = io |> IO.binstream(:line)
          num_lines = lines |> Enum.count
          

          That’s actually good performance, since wc took 74 seconds to count characters:

          $ time wc path/to/very/large/file
           92474     498826 1603770580 path/to/very/large/file
          wc path/to/very/large/file  73.57s user 0.84s system 100% cpu 1:14.41 total
          

          Although this performance still pales in comparison to Python’s 4 seconds, it’s a reasonable trade-off considering the concurrency and fault-tolerance merits of Elixir over Python. Furthermore, this subpar performance might be a remnant of the underlying algorithms being used to fill buffers and split lines inside Elixir, given the similarity in execution time with wc when counting characters.