Speeding up I/O on very large files in Elixir

Suraj N. Kurapati

7 March 2017
18 December 2018

Problem
Approach
Solution
Conclusion

Problem

At work today, a colleague needed to process a very large (1.5 GiB) log file, line by line, using Elixir 1.4.2 and Erlang 19.2 on Linux 3.16. A simple iex experiment yielded abysmal performance, taking 8 hours and 7 minutes to finish!

file = "path/to/very/large/file"
lines = file |> File.stream!
grep = lines |> Enum.filter(&String.match?(&1, ~r/regex_to_grep/))

In contrast, a similar experiment in Python 2.7.8 finished in merely 4 seconds:

import re
regex_to_grep = re.compile('regex_to_grep')
file = open('path/to/very/large/file', 'r')
grep = [line for line in file if regex_to_grep.search(line)]

Approach

My first instinct was to parallelize the grep operation, but that didn’t help:

file = "path/to/very/large/file"
lines = file |> File.stream!
grep = lines |>
  Task.async_stream(&(String.match?(&1, ~r/regex_to_grep/) and &1)) |>
  Stream.map(fn {:ok, async_stream_result} -> async_stream_result end) |>
  Enum.filter(&is_binary/1)

Surprised, I retreated back to a simple loop over the entire length of the file:

file = "path/to/very/large/file"
lines = file |> File.stream!
num_lines = lines |> Enum.count

Inconceivably, it hung again! So I now suspected that something might be wrong with that log file itself. Paging through it, I discovered an enormous sequence of NUL bytes embalmed several thousand lines deep. That was unexpected, but why should NUL bytes only affect Elixir and not Python? Were C programs immune too?

$ grep -c 'regex_to_grep' path/to/very/large/file
grep: line too long

Aha! GNU Grep actually had a safeguard built-in to prevent memory exploitation by aborting on very long lines, likely elongated by that enormous NUL sequence.

Perhaps excessive line lengths also paralyzed Elixir? To verify, I wrote my own file reader that would lazily read a single 128 KiB (32 filesystem blocks, commonly sized to be 4 KiB) sized chunk at a time and then split it into lines:

file = "path/to/very/large/file"
lines = file |> File.stream!([], 32 * 4096) |>
  Stream.map(&String.split(&1, ~r/(?<=\n)/)) |>
  Stream.transform("", fn lines, acc ->
    if Enum.empty?(lines) do
      if acc == "" do
        {:halt, acc}
      else
        {[acc], ""}
      end
    else
      # prepend previous partial line to this chunk
      [first_line | rest_lines] = lines
      lines = [acc <> first_line | rest_lines]
      acc = ""

      # carry any partial line at end of this chunk
      unless List.last(lines) |> String.ends_with?("\n") do
        {lines, [acc]} = Enum.split(lines, -1)
      end

      {lines, acc}
    end
  end)
num_lines = lines |> Enum.count

This workaround successfully prevented the enormous NUL byte sequence from hanging Elixir (hurray!) but the overall operation took 24 minutes to complete.

How do I optimize this further? Well, I would imagine that Elixir already uses a similar (in spirit) line splitting algorithm that is already heavily optimized, so eliminating my own would be ideal (deleted code is debugged code). In addition, I also have control over the sizing of chunks I read from the file.

Solution

Following this intuition, I searched the web and unearthed two disparate hints:

Use IO.binstream/2 to read the file as a raw byte sequence (not as UTF-8, which is the case with IO.stream/2 and thus with File.stream!/2 as well). http://stackoverflow.com/a/31397000
Ask for large chunks while reading the file to counteract the overhead of :file.read/2 and use additional buffering on top of it for performance. http://stackoverflow.com/a/9771547

Applying these hints yielded 16x speedup, cutting execution time to 90 seconds:

file = "path/to/very/large/file"
io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
lines = io |> IO.binstream(:line)
num_lines = lines |> Enum.count

That’s actually good performance, since wc took 74 seconds to count characters:

$ time wc path/to/very/large/file
 92474     498826 1603770580 path/to/very/large/file
wc path/to/very/large/file  73.57s user 0.84s system 100% cpu 1:14.41 total

Although this performance still pales in comparison to Python’s 4 seconds, it’s a reasonable trade-off considering the concurrency and fault-tolerance merits of Elixir over Python. Furthermore, this subpar performance might be a remnant of the underlying algorithms being used to fill buffers and split lines inside Elixir, given the similarity in execution time with wc when counting characters.

Conclusion

Read large files as raw binary streams (not as UTF-8) along with enough caching:

file = "path/to/very/large/file"
io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
lines = io |> IO.binstream(:line)

Updates

Comments

NOTE: Comments are temporarily disabled on this page while I migrate them from Disqus to Gisqus (GitHub Discussions) because the former has 💪 forced the display of advertisements 🤑 on this website recently.