Speeding up I/O on very large files in Elixir
Problem
At work today, a colleague needed to process a very large (1.5 GiB) log file,
line by line, using Elixir 1.4.2 and Erlang 19.2 on Linux 3.16. A simple iex
experiment yielded abysmal performance, taking 8 hours and 7 minutes to finish!
file = "path/to/very/large/file"
lines = file |> File.stream!
grep = lines |> Enum.filter(&String.match?(&1, ~r/regex_to_grep/))
In contrast, a similar experiment in Python 2.7.8 finished in merely 4 seconds:
import re
regex_to_grep = re.compile('regex_to_grep')
file = open('path/to/very/large/file', 'r')
grep = [line for line in file if regex_to_grep.search(line)]
Approach
My first instinct was to parallelize the grep operation, but that didn’t help:
file = "path/to/very/large/file"
lines = file |> File.stream!
grep = lines |>
Task.async_stream(&(String.match?(&1, ~r/regex_to_grep/) and &1)) |>
Stream.map(fn {:ok, async_stream_result} -> async_stream_result end) |>
Enum.filter(&is_binary/1)
Surprised, I retreated back to a simple loop over the entire length of the file:
file = "path/to/very/large/file"
lines = file |> File.stream!
num_lines = lines |> Enum.count
Inconceivably, it hung again! So I now suspected that something might be wrong with that log file itself. Paging through it, I discovered an enormous sequence of NUL bytes embalmed several thousand lines deep. That was unexpected, but why should NUL bytes only affect Elixir and not Python? Were C programs immune too?
$ grep -c 'regex_to_grep' path/to/very/large/file
grep: line too long
Aha! GNU Grep actually had a safeguard built-in to prevent memory exploitation by aborting on very long lines, likely elongated by that enormous NUL sequence.
Perhaps excessive line lengths also paralyzed Elixir? To verify, I wrote my own file reader that would lazily read a single 128 KiB (32 filesystem blocks, commonly sized to be 4 KiB) sized chunk at a time and then split it into lines:
file = "path/to/very/large/file"
lines = file |> File.stream!([], 32 * 4096) |>
Stream.map(&String.split(&1, ~r/(?<=\n)/)) |>
Stream.transform("", fn lines, acc ->
if Enum.empty?(lines) do
if acc == "" do
{:halt, acc}
else
{[acc], ""}
end
else
# prepend previous partial line to this chunk
[first_line | rest_lines] = lines
lines = [acc <> first_line | rest_lines]
acc = ""
# carry any partial line at end of this chunk
unless List.last(lines) |> String.ends_with?("\n") do
{lines, [acc]} = Enum.split(lines, -1)
end
{lines, acc}
end
end)
num_lines = lines |> Enum.count
This workaround successfully prevented the enormous NUL byte sequence from hanging Elixir (hurray!) but the overall operation took 24 minutes to complete.
How do I optimize this further? Well, I would imagine that Elixir already uses a similar (in spirit) line splitting algorithm that is already heavily optimized, so eliminating my own would be ideal (deleted code is debugged code). In addition, I also have control over the sizing of chunks I read from the file.
Solution
Following this intuition, I searched the web and unearthed two disparate hints:
Use
IO.binstream/2
to read the file as a raw byte sequence (not as UTF-8, which is the case withIO.stream/2
and thus withFile.stream!/2
as well). http://stackoverflow.com/a/31397000Ask for large chunks while reading the file to counteract the overhead of
:file.read/2
and use additional buffering on top of it for performance. http://stackoverflow.com/a/9771547
Applying these hints yielded 16x speedup, cutting execution time to 90 seconds:
file = "path/to/very/large/file"
io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
lines = io |> IO.binstream(:line)
num_lines = lines |> Enum.count
That’s actually good performance, since wc
took 74 seconds to count characters:
$ time wc path/to/very/large/file
92474 498826 1603770580 path/to/very/large/file
wc path/to/very/large/file 73.57s user 0.84s system 100% cpu 1:14.41 total
Although this performance still pales in comparison to Python’s 4 seconds, it’s
a reasonable trade-off considering the concurrency and fault-tolerance merits of
Elixir over Python. Furthermore, this subpar performance might be a remnant of
the underlying algorithms being used to fill buffers and split lines inside Elixir,
given the similarity in execution time with wc
when counting characters.
Conclusion
Read large files as raw binary streams (not as UTF-8) along with enough caching:
file = "path/to/very/large/file"
io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB generous caching
lines = io |> IO.binstream(:line)