Hunt-Szymanski diff
A port of the Hunt-Szymanski diff algorithm to Elixir, with a focus on performance and memory efficiency.
This library is particularly suited for longer texts such as wiki pages or blog articles.
Credits
This library was generated with assistance from ChatGPT.
While I guided its development and reviewed the implementation, the core algorithm was recreated by AI.
Usage
left = """
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
"""
right = """
Roses are blue,
Violets are blue,
I love fluffy clouds,
And so are you.
"""
# 1. Diff two strings
diff = HSDiff.diff(left, right)
# =>
# [
# del: ["Roses are red,"],
# ins: ["Roses are blue,"],
# eq: ["Violets are blue,"],
# del: ["Sugar is sweet,"],
# ins: ["I love fluffy clouds,"],
# eq: ["And so are you."],
# eq: [""]
# ]
# 2. Optimize the diff result
optimized = HSDiff.optimize(diff)
# =>
# [
# del: 1,
# ins: ["Roses are blue,"],
# eq: 1,
# del: 1,
# ins: ["I love fluffy clouds,"],
# eq: 2
# ]
# 3. Patch
restored = HSDiff.patch(left, optimized)
# => "Roses are blue,\nViolets are blue,\nI love fluffy clouds,\nAnd so are you.\n"Why Use Hunt-Szymanski Diff?
Elixir’s standard library already provides String.myers_difference/2 to generate diffs between strings (and lists). However:
- High Memory Usage: For larger inputs (e.g., a typical article or blog post),
String.myers_difference/2can quickly use gigabytes of memory. (Note: this can be alleviated by chunking the input into smaller pieces) - Small vs Large Inputs:
String.myers_difference/2performs well for short strings or larger strings with small changes. But for large-scale diffs—such as comparing entire documents—the Hunt–Szymanski line-based diff is more memory-efficient and performs better.
In short, if you need a diff algorithm for big textual data, Hunt–Szymanski can help you avoid the heavy memory footprint that comes with other approaches.
Installation
Add hunt_szymanski_diff to your list of dependencies in mix.exs:
def deps do
[
{:hunt_szymanski_diff, "~> 0.1.0"}
]
end