Yargy

Earley parser with grammar DSL for rule-based information extraction from Russian text.

Elixir port of natasha/yargy. Depends on morph_ru for morphological analysis and razdel for sentence segmentation.

Installation

def deps do
  [
    {:yargy, "~> 0.5"}
  ]
end

Performance

Benchmarked on real Russian legal text (2.5 KB medium, 18.5 KB long). Person grammar extracts full names with patronymics in four patterns (Surname+Name+Patronymic, initials, etc.).

End-to-end: text → extracted matches (tokenize + morph tag + parse):

Grammar Text Python yargy 0.16 Elixir yargy Speedup
Person short (81 chars) 213 µs 242 µs 0.9×
Person medium (2.5 KB) 8.3 ms 6.7 ms 1.2×
Person long (18.5 KB) 60 ms 41 ms 1.5×
Date short 171 µs 98 µs 1.7×
Date medium 3.7 ms 2.5 ms 1.5×
Date long 29 ms 17 ms 1.7×
Amount short 133 µs 66 µs 2.0×
Amount medium 4.3 ms 2.1 ms 2.0×
Amount long 27 ms 15 ms 1.8×

Component breakdown (long text, 18.5 KB, ~3600 tokens):

Component Python Elixir Speedup
Tokenize + morph tag 2.6 ms 2.9 ms 0.9×
Sentenize 1.5 ms 2.7 ms 0.6×
Parse only (Person) ~57 ms 36 ms 1.6×

Python's tokenizer is backed by C-compiled regex + pymorphy2's DAWG in C, giving it an edge on raw tokenization. The Elixir port is near parity on tokenize+morph and wins on the parser core where Earley chart operations dominate — tuple-based O(1) term access, integer rule IDs for fast dedup, and prediction caching reduce constant factors.

Run benchmarks: mix run bench/parser_bench.exs (Elixir), python3 bench/python_bench.py (Python).

Grammar DSL

Define grammars declaratively with use Yargy.Grammar. Rules compile once at module load time and are cached in persistent_term.

defmodule MyApp.PersonGrammar do
  use Yargy.Grammar

  defrule :surname, all([gram("Surn"), capitalized()])
  defrule :first_name, all([gram("Name"), capitalized()])
  defrule :patronymic, all([gram("Patr"), capitalized()])
  defrule :dot, token(".")
  defrule :initial, all([upper(), length_eq(1)])
  defrule :initial_dot, rule(:initial) ~> rule(:dot)

  defgrammar :person, choice([
    rule(:surname) ~> rule(:first_name) ~> optional(rule(:patronymic)),
    rule(:first_name) ~> optional(rule(:patronymic)) ~> rule(:surname),
    rule(:surname) ~> rule(:initial_dot) ~> rule(:initial_dot),
    rule(:initial_dot) ~> rule(:initial_dot) ~> rule(:surname)
  ])
end

# Use the generated functions:
MyApp.PersonGrammar.person_text("Адвокат Иванов Иван Петрович подписал")
# [%Yargy.Parser.Match{tokens: [...], start: 1, stop: 4}]

Terminals

Function Matches
token("ст") exact value
token(~w[ГК ТК НК]) value in set
lemma("статья") any morphological form
lemma(~w[статья закон]) lemma in set
gram("VERB") has OpenCorpora grammeme
integer() token type :int
word() token type :word
punct(".") punctuation with exact value
capitalized() first char uppercase
upper() all chars uppercase
caseless("рублей") case-insensitive
all([gram("Surn"), capitalized()]) AND — same token matches all
any([token("ст"), lemma("статья")]) OR — same token matches any

Composition

Syntax Meaning
a ~> b sequence (a then b)
choice([a, b]) alternation
optional(a) zero or one
repeat(a) one or more
repeat(a, min: 2, max: 5) bounded repetition
rule(:name) reference a defrule

Generated functions

defgrammar :person, ... generates:

Bag-of-features matchers

defmatch checks for unordered token presence — no sequential parsing needed. Useful for sentence classification.

defmodule MyApp.SentenceClassifier do
  use Yargy.Grammar

  defmatch :evidence, all_of([
    any_token(lemma(~w[подтверждаться подтвердить])),
    any_token(lemma(~w[акт квитанция чек выписка]))
  ])

  defmatch :demand, any_token(all([
    lemma(~w[требовать просить взыскать]),
    gram("VERB")
  ]))

  defmatch :short_title, all_of([
    any_token(lemma("претензия")),
    max_words(5)
  ])
end

MyApp.SentenceClassifier.evidence_match?("Оплата подтверждается актом")
# true

MyApp.SentenceClassifier.demand_match?("Истец требует возмещения")
# true

Match combinators

Function Meaning
any_token(pred) ∃ token matching predicate
no_token(pred) ¬∃ token matching predicate
first_token(pred) first word token matches
all_of([...]) all conditions hold (AND)
any_of([...]) at least one holds (OR)
max_words(n) ≤ n word tokens

Generated functions

defmatch :evidence, ... generates:

Low-level API

The DSL compiles to the same Rule / Predicate / Parser primitives you can use directly:

alias Yargy.{Parser, Predicate, Rule, Tokenizer}

date = Rule.rule([
  Predicate.type(:int), Predicate.eq("."),
  Predicate.type(:int), Predicate.eq("."),
  Predicate.type(:int)
])

tokens = Tokenizer.tokenize("Договор от 15.03.2024 подписан.")
Parser.findall(Parser.new(date), tokens)

Partial matching (autocomplete)

When the input ends mid-parse, partial_matches/2 returns incomplete states from the Earley chart — useful for autocomplete.

rule = Rule.rule([
  Predicate.in_(~w(ООО ПАО АО)),
  Predicate.or_([Predicate.eq("«"), Predicate.eq("\"")]),
  Predicate.type(:word),
  Predicate.or_([Predicate.eq("»"), Predicate.eq("\"")])
])

parser = Parser.new(rule)
tokens = Tokenizer.tokenize("Директору ООО «Ромашка")

Parser.partial_matches(parser, tokens)
# [%{dot: 3, production_length: 4, progress: 0.75,
#    matched_text: "ООО « Ромашка", ...}]

Sentence splitting

Yargy.Sentenize.sentenize("Привет. Как дела? Хорошо!")
# ["Привет.", "Как дела?", "Хорошо!"]

Core modules

Module Purpose
Yargy.Grammar Declarative DSL — defrule, defgrammar, terminals, ~>
Yargy.Parser Earley parser — findall/2, find/2, partial_matches/2
Yargy.Rule Low-level rule builder — rule, or_rule, optional, repeatable
Yargy.Predicate Token predicates — gram, type, eq, normalized
Yargy.Relations Agreement — gnc_relation, nc_relation
Yargy.Tokenizer UTF-8 tokenizer with byte/char position tracking
Yargy.Sentenize Sentence splitter (via razdel)
Yargy.MorphTagger Morphological tagging via morph_ru
Yargy.Grammars.* Date, NamedDate, Amount, Person

License

MIT © Danila Poyarkov