Nasty Logo Nasty → Natural Abstract Syntax Tree Yeoman

CIcodecovHex.pmDocs

A comprehensive NLP library for Elixir that treats natural language with the same rigor as programming languages.

Nasty provides a complete grammatical Abstract Syntax Tree (AST) for multiple natural languages (English, Spanish, and Catalan), with a full NLP pipeline from tokenization to text summarization.

Quick Start

# Run the complete demo
mix run demo.exs

# Or try specific examples
mix run examples/catalan_example.exs
mix run examples/roundtrip_translation.exs
mix run examples/multilingual_pipeline.exs

New to Nasty? Start with the Getting Started Guide for a beginner-friendly tutorial.

alias Nasty.Language.English

# Simple example
text = "John Smith works at Google in New York."

{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged)

# Extract entities
alias Nasty.Language.English.EntityRecognizer
entities = EntityRecognizer.recognize(tagged)
# => [%Entity{type: :person, text: "John Smith"}, 
#     %Entity{type: :org, text: "Google"}, ...]

# Extract dependencies
alias Nasty.Language.English.DependencyExtractor
sentences = document.paragraphs |> Enum.flat_map(& &1.sentences)
deps = Enum.flat_map(sentences, &DependencyExtractor.extract/1)

# Semantic role labeling
{:ok, document_with_srl} = Nasty.Language.English.parse(tagged, semantic_roles: true)
# Access semantic frames
frames = document_with_srl.semantic_frames
# => [%SemanticFrame{predicate: "works", roles: [%Role{type: :agent, text: "John Smith"}, ...]}]

# Coreference resolution
{:ok, document_with_coref} = Nasty.Language.English.parse(tagged, coreference: true)
# Access coreference chains
chains = document_with_coref.coref_chains
# => [%CorefChain{representative: "John Smith", mentions: ["John Smith", "he"], ...}]

# Summarize
summary = English.summarize(document, ratio: 0.3)  # 30% compression
# or
summary = English.summarize(document, max_sentences: 3)  # Fixed count

# MMR (Maximal Marginal Relevance) for reduced redundancy
summary_mmr = English.summarize(document, max_sentences: 3, method: :mmr, mmr_lambda: 0.5)

# Question answering
{:ok, answers} = English.answer_question(document, "Who works at Google?")
# => [%Answer{text: "John Smith", confidence: 0.85, ...}]

# Statistical POS tagging (auto-loads from priv/models/)
{:ok, tokens_hmm} = English.tag_pos(tokens, model: :hmm)

# Neural POS tagging (97-98% accuracy)
{:ok, tokens_neural} = English.tag_pos(tokens, model: :neural)

# Or ensemble mode (combines neural + statistical + rule-based)
{:ok, tokens_ensemble} = English.tag_pos(tokens, model: :ensemble)

# Text classification
# Train a sentiment classifier
training_data = [
  {positive_doc1, :positive},
  {positive_doc2, :positive},
  {negative_doc1, :negative},
  {negative_doc2, :negative}
]
model = English.train_classifier(training_data, features: [:bow, :lexical])

# Classify new documents
{:ok, predictions} = English.classify(test_doc, model)
# => [%Classification{class: :positive, confidence: 0.85, ...}, ...]

# Information extraction
# Extract relations between entities
{:ok, relations} = English.extract_relations(document)
# => [%Relation{type: :works_at, subject: person, object: org, confidence: 0.8}]

# Extract events with participants
{:ok, events} = English.extract_events(document)
# => [%Event{type: :business_acquisition, trigger: "acquired", participants: %{agent: ..., patient: ...}}]

# Template-based extraction
templates = [TemplateExtractor.employment_template()]
{:ok, results} = English.extract_templates(document, templates)
# => [%{template: "employment", slots: %{employee: "John", employer: "Google"}, confidence: 0.85}]

Architecture

graph LR
    A[Text] --> B[Tokenization]
    B --> C[POS Tagging]
    C --> D[Phrase Parsing]
    D --> E[Sentence Parsing]
    E --> F[Document AST]
    F --> G[Dependencies]
    F --> H[Entities]
    F --> I[Summarization]
    F --> J[Translation]
    F --> K[More...]
    
    style F fill:#e1f5ff
    style A fill:#fff3e0

Complete Pipeline

  1. Tokenization (English.Tokenizer) → Split text into tokens
  2. POS Tagging (English.POSTagger) → Assign grammatical categories
  3. Morphology (English.Morphology) → Lemmatization and features
  4. Phrase Parsing (English.PhraseParser) → Build NP, VP, PP structures
  5. Sentence Parsing (English.SentenceParser) → Detect clauses and structure
  6. Dependency Extraction (English.DependencyExtractor) → Grammatical relations
  7. Entity Recognition (English.EntityRecognizer) → Named entities
  8. Semantic Role Labeling (English.SemanticRoleLabeler) → Predicate-argument structure
  9. Coreference Resolution (English.CoreferenceResolver) → Link mentions
  10. Summarization (English.Summarizer) → Extract key sentences
  11. Question Answering (English.QuestionAnalyzer, English.AnswerExtractor) → Answer questions
  12. Text Classification (English.FeatureExtractor, English.TextClassifier) → Train and classify documents
  13. Information Extraction (English.RelationExtractor, English.EventExtractor, English.TemplateExtractor) → Extract structured information
  14. AST Rendering (Rendering.Text) → Convert AST back to natural language
  15. AST Utilities (Utils.Traversal, Utils.Query, Utils.Validator, Utils.Transform) → Traverse, query, validate, and transform trees
  16. Visualization (Rendering.Visualization, Rendering.PrettyPrint) → Export to DOT/JSON and debug output

Features

Phrase Structures

Sentence Types

Dependencies (Universal Dependencies)

Entity Types

Multi-Language Support

Nasty provides a language-agnostic architecture using Elixir behaviours, enabling support for multiple natural languages:

Supported Languages

Usage

alias Nasty.Language.Spanish

# Spanish text processing
text = "El gato duerme en el sofá."
{:ok, tokens} = Spanish.tokenize(text)
{:ok, tagged} = Spanish.tag_pos(tokens)
{:ok, document} = Spanish.parse(tagged)

# Works identically to English
summary = Spanish.summarize(document, ratio: 0.3)
{:ok, entities} = Spanish.extract_entities(document)

# Catalan text processing
alias Nasty.Language.Catalan

text_ca = "El gat dorm al sofà."
{:ok, tokens_ca} = Catalan.tokenize(text_ca)
{:ok, tagged_ca} = Catalan.tag_pos(tokens_ca)
{:ok, document_ca} = Catalan.parse(tagged_ca)

# Extract entities (Catalan-specific lexicons)
alias Nasty.Language.Catalan.EntityRecognizer
{:ok, entities_ca} = EntityRecognizer.recognize(tagged_ca)

# Translate between languages (AST-based)
alias Nasty.Translation.Translator

# English to Spanish
{:ok, tokens_en} = English.tokenize("The quick cat runs.")
{:ok, tagged_en} = English.tag_pos(tokens_en)
{:ok, doc_en} = English.parse(tagged_en)
{:ok, doc_es} = Translator.translate_document(doc_en, :es)
{:ok, text_es} = Nasty.render(doc_es)
# => "El gato rápido corre."

# Spanish to English
{:ok, tokens_es} = Spanish.tokenize("La casa grande.")
{:ok, tagged_es} = Spanish.tag_pos(tokens_es)
{:ok, doc_es} = Spanish.parse(tagged_es)
{:ok, doc_en} = Translator.translate_document(doc_es, :en)
{:ok, text_en} = Nasty.render(doc_en)
# => "The big house."

Language Registry

All languages are registered in Nasty.Language.Registry and can be accessed dynamically:

# Auto-detect language
{:ok, lang} = Nasty.Language.Registry.detect_language("¿Cómo estás?")
# => :es

# Get language module
{:ok, Spanish} = Nasty.Language.Registry.get(:es)

See complete language-specific examples:

Text Summarization

Question Answering

Text Classification

Information Extraction

Code Interoperability

Convert between natural language and Elixir code bidirectionally:

AST Rendering & Utilities

Convert AST back to text, traverse and query trees, validate structures, and export visualizations:

Testing

# Run all tests
mix test

# Run specific module tests
mix test test/language/english/tokenizer_test.exs
mix test test/language/english/phrase_parser_test.exs
mix test test/language/english/dependency_extractor_test.exs

Documentation

Comprehensive documentation is available in the docs/ directory:

Getting Started

Core Documentation

Language-Specific Documentation

Statistical & Neural Models

Nasty includes comprehensive statistical and neural network models for state-of-the-art NLP:

Statistical Models

Sequence Labeling

Parsing

Classification

Neural Models

Transformer Models (Bumblebee Integration)

See Statistical Models for complete reference, Neural Models for neural architecture details, Training Neural for training guide, Pretrained Models for transformer usage, Zero Shot for zero-shot classification, and Quantization for model optimization.

Quick Start: Model Management

# List available models
mix nasty.models list

# Train HMM POS tagger (fast, 95% accuracy)
mix nasty.train.pos \
  --corpus data/UD_English-EWT/en_ewt-ud-train.conllu \
  --test data/UD_English-EWT/en_ewt-ud-test.conllu \
  --output priv/models/en/pos_hmm_v1.model

# Train neural POS tagger (slower, 97-98% accuracy)
mix nasty.train.neural_pos \
  --corpus data/UD_English-EWT/en_ewt-ud-train.conllu \
  --output priv/models/en/pos_neural_v1.axon \
  --epochs 10 \
  --batch-size 32

# Train CRF for NER
mix nasty.train.crf \
  --corpus data/train.conllu \
  --test data/test.conllu \
  --output priv/models/en/ner_crf.model \
  --task ner \
  --iterations 100

# Train PCFG parser
mix nasty.train.pcfg \
  --corpus data/en_ewt-ud-train.conllu \
  --test data/en_ewt-ud-test.conllu \
  --output priv/models/en/pcfg.model \
  --smoothing 0.001

# Evaluate models
mix nasty.eval.pos \
  --model priv/models/en/pos_hmm_v1.model \
  --test data/UD_English-EWT/en_ewt-ud-test.conllu \
  --baseline

mix nasty.eval \
  --model priv/models/en/ner_crf.model \
  --test data/test.conllu \
  --type crf \
  --task ner

mix nasty.eval \
  --model priv/models/en/pcfg.model \
  --test data/test.conllu \
  --type pcfg

Future Enhancements

License

MIT License — see LICENSE file for details.


Built with ❤️ using Elixir and NimbleParsec