dataprep

CIHex.pm

dataprep_logo

Composable, type-driven preprocessing and validation combinator library for Gleam.

dataprep is a combinator toolkit, not a rule catalog.

Requirements

Supported targets

Install

gleam add dataprep

Quick start

import dataprep/prep
import dataprep/validated.{type Validated}
import dataprep/rules

pub type User {
  User(name: String, age: Int)
}

pub type Err {
  NameEmpty
  AgeTooYoung
}

pub fn validate_user(name: String, age: Int) -> Validated(User, Err) {
  let clean = prep.trim() |> prep.then(prep.lowercase())
  let check_name = rules.not_empty(NameEmpty)
  let check_age = rules.min_int(0, AgeTooYoung)

  validated.map2(
    User,
    name |> clean |> check_name,
    check_age(age),
  )
}

// validate_user("  Alice ", 25)   -> Valid(User("alice", 25))
// validate_user("", -1)           -> Invalid([NameEmpty, AgeTooYoung])

Note on composing rules: Each rules.* function returns a validator (fn(a) -> Validated(a, e)), not a transformed value. You cannot pipe one rule into another directly — use validator.both to run checks in parallel (accumulating errors) or validator.guard to short-circuit (skip later checks if an earlier one fails):

import dataprep/rules
import dataprep/validator

// ✗ Won't compile — piping a validator fn into another rule
let check = rules.not_empty(Empty) |> rules.min_length(3, TooShort)

// ✓ Correct — combine validators explicitly
let check =
  rules.not_empty(Empty)
  |> validator.guard(rules.min_length(3, TooShort))

Examples

Field validation with structured error context

Attach field names to errors so callers can identify which field failed.

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}
import dataprep/validator

pub type FormError {
  Field(name: String, detail: FieldDetail)
}

pub type FieldDetail {
  Empty
  TooShort(min: Int)
  TooLong(max: Int)
}

pub fn validate_username(raw: String) -> Validated(String, FormError) {
  let clean = prep.trim() |> prep.then(prep.lowercase())
  let check =
    rules.not_empty(Empty)
    |> validator.guard(
      rules.min_length(3, TooShort(3))
      |> validator.both(rules.max_length(20, TooLong(20))),
    )
    |> validator.label("username", Field)

  raw |> clean |> check
}

// validate_username("  Al  ")
//   -> Invalid([Field("username", TooShort(3))])
// validate_username("  Alice  ")
//   -> Valid("alice")

Parse then validate

Use validated.and_then to bridge type-changing parsing with same-type validation. Parsing short-circuits; validation accumulates.

import dataprep/parse
import dataprep/rules
import dataprep/validated.{type Validated}
import dataprep/validator

pub type AgeError {
  NotAnInteger(raw: String)
  TooYoung(min: Int)
  TooOld(max: Int)
}

pub fn validate_age(raw: String) -> Validated(Int, AgeError) {
  let check_range =
    rules.min_int(0, TooYoung(0))
    |> validator.both(rules.max_int(150, TooOld(150)))

  parse.int(raw, NotAnInteger)
  |> validated.and_then(check_range)
}

// validate_age("abc") -> Invalid([NotAnInteger("abc")])
// validate_age("200") -> Invalid([TooOld(150)])
// validate_age("25")  -> Valid(25)

Nested error labeling with map3

Combine multiple fields into a domain type. All errors from all fields are accumulated with their field names.

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}
import dataprep/validator

pub type SignupForm {
  SignupForm(name: String, email: String, age: Int)
}

pub type SignupError {
  Field(name: String, detail: Detail)
}

pub type Detail {
  Empty
  TooShort(min: Int)
  OutOfRange(min: Int, max: Int)
}

fn validate_name(raw: String) -> Validated(String, SignupError) {
  let clean = prep.trim() |> prep.then(prep.lowercase())
  let check =
    rules.not_empty(Empty)
    |> validator.guard(rules.min_length(2, TooShort(2)))
    |> validator.label("name", Field)
  raw |> clean |> check
}

fn validate_email(raw: String) -> Validated(String, SignupError) {
  let clean = prep.trim() |> prep.then(prep.lowercase())
  let check =
    rules.not_empty(Empty)
    |> validator.label("email", Field)
  raw |> clean |> check
}

fn validate_age(age: Int) -> Validated(Int, SignupError) {
  let check =
    rules.min_int(0, OutOfRange(0, 150))
    |> validator.both(rules.max_int(150, OutOfRange(0, 150)))
    |> validator.label("age", Field)
  check(age)
}

pub fn validate_signup(
  name: String,
  email: String,
  age: Int,
) -> Validated(SignupForm, SignupError) {
  validated.map3(
    SignupForm,
    validate_name(name),
    validate_email(email),
    validate_age(age),
  )
}

// validate_signup("", "", 200)
//   -> Invalid([
//        Field("name", Empty),
//        Field("email", Empty),
//        Field("age", OutOfRange(0, 150)),
//      ])

Pattern matching with rules.matches / matches_string

matches and matches_string use regexp.check semantics — they pass as long as the pattern hits anywhere in the input. A pattern like [0-9]+ will accept "abc123def" because the digit run matches a substring. For the validation case (\"the whole string must look like an email / slug / number\"), use the matches_fully / matches_fully_string siblings, which compare the matched span against the entire input.

There are three intended ways to construct a regex-driven validator:

import dataprep/rules
import dataprep/validated.{type Validated}
import gleam/regexp
import gleam/result

pub type TagError {
  BadFormat
}

// Literal pattern with full-match semantics — the convenience
// helper compiles once at construction. No `let assert Ok(_)`
// boilerplate at the call site, and a substring hit on a partial
// pattern (like `[a-z0-9-]+`) does NOT silently slip through.
pub fn validate_tag(raw: String) -> Validated(String, TagError) {
  let check =
    rules.matches_fully_string(pattern: "[a-z0-9-]+", error: BadFormat)
  check(raw)
}

// Dynamic pattern — the caller controls the compile error.
pub fn validate_with(
  raw: String,
  pattern: String,
) -> Result(Validated(String, TagError), regexp.CompileError) {
  use re <- result.map(regexp.from_string(pattern))
  rules.matches(pattern: re, error: BadFormat)(raw)
}

// validate_tag("ok-1") -> Valid("ok-1")
// validate_tag("BAD!") -> Invalid([BadFormat])

default vs default_when_blank

prep.default(fallback) only fires on the literal empty string "". Whitespace-only inputs (" ", "\t", "\r\n") pass through unchanged. Use prep.default_when_blank(fallback) when \"blank\" should also include whitespace-only.

import dataprep/prep

let strict = prep.default("N/A")
strict("")          // "N/A"
strict(" ")         // " "          ← passed through
strict("\t")        // "\t"         ← passed through
strict("hi")        // "hi"

let lenient = prep.default_when_blank("N/A")
lenient("")         // "N/A"
lenient(" ")        // "N/A"        ← whitespace-only treated as blank
lenient("\t\n")     // "N/A"
lenient("  hi  ")   // "  hi  "     ← original input preserved on non-blank

// Want the trimmed form on the non-blank path? Compose explicitly:
let normalised = prep.trim() |> prep.then(first: _, next: prep.default("N/A"))
normalised("  hi  ") // "hi"
normalised("   ")    // "N/A"

More examples are available in the doc/recipes/ directory of the repository.

Modules

Module Responsibility
dataprep/prep Infallible transformations: trim, lowercase, uppercase, collapse_space (ASCII whitespace only), collapse_unicode_space (full Unicode \s), replace, default, default_when_blank. Compose with then or sequence.
dataprep/validator Checks without transformation: check, predicate, both, all, alt, guard, map_error, label, each, optional.
dataprep/validated Applicative error accumulation: map, map_error, and_then, from_result, from_result_map, to_result, map2..map5, sequence, traverse, traverse_indexed.
dataprep/non_empty_list At-least-one guarantee for error lists: single, cons, append, concat, map, flat_map, to_list, from_list.
dataprep/rules Built-in rules: not_empty, not_blank, matches, matches_string, matches_string_checked, matches_fully, matches_fully_string, matches_fully_string_checked, min_length, max_length, length_between, min_int, max_int, min_float, max_float, non_negative_int, non_negative_float, one_of, equals.
dataprep/parse Parse helpers: int, float, float_strict. Bridge String to typed Validated with custom error mapping.

Composition overview

Phase Combinator Errors When to use
Prep prep.then (none) Chain infallible transforms
Validate validator.both / all Accumulate all Independent checks on same value
Validate validator.alt Accumulate on full failure Accept alternative forms
Validate validator.guard Short-circuit Skip if prerequisite fails
Combine validated.map2..map5 Accumulate all Build domain types from independent fields
Bridge validated.and_then Short-circuit Parse then validate (type changes)
Bridge parse.int / parse.float Short-circuit String to typed Validated in one step
Bridge raw |> prep |> validator (prep has none) Apply infallible transform before validation
Collection validated.sequence / traverse Accumulate all Validate a list of values
Collection validator.each Accumulate all Apply a validator to every list element
Collection validator.optional (none if None) Skip validation for absent values

Scope policy

dataprep is a combinator toolkit, not a rule catalog. The library deliberately ships only the building blocks needed to construct typed, error-accumulating validators:

Domain-specific parsers (email, url, uuid, iso_datetime, ipv4, ...) are intentionally not in scope. The recommended path is to compose the primitives above into the parser you need, or to depend on a domain-specific package alongside dataprep.

See "Building your own parser" below for the recipes.

Building your own parser

The recipes below cover the common shapes a caller actually wants. Each recipe uses only the public API and is verified by the tests in test/dataprep/cookbook_test.gleam.

The recipes share one error type so they can be combined inside the same form-validation flow:

pub type Err {
  NotAnInteger(raw: String)
  NotPositive
  WrongLength(min: Int, max: Int, got: Int)
  NotUuid(raw: String)
  NotAllowed(raw: String)
}

Recipe 1: positive_int

Parse to Int, then enforce > 0. Uses validated.and_then to short-circuit when the parse itself fails.

import dataprep/parse
import dataprep/validated.{type Validated}
import dataprep/validator

fn positive_int(raw: String) -> Validated(Int, Err) {
  use n <- validated.and_then(parse.int(raw, NotAnInteger))
  validator.predicate(fn(x) { x > 0 }, NotPositive)(n)
}

Recipe 2: bounded_string

Trim, then enforce length is in [min, max].

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}
import gleam/string

fn bounded_string(
  min: Int,
  max: Int,
) -> fn(String) -> Validated(String, Err) {
  fn(raw: String) {
    let trimmed = prep.run(prep: prep.trim(), value: raw)
    rules.length_between(
      minimum: min,
      maximum: max,
      error: WrongLength(min, max, string.length(trimmed)),
    )(trimmed)
  }
}

Recipe 3: uuid_v4_lowercase

Trim, lowercase, then regex match. Demonstrates prep.then for chained infallible normalisation before validation.

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}

fn uuid_v4_lowercase(raw: String) -> Validated(String, Err) {
  let normalized =
    prep.run(
      prep: prep.then(first: prep.trim(), next: prep.lowercase()),
      value: raw,
    )
  rules.matches_fully_string(
    pattern: "[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}",
    error: NotUuid(raw),
  )(normalized)
}

Recipe 4: enum_of_strings_ci

Case-insensitive match against a fixed allow-list. Returns a parameterised validator so callers can vary the allowed set.

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}

fn enum_of_strings_ci(
  allowed: List(String),
) -> fn(String) -> Validated(String, Err) {
  fn(raw: String) {
    let normalized = prep.run(prep: prep.lowercase(), value: raw)
    rules.one_of(allowed: allowed, error: NotAllowed(raw))(normalized)
  }
}

The composition pattern is the same in every case: prep.run to normalise the raw input, then a rules or validator combinator to check the normalised value, then optionally validated.and_then to chain a follow-up step that depends on a successful prior step.

Out of scope, by design

The following are intentionally not provided by dataprep, even though some of them appear in adjacent libraries in other languages:

These have implementation-defining standards or substantial spec surface that would push dataprep from "small primitives" toward "kitchen sink." Keeping the scope tight is what lets the combinators stay composable.

Development

This project uses mise to manage Gleam and Erlang versions, and just as a task runner.

mise install    # install Gleam and Erlang
just ci         # format check, typecheck, build, test
just test       # gleam test
just format     # gleam format
just check      # all checks without deps download

Contributing

Contributions are welcome. See CONTRIBUTING.md for details.

License

MIT