dataprep

dataprep_logo

Composable, type-driven preprocessing and validation combinator library for Gleam.

dataprep is a combinator toolkit, not a rule catalog.

Built-in and user-defined rules are identical in power.
No domain-specific rules (email, URL, UUID). Write your own or use a dedicated package.
No schema, no DSL, no reflection.
Prep transforms. Validator checks. They do not mix.
Errors are your types, not ours.

Requirements

Gleam 1.15 or later
Erlang/OTP 27 or later (when targeting Erlang)
Node.js 18 or later (when targeting JavaScript)

Supported targets

Erlang (BEAM) — for server-side use (e.g. wisp, mist)
JavaScript — for client-side use (e.g. Lustre form validation in the browser). The package contains zero FFI and zero target-specific code, so the same Validator(a, e) can be shared between client and server code.

Install

gleam add dataprep

Quick start

import dataprep/prep
import dataprep/validated.{type Validated}
import dataprep/rules

pub type User {
  User(name: String, age: Int)
}

pub type Err {
  NameEmpty
  AgeTooYoung
}

pub fn validate_user(name: String, age: Int) -> Validated(User, Err) {
  let clean = prep.trim() |> prep.then(prep.lowercase())
  let check_name = rules.not_empty(NameEmpty)
  let check_age = rules.min_int(0, AgeTooYoung)

  validated.map2(
    User,
    name |> clean |> check_name,
    check_age(age),
  )
}

// validate_user("  Alice ", 25)   -> Valid(User("alice", 25))
// validate_user("", -1)           -> Invalid([NameEmpty, AgeTooYoung])

Note on composing rules: Each rules.* function returns a validator (fn(a) -> Validated(a, e)), not a transformed value. You cannot pipe one rule into another directly — use validator.both to run checks in parallel (accumulating errors) or validator.guard to short-circuit (skip later checks if an earlier one fails):
import dataprep/rules
import dataprep/validator

// ✗ Won't compile — piping a validator fn into another rule
let check = rules.not_empty(Empty) |> rules.min_length(3, TooShort)

// ✓ Correct — combine validators explicitly
let check =
  rules.not_empty(Empty)
  |> validator.guard(rules.min_length(3, TooShort))

Examples

Field validation with structured error context

Attach field names to errors so callers can identify which field failed.

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}
import dataprep/validator

pub type FormError {
  Field(name: String, detail: FieldDetail)
}

pub type FieldDetail {
  Empty
  TooShort(min: Int)
  TooLong(max: Int)
}

pub fn validate_username(raw: String) -> Validated(String, FormError) {
  let clean = prep.trim() |> prep.then(prep.lowercase())
  let check =
    rules.not_empty(Empty)
    |> validator.guard(
      rules.min_length(3, TooShort(3))
      |> validator.both(rules.max_length(20, TooLong(20))),
    )
    |> validator.label("username", Field)

  raw |> clean |> check
}

// validate_username("  Al  ")
//   -> Invalid([Field("username", TooShort(3))])
// validate_username("  Alice  ")
//   -> Valid("alice")

Parse then validate

Use validated.and_then to bridge type-changing parsing with same-type validation. Parsing short-circuits; validation accumulates.

import dataprep/parse
import dataprep/rules
import dataprep/validated.{type Validated}
import dataprep/validator

pub type AgeError {
  NotAnInteger(raw: String)
  TooYoung(min: Int)
  TooOld(max: Int)
}

pub fn validate_age(raw: String) -> Validated(Int, AgeError) {
  let check_range =
    rules.min_int(0, TooYoung(0))
    |> validator.both(rules.max_int(150, TooOld(150)))

  parse.int(raw, NotAnInteger)
  |> validated.and_then(check_range)
}

// validate_age("abc") -> Invalid([NotAnInteger("abc")])
// validate_age("200") -> Invalid([TooOld(150)])
// validate_age("25")  -> Valid(25)

Nested error labeling with map3

Combine multiple fields into a domain type. All errors from all fields are accumulated with their field names.

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}
import dataprep/validator

pub type SignupForm {
  SignupForm(name: String, email: String, age: Int)
}

pub type SignupError {
  Field(name: String, detail: Detail)
}

pub type Detail {
  Empty
  TooShort(min: Int)
  OutOfRange(min: Int, max: Int)
}

fn validate_name(raw: String) -> Validated(String, SignupError) {
  let clean = prep.trim() |> prep.then(prep.lowercase())
  let check =
    rules.not_empty(Empty)
    |> validator.guard(rules.min_length(2, TooShort(2)))
    |> validator.label("name", Field)
  raw |> clean |> check
}

fn validate_email(raw: String) -> Validated(String, SignupError) {
  let clean = prep.trim() |> prep.then(prep.lowercase())
  let check =
    rules.not_empty(Empty)
    |> validator.label("email", Field)
  raw |> clean |> check
}

fn validate_age(age: Int) -> Validated(Int, SignupError) {
  let check =
    rules.min_int(0, OutOfRange(0, 150))
    |> validator.both(rules.max_int(150, OutOfRange(0, 150)))
    |> validator.label("age", Field)
  check(age)
}

pub fn validate_signup(
  name: String,
  email: String,
  age: Int,
) -> Validated(SignupForm, SignupError) {
  validated.map3(
    SignupForm,
    validate_name(name),
    validate_email(email),
    validate_age(age),
  )
}

// validate_signup("", "", 200)
//   -> Invalid([
//        Field("name", Empty),
//        Field("email", Empty),
//        Field("age", OutOfRange(0, 150)),
//      ])

Pattern matching with `rules.matches` / `matches_string`

matches and matches_string use regexp.check semantics — they pass as long as the pattern hits anywhere in the input. A pattern like [0-9]+ will accept "abc123def" because the digit run matches a substring. For the validation case (\"the whole string must look like an email / slug / number\"), use the matches_fully / matches_fully_string siblings, which compare the matched span against the entire input.

There are three intended ways to construct a regex-driven validator:

Pre-compiled Regexp — pass a regexp.Regexp to matches / matches_fully. Pattern errors surface as a regexp.from_stringResult at the call site, before the validator is built.
Literal convenience — matches_string / matches_fully_string. The helper compiles internally and panics on a malformed literal, which is a programmer error there is no useful recovery from. Use this only when the pattern is hard-coded at the call site.
Checked dynamic pattern — matches_string_checked / matches_fully_string_checked return Result(Validator, RegexRuleError). Use this for config-driven or admin-supplied patterns where a malformed regex is a runtime condition that should be handled rather than crash the process.

import dataprep/rules
import dataprep/validated.{type Validated}
import gleam/regexp
import gleam/result

pub type TagError {
  BadFormat
}

// Literal pattern with full-match semantics — the convenience
// helper compiles once at construction. No `let assert Ok(_)`
// boilerplate at the call site, and a substring hit on a partial
// pattern (like `[a-z0-9-]+`) does NOT silently slip through.
pub fn validate_tag(raw: String) -> Validated(String, TagError) {
  let check =
    rules.matches_fully_string(pattern: "[a-z0-9-]+", error: BadFormat)
  check(raw)
}

// Dynamic pattern — the caller controls the compile error.
pub fn validate_with(
  raw: String,
  pattern: String,
) -> Result(Validated(String, TagError), regexp.CompileError) {
  use re <- result.map(regexp.from_string(pattern))
  rules.matches(pattern: re, error: BadFormat)(raw)
}

// validate_tag("ok-1") -> Valid("ok-1")
// validate_tag("BAD!") -> Invalid([BadFormat])

`default` vs `default_when_blank`

prep.default(fallback) only fires on the literal empty string "". Whitespace-only inputs (" ", "\t", "\r\n") pass through unchanged. Use prep.default_when_blank(fallback) when \"blank\" should also include whitespace-only.

import dataprep/prep

let strict = prep.default("N/A")
strict("")          // "N/A"
strict(" ")         // " "          ← passed through
strict("\t")        // "\t"         ← passed through
strict("hi")        // "hi"

let lenient = prep.default_when_blank("N/A")
lenient("")         // "N/A"
lenient(" ")        // "N/A"        ← whitespace-only treated as blank
lenient("\t\n")     // "N/A"
lenient("  hi  ")   // "  hi  "     ← original input preserved on non-blank

// Want the trimmed form on the non-blank path? Compose explicitly:
let normalised = prep.trim() |> prep.then(first: _, next: prep.default("N/A"))
normalised("  hi  ") // "hi"
normalised("   ")    // "N/A"

More examples are available in the doc/recipes/ directory of the repository.

Modules

Module	Responsibility
`dataprep/prep`	Infallible transformations: `trim`, `lowercase`, `uppercase`, `collapse_space` (ASCII whitespace only), `collapse_unicode_space` (full Unicode `\s`), `replace`, `default`, `default_when_blank`. Compose with `then` or `sequence`.
`dataprep/validator`	Checks without transformation: `check`, `predicate`, `both`, `all`, `alt`, `guard`, `map_error`, `label`, `each`, `optional`.
`dataprep/validated`	Applicative error accumulation: `map`, `map_error`, `and_then`, `from_result`, `from_result_map`, `to_result`, `map2`..`map5`, `sequence`, `traverse`, `traverse_indexed`.
`dataprep/non_empty_list`	At-least-one guarantee for error lists: `single`, `cons`, `append`, `concat`, `map`, `flat_map`, `to_list`, `from_list`.
`dataprep/rules`	Built-in rules: `not_empty`, `not_blank`, `matches`, `matches_string`, `matches_string_checked`, `matches_fully`, `matches_fully_string`, `matches_fully_string_checked`, `min_length`, `max_length`, `length_between`, `min_int`, `max_int`, `min_float`, `max_float`, `non_negative_int`, `non_negative_float`, `one_of`, `equals`.
`dataprep/parse`	Parse helpers: `int`, `float`, `float_strict`. Bridge `String` to typed `Validated` with custom error mapping.

Composition overview

Phase	Combinator	Errors	When to use
Prep	`prep.then`	(none)	Chain infallible transforms
Validate	`validator.both` / `all`	Accumulate all	Independent checks on same value
Validate	`validator.alt`	Accumulate on full failure	Accept alternative forms
Validate	`validator.guard`	Short-circuit	Skip if prerequisite fails
Combine	`validated.map2`..`map5`	Accumulate all	Build domain types from independent fields
Bridge	`validated.and_then`	Short-circuit	Parse then validate (type changes)
Bridge	`parse.int` / `parse.float`	Short-circuit	String to typed Validated in one step
Bridge	`raw \|> prep \|> validator`	(prep has none)	Apply infallible transform before validation
Collection	`validated.sequence` / `traverse`	Accumulate all	Validate a list of values
Collection	`validator.each`	Accumulate all	Apply a validator to every list element
Collection	`validator.optional`	(none if None)	Skip validation for absent values

Scope policy

dataprep is a combinator toolkit, not a rule catalog. The library deliberately ships only the building blocks needed to construct typed, error-accumulating validators:

infallible string transforms (prep),
generic checks (validator, rules),
applicative error accumulation (validated, non_empty_list),
String to Int / Float parsers (parse).

Domain-specific parsers (email, url, uuid, iso_datetime, ipv4, ...) are intentionally not in scope. The recommended path is to compose the primitives above into the parser you need, or to depend on a domain-specific package alongside dataprep.

See "Building your own parser" below for the recipes.

Building your own parser

The recipes below cover the common shapes a caller actually wants. Each recipe uses only the public API and is verified by the tests in test/dataprep/cookbook_test.gleam.

The recipes share one error type so they can be combined inside the same form-validation flow:

pub type Err {
  NotAnInteger(raw: String)
  NotPositive
  WrongLength(min: Int, max: Int, got: Int)
  NotUuid(raw: String)
  NotAllowed(raw: String)
}

Recipe 1: `positive_int`

Parse to Int, then enforce > 0. Uses validated.and_then to short-circuit when the parse itself fails.

import dataprep/parse
import dataprep/validated.{type Validated}
import dataprep/validator

fn positive_int(raw: String) -> Validated(Int, Err) {
  use n <- validated.and_then(parse.int(raw, NotAnInteger))
  validator.predicate(fn(x) { x > 0 }, NotPositive)(n)
}

Recipe 2: `bounded_string`

Trim, then enforce length is in [min, max].

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}
import gleam/string

fn bounded_string(
  min: Int,
  max: Int,
) -> fn(String) -> Validated(String, Err) {
  fn(raw: String) {
    let trimmed = prep.run(prep: prep.trim(), value: raw)
    rules.length_between(
      minimum: min,
      maximum: max,
      error: WrongLength(min, max, string.length(trimmed)),
    )(trimmed)
  }
}

Recipe 3: `uuid_v4_lowercase`

Trim, lowercase, then regex match. Demonstrates prep.then for chained infallible normalisation before validation.

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}

fn uuid_v4_lowercase(raw: String) -> Validated(String, Err) {
  let normalized =
    prep.run(
      prep: prep.then(first: prep.trim(), next: prep.lowercase()),
      value: raw,
    )
  rules.matches_fully_string(
    pattern: "[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}",
    error: NotUuid(raw),
  )(normalized)
}

Recipe 4: `enum_of_strings_ci`

Case-insensitive match against a fixed allow-list. Returns a parameterised validator so callers can vary the allowed set.

import dataprep/prep
import dataprep/rules
import dataprep/validated.{type Validated}

fn enum_of_strings_ci(
  allowed: List(String),
) -> fn(String) -> Validated(String, Err) {
  fn(raw: String) {
    let normalized = prep.run(prep: prep.lowercase(), value: raw)
    rules.one_of(allowed: allowed, error: NotAllowed(raw))(normalized)
  }
}

The composition pattern is the same in every case: prep.run to normalise the raw input, then a rules or validator combinator to check the normalised value, then optionally validated.and_then to chain a follow-up step that depends on a successful prior step.

Out of scope, by design

The following are intentionally not provided by dataprep, even though some of them appear in adjacent libraries in other languages:

email, url, uri parsing (use a URI- or email-specific package)
iso_datetime / time arithmetic (use a gleam_time-shaped package)
uuid / ulid generation (use a UUID-shaped package)
JSON shape validation (use a JSON-schema package)
HTML / XML sanitisation (use a sanitiser package)

These have implementation-defining standards or substantial spec surface that would push dataprep from "small primitives" toward "kitchen sink." Keeping the scope tight is what lets the combinators stay composable.

Development

This project uses mise to manage Gleam and Erlang versions, and just as a task runner.

mise install    # install Gleam and Erlang
just ci         # format check, typecheck, build, test
just test       # gleam test
just format     # gleam format
just check      # all checks without deps download

Contributing

Contributions are welcome. See CONTRIBUTING.md for details.

License

MIT