em_filter

Hex.pmHex DocsLicense

An Erlang library for building Emergence agents connected to an em_disco discovery service.

Philosophy

Emergence is a distributed discovery network, not a search engine with a central index. Any agent can contribute any result type. Emquest (the web gateway) fans out queries across all connected agents in parallel, deduplicates results by URL, and streams cards to the browser in real time.

em_filter is the library side of this: it handles the WebSocket connection to em_disco, receives queries, calls your handler, and sends results back. Your handler focuses entirely on one thing — turning a query into a list of result maps (embryos).

Features

Concepts

Every node in the Emergence system is an agent. An agent has two optional features:

Memory is best used for caching expensive operations (HTTP responses, DNS lookups, rate limit state). Do not use memory to deduplicate results — deduplication is handled upstream by the Emquest pipeline.

Handler contract

Every handler module must export handle/2:

handle(Body :: binary(), Memory :: map()) ->
    {Result :: term(), NewMemory :: map()}

Body is the raw JSON query binary. Result is typically a list of embryo maps. Returning the same map as NewMemory is valid for stateless behaviour.

Embryo format

Agents return a list of embryo maps:

#{
    <<"type">>       => <<"rss">>,        %% agent-defined type
    <<"properties">> => #{
        <<"url">>    => <<"https://...">>,
        <<"title">>  => <<"...">>,
        <<"resume">> => <<"...">>
    }
}

Installation

Add to your rebar.config:

{deps, [
    {em_filter, "1.2.4"}
]}.

Usage

Stateless agent

Announces capabilities but does not persist state between queries.

em_filter:start_agent(my_agent, my_handler, #{
    capabilities => [<<"search">>, <<"web">>]
}).
-module(my_handler).
-export([handle/2]).

handle(Body, Memory) ->
    Results = do_search(Body),
    {Results, Memory}.

Agent with memory (cache)

Memory is useful for caching.

-module(my_handler).
-export([handle/2]).

handle(Body, Memory) ->
    Cache = maps:get(cache, Memory, #{}),
    case maps:get(Body, Cache, undefined) of
        undefined ->
            Results  = fetch_from_api(Body),
            NewCache = Cache#{Body => Results},
            {Results, Memory#{cache => NewCache}};
        Cached ->
            {Cached, Memory}
    end.
em_filter:start_agent(my_agent, my_handler, #{
    capabilities => [<<"search">>],
    memory       => ets
}).

Multi-disco connectivity

An agent connects to every disco node listed in emergence.conf. Each node gets its own persistent WebSocket connection and worker process.

[em_disco]
nodes = localhost:8080, em-disco.roques.me

With this config, start_agent/3 spawns two workers automatically:

Port and transport resolution:

Configuration

The em_disco address is resolved in this order:

  1. [em_disco] nodes in emergence.conf (recommended)
  2. EM_DISCO_HOST / EM_DISCO_PORT environment variables (legacy, single node)
  3. Default: localhost:8080

emergence.conf locations:

Full example:

[em_disco]
nodes = localhost:8080, em-disco.roques.me

Console output

When running, em_filter logs two events at the notice level:

[em_filter] agent connected: my_agent @ localhost:8080
[em_filter] query: <body>

Connection warnings (auth rejected, timeout, unreachable) are logged at the warning level. OTP startup progress reports are suppressed.

HTML utilities

The following helpers are available for agents that scrape HTML:

Function Description
strip_scripts/1 Removes <script> tags
extract_elements/2 CSS-style element extraction
get_text/1 Strips all HTML tags
extract_attribute/2 Extracts a tag attribute value
clean_text/3 Strips noise and decodes entities
decode_html_entities/1 Decodes &amp;, &#x...;, &#...;
should_skip_link/2 Filters out unwanted URLs

License

Apache 2.0 — see LICENSE.md.