llama_cpp_sdk logo

Hex versionHexDocsMIT License

LlamaCppSdk

llama_cpp_sdk is the first concrete backend package for the self-hosted inference stack:

external_runtime_transport
  -> self_hosted_inference_core
  -> llama_cpp_sdk
  -> req_llm through published EndpointDescriptor values

It owns the llama-server specifics that do not belong in the shared kernel:

It does not parse OpenAI payloads, token streams, or inference responses. Those stay northbound in req_llm and the calling control plane.

The phase-1 proof fixture also serves /v1/chat/completions with both standard JSON and SSE streaming responses so the published endpoint contract can be exercised honestly by northbound clients.

Current Release Boundary

The first backend release is intentionally narrow and truthful:

Installation

Add the package to your dependency list:

def deps do
  [
    {:llama_cpp_sdk, "~> 0.1.0"}
  ]
end

llama_cpp_sdk depends on self_hosted_inference_core, which in turn depends on external_runtime_transport.

Quick Start

Resolve a spawned endpoint through the shared kernel:

alias LlamaCppSdk
alias SelfHostedInferenceCore.ConsumerManifest

consumer =
  ConsumerManifest.new!(
    consumer: :jido_integration_req_llm,
    accepted_runtime_kinds: [:service],
    accepted_management_modes: [:jido_managed],
    accepted_protocols: [:openai_chat_completions],
    required_capabilities: %{streaming?: true},
    optional_capabilities: %{tool_calling?: :unknown},
    constraints: %{startup_kind: :spawned},
    metadata: %{}
  )

{:ok, resolution} =
  LlamaCppSdk.resolve_endpoint(
    %{
      model: "/models/qwen3-14b-instruct.gguf",
      alias: "qwen3-14b-instruct",
      host: "127.0.0.1",
      port: 8080,
      ctx_size: 8_192,
      gpu_layers: :all,
      threads: 8,
      parallel: 2,
      flash_attn: :auto
    },
    consumer,
    owner_ref: "run-123",
    ttl_ms: 30_000
  )

resolution.endpoint.base_url
resolution.lease.lease_ref

The backend normalizes the boot spec, registers itself with self_hosted_inference_core, and publishes an endpoint descriptor once the service is actually ready.

That published descriptor is the northbound contract used by jido_integration. The caller should execute requests against:

Supported Boot Fields

The first release supports normalized fields for the installed llama-server CLI surface:

See guides/boot_spec.md for the full contract. When api_key_file is provided, llama_cpp_sdk reads it to derive the published authorization header for northbound clients.

Readiness And Health

Readiness is owned here, above the transport seam:

  1. launch the spawned process via external_runtime_transport
  2. probe TCP reachability on the requested host and port
  3. probe HTTP availability on /health or /v1/models
  4. publish the endpoint only after readiness succeeds

Health continues to poll after publication so the shared kernel can expose healthy, degraded, or unavailable runtime truth.

Examples And Guides

Development

Run the normal quality checks from the repo root when your environment allows Mix to create its local coordination socket:

mix format --check-formatted
mix compile --warnings-as-errors
mix test
MIX_ENV=test mix credo --strict
MIX_ENV=dev mix dialyzer
mix docs

License

This repository is released under the MIT License. See LICENSE for the canonical license text and CHANGELOG.md for release history.