LlamaCppSdk
llama_cpp_sdk is the first concrete backend package for the self-hosted
inference stack:
external_runtime_transport
-> self_hosted_inference_core
-> llama_cpp_sdk
-> req_llm through published EndpointDescriptor values
It owns the llama-server specifics that do not belong in the shared kernel:
- boot-spec normalization
llama-serverflag rendering- readiness and health probes
- stop semantics for a spawned service
- backend manifest publication
- OpenAI-compatible endpoint descriptor production
It does not parse OpenAI payloads, token streams, or inference responses.
Those stay northbound in req_llm and the calling control plane.
The phase-1 proof fixture also serves /v1/chat/completions with both standard
JSON and SSE streaming responses so the published endpoint contract can be
exercised honestly by northbound clients.
Current Release Boundary
The first backend release is intentionally narrow and truthful:
-
supported startup kind:
:spawned -
supported execution surface:
:local_subprocess - non-local execution surfaces: rejected during boot-spec normalization
-
published protocol:
:openai_chat_completions -
northbound integration:
self_hosted_inference_core :ssh_execstory: documented as a future additive path once remote model-path semantics, readiness reachability, and shutdown guarantees are verified
Installation
Add the package to your dependency list:
def deps do
[
{:llama_cpp_sdk, "~> 0.1.0"}
]
endllama_cpp_sdk depends on self_hosted_inference_core, which in turn depends
on external_runtime_transport.
Quick Start
Resolve a spawned endpoint through the shared kernel:
alias LlamaCppSdk
alias SelfHostedInferenceCore.ConsumerManifest
consumer =
ConsumerManifest.new!(
consumer: :jido_integration_req_llm,
accepted_runtime_kinds: [:service],
accepted_management_modes: [:jido_managed],
accepted_protocols: [:openai_chat_completions],
required_capabilities: %{streaming?: true},
optional_capabilities: %{tool_calling?: :unknown},
constraints: %{startup_kind: :spawned},
metadata: %{}
)
{:ok, resolution} =
LlamaCppSdk.resolve_endpoint(
%{
model: "/models/qwen3-14b-instruct.gguf",
alias: "qwen3-14b-instruct",
host: "127.0.0.1",
port: 8080,
ctx_size: 8_192,
gpu_layers: :all,
threads: 8,
parallel: 2,
flash_attn: :auto
},
consumer,
owner_ref: "run-123",
ttl_ms: 30_000
)
resolution.endpoint.base_url
resolution.lease.lease_ref
The backend normalizes the boot spec, registers itself with
self_hosted_inference_core, and publishes an endpoint descriptor once the
service is actually ready.
That published descriptor is the northbound contract used by
jido_integration. The caller should execute requests against:
endpoint.base_url <> "/chat/completions"for chat completionsendpoint.headersfor bearer auth or other published headers
Supported Boot Fields
The first release supports normalized fields for the installed
llama-server CLI surface:
binary_pathlauncher_argsmodelaliashostportctx_sizegpu_layersthreadsthreads_batchparallelflash_attnembeddingsapi_keyapi_key_fileapi_prefixtimeout_secondsthreads_httppoolingenvironmentextra_args
See guides/boot_spec.md for the full contract.
When api_key_file is provided, llama_cpp_sdk reads it to derive the
published authorization header for northbound clients.
Readiness And Health
Readiness is owned here, above the transport seam:
-
launch the spawned process via
external_runtime_transport - probe TCP reachability on the requested host and port
-
probe HTTP availability on
/healthor/v1/models - publish the endpoint only after readiness succeeds
Health continues to poll after publication so the shared kernel can expose
healthy, degraded, or unavailable runtime truth.
Examples And Guides
guides/architecture.mdguides/readiness_and_health.mdguides/integration_with_self_hosted_inference_core.mdexamples/README.md
Development
Run the normal quality checks from the repo root when your environment allows Mix to create its local coordination socket:
mix format --check-formatted
mix compile --warnings-as-errors
mix test
MIX_ENV=test mix credo --strict
MIX_ENV=dev mix dialyzer
mix docsLicense
This repository is released under the MIT License. See LICENSE for the
canonical license text and CHANGELOG.md for release history.