Variable Declaration
Variables allow you to specify configurable values when deploying a service. If no value is provided during deployment, the variable will use its defined default value.
variable "memory_mib" {
type = number
description = "Memory allocation in MiB for the model task"
default = 14360
}
Keyword | Purpose |
---|
variable | Declares a reusable value block |
type | Specifies the type (number , string , list , etc.) |
description | Describes what this variable is used for |
default | Provides a fallback value if none is given at deploy time |
Reference variables with ${var.variable_name}
in your task or resource definitions.
Application Block
The application
block is the top-level definition for an application deployment. It describes the overall app and contains one or more service
blocks.
application {
name = "My Custom Model"
summary = "https://huggingface.co/your-company/model"
description = "A custom model deployment"
category = "model"
tags = ["llm", "text2text"]
service "model" {
name = "my-model"
// ...service definition...
}
service "ui" {
name = "frontend"
// ...service definition...
}
}
Keyword | Purpose |
---|
name | Human-readable name for your application |
summary | Link to model card or documentation (optional) |
description | A short description of your deployment |
category | Category for grouping (e.g., model , agent , tool ) |
tags | List of tags for search and filtering |
service | Defines a component of your application (model, UI, etc.) |
Each service
block describes a deployable component (such as a model backend, UI, or internal service) and can contain nested task_group
and task
blocks for detailed configuration.
Service Block
A service
block defines a deployable component of your application, such as a model backend, UI, or internal service. Each service can contain one or more task_group
blocks, which in turn contain task
blocks for detailed configuration.
service "model" {
name = "my-model"
task_group "model" {
name = "my-model"
count = 1
task {
name = "my-model"
driver = "docker"
config {
image = "my-model-image:latest"
}
endpoint api {
port = 5000
}
resources {
cpu_mhz = 10240
memory_mib = 4096
}
}
}
}
# or if you want to re-use an existing service that's deployed
service "model" {
name = "my-model"
uses = "${var.model_slug}"
}
Keyword | Purpose |
---|
service | Declares a service component (e.g., model, ui, internal) |
name | Name of the service instance |
uses | (Optional) Reference to another service by it’s distinct slug |
task_group | Defines a group of tasks for this service |
config | Service-level configuration (e.g., dependencies) |
Each task_group
block can contain one or more task
blocks, which define how containers or processes are run. Services can also include dependencies and environment variables as needed.
Task Group Block
A task_group
block represents a group of tasks (such as containers or commands) that are scheduled together on a single node. All tasks within a task_group
share the same lifecycle and can communicate via localhost.
task_group "model" {
name = "my-model"
count = 1
task {
name = "my-model"
driver = "docker"
config {
image = "my-model-image:latest"
}
resources {
cpu_mhz = 10240
memory_mib = 4096
}
}
}
Keyword | Purpose |
---|
task_group | Declares a group of tasks to be scheduled on one node |
name | Name of the task group |
count | Number of instances of this group to run |
task | Defines a unit of execution within the group |
Service Types
Use the service type that matches how the component should be operated and discovered.
- system: Default job type for general-purpose services. Deploys the cluster based on the resources requested.
- batch: One-off or scheduled jobs that run to completion (ETL, migrations). Prefer
driver = "exec"
for short-lived containers.
- daemon: Long-running background agents that should be present on ALL nodes (log shippers, monitors). Not user-facing.
- model: Model backends (LLMs, embeddings, VLMs). Requires a
task_group "model"
with the primary model task. A proxy/API will be auto-generated.
- public-api: Publicly exposed backend API (stable, internet-facing). These backends will have to implement their own auth layer, and the token-based auth will not be enforced.
- ui: End-user facing web frontends. Typically receive a public domain and depend on backend services.
- internal: Private backend components not exposed publicly (DBs, indexers, workers). Consumed by other services.
- tool: Auxiliary tool services used by agents or backends (e.g., MCP tools). Usually internal-only.
Tips
- Model services require a
task_group
labeled model
; other labels are rejected.
- Public APIs and UIs are the typical places where you rely on
${service.<name>.domain_name}
.
- Batch services commonly use
task_group "internal"
or task_group "tool"
with tasks that exit on success.
Task Group Types
Task group type communicates the intent and runtime characteristics of the tasks inside.
- model: Model-serving workloads with one primary model task; a proxy/API may be auto-generated.
- tool: Tooling sidecars or tool-specific services used by agents or backends.
- ui: Frontend tasks serving static assets or web UIs for users.
- internal: Private backends like DBs, workers, or internal HTTP services.
- daemon: Background/agent processes; long-running and not user-facing.
- public-api: Publicly reachable HTTP APIs with domain exposure and health checks.
When to choose which
- Choose model for LLM/embedding services that expose inference endpoints.
- Choose internal for non-public components (databases, indexers, workers) used by other services.
- Choose public-api when clients (including UIs) should call this service directly from outside the cluster.
- Choose ui for end-user web applications (frontends) or dashboards.
- Choose tool for MCP servers.
- Choose daemon for node-level or background tasks that run on all nodes and should not terminate.
Task Block
A task
block defines a single unit of execution, such as a container or shell command. Tasks are the smallest deployable units and can be containers, scripts, or binaries.
# container task
task {
name = "my-model"
driver = "docker"
config {
image = "my-model-image:latest"
}
env {
EXAMPLE_ENV = "value"
}
endpoint api {
port = 5000
}
resources {
cpu_mhz = 10240
memory_mib = 4096
}
}
# exec task
task {
name = "backup-task"
driver = "exec"
config {
command = "/bin/bash"
args = ["backup.sh"]
}
resources {
memory_mib = 128
}
}
Keyword | Purpose |
---|
task | Declares a single unit of execution (container, command, etc.) |
name | Name of the task |
driver | Execution backend (e.g., docker , exec ) |
config | Task-specific configuration (image, command, args, etc.) |
env | Environment variables for the task |
endpoint | Network endpoint exposed by the task |
resources | Resource allocation for the task |
Model Block
The model
block is syntactic sugar for service "model"
. The compiler automatically converts a model
block into an equivalent service "model"
block during processing.
The model
block provides a simplified way to deploy a model.
model {
name = "llama-8b"
config {
image = "sglang_hf_generic:latest"
args = [
"--model-path", "meta-llama/Llama-3.1-8B-Instruct",
"--host", "0.0.0.0",
"--port", "5000"
]
}
endpoint llm {
port = 5000
health {
type = "http"
path = "/health"
interval = "10s"
timeout = "2s"
}
}
resources {
cpu_mhz = 10240
memory_mib = 14360
sharing_strategy = "${var.sharing_strategy}"
device {
name = "${var.gpu_name}"
memory_mibs = "${var.gpu_memory_mibs}"
}
}
}
Keyword | Purpose |
---|
name | Unique identifier for your model deployment |
count | Number of model replicas to run |
config | Container configuration including image and startup arguments |
args | List of arguments passed to the container |
endpoint | Network endpoint and health check configuration |
resources | Hardware resource allocation for the model |
Daemon Block
The daemon
block is syntactic sugar for service "daemon"
. The compiler automatically converts a daemon
block into an equivalent service "daemon"
block during processing.
The daemon
block provides a simplified way to define background services such as log shippers, sidecars, or monitoring agents. daemon
services are deployed on every node in the cluster, allowing them to run continuously in the background.
daemon {
name = "vector"
config {
image = "timberio/vector:latest-debian"
volumes = [
"/var/log:/var/log:ro"
]
args = [
"--config-dir", "/local/etc/vector"
]
}
template {
data = <<EOF
[sources.my_source]
type = "file"
include = ["/var/log/**/*.log"]
fingerprint.strategy = "device_and_inode"
[sources.docker_logs]
type = "docker_logs"
[sinks.my_sink]
type = "console"
inputs = ["my_source", "docker_logs"]
encoding.codec = "json"
EOF
destination = "/local/etc/vector/vector.toml"
}
resources {
cpu_mhz = 100
memory_mib = 1024
}
}
Keyword | Purpose |
---|
name | Unique identifier for your daemon |
config | Container configuration for the daemon |
image | Docker image to launch |
volumes | List of volume mounts for the container |
args | Command-line arguments passed to the container |
template | File templating block for configuration files |
data | Contents of the template file |
destination | Path inside the container where the template will be written |
resources | Resource allocation for the daemon |
The daemon
block is ideal for defining persistent background processes that support your main application, such as log collectors or monitoring agents. It’s also useful for running system jobs that need to execute on every node in the cluster.
Config
Configure how your model container should be launched.
config {
image = "your-custom-model-image:latest"
args = ["--host", "0.0.0.0", "--port", "5000"]
}
Keyword | Purpose |
---|
image | Docker image to launch |
args | Command-line arguments passed to your container |
Endpoint & Health Check
Specify the model’s serving port and health check mechanism.
endpoint llm {
port = 5000
health {
type = "http"
path = "/health"
interval = "10s"
timeout = "2s"
}
}
Keyword | Purpose |
---|
endpoint | Exposes a port for requests and defines health checks |
port | Port that the model will serve traffic on the container |
static | (Optional) Port that the model will serve traffic on the host. If not specified, a host port will be dynamically allocated |
health | Block that determines if the container is “healthy” |
type | Health check type (http ) |
path | HTTP path to hit for health checking |
interval | How often to check the health |
timeout | Timeout for each health check attempt |
Ensure the health check path is correctly implemented in your model.
If the health check fails, the model may not be marked as healthy.
Resources and GPU Allocation
Declare CPU, memory, and GPU resources for your task.
resources {
cpu_mhz = 10240
memory_mib = "${var.memory_mib}"
sharing_strategy = "${var.sharing_strategy}"
device {
name = "${var.gpu_name}"
memory_mibs = "${var.gpu_memory_mibs}"
}
}
Keyword | Purpose |
---|
cpu_mhz | How much CPU to allocate (in MHz) |
memory_mib | RAM allocated to this task |
sharing_strategy | GPU usage model (mps for CUDA Multi-Process Service) |
device | GPU allocation block |
name | GPU device name (e.g., nvidia/gpu ) |
memory_mibs | Memory to allocate per GPU (MiB). e.g., [10240] requests a node with at least 1 GPU having 10240 MiB of memory |
Sample Deployment Flow
# 1. Seed the model recipe
sora recipe seed model.hcl
# 2. View the available recipes
sora recipe list
# 3. Deploy the model using its slug
sora recipe deploy <recipe-slug>