Doramagic Project Pack · Human Manual
BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
BentoML Overview and Getting Started
Related topics: Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Model Store, Bento Build, and Framework Integrations, Deployment, Containerization, BentoCloud, and Opera...
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Model Store, Bento Build, and Framework Integrations, Deployment, Containerization, BentoCloud, and Operations
BentoML Overview and Getting Started
Purpose and Scope
BentoML is a Python library for building online serving systems optimized for AI applications and model inference. The framework is designed around three core capabilities: turning model inference scripts into REST API servers using standard Python type hints, packaging everything into reproducible Docker container images, and maximizing compute utilization through features like dynamic batching, model parallelism, multi-stage pipelines, and multi-model inference-graph orchestration. Source: README.md
A BentoML project lifecycle typically follows these stages: develop a Service locally, build it into a Bento (a standardized deployable artifact), containerize it as a Docker image, and then deploy it to BentoCloud or a custom infrastructure. The CLI commands bentoml serve, bentoml build, and bentoml containerize map directly to these stages. Source: README.md
The project is licensed under Apache 2.0, collects anonymous usage analytics (opt-out via BENTOML_DO_NOT_TRACK=True or the --do-not-track CLI flag), and maintains its main documentation site separately from the repository. Source: README.md
High-Level Architecture
The runtime architecture is organized into several cooperating subsystems: a serving layer that manages services and resources, an SDK that defines Service and Model abstractions, a client layer for remote invocation, and internal utilities for analytics, containerization, file watching, and configuration.
flowchart TB
User[Developer / Client] --> CLI[bentoml CLI]
CLI --> Serve[bentoml serve]
CLI --> Build[bentoml build]
CLI --> Container[bentoml containerize]
Serve --> Server[serving.py / Server]
Server --> Allocator[ResourceAllocator]
Build --> Bento[Bento Artifact]
Container --> Docker[Docker Image]
Server --> Models[Model Store]
Models --> HF[HuggingFaceModel]
Models --> Store[StoredModel]
Docker --> Deploy[Deploy to BentoCloud / Cluster]
Deploy --> Client[RemoteProxy / SyncHTTPClient / AsyncHTTPClient]The serving subsystem is implemented in src/_bentoml_impl/server/serving.py, which exposes a Server wrapper around Circus processes and a _get_server_socket helper that negotiates between Unix Domain Sockets (UDS) and TCP based on the platform. On Windows, WSL, or when BENTOML_NO_UDS is set, the server falls back to localhost TCP. Source: src/_bentoml_impl/server/serving.py
The ResourceAllocator class in src/_bentoml_impl/server/allocator.py tracks GPU and CPU assignments. It queries system_resources() for the available NVIDIA devices, decrements a remaining-gpus counter on each assignment, and supports fractional allocation. Two environment variables influence behavior: BENTOML_DISABLE_GPU_ALLOCATION and CUDA_VISIBLE_DEVICES. Source: src/_bentoml_impl/server/allocator.py
Key Subsystems and Workflows
Services and IO Descriptors
Services are the unit of deployment. Each Service declares API endpoints via decorators; each endpoint consumes and produces data described by IO descriptors. The community has reported bugs related to descriptor resolution, for example an IndexError in IODescriptor.from_output() when methods return bare (unparameterized) iterator annotations such as t.Iterator or t.Generator. Source: Issue #5625
For type-hint validation similar to FastAPI's Pydantic integration, the SDK uses IODescriptor. A long-standing community request asks for richer Pydantic model schema support. Source: Issue #1480
Model Store and Model References
bentoml.models provides two complementary abstractions: a ModelStore for saved artifacts on disk and SDK Model reference types such as HuggingFaceModel. The SDK base class Model[T] is defined in src/_bentoml_sdk/models/base.py and declares abstract methods to_info() and from_info() for round-tripping between a model reference and a BentoModelInfo object used during build. Source: src/_bentoml_sdk/models/base.py
HuggingFaceModel (in src/_bentoml_sdk/models/huggingface.py) accepts a model_id, an optional revision, an endpoint (defaulting to HF_ENDPOINT or https://huggingface.co), and include/exclude file patterns. A recent fix in v1.4.37 ensures that HuggingFaceModel is correctly loaded inside a container. Source: v1.4.37 release notes
For model-info persistence, the on-disk YAML format is parsed in src/bentoml/_internal/models/model.py, where ModelInfo.from_yaml strips legacy fields and provides sensible defaults when a pre-1.0 model is loaded. Source: src/bentoml/_internal/models/model.py
Client and Remote Proxy
The client layer is structured around an abstract AbstractClient (in src/_bentoml_impl/client/base.py) that exposes endpoints as a dictionary of ClientEndpoint records. A map_exception helper translates HTTP responses into typed BentoMLException subclasses via a error_mapping table. Source: src/_bentoml_impl/client/base.py
RemoteProxy (in src/_bentoml_impl/client/proxy.py) wraps a service URL and provides sync/async HTTP clients. The default timeout is derived from the service configuration plus a 1% margin, falling back to 60 seconds when no service object is passed. A v1.4.35 fix in PR #5541 ensures that the connector is recreated on session refresh to prevent closed-session errors. Source: v1.4.35 release notes
Build, Container, and Configuration
The build path records analytics events; the handler in src/bentoml/_internal/utils/analytics/cli_events.py differentiates between BentoInfo and BentoInfoV2 to count runners correctly and reports the total Bento size, model size, and model types. Source: src/bentoml/_internal/utils/analytics/cli_events.py
The legacy buildx shim in src/bentoml/_internal/utils/buildx.py is preserved for bentoctl compatibility but emits a DeprecationWarning and forwards to the new bentoml.container.build and bentoml.container.health API. Multi-arch builds gained sharing=locked cache mounts in v1.4.39 to avoid corruption across parallel builds. Source: v1.4.39 release notes
For configuration, BentoML supports .env files via the dotenv module in src/bentoml/_internal/utils/dotenv.py, which parses KEY=VALUE and export KEY=VALUE syntax. Developers can also opt into the experimental pylock.toml standard for per-dependency pinning, a feature the community has been requesting. Source: Issue #5466
Development Workflow and Reloading
The Circus-based file watcher in src/bentoml/_internal/utils/circus/watchfilesplugin.py loads a BentoBuildConfig to derive include/exclude path specs, filters changes through those specs, and triggers a restart call on the Circus arbiter when matching files change. This is the mechanism that powers bentoml serve --reload. Source: src/bentoml/_internal/utils/circus/watchfilesplugin.py
The internal server README in src/bentoml/_internal/server/README.md demonstrates running a service both via bentoml serve and directly through uvicorn hello:app --reload, then exercising it with curl. Source: src/bentoml/_internal/server/README.md
Common Failure Modes and Limitations
- Deprecated IO types:
bentoml.ioraises aBentoMLDeprecationWarningfrom v1.4 onward. Users should migrate to the new style IO descriptors. Source: Issue #5365 - Iterator return annotations: Bare (unparameterized) iterator types in service methods crash
IODescriptor.from_output(). Source: Issue #5625 - Missing dependency formats:
pylock.tomlis not yet supported as an alternative torequirements.txt. Source: Issue #5466 - gRPC transport: Native gRPC support is not in core; long-running community request. Source: Issue #703
- Server-Sent Events: SSE for streaming responses is not natively supported. Source: Issue #3743
- SpaCy runner: Removed in v1.0; no built-in runner currently exists. Source: Issue #4134
- Resource exhaustion: When more GPUs are requested than available,
ResourceAllocatorwarns and continues, which may cause runtime failures downstream. Source: src/_bentoml_impl/server/allocator.py - Multi-bento model import: Importing multiple Bentos that share a model required a fix in v1.4.35. Source: v1.4.35 release notes
- Container path resolution: A path-traversal hardening landed in v1.4.34. Source: v1.4.34 release notes
Quick Start Example
A minimal end-to-end loop, as documented in the project README, is:
# 1. Install BentoML
pip install bentoml
# 2. Develop a service (defined in service.py)
bentoml serve service.py:svc
# 3. Build a Bento (standardized deployable artifact)
bentoml build
# 4. Containerize (requires Docker daemon)
bentoml containerize summarization:latest
# 5. Run the image
docker run --rm -p 3000:3000 summarization:latest
# 6. Deploy to BentoCloud
bentoml cloud login
bentoml deploy
Source: README.md
See Also
Source: https://github.com/bentoml/BentoML / Human Manual
Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)
Related topics: BentoML Overview and Getting Started, Model Store, Bento Build, and Framework Integrations
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: BentoML Overview and Getting Started, Model Store, Bento Build, and Framework Integrations
Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)
Overview
In BentoML, a *Service* is the top-level object that bundles one or more ML models behind strongly-typed HTTP APIs. Each service method maps to an HTTP route whose request/response payload is governed by an IODescriptor. The runtime exposes two cooperating transports: an HTTP/JSON (and HTTP/pickle) implementation built on httpx / aiohttp, and an in-process reverse-proxy that can forward to external commands. Both transports share the same ClientEndpoint model so that the same service class is reachable through the sync HTTPClient, the async AsyncHTTPClient, or the RemoteProxy. Long-running or streaming endpoints are surfaced through Task / AsyncTask handles. Native gRPC and Server-Sent Events (SSE) are not part of the current v1.4 protocol surface; they are tracked as community feature requests (see issue #703 and issue #3743).
Service Definition and Method Endpoints
A service is declared by subclassing Service and annotating methods. The decorator machinery turns each public method into a ClientEndpoint whose route, input_spec, output_spec, doc, stream_output, and is_task fields are derived from the Python type annotations. When a Service object is passed into an HTTP client, the constructor iterates service.apis.items() and registers one route per API, generating JSON-Schema-compatible input/output descriptors:
for name, method in service.apis.items():
routes[name] = ClientEndpoint(
name=name,
route=method.route,
input=method.input_spec.model_json_schema(),
output=method.output_spec.model_json_schema(),
doc=method.doc,
input_spec=method.input_spec,
output_spec=method.output_spec,
stream_output=method.is_stream,
is_task=method.is_task,
)
Source: src/_bentoml_impl/client/proxy2.py:1-220. The same endpoint shape is reused on the server side, so the route table is symmetric between the producer (the service definition) and the consumer (the client). Service-level configuration such as the readyz endpoint is read from service.config.get("endpoints", {}).get("readyz", "/readyz") (src/_bentoml_impl/client/proxy2.py:1-220). The serve_http entry point exported by src/_bentoml_impl/server/__init__.py is the public API used to start a service.
IO Types via IODescriptor
The IODescriptor abstraction (referenced as from _bentoml_sdk import IODescriptor in src/_bentoml_impl/client/http.py:1-80 and instantiated from Pydantic models in src/_bentoml_impl/client/base.py:1-80) is responsible for serializing and validating request/response payloads. Two helper attributes matter most:
input_spec.model_json_schema()andoutput_spec.model_json_schema()produce the JSON-Schema that the HTTP layer exposes for clients that want to validate their own payloads.is_streamandis_taskflag whether the endpoint returns a streamed response or an asynchronous task handle.
A known regression surfaces when the return annotation is an *unparameterized* iterator such as t.Iterator or t.AsyncIterator — IODescriptor.from_output() then crashes with IndexError (see issue #5625). In current builds the v1.4 release notes also deprecate bentoml.io in favor of these new style IO descriptors (see issue #5365 and release v1.4.39).
HTTP Protocol, Client Transports, and Reverse Proxy
BentoML ships two HTTP transports. The first, HTTPClient in src/_bentoml_impl/client/http.py:1-80, is built on httpx.Client / httpx.AsyncClient. The second, ProxyClient in src/_bentoml_impl/client/proxy2.py:1-220, is built on aiohttp and is used by RemoteProxy for inter-service calls inside a Bento deployment. Both implement the same AbstractClient contract and surface the same ClientEndpoint registry.
flowchart LR Client[Python SDK caller] --> HC[SyncHTTPClient / AsyncHTTPClient] HC -->|httpx| HTTP[BentoML HTTP server] Client --> RP[RemoteProxy] RP -->|aiohttp| HTTP HTTP --> Circus[circus Server] Circus -->|UDS / TCP| Worker[Bento worker process] Worker --> Models[(ModelStore / HuggingFaceModel)]
Source: src/_bentoml_impl/server/serving.py:1-80 and src/_bentoml_impl/client/http.py:1-80. The serving layer uses circus to spawn one or more worker processes, connecting them via Unix Domain Sockets on POSIX or TCP sockets on Windows/WSL (the choice is gated by BENTOML_NO_UDS in src/_bentoml_impl/server/serving.py:1-80). Models themselves are looked up through the abstract Model class — including concrete subclasses such as HuggingFaceModel (src/_bentoml_sdk/models/base.py:1-80 and src/_bentoml_sdk/models/huggingface.py:1-80) — which resolves to entries in the on-disk ModelStore (src/bentoml/_internal/models/model.py:1-160).
For services that wrap a third-party HTTP server (e.g., a Triton or vLLM command), create_proxy_app (src/_bentoml_impl/server/proxy.py:1-80) exposes a Starlette app that forwards all requests — including a configurable /health endpoint — to the spawned child process.
Long-Running Calls, Streaming, and Community-Requested Protocols
Endpoints flagged with is_task=True return a Task or AsyncTask handle instead of the final payload, allowing the client to poll, cancel, or retry the operation. The relevant methods on the handle are get_status(), cancel(), get(), and retry() (src/_bentoml_impl/client/task.py:1-80). Endpoints flagged with stream_output=True are surfaced through the same HTTP transport but use a streaming response (used internally for SSE-style outputs; see issue #3743).
| Protocol | Status in v1.4 | Transport file | Notes |
|---|---|---|---|
| HTTP/JSON | Supported | src/_bentoml_impl/client/http.py | Default media type is application/json |
| HTTP/pickle | Supported | src/_bentoml_impl/client/proxy.py | Used by RemoteProxy (application/vnd.bentoml+pickle) |
| Async tasks | Supported | src/_bentoml_impl/client/task.py | Returns Task / AsyncTask handles |
| Streaming (SSE-like) | Supported via is_stream | src/_bentoml_impl/client/base.py | Requested as first-class SSE in issue #3743 |
| gRPC | Not implemented | n/a | Requested in issue #703 (9 comments) |
Common Failure Modes
- Deprecated
bentoml.io— emitsBentoMLDeprecationWarningon import; migrate to the new IO descriptors shipped in v1.4 (issue #5365). - Bare iterator return type —
IODescriptor.from_output()raisesIndexErroront.Iteratorwithout parameters (issue #5625). - Symlink file copy — resolved in v1.4.39 by preventing symlink traversal in
BentoStore(release v1.4.39). - Closed aiohttp session — fixed in v1.4.35 by recreating the connector on session refresh (release v1.4.35).
- Missing
pylock.tomlsupport — tracked in issue #5466 as a missing dependency-locking feature.
See Also
- Model Loading and Model Store
- Workers and Model Parallelization
- Service Configuration and Dependencies
- Async Tasks and Streaming Endpoints
- Community: issue #703 gRPC support, issue #3743 SSE support, issue #5625 IODescriptor IndexError
Source: https://github.com/bentoml/BentoML / Human Manual
Model Store, Bento Build, and Framework Integrations
Related topics: BentoML Overview and Getting Started, Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Deployment, Containerization, BentoCloud, and Operations
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: BentoML Overview and Getting Started, Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Deployment, Containerization, BentoCloud, and Operations
Model Store, Bento Build, and Framework Integrations
BentoML is a Python framework for building online serving systems for AI applications and model inference. The Model Store, the Bento build pipeline, and the framework integrations together form the "packaging" half of the project: they decide *what* artifact travels with a service, *how* that artifact is assembled, and *how* it is turned into a container that runs anywhere. This page documents the moving parts behind those decisions, as they exist in the source tree.
1. Model Store: Persistent On-Disk Model Management
The Model Store is the on-disk location where BentoML keeps versioned, tagged model artifacts. Every saved model is represented as a directory containing a model.yaml manifest that captures metadata, signatures, and the captured ModelContext (Python version, framework versions, etc.).
1.1 The `model.yaml` Manifest
The manifest loader strips forward-incompatible fields and normalizes the representation before handing it to a cattr structure hook. Key behaviors defined in the loader:
- The top-level
namefield is promoted into aTag, while the explicittagfield is *ignored* on save. - For backwards compatibility with Bentos created before
1.0.0rc1, the loader deletesversion,bentoml_version, and the legacycontext.pip_dependencies(rewriting it intoframework_versions). - A missing
signaturessection is silently defaulted to an empty mapping. - An unexpected field raises a
BentoMLExceptionthroughcattr'sTypeErrorhandling.
Source: src/bentoml/_internal/models/model.py
1.2 Internal vs. SDK Model Types
There are two parallel "model" types in the codebase:
| Layer | Type | Role |
|---|---|---|
| Internal | bentoml._internal.models.Model (a StoredModel) | The persisted, filesystem-backed record in the Model Store. |
| SDK | _bentoml_sdk.models.base.Model[T] | An abstract, generic, framework-agnostic reference a service author declares. |
The SDK base class is abc.ABC and t.Generic[T]. It requires subclasses to implement to_info, from_info, to_create_schema, and resolve. The descriptor protocol (__get__) lazily resolves the model on first attribute access and caches the resolved value in a name-mangled _Model__resolved slot. This is what makes a class-level annotation like model = HuggingFaceModel("org/name") behave like an eagerly-resolved runtime object inside a service.
Source: src/_bentoml_sdk/models/base.py
A concrete specialization is HuggingFaceModel, an attrs-frozen, hashable reference. Its to_create_schema produces a BentoCloud-compatible CreateModelSchema carrying the model ID, revision, endpoint (defaulting to the HF_ENDPOINT env var), include/exclude patterns, and a ModelManifestSchema with the BentoML version, size, and context. The model URL is https://huggingface.co/{model_id} by default (DEFAULT_HF_ENDPOINT), and a change in v1.4.37 ensures this reference is correctly materialized inside the container (see release notes).
Source: src/_bentoml_sdk/models/huggingface.py
2. The Bento Build Pipeline
A "Bento" is the standardized deployable artifact. Building one is what bridges the developer's Python code, the SDK-level model references, and the container frontend.
flowchart LR
A[Service module<br/>+ Model references] --> B[bentoml build]
B --> C[Model Store lookup<br/>+ on-disk copy]
C --> D[BentoInfo / BentoInfoV2]
D --> E[bentoml containerize]
E --> F[Dockerfile frontend]
F --> G[OCI image]2.1 What a `bento build` Produces
The build command collects services, runners, and models, then writes a Bento object. Analytics — opt-in, controlled by the BENTOML_DO_NOT_TRACK env var — emits a BentoBuildEvent summarizing creation timestamp, total size, model size, runner count, and model module names. The schema distinguishes between BentoInfo (v1) and BentoInfoV2 (v1.2+) when counting runners, which keeps the metric stable across schema migrations.
Source: src/bentoml/_internal/utils/analytics/cli_events.py
2.2 Container Frontend and Supported Runtimes
The Dockerfile frontend is the typed contract that enumerates what the build can target. It exports a strict list of supported runtimes, which guarantees the generated image is reproducible.
| Runtime | Supported versions |
|---|---|
| Python | 3.9, 3.10, 3.11, 3.12, 3.13, 3.14 |
| CUDA | 12.8.1, 12.8.0, 12.6.x, 12.1.x, 12.0.x, 11.8.0, 11.7.1, 11.6.2, 11.4.3, 11.2.2 |
User-supplied CUDA versions are normalized through ALLOWED_CUDA_VERSION_ARGS (e.g. "12" → "12.8.1"). Release v1.4.38 corrected the NVIDIA CUDA base images for Debian, which previously caused subtle runtime mismatches.
Source: src/bentoml/_internal/container/frontend/dockerfile/__init__.py
2.3 The `buildx` Shim and Deprecation
bentoml/_internal/utils/buildx.py is intentionally a thin shim for bentoctl and is not for direct use. It re-exports build and health, but every call raises a DeprecationWarning directing users to bentoml.container.build and bentoml.container.health. The shim also normalizes legacy keyword names — tags → tag, and drops subprocess_env — so older callers keep working while the internal API moves to the new container module.
Source: src/bentoml/_internal/utils/buildx.py
3. Framework Integrations and the Serving Side
Framework integrations are not a single class — they are realized through the SDK's Model subclasses, the container frontend's options, and the runtime resource allocator.
3.1 GPU / CPU Resource Allocation
ResourceAllocator is the runtime component that decides which GPU indices are assigned to which worker. It uses the system-detected nvidia.com/gpu count, tracks remaining_gpus as it hands them out, and supports fractional allocation (less than one whole GPU). Two environment variables alter its behavior:
BENTOML_DISABLE_GPU_ALLOCATION— disables automatic allocation entirely.CUDA_VISIBLE_DEVICES— also disables automatic allocation, leaving it to the user.
If the request exceeds remaining capacity, a ResourceWarning is emitted, and the counter is clamped to zero rather than going negative.
Source: src/_bentoml_impl/server/allocator.py
3.2 Client-Side Integration Surface
Once a Bento is deployed, the HTTP client is the primary way framework users invoke the service. HTTPClient (in _bentoml_impl/client/http.py) is a generic httpx-backed client that introspects the service's IODescriptor endpoints — captured as ClientEndpoint records with input_spec, output_spec, stream_output, and is_task flags. This is what enables strongly-typed client stubs generated from a service's Python type hints, regardless of the framework the underlying model uses.
Source: src/_bentoml_impl/client/http.py
3.3 Project-Level Commands
The README.md documents the canonical developer workflow that ties Model Store, build, and containerization together:
bentoml build # assemble Bento
bentoml containerize <tag> # produce OCI image
docker run --rm -p 3000:3000 <tag>
bentoml cloud login # optional: BentoCloud
bentoml deploy # optional: deploy
Release v1.4.39 hardened the Model Store by preventing symlink traversal during file copies, which is relevant for any framework integration that stores weights in symlinked cache directories.
Source: README.md
4. Common Failure Modes and Migration Notes
Three recurring pain points are visible in the community and the release notes:
- Old IO imports.
from bentoml.io import JSONraisesBentoMLDeprecationWarningsince v1.4. The fix is to migrate to the new SDK IO types in_bentoml_sdk. Issue #5365 reports confusion from PyTorch tutorials still using the old path. - Per-dependency resolution. Issue #5466 requests
pylock.tomlsupport so dependency resolution can be configured per-package — the current options cannot express that granularity. - Iterator return annotations. Issue #5625 shows
IODescriptor.from_output()raisingIndexErroron bare (unparameterized) iterator annotations liket.Iteratorort.Generator. Parameterize them (t.Iterator[int]) to work around.
Each of these is a boundary case where the Model Store / build / framework contract is in flux, and they are the most likely sources of integration friction in the current release line.
See Also
- BentoML Documentation — Model Store
- BentoML Documentation — Hello World
- BentoML Releases
- Related wiki page: *Services, Runners, and IO Descriptors* (descriptor protocol for
Model[T]) - Related wiki page: *Container Frontends and Docker Build* (Dockerfile options,
buildxmigration)
Source: https://github.com/bentoml/BentoML / Human Manual
Deployment, Containerization, BentoCloud, and Operations
Related topics: Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Model Store, Bento Build, and Framework Integrations
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Model Store, Bento Build, and Framework Integrations
Deployment, Containerization, BentoCloud, and Operations
BentoML provides an end-to-end path from a Python inference script to a reproducible deployable artifact. This page documents the deployment surface area: how a service is packaged into a Bento, how that Bento is turned into a Docker image, how resources (especially GPUs) are allocated at runtime, how the HTTP client talks to a running deployment, and the operational tooling (dev-mode reload, telemetry, dotenv parsing) that surrounds these flows. Sources throughout this page are taken directly from the repository's main branch as of release v1.4.39 (v1.4.39 release notes).
1. The Bento → Container → Cloud Pipeline
The deployment story is described at a high level in the project README:
bentoml build to package necessary code, models, dependency configs into a Bento — the standardized deployable artifact in BentoML.
Generate a Docker container image for deployment: bentoml containerize summarization:latest.
Run the generated image: docker run --rm -p 3000:3000 summarization:latest.
Deploy from current directory: bentoml deploy.
Source: README.md
The pipeline is intentionally the same regardless of whether the target is a local Docker daemon or BentoCloud: produce a deterministic, versioned Bento first, then turn it into a container image, then schedule it. The BentoInfo / BentoInfoV2 schemas (read in src/bentoml/_internal/utils/analytics/cli_events.py) distinguish the legacy V1 format (with a runners list) from the V2 service-oriented format (with a services map), and the analytics emitter counts num_of_runners accordingly.
flowchart LR
A[Service script<br/>bentoml.Service] --> B[bentoml build<br/>produces a Bento]
B --> C[bentoml containerize<br/>or bentoml.container.build]
C --> D[Docker daemon<br/>docker run -p 3000:3000]
C --> E[BentoCloud<br/>bentoml deploy]
D --> F[HTTP client<br/>HTTPClient]
E --> FThe internal container subsystem exposes a backend abstraction (get_backend("buildx")) so the legacy bentoml._internal.utils.buildx module can remain a thin compatibility shim for bentoctl:
This module is shim for bentoctl. NOT FOR DIRECT USE. Make sure to usebentoml.container.buildandbentoml.container.healthinstead.
Source: src/bentoml/_internal/utils/buildx.py
The shim re-keys tags → tag and pops subprocess_env before delegating to the registered backend, making the new container API the only forward-compatible entry point.
2. Local Containerization, Dev Mode, and Environment
Two operational conveniences sit around the build/run pipeline: dotenv-based configuration and dev-mode file watching.
dotenv.py parses KEY=VALUE and export KEY=VALUE lines (and KEY: VALUE separators) so a .env file can supply secrets, registry credentials, and per-deployment toggles without code changes. The parser tolerates whitespace and the optional export prefix, which matches the conventions used by Django's django-dotenv, the upstream project this module was ported from.
Source: src/bentoml/_internal/utils/dotenv.py
In development, watchfilesplugin.py wraps the watchfiles change stream into a Circus plugin that restarts workers when any file matched by BentoBuildConfig.include changes. The plugin evaluates BentoPathSpec(build_config.include, build_config.exclude, working_dir) once and filters each event against that spec before issuing restart name="*" to the supervisor. This is what powers bentoml serve --development reload semantics.
3. Resource Allocation at Runtime
Once a Bento is running, the API server needs to know which GPU(s) it may bind to. The ResourceAllocator in allocator.py drives this:
- It inspects
system_resources()["nvidia.com/gpu"]to learn how many NVIDIA devices are visible. - Each GPU is modeled as
(remaining_fraction, unit)so fractional allocation (e.g.0.5of a GPU per runner) is supported. gpu_allocation_disabled()returnsTrueif eitherBENTOML_DISABLE_GPU_ALLOCATIONorCUDA_VISIBLE_DEVICESis set in the environment — in that mode the allocator yields control back to the user.- If a request exceeds the remaining budget, a
ResourceWarningis emitted telling the operator to setBENTOML_DISABLE_GPU_ALLOCATION=1and allocate GPUs manually.
Source: src/_bentoml_impl/server/allocator.py
This design lets a single Bento serve multiple models that share one or more GPUs without an external scheduler, while still permitting the well-known CUDA_VISIBLE_DEVICES escape hatch for fine-grained control.
4. HTTP Client, Telemetry, and BentoCloud Operations
The HTTPClient in src/_bentoml_impl/client/http.py is the primary way to invoke a deployed Bento. It is parameterized over both httpx.Client (sync) and httpx.AsyncClient, implements automatic retries up to MAX_RETRIES = 3, and maps HTTP status codes to typed BentoMLException subclasses via map_exception (defined in base.py). Each ClientEndpoint carries input_spec / output_spec IODescriptor classes so the SDK can serialize Python types correctly.
Operational telemetry is emitted by cli_events.py under the cli_events_map["bentos"]["build"] key. A BentoBuildEvent records bento_creation_timestamp, bento_size_in_kb, model_size_in_kb, num_of_models, num_of_runners, and model_types, giving the BentoML team aggregate insight into how the build pipeline is used. The model metadata itself is round-tripped through bentoml_cattr, a cattrs Converter with omit_if_default=True and a custom datetime ISO-8601 hook so that BentoInfo / ModelInfo instances can be serialized losslessly.
For cloud deployments, the README points operators at the bentoml cloud login / bentoml deploy workflow and the BentoCloud web UI. Because deployments are built from the same Bento artifact produced locally, the same Model Store entry — including HuggingFaceModel references from src/_bentoml_sdk/models/huggingface.py which resolve model_id + revision against the Hugging Face Hub — can be shipped unchanged between a local Docker run and BentoCloud.
5. Common Failure Modes and Community Notes
Several issues documented in the community affect the deployment surface:
- Container security: v1.4.34 fixed a security issue when resolving user-supplied file paths (v1.4.34 release).
- Symlink traversal in BentoStore: v1.4.39 added a guard so
copy_modelno longer follows symlinks out of the store (v1.4.39 release notes). - Multi-arch BuildKit caching: v1.4.39 switched cache mounts to
sharing=lockedto avoid corruption on parallel multi-arch builds (v1.4.39 release notes). - HuggingFace loading in containers: v1.4.37 fixed
bentoml.models.HuggingFaceModelresolution when running inside the generated image (v1.4.37 release notes). - CPU worker floor: v1.4.32 ensures at least one CPU worker is scheduled even when
resources.cpu == 0(v1.4.32 release notes). - Bundling arbitrary local files: a long-standing feature request (#685) asks for first-class inclusion of non-Python assets beyond
build_config.include/build_config.exclude. - Server-Sent Events: #3743 requests streaming SSE responses from the API server, which would benefit LLM chat and live-monitoring use cases.
- gRPC transport: #703 requests native gRPC support for the API server, complementing the existing
HTTPClient.
See Also
- Hello World example — end-to-end
bentoml build → containerize → runwalkthrough. - Model loading and Model Store — how the Model Store feeds the Bento artifact.
- Distributed serving systems — multi-runner composition relevant to
ResourceAllocator. - BentoCloud deployment — managed deployment counterpart to local Docker runs.
- Contributing Guide and Development Guide for working on the deployment subsystems themselves.
Source: https://github.com/bentoml/BentoML / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 9 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/bentoml/BentoML/issues/5466
2. Configuration risk: Configuration risk requires verification
- Severity: high
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/bentoml/BentoML/issues/5365
3. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | github_repo:178976529 | https://github.com/bentoml/BentoML
4. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/bentoml/BentoML/issues/5625
5. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:178976529 | https://github.com/bentoml/BentoML
6. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | github_repo:178976529 | https://github.com/bentoml/BentoML
7. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | github_repo:178976529 | https://github.com/bentoml/BentoML
8. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:178976529 | https://github.com/bentoml/BentoML
9. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:178976529 | https://github.com/bentoml/BentoML
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using BentoML with real data or production workflows.
- Example: Serve FunASR/SenseVoice as speech recognition API - github / github_issue
- bug: Bentoml Pytorch model serve bug - github / github_issue
- feature: support for pylock.toml - github / github_issue
- BUG: IndexError in IODescriptor.from_output() with bare (unparameterized - github / github_issue
- v1.4.39 - github / github_release
- v1.4.38 - github / github_release
- v1.4.37 - github / github_release
- v1.4.36 - github / github_release
- v1.4.35 - github / github_release
- v1.4.34 - github / github_release
- v1.4.33 - github / github_release
- v1.4.32 - github / github_release
Source: Project Pack community evidence and pitfall evidence