Doramagic Project Pack · Human Manual

BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

BentoML Overview and Getting Started

Related topics: Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Model Store, Bento Build, and Framework Integrations, Deployment, Containerization, BentoCloud, and Opera...

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Services and IO Descriptors

Continue reading this section for the full explanation and source context.

Section Model Store and Model References

Continue reading this section for the full explanation and source context.

Section Client and Remote Proxy

Continue reading this section for the full explanation and source context.

Related topics: Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Model Store, Bento Build, and Framework Integrations, Deployment, Containerization, BentoCloud, and Operations

BentoML Overview and Getting Started

Purpose and Scope

BentoML is a Python library for building online serving systems optimized for AI applications and model inference. The framework is designed around three core capabilities: turning model inference scripts into REST API servers using standard Python type hints, packaging everything into reproducible Docker container images, and maximizing compute utilization through features like dynamic batching, model parallelism, multi-stage pipelines, and multi-model inference-graph orchestration. Source: README.md

A BentoML project lifecycle typically follows these stages: develop a Service locally, build it into a Bento (a standardized deployable artifact), containerize it as a Docker image, and then deploy it to BentoCloud or a custom infrastructure. The CLI commands bentoml serve, bentoml build, and bentoml containerize map directly to these stages. Source: README.md

The project is licensed under Apache 2.0, collects anonymous usage analytics (opt-out via BENTOML_DO_NOT_TRACK=True or the --do-not-track CLI flag), and maintains its main documentation site separately from the repository. Source: README.md

High-Level Architecture

The runtime architecture is organized into several cooperating subsystems: a serving layer that manages services and resources, an SDK that defines Service and Model abstractions, a client layer for remote invocation, and internal utilities for analytics, containerization, file watching, and configuration.

flowchart TB
    User[Developer / Client] --> CLI[bentoml CLI]
    CLI --> Serve[bentoml serve]
    CLI --> Build[bentoml build]
    CLI --> Container[bentoml containerize]
    Serve --> Server[serving.py / Server]
    Server --> Allocator[ResourceAllocator]
    Build --> Bento[Bento Artifact]
    Container --> Docker[Docker Image]
    Server --> Models[Model Store]
    Models --> HF[HuggingFaceModel]
    Models --> Store[StoredModel]
    Docker --> Deploy[Deploy to BentoCloud / Cluster]
    Deploy --> Client[RemoteProxy / SyncHTTPClient / AsyncHTTPClient]

The serving subsystem is implemented in src/_bentoml_impl/server/serving.py, which exposes a Server wrapper around Circus processes and a _get_server_socket helper that negotiates between Unix Domain Sockets (UDS) and TCP based on the platform. On Windows, WSL, or when BENTOML_NO_UDS is set, the server falls back to localhost TCP. Source: src/_bentoml_impl/server/serving.py

The ResourceAllocator class in src/_bentoml_impl/server/allocator.py tracks GPU and CPU assignments. It queries system_resources() for the available NVIDIA devices, decrements a remaining-gpus counter on each assignment, and supports fractional allocation. Two environment variables influence behavior: BENTOML_DISABLE_GPU_ALLOCATION and CUDA_VISIBLE_DEVICES. Source: src/_bentoml_impl/server/allocator.py

Key Subsystems and Workflows

Services and IO Descriptors

Services are the unit of deployment. Each Service declares API endpoints via decorators; each endpoint consumes and produces data described by IO descriptors. The community has reported bugs related to descriptor resolution, for example an IndexError in IODescriptor.from_output() when methods return bare (unparameterized) iterator annotations such as t.Iterator or t.Generator. Source: Issue #5625

For type-hint validation similar to FastAPI's Pydantic integration, the SDK uses IODescriptor. A long-standing community request asks for richer Pydantic model schema support. Source: Issue #1480

Model Store and Model References

bentoml.models provides two complementary abstractions: a ModelStore for saved artifacts on disk and SDK Model reference types such as HuggingFaceModel. The SDK base class Model[T] is defined in src/_bentoml_sdk/models/base.py and declares abstract methods to_info() and from_info() for round-tripping between a model reference and a BentoModelInfo object used during build. Source: src/_bentoml_sdk/models/base.py

HuggingFaceModel (in src/_bentoml_sdk/models/huggingface.py) accepts a model_id, an optional revision, an endpoint (defaulting to HF_ENDPOINT or https://huggingface.co), and include/exclude file patterns. A recent fix in v1.4.37 ensures that HuggingFaceModel is correctly loaded inside a container. Source: v1.4.37 release notes

For model-info persistence, the on-disk YAML format is parsed in src/bentoml/_internal/models/model.py, where ModelInfo.from_yaml strips legacy fields and provides sensible defaults when a pre-1.0 model is loaded. Source: src/bentoml/_internal/models/model.py

Client and Remote Proxy

The client layer is structured around an abstract AbstractClient (in src/_bentoml_impl/client/base.py) that exposes endpoints as a dictionary of ClientEndpoint records. A map_exception helper translates HTTP responses into typed BentoMLException subclasses via a error_mapping table. Source: src/_bentoml_impl/client/base.py

RemoteProxy (in src/_bentoml_impl/client/proxy.py) wraps a service URL and provides sync/async HTTP clients. The default timeout is derived from the service configuration plus a 1% margin, falling back to 60 seconds when no service object is passed. A v1.4.35 fix in PR #5541 ensures that the connector is recreated on session refresh to prevent closed-session errors. Source: v1.4.35 release notes

Build, Container, and Configuration

The build path records analytics events; the handler in src/bentoml/_internal/utils/analytics/cli_events.py differentiates between BentoInfo and BentoInfoV2 to count runners correctly and reports the total Bento size, model size, and model types. Source: src/bentoml/_internal/utils/analytics/cli_events.py

The legacy buildx shim in src/bentoml/_internal/utils/buildx.py is preserved for bentoctl compatibility but emits a DeprecationWarning and forwards to the new bentoml.container.build and bentoml.container.health API. Multi-arch builds gained sharing=locked cache mounts in v1.4.39 to avoid corruption across parallel builds. Source: v1.4.39 release notes

For configuration, BentoML supports .env files via the dotenv module in src/bentoml/_internal/utils/dotenv.py, which parses KEY=VALUE and export KEY=VALUE syntax. Developers can also opt into the experimental pylock.toml standard for per-dependency pinning, a feature the community has been requesting. Source: Issue #5466

Development Workflow and Reloading

The Circus-based file watcher in src/bentoml/_internal/utils/circus/watchfilesplugin.py loads a BentoBuildConfig to derive include/exclude path specs, filters changes through those specs, and triggers a restart call on the Circus arbiter when matching files change. This is the mechanism that powers bentoml serve --reload. Source: src/bentoml/_internal/utils/circus/watchfilesplugin.py

The internal server README in src/bentoml/_internal/server/README.md demonstrates running a service both via bentoml serve and directly through uvicorn hello:app --reload, then exercising it with curl. Source: src/bentoml/_internal/server/README.md

Common Failure Modes and Limitations

  • Deprecated IO types: bentoml.io raises a BentoMLDeprecationWarning from v1.4 onward. Users should migrate to the new style IO descriptors. Source: Issue #5365
  • Iterator return annotations: Bare (unparameterized) iterator types in service methods crash IODescriptor.from_output(). Source: Issue #5625
  • Missing dependency formats: pylock.toml is not yet supported as an alternative to requirements.txt. Source: Issue #5466
  • gRPC transport: Native gRPC support is not in core; long-running community request. Source: Issue #703
  • Server-Sent Events: SSE for streaming responses is not natively supported. Source: Issue #3743
  • SpaCy runner: Removed in v1.0; no built-in runner currently exists. Source: Issue #4134
  • Resource exhaustion: When more GPUs are requested than available, ResourceAllocator warns and continues, which may cause runtime failures downstream. Source: src/_bentoml_impl/server/allocator.py
  • Multi-bento model import: Importing multiple Bentos that share a model required a fix in v1.4.35. Source: v1.4.35 release notes
  • Container path resolution: A path-traversal hardening landed in v1.4.34. Source: v1.4.34 release notes

Quick Start Example

A minimal end-to-end loop, as documented in the project README, is:

# 1. Install BentoML
pip install bentoml

# 2. Develop a service (defined in service.py)
bentoml serve service.py:svc

# 3. Build a Bento (standardized deployable artifact)
bentoml build

# 4. Containerize (requires Docker daemon)
bentoml containerize summarization:latest

# 5. Run the image
docker run --rm -p 3000:3000 summarization:latest

# 6. Deploy to BentoCloud
bentoml cloud login
bentoml deploy

Source: README.md

See Also

Source: https://github.com/bentoml/BentoML / Human Manual

Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)

Related topics: BentoML Overview and Getting Started, Model Store, Bento Build, and Framework Integrations

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: BentoML Overview and Getting Started, Model Store, Bento Build, and Framework Integrations

Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)

Overview

In BentoML, a *Service* is the top-level object that bundles one or more ML models behind strongly-typed HTTP APIs. Each service method maps to an HTTP route whose request/response payload is governed by an IODescriptor. The runtime exposes two cooperating transports: an HTTP/JSON (and HTTP/pickle) implementation built on httpx / aiohttp, and an in-process reverse-proxy that can forward to external commands. Both transports share the same ClientEndpoint model so that the same service class is reachable through the sync HTTPClient, the async AsyncHTTPClient, or the RemoteProxy. Long-running or streaming endpoints are surfaced through Task / AsyncTask handles. Native gRPC and Server-Sent Events (SSE) are not part of the current v1.4 protocol surface; they are tracked as community feature requests (see issue #703 and issue #3743).

Service Definition and Method Endpoints

A service is declared by subclassing Service and annotating methods. The decorator machinery turns each public method into a ClientEndpoint whose route, input_spec, output_spec, doc, stream_output, and is_task fields are derived from the Python type annotations. When a Service object is passed into an HTTP client, the constructor iterates service.apis.items() and registers one route per API, generating JSON-Schema-compatible input/output descriptors:

for name, method in service.apis.items():
    routes[name] = ClientEndpoint(
        name=name,
        route=method.route,
        input=method.input_spec.model_json_schema(),
        output=method.output_spec.model_json_schema(),
        doc=method.doc,
        input_spec=method.input_spec,
        output_spec=method.output_spec,
        stream_output=method.is_stream,
        is_task=method.is_task,
    )

Source: src/_bentoml_impl/client/proxy2.py:1-220. The same endpoint shape is reused on the server side, so the route table is symmetric between the producer (the service definition) and the consumer (the client). Service-level configuration such as the readyz endpoint is read from service.config.get("endpoints", {}).get("readyz", "/readyz") (src/_bentoml_impl/client/proxy2.py:1-220). The serve_http entry point exported by src/_bentoml_impl/server/__init__.py is the public API used to start a service.

IO Types via IODescriptor

The IODescriptor abstraction (referenced as from _bentoml_sdk import IODescriptor in src/_bentoml_impl/client/http.py:1-80 and instantiated from Pydantic models in src/_bentoml_impl/client/base.py:1-80) is responsible for serializing and validating request/response payloads. Two helper attributes matter most:

  • input_spec.model_json_schema() and output_spec.model_json_schema() produce the JSON-Schema that the HTTP layer exposes for clients that want to validate their own payloads.
  • is_stream and is_task flag whether the endpoint returns a streamed response or an asynchronous task handle.

A known regression surfaces when the return annotation is an *unparameterized* iterator such as t.Iterator or t.AsyncIteratorIODescriptor.from_output() then crashes with IndexError (see issue #5625). In current builds the v1.4 release notes also deprecate bentoml.io in favor of these new style IO descriptors (see issue #5365 and release v1.4.39).

HTTP Protocol, Client Transports, and Reverse Proxy

BentoML ships two HTTP transports. The first, HTTPClient in src/_bentoml_impl/client/http.py:1-80, is built on httpx.Client / httpx.AsyncClient. The second, ProxyClient in src/_bentoml_impl/client/proxy2.py:1-220, is built on aiohttp and is used by RemoteProxy for inter-service calls inside a Bento deployment. Both implement the same AbstractClient contract and surface the same ClientEndpoint registry.

flowchart LR
  Client[Python SDK caller] --> HC[SyncHTTPClient / AsyncHTTPClient]
  HC -->|httpx| HTTP[BentoML HTTP server]
  Client --> RP[RemoteProxy]
  RP -->|aiohttp| HTTP
  HTTP --> Circus[circus Server]
  Circus -->|UDS / TCP| Worker[Bento worker process]
  Worker --> Models[(ModelStore / HuggingFaceModel)]

Source: src/_bentoml_impl/server/serving.py:1-80 and src/_bentoml_impl/client/http.py:1-80. The serving layer uses circus to spawn one or more worker processes, connecting them via Unix Domain Sockets on POSIX or TCP sockets on Windows/WSL (the choice is gated by BENTOML_NO_UDS in src/_bentoml_impl/server/serving.py:1-80). Models themselves are looked up through the abstract Model class — including concrete subclasses such as HuggingFaceModel (src/_bentoml_sdk/models/base.py:1-80 and src/_bentoml_sdk/models/huggingface.py:1-80) — which resolves to entries in the on-disk ModelStore (src/bentoml/_internal/models/model.py:1-160).

For services that wrap a third-party HTTP server (e.g., a Triton or vLLM command), create_proxy_app (src/_bentoml_impl/server/proxy.py:1-80) exposes a Starlette app that forwards all requests — including a configurable /health endpoint — to the spawned child process.

Long-Running Calls, Streaming, and Community-Requested Protocols

Endpoints flagged with is_task=True return a Task or AsyncTask handle instead of the final payload, allowing the client to poll, cancel, or retry the operation. The relevant methods on the handle are get_status(), cancel(), get(), and retry() (src/_bentoml_impl/client/task.py:1-80). Endpoints flagged with stream_output=True are surfaced through the same HTTP transport but use a streaming response (used internally for SSE-style outputs; see issue #3743).

ProtocolStatus in v1.4Transport fileNotes
HTTP/JSONSupportedsrc/_bentoml_impl/client/http.pyDefault media type is application/json
HTTP/pickleSupportedsrc/_bentoml_impl/client/proxy.pyUsed by RemoteProxy (application/vnd.bentoml+pickle)
Async tasksSupportedsrc/_bentoml_impl/client/task.pyReturns Task / AsyncTask handles
Streaming (SSE-like)Supported via is_streamsrc/_bentoml_impl/client/base.pyRequested as first-class SSE in issue #3743
gRPCNot implementedn/aRequested in issue #703 (9 comments)

Common Failure Modes

  • Deprecated bentoml.io — emits BentoMLDeprecationWarning on import; migrate to the new IO descriptors shipped in v1.4 (issue #5365).
  • Bare iterator return typeIODescriptor.from_output() raises IndexError on t.Iterator without parameters (issue #5625).
  • Symlink file copy — resolved in v1.4.39 by preventing symlink traversal in BentoStore (release v1.4.39).
  • Closed aiohttp session — fixed in v1.4.35 by recreating the connector on session refresh (release v1.4.35).
  • Missing pylock.toml support — tracked in issue #5466 as a missing dependency-locking feature.

See Also

Source: https://github.com/bentoml/BentoML / Human Manual

Model Store, Bento Build, and Framework Integrations

Related topics: BentoML Overview and Getting Started, Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Deployment, Containerization, BentoCloud, and Operations

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 1.1 The model.yaml Manifest

Continue reading this section for the full explanation and source context.

Section 1.2 Internal vs. SDK Model Types

Continue reading this section for the full explanation and source context.

Section 2.1 What a bento build Produces

Continue reading this section for the full explanation and source context.

Related topics: BentoML Overview and Getting Started, Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Deployment, Containerization, BentoCloud, and Operations

Model Store, Bento Build, and Framework Integrations

BentoML is a Python framework for building online serving systems for AI applications and model inference. The Model Store, the Bento build pipeline, and the framework integrations together form the "packaging" half of the project: they decide *what* artifact travels with a service, *how* that artifact is assembled, and *how* it is turned into a container that runs anywhere. This page documents the moving parts behind those decisions, as they exist in the source tree.

1. Model Store: Persistent On-Disk Model Management

The Model Store is the on-disk location where BentoML keeps versioned, tagged model artifacts. Every saved model is represented as a directory containing a model.yaml manifest that captures metadata, signatures, and the captured ModelContext (Python version, framework versions, etc.).

1.1 The `model.yaml` Manifest

The manifest loader strips forward-incompatible fields and normalizes the representation before handing it to a cattr structure hook. Key behaviors defined in the loader:

  • The top-level name field is promoted into a Tag, while the explicit tag field is *ignored* on save.
  • For backwards compatibility with Bentos created before 1.0.0rc1, the loader deletes version, bentoml_version, and the legacy context.pip_dependencies (rewriting it into framework_versions).
  • A missing signatures section is silently defaulted to an empty mapping.
  • An unexpected field raises a BentoMLException through cattr's TypeError handling.

Source: src/bentoml/_internal/models/model.py

1.2 Internal vs. SDK Model Types

There are two parallel "model" types in the codebase:

LayerTypeRole
Internalbentoml._internal.models.Model (a StoredModel)The persisted, filesystem-backed record in the Model Store.
SDK_bentoml_sdk.models.base.Model[T]An abstract, generic, framework-agnostic reference a service author declares.

The SDK base class is abc.ABC and t.Generic[T]. It requires subclasses to implement to_info, from_info, to_create_schema, and resolve. The descriptor protocol (__get__) lazily resolves the model on first attribute access and caches the resolved value in a name-mangled _Model__resolved slot. This is what makes a class-level annotation like model = HuggingFaceModel("org/name") behave like an eagerly-resolved runtime object inside a service.

Source: src/_bentoml_sdk/models/base.py

A concrete specialization is HuggingFaceModel, an attrs-frozen, hashable reference. Its to_create_schema produces a BentoCloud-compatible CreateModelSchema carrying the model ID, revision, endpoint (defaulting to the HF_ENDPOINT env var), include/exclude patterns, and a ModelManifestSchema with the BentoML version, size, and context. The model URL is https://huggingface.co/{model_id} by default (DEFAULT_HF_ENDPOINT), and a change in v1.4.37 ensures this reference is correctly materialized inside the container (see release notes).

Source: src/_bentoml_sdk/models/huggingface.py

2. The Bento Build Pipeline

A "Bento" is the standardized deployable artifact. Building one is what bridges the developer's Python code, the SDK-level model references, and the container frontend.

flowchart LR
    A[Service module<br/>+ Model references] --> B[bentoml build]
    B --> C[Model Store lookup<br/>+ on-disk copy]
    C --> D[BentoInfo / BentoInfoV2]
    D --> E[bentoml containerize]
    E --> F[Dockerfile frontend]
    F --> G[OCI image]

2.1 What a `bento build` Produces

The build command collects services, runners, and models, then writes a Bento object. Analytics — opt-in, controlled by the BENTOML_DO_NOT_TRACK env var — emits a BentoBuildEvent summarizing creation timestamp, total size, model size, runner count, and model module names. The schema distinguishes between BentoInfo (v1) and BentoInfoV2 (v1.2+) when counting runners, which keeps the metric stable across schema migrations.

Source: src/bentoml/_internal/utils/analytics/cli_events.py

2.2 Container Frontend and Supported Runtimes

The Dockerfile frontend is the typed contract that enumerates what the build can target. It exports a strict list of supported runtimes, which guarantees the generated image is reproducible.

RuntimeSupported versions
Python3.9, 3.10, 3.11, 3.12, 3.13, 3.14
CUDA12.8.1, 12.8.0, 12.6.x, 12.1.x, 12.0.x, 11.8.0, 11.7.1, 11.6.2, 11.4.3, 11.2.2

User-supplied CUDA versions are normalized through ALLOWED_CUDA_VERSION_ARGS (e.g. "12""12.8.1"). Release v1.4.38 corrected the NVIDIA CUDA base images for Debian, which previously caused subtle runtime mismatches.

Source: src/bentoml/_internal/container/frontend/dockerfile/__init__.py

2.3 The `buildx` Shim and Deprecation

bentoml/_internal/utils/buildx.py is intentionally a thin shim for bentoctl and is not for direct use. It re-exports build and health, but every call raises a DeprecationWarning directing users to bentoml.container.build and bentoml.container.health. The shim also normalizes legacy keyword names — tagstag, and drops subprocess_env — so older callers keep working while the internal API moves to the new container module.

Source: src/bentoml/_internal/utils/buildx.py

3. Framework Integrations and the Serving Side

Framework integrations are not a single class — they are realized through the SDK's Model subclasses, the container frontend's options, and the runtime resource allocator.

3.1 GPU / CPU Resource Allocation

ResourceAllocator is the runtime component that decides which GPU indices are assigned to which worker. It uses the system-detected nvidia.com/gpu count, tracks remaining_gpus as it hands them out, and supports fractional allocation (less than one whole GPU). Two environment variables alter its behavior:

  • BENTOML_DISABLE_GPU_ALLOCATION — disables automatic allocation entirely.
  • CUDA_VISIBLE_DEVICES — also disables automatic allocation, leaving it to the user.

If the request exceeds remaining capacity, a ResourceWarning is emitted, and the counter is clamped to zero rather than going negative.

Source: src/_bentoml_impl/server/allocator.py

3.2 Client-Side Integration Surface

Once a Bento is deployed, the HTTP client is the primary way framework users invoke the service. HTTPClient (in _bentoml_impl/client/http.py) is a generic httpx-backed client that introspects the service's IODescriptor endpoints — captured as ClientEndpoint records with input_spec, output_spec, stream_output, and is_task flags. This is what enables strongly-typed client stubs generated from a service's Python type hints, regardless of the framework the underlying model uses.

Source: src/_bentoml_impl/client/http.py

3.3 Project-Level Commands

The README.md documents the canonical developer workflow that ties Model Store, build, and containerization together:

bentoml build              # assemble Bento
bentoml containerize <tag> # produce OCI image
docker run --rm -p 3000:3000 <tag>
bentoml cloud login        # optional: BentoCloud
bentoml deploy             # optional: deploy

Release v1.4.39 hardened the Model Store by preventing symlink traversal during file copies, which is relevant for any framework integration that stores weights in symlinked cache directories.

Source: README.md

4. Common Failure Modes and Migration Notes

Three recurring pain points are visible in the community and the release notes:

  1. Old IO imports. from bentoml.io import JSON raises BentoMLDeprecationWarning since v1.4. The fix is to migrate to the new SDK IO types in _bentoml_sdk. Issue #5365 reports confusion from PyTorch tutorials still using the old path.
  2. Per-dependency resolution. Issue #5466 requests pylock.toml support so dependency resolution can be configured per-package — the current options cannot express that granularity.
  3. Iterator return annotations. Issue #5625 shows IODescriptor.from_output() raising IndexError on bare (unparameterized) iterator annotations like t.Iterator or t.Generator. Parameterize them (t.Iterator[int]) to work around.

Each of these is a boundary case where the Model Store / build / framework contract is in flux, and they are the most likely sources of integration friction in the current release line.

See Also

Source: https://github.com/bentoml/BentoML / Human Manual

Deployment, Containerization, BentoCloud, and Operations

Related topics: Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Model Store, Bento Build, and Framework Integrations

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE), Model Store, Bento Build, and Framework Integrations

Deployment, Containerization, BentoCloud, and Operations

BentoML provides an end-to-end path from a Python inference script to a reproducible deployable artifact. This page documents the deployment surface area: how a service is packaged into a Bento, how that Bento is turned into a Docker image, how resources (especially GPUs) are allocated at runtime, how the HTTP client talks to a running deployment, and the operational tooling (dev-mode reload, telemetry, dotenv parsing) that surrounds these flows. Sources throughout this page are taken directly from the repository's main branch as of release v1.4.39 (v1.4.39 release notes).

1. The Bento → Container → Cloud Pipeline

The deployment story is described at a high level in the project README:

bentoml build to package necessary code, models, dependency configs into a Bento — the standardized deployable artifact in BentoML.
Generate a Docker container image for deployment: bentoml containerize summarization:latest.
Run the generated image: docker run --rm -p 3000:3000 summarization:latest.
Deploy from current directory: bentoml deploy.

Source: README.md

The pipeline is intentionally the same regardless of whether the target is a local Docker daemon or BentoCloud: produce a deterministic, versioned Bento first, then turn it into a container image, then schedule it. The BentoInfo / BentoInfoV2 schemas (read in src/bentoml/_internal/utils/analytics/cli_events.py) distinguish the legacy V1 format (with a runners list) from the V2 service-oriented format (with a services map), and the analytics emitter counts num_of_runners accordingly.

flowchart LR
    A[Service script<br/>bentoml.Service] --> B[bentoml build<br/>produces a Bento]
    B --> C[bentoml containerize<br/>or bentoml.container.build]
    C --> D[Docker daemon<br/>docker run -p 3000:3000]
    C --> E[BentoCloud<br/>bentoml deploy]
    D --> F[HTTP client<br/>HTTPClient]
    E --> F

The internal container subsystem exposes a backend abstraction (get_backend("buildx")) so the legacy bentoml._internal.utils.buildx module can remain a thin compatibility shim for bentoctl:

This module is shim for bentoctl. NOT FOR DIRECT USE. Make sure to use bentoml.container.build and bentoml.container.health instead.

Source: src/bentoml/_internal/utils/buildx.py

The shim re-keys tags → tag and pops subprocess_env before delegating to the registered backend, making the new container API the only forward-compatible entry point.

2. Local Containerization, Dev Mode, and Environment

Two operational conveniences sit around the build/run pipeline: dotenv-based configuration and dev-mode file watching.

dotenv.py parses KEY=VALUE and export KEY=VALUE lines (and KEY: VALUE separators) so a .env file can supply secrets, registry credentials, and per-deployment toggles without code changes. The parser tolerates whitespace and the optional export prefix, which matches the conventions used by Django's django-dotenv, the upstream project this module was ported from.

Source: src/bentoml/_internal/utils/dotenv.py

In development, watchfilesplugin.py wraps the watchfiles change stream into a Circus plugin that restarts workers when any file matched by BentoBuildConfig.include changes. The plugin evaluates BentoPathSpec(build_config.include, build_config.exclude, working_dir) once and filters each event against that spec before issuing restart name="*" to the supervisor. This is what powers bentoml serve --development reload semantics.

3. Resource Allocation at Runtime

Once a Bento is running, the API server needs to know which GPU(s) it may bind to. The ResourceAllocator in allocator.py drives this:

  • It inspects system_resources()["nvidia.com/gpu"] to learn how many NVIDIA devices are visible.
  • Each GPU is modeled as (remaining_fraction, unit) so fractional allocation (e.g. 0.5 of a GPU per runner) is supported.
  • gpu_allocation_disabled() returns True if either BENTOML_DISABLE_GPU_ALLOCATION or CUDA_VISIBLE_DEVICES is set in the environment — in that mode the allocator yields control back to the user.
  • If a request exceeds the remaining budget, a ResourceWarning is emitted telling the operator to set BENTOML_DISABLE_GPU_ALLOCATION=1 and allocate GPUs manually.

Source: src/_bentoml_impl/server/allocator.py

This design lets a single Bento serve multiple models that share one or more GPUs without an external scheduler, while still permitting the well-known CUDA_VISIBLE_DEVICES escape hatch for fine-grained control.

4. HTTP Client, Telemetry, and BentoCloud Operations

The HTTPClient in src/_bentoml_impl/client/http.py is the primary way to invoke a deployed Bento. It is parameterized over both httpx.Client (sync) and httpx.AsyncClient, implements automatic retries up to MAX_RETRIES = 3, and maps HTTP status codes to typed BentoMLException subclasses via map_exception (defined in base.py). Each ClientEndpoint carries input_spec / output_spec IODescriptor classes so the SDK can serialize Python types correctly.

Operational telemetry is emitted by cli_events.py under the cli_events_map["bentos"]["build"] key. A BentoBuildEvent records bento_creation_timestamp, bento_size_in_kb, model_size_in_kb, num_of_models, num_of_runners, and model_types, giving the BentoML team aggregate insight into how the build pipeline is used. The model metadata itself is round-tripped through bentoml_cattr, a cattrs Converter with omit_if_default=True and a custom datetime ISO-8601 hook so that BentoInfo / ModelInfo instances can be serialized losslessly.

For cloud deployments, the README points operators at the bentoml cloud login / bentoml deploy workflow and the BentoCloud web UI. Because deployments are built from the same Bento artifact produced locally, the same Model Store entry — including HuggingFaceModel references from src/_bentoml_sdk/models/huggingface.py which resolve model_id + revision against the Hugging Face Hub — can be shipped unchanged between a local Docker run and BentoCloud.

5. Common Failure Modes and Community Notes

Several issues documented in the community affect the deployment surface:

  • Container security: v1.4.34 fixed a security issue when resolving user-supplied file paths (v1.4.34 release).
  • Symlink traversal in BentoStore: v1.4.39 added a guard so copy_model no longer follows symlinks out of the store (v1.4.39 release notes).
  • Multi-arch BuildKit caching: v1.4.39 switched cache mounts to sharing=locked to avoid corruption on parallel multi-arch builds (v1.4.39 release notes).
  • HuggingFace loading in containers: v1.4.37 fixed bentoml.models.HuggingFaceModel resolution when running inside the generated image (v1.4.37 release notes).
  • CPU worker floor: v1.4.32 ensures at least one CPU worker is scheduled even when resources.cpu == 0 (v1.4.32 release notes).
  • Bundling arbitrary local files: a long-standing feature request (#685) asks for first-class inclusion of non-Python assets beyond build_config.include / build_config.exclude.
  • Server-Sent Events: #3743 requests streaming SSE responses from the API server, which would benefit LLM chat and live-monitoring use cases.
  • gRPC transport: #703 requests native gRPC support for the API server, complementing the existing HTTPClient.

See Also

Source: https://github.com/bentoml/BentoML / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Runtime risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 9 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

  • Severity: high
  • Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/bentoml/BentoML/issues/5466

2. Configuration risk: Configuration risk requires verification

  • Severity: high
  • Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/bentoml/BentoML/issues/5365

3. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | github_repo:178976529 | https://github.com/bentoml/BentoML

4. Runtime risk: Runtime risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: community_evidence:github | https://github.com/bentoml/BentoML/issues/5625

5. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | github_repo:178976529 | https://github.com/bentoml/BentoML

6. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: downstream_validation.risk_items | github_repo:178976529 | https://github.com/bentoml/BentoML

7. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: risks.scoring_risks | github_repo:178976529 | https://github.com/bentoml/BentoML

8. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: issue_or_pr_quality=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | github_repo:178976529 | https://github.com/bentoml/BentoML

9. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: release_recency=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | github_repo:178976529 | https://github.com/bentoml/BentoML

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using BentoML with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence