# https://github.com/bentoml/BentoML Project Manual

Generated at: 2026-06-14 00:26:12 UTC

## Table of Contents

- [BentoML Overview and Getting Started](#page-1)
- [Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)](#page-2)
- [Model Store, Bento Build, and Framework Integrations](#page-3)
- [Deployment, Containerization, BentoCloud, and Operations](#page-4)

<a id='page-1'></a>

## BentoML Overview and Getting Started

### Related Pages

Related topics: [Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)](#page-2), [Model Store, Bento Build, and Framework Integrations](#page-3), [Deployment, Containerization, BentoCloud, and Operations](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/bentoml/BentoML/blob/main/README.md)
- [src/_bentoml_impl/server/serving.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/serving.py)
- [src/_bentoml_impl/server/allocator.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/allocator.py)
- [src/_bentoml_impl/client/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/base.py)
- [src/_bentoml_impl/client/proxy.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/proxy.py)
- [src/_bentoml_sdk/models/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/base.py)
- [src/_bentoml_sdk/models/huggingface.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/huggingface.py)
- [src/bentoml/_internal/models/model.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/models/model.py)
- [src/bentoml/_internal/utils/analytics/cli_events.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/analytics/cli_events.py)
- [src/bentoml/_internal/utils/buildx.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/buildx.py)
- [src/bentoml/_internal/utils/circus/watchfilesplugin.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/circus/watchfilesplugin.py)
- [src/bentoml/_internal/server/README.md](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/server/README.md)
- [src/bentoml/_internal/utils/dotenv.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/dotenv.py)
- [examples/README.md](https://github.com/bentoml/BentoML/blob/main/examples/README.md)
- [typings/README.md](https://github.com/bentoml/BentoML/blob/main/typings/README.md)
</details>

# BentoML Overview and Getting Started

## Purpose and Scope

BentoML is a Python library for building online serving systems optimized for AI applications and model inference. The framework is designed around three core capabilities: turning model inference scripts into REST API servers using standard Python type hints, packaging everything into reproducible Docker container images, and maximizing compute utilization through features like dynamic batching, model parallelism, multi-stage pipelines, and multi-model inference-graph orchestration. Source: [README.md](https://github.com/bentoml/BentoML/blob/main/README.md)

A BentoML project lifecycle typically follows these stages: develop a Service locally, build it into a Bento (a standardized deployable artifact), containerize it as a Docker image, and then deploy it to BentoCloud or a custom infrastructure. The CLI commands `bentoml serve`, `bentoml build`, and `bentoml containerize` map directly to these stages. Source: [README.md](https://github.com/bentoml/BentoML/blob/main/README.md)

The project is licensed under Apache 2.0, collects anonymous usage analytics (opt-out via `BENTOML_DO_NOT_TRACK=True` or the `--do-not-track` CLI flag), and maintains its main documentation site separately from the repository. Source: [README.md](https://github.com/bentoml/BentoML/blob/main/README.md)

## High-Level Architecture

The runtime architecture is organized into several cooperating subsystems: a serving layer that manages services and resources, an SDK that defines `Service` and `Model` abstractions, a client layer for remote invocation, and internal utilities for analytics, containerization, file watching, and configuration.

```mermaid
flowchart TB
    User[Developer / Client] --> CLI[bentoml CLI]
    CLI --> Serve[bentoml serve]
    CLI --> Build[bentoml build]
    CLI --> Container[bentoml containerize]
    Serve --> Server[serving.py / Server]
    Server --> Allocator[ResourceAllocator]
    Build --> Bento[Bento Artifact]
    Container --> Docker[Docker Image]
    Server --> Models[Model Store]
    Models --> HF[HuggingFaceModel]
    Models --> Store[StoredModel]
    Docker --> Deploy[Deploy to BentoCloud / Cluster]
    Deploy --> Client[RemoteProxy / SyncHTTPClient / AsyncHTTPClient]
```

The serving subsystem is implemented in [src/_bentoml_impl/server/serving.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/serving.py), which exposes a `Server` wrapper around Circus processes and a `_get_server_socket` helper that negotiates between Unix Domain Sockets (UDS) and TCP based on the platform. On Windows, WSL, or when `BENTOML_NO_UDS` is set, the server falls back to localhost TCP. Source: [src/_bentoml_impl/server/serving.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/serving.py)

The `ResourceAllocator` class in [src/_bentoml_impl/server/allocator.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/allocator.py) tracks GPU and CPU assignments. It queries `system_resources()` for the available NVIDIA devices, decrements a remaining-gpus counter on each assignment, and supports fractional allocation. Two environment variables influence behavior: `BENTOML_DISABLE_GPU_ALLOCATION` and `CUDA_VISIBLE_DEVICES`. Source: [src/_bentoml_impl/server/allocator.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/allocator.py)

## Key Subsystems and Workflows

### Services and IO Descriptors

Services are the unit of deployment. Each `Service` declares API endpoints via decorators; each endpoint consumes and produces data described by IO descriptors. The community has reported bugs related to descriptor resolution, for example an `IndexError` in `IODescriptor.from_output()` when methods return bare (unparameterized) iterator annotations such as `t.Iterator` or `t.Generator`. Source: [Issue #5625](https://github.com/bentoml/BentoML/issues/5625)

For type-hint validation similar to FastAPI's Pydantic integration, the SDK uses `IODescriptor`. A long-standing community request asks for richer Pydantic model schema support. Source: [Issue #1480](https://github.com/bentoml/BentoML/issues/1480)

### Model Store and Model References

`bentoml.models` provides two complementary abstractions: a `ModelStore` for saved artifacts on disk and SDK `Model` reference types such as `HuggingFaceModel`. The SDK base class `Model[T]` is defined in [src/_bentoml_sdk/models/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/base.py) and declares abstract methods `to_info()` and `from_info()` for round-tripping between a model reference and a `BentoModelInfo` object used during build. Source: [src/_bentoml_sdk/models/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/base.py)

`HuggingFaceModel` (in [src/_bentoml_sdk/models/huggingface.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/huggingface.py)) accepts a `model_id`, an optional `revision`, an `endpoint` (defaulting to `HF_ENDPOINT` or `https://huggingface.co`), and `include`/`exclude` file patterns. A recent fix in v1.4.37 ensures that `HuggingFaceModel` is correctly loaded inside a container. Source: [v1.4.37 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.37)

For model-info persistence, the on-disk YAML format is parsed in [src/bentoml/_internal/models/model.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/models/model.py), where `ModelInfo.from_yaml` strips legacy fields and provides sensible defaults when a pre-1.0 model is loaded. Source: [src/bentoml/_internal/models/model.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/models/model.py)

### Client and Remote Proxy

The client layer is structured around an abstract `AbstractClient` (in [src/_bentoml_impl/client/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/base.py)) that exposes `endpoints` as a dictionary of `ClientEndpoint` records. A `map_exception` helper translates HTTP responses into typed `BentoMLException` subclasses via a `error_mapping` table. Source: [src/_bentoml_impl/client/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/base.py)

`RemoteProxy` (in [src/_bentoml_impl/client/proxy.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/proxy.py)) wraps a service URL and provides sync/async HTTP clients. The default timeout is derived from the service configuration plus a 1% margin, falling back to 60 seconds when no service object is passed. A v1.4.35 fix in [PR #5541](https://github.com/bentoml/BentoML/pull/5541) ensures that the connector is recreated on session refresh to prevent closed-session errors. Source: [v1.4.35 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.35)

### Build, Container, and Configuration

The build path records analytics events; the handler in [src/bentoml/_internal/utils/analytics/cli_events.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/analytics/cli_events.py) differentiates between `BentoInfo` and `BentoInfoV2` to count runners correctly and reports the total Bento size, model size, and model types. Source: [src/bentoml/_internal/utils/analytics/cli_events.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/analytics/cli_events.py)

The legacy `buildx` shim in [src/bentoml/_internal/utils/buildx.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/buildx.py) is preserved for `bentoctl` compatibility but emits a `DeprecationWarning` and forwards to the new `bentoml.container.build` and `bentoml.container.health` API. Multi-arch builds gained `sharing=locked` cache mounts in v1.4.39 to avoid corruption across parallel builds. Source: [v1.4.39 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.39)

For configuration, BentoML supports `.env` files via the `dotenv` module in [src/bentoml/_internal/utils/dotenv.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/dotenv.py), which parses `KEY=VALUE` and `export KEY=VALUE` syntax. Developers can also opt into the experimental `pylock.toml` standard for per-dependency pinning, a feature the community has been requesting. Source: [Issue #5466](https://github.com/bentoml/BentoML/issues/5466)

### Development Workflow and Reloading

The Circus-based file watcher in [src/bentoml/_internal/utils/circus/watchfilesplugin.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/circus/watchfilesplugin.py) loads a `BentoBuildConfig` to derive include/exclude path specs, filters changes through those specs, and triggers a `restart` call on the Circus arbiter when matching files change. This is the mechanism that powers `bentoml serve --reload`. Source: [src/bentoml/_internal/utils/circus/watchfilesplugin.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/circus/watchfilesplugin.py)

The internal server README in [src/bentoml/_internal/server/README.md](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/server/README.md) demonstrates running a service both via `bentoml serve` and directly through `uvicorn hello:app --reload`, then exercising it with `curl`. Source: [src/bentoml/_internal/server/README.md](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/server/README.md)

## Common Failure Modes and Limitations

- **Deprecated IO types**: `bentoml.io` raises a `BentoMLDeprecationWarning` from v1.4 onward. Users should migrate to the new style IO descriptors. Source: [Issue #5365](https://github.com/bentoml/BentoML/issues/5365)
- **Iterator return annotations**: Bare (unparameterized) iterator types in service methods crash `IODescriptor.from_output()`. Source: [Issue #5625](https://github.com/bentoml/BentoML/issues/5625)
- **Missing dependency formats**: `pylock.toml` is not yet supported as an alternative to `requirements.txt`. Source: [Issue #5466](https://github.com/bentoml/BentoML/issues/5466)
- **gRPC transport**: Native gRPC support is not in core; long-running community request. Source: [Issue #703](https://github.com/bentoml/BentoML/issues/703)
- **Server-Sent Events**: SSE for streaming responses is not natively supported. Source: [Issue #3743](https://github.com/bentoml/BentoML/issues/3743)
- **SpaCy runner**: Removed in v1.0; no built-in runner currently exists. Source: [Issue #4134](https://github.com/bentoml/BentoML/issues/4134)
- **Resource exhaustion**: When more GPUs are requested than available, `ResourceAllocator` warns and continues, which may cause runtime failures downstream. Source: [src/_bentoml_impl/server/allocator.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/allocator.py)
- **Multi-bento model import**: Importing multiple Bentos that share a model required a fix in v1.4.35. Source: [v1.4.35 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.35)
- **Container path resolution**: A path-traversal hardening landed in v1.4.34. Source: [v1.4.34 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.34)

## Quick Start Example

A minimal end-to-end loop, as documented in the project README, is:

```bash
# 1. Install BentoML
pip install bentoml

# 2. Develop a service (defined in service.py)
bentoml serve service.py:svc

# 3. Build a Bento (standardized deployable artifact)
bentoml build

# 4. Containerize (requires Docker daemon)
bentoml containerize summarization:latest

# 5. Run the image
docker run --rm -p 3000:3000 summarization:latest

# 6. Deploy to BentoCloud
bentoml cloud login
bentoml deploy
```

Source: [README.md](https://github.com/bentoml/BentoML/blob/main/README.md)

## See Also

- [Hello World Tutorial](https://docs.bentoml.com/en/latest/get-started/hello-world.html)
- [Model Loading and Model Store](https://docs.bentoml.com/en/latest/build-with-bentoml/model-loading-and-management.html)
- [GPU Inference](https://docs.bentoml.com/en/latest/build-with-bentoml/gpu-inference.html)
- [Adaptive Batching](https://docs.bentoml.com/en/latest/get-started/adaptive-batching.html)
- [Distributed Services](https://docs.bentoml.com/en/latest/build-with-bentoml/distributed-services.html)
- [BentoCloud Deployment](https://docs.bentoml.com/en/latest/get-started/cloud-deployment.html)
- [Examples Index](https://github.com/bentoml/BentoML/blob/main/examples/README.md)

---

<a id='page-2'></a>

## Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)

### Related Pages

Related topics: [BentoML Overview and Getting Started](#page-1), [Model Store, Bento Build, and Framework Integrations](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/_bentoml_impl/client/http.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/http.py)
- [src/_bentoml_impl/client/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/base.py)
- [src/_bentoml_impl/client/proxy.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/proxy.py)
- [src/_bentoml_impl/client/proxy2.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/proxy2.py)
- [src/_bentoml_impl/client/task.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/task.py)
- [src/_bentoml_impl/server/serving.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/serving.py)
- [src/_bentoml_impl/server/proxy.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/proxy.py)
- [src/_bentoml_sdk/models/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/base.py)
- [src/bentoml/_internal/models/model.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/models/model.py)
</details>

# Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)

## Overview

In BentoML, a *Service* is the top-level object that bundles one or more ML models behind strongly-typed HTTP APIs. Each service method maps to an HTTP route whose request/response payload is governed by an `IODescriptor`. The runtime exposes two cooperating transports: an HTTP/JSON (and HTTP/pickle) implementation built on `httpx` / `aiohttp`, and an in-process reverse-proxy that can forward to external commands. Both transports share the same `ClientEndpoint` model so that the same service class is reachable through the sync `HTTPClient`, the async `AsyncHTTPClient`, or the `RemoteProxy`. Long-running or streaming endpoints are surfaced through `Task` / `AsyncTask` handles. Native gRPC and Server-Sent Events (SSE) are not part of the current v1.4 protocol surface; they are tracked as community feature requests (see [issue #703](https://github.com/bentoml/BentoML/issues/703) and [issue #3743](https://github.com/bentoml/BentoML/issues/3743)).

## Service Definition and Method Endpoints

A service is declared by subclassing `Service` and annotating methods. The decorator machinery turns each public method into a `ClientEndpoint` whose `route`, `input_spec`, `output_spec`, `doc`, `stream_output`, and `is_task` fields are derived from the Python type annotations. When a `Service` object is passed into an HTTP client, the constructor iterates `service.apis.items()` and registers one route per API, generating JSON-Schema-compatible input/output descriptors:

```python
for name, method in service.apis.items():
    routes[name] = ClientEndpoint(
        name=name,
        route=method.route,
        input=method.input_spec.model_json_schema(),
        output=method.output_spec.model_json_schema(),
        doc=method.doc,
        input_spec=method.input_spec,
        output_spec=method.output_spec,
        stream_output=method.is_stream,
        is_task=method.is_task,
    )
```

Source: [src/_bentoml_impl/client/proxy2.py:1-220](). The same endpoint shape is reused on the server side, so the route table is symmetric between the producer (the service definition) and the consumer (the client). Service-level configuration such as the readyz endpoint is read from `service.config.get("endpoints", {}).get("readyz", "/readyz")` ([src/_bentoml_impl/client/proxy2.py:1-220]()). The `serve_http` entry point exported by `src/_bentoml_impl/server/__init__.py` is the public API used to start a service.

## IO Types via IODescriptor

The `IODescriptor` abstraction (referenced as `from _bentoml_sdk import IODescriptor` in [src/_bentoml_impl/client/http.py:1-80]() and instantiated from Pydantic models in [src/_bentoml_impl/client/base.py:1-80]()) is responsible for serializing and validating request/response payloads. Two helper attributes matter most:

- `input_spec.model_json_schema()` and `output_spec.model_json_schema()` produce the JSON-Schema that the HTTP layer exposes for clients that want to validate their own payloads.
- `is_stream` and `is_task` flag whether the endpoint returns a streamed response or an asynchronous task handle.

A known regression surfaces when the return annotation is an *unparameterized* iterator such as `t.Iterator` or `t.AsyncIterator` — `IODescriptor.from_output()` then crashes with `IndexError` (see [issue #5625](https://github.com/bentoml/BentoML/issues/5625)). In current builds the v1.4 release notes also deprecate `bentoml.io` in favor of these new style IO descriptors (see [issue #5365](https://github.com/bentoml/BentoML/issues/5365) and [release v1.4.39](https://github.com/bentoml/BentoML/releases/tag/v1.4.39)).

## HTTP Protocol, Client Transports, and Reverse Proxy

BentoML ships **two** HTTP transports. The first, `HTTPClient` in [src/_bentoml_impl/client/http.py:1-80](), is built on `httpx.Client` / `httpx.AsyncClient`. The second, `ProxyClient` in [src/_bentoml_impl/client/proxy2.py:1-220](), is built on `aiohttp` and is used by `RemoteProxy` for inter-service calls inside a Bento deployment. Both implement the same `AbstractClient` contract and surface the same `ClientEndpoint` registry.

```mermaid
flowchart LR
  Client[Python SDK caller] --> HC[SyncHTTPClient / AsyncHTTPClient]
  HC -->|httpx| HTTP[BentoML HTTP server]
  Client --> RP[RemoteProxy]
  RP -->|aiohttp| HTTP
  HTTP --> Circus[circus Server]
  Circus -->|UDS / TCP| Worker[Bento worker process]
  Worker --> Models[(ModelStore / HuggingFaceModel)]
```

Source: [src/_bentoml_impl/server/serving.py:1-80]() and [src/_bentoml_impl/client/http.py:1-80](). The serving layer uses `circus` to spawn one or more worker processes, connecting them via Unix Domain Sockets on POSIX or TCP sockets on Windows/WSL (the choice is gated by `BENTOML_NO_UDS` in [src/_bentoml_impl/server/serving.py:1-80]()). Models themselves are looked up through the abstract `Model` class — including concrete subclasses such as `HuggingFaceModel` ([src/_bentoml_sdk/models/base.py:1-80]() and [src/_bentoml_sdk/models/huggingface.py:1-80]()) — which resolves to entries in the on-disk `ModelStore` ([src/bentoml/_internal/models/model.py:1-160]()).

For services that wrap a third-party HTTP server (e.g., a Triton or vLLM command), `create_proxy_app` ([src/_bentoml_impl/server/proxy.py:1-80]()) exposes a Starlette app that forwards all requests — including a configurable `/health` endpoint — to the spawned child process.

## Long-Running Calls, Streaming, and Community-Requested Protocols

Endpoints flagged with `is_task=True` return a `Task` or `AsyncTask` handle instead of the final payload, allowing the client to poll, cancel, or retry the operation. The relevant methods on the handle are `get_status()`, `cancel()`, `get()`, and `retry()` ([src/_bentoml_impl/client/task.py:1-80]()). Endpoints flagged with `stream_output=True` are surfaced through the same HTTP transport but use a streaming response (used internally for SSE-style outputs; see [issue #3743](https://github.com/bentoml/BentoML/issues/3743)).

| Protocol | Status in v1.4 | Transport file | Notes |
|----------|---------------|----------------|-------|
| HTTP/JSON | Supported | [src/_bentoml_impl/client/http.py]() | Default media type is `application/json` |
| HTTP/pickle | Supported | [src/_bentoml_impl/client/proxy.py]() | Used by `RemoteProxy` (`application/vnd.bentoml+pickle`) |
| Async tasks | Supported | [src/_bentoml_impl/client/task.py]() | Returns `Task` / `AsyncTask` handles |
| Streaming (SSE-like) | Supported via `is_stream` | [src/_bentoml_impl/client/base.py]() | Requested as first-class SSE in [issue #3743]() |
| gRPC | Not implemented | n/a | Requested in [issue #703]() (9 comments) |

## Common Failure Modes

- **Deprecated `bentoml.io`** — emits `BentoMLDeprecationWarning` on import; migrate to the new IO descriptors shipped in v1.4 ([issue #5365]()).
- **Bare iterator return type** — `IODescriptor.from_output()` raises `IndexError` on `t.Iterator` without parameters ([issue #5625]()).
- **Symlink file copy** — resolved in v1.4.39 by preventing symlink traversal in `BentoStore` ([release v1.4.39]()).
- **Closed aiohttp session** — fixed in v1.4.35 by recreating the connector on session refresh ([release v1.4.35]()).
- **Missing `pylock.toml` support** — tracked in [issue #5466]() as a missing dependency-locking feature.

## See Also

- [Model Loading and Model Store](model-store.md)
- [Workers and Model Parallelization](parallelization.md)
- [Service Configuration and Dependencies](service-config.md)
- [Async Tasks and Streaming Endpoints](async-tasks.md)
- Community: [issue #703 gRPC support](https://github.com/bentoml/BentoML/issues/703), [issue #3743 SSE support](https://github.com/bentoml/BentoML/issues/3743), [issue #5625 IODescriptor IndexError](https://github.com/bentoml/BentoML/issues/5625)

---

<a id='page-3'></a>

## Model Store, Bento Build, and Framework Integrations

### Related Pages

Related topics: [BentoML Overview and Getting Started](#page-1), [Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)](#page-2), [Deployment, Containerization, BentoCloud, and Operations](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/bentoml/_internal/models/model.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/models/model.py)
- [src/_bentoml_sdk/models/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/base.py)
- [src/_bentoml_sdk/models/huggingface.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/huggingface.py)
- [src/bentoml/_internal/container/frontend/dockerfile/__init__.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/container/frontend/dockerfile/__init__.py)
- [src/bentoml/_internal/utils/buildx.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/buildx.py)
- [src/bentoml/_internal/utils/analytics/cli_events.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/analytics/cli_events.py)
- [src/_bentoml_impl/server/allocator.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/allocator.py)
- [src/_bentoml_impl/client/http.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/http.py)
- [README.md](https://github.com/bentoml/BentoML/blob/main/README.md)
</details>

# Model Store, Bento Build, and Framework Integrations

BentoML is a Python framework for building online serving systems for AI applications and model inference. The Model Store, the Bento build pipeline, and the framework integrations together form the "packaging" half of the project: they decide *what* artifact travels with a service, *how* that artifact is assembled, and *how* it is turned into a container that runs anywhere. This page documents the moving parts behind those decisions, as they exist in the source tree.

## 1. Model Store: Persistent On-Disk Model Management

The Model Store is the on-disk location where BentoML keeps versioned, tagged model artifacts. Every saved model is represented as a directory containing a `model.yaml` manifest that captures metadata, signatures, and the captured `ModelContext` (Python version, framework versions, etc.).

### 1.1 The `model.yaml` Manifest

The manifest loader strips forward-incompatible fields and normalizes the representation before handing it to a `cattr` structure hook. Key behaviors defined in the loader:

- The top-level `name` field is promoted into a `Tag`, while the explicit `tag` field is *ignored* on save.
- For backwards compatibility with Bentos created before `1.0.0rc1`, the loader deletes `version`, `bentoml_version`, and the legacy `context.pip_dependencies` (rewriting it into `framework_versions`).
- A missing `signatures` section is silently defaulted to an empty mapping.
- An unexpected field raises a `BentoMLException` through `cattr`'s `TypeError` handling.

Source: [src/bentoml/_internal/models/model.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/models/model.py)

### 1.2 Internal vs. SDK Model Types

There are two parallel "model" types in the codebase:

| Layer | Type | Role |
|-------|------|------|
| Internal | `bentoml._internal.models.Model` (a `StoredModel`) | The persisted, filesystem-backed record in the Model Store. |
| SDK | `_bentoml_sdk.models.base.Model[T]` | An abstract, generic, framework-agnostic reference a service author declares. |

The SDK base class is `abc.ABC` and `t.Generic[T]`. It requires subclasses to implement `to_info`, `from_info`, `to_create_schema`, and `resolve`. The descriptor protocol (`__get__`) lazily resolves the model on first attribute access and caches the resolved value in a name-mangled `_Model__resolved` slot. This is what makes a class-level annotation like `model = HuggingFaceModel("org/name")` behave like an eagerly-resolved runtime object inside a service.

Source: [src/_bentoml_sdk/models/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/base.py)

A concrete specialization is `HuggingFaceModel`, an `attrs`-frozen, hashable reference. Its `to_create_schema` produces a BentoCloud-compatible `CreateModelSchema` carrying the model ID, revision, endpoint (defaulting to the `HF_ENDPOINT` env var), include/exclude patterns, and a `ModelManifestSchema` with the BentoML version, size, and context. The model URL is `https://huggingface.co/{model_id}` by default (`DEFAULT_HF_ENDPOINT`), and a change in v1.4.37 ensures this reference is correctly materialized inside the container (see release notes).

Source: [src/_bentoml_sdk/models/huggingface.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/huggingface.py)

## 2. The Bento Build Pipeline

A "Bento" is the standardized deployable artifact. Building one is what bridges the developer's Python code, the SDK-level model references, and the container frontend.

```mermaid
flowchart LR
    A[Service module<br/>+ Model references] --> B[bentoml build]
    B --> C[Model Store lookup<br/>+ on-disk copy]
    C --> D[BentoInfo / BentoInfoV2]
    D --> E[bentoml containerize]
    E --> F[Dockerfile frontend]
    F --> G[OCI image]
```

### 2.1 What a `bento build` Produces

The build command collects services, runners, and models, then writes a `Bento` object. Analytics — opt-in, controlled by the `BENTOML_DO_NOT_TRACK` env var — emits a `BentoBuildEvent` summarizing creation timestamp, total size, model size, runner count, and model module names. The schema distinguishes between `BentoInfo` (v1) and `BentoInfoV2` (v1.2+) when counting runners, which keeps the metric stable across schema migrations.

Source: [src/bentoml/_internal/utils/analytics/cli_events.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/analytics/cli_events.py)

### 2.2 Container Frontend and Supported Runtimes

The Dockerfile frontend is the typed contract that enumerates what the build can target. It exports a strict list of supported runtimes, which guarantees the generated image is reproducible.

| Runtime | Supported versions |
|---------|--------------------|
| Python | 3.9, 3.10, 3.11, 3.12, 3.13, 3.14 |
| CUDA | 12.8.1, 12.8.0, 12.6.x, 12.1.x, 12.0.x, 11.8.0, 11.7.1, 11.6.2, 11.4.3, 11.2.2 |

User-supplied CUDA versions are normalized through `ALLOWED_CUDA_VERSION_ARGS` (e.g. `"12"` → `"12.8.1"`). Release `v1.4.38` corrected the NVIDIA CUDA base images for Debian, which previously caused subtle runtime mismatches.

Source: [src/bentoml/_internal/container/frontend/dockerfile/__init__.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/container/frontend/dockerfile/__init__.py)

### 2.3 The `buildx` Shim and Deprecation

`bentoml/_internal/utils/buildx.py` is intentionally a thin shim for `bentoctl` and is not for direct use. It re-exports `build` and `health`, but every call raises a `DeprecationWarning` directing users to `bentoml.container.build` and `bentoml.container.health`. The shim also normalizes legacy keyword names — `tags` → `tag`, and drops `subprocess_env` — so older callers keep working while the internal API moves to the new container module.

Source: [src/bentoml/_internal/utils/buildx.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/buildx.py)

## 3. Framework Integrations and the Serving Side

Framework integrations are not a single class — they are realized through the SDK's `Model` subclasses, the container frontend's options, and the runtime resource allocator.

### 3.1 GPU / CPU Resource Allocation

`ResourceAllocator` is the runtime component that decides which GPU indices are assigned to which worker. It uses the system-detected `nvidia.com/gpu` count, tracks `remaining_gpus` as it hands them out, and supports fractional allocation (less than one whole GPU). Two environment variables alter its behavior:

- `BENTOML_DISABLE_GPU_ALLOCATION` — disables automatic allocation entirely.
- `CUDA_VISIBLE_DEVICES` — also disables automatic allocation, leaving it to the user.

If the request exceeds remaining capacity, a `ResourceWarning` is emitted, and the counter is clamped to zero rather than going negative.

Source: [src/_bentoml_impl/server/allocator.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/allocator.py)

### 3.2 Client-Side Integration Surface

Once a Bento is deployed, the HTTP client is the primary way framework users invoke the service. `HTTPClient` (in `_bentoml_impl/client/http.py`) is a generic `httpx`-backed client that introspects the service's `IODescriptor` endpoints — captured as `ClientEndpoint` records with `input_spec`, `output_spec`, `stream_output`, and `is_task` flags. This is what enables strongly-typed client stubs generated from a service's Python type hints, regardless of the framework the underlying model uses.

Source: [src/_bentoml_impl/client/http.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/http.py)

### 3.3 Project-Level Commands

The `README.md` documents the canonical developer workflow that ties Model Store, build, and containerization together:

```bash
bentoml build              # assemble Bento
bentoml containerize <tag> # produce OCI image
docker run --rm -p 3000:3000 <tag>
bentoml cloud login        # optional: BentoCloud
bentoml deploy             # optional: deploy
```

Release `v1.4.39` hardened the Model Store by preventing symlink traversal during file copies, which is relevant for any framework integration that stores weights in symlinked cache directories.

Source: [README.md](https://github.com/bentoml/BentoML/blob/main/README.md)

## 4. Common Failure Modes and Migration Notes

Three recurring pain points are visible in the community and the release notes:

1. **Old IO imports.** `from bentoml.io import JSON` raises `BentoMLDeprecationWarning` since v1.4. The fix is to migrate to the new SDK IO types in `_bentoml_sdk`. Issue #5365 reports confusion from PyTorch tutorials still using the old path.
2. **Per-dependency resolution.** Issue #5466 requests `pylock.toml` support so dependency resolution can be configured per-package — the current options cannot express that granularity.
3. **Iterator return annotations.** Issue #5625 shows `IODescriptor.from_output()` raising `IndexError` on bare (unparameterized) iterator annotations like `t.Iterator` or `t.Generator`. Parameterize them (`t.Iterator[int]`) to work around.

Each of these is a boundary case where the Model Store / build / framework contract is in flux, and they are the most likely sources of integration friction in the current release line.

## See Also

- [BentoML Documentation — Model Store](https://docs.bentoml.com/en/latest/build-with-bentoml/model-loading-and-management.html)
- [BentoML Documentation — Hello World](https://docs.bentoml.com/en/latest/get-started/hello-world.html)
- [BentoML Releases](https://github.com/bentoml/BentoML/releases)
- Related wiki page: *Services, Runners, and IO Descriptors* (descriptor protocol for `Model[T]`)
- Related wiki page: *Container Frontends and Docker Build* (Dockerfile options, `buildx` migration)

---

<a id='page-4'></a>

## Deployment, Containerization, BentoCloud, and Operations

### Related Pages

Related topics: [Service Definition, IO Types, and API Protocols (HTTP, gRPC, SSE)](#page-2), [Model Store, Bento Build, and Framework Integrations](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/bentoml/BentoML/blob/main/README.md)
- [src/bentoml/_internal/utils/buildx.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/buildx.py)
- [src/_bentoml_impl/server/allocator.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/allocator.py)
- [src/bentoml/_internal/utils/circus/watchfilesplugin.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/circus/watchfilesplugin.py)
- [src/bentoml/_internal/utils/dotenv.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/dotenv.py)
- [src/bentoml/_internal/utils/analytics/cli_events.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/analytics/cli_events.py)
- [src/_bentoml_impl/client/http.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/http.py)
- [src/bentoml/_internal/utils/cattr.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/cattr.py)
- [src/_bentoml_sdk/models/base.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/base.py)
- [examples/README.md](https://github.com/bentoml/BentoML/blob/main/examples/README.md)
</details>

# Deployment, Containerization, BentoCloud, and Operations

BentoML provides an end-to-end path from a Python inference script to a reproducible deployable artifact. This page documents the deployment surface area: how a service is packaged into a Bento, how that Bento is turned into a Docker image, how resources (especially GPUs) are allocated at runtime, how the HTTP client talks to a running deployment, and the operational tooling (dev-mode reload, telemetry, dotenv parsing) that surrounds these flows. Sources throughout this page are taken directly from the repository's main branch as of release v1.4.39 ([v1.4.39 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.39)).

## 1. The Bento → Container → Cloud Pipeline

The deployment story is described at a high level in the project README:

> `bentoml build` to package necessary code, models, dependency configs into a Bento — the standardized deployable artifact in BentoML.
> Generate a Docker container image for deployment: `bentoml containerize summarization:latest`.
> Run the generated image: `docker run --rm -p 3000:3000 summarization:latest`.
> Deploy from current directory: `bentoml deploy`.

Source: [README.md](https://github.com/bentoml/BentoML/blob/main/README.md)

The pipeline is intentionally the same regardless of whether the target is a local Docker daemon or BentoCloud: produce a deterministic, versioned Bento first, then turn it into a container image, then schedule it. The `BentoInfo` / `BentoInfoV2` schemas (read in [`src/bentoml/_internal/utils/analytics/cli_events.py`](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/analytics/cli_events.py)) distinguish the legacy V1 format (with a `runners` list) from the V2 service-oriented format (with a `services` map), and the analytics emitter counts `num_of_runners` accordingly.

```mermaid
flowchart LR
    A[Service script<br/>bentoml.Service] --> B[bentoml build<br/>produces a Bento]
    B --> C[bentoml containerize<br/>or bentoml.container.build]
    C --> D[Docker daemon<br/>docker run -p 3000:3000]
    C --> E[BentoCloud<br/>bentoml deploy]
    D --> F[HTTP client<br/>HTTPClient]
    E --> F
```

The internal container subsystem exposes a backend abstraction (`get_backend("buildx")`) so the legacy `bentoml._internal.utils.buildx` module can remain a thin compatibility shim for `bentoctl`:

> This module is shim for bentoctl. NOT FOR DIRECT USE. Make sure to use `bentoml.container.build` and `bentoml.container.health` instead.

Source: [src/bentoml/_internal/utils/buildx.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/buildx.py)

The shim re-keys `tags → tag` and pops `subprocess_env` before delegating to the registered backend, making the new container API the only forward-compatible entry point.

## 2. Local Containerization, Dev Mode, and Environment

Two operational conveniences sit around the build/run pipeline: dotenv-based configuration and dev-mode file watching.

`dotenv.py` parses `KEY=VALUE` and `export KEY=VALUE` lines (and `KEY: VALUE` separators) so a `.env` file can supply secrets, registry credentials, and per-deployment toggles without code changes. The parser tolerates whitespace and the optional `export` prefix, which matches the conventions used by Django's `django-dotenv`, the upstream project this module was ported from.

Source: [src/bentoml/_internal/utils/dotenv.py](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/dotenv.py)

In development, [`watchfilesplugin.py`](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/circus/watchfilesplugin.py) wraps the `watchfiles` change stream into a [Circus](https://www.bentoml.com) plugin that restarts workers when any file matched by `BentoBuildConfig.include` changes. The plugin evaluates `BentoPathSpec(build_config.include, build_config.exclude, working_dir)` once and filters each event against that spec before issuing `restart name="*"` to the supervisor. This is what powers `bentoml serve --development` reload semantics.

## 3. Resource Allocation at Runtime

Once a Bento is running, the API server needs to know which GPU(s) it may bind to. The `ResourceAllocator` in [`allocator.py`](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/allocator.py) drives this:

- It inspects `system_resources()["nvidia.com/gpu"]` to learn how many NVIDIA devices are visible.
- Each GPU is modeled as `(remaining_fraction, unit)` so fractional allocation (e.g. `0.5` of a GPU per runner) is supported.
- `gpu_allocation_disabled()` returns `True` if either `BENTOML_DISABLE_GPU_ALLOCATION` or `CUDA_VISIBLE_DEVICES` is set in the environment — in that mode the allocator yields control back to the user.
- If a request exceeds the remaining budget, a `ResourceWarning` is emitted telling the operator to set `BENTOML_DISABLE_GPU_ALLOCATION=1` and allocate GPUs manually.

Source: [src/_bentoml_impl/server/allocator.py](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/server/allocator.py)

This design lets a single Bento serve multiple models that share one or more GPUs without an external scheduler, while still permitting the well-known `CUDA_VISIBLE_DEVICES` escape hatch for fine-grained control.

## 4. HTTP Client, Telemetry, and BentoCloud Operations

The `HTTPClient` in [`src/_bentoml_impl/client/http.py`](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/http.py) is the primary way to invoke a deployed Bento. It is parameterized over both `httpx.Client` (sync) and `httpx.AsyncClient`, implements automatic retries up to `MAX_RETRIES = 3`, and maps HTTP status codes to typed `BentoMLException` subclasses via `map_exception` (defined in [`base.py`](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_impl/client/base.py)). Each `ClientEndpoint` carries `input_spec` / `output_spec` `IODescriptor` classes so the SDK can serialize Python types correctly.

Operational telemetry is emitted by [`cli_events.py`](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/analytics/cli_events.py) under the `cli_events_map["bentos"]["build"]` key. A `BentoBuildEvent` records `bento_creation_timestamp`, `bento_size_in_kb`, `model_size_in_kb`, `num_of_models`, `num_of_runners`, and `model_types`, giving the BentoML team aggregate insight into how the build pipeline is used. The model metadata itself is round-tripped through [`bentoml_cattr`](https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/utils/cattr.py), a `cattrs` `Converter` with `omit_if_default=True` and a custom `datetime` ISO-8601 hook so that `BentoInfo` / `ModelInfo` instances can be serialized losslessly.

For cloud deployments, the README points operators at the `bentoml cloud login` / `bentoml deploy` workflow and the BentoCloud web UI. Because deployments are built from the same Bento artifact produced locally, the same Model Store entry — including `HuggingFaceModel` references from [`src/_bentoml_sdk/models/huggingface.py`](https://github.com/bentoml/BentoML/blob/main/src/_bentoml_sdk/models/huggingface.py) which resolve `model_id` + `revision` against the Hugging Face Hub — can be shipped unchanged between a local Docker run and BentoCloud.

## 5. Common Failure Modes and Community Notes

Several issues documented in the community affect the deployment surface:

- **Container security**: v1.4.34 fixed a security issue when resolving user-supplied file paths ([v1.4.34 release](https://github.com/bentoml/BentoML/releases/tag/v1.4.34)).
- **Symlink traversal in BentoStore**: v1.4.39 added a guard so `copy_model` no longer follows symlinks out of the store ([v1.4.39 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.39)).
- **Multi-arch BuildKit caching**: v1.4.39 switched cache mounts to `sharing=locked` to avoid corruption on parallel multi-arch builds ([v1.4.39 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.39)).
- **HuggingFace loading in containers**: v1.4.37 fixed `bentoml.models.HuggingFaceModel` resolution when running inside the generated image ([v1.4.37 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.37)).
- **CPU worker floor**: v1.4.32 ensures at least one CPU worker is scheduled even when `resources.cpu == 0` ([v1.4.32 release notes](https://github.com/bentoml/BentoML/releases/tag/v1.4.32)).
- **Bundling arbitrary local files**: a long-standing feature request (#685) asks for first-class inclusion of non-Python assets beyond `build_config.include` / `build_config.exclude`.
- **Server-Sent Events**: #3743 requests streaming SSE responses from the API server, which would benefit LLM chat and live-monitoring use cases.
- **gRPC transport**: #703 requests native gRPC support for the API server, complementing the existing `HTTPClient`.

## See Also

- [Hello World example](https://docs.bentoml.com/en/latest/get-started/hello-world.html) — end-to-end `bentoml build → containerize → run` walkthrough.
- [Model loading and Model Store](https://docs.bentoml.com/en/latest/build-with-bentoml/model-loading-and-management.html) — how the Model Store feeds the Bento artifact.
- [Distributed serving systems](https://docs.bentoml.com/en/latest/build-with-bentoml/distributed-services.html) — multi-runner composition relevant to `ResourceAllocator`.
- [BentoCloud deployment](https://docs.bentoml.com/en/latest/get-started/cloud-deployment.html) — managed deployment counterpart to local Docker runs.
- [Contributing Guide](https://github.com/bentoml/BentoML/blob/main/CONTRIBUTING.md) and [Development Guide](https://github.com/bentoml/BentoML/blob/main/DEVELOPMENT.md) for working on the deployment subsystems themselves.

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: bentoml/BentoML

Summary: Found 9 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/bentoml/BentoML/issues/5466

## 2. Configuration risk - Configuration risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/bentoml/BentoML/issues/5365

## 3. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | github_repo:178976529 | https://github.com/bentoml/BentoML

## 4. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/bentoml/BentoML/issues/5625

## 5. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:178976529 | https://github.com/bentoml/BentoML

## 6. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | github_repo:178976529 | https://github.com/bentoml/BentoML

## 7. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | github_repo:178976529 | https://github.com/bentoml/BentoML

## 8. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:178976529 | https://github.com/bentoml/BentoML

## 9. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:178976529 | https://github.com/bentoml/BentoML

<!-- canonical_name: bentoml/BentoML; human_manual_source: deepwiki_human_wiki -->
