# https://github.com/NotJoeMartinez/yt-fts Project Manual

Generated at: 2026-06-27 01:03:25 UTC

## Table of Contents

- [Project Overview and Architecture](#page-1)
- [Download, Ingestion, and Channel Management](#page-2)
- [Full Text and Semantic Search](#page-3)
- [AI Features: Embeddings, RAG Chat, and Summarization](#page-4)

<a id='page-1'></a>

## Project Overview and Architecture

### Related Pages

Related topics: [Download, Ingestion, and Channel Management](#page-2), [Full Text and Semantic Search](#page-3), [AI Features: Embeddings, RAG Chat, and Summarization](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md)
- [CHANGELOG.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md)
- [src/yt_fts/yt_fts.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py)
- [src/yt_fts/search.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/search.py)
- [src/yt_fts/download/download_handler.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/download/download_handler.py)
- [src/yt_fts/llm/get_embeddings.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/get_embeddings.py)
- [src/yt_fts/llm/chatbot.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/chatbot.py)
- [src/yt_fts/llm/summarize.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/summarize.py)
- [src/yt_fts/utils.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py)
</details>

# Project Overview and Architecture

## Purpose and Scope

`yt-fts` is a command-line tool that downloads YouTube channel subtitles, indexes them into a local SQLite full-text search database, and exposes them for fast keyword search, semantic (vector) search, and LLM-powered Q&A. The project targets users who want to search across large video archives (channels or playlists) by spoken content rather than by titles or metadata.

The tool is distributed as a Click-based CLI (the `yt-fts` command) and currently sits at version `0.1.62` per [CHANGELOG.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md). Core capabilities advertised in the [README.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md) include:

- `download` — fetch VTT subtitles for a channel, playlist, or single video via `yt-dlp`.
- `search` / `vsearch` — full-text and semantic (vector) search over the downloaded transcripts.
- `export` — export a channel's subtitles to disk (TXT/VTT) and export search results to CSV.
- `embeddings` — generate vector embeddings for a channel to enable `vsearch`.
- `summarize` — produce an LLM summary of a single video with timestamped URLs.
- `llm` — interactive chat that answers questions using retrieved transcript context.
- `list` / `delete` / `update` — channel lifecycle management.
- `config` — print configuration paths.

A frequent community pain point (see issue [#46](https://github.com/NotJoeMartinez/yt-fts/issues/46)) is that the wrong channel can be downloaded when a user pastes a handle URL like `@TomScottGo`; the channel id is resolved from the URL and used as the canonical scope, so the resolution logic is central to the architecture.

## High-Level Architecture

The codebase is organized around a thin CLI layer in [src/yt_fts/yt_fts.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py) that dispatches to one of several handler classes. Each handler encapsulates a single concern and persists results into either SQLite (full-text search) or ChromaDB (vector search).

```mermaid
flowchart LR
    A[CLI: yt-fts] --> B[yt_fts.py<br/>Click commands]
    B --> D[DownloadHandler<br/>download]
    B --> S[SearchHandler<br/>search / vsearch]
    B --> E[ExportHandler<br/>export]
    B --> Em[EmbeddingsHandler<br/>embeddings]
    B --> Su[SummarizeHandler<br/>summarize]
    B --> L[LLMHandler<br/>llm]
    B --> C[config / list / delete / update]

    D -->|yt-dlp + VTT| DB[(SQLite FTS5<br/>channels, videos, subs)]
    Em -->|split + embed| CH[(ChromaDB<br/>vector store)]
    S --> DB
    S --> CH
    L --> CH
    L -->|chat completions| API[(OpenAI / Gemini)]
    Su -->|chat completions| API
    Su --> DB
```

The two persistence backends are intentionally split:

- **SQLite** stores raw subtitles and provides `FTS5` full-text search. It is queried by the `search` command and is also the source of truth for channel/video metadata (titles, dates, channel ids).
- **ChromaDB** stores per-channel vector embeddings and is queried only by `vsearch`, `llm`, and `summarize` workflows. Embeddings must be generated separately via the `embeddings` command — a channel that has been embedded is marked `(ss)` in `list_channels` output.

This split is reflected in the module layout: SQL-only code lives in [src/yt_fts/search.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/search.py) and `db_utils.py`, while vector/embedding/LLM code is isolated under [src/yt_fts/llm/](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/).

## Core Components

### CLI Entry Point

The Click group is declared in [src/yt_fts/yt_fts.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py). Each `@cli.command` block is intentionally thin: it parses arguments, constructs a handler, and delegates. For example, the `download` command accepts `url`, `--playlist`, `--language`, `--jobs` (default `8`), and `--cookies-from-browser`, then instantiates `DownloadHandler`. Higher-level commands (`summarize`, `llm`) additionally resolve a model via `get_model_config()` and build an `OpenAI` client before handing off.

### Download Subsystem

[src/yt_fts/download/download_handler.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/download/download_handler.py) wraps `yt_dlp.YoutubeDL` to extract WebVTT subtitles (`writeautomaticsub=True`, `skip_download=True`, `subtitleslangs=['en', '-live_chat']`). A user-agent pool is randomly selected per request — this was added in v0.1.62 alongside retry logic, per [CHANGELOG.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md) (`"Fix download only one, randomize user agents"`). Default parallel job count was raised from 1 to 8 in v0.1.60 for faster ingestion.

Community issue [#183](https://github.com/NotJoeMartinez/yt-fts/issues/183) ("Requested format not available") is rooted in this subsystem: it occurs when `yt-dlp` cannot negotiate a subtitle stream for the requested language. The handler mitigates this by probing `info.get('subtitles')` and `info.get('automatic_captions')` and aborting with a clear warning when both are empty, rather than failing later.

### Search Subsystem

[src/yt_fts/search.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/search.py) implements `SearchHandler`, which supports three scopes: `all`, `channel`, and `video`. The scope is chosen from CLI flags; the handler resolves a channel id (if needed) via `get_channel_id_from_input()` and then queries the SQLite FTS index. The handler also formats results with bolded query matches using `bold_query_matches` and can write a CSV via `--export`.

Issue [#189](https://github.com/NotJoeMartinez/yt-fts/issues/189) ("Search results show only one quote per video") pointed at a tuple-shape mismatch in the FTS result unpacking on line 174 of `search.py` — a useful illustration that the result-row contract between `db_utils.search_*` and `SearchHandler` is fragile and worth attention when modifying either side.

### Embeddings and Vector Subsystem

[src/yt_fts/llm/get_embeddings.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/get_embeddings.py) defines `EmbeddingsHandler.add_embeddings_to_chroma()`. It fetches every video id for a channel, splits each transcript into chunks of `interval` seconds (default 30, configurable via `--interval`), prefixes each chunk with metadata (channel name, video title, date), and writes the resulting vectors plus metadata into ChromaDB. The metadata fields (`channel_name`, `video_title`, `video_date`, `video_id`, `start`, `subs`) are the same fields later used by `LLMHandler.format_context()` to build prompt context.

### LLM Subsystem

Two LLM-driven handlers live under [src/yt_fts/llm/](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/):

- `chatbot.py` (`LLMHandler`) — runs an interactive REPL. On each turn it builds a prompt from the top-k ChromaDB hits; if the model replies with "I don't know", it asks the model to generate a follow-up retrieval query and re-issues the search, then continues the conversation.
- `summarize.py` (`SummarizeHandler`) — pulls a single video's transcript (from the DB if present, otherwise via a one-shot `yt-dlp` call) and asks the chat model for a structured summary with timestamped `youtu.be/<id>?t=<sec>` links.

Both rely on `get_model_config()` in [src/yt_fts/utils.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py) to select between `OPENAI` (`text-embedding-ada-002` + `gpt-4o`) and `GEMINI` (`text-embedding-004` + `gemini-2.5-flash`), dispatching on the API key prefix (`sk-` vs `AIza`) or the matching `OPENAI_API_KEY` / `GEMINI_API_KEY` environment variable.

## Configuration and Model Selection

Configuration is path-centric rather than file-centric: `config` simply prints the database, ChromaDB, and config directories. There is no persistent config file for model selection — model choice is implicit from the API key. The supported matrix is:

| Provider | Detection | Embedding Model | Chat Model |
|----------|-----------|-----------------|------------|
| OpenAI | `sk-...` key or `OPENAI_API_KEY` env | `text-embedding-ada-002` | `gpt-4o` |
| Gemini | `AIza...` key or `GEMINI_API_KEY` env | `text-embedding-004` | `gemini-2.5-flash` |

Source: [src/yt_fts/utils.py:Model](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py).

## Common Failure Modes and Limitations

- **Wrong channel downloaded** ([#46](https://github.com/NotJoeMartinez/yt-fts/issues/46)): handle-based URLs can resolve to a different channel id than the user expects; verify the resolved id before downloading large channels.
- **Requested format not available** ([#183](https://github.com/NotJoeMartinez/yt-fts/issues/183)): fixed by retrying or by extracting cookies from the browser via `--cookies-from-browser`.
- **Scoped search** ([#47](https://github.com/NotJoeMartinez/yt-fts/issues/47)): the current `search`/`vsearch` scopes are `all`, `channel`, and `video`; there is no regex-over-titles filter, so users with very large channels must fall back to per-video lookups.
- **Search result cardinality** ([#189](https://github.com/NotJoeMartinez/yt-fts/issues/189)): row unpacking between `db_utils` and `SearchHandler` must stay aligned — a regression here collapses multi-quote results to one row per video.

## See Also

- Download and ingestion details: `DownloadHandler` in [src/yt_fts/download/download_handler.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/download/download_handler.py)
- Search and ranking: `SearchHandler` in [src/yt_fts/search.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/search.py)
- Embeddings and vector store: [src/yt_fts/llm/get_embeddings.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/get_embeddings.py)
- LLM chat and summarization: [src/yt_fts/llm/chatbot.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/chatbot.py), [src/yt_fts/llm/summarize.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/summarize.py)
- Version history: [CHANGELOG.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md)

---

<a id='page-2'></a>

## Download, Ingestion, and Channel Management

### Related Pages

Related topics: [Project Overview and Architecture](#page-1), [Full Text and Semantic Search](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/yt_fts/yt_fts.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py)
- [src/yt_fts/download/download_handler.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/download/download_handler.py)
- [src/yt_fts/utils.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py)
- [src/yt_fts/db_utils.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/db_utils.py)
- [src/yt_fts/config.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/config.py)
- [src/yt_fts/search.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/search.py)
- [CHANGELOG.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md)
- [README.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md)
</details>

# Download, Ingestion, and Channel Management

## Overview and Purpose

`yt-fts` is a command-line tool that uses `yt-dlp` to scrape a YouTube channel's subtitle tracks and load them into a local SQLite database, making them searchable from the terminal. The download and ingestion pipeline is the entry point of every other feature in the project: semantic search (`vsearch`), the LLM/RAG chat bot (`llm`), and video summarization (`summarize`) all depend on rows that originate from the `download` command. Source: [src/yt_fts/yt_fts.py:1-40](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py) and [README.md:1-20](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md).

Channel management covers the full lifecycle of a channel in the local database: creating a new channel record via `download`, refreshing it with `update`, inspecting it with `list`, and removing it with `delete`. These commands are implemented as a Click group in `yt_fts.py` and delegate to `DownloadHandler` and database utilities in `db_utils.py`.

## The Download Pipeline

The `download` command is defined as a Click subcommand in [src/yt_fts/yt_fts.py:25-60](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py). It accepts a channel or playlist URL plus a language flag, a parallelism flag (`-j/--jobs`, default 8), an optional `--cookies-from-browser` flag for rate-limit bypass, and a random user-agent switch that was introduced in v0.1.62. Source: [CHANGELOG.md:18-24](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md).

Internally the command instantiates `DownloadHandler` from [src/yt_fts/download/download_handler.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/download/download_handler.py). The handler is responsible for:

1. Resolving the human-friendly URL (`/@handle`, `/channel/UC...`, `/playlist?list=...`) into a stable channel identifier using helpers in `db_utils.py`.
2. Calling `yt-dlp` against the channel's `/videos` tab in parallel workers.
3. Parsing the resulting `vtt` subtitle files with `parse_vtt` from [src/yt_fts/utils.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py) and inserting one row per caption cue into the `subtitles` table.
4. Storing video metadata (title, upload date, duration) in the `videos` table and a row in the `channels` table for the channel itself.

The architecture can be visualized as follows:

```mermaid
flowchart LR
    A[CLI: yt-fts download] --> B[DownloadHandler]
    B --> C[yt-dlp workers in parallel]
    C --> D[VTT subtitle files]
    D --> E[parse_vtt in utils.py]
    E --> F[(SQLite: channels, videos, subtitles)]
    F --> G[search / vsearch / summarize / llm]
```

A reliability improvement shipped in v0.1.60 changed `DownloadHandler` so that running `download` against an already-ingested channel no longer exits with an error — it now transparently routes the call into the `update` flow. Source: [CHANGELOG.md:8-15](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md).

## Channel Management Commands

Beyond `download`, the Click group exposes four management commands, all declared in [src/yt_fts/yt_fts.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py):

| Command | Purpose | Key behavior |
| --- | --- | --- |
| `update` | Refresh subtitles and metadata for a previously downloaded channel | Reuses `DownloadHandler`; can target a single channel or, by default since v0.1.57, all channels. Source: [CHANGELOG.md:30-34](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md) |
| `list` | Print every channel currently in the database | Appends `(ss)` next to a channel name when embeddings exist for it. Source: [README.md:90-100](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md) |
| `delete` | Remove a channel and all of its subtitles after a confirmation prompt | Calls `delete_channel` in `db_utils.py`. Source: [src/yt_fts/yt_fts.py:110-130](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py) |
| `config` | Print the on-disk locations of the SQLite database, the Chroma vector store, and the config file | Values are read from [src/yt_fts/config.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/config.py) via `get_db_path`, `get_or_make_chroma_path`, and `get_config_path` |

The `list` output is the canonical way to verify what was actually ingested. If a channel name looks wrong — for example, downloading `https://youtube.com/@TomScottGo` and ending up with `@tomscottplus` (community issue #46) — the URL-to-channel-id resolver is almost always the cause, and inspecting the entry with `yt-fts list` is the first diagnostic step.

## Configuration and Common Failure Modes

Configuration is read from a platform-appropriate user directory by [src/yt_fts/config.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/config.py). The same module produces the SQLite path used by `download`/`update`/`list` and the Chroma path used by the embeddings subsystem. Source: [src/yt_fts/yt_fts.py:140-150](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py).

Three recurring failure modes show up in community discussions and are worth calling out:

- **"Requested Format Not Available"** (issue #183). This is raised by `yt-dlp` when the channel has no manually uploaded subtitles in the requested language. Re-run with `-l` set to a language that the channel actually publishes (`auto` is often the only option for auto-generated captions) and, if it persists, pass `--cookies-from-browser` to bypass bot detection. Source: [src/yt_fts/yt_fts.py:30-45](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py) and [CHANGELOG.md:30-34](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md).
- **"Wrong channel downloaded"** (issue #46). YouTube serves several handles that look similar (`@TomScottGo` vs `@tomscottplus`); the channel-id resolver in `db_utils.py` will pick whatever handle the URL canonicalizes to, with no fuzzy matching. Always confirm the URL before downloading, and use `yt-fts list` to verify after the fact.
- **No rows after a successful download** (v0.1.55 changelog). This was a regression where ingested cues failed to be persisted; it was fixed by adjusting the download/commit boundary. If you see an empty database after `download`, ensure you are on >= 0.1.55 and that the process was not killed mid-write. Source: [CHANGELOG.md:50-55](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md).

Two open feature requests also touch the ingestion boundary: issue #47 asks for a regex filter on video titles so that a user could ingest only a subset of a large channel, and issue #60 asks for a built-in way to download the audio or video clip surrounding a search hit instead of having to shell out to `yt-dlp` and `ffmpeg`. Neither is implemented in the current source tree, so today the workaround is the manual `yt-dl` + `ffmpeg` pipeline described in #60.

## See Also

- [README.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md) — full command reference for `download`, `update`, `list`, `delete`, and `config`.
- [CHANGELOG.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md) — release notes for download-related fixes in v0.1.55, v0.1.57, v0.1.60, and v0.1.62.
- [src/yt_fts/search.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/search.py) — downstream consumer of the `subtitles` table produced by `download`.
- [src/yt_fts/llm/get_embeddings.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/get_embeddings.py) — consumes the same database to build Chroma embeddings used by `vsearch` and `llm`.

---

<a id='page-3'></a>

## Full Text and Semantic Search

### Related Pages

Related topics: [Project Overview and Architecture](#page-1), [AI Features: Embeddings, RAG Chat, and Summarization](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/yt_fts/search.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/search.py)
- [src/yt_fts/db_utils.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/db_utils.py)
- [src/yt_fts/utils.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py)
- [src/yt_fts/llm/get_embeddings.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/get_embeddings.py)
- [src/yt_fts/llm/chatbot.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/chatbot.py)
- [src/yt_fts/llm/summarize.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/summarize.py)
- [src/yt_fts/yt_fts.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py)
- [src/yt_fts/config.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/config.py)
- [CHANGELOG.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md)
- [README.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md)
</details>

# Full Text and Semantic Search

## Overview

yt-fts exposes two complementary search mechanisms over downloaded YouTube subtitle transcripts: a SQLite-backed **full text search** (FTS5) for exact substring matching, and a Chroma-backed **semantic search** (vector) for similarity matching using embedding models. Both are surfaced through a `SearchHandler` class that dispatches to the appropriate scope and backend [Source: [src/yt_fts/search.py:1-39]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/search.py).

The CLI commands `search`, `vsearch`, `llm`, and `summarize` are the primary user-facing entry points. The `vsearch` and `llm` paths require that a channel first be processed with `embeddings` so that vectors are stored in a local ChromaDB instance [Source: [README.md:130-160]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md).

## Full Text Search (FTS5)

The keyword search path uses SQLite's FTS5 virtual table `Subtitles_fts`, joined back to `Subtitles` and `Videos` tables, so that ranking is computed by FTS5 while metadata (timestamps, video IDs, channel IDs) is read from the relational tables [Source: [src/yt_fts/db_utils.py:1-50]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/db_utils.py).

### Query parsing and boolean operators

`parse_query` accepts user input and tokenizes it, preserving explicit `AND`/`OR` operators and escaping the remaining terms with FTS5 quote-escaping rules. Changelog entry 0.1.56 documents a fix where `OR`, `AND`, and quoted search expressions were not being honored, which is now resolved by routing input through `parse_query` [Source: [CHANGELOG.md:0.1.56]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md).

### Scopes

`SearchHandler.full_text_search` selects a scope from three values:

| Scope     | Backend call            | Trigger                                |
|-----------|-------------------------|----------------------------------------|
| `all`     | `search_all`            | No `--channel` flag                    |
| `channel` | `search_channel`        | `--channel` resolves to a channel ID   |
| `video`   | `search_video`          | `--video-id` restricts to a single video |

The selected call receives the FTS5 query and an optional limit, returning rows ordered by `rank` (FTS5's BM25-style score) [Source: [src/yt_fts/search.py:32-55]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/search.py).

### Limits, length, and timestamps

`utils.show_message` enforces a 40-character ceiling on the raw search string and surfaces a "search_too_long" error. `time_to_secs` converts VTT `HH:MM:SS` cue starts into integer seconds for use in `youtu.be/<id>?t=<secs>` links, with a deliberate 3-second backward offset so the URL points a moment before the cue begins [Source: [src/yt_fts/utils.py:1-50]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py).

### Known limitations

Community issue #189 reported that `search` returned at most one quote per video due to a `video_id` field omitted from a result tuple inside `search_video`; the row was deduplicated downstream in `SearchHandler` [Source: issue #189](https://github.com/NotJoeMartinez/yt-fts/issues/189). Issue #47 requests the ability to restrict results to videos whose titles match a regex, which is not currently supported by `search_channel` or `search_video` [Source: issue #47](https://github.com/NotJoeMartinez/yt-fts/issues/47).

## Semantic Search (Vector)

Vector search is opt-in per channel. The `embeddings` command runs `EmbeddingsHandler.add_embeddings_to_chroma`, which fetches every video ID for the channel, splits subtitles into time-windowed segments (default 30 seconds, configurable with `--interval`), and concatenates each segment with its `channel_name`, `video_title`, and `video_date` before embedding [Source: [src/yt_fts/llm/get_embeddings.py:1-50]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/get_embeddings.py).

### Model selection

`get_model_config` (in `utils.py`) auto-detects the provider from the API key prefix: keys beginning with `sk-` select `OPENAI` (`text-embedding-ada-002`, `gpt-4o`, base URL `https://api.openai.com/v1`); keys beginning with `AIza` select `GEMINI` (`text-embedding-004`, `gemini-2.5-flash`, base URL `https://generativelanguage.googleapis.com/v1beta`). Gemini support was added in v0.1.64 [Source: [src/yt_fts/utils.py:1-40]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py), [Source: [CHANGELOG.md:0.1.64]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md).

### Querying

`vsearch` reuses the `SearchHandler` constructor; the only difference from `search` is that the OpenAI/Chroma client is constructed and the scope is forwarded to a vector query instead of an FTS query. Channel and video scopes are mirrored for consistency. A successful run appends a `(ss)` marker to the channel in `list_channels` output [Source: [README.md:160-200]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md).

## LLM-Augmented Workflows

Two higher-level commands build on top of vector search:

### `llm` (chatbot)

`LLMHandler` begins by calling `start_llm`, which produces initial context from the prompt via `create_context` and asks the chat model to answer using only that context. If the response begins with "i don't know", `get_expand_context_query` reformulates the conversation into a fresh vector query and re-asks with the expanded context, allowing multi-hop retrieval [Source: [src/yt_fts/llm/chatbot.py:1-80]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/chatbot.py). `format_context` injects per-cue metadata (video title, date, youtu.be link) into the prompt so the model can cite sources [Source: [src/yt_fts/llm/chatbot.py:80-110]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/chatbot.py).

### `summarize`

`SummarizeHandler` accepts either a full YouTube URL or a raw video ID. If the ID is not present locally, it calls `download_transcript` via yt-dlp; otherwise it reads the joined subtitle text from the database. A system prompt instructs the model to emit timestamped `https://youtu.be/<id>?t=<secs>` URLs for each section [Source: [src/yt_fts/llm/summarize.py:1-50]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/summarize.py).

```mermaid
flowchart LR
    A[yt-fts download] --> B[(SQLite: Subtitles_fts)]
    A --> C[(Chroma: vectors)]
    B --> D[search]
    C --> E[vsearch]
    C --> F[llm / summarize]
    D --> G[export CSV]
    E --> G
```

## Common Failure Modes

- **No matches found** — displayed when FTS5 returns an empty set; the user is prompted to shorten the query or use wildcards [Source: [src/yt_fts/utils.py:1-20]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py).
- **Wrong channel downloaded** — issue #46 reports that a `@handle` URL sometimes resolves to a sibling channel's videos; this is a download-side resolver issue rather than a search defect [Source: issue #46](https://github.com/NotJoeMartinez/yt-fts/issues/46).
- **Download-only flow for sound bites** — issue #60 outlines a manual `yt-dl` + `ffmpeg` workflow because there is no first-class "export clip" command after `search` returns a timestamp [Source: issue #60](https://github.com/NotJoeMartinez/yt-fts/issues/60).
- **`Requested format not available`** — issue #183 reports yt-dlp format errors on download that subsequently leave the search index empty [Source: issue #183](https://github.com/NotJoeMartinez/yt-fts/issues/183). v0.1.62 introduced randomized user-agent headers and a retry method to mitigate this [Source: [CHANGELOG.md:0.1.62]()](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md).

## See Also

- yt-fts README — [README.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md)
- Download pipeline — `src/yt_fts/download/download_handler.py`
- Configuration paths — `src/yt_fts/config.py`

---

<a id='page-4'></a>

## AI Features: Embeddings, RAG Chat, and Summarization

### Related Pages

Related topics: [Project Overview and Architecture](#page-1), [Full Text and Semantic Search](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/yt_fts/llm/get_embeddings.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/get_embeddings.py)
- [src/yt_fts/llm/chatbot.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/chatbot.py)
- [src/yt_fts/llm/summarize.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/llm/summarize.py)
- [src/yt_fts/utils.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/utils.py)
- [src/yt_fts/config.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/config.py)
- [src/yt_fts/yt_fts.py](https://github.com/NotJoeMartinez/yt-fts/blob/main/src/yt_fts/yt_fts.py)
- [CHANGELOG.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/CHANGELOG.md)
- [README.md](https://github.com/NotJoeMartinez/yt-fts/blob/main/README.md)
</details>

# AI Features: Embeddings, RAG Chat, and Summarization

## Overview

`yt-fts` ships three optional AI-driven commands that sit on top of the project's local subtitle database (`channels.db`) and a parallel ChromaDB vector store: `embeddings` (precompute semantic vectors), `vsearch` (semantic search), `llm` (Retrieval-Augmented Generation chat), and `summarize` (one-shot transcript summary). Source: [CHANGELOG.md:11-13](). These features share a common configuration layer (`get_model_config`) and a common transcript source, but they differ in whether they require a channel to be embedded first.

The model layer supports two providers out of the box — OpenAI and Google Gemini — and the key for the active provider is selected by inspecting the prefix of the API key string (`sk-` for OpenAI, `AIza` for Gemini). Source: [src/yt_fts/utils.py:18-37](). Gemini support landed in v0.1.64 (2025-08-10) and is documented in [CHANGELOG.md:5-7]().

## Embeddings Pipeline

The `embeddings` command is implemented by `EmbeddingsHandler.add_embeddings_to_chroma`. It walks every video in a channel, fetches its subtitle rows from SQLite, splits them into fixed-interval chunks, and writes the resulting vectors into ChromaDB.

```mermaid
flowchart LR
    A[yt-fts embeddings --channel X] --> B[Load video IDs]
    B --> C[Fetch subs from SQLite]
    C --> D[split_subtitles by interval]
    D --> E[Build text + metadata]
    E --> F[OpenAI / Gemini embed]
    F --> G[ChromaDB collection]
```

Key behaviors:

- **Interval chunking**: `split_subtitles` groups consecutive caption lines into buckets of `self.interval` seconds (default 30s, configurable via `--interval`). Source: [src/yt_fts/llm/get_embeddings.py:78-114](). Videos shorter than the interval are skipped with a console message rather than raising an error. Source: [src/yt_fts/llm/get_embeddings.py:81-85]().
- **Metadata enrichment**: Each chunk is enriched with channel name, video title, date, video id, and a timestamped `youtu.be` link so retrieved chunks are directly linkable. Source: [src/yt_fts/llm/get_embeddings.py:62-72]().
- **Re-runnable**: Because the same channel is upserted into Chroma, running `embeddings` again refreshes vectors without failing. This mirrors the duplicate-download behavior introduced in v0.1.60. Source: [CHANGELOG.md:25-27]().

## RAG Chat (the `llm` command)

The interactive chat flow is implemented by `LLMHandler` in [src/yt_fts/llm/chatbot.py:21-30](). It requires the channel to already have embeddings in Chroma (i.e. `embeddings` must have been run first — the README notes "(ss)" appears next to the channel name in `list` once enabled). Source: [README.md:152-154]().

The chat loop works in two stages:

1. **Initial query** — the user's first message is sent to the LLM with a system prompt instructing the model to rewrite the question into a *vector search query string*. Source: [src/yt_fts/llm/chatbot.py:55-70](). The rewritten question is then used to query Chroma via `get_chroma_client()`, and the top matches become the LLM's context.
2. **Continuation turns** — subsequent user messages are appended to a message history and the LLM is asked again to produce a search query; the retrieved chunks are formatted with `format_context` and passed back as the assistant context. The loop exits when the user types `exit`. Source: [src/yt_fts/llm/chatbot.py:31-44]() and [src/yt_fts/llm/chatbot.py:81-95]().

This iterative "rewrite → retrieve → answer" pattern is what makes the bot feel like it can drill down across a long conversation, but it is also a hard dependency on the quality of the embeddings and the chat model's instruction following.

## Summarization (the `summarize` command)

`SummarizeHandler` takes either a YouTube URL or a bare video ID. If the video is already in `channels.db`, the transcript is read from SQLite; otherwise it falls back to a one-off `yt_dlp` download via `download_transcript()`. Source: [src/yt_fts/llm/summarize.py:26-37]().

The system prompt is constructed to force a specific output shape — numbered sections of key points, each annotated with a `youtu.be` deep link of the form `https://youtu.be/{id}?t={seconds}`. Source: [src/yt_fts/llm/summarize.py:44-60](). This is the same timestamped-URL pattern referenced in community discussion #60, which asked for a way to download a specific audio/video quote; the summarizer already produces clickable timestamps, but the actual media-clip workflow still requires external `yt-dlp` + `ffmpeg` steps. Source: [README.md:67-86]() and community issue #60.

The completion call sets `temperature=0.5` and `max_tokens=2000`, mirroring the chat handler for consistency. Source: [src/yt_fts/llm/summarize.py:67-81]().

## Configuration & Model Selection

| Aspect | Behavior | Source |
| --- | --- | --- |
| Provider selection | API key prefix (`sk-` → OpenAI, `AIza` → Gemini) | [src/yt_fts/utils.py:24-33]() |
| Default OpenAI models | `text-embedding-ada-002`, `gpt-4o` | [src/yt_fts/utils.py:14-18]() |
| Default Gemini models | `text-embedding-004`, `gemini-2.5-flash` | [src/yt_fts/utils.py:14-18]() |
| Key resolution order | Explicit `--api-key` → `OPENAI_API_KEY` / `GEMINI_API_KEY` env var | [src/yt_fts/utils.py:34-39]() |
| Default embedding interval | 30 seconds (override with `--interval`) | [README.md:147-149]() |
| Default parallel jobs | 8 (changed in v0.1.60) | [CHANGELOG.md:25-27]() |

If neither env var is set and no key is passed, `get_model_config` raises `ValueError("No model configuration found. ...")`. Source: [src/yt_fts/utils.py:40-41]().

## Common Failure Modes

- **"No matches found"** in `vsearch` — almost always means `embeddings` was never run for the channel; the README calls this out explicitly. Source: [README.md:155-161]().
- **Channel resolution errors** — `llm` resolves `--channel` via `get_channel_id_from_input`; if multiple channels match, the chat handler will surface the same `multiple_channels_found` error message used elsewhere. Source: [src/yt_fts/utils.py:14-22]() and [src/yt_fts/llm/chatbot.py:25-28]().
- **Provider-specific LLM params** — `frequency_penalty` and `stop` are set to `NotGiven()` for non-OpenAI providers because Gemini rejects those parameters; this is centralized in `get_completion`. Source: [src/yt_fts/llm/chatbot.py:71-85]() and [src/yt_fts/llm/summarize.py:67-81]().
- **"Wrong channel downloaded"** (#46) is unrelated to AI features but is worth knowing: the channel ID is derived from the URL's `/@handle` segment, and a similar mis-resolution is possible when piping a name into `llm --channel`. Pass a numeric channel id to disambiguate.

## See Also

- [CHANGELOG.md](CHANGELOG.md) — release history including v0.1.52 (`llm`), v0.1.57 (`summarize`), and v0.1.64 (Gemini).
- [README.md](README.md) — user-facing command reference for `embeddings`, `vsearch`, `llm`, and `summarize`.

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: NotJoeMartinez/yt-fts

Summary: Found 15 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/NotJoeMartinez/yt-fts/issues/138

## 2. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/NotJoeMartinez/yt-fts/issues/180

## 3. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/NotJoeMartinez/yt-fts/issues/183

## 4. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/NotJoeMartinez/yt-fts/issues/192

## 5. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/NotJoeMartinez/yt-fts/issues/184

## 6. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/NotJoeMartinez/yt-fts

## 7. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/NotJoeMartinez/yt-fts/issues/182

## 8. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/NotJoeMartinez/yt-fts/issues/189

## 9. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/NotJoeMartinez/yt-fts/issues/181

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/NotJoeMartinez/yt-fts

## 11. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/NotJoeMartinez/yt-fts

## 12. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/NotJoeMartinez/yt-fts

## 13. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/NotJoeMartinez/yt-fts/issues/168

## 14. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/NotJoeMartinez/yt-fts

## 15. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/NotJoeMartinez/yt-fts

<!-- canonical_name: NotJoeMartinez/yt-fts; human_manual_source: deepwiki_human_wiki -->
