# https://github.com/lucaong/minisearch Project Manual Generated at: 2026-06-26 16:15:18 UTC ## Table of Contents - [Overview and Getting Started](#page-overview) - [Core Architecture and Data Model](#page-architecture) - [Configuration and Search API](#page-api) - [Extensibility, Internationalization, and Troubleshooting](#page-extensibility) ## Overview and Getting Started ### Related Pages Related topics: [Core Architecture and Data Model](#page-architecture), [Configuration and Search API](#page-api)

Related Source Files

The following source files were used to generate this page: - [README.md](https://github.com/lucaong/minisearch/blob/main/README.md) - [package.json](https://github.com/lucaong/minisearch/blob/main/package.json) - [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts) - [src/MiniSearch.test.js](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.test.js) - [src/SearchableMap/types.ts](https://github.com/lucaong/minisearch/blob/main/src/SearchableMap/types.ts) - [examples/plain_js/README.md](https://github.com/lucaong/minisearch/blob/main/examples/plain_js/README.md)

# Overview and Getting Started ## What is MiniSearch `MiniSearch` is a tiny but powerful in-memory full-text search engine written in JavaScript. According to [README.md](https://github.com/lucaong/minisearch/blob/main/README.md), it is "respectful of resources" and can comfortably run both in Node.js and the browser. The package has zero runtime dependencies ([package.json:13](https://github.com/lucaong/minisearch/blob/main/package.json)) and ships UMD, ESM, and CJS bundles for flexible consumption. The library is described by its author as a "Tiny but powerful full-text search engine for browser and Node" ([package.json:3](https://github.com/lucaong/minisearch/blob/main/package.json)). It supports BM25-based relevance scoring ([src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts)), fuzzy search, prefix search, auto-suggestions, faceted filtering, and full serialization/deserialization of indexes. The index itself is built on a radix-tree data structure implemented in [`src/SearchableMap/types.ts`](https://github.com/lucaong/minisearch/blob/main/src/SearchableMap/types.ts), which provides memory-efficient term storage. ## Installation Install the package from npm: ```bash npm install minisearch ``` The published artifact exposes three entry points ([package.json:7-21](https://github.com/lucaong/minisearch/blob/main/package.json)): a CommonJS build, an ES module build, and a standalone UMD bundle served by both unpkg and jsDelivr. TypeScript type definitions are bundled at `dist/es/index.d.ts`. A separate entry point, `./SearchableMap`, is exported for advanced use cases that need direct access to the radix-tree implementation ([src/SearchableMap/types.ts](https://github.com/lucaong/minisearch/blob/main/src/SearchableMap/types.ts)). ## Quick Start The following example demonstrates the typical lifecycle: instantiate, index, search. ```javascript import MiniSearch from 'minisearch' const documents = [ { id: 1, title: 'Moby Dick', text: 'Call me Ishmael...' }, { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', text: 'I can see by my watch...' }, { id: 3, title: 'Neuromancer', text: 'The sky above the port was...' } ] const miniSearch = new MiniSearch({ fields: ['title', 'text'], // fields to index for full-text search storeFields: ['title'] // fields to return in search results }) miniSearch.addAll(documents) const results = miniSearch.search('zen art motorcycle') // => [{ id: 2, title: 'Zen and the Art of Motorcycle Maintenance', score: 2.77258, ... }] ``` Every document must include a unique `id` field. Fields listed in `fields` are tokenized and added to the inverted index; fields listed in `storeFields` are retained verbatim and returned with each result ([src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts)). ## How the Indexing Pipeline Works Each document passes through a deterministic, four-stage pipeline before being searchable: ```mermaid flowchart LR A[Document] --> B[extractField] B --> C[tokenize] C --> D[processTerm] D --> E[Inverted Index] ``` The stages are defined and documented in [`src/MiniSearch.ts`](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts): | Stage | Option | Default | Purpose | |---|---|---|---| | Extract | `extractField` | `document[fieldName]` | Pull a raw value from the document | | Stringify | `stringifyField` | `fieldValue.toString()` | Convert non-string values to strings | | Tokenize | `tokenize` | `string.split(SPACE_OR_PUNCTUATION)` | Split the string into terms | | Process | `processTerm` | `term.toLowerCase()` | Normalize, stem, or expand a term | Both `tokenize` and `processTerm` receive the field name as their second argument, allowing per-field customization ([src/MiniSearch.test.js:25-56](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.test.js)). The `processTerm` function may return a string, a falsy value (to discard the term), or an array of strings (to expand one token into many indexable terms) ([src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts)). For documents with nested fields or non-plain-object shapes, a custom `extractField` is the recommended approach ([README.md:120-145](https://github.com/lucaong/minisearch/blob/main/README.md)): ```javascript const miniSearch = new MiniSearch({ fields: ['title', 'author.name', 'pubYear'], extractField: (document, fieldName) => { if (fieldName === 'pubYear') return document.pubDate.getFullYear().toString() return fieldName.split('.').reduce((doc, key) => doc?.[key], document) } }) ``` ## Common Pitfalls and Community-Reported Issues Several recurring themes appear in community discussions and are worth understanding before adopting MiniSearch. **Non-Latin and agglutinative languages.** The default tokenizer splits on the Unicode property classes `\p{Z}` (separators) and `\p{P}` (punctuation) ([src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts)). This works poorly for Chinese ([Issue #201](https://github.com/lucaong/minisearch/issues/201)) and Korean, since morphemes are not separated by spaces. A community-maintained Korean morphological tokenizer is available at [garu-minisearch-tokenizer](https://www.npmjs.com/package/garu-minisearch-tokenizer), referenced in [Issue #314](https://github.com/lucaong/minisearch/issues/314) and [Issue #312](https://github.com/lucaong/minisearch/issues/312). **Punctuation handling in the default tokenizer.** Because the Unicode `Punctuation` class includes quotation marks and apostrophes, the default regex can split `song's` into `song` and `s` ([Issue #309](https://github.com/lucaong/minisearch/issues/309)). Supply a custom `tokenize` function if your content contains many contractions or quoted strings. **Not all fields are searched.** Only fields listed in `fields` are indexed. If a field appears in the stored JSON dump but not in results, verify it is included in the `fields` option ([Issue #298](https://github.com/lucaong/minisearch/issues/298)). **Wildcard misuse.** The built-in `MiniSearch.wildcard` sentinel (`*`) is intended for programmatic use. Passing arbitrary strings containing `*` can throw `Cannot read properties of undefined (reading 'map')` ([Issue #307](https://github.com/lucaong/minisearch/issues/307)). **Combination queries.** When building nested `QueryCombination` trees, the `filter` property is applied at the top-level result set, not per sub-query ([Issue #304](https://github.com/lucaong/minisearch/issues/304)). ## Running the Example A self-contained browser demo lives in [`examples/plain_js/`](https://github.com/lucaong/minisearch/blob/main/examples/plain_js/README.md). To launch it: ```bash cd examples/plain_js python3 -m http.server # or: npx http-server -p 8000 ``` Then open the printed URL in a browser. The example showcases search, auto-completion, and several advanced configuration options ([examples/plain_js/README.md:7-15](https://github.com/lucaong/minisearch/blob/main/examples/plain_js/README.md)). ## See Also - [Search and Auto-Suggest](search-and-auto-suggest.md) - [Indexing Pipeline and Tokenization](indexing-pipeline.md) - [Scoring and BM25 Tuning](scoring-and-bm25.md) - [Serialization and Vacuuming](serialization-and-vacuuming.md) - [Advanced Query Combinations](advanced-queries.md) --- ## Core Architecture and Data Model ### Related Pages Related topics: [Overview and Getting Started](#page-overview), [Configuration and Search API](#page-api), [Extensibility, Internationalization, and Troubleshooting](#page-extensibility)

Related Source Files

The following source files were used to generate this page: - [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts) - [src/SearchableMap/types.ts](https://github.com/lucaong/minisearch/blob/main/src/SearchableMap/types.ts) - [README.md](https://github.com/lucaong/minisearch/blob/main/README.md) - [package.json](https://github.com/lucaong/minisearch/blob/main/package.json) - [src/MiniSearch.test.js](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.test.js)

# Core Architecture and Data Model ## 1. Purpose and Scope MiniSearch is a tiny, dependency-free, in-memory full-text search engine for JavaScript that runs in both Node and the browser. As of version **7.2.0** ([package.json](https://github.com/lucaong/minisearch/blob/main/package.json)), it provides BM25-ranked full-text search, fuzzy matching, prefix matching, auto-suggest, filtering, document serialization, and incremental add/remove operations entirely in RAM. The "core architecture" describes how a single `MiniSearch` instance is structured internally, the pipeline that turns a document into searchable terms, and the data structures that store both the inverted index and the user-facing metadata. The "data model" describes the typed shapes used for documents, queries, results, and the on-disk JSON serialization. ## 2. Class Layout and Internal State The single export is the `MiniSearch` class ([src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts)). Each instance owns the following protected/private fields, which together constitute the live in-memory state: | Field | Type | Role | | --- | --- | --- | | `_options` | `OptionsWithDefaults` | Resolved indexer options with defaults applied | | `_index` | `SearchableMap` | Inverted index of term → postings per field | | `_documentIds` / `_idToShortId` | `Map` / `Map` | Bidirectional lookup between user IDs and internal numeric IDs | | `_fieldIds` | `{ [name: string]: number }` | Compact numeric IDs for each indexed field | | `_fieldLength` | `Map` | Per-document field length (one entry per field) | | `_avgFieldLength` | `number[]` | Running average field length, indexed by field ID | | `_storedFields` | `Map>` | Stored fields returned verbatim in search results | | `_documentCount`, `_nextId`, `_dirtCount` | numbers | Counters used by the indexer and vacuum logic | The internal `_index` is a `SearchableMap`, which is built on top of a radix tree as defined in [src/SearchableMap/types.ts](https://github.com/lucaong/minisearch/blob/main/src/SearchableMap/types.ts). That tree representation enables efficient exact lookup, prefix iteration, and fuzzy neighbor traversal without a separate trie. See `RadixTree` in `types.ts` for the underlying typed interface. ## 3. The Indexing Pipeline A document becomes searchable through three sequential, user-pluggable steps documented in [README.md](https://github.com/lucaong/minisearch/blob/main/README.md) and the JSDoc in [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts): ```mermaid flowchart LR A[Document object] -->|extractField| B[String field value] B -->|tokenize| C[Raw terms] C -->|processTerm| D[Normalized terms] D --> E[(Inverted index in _index)] ``` - **extractField(document, fieldName)** retrieves the raw value for each entry in the user-supplied `fields` array. The default simply reads `document[fieldName]`, but a custom extractor can reach nested keys (e.g. `author.name`) or convert `Date` values to strings ([README.md](https://github.com/lucaong/minisearch/blob/main/README.md)). - **tokenize(text, fieldName)** splits that value into terms. The default tokenizer splits on the unicode-aware regex `SPACE_OR_PUNCTUATION` (`/[\n\r\p{Z}\p{P}]+/u`). This default has been a source of community discussion: see [issue #309](https://github.com/lucaong/minisearch/issues/309), where users note that `\p{P}` also matches ASCII apostrophes, causing words like `song's` to be split. - **processTerm(term, fieldName)** normalizes each token (default: lower-case). It may return a falsy value to discard the term, a single string, or an array of expanded strings (for stemming or synonyms) ([src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts)). Test coverage in [src/MiniSearch.test.js](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.test.js) verifies that both `tokenize` and `processTerm` are invoked with the field name as the second argument, and that returning an array from `processTerm` correctly expands one token into several indexed terms. ## 4. Scoring: BM25 with Field Length Normalization Relevance is computed using **BM25+** in [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts). The default parameters are: ```js const defaultBM25params = { k: 1.2, b: 0.7, d: 0.5 } ``` Where `k` controls term-frequency saturation, `b` controls length normalization, and `d` is the BM25+ floor that guarantees a non-zero contribution from any matching term. The scoring formula requires four pieces of state maintained per document per field: term frequency, total matching document count, total document count, and field length — all already tracked by `_index`, `_fieldLength`, `_avgFieldLength`, and `_documentCount` (see Section 2). Community feedback on scoring (issues [#129](https://github.com/lucaong/minisearch/issues/129) and [#263](https://github.com/lucaong/minisearch/issues/263)) indicates that the default parameters can produce surprising rankings when documents are similar; users are expected to tune `k`, `b`, and `d`, or to add `boostDocument` and per-field `boost` weights when domain-specific ranking is required. ## 5. Query and Result Data Model Queries are typed as `Query = QueryCombination | string | Wildcard` ([src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts)). A plain string is the common case, a `QueryCombination` (`{ combineWith, queries, ...searchOptions }`) expresses nested `AND` / `OR` / `AND_NOT` trees, and the `Wildcard` is the unique `Symbol('*')` exposed as `MiniSearch.wildcard` for matching every document. Search results are typed by `SearchResult`, which includes `id`, `score`, `terms`, `queryTerms`, and a `match: MatchInfo` object describing which terms hit which fields. Stored fields are merged in via `Object.assign(result, this._storedFields.get(docId))` and the user-supplied `filter` predicate is applied before sorting by `score` ([src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts)). Note that `filter` runs on the merged result set, not on each sub-query of a `QueryCombination`; this distinction has been raised in community discussion (see [issue #304](https://github.com/lucaong/minisearch/issues/304)). For persistence, `MiniSearch` exposes a serialization shape defined by `AsPlainObject` in [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts), containing `documentCount`, `documentIds`, `fieldIds`, `fieldLength`, `averageFieldLength`, `storedFields`, `dirtCount`, the `index` array, and a `serializationVersion` integer. Discarded documents accumulate as "dirt"; `vacuum()` cleans them in configurable batches (`batchSize`, `batchWait`) and can also be triggered automatically by `minDirtCount` / `minDirtFactor` thresholds. ## See Also - README and quick-start examples: [README.md](https://github.com/lucaong/minisearch/blob/main/README.md) - Plain-JS example app: [examples/plain_js/README.md](https://github.com/lucaong/minisearch/blob/main/examples/plain_js/README.md) - SearchableMap radix tree types: [src/SearchableMap/types.ts](https://github.com/lucaong/minisearch/blob/main/src/SearchableMap/types.ts) - Package metadata and build configuration: [package.json](https://github.com/lucaong/minisearch/blob/main/package.json) - Community discussion on Korean segmentation: [issue #314](https://github.com/lucaong/minisearch/issues/314) --- ## Configuration and Search API ### Related Pages Related topics: [Overview and Getting Started](#page-overview), [Core Architecture and Data Model](#page-architecture), [Extensibility, Internationalization, and Troubleshooting](#page-extensibility)

Related Source Files

The following source files were used to generate this page: - [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts) - [src/MiniSearch.test.js](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.test.js) - [README.md](https://github.com/lucaong/minisearch/blob/main/README.md) - [package.json](https://github.com/lucaong/minisearch/blob/main/package.json) - [examples/plain_js/README.md](https://github.com/lucaong/minisearch/blob/main/examples/plain_js/README.md)

# Configuration and Search API MiniSearch exposes a single primary class, `MiniSearch`, whose surface area is split between **construction options** (how documents are indexed) and **search options** (how queries are evaluated at runtime). Both surfaces are declared in [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts), where the `Options` and `SearchOptions` types live alongside their defaults. Mastering these two option bags is the central task for anyone integrating the library, since every downstream behavior — tokenization, scoring, filtering, suggestion generation — is controlled through them. This page documents the configuration options passed to `new MiniSearch(...)`, the search-time options accepted by `miniSearch.search(...)` and `miniSearch.autoSuggest(...)`, and the query expression API (`Query`, `QueryCombination`, `Wildcard`). ## Construction: Indexing Configuration The `MiniSearch` constructor accepts an `Options` object that controls how documents are broken into terms, stored, and ranked. The full list, with library defaults, is documented inline in [src/MiniSearch.ts:1-30](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts) and replicated in the README. The most consequential options are summarized below. | Option | Type | Default | Purpose | | --- | --- | --- | --- | | `idField` | `string` | `'id'` | Field used as the unique document identifier. Source: [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts) | | `fields` | `string[]` | `undefined` (required) | Document fields to index for full-text search. | | `storeFields` | `string[]` | `[]` | Fields copied into search results without being indexed. | | `extractField` | `(doc, field) => unknown` | `doc => doc[field]` | Custom field extractor (e.g. for nested documents). | | `tokenize` | `(text, fieldName?) => string[]` | splits on `SPACE_OR_PUNCTUATION` | Splits raw text into terms before normalization. | | `processTerm` | `(term, fieldName?) => string \| string[] \| falsy` | `term => term.toLowerCase()` | Normalizes or stems a term; falsy values are dropped. | | `searchOptions` | `SearchOptions` | `undefined` | Default options merged into every `search()` call. | | `BM25Params` | `{ k, b, d }` | `{ k: 1.2, b: 0.7, d: 0.5 }` | Okapi BM25+ parameters for ranking. Source: [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts) | The default tokenizer is `string.split(SPACE_OR_PUNCTUATION)`, where `SPACE_OR_PUNCTUATION` is the Unicode regex `/[\n\r\p{Z}\p{P}]+/u`. Community issue [#309](https://github.com/lucaong/minisearch/issues/309) shows that this regex also matches quotation marks (`"`, `'`), so users whose text contains contractions such as `song's` may want to override `tokenize` (or `processTerm`) to keep the apostrophe intact. A more invasive option is to swap in a language-aware tokenizer, as demonstrated by the community-built `garu-minisearch-tokenizer` for Korean ([#312](https://github.com/lucaong/minisearch/issues/312), [#314](https://github.com/lucaong/minisearch/issues/314)) and discussed for Chinese in [#201](https://github.com/lucaong/minisearch/issues/201). > The `extractField` function can return any value, but only stringish content will produce useful search terms. Issue [#302](https://github.com/lucaong/minisearch/issues/302) requests richer return-type support for `extractField`; for now, numeric or array fields used purely for storage should be placed in `storeFields` rather than `fields`. ### Tokenization and Term Processing Pipeline The indexing pipeline is a four-step chain: read field value → `extractField` → `tokenize` → `processTerm`. Each step can be overridden independently, and `processTerm` may return an array to expand one token into multiple indexed terms (useful for hyphenated words or morphological splitters). This same pipeline is reused on the query side unless `searchOptions.tokenize` or `searchOptions.processTerm` override it. Source: [src/MiniSearch.ts:1-40](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts). ## Search-Time Configuration The `SearchOptions` object can be passed to `miniSearch.search(query, options)` or pre-baked into the constructor via `searchOptions`. Notable members (see the type definitions in [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts)) include: - **`prefix`** — `boolean | (term, index, terms) => boolean`. When truthy, the last (or all, depending on the function form) terms are matched as prefixes. Issue [#299](https://github.com/lucaong/minisearch/issues/299) notes that prefix matches are weighted *lower* than fuzzy matches by default; the `prefixWeight` and `fuzzyWeight` options can rebalance this. - **`fuzzy`** — `boolean | number | function`. Numeric values between `0` and `1` are interpreted as a relative Damerau–Levenshtein threshold; values `>= 1` are absolute edit distances. Several scoring anomalies around fuzzy search are tracked in [#129](https://github.com/lucaong/minisearch/issues/129) and [#263](https://github.com/lucaong/minisearch/issues/263). - **`boostDocument`** — `(id, term, storedFields?) => number` returning `>= 0`. Lets callers reweight or zero-out results based on the document content, enabling business-logic-driven ranking. - **`combineWith`** — `'AND' | 'OR'`. Determines how multiple query terms are intersected within a single string query. - **`filter`** — `(result) => boolean`. Applied to the *final* result set. Issue [#304](https://github.com/lucaong/minisearch/issues/304) clarifies that `filter` is *not* applied per sub-query when using `QueryCombination` trees, despite the type signature suggesting it is. - **`fields`** — restricts a search call to a subset of indexed fields. Issue [#298](https://github.com/lucaong/minisearch/issues/298) is a common pitfall: if the desired field is missing from results, it usually means the field was not declared in either `fields` (to be indexed) or `storeFields` (to be returned). The search method signature is: ```ts search(query: Query, searchOptions: SearchOptions = {}): SearchResult[] ``` Source: [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts). ## Query Expression API A `Query` is one of: 1. A plain string (e.g. `'zen art motorcycle'`). 2. A `Wildcard` — the symbol `MiniSearch.wildcard` (i.e. `Symbol('*')`). Issue [#307](https://github.com/lucaong/minisearch/issues/307) reports that passing the wildcard through certain wrappers can cause a `Cannot read properties of undefined (reading 'map')` crash in `executeQuery`; the safe form is to use `MiniSearch.wildcard` directly. 3. A `QueryCombination` object: `{ combineWith, queries: Query[] }`, optionally with any other `SearchOptions` to scope the sub-query (e.g. boosting or filtering inside a clause). ```ts miniSearch.search({ combineWith: 'OR', queries: [ { combineWith: 'AND', queries: ['apple', 'pear'] }, 'juice', 'tree' ] }) ``` This expression-tree API lets external parsers build complex boolean queries. Issue [#297](https://github.com/lucaong/minisearch/issues/297) discusses how to translate a parenthesized mini-language into nested `QueryCombination` nodes, and [#311](https://github.com/lucaong/minisearch/issues/311) flags an edge case where `combineWith: 'AND'` misbehaves when one sub-query contributes no terms. The `autoSuggest(queryString, options)` method reuses the same `SearchOptions`, but defaults to prefix-searching the last term with `combineWith: 'AND'`. Source: [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts). ## Lifecycle and Result Shape `MiniSearch.search` returns `SearchResult[]`, each entry containing `id`, `score`, `terms` (matched document terms), `queryTerms` (the original query terms that produced a hit), a `match` map of term → fields, and any stored fields. The full shape is declared in [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts). For sorting stability over equal-score ties, issue [#301](https://github.com/lucaong/minisearch/issues/301) recommends keeping the original document array in memory and breaking ties by the pre-search index, since MiniSearch does not preserve insertion order through scoring. The library's runtime layout — single default export plus an optional `SearchableMap` companion — is declared in [package.json](https://github.com/lucaong/minisearch/blob/main/package.json) (`"exports": { ".": { ... }, "./SearchableMap": { ... } }`), letting advanced users import just the data structure when needed. A minimal end-to-end usage example is provided in [examples/plain_js/README.md](https://github.com/lucaong/minisearch/blob/main/examples/plain_js/README.md), and integration scenarios (including a Rust+WebAssembly port) are referenced in issue [#313](https://github.com/lucaong/minisearch/issues/313). ## See Also - MiniSearch overview and features: [README.md](https://github.com/lucaong/minisearch/blob/main/README.md) - Full type definitions and defaults: [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts) - Test coverage for `replace`, `vacuum`, and search semantics: [src/MiniSearch.test.js](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.test.js) - Package metadata and dual ESM/CJS/UMD entry points: [package.json](https://github.com/lucaong/minisearch/blob/main/package.json) - Plain-JavaScript demo walkthrough: [examples/plain_js/README.md](https://github.com/lucaong/minisearch/blob/main/examples/plain_js/README.md) --- ## Extensibility, Internationalization, and Troubleshooting ### Related Pages Related topics: [Core Architecture and Data Model](#page-architecture), [Configuration and Search API](#page-api)

Related Source Files

The following source files were used to generate this page: - [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts) - [README.md](https://github.com/lucaong/minisearch/blob/main/README.md) - [src/MiniSearch.test.js](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.test.js) - [package.json](https://github.com/lucaong/minisearch/blob/main/package.json) - [examples/plain_js/README.md](https://github.com/lucaong/minisearch/blob/main/examples/plain_js/README.md)

# Extensibility, Internationalization, and Troubleshooting MiniSearch ships with sensible defaults — a whitespace/punctuation tokenizer, lowercased term normalization, and BM25 scoring — but real-world use cases frequently demand more. The library exposes a small but powerful set of extension hooks that allow developers to adapt indexing and searching to non-English text, domain-specific data, and unusual retrieval semantics. This page documents those hooks, explains how they enable internationalization (with particular attention to languages such as Korean and Chinese), and catalogs common failure modes reported by the community along with their mitigations. ## Extension Points ### Custom Field Extraction (`extractField`) The `extractField` option determines how a configured field name is read from a document before being tokenized. The default implementation treats documents as plain objects, but the option can be replaced to support nested fields, computed fields, or alternate storage backends. ```javascript const miniSearch = new MiniSearch({ fields: ['title', 'author.name', 'pubYear'], extractField: (document, fieldName) => { if (fieldName === 'pubYear') { return document.pubDate && document.pubDate.getFullYear().toString() } return fieldName.split('.').reduce((doc, key) => doc && doc[key], document) } }) ``` Source: [README.md](https://github.com/lucaong/minisearch/blob/main/README.md) and [`extractField` option documentation in src/MiniSearch.ts](src/MiniSearch.ts). Because the extracted value is fed directly into the tokenizer and `processTerm` pipeline, `extractField` can return any type — including numbers, dates, or arrays — as long as downstream stages can stringify it. This is the primary lever for supporting structured document shapes. ### Custom Tokenization (`tokenize`) `tokenize` controls how a string field value is split into terms. Its signature is `(text: string, fieldName?: string) => string[]`, allowing different tokenization strategies per field. The default implementation splits on the regex `/[\n\r\p{Z}\p{P}]+/u`, which treats any Unicode whitespace or punctuation character as a delimiter. The function signature and contract are defined in [`src/MiniSearch.ts`](src/MiniSearch.ts): ```typescript tokenize?: (text: string, fieldName?: string) => string[], ``` Tests confirm that the field name is passed as the second argument, enabling per-field strategies. Source: [src/MiniSearch.test.js: passes field value and name to tokenizer](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.test.js). ### Term Processing and Expansion (`processTerm`) `processTerm` is invoked on every tokenized term and may normalize, stem, or expand it. It may return a single string, an array of strings (each indexed as a separate term), or a falsy value (which discards the term). This is the canonical place to plug in stemming algorithms, synonyms, or morphological analyzers. Source: [`processTerm` option documentation in src/MiniSearch.ts](src/MiniSearch.ts). Tests verify expansion behavior: ```javascript const processTerm = (string) => string === 'foobar' ? ['foo', 'bar'] : string ``` Source: [src/MiniSearch.test.js: allows processTerm to expand a single term into several terms](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.test.js). ### Document-Level Boosting (`boostDocument`) Search options expose `boostDocument`, a function returning a multiplicative factor for a (document, term) pair. Returning a number > 1 promotes the document, < 1 demotes it, and a falsy value excludes it entirely. This is distinct from field-level boosting and is useful for time-decay or popularity-based ranking. Source: [`boostDocument` option in src/MiniSearch.ts](src/MiniSearch.ts). ### Query Construction Hooks Beyond indexing, search-time behavior is also extensible. `prefix`, `fuzzy`, `boost`, `combineWith`, and `filter` options — all defined in [`src/MiniSearch.ts`](src/MiniSearch.ts) — accept either literals or functions, enabling per-term or per-document decisions at query time. ## Internationalization ### Default Tokenizer Limitations MiniSearch's default tokenization strategy — splitting on Unicode whitespace and punctuation — works well for languages with explicit word boundaries (English, most European languages). It does **not** segment languages that do not delimit words with whitespace, including Chinese, Japanese, and Thai, nor does it handle morphological complexity in agglutinative languages such as Korean. The community has reported that this default behavior breaks Korean search in particular: a query for `학교` ("school") will not match documents containing `학교에` or `학교를` because the tokenizer treats the bound particle as part of the same token. Source: [Issue #314: Community Korean tokenizer](https://github.com/lucaong/minisearch/issues/314). ### Korean Tokenization (Community Plugin) A community-maintained morphological tokenizer, **garu-minisearch-tokenizer**, addresses this gap by integrating a Korean morphological analyzer into MiniSearch's `tokenize`/`processTerm` pipeline. The plugin decomposes stems from endings and particles, ensuring that `먹다`, `먹었다`, and `먹습니다` all share a common stem token. Source: [Issue #312: Korean tokenizer plugin](https://github.com/lucaong/minisearch/issues/312). ### Practical Internationalization Strategies | Strategy | When to Use | Implementation Hook | |----------|-------------|---------------------| | Whitespace + punctuation split | English, European languages | Default | | Morphological analyzer | Korean, Turkish, Finnish | `tokenize` or `processTerm` | | n-gram segmentation | Chinese, Japanese, Thai | `tokenize` returning fixed-length substrings | | Stemmer (Porter, Snowball) | English variants, Romance languages | `processTerm` | | Synonym expansion | Domain jargon, plurals | `processTerm` returning arrays | For Chinese and similar languages, the most common pattern is to override `tokenize` to return character bigrams or character n-grams, ensuring every query segment has a chance of overlapping with an indexed segment. Source: [Issue #201: Chinese search support](https://github.com/lucaong/minisearch/issues/201). ## Common Issues and Troubleshooting ### Quotation Marks Split by `SPACE_OR_PUNCTUATION` The default tokenizer regex — `/[\n\r\p{Z}\p{P}]+/u` — treats Unicode punctuation, including straight and curly quotation marks (`"`, `'`, `'`, `"`), as delimiters. As a result, `song's` is split into `song` and `s`, which can produce unexpected matches. Developers handling user-generated content with apostrophes should override `tokenize` with a regex that preserves intra-word apostrophes. Source: [Issue #309: tokenize() issue with SPACE_OR_PUNCTUATION and quotes](https://github.com/lucaong/minisearch/issues/309). ### `filter` Applied Only to Final Results The `filter` search option is evaluated only once, against the aggregated result set, not against each sub-query in a `QueryCombination`. The TypeScript signature may suggest otherwise, but the implementation applies `filter` after combining sub-query results. Source: [Issue #304: Filtering not working at sub-queries level](https://github.com/lucaong/minisearch/issues/304). When sub-query filtering is required, post-process the results yourself, or restructure the query so each sub-query carries identical filterable stored fields. ### `Cannot read properties of undefined (reading 'keys')` This runtime error has been observed in production crash reports and traces to internal state during query execution when the index is in an unexpected configuration. Source: [Issue #306: Cannot read properties of undefined (reading 'keys')](https://github.com/lucaong/minisearch/issues/306). Mitigation typically involves ensuring that documents are added before searches are issued, and that no `replace` operations are interleaved with concurrent searches. ### Wildcard Symbol Misuse `MiniSearch.wildcard` is exposed as a `Symbol('*')`, but passing the *string* `"*"` (or invoking `Symbol('*')` inline) does not match the wildcard identity check. Searching with `Symbol('*')` as a string can produce `Cannot read properties of undefined (reading 'map')` because the internal code path expects the exact exported symbol. Source: [Issue #307: BUG global wildcard symbol is not safe](https://github.com/lucaong/minisearch/issues/307). Always use `MiniSearch.wildcard` rather than constructing a new symbol. ### Scoring Surprises Several reports (#263, #129) describe cases where documents containing the exact query term score lower than documents with only fuzzy or prefix matches. This typically stems from the BM25+ frequency normalization lower bound (`d` parameter) or from `boostDocument` functions that return values < 1 for exact matches. Source: [Issue #263: Fuzzy score issues](https://github.com/lucaong/minisearch/issues/263), [Issue #129: Issues with scoring](https://github.com/lucaong/minisearch/issues/129). Adjust `BM25Params` or per-term boost weights to rebalance. ### Preserving External Sort Order When results are pre-sorted by an external criterion (e.g., update time) and search is layered on top, equal-scoring results may be reordered unpredictably. Source: [Issue #301: How to correctly retain sorting?](https://github.com/lucaong/minisearch/issues/301). Stable sort behavior in JavaScript is guaranteed only since ES2019; if targeting older environments, sort with a tie-breaker that re-applies the original index. ## See Also - [README.md](https://github.com/lucaong/minisearch/blob/main/README.md) — High-level overview and quickstart examples - [src/MiniSearch.ts](https://github.com/lucaong/minisearch/blob/main/src/MiniSearch.ts) — Full API surface, option types, and BM25 implementation - [examples/plain_js/](https://github.com/lucaong/minisearch/tree/main/examples/plain_js) — Reference browser application - [garu-minisearch-tokenizer](https://www.npmjs.com/package/garu-minisearch-tokenizer) — Community Korean tokenizer - [minisearch-wasm](https://github.com/epoyraz/minisearch-wasm) — Rust + WebAssembly port --- --- ## Pitfall Log Project: lucaong/minisearch Summary: Found 19 structured pitfall item(s), including 4 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification. ## 1. Capability evidence risk - Capability evidence risk requires verification - Severity: high - Evidence strength: source_linked - Finding: Project evidence flags a capability evidence risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/304 ## 2. Runtime risk - Runtime risk requires verification - Severity: high - Evidence strength: source_linked - Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/306 ## 3. Runtime risk - Runtime risk requires verification - Severity: high - Evidence strength: source_linked - Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/307 ## 4. Security or permission risk - Security or permission risk requires verification - Severity: high - Evidence strength: source_linked - Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/309 ## 5. Installation risk - Installation risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/302 ## 6. Installation risk - Installation risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/300 ## 7. Installation risk - Installation risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/308 ## 8. Capability evidence risk - Capability evidence risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: README/documentation is current enough for a first validation pass. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: capability.assumptions | https://github.com/lucaong/minisearch ## 9. Maintenance risk - Maintenance risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: evidence.maintainer_signals | https://github.com/lucaong/minisearch ## 10. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: no_demo - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: downstream_validation.risk_items | https://github.com/lucaong/minisearch ## 11. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: no_demo - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: risks.scoring_risks | https://github.com/lucaong/minisearch ## 12. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/314 ## 13. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/311 ## 14. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/296 ## 15. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/312 ## 16. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/298 ## 17. Security or permission risk - Security or permission risk requires verification - Severity: medium - Evidence strength: source_linked - Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow. - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: community_evidence:github | https://github.com/lucaong/minisearch/issues/297 ## 18. Maintenance risk - Maintenance risk requires verification - Severity: low - Evidence strength: source_linked - Finding: issue_or_pr_quality=unknown。 - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: evidence.maintainer_signals | https://github.com/lucaong/minisearch ## 19. Maintenance risk - Maintenance risk requires verification - Severity: low - Evidence strength: source_linked - Finding: release_recency=unknown。 - User impact: May increase setup, validation, or first-run risk for the user. - Evidence: evidence.maintainer_signals | https://github.com/lucaong/minisearch