# https://github.com/benseverndev-oss/goldenmatch Project Manual

Generated at: 2026-05-30 20:24:26 UTC

## Table of Contents

- [Home](#home)
- [Getting Started](#getting-started)
- [Suite Packages Overview](#suite-packages)
- [Installation](#installation)
- [System Architecture](#architecture)
- [Backend Systems](#backend-systems)
- [Core Matching Engine](#core-matching)
- [AutoConfig System](#autoconfig)
- [Learning Memory](#learning-memory)
- [Blocking and Scoring](#blocking-scoring)

<a id='home'></a>

## Home

### Related Pages

Related topics: [Getting Started](#getting-started), [Suite Packages Overview](#suite-packages)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/README.md)
- [packages/python/goldenmatch/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/README.md)
- [packages/python/goldencheck/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldencheck/README.md)
- [packages/python/goldenflow/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenflow/README.md)
- [packages/python/infermap/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/infermap/README.md)
- [packages/typescript/goldencheck-types/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldencheck-types/README.md)
</details>

# Golden Suite

**A polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.**

*GoldenCheck profiles → GoldenFlow standardizes → GoldenMatch deduplicates → GoldenPipe orchestrates. With InferMap for schema mapping and a Rust extension layer for Postgres / DuckDB.*

## Overview

Golden Suite is a comprehensive toolkit for data quality and entity resolution, designed to handle the complete lifecycle of messy data: profiling, standardization, deduplication, and orchestration. The project supports both Python and TypeScript ecosystems, with optional Rust acceleration for high-performance workloads.

Source: [README.md:1-5]()

## Packages Overview

| Tool | Languages | Purpose | Install |
|------|-----------|---------|---------|
| **[GoldenMatch](packages/python/goldenmatch/README.md)** | Python · TS | Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package. | `pip install goldenmatch` · `npm i goldenmatch` |
| **[GoldenCheck](packages/python/goldencheck/README.md)** | Python · TS | Data-quality scanning: encoding, Unicode, format validation, anomaly detection. | `pip install goldencheck` · `npm i goldencheck` |
| **[GoldenFlow](packages/python/goldenflow/README.md)** | Python · TS | Transforms & standardizers: phone, date, address, categorical normalization. | `pip install goldenflow` · `npm i goldenflow` |
| **[GoldenPipe](packages/python/goldenpipe/README.md)** | Python · TS | Orchestrator that wires Check → Flow → Match into one declarative pipeline. | `pip install goldenpipe` · `npm i goldenpipe` |
| **[InferMap](packages/python/infermap/README.md)** | Python · TS | Schema mapping engine — auto-aligns columns across heterogeneous sources. | `pip install infermap` · `npm i infermap` |
| **[goldenmatch-extensions](packages/rust/extensions/README.md)** | Rust | Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching. | source build |
| **[dbt-goldensuite](packages/python/goldenmatch/dbt-goldensuite/README.md)** | dbt · Python | dbt package — quality-gate tests, correction CRUD macros + GoldenCheck assertions for warehouse models. | `pip install dbt-goldensuite` |
| **[goldencheck-action](packages/actions/goldencheck/README.md)** | YAML | GitHub Action — CI with PR comments for data validation. | `uses: benseverndev-oss/goldencheck-action@v1` |

Source: [README.md:28-41]()

## Architecture

```mermaid
graph LR
    A[Raw Data] --> B[GoldenCheck<br/>Profile & Validate]
    B --> C[GoldenFlow<br/>Standardize]
    C --> D[InferMap<br/>Schema Mapping]
    D --> E[GoldenMatch<br/>Deduplicate]
    E --> F[GoldenPipe<br/>Orchestrate]
    
    G[Postgres/DuckDB] --> E
    H[GitHub CI] --> B
    I[MCP Server] --> F
```

### Data Flow

1. **GoldenCheck** profiles your data and discovers quality rules automatically
2. **GoldenFlow** transforms messy fields into canonical formats
3. **InferMap** aligns columns across heterogeneous schemas
4. **GoldenMatch** identifies and merges duplicate records
5. **GoldenPipe** orchestrates the entire pipeline declaratively

Source: [packages/python/goldencheck/README.md:1-10]()

## Quick Start

### Python

```bash
# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv
```

Source: [README.md:64-66]()

### TypeScript / Node.js

```bash
# TypeScript / Edge runtimes
npm install goldenmatch
```

Source: [README.md:69]()

## GoldenMatch

The headline package for entity resolution. Supports multiple matching strategies:

- **Fuzzy matching** — handles typos and variations
- **Exact matching** — bit-for-bit comparisons
- **Probabilistic matching** —贝叶斯-style confidence scoring
- **LLM matching** — semantic clustering with language models

### Key Features

- Zero-config dedup for common cases
- Configurable matchkeys and blocking strategies
- Field-level score explanations
- Streaming/incremental matching for new records
- PPRL (Privacy-Preserving Record Linkage) for cross-organization matching

### Performance

**v1.24.0** achieved significant performance milestones:

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| 10M records (wall time) | 2604s | 502s | -81% |
| Peak RSS | baseline | -18% reduction | — |
| F1 Score | 0.9886 | 0.9886 | invariant |

Source: [packages/python/goldenmatch/README.md:1-50]()

## GoldenCheck

Data validation that discovers rules from your data so you don't have to write them.

### Core Capabilities

- Automatic rule discovery from data patterns
- Encoding and Unicode validation
- Format validation and anomaly detection
- Health score grading (A-F)
- Multiple output formats: HTML, JSON, TUI

### Domain Type Packs

Community-contributed semantic type definitions for improved detection:

| Domain | Types | Description |
|--------|-------|-------------|
| [healthcare](domains/healthcare.yaml) | 10 | NPI, ICD codes, insurance IDs, patient demographics, CPT, DRG |
| [finance](domains/finance.yaml) | 8 | Account numbers, routing numbers, CUSIP/ISIN, currency, transactions |
| [ecommerce](domains/ecommerce.yaml) | 9 | SKUs, order IDs, tracking numbers, categories, shipping |

Source: [packages/typescript/goldencheck-types/README.md:1-30]()

## GoldenFlow

Transforms and standardizers for messy data fields.

### Transform Categories

| Category | Count | Examples |
|----------|-------|----------|
| Text Transforms | 18 | strip, lowercase, normalize_unicode, remove_html_tags |
| Phone Transforms | 5 | phone_e164, phone_national, phone_format |
| Date Transforms | 7 | date_parse, date_format, date_floor |
| Address Transforms | 6 | address_parse, address_standardize |
| Numeric Transforms | 4 | parse_currency, parse_number |
| Categorical Transforms | 4 | category_normalize, category_map |

Source: [packages/python/goldenflow/README.md:1-100]()

## InferMap

Inference-driven schema mapping engine. Maps messy source columns to known target schemas with confidence scores and human-readable reasoning.

### Supported Data Sources

- CSV files
- DataFrames
- Database tables
- In-memory records

### TypeScript Compatibility

- Next.js Server Components
- Route Handlers
- Server Actions
- Edge Runtime

Source: [packages/python/infermap/README.md:1-80]()

## Optional Components

### Native Acceleration

For maximum performance on large datasets:

```bash
pip install "goldenmatch[native]"
```

This pulls `goldenmatch-native`, a separately distributed compiled (Rust/PyO3 abi3) runtime.

### MCP Server (Claude Desktop)

```bash
pip install goldencheck[mcp]
```

Source: [packages/python/goldencheck/README.md:1-50]()

## Integrations

### dbt

Add data-quality gates to dbt:

```yaml
# dbt_project.yml
packages:
  - package: benseverndev-oss/dbt-goldensuite
```

### GitHub Actions

```yaml
- uses: benseverndev-oss/goldencheck-action@v1
  with:
    files: "data/*.csv"
    fail-on: error
```

### Airflow

12 drop-in DAGs available at `examples/airflow/`.

### MCP Container

Run from a single MCP container:

```bash
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
```

Source: [README.md:42-60]()

## Web UI

GoldenMatch includes an interactive web workbench:

```bash
pip install goldenmatch[web]
goldenmatch serve-ui <project>
```

Features:
- Pair drilldown with cluster members
- Field-level diff view
- Natural language explanations per pair

Source: [README.md:58-62]()

## Latest Release

**v1.24.0** — 10M QIS-bucket-realistic: 2604s → 502s (-81% wall) at F1=0.9886 invariant + 18% RSS reduction.

Key improvements:
- ~15 performance PRs
- Chao1 scale-aware cardinality
- Heuristic rule expansion
- Diagnostic harness

See [CHANGELOG.md](CHANGELOG.md) for the full PR list.

Source: [community_context](community_context)

## Getting Help

| Resource | Link |
|----------|------|
| Documentation | [Wiki](https://github.com/benseverndev-oss/goldenmatch/wiki) |
| Examples | [Python examples](./packages/python/goldenmatch/examples/) · [TypeScript examples](./packages/typescript/goldenmatch/examples/) |
| Issues | [GitHub Issues](https://github.com/benseverndev-oss/goldenmatch/issues) |
| Discussions | [GitHub Discussions](https://github.com/benseverndev-oss/goldenmatch/discussions) |

## License

MIT — see [LICENSE](LICENSE)

---

<a id='getting-started'></a>

## Getting Started

### Related Pages

Related topics: [Installation](#installation), [System Architecture](#architecture)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [examples/python/01_quickstart_dedupe.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/examples/python/01_quickstart_dedupe.py)
- [examples/python/02_full_suite_pipeline.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/examples/python/02_full_suite_pipeline.py)
- [examples/python/03_multi_source_unify.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/examples/python/03_multi_source_unify.py)
- [packages/python/goldenmatch/examples/basic_dedupe.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/examples/basic_dedupe.py)
- [packages/typescript/goldenmatch/examples/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldenmatch/examples/README.md)
- [README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/README.md)
</details>

# Getting Started

The Golden Suite is a polyglot data-quality and entity-resolution toolkit designed for deduplication, data standardization, and schema mapping. It consists of five core packages: **GoldenCheck** (data validation), **GoldenFlow** (transforms and standardization), **GoldenMatch** (record deduplication), **GoldenPipe** (pipeline orchestration), and **InferMap** (schema mapping). Source: [README.md:1-10]()

This guide walks you through installation, basic usage patterns, and recommended starting points for common use cases.

---

## Installation

### Python

```bash
# Core deduplication package
pip install goldenmatch

# Optional: native acceleration (Rust runtime for 5x speedup)
pip install "goldenmatch[native]"

# Optional: web UI workbench
pip install "goldenmatch[web]"

# Full suite
pip install goldencheck goldenflow goldenpipe infermap
```

Source: [README.md:80-85]()

### TypeScript / Node.js

```bash
# Core package (zero runtime dependencies)
npm install goldenmatch

# Optional peer dependencies for specific features
npm install yaml  # YAML config support
```

Source: [packages/typescript/goldenmatch/examples/README.md:1-15]()

---

## Quick Start: Dedupe a CSV in 30 Seconds

The fastest way to get started is deduplicating a customer CSV file with zero configuration:

```bash
goldenmatch dedupe customers.csv
```

This single command performs blocking, scoring, clustering, and golden record creation. For Python scripts, the equivalent:

```python
from goldenmatch import dedupe, GoldenConfig

# Zero-config: infer blocking and matching strategy
result = dedupe("customers.csv")
result.to_csv("customers_deduped.csv")
```

Source: [examples/python/01_quickstart_dedupe.py:1-20]()

---

## Basic Deduplication

### Minimal Script

The `basic_dedupe.py` example demonstrates finding and merging duplicate records using fuzzy matching:

```python
from goldenmatch import GoldenMatch
import polars as pl

# Load your data
records = pl.read_csv("customers.csv")

# Initialize with auto-configured parameters
matcher = GoldenMatch()
clusters = matcher.fit_predict(records)

# Get deduplicated golden records
golden_records = matcher.build_golden_records(clusters)
golden_records.write_csv("customers_golden.csv")
```

Source: [packages/python/goldenmatch/examples/basic_dedupe.py:1-30]()

### With Ground Truth Evaluation

To measure accuracy against labeled data:

```python
from goldenmatch import GoldenMatch, evaluate

matcher = GoldenMatch()
clusters = matcher.fit_predict(records)

# Evaluate against ground truth clusters
metrics = evaluate(clusters, ground_truth_clusters)
print(f"Precision: {metrics.precision}")
print(f"Recall: {metrics.recall}")
print(f"F1: {metrics.f1}")
```

Source: [examples/python/01_quickstart_dedupe.py:40-60]()

---

## Full Suite Pipeline

For production workflows, compose all Golden Suite components:

```mermaid
graph TD
    A[Raw Data] --> B[GoldenCheck<br/>Profile & Validate]
    B --> C[GoldenFlow<br/>Standardize]
    C --> D[InferMap<br/>Schema Mapping]
    D --> E[GoldenMatch<br/>Deduplicate]
    E --> F[GoldenPipe<br/>Orchestrate]
```

Source: [examples/python/02_full_suite_pipeline.py:1-50]()

### Pipeline Example

```python
from goldencheck import scan, GoldenCheckConfig
from goldenflow import apply_transforms, TransformPipeline
from infermap import InferMap
from goldenpipe import GoldenPipe

# Step 1: Profile and validate data quality
profile = scan("raw_data.csv", config=GoldenCheckConfig())
print(f"Health grade: {profile.health_grade}")

# Step 2: Standardize messy fields
pipeline = TransformPipeline([
    {"column": "phone", "transform": "phone_e164"},
    {"column": "email", "transform": "lowercase"},
    {"column": "address", "transform": "standardize_address"},
])
standardized = apply_transforms("raw_data.csv", pipeline)

# Step 3: Map to target schema
mapper = InferMap()
schema_mapping = mapper.fit_predict(standardized, target_schema)
mapped = mapper.transform(standardized)

# Step 4: Deduplicate
from goldenmatch import GoldenMatch
matcher = GoldenMatch()
clusters = matcher.fit_predict(mapped)
golden = matcher.build_golden_records(clusters)

# Step 5: Orchestrate with GoldenPipe
pipe = GoldenPipe(name="customer_360")
pipe.add_step("check", scan, config=GoldenCheckConfig())
pipe.add_step("flow", apply_transforms, pipeline=pipeline)
pipe.add_step("match", matcher)
result = pipe.run("raw_data.csv")
```

Source: [examples/python/02_full_suite_pipeline.py:20-60]()

---

## Multi-Source Unification

Merge customer data from multiple sources (CRM, marketing, vendor) into a unified golden record:

```python
from goldenmatch import GoldenMatch
from goldenpipe import GoldenPipe

# Load data from multiple sources
crm = pl.read_csv("crm_export.csv")
marketing = pl.read_csv("marketing_platform.csv")
vendor = pl.read_csv("vendor_data.csv")

# Create unified dataset with source tracking
unified = pl.concat([
    crm.with_columns(pl.lit("crm").alias("_source")),
    marketing.with_columns(pl.lit("marketing").alias("_source")),
    vendor.with_columns(pl.lit("vendor").alias("_source")),
])

# Match and deduplicate across sources
matcher = GoldenMatch()
clusters = matcher.fit_predict(unified)

# Build golden records with provenance
golden = matcher.build_golden_records(
    clusters,
    provenance=True,  # Track which source contributed each field
)
```

Source: [examples/python/03_multi_source_unify.py:1-40]()

---

## TypeScript / Node.js Quick Start

### Basic Deduplication

```typescript
import { GoldenMatch } from 'goldenmatch';
import { readFileSync } from 'fs';

// Load CSV data
const records = await readFileSync('customers.csv', 'utf-8');

// Initialize matcher with default config
const matcher = new GoldenMatch();
const clusters = await matcher.fitPredict(records);

// Build golden records
const golden = matcher.buildGoldenRecords(clusters);
console.log(`Found ${clusters.length} unique records`);
```

Source: [packages/typescript/goldenmatch/examples/README.md:1-30]()

### Edge Runtime Example

```typescript
// Vercel Edge or Cloudflare Workers compatible
import { GoldenMatch } from 'goldenmatch';

export default {
  async fetch(request: Request): Promise<Response> {
    const matcher = new GoldenMatch();
    const records = await request.json();
    const clusters = await matcher.fitPredict(records);
    return Response.json({ clusters });
  }
};
```

---

## Example Scripts Reference

| Script | Use Case | Key Dependencies |
|--------|----------|------------------|
| `01_quickstart_dedupe.py` | Zero-config CSV deduplication | goldenmatch |
| `02_full_suite_pipeline.py` | Complete Check → Flow → Match pipeline | goldencheck, goldenflow, goldenmatch, goldenpipe |
| `03_multi_source_unify.py` | Cross-system customer unification | goldenmatch |
| `basic_dedupe.py` | Find and merge duplicate records | goldenmatch |
| `explain_match.py` | Per-field score breakdown | goldenmatch |
| `evaluate_and_tune.py` | Accuracy measurement against ground truth | goldenmatch |

Source: [examples/README.md:1-30]()

Run any example:

```bash
cd examples/python
python 01_quickstart_dedupe.py
```

---

## Performance Notes

- **v1.24.0**: 10M records in 502s (81% wall reduction) with F1=0.9886 invariant at `backend="bucket"`.
- **Native acceleration**: Install `goldenmatch[native]` for Rust-powered clustering and block-scoring kernels.
- **Memory**: ~18% RSS reduction in v1.24.0 compared to v1.23.0 on large datasets.

Source: [Community Context - v1.24.0]()

---

## Next Steps

| Goal | Resource |
|------|----------|
| Customize matching strategy | [Advanced Config Example](examples/python/advanced_config.py) |
| Build custom matchkeys | [Custom Config Example](examples/python/custom_config.py) |
| Use LLM for clustering | [LLM Product Matching](examples/python/llm_product_matching.py) |
| Incremental matching | [Streaming Example](examples/python/streaming_incremental.py) |
| PPRL (privacy-preserving) | [PPRL Healthcare](examples/python/pprl_healthcare.py) |
| Run in Airflow | [Airflow DAGs](examples/airflow/) |

---

<a id='suite-packages'></a>

## Suite Packages Overview

### Related Pages

Related topics: [Core Matching Engine](#core-matching)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/README.md)
- [packages/python/goldencheck/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldencheck/README.md)
- [packages/python/goldenflow/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenflow/README.md)
- [packages/python/infermap/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/infermap/README.md)
- [examples/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/examples/README.md)
- [packages/typescript/goldencheck-types/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldencheck-types/README.md)
- [packages/rust/extensions/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/rust/extensions/README.md)
</details>

# Suite Packages Overview

The Golden Suite is a polyglot data-quality and entity-resolution toolkit that provides a complete pipeline from data profiling to deduplication. The suite consists of multiple purpose-built packages that can be used independently or composed together for end-to-end workflows.

Source: [README.md:1-15]()

## Package Architecture

```mermaid
graph TD
    A[Raw Data] --> B[GoldenCheck]
    B --> C[GoldenFlow]
    C --> D[InferMap]
    D --> E[GoldenMatch]
    E --> F[GoldenPipe]
    
    G[goldenmatch-extensions] --> E
    H[dbt-goldensuite] --> B
    I[goldencheck-types] --> B
    
    J[goldencheck-action] --> B
    K[GoldenCheck MCP] --> B
    L[Goldensuite MCP] --> F
```

## Package Summary

| Package | Languages | Purpose | Install |
|---------|-----------|---------|---------|
| **GoldenMatch** | Python, TypeScript | Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM | `pip install goldenmatch` / `npm i goldenmatch` |
| **GoldenCheck** | Python, TypeScript | Data-quality scanning: encoding, Unicode, format validation, anomaly detection | `pip install goldencheck` / `npm i goldencheck` |
| **GoldenFlow** | Python, TypeScript | Transforms & standardizers: phone, date, address, categorical normalization | `pip install goldenflow` / `npm i goldenflow` |
| **GoldenPipe** | Python, TypeScript | Orchestrator that wires Check → Flow → Match into one declarative pipeline | `pip install goldenpipe` / `npm i goldenpipe` |
| **InferMap** | Python, TypeScript | Schema mapping engine — auto-aligns columns across heterogeneous sources | `pip install infermap` / `npm i infermap` |
| **goldenmatch-extensions** | Rust | Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching | source build |
| **dbt-goldensuite** | dbt, Python | dbt package — quality-gate tests, correction CRUD macros + GoldenCheck assertions | `pip install dbt-goldensuite` |
| **goldencheck-action** | — | GitHub Action for CI with PR comments | marketplace |
| **goldencheck-types** | TypeScript | Community-contributed domain type packs (healthcare, finance, e-commerce) | `npm i goldencheck-types` |

Source: [README.md:80-100]()

## GoldenCheck

GoldenCheck is the data-quality scanning and profiling component of the Golden Suite. It detects issues such as encoding problems, Unicode anomalies, format violations, and data anomalies.

### Core Capabilities

- **Encoding Detection**: Identifies and reports encoding issues in text fields
- **Unicode Validation**: Detects malformed Unicode sequences and normalization issues
- **Format Validation**: Validates against expected formats (email, phone, URL, etc.)
- **Anomaly Detection**: Statistical analysis to identify outliers and unusual patterns
- **LLM-Powered Analysis**: Uses language models to identify semantic data quality issues missed by automated profilers

Source: [packages/python/goldencheck/README.md:1-50]()

### Domain Type Packs

GoldenCheck supports domain-specific type definitions through community-contributed packs:

| Domain | Types Included |
|--------|----------------|
| **Healthcare** | NPI, ICD codes, insurance IDs, patient demographics, CPT, DRG |
| **Finance** | Account numbers, routing numbers, CUSIP/ISIN, currency, transactions |
| **E-commerce** | SKUs, order IDs, tracking numbers, categories, shipping |

Source: [packages/typescript/goldencheck-types/README.md:1-30]()

### MCP Server

GoldenCheck includes an MCP (Model Context Protocol) server providing 10 agent-level tools:

- `analyze_data` — Domain detection and strategy recommendation
- `auto_triage` — Automated issue classification
- `explain_finding` — Natural language explanation of findings
- `explain_column` — Column-level analysis
- `compare_domains` — Cross-domain comparison
- `generate_handoff` — Pipeline handoff generation

Source: [packages/typescript/goldencheck/src/node/mcp/agent-tools.ts:1-50]()

## GoldenFlow

GoldenFlow provides 76+ transformation functions for standardizing messy data fields. It focuses on transforming data before matching to improve deduplication accuracy.

Source: [packages/python/goldenflow/README.md:1-50]()

### Transform Categories

#### Text Transforms (18)

| Transform | Description |
|-----------|-------------|
| `strip` | Trim whitespace |
| `lowercase` / `uppercase` | Case conversion |
| `title_case` | Proper casing ("john smith" → "John Smith") |
| `normalize_unicode` | NFKD normalization, strip accents |
| `normalize_quotes` | Smart/curly quotes → straight quotes |
| `collapse_whitespace` | Multiple spaces → single space |
| `remove_punctuation` | Strip punctuation characters |
| `remove_html_tags` | Strip HTML markup from scraped data |
| `fix_mojibake` | Fix common UTF-8/Latin-1 encoding garbling |

#### Phone Transforms (5)

| Transform | Description |
|-----------|-------------|
| `phone_e164` | Any format → +15550123456 |
| `phone_national` | Any format → (555) 012-3456 |

#### Date Transforms

- Date parsing and normalization across multiple formats
- Timezone normalization

#### Domain-Specific Transforms

| Domain | Capabilities |
|--------|-------------|
| **Healthcare** | NPI, ICD, CPT, DRG parsing, transaction dates, amount parsing |
| **E-commerce** | SKU normalization, price parsing, order dates, address standardization |
| **Real Estate** | Property addresses, listing dates, price normalization, geo fields |

Source: [packages/python/goldenflow/README.md:50-120]()

## GoldenMatch

GoldenMatch is the core entity resolution (deduplication) engine. It supports multiple matching strategies and scales to millions of records.

Source: [README.md:80-85]()

### Matching Strategies

| Strategy | Use Case |
|----------|----------|
| **Fuzzy Matching** | Name/address variants with typo tolerance |
| **Exact Matching** | Identifier deduplication |
| **Probabilistic Matching** | Record linkage with confidence scores |
| **LLM Clustering** | Semantic product/matching for complex domains |
| **PPRL** | Privacy-preserving record linkage (cross-organization) |

### Performance Benchmarks

| Dataset Size | Wall Time | Peak RSS | F1 Score |
|--------------|-----------|----------|----------|
| 10M records (QIS-bucket-realistic) | 502s (-81%) | 18% reduction | 0.9886 |

Source: [v1.24.0 Release Notes](https://github.com/benseverndev-oss/goldenmatch/releases/tag/v1.24.0)

### Native Acceleration

An optional Rust-based acceleration runtime is available:

```bash
pip install "goldenmatch[native]"
```

This pulls `goldenmatch-native`, a separately distributed compiled (Rust/PyO3 abi3) runtime. The native runtime is discovered automatically when installed.

Source: [v1.21.0 Release Notes](https://github.com/benseverndev-oss/goldenmatch/releases/tag/v1.21.0)

### Configuration Options

| Option | Description |
|--------|-------------|
| `backend` | `"bucket"` (recommended for 5M+ records), `"chunked"` |
| `auto_config` | Auto-tune recall thresholds |
| `lineage_provenance` | Track source row for golden record fields |

## InferMap

InferMap is a schema mapping engine that automatically aligns columns across heterogeneous data sources. It supports both Python and TypeScript with full API parity.

Source: [packages/python/infermap/README.md:1-40]()

### Key Features

- **Auto Schema Alignment**: Detects and maps columns across different source schemas
- **Custom Scorers**: Configurable similarity scoring algorithms
- **Domain Dictionaries**: Industry-specific vocabulary for better matching
- **Calibration Tools**: Score matrix introspection and tuning
- **Edge Runtime Support**: Works in Vercel Edge Runtime and Next.js

Source: [packages/python/infermap/README.md:40-80]()

## GoldenPipe

GoldenPipe is the orchestrator that wires Check → Flow → Match into a single declarative pipeline. It enables pipeline definitions in YAML or Python.

Source: [README.md:85-90]()

### Pipeline Stages

```mermaid
graph LR
    A[Check] --> B[Flow]
    B --> C[Match]
    C --> D[Output]
    
    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#e8f5e9
    style D fill:#f3e5f5
```

### Deployment Options

| Runtime | Description |
|---------|-------------|
| **Airflow** | 12 drop-in DAGs for daily/incremental/warehouse-native dedupe |
| **MCP Container** | Single container: `docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest` |
| **Python/CLI** | Direct script execution |

Source: [README.md:90-110]()

## goldenmatch-extensions

A Rust-based Postgres extension using pgrx that provides SQL-native fuzzy matching capabilities.

Source: [packages/rust/extensions/README.md:1-30]()

### Capabilities

- **13 pgrx Functions**: Core API parity with Python/TypeScript
- **goldenflow_* Transforms**: SQL-level data standardization
- **memory_learn/memory_stats CRUD**: Learning memory database operations
- **DuckDB UDFs**: Fuzzy matching in DuckDB queries

Source: [goldenmatch_pg v0.5.0 Release](https://github.com/benseverndev-oss/goldenmatch/releases/tag/goldenmatch-pg-v0.5.0)

## dbt-goldensuite

A dbt package that adds Golden Suite capabilities as dbt tests and macros.

### Features

- Quality-gate tests for warehouse models
- Correction CRUD macros for data repair
- GoldenCheck assertions integrated with dbt test framework

Source: [packages/python/goldenmatch/dbt-goldensuite/README.md:1-20]()

## goldencheck-action

GitHub Action for CI integration that:

- Runs GoldenCheck scans on PR data changes
- Posts PR comments with findings
- Fails builds based on configurable severity thresholds

Source: [packages/actions/goldencheck/README.md:1-30]()

## Examples and Quick Start

The repository includes comprehensive examples organized by deployment target:

Source: [examples/README.md:1-30]()

| Directory | Audience | Highlights |
|-----------|----------|------------|
| `python/` | Python users | 6 scripts: zero-config quickstart, full Suite composed, customer 360, PPRL |
| `typescript/` | TypeScript/edge users | 4 scripts: quickstart, Vercel-Edge, MCP client |
| `sql/` | SQL/warehouse users | DuckDB + Postgres core-API examples |
| `airflow/` | Data-platform users | 12 drop-in DAGs |

### Quick Start Commands

```bash
# Headline package: dedupe a CSV
pip install goldenmatch && goldenmatch dedupe customers.csv

# TypeScript / Edge
npm install goldenmatch

# Full suite via MCP
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
```

Source: [README.md:110-130]()

## Related Documentation

- [Getting Started with GoldenMatch](https://github.com/benseverndev-oss/goldenmatch/wiki/Getting-Started)
- [GoldenMatch Web UI](https://github.com/benseverndev-oss/goldenmatch/wiki/Web-UI)
- [ER Agent / A2A Integration](https://github.com/benseverndev-oss/goldenmatch/wiki/ER-Agent)
- [InferMap Wiki](https://github.com/benseverndev-oss/infermap/wiki)

---

<a id='installation'></a>

## Installation

### Related Pages

Related topics: [Getting Started](#getting-started)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/python/goldenmatch/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/README.md)
- [packages/python/goldencheck/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldencheck/README.md)
- [packages/python/goldenflow/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenflow/README.md)
- [packages/python/goldenpipe/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenpipe/README.md)
- [packages/python/infermap/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/infermap/README.md)
- [packages/typescript/goldenmatch/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldenmatch/README.md)
- [packages/python/goldensuite-mcp/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldensuite-mcp/README.md)
- [packages/typescript/goldencheck/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldencheck/README.md)
- [packages/rust/extensions/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/rust/extensions/README.md)
- [README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/README.md)
- [examples/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/examples/README.md)
</details>

# Installation

The Golden Suite is a polyglot data-quality and entity-resolution toolkit available across Python, TypeScript/Node.js, Rust, and SQL environments. This page documents installation methods for all suite packages, optional dependencies, system requirements, and deployment patterns.

## System Requirements

| Component | Minimum Version | Recommended |
|-----------|-----------------|-------------|
| Python | 3.11+ | 3.11, 3.12 |
| Node.js | 20+ | 22 LTS |
| Rust | 1.75+ (for extensions) | Latest stable |
| PostgreSQL | 15+ (for extensions) | 16+ |
| DuckDB | 1.0+ (for extensions) | Latest |

Source: [README.md:1-15]()

## Quick Start

```bash
# Entity resolution (Python)
pip install goldenmatch

# TypeScript / Edge runtimes
npm install goldenmatch
```

Source: [README.md:55-60]()

## Python Installation

All Python packages are published to PyPI and support pip installation with optional extras.

### Core Packages

| Package | Purpose | Install Command |
|---------|---------|-----------------|
| **goldenmatch** | Zero-config entity resolution (fuzzy + exact + probabilistic + LLM) | `pip install goldenmatch` |
| **goldencheck** | Data-quality scanning and validation | `pip install goldencheck` |
| **goldenflow** | Data transforms and standardization | `pip install goldenflow` |
| **goldenpipe** | Pipeline orchestrator | `pip install goldenpipe` |
| **infermap** | Schema mapping engine | `pip install infermap` |

Source: [packages/python/goldenmatch/README.md](), [packages/python/goldencheck/README.md](), [packages/python/goldenflow/README.md](), [packages/python/goldenpipe/README.md](), [packages/python/infermap/README.md]()

### GoldenMatch Optional Dependencies

GoldenMatch supports modular installation through extras:

```bash
# Basic installation
pip install goldenmatch

# With native acceleration (Rust/PyO3 abi3 runtime)
pip install "goldenmatch[native]"

# With LLM support (Anthropic SDK)
pip install "goldenmatch[llm]"

# With baseline profiling support
pip install "goldenmatch[baseline]"

# With semantic type inference
pip install "goldenmatch[semantic]"

# With web UI
pip install "goldenmatch[web]"

# With MCP server
pip install "goldenmatch[mcp]"

# Full installation with all extras
pip install "goldenmatch[native,llm,baseline,semantic,web,mcp]"
```

Source: [packages/python/goldenmatch/README.md](), [README.md:50-55]()

### GoldenCheck Optional Dependencies

```bash
# Basic installation
pip install goldencheck

# With LLM enhancement
pip install "goldencheck[llm]"

# With MCP server for Claude Desktop
pip install "goldencheck[mcp]"

# With all extras
pip install "goldencheck[llm,mcp]"
```

Source: [packages/python/goldencheck/README.md]()

## TypeScript / Node.js Installation

All TypeScript packages are published to npm with zero runtime dependencies for the core packages (edge-safe).

```bash
# Core packages
npm install goldenmatch
npm install goldencheck
npm install goldenflow
npm install goldenpipe
npm install infermap

# MCP server (Node.js only)
npm install @benseverndev-oss/goldensuite-mcp
```

Source: [packages/typescript/goldenmatch/README.md](), [packages/typescript/goldencheck/README.md]()

### Peer Dependencies

Some TypeScript examples require optional peer dependencies:

| Package | Purpose | Install Command |
|---------|---------|-----------------|
| `yaml` | YAML configuration parsing | `npm install yaml` |
| `nodejs-polars` | Parquet reading (Node.js only) | auto-installed when needed |
| `csv-parse` | CSV reading (Node.js only) | auto-installed when needed |
| `@modelcontextprotocol/sdk` | MCP server (Node.js only) | auto-installed when needed |

Source: [packages/typescript/goldenmatch/examples/README.md]()

## Docker Installation

For a self-contained MCP server deployment, use the official container image:

```bash
# Pull the latest MCP server
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
```

Source: [README.md:40]()

## GoldenMatch Native Acceleration

As of v1.21.0, GoldenMatch offers an optional compiled Rust runtime via the `goldenmatch-native` package. This is separately distributed and uses PyO3 abi3 for compatibility.

```mermaid
graph TD
    A[User: pip install goldenmatch] --> B{Has native extra?}
    B -->|No| C[Pure-Python wheel]
    B -->|Yes| D[pip install goldenmatch-native]
    D --> E[Rust/PyO3 abi3 runtime]
    E --> F[Auto-discovery at runtime]
    C --> G[Standard Polars backend]
    F --> H[Optimized clustering + scoring kernels]
    G --> H
```

Source: [v1.21.0 Release Notes](), [README.md]()

### Installation Commands for Native

```bash
# Option 1: Install with native extra (recommended)
pip install "goldenmatch[native]"

# Option 2: Install separately
pip install goldenmatch
pip install goldenmatch-native
```

Source: [v1.21.0 Release Notes]()

## Rust Extensions (PostgreSQL / DuckDB)

For SQL-native fuzzy matching, install the Rust extension package.

### PostgreSQL Extension

```bash
# Clone and build from source
git clone https://github.com/benseverndev-oss/goldenmatch.git
cd packages/rust/extensions
cargo build --release

# Load in PostgreSQL
CREATE EXTENSION goldenmatch_pg;
```

Source: [packages/rust/extensions/README.md]()

### DuckDB UDFs

The same Rust package provides DuckDB UDFs for in-database matching:

```bash
# Install via source build
cargo build --release
# UDFs are loaded via SQL commands in DuckDB
```

Source: [packages/rust/extensions/README.md]()

## MCP Server Setup

The Golden Suite includes an MCP (Model Context Protocol) server for Claude Desktop integration.

### Python Installation

```bash
pip install "goldencheck[mcp]"
# or for full suite
pip install "goldenmatch[mcp]"
```

Source: [packages/python/goldencheck/README.md]()

### Claude Desktop Configuration

Add to your Claude Desktop config (`claude_desktop_config.json`):

```json
{
  "mcpServers": {
    "goldencheck": {
      "command": "python",
      "args": ["-m", "goldencheck.mcp.server"]
    }
  }
}
```

Source: [packages/python/goldencheck/README.md]()

### Docker Deployment

For production MCP deployments:

```bash
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
```

Source: [README.md:40](), [packages/python/goldensuite-mcp/README.md]()

## dbt Integration

Install the dbt package for data warehouse quality gates:

```bash
pip install dbt-goldensuite
```

Source: [README.md:35]()

## GitHub Actions

For CI/CD integration:

```bash
pip install goldencheck
# or use the action directly in workflows
```

Source: [README.md:35]()

## Airflow DAGs

Run the Golden Suite as Airflow DAGs:

```bash
# Install Airflow adapter
pip install apache-airflow

# Use pre-built DAGs from examples/
cp -r examples/airflow/ /path/to/airflow/dags/
```

Source: [examples/README.md]()

## Verification

Verify installation with the following commands:

```bash
# Python packages
python -c "import goldenmatch; print(goldenmatch.__version__)"
python -c "import goldencheck; print(goldencheck.__version__)"
python -c "import goldenflow; print(goldenflow.__version__)"

# TypeScript packages
node -e "console.log(require('goldenmatch/package.json').version)"

# CLI tools
goldenmatch --version
goldencheck --version
goldenflow --version
```

## Common Installation Issues

### Python Version Mismatch

GoldenMatch requires Python 3.11+. Check your version:

```bash
python --version
```

If using an older version, use a virtual environment:

```bash
python -m venv golden-env
source golden-env/bin/activate  # Linux/macOS
# or
golden-env\Scripts\activate  # Windows
pip install goldenmatch
```

### Native Extension Not Found

If `goldenmatch-native` isn't auto-discovered:

```bash
# Reinstall with native extra
pip uninstall goldenmatch-native
pip install "goldenmatch[native]"
```

### MCP Server Connection Issues

For Claude Desktop MCP integration, ensure the config is in the correct location:

- Linux: `~/.config/Claude/claude_desktop_config.json`
- macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`
- Windows: `%APPDATA%\Claude\claude_desktop_config.json`

## Installation Hierarchy

```mermaid
graph TB
    subgraph "Full Stack Installation"
        A[Golden Suite MCP Container] --> B[GoldenMatch + Native]
        A --> C[GoldenCheck + MCP]
        A --> D[GoldenFlow]
        A --> E[GoldenPipe]
        A --> F[InferMap]
    end
    
    subgraph "Python-Only Stack"
        G[goldenmatch] --> H[Polars]
        G --> I[Polars-Runtime]
        G --> J[goldenmatch-native optional]
    end
    
    subgraph "TypeScript-Only Stack"
        K[goldenmatch] --> L[Zero deps]
        K --> M[nodejs-polars optional]
    end
```

## Related Pages

- [Quick Start Guide](Quick-Start)
- [GoldenMatch Documentation](GoldenMatch)
- [GoldenCheck Documentation](GoldenCheck)
- [GoldenFlow Transforms](GoldenFlow)
- [MCP Server Setup](MCP-Server)
- [Rust Extensions](Extensions)

---

<a id='architecture'></a>

## System Architecture

### Related Pages

Related topics: [Backend Systems](#backend-systems), [Core Matching Engine](#core-matching), [Blocking and Scoring](#blocking-scoring)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/python/goldenmatch/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/README.md)
- [packages/python/goldencheck/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldencheck/README.md)
- [packages/python/goldenflow/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenflow/README.md)
- [packages/python/infermap/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/infermap/README.md)
- [packages/typescript/goldencheck-types/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldencheck-types/README.md)
- [packages/typescript/goldencheck/src/core/llm/prompts.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldencheck/src/core/llm/prompts.ts)
- [packages/typescript/goldencheck/src/core/reporters/json.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldencheck/src/core/reporters/json.ts)
- [packages/typescript/goldencheck/src/node/mcp/agent-tools.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldencheck/src/node/mcp/agent-tools.ts)
- [packages/typescript/infermap/src/node/mcp/server.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/infermap/src/node/mcp/server.ts)
- [packages/typescript/goldencheck/src/core/semantic/types.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/typescript/goldencheck/src/core/semantic/types.ts)
</details>

# System Architecture

The Golden Suite is a polyglot data-quality and entity-resolution toolkit designed for AI-native workflows. The architecture follows a modular, pipeline-oriented design where each component addresses a specific stage of data processing: profiling, standardization, schema mapping, and deduplication. Source: [packages/python/goldenmatch/README.md]()

## Architecture Overview

The system consists of five core Python packages, three TypeScript packages, and a Rust extension layer. Each package is independently installable and can operate standalone or as part of an orchestrated pipeline.

### Package Overview

| Package | Language | Purpose | Install Command |
|---------|----------|---------|-----------------|
| **GoldenMatch** | Python · TS | Zero-config entity resolution (fuzzy + exact + probabilistic + LLM) | `pip install goldenmatch` · `npm i goldenmatch` |
| **GoldenCheck** | Python · TS | Data-quality scanning: encoding, Unicode, format validation, anomaly detection | `pip install goldencheck` · `npm i goldencheck` |
| **GoldenFlow** | Python · TS | Transforms & standardizers: phone, date, address, categorical normalization | `pip install goldenflow` · `npm i goldenflow` |
| **GoldenPipe** | Python · TS | Orchestrator wiring Check → Flow → Match into declarative pipelines | `pip install goldenpipe` · `npm i goldenpipe` |
| **InferMap** | Python · TS | Schema mapping engine — auto-aligns columns across heterogeneous sources | `pip install infermap` · `npm i infermap` |
| **goldenmatch-extensions** | Rust | Postgres extension (pgrx) + DuckDB UDFs for SQL-native fuzzy matching | source build |
| **dbt-goldensuite** | dbt · Python | dbt package with quality-gate tests and correction CRUD macros | `pip install dbt-goldensuite` |
| **goldencheck-action** | Action | GitHub Action for CI with PR comments | via GitHub Marketplace |

Source: [packages/python/goldenmatch/README.md]()

## Core Data Flow

The canonical pipeline processes data through four stages:

```mermaid
graph TD
    A[Raw Data CSV/Parquet] --> B[GoldenCheck<br/>Profile & Validate]
    B --> C[GoldenFlow<br/>Standardize & Transform]
    C --> D[InferMap<br/>Schema Mapping]
    D --> E[GoldenMatch<br/>Deduplicate & Merge]
    E --> F[Golden Records<br/>with Provenance]
    
    B -->|Findings Report| G[Data Quality Score]
    E -->|Cluster Decisions| H[Memory Learning]
    H -->|Auto-config| E
```

### Pipeline Stage Details

| Stage | Input | Output | Key Capabilities |
|-------|-------|--------|------------------|
| **Profile** | Raw CSV/Parquet | Schema, statistics, health score | Encoding detection, null analysis, cardinality profiling |
| **Standardize** | Messy fields | Normalized fields | Phone E.164, date parsing, address standardization, unicode normalization |
| **Map** | Heterogeneous schemas | Column alignments | Domain dictionaries, custom scorers, confidence scoring |
| **Match** | Canonical records | Duplicate clusters | Fuzzy matching, blocking, probabilistic scoring, LLM clustering |

Source: [packages/python/goldenflow/README.md]()

## GoldenMatch Architecture

GoldenMatch is the headline package, providing entity resolution (ER) capabilities. The v1.24.0 release achieved 81% wall-clock reduction (2604s → 502s) on 10M record datasets through the QIS-bucket-realistic optimization path. Source: [Community Context - v1.24.0]()

### Match Pipeline Components

```mermaid
graph LR
    A[Input Records] --> B[Blocking<br/>Key Generation]
    B --> C[Candidate Pair<br/>Generation]
    C --> D[Vectorized<br/>Scoring Kernels]
    D --> E[Probabilistic<br/>Classifier]
    E --> F[Cluster<br/>Formation]
    F --> G[Golden Record<br/>Survivorship]
    G --> H[Output with<br/>Provenance]
```

### Backend Modes

GoldenMatch supports multiple execution backends optimized for different dataset scales:

| Backend | Use Case | Records | Performance |
|---------|----------|---------|-------------|
| `chunked` | Development / small datasets | < 1M | Single-threaded baseline |
| `bucket` | **Recommended 5M-on-one-node config** | 1-10M | 5x wall reduction, 2x RSS reduction |
| `native` | Maximum performance | Any | Rust/PyO3 abi3 compiled kernels |

The `bucket` backend became the recommended path in v1.16.0, processing 5M records in 9.94 minutes with 6.4 GB peak RSS on a single 16-core node. Source: [packages/python/goldenmatch/README.md]()

### Native Acceleration Layer

The optional native runtime (`pip install "goldenmatch[native]"`) ships the compiled `_native` kernel as a separately distributed wheel. The runtime is discovered automatically at import time:

```python
import goldenmatch
# Automatically detects and uses native runtime if available
result = goldenmatch.dedupe("customers.csv")
```

Source: [Community Context - v1.21.0](), [goldenmatch-native v0.1.0]()

## GoldenCheck Architecture

GoldenCheck provides data validation that discovers rules from your data. The TypeScript implementation follows a scanner-reporter pattern with an LLM enhancement layer. Source: [packages/python/goldencheck/README.md]()

### Core Engine Components

| Component | File Location | Purpose |
|-----------|---------------|---------|
| **Scanner** | `src/core/engine/scanner.ts` | Analyzes data files, profiles columns |
| **Engine** | `src/core/engine/` | Executes discovered quality rules |
| **Confidence** | `src/core/engine/confidence.ts` | Applies severity downgrades based on findings |
| **Triage** | `src/core/engine/triage.ts` | Auto-categorizes findings by priority |
| **Fixer** | `src/core/engine/fixer.ts` | Applies automated corrections |
| **Agent** | `src/core/agent/` | Strategy selection and explanation |

Source: [packages/typescript/goldencheck/src/node/mcp/agent-tools.ts]()

### LLM Integration Layer

The TypeScript implementation includes a comprehensive LLM interface for semantic analysis:

```typescript
interface LLMResponse {
  columns: Record<string, LLMColumnAssessment>;
  relations: LLMRelation[];
}

interface LLMColumnAssessment {
  semantic_type: string | null;
  issues: LLMIssue[];
  upgrades: LLMUpgrade[];
  downgrades: LLMDowngrade[];
}
```

Source: [packages/typescript/goldencheck/src/core/llm/prompts.ts]()

### Semantic Type System

GoldenCheck ships with a bundled base type system defined as TypeScript constants (no runtime YAML dependency):

```typescript
export const BASE_TYPES: Readonly<Record<string, TypeDef>> = {
  identifier: {
    nameHints: ["id", "key", "pk", "code", "sku", "number", "num", "record"],
    valueSignals: { min_unique_pct: 0.95 },
    suppress: ["cardinality", "pattern_consistency", "drift_detection"],
  },
  person_name: {
    nameHints: ["first_name", "last_name", "full_name", ...],
    valueSignals: { mixed_case: true },
    suppress: ["pattern_consistency", "cardinality"],
  },
  email: {
    nameHints: ["email", "mail", "e_mail"],
    valueSignals: { format_match: "email", min_match_pct: 0.70 },
    suppress: ["pattern_consistency"],
  },
  phone: {
    nameHints: ["phone", "tel", "fax", "mobile", "cell"],
    valueSignals: { format_match: "phone", min_match_pct: 0.70 },
    suppress: ["type_inference", "pattern_consistency"],
  },
  address: {
    nameHints: ["address", "street", "addr", "line1", "line2"],
    valueSignals: { avg_length_min: 15 },
    suppress: ["pattern_consistency", "cardinality"],
  },
  free_text: {
    nameHints: ["notes", "comments", "description", ...],
    // ...
  },
};
```

Source: [packages/typescript/goldencheck/src/core/semantic/types.ts]()

### Reporter System

The JSON reporter produces machine-readable output matching the spec schema:

```typescript
interface ReportOutput {
  file: string;
  rows: number;
  columns: number;
  health_grade: string;
  health_score: number;
  summary: { errors: number; warnings: number; info: number };
  findings: Array<{
    severity: string;
    column: string;
    check: string;
    message: string;
    affected_rows: number;
    sample_values: string[];
  }>;
}
```

Source: [packages/typescript/goldencheck/src/core/reporters/json.ts]()

## GoldenFlow Transform Architecture

GoldenFlow provides 76 transforms organized into semantic categories for data standardization. Source: [packages/python/goldenflow/README.md]()

### Transform Categories

| Category | Count | Examples |
|----------|-------|----------|
| **Text Transforms** | 18 | `strip`, `lowercase`, `uppercase`, `normalize_unicode`, `remove_html_tags` |
| **Phone Transforms** | 5 | `phone_e164`, `phone_national`, `phone_format` |
| **Date Transforms** | 8 | `date_parse`, `date_format`, `fuzzy_date` |
| **Numeric Transforms** | 6 | `parse_currency`, `extract_numbers`, `round_precision` |
| **Address Transforms** | 7 | `standardize_address`, `parse_components` |
| **Categorical Transforms** | 4 | `normalize_category`, `fuzzy_category` |

### Domain-Specific Transforms

| Domain | Transforms |
|--------|------------|
| **Healthcare** | NPI normalization, ICD code parsing, insurance ID formatting, CPT/DRG parsing |
| **Finance** | Account number formatting, routing number validation, CUSIP/ISIN parsing, currency normalization |
| **E-commerce** | SKU normalization, price parsing, order date standardization, address standardization |
| **Real Estate** | Property address parsing, listing date normalization, price standardization, geo field extraction |

Source: [packages/python/goldenflow/README.md]()

## InferMap Architecture

InferMap is an inference-driven schema mapping engine that automatically aligns columns across heterogeneous data sources. Source: [packages/python/infermap/README.md]()

### MCP Server Tools

The TypeScript implementation exposes a comprehensive MCP tool interface:

| Tool | Purpose |
|------|---------|
| `inspect` | Analyze schema of a data source |
| `suggest-mappings` | Propose column alignments with confidence scores |
| `apply` | Generate remapped output file |
| `compare-schemas` | Side-by-side schema comparison |
| `domain-mapping` | Domain-specific mapping using dictionaries |

Source: [packages/typescript/infermap/src/node/mcp/server.ts]()

### Scorer Architecture

InferMap supports custom scorers for mapping decisions:

```mermaid
graph TD
    A[Source Column] --> B[Name Similarity]
    A --> C[Type Compatibility]
    A --> D[Value Distribution]
    A --> E[Domain Dictionary]
    B --> F[Composite Score]
    C --> F
    D --> F
    E --> F
    F --> G[Mapping Confidence]
```

Source: [packages/python/infermap/README.md]()

## GoldenPipe Orchestration Layer

GoldenPipe wires Check → Flow → Match into declarative pipelines with zero boilerplate. Source: [packages/python/goldenmatch/README.md]()

### Pipeline Declaration

```python
from goldenpipe import Pipeline

pipeline = (
    Pipeline("customer-dedup")
    .check("raw_data.csv")
    .flow(standardization_rules)
    .map(source_schema, target_schema)
    .match(blocking_rules, match_config)
    .output("golden_records.csv")
)
pipeline.run()
```

### Integration Points

| Integration | Package | Use Case |
|-------------|---------|----------|
| **Airflow DAGs** | `examples/airflow/` | 12 drop-in DAG templates |
| **dbt** | `dbt-goldensuite` | Quality-gate tests for warehouse models |
| **GitHub Actions** | `goldencheck-action` | PR-level data validation |
| **MCP Server** | `goldensuite-mcp` | Single-container MCP deployment |

Source: [packages/python/goldenmatch/README.md]()

## MCP Agent Tools Architecture

The GoldenCheck MCP server exposes 10 agent-level tools for Claude Desktop integration: Source: [packages/typescript/goldencheck/src/node/mcp/agent-tools.ts]()

```typescript
export const AGENT_TOOLS: readonly Tool[] = [
  { name: "analyze_data", description: "Analyze data file to detect domain and recommend strategy" },
  { name: "explain_finding", description: "Explain a specific finding in natural language" },
  { name: "explain_column", description: "Explain column quality assessment" },
  { name: "auto_triage", description: "Auto-categorize findings by priority" },
  { name: "apply_fixes", description: "Apply automated corrections" },
  { name: "compare_domains", description: "Compare data against known domain schemas" },
  { name: "generate_handoff", description: "Generate pipeline handoff documentation" },
  { name: "build_review_queue", description: "Build prioritized review queue" },
];
```

### Agent Tool Execution Flow

```mermaid
graph TD
    A[Claude Desktop] -->|MCP Protocol| B[Agent Tools Layer]
    B --> C{Command Router}
    C -->|analyze_data| D[Scanner Engine]
    C -->|explain_*| E[Agent Explanation]
    C -->|apply_fixes| F[Fixer Engine]
    C -->|auto_triage| G[Triage Engine]
    D --> H[Findings Report]
    E --> I[Natural Language Output]
    F --> J[Corrected Data]
    G --> K[Prioritized Queue]
```

Source: [packages/typescript/goldencheck/src/node/mcp/agent-tools.ts]()

## Memory Learning System

The Learning Memory system in GoldenMatch consists of three tuners that consume user feedback to auto-configure thresholds: Source: [Community Context - v1.20.0]()

| Tuner | Level | Input | Output |
|-------|-------|-------|--------|
| **MemoryLearner** | Pair-level | Approve/reject pair decisions | Per-field auto-approve threshold |
| **Field-Strategy Tuner** | Field-level | Field match quality feedback | Per-field match strategy selection |
| **Cluster-Decision Tuner** | Cluster-level | Cluster approve/reject decisions | Per-dataset auto-approve threshold |

### Auto-Configuration Flow

```mermaid
graph LR
    A[User Feedback] --> B{Memory System}
    B --> C[Pair-Level Learning]
    B --> D[Field-Level Learning]
    B --> E[Cluster-Level Learning]
    C --> F[Auto-Config Commit]
    D --> F
    E --> F
    F --> G[Threshold Proposals]
    G --> H[Match Pipeline]
```

### Zero-Label Confidence (v1.23.0)

The v1.23.0 release introduced auto-config commits by zero-label confidence by default. The controller's `pick_committed` tiebreaker prefers higher `-overall_confidence` candidates over higher `-mass_separation` among same-health-rank candidates carrying a zero_label profile. Source: [Community Context - v1.23.0]()

## Golden Record Provenance

The v1.22.0 release added field-level golden-record provenance tracking: Source: [Community Context - v1.22.0]()

### Provenance Data Model

```python
# When provenance=True is enabled:
result = build_golden_records_batch(records, provenance=True)

# Each field dict includes:
{
    "value": "John Smith",
    "source_row_id": "__row_id__ of winning record",
    "survivorship_winner": True
}
```

### Lineage Configuration

| Config Option | Type | Default | Description |
|--------------|------|---------|-------------|
| `config.output.lineage_provenance` | bool | `False` | Enable field-level source tracking |

Source: [Community Context - v1.22.0]()

## TypeScript Package Structure

The TypeScript packages follow a consistent structure optimized for edge runtimes:

```
packages/typescript/
├── goldencheck/
│   ├── src/
│   │   ├── core/
│   │   │   ├── engine/       # Scanner, fixer, triage, confidence
│   │   │   ├── agent/       # Strategy selection, explanation
│   │   │   ├── llm/         # LLM interface, prompts
│   │   │   ├── reporters/   # JSON, HTML output formatters
│   │   │   ├── semantic/    # Type definitions, domain matching
│   │   │   └── types.ts     # Core type definitions
│   │   └── node/
│   │       └── mcp/         # MCP server, agent tools
│   └── domains/             # Bundled domain packs (YAML)
├── goldencheck-types/
│   ├── domains/             # Community-contributed domain packs
│   └── src/                 # Type definitions
└── infermap/
    ├── src/
    │   ├── core/            # Mapping engine
    │   └── node/
    │       └── mcp/         # MCP server tools
    └── domains/             # Domain dictionaries
```

### Zero Runtime Dependencies

The core TypeScript packages have no runtime dependencies (edge-safe):

| Package | Dependencies |
|---------|---------------|
| `goldencheck` core | None (pure TypeScript) |
| `goldencheck-types` | `js-yaml` (dev: `tsup`, `vitest`, `typescript`) |
| `infermap` core | None (pure TypeScript) |

Optional peer dependencies:
- `nodejs-polars` — Parquet reading (Node.js only)
- `csv-parse` — CSV reading (Node.js only)
- `@modelcontextprotocol/sdk` — MCP server (Node.js only)

Source: [packages/typescript/goldencheck-types/package.json]()

## Cross-Language Record Fingerprint

The v1.21.0 release introduced cross-language record fingerprinting enabling consistent record identification across Python, TypeScript, and Rust execution environments. Source: [Community Context - v1.21.0]()

## Rust Extensions Layer

The `goldenmatch-extensions` package provides SQL-native fuzzy matching through Postgres UDFs (pgrx) and DuckDB UDFs: Source: [Community Context - v1.24.0]()

### Extension Capabilities (v0.5.0)

| Capability | Postgres Functions | DuckDB Functions |
|------------|-------------------|------------------|
| Core API parity | 13 pgrx functions | Multiple UDFs |
| GoldenFlow transforms | `goldenflow_*` transforms | Equivalent UDFs |
| Memory learning | `memory_learn` CRUD | `memory_stats` CRUD |

## Performance Characteristics

### Scaling Benchmarks (v1.24.0)

| Dataset Size | Wall Clock | Memory (RSS) | Configuration |
|-------------|------------|--------------|---------------|
| 100K records | ~30s | < 1 GB | Default |
| 1M records | ~2 min | ~2 GB | Default |
| 5M records | 9.94 min | 6.4 GB | `backend="bucket"`, 16-core |
| 10M records | 8.37 min | ~5 GB | `backend="bucket"`, optimized |

The QIS-bucket-realistic path achieves F1=0.9886 invariant across all scales. Source: [Community Context - v1.24.0]()

### Memory Optimization

The native Rust acceleration (`goldenmatch-native`) provides:
- 18% RSS reduction on large datasets
- Compiled clustering kernels
- Optimized block-scoring operations

Source: [Community Context - v1.19.0]()

## Related Documentation

| Topic | Documentation |
|-------|--------------|
| Getting Started | [GoldenMatch Quick Start](https://github.com/benseverndev-oss/goldenmatch#quick-start) |
| Python API | [packages/python/goldenmatch/](packages/python/goldenmatch/) |
| TypeScript API | [packages/typescript/goldenmatch/](packages/typescript/goldenmatch/) |
| MCP Integration | [Web UI Wiki](https://github.com/benseverndev-oss/goldenmatch/wiki/Web-UI) |
| ER Agent | [ER Agent / A2A Wiki](https://github.com/benseverndev-oss/goldenmatch/wiki/ER-Agent) |
| Examples | [Python Examples](./examples/), [TypeScript Examples](./examples/typescript/) |

---

<a id='backend-systems'></a>

## Backend Systems

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/python/goldenmatch/goldenmatch/backends/duckdb_backend.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/backends/duckdb_backend.py)
- [packages/python/goldenmatch/goldenmatch/backends/ray_backend.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/backends/ray_backend.py)
- [packages/python/goldenmatch/goldenmatch/backends/polars_backend.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/backends/polars_backend.py)
- [packages/python/goldenmatch/goldenmatch/backends/base.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/backends/base.py)
- [packages/python/goldenmatch/goldenmatch/backends/__init__.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/backends/__init__.py)
- [packages/python/goldenmatch/goldenmatch/backends/bucket_backend.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/backends/bucket_backend.py)
- [docs/scale-envelope.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/docs/scale-envelope.md)
</details>

# Backend Systems

GoldenMatch provides a pluggable backend architecture that allows the entity resolution engine to execute across different computational substrates. This design enables users to choose the optimal backend based on dataset size, infrastructure constraints, and performance requirements.

## Overview

The backend abstraction decouples the core matching logic from data execution, allowing GoldenMatch to scale from single-threaded operations on small datasets to distributed clusters processing tens of millions of records.

```mermaid
graph TD
    User[User Code] --> Config[GoldenMatch Config]
    Config --> Registry[Backend Registry]
    Registry --> Polars[Polars Backend]
    Registry --> Bucket[Bucket Backend]
    Registry --> DuckDB[DuckDB Backend]
    Registry --> Ray[Ray Backend]
    
    Polars --> Data1[In-Memory Data]
    Bucket --> Data2[Chunked Data]
    DuckDB --> Data3[SQL Engine]
    Ray --> Data4[Distributed Data]
    
    subgraph Core["Core ER Engine"]
        Blocking[Blocking]
        Scoring[Pair Scoring]
        Clustering[Clustering]
    end
    
    Data1 --> Core
    Data2 --> Core
    Data3 --> Core
    Data4 --> Core
```

## Backend Types

### Polars Backend

The default backend, designed for single-node operation with multi-threaded execution. It leverages Polars' native vectorized operations for blocking, scoring, and clustering.

| Characteristic | Value |
|----------------|-------|
| Execution Model | Single node, multi-threaded |
| Memory Model | In-memory |
| Typical Dataset Size | Up to 5M records |
| Dependencies | polars |
| Configuration Key | `backend="polars"` |

The Polars backend is the recommended starting point for datasets under 5 million records. Source: [packages/python/goldenmatch/goldenmatch/backends/polars_backend.py]()

### Bucket Backend

An evolution of the Polars backend optimized for larger datasets on single nodes. The bucket backend partitions data into manageable chunks that are processed sequentially, reducing peak memory consumption.

| Characteristic | Value |
|----------------|-------|
| Execution Model | Single node, chunked processing |
| Memory Model | Memory-efficient, streaming |
| Typical Dataset Size | 5M-10M records |
| RSS Reduction | ~18% vs chunked baseline |
| Wall Time Reduction | ~81% vs v1.15 baseline |

Source: [packages/python/goldenmatch/goldenmatch/backends/bucket_backend.py]()
Source: [docs/scale-envelope.md]()

> **v1.16.0+ Performance Note**: The `backend="bucket"` path achieves 5M records in 9.94 minutes with 6.4 GB peak RSS on a 16-core node. This represents a 5x wall reduction and 2x peak RSS reduction compared to the v1.15 chunked baseline. Source: [README.md]()

### DuckDB Backend

Provides SQL-native fuzzy matching capabilities, enabling GoldenMatch operations to execute within DuckDB queries. This is particularly useful for warehouse-native entity resolution workflows.

| Characteristic | Value |
|----------------|-------|
| Execution Model | DuckDB SQL engine |
| Memory Model | Arrow-based, out-of-core |
| Use Case | Warehouse-native ER, SQL integration |
| Key Functions | `goldenflow_*` transforms, memory operations |

Source: [packages/python/goldenmatch/goldenmatch/backends/duckdb_backend.py]()

### Ray Backend

Enables distributed entity resolution across a Ray cluster. The Ray backend distributes blocking, scoring, and clustering operations across multiple nodes.

| Characteristic | Value |
|----------------|-------|
| Execution Model | Distributed Ray cluster |
| Memory Model | Distributed, cluster-sharded |
| Typical Dataset Size | 10M+ records |
| Dependencies | ray |

Source: [packages/python/goldenmatch/goldenmatch/backends/ray_backend.py]()

## Backend Selection

GoldenMatch automatically selects an appropriate backend based on dataset characteristics, but users can explicitly override this via configuration:

```python
from goldenmatch import GoldenMatch

config = {
    "backend": "bucket",  # Explicit backend selection
    "backend_options": {
        "chunk_size": 500_000,
        "num_threads": 16
    }
}

matcher = GoldenMatch(config)
```

### Auto-Detection Logic

The backend registry implements automatic selection based on:

1. **Record count**: Datasets under 1M records use `polars`; larger datasets may trigger `bucket` or `ray`
2. **Memory availability**: Detected via RSS monitoring during execution
3. **Cluster availability**: If Ray is initialized, `ray` backend becomes available
4. **User configuration**: Explicit `backend` setting takes precedence

## Backend Interface

All backends implement a common interface defined in `base.py`:

```python
class Backend(Protocol):
    def setup(self, config: Config) -> None: ...
    def teardown(self) -> None: ...
    def execute_blocking(self, records: DataFrame, config: Config) -> DataFrame: ...
    def execute_scoring(self, pairs: DataFrame, config: Config) -> DataFrame: ...
    def execute_clustering(self, pairs: DataFrame, config: Config) -> DataFrame: ...
    def execute_merge(self, records: DataFrame, clusters: DataFrame, config: Config) -> DataFrame: ...
```

Source: [packages/python/goldenmatch/goldenmatch/backends/base.py]()

## Native Acceleration (Rust/PyO3)

v1.21.0 introduced optional native acceleration via `goldenmatch-native`, a separately distributed compiled runtime built with Rust and PyO3 abi3:

```bash
pip install "goldenmatch[native]"
```

The native runtime is discovered automatically when installed and provides:

- Compiled clustering kernels
- Optimized block-scoring operations
- Polars-compatible ABI3 bindings

Source: [packages/python/goldenmatch/goldenmatch/backends/__init__.py]()

> **Note**: `goldenmatch-native` is not a standalone package. It must be installed alongside the core `goldenmatch` package and is automatically discovered at import time.

## Memory Management

### Health Monitoring

All backends implement health monitoring to detect memory pressure:

```python
class HealthMonitor:
    """Tracks RSS usage and execution metrics."""
    
    def check_health(self) -> HealthStatus:
        """Returns current memory and execution health."""
    
    def get_metrics(self) -> dict:
        """Returns detailed metrics for diagnostics."""
```

### RSS Reduction Features

| Version | Feature | RSS Impact |
|---------|---------|------------|
| v1.16.0 | Bucket backend introduction | 50% reduction vs chunked |
| v1.24.0 | Scale-aware cardinality | 18% reduction |
| v1.24.0 | Heuristic rule expansion | Additional savings |

Source: [docs/scale-envelope.md]()

## Configuration Reference

### Backend-Specific Options

```yaml
backend: "bucket"  # polars, bucket, duckdb, ray

backend_options:
  # Polars/Bucket options
  num_threads: 16
  chunk_size: 500_000
  memory_limit_gb: 32
  
  # Ray options
  num_actors: 8
  actor_placement: "node1,node2,node3"
  
  # DuckDB options
  catalog: "memory"
  threads: 8
```

### Performance Tuning

For the recommended 5M-on-one-node configuration:

```yaml
backend: "bucket"
backend_options:
  num_threads: 16
  chunk_size: 500_000
```

This configuration achieves approximately 9.94 minutes wall time and 6.4 GB peak RSS on a 16-core node.

Source: [docs/scale-envelope.md]()

## Execution Pipeline

The backend executes the entity resolution pipeline in these stages:

```mermaid
graph LR
    A[Raw Records] --> B[Blocking]
    B --> C[Pair Generation]
    C --> D[Pair Scoring]
    D --> E[Clustering]
    E --> F[Golden Record Construction]
    F --> G[Output]
    
    B1[Block Key Computation] --> B
    D1[Field Scorers] --> D
    E1[Linkage Criteria] --> E
```

Each stage is backend-specific for optimal execution:

| Stage | Polars | Bucket | DuckDB | Ray |
|-------|--------|--------|--------|-----|
| Blocking | Vectorized | Chunked vectorized | SQL | Distributed actors |
| Scoring | Multi-threaded | Chunked | SQL | Distributed |
| Clustering | Single-node | Chunked | SQL | Distributed |

## Extending Backends

To implement a custom backend:

```python
from goldenmatch.backends.base import Backend
from goldenmatch.core.config import Config
import polars as pl

class CustomBackend(Backend):
    def setup(self, config: Config) -> None:
        self.config = config
    
    def execute_blocking(self, records: pl.DataFrame, config: Config) -> pl.DataFrame:
        # Implement custom blocking logic
        return blocked_df
    
    def execute_scoring(self, pairs: pl.DataFrame, config: Config) -> pl.DataFrame:
        # Implement custom scoring logic
        return scored_df
    
    def execute_clustering(self, pairs: pl.DataFrame, config: Config) -> pl.DataFrame:
        # Implement custom clustering logic
        return clustered_df
    
    def teardown(self) -> None:
        # Cleanup resources
        pass
```

Register the backend:

```python
from goldenmatch.backends import register_backend

register_backend("custom", CustomBackend)
```

## See Also

- [Architecture Overview](Architecture) — High-level system design
- [Scale Envelope](docs/scale-envelope.md) — Performance benchmarks
- [CLI Reference](CLI) — Command-line backend selection
- [Configuration Guide](Configuration) — Backend configuration options
- [GoldenMatch PostgreSQL Extension](packages/rust/extensions/README.md) — SQL-native ER with pgrx

---

<a id='core-matching'></a>

## Core Matching Engine

### Related Pages

Related topics: [AutoConfig System](#autoconfig), [Blocking and Scoring](#blocking-scoring)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/python/goldenmatch/goldenmatch/core/scorer.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/scorer.py)
- [packages/python/goldenmatch/goldenmatch/core/cluster.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/cluster.py)
- [packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py)
- [packages/python/goldenmatch/goldenmatch/core/blocker.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/blocker.py)
- [packages/python/goldenmatch/goldenmatch/core/controller.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/controller.py)
- [packages/python/goldenmatch/goldenmatch/core/matcher.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/matcher.py)
- [packages/python/goldenmatch/goldenmatch/core/field_strategies.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/field_strategies.py)
</details>

# Core Matching Engine

The Core Matching Engine is the central processing component of GoldenMatch, responsible for identifying, scoring, and clustering duplicate records within datasets. It orchestrates the entire entity resolution pipeline from candidate pair generation through final cluster formation, enabling users to deduplicate records with configurable precision, recall, and performance characteristics.

## Architecture Overview

The Core Matching Engine consists of several interconnected modules that work together to perform entity resolution at scale. The engine processes input records through a staged pipeline: blocking reduces the candidate space, scoring evaluates pair similarity, and clustering groups related records into golden records.

```mermaid
graph TD
    A[Input Records] --> B[Blocker]
    B --> C[Candidate Pairs]
    C --> D[Scorer]
    D --> E[Scored Pairs]
    E --> F[Cluster]
    F --> G[Golden Records]
    
    H[Config] --> B
    H --> D
    H --> F
    
    I[Autoconfig Controller] -.-> H
```

### Core Components

| Component | Purpose | Key Responsibilities |
|-----------|---------|----------------------|
| **Blocker** | Candidate reduction | Generate candidate pairs using blocking keys, handle QIS-bucket blocking |
| **Scorer** | Similarity evaluation | Compute field-level and overall pair scores, apply field strategies |
| **Cluster** | Record grouping | Form clusters from scored pairs, manage transitive closure, build golden records |
| **Autoconfig Controller** | Self-configuration | Auto-tune thresholds, manage recall targets, handle zero-label commits |
| **Controller** | Orchestration | Coordinate pipeline stages, manage state, handle incremental processing |

Source: [packages/python/goldenmatch/goldenmatch/core/controller.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/controller.py)

## Blocking Module

The blocking module is responsible for reducing the computational complexity of record matching from O(n²) to a manageable candidate set. It groups records by shared blocking keys and only generates candidate pairs within the same block.

### Blocking Strategies

GoldenMatch supports multiple blocking strategies optimized for different data characteristics and scale requirements:

| Strategy | Description | Use Case |
|----------|-------------|----------|
| **QIS Bucket** | Quality-Interval-Sorted bucketing with Chao1 cardinality estimation | Large-scale datasets (10M+ records), realistic data distributions |
| **Standard** | Traditional blocking on normalized field values | General purpose deduplication |
| **Multi-pass** | Sequential blocking with different keys | High recall requirements |
| **ANN Fallback** | Approximate nearest neighbor blocking | Fuzzy matching with edit distance |

The QIS-bucket strategy introduced in v1.24.0 achieves significant performance improvements: 10M records processed in 502s (down from 2604s) with invariant F1=0.9886 and 18% RSS reduction.

Source: [packages/python/goldenmatch/goldenmatch/core/blocker.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/blocker.py)

### Blocking Key Configuration

```python
config = {
    "blocking": {
        "keys": ["name_soundex", "zip_code", "phone_area"],
        "min_block_size": 2,
        "max_block_size": 100000
    }
}
```

## Scoring Module

The scoring module evaluates candidate pairs by computing similarity scores at both field and record levels. It applies configurable field strategies to determine how each field contributes to the overall match probability.

### Field Strategies

Field strategies define how individual fields are compared and weighted:

| Strategy | Description | Best For |
|----------|-------------|----------|
| `exact` | Binary match/mismatch | IDs, codes, categorical |
| `fuzzy` | Edit distance or Jaro-Winkler | Names, addresses |
| `token_set` | Token overlap comparison | Multi-word fields |
| `numeric` | Threshold-based comparison | Ages, amounts |
| `date` | Temporal proximity | Date fields |
| `phonetic` | Soundex/Metaphone matching | Names |

Source: [packages/python/goldenmatch/goldenmatch/core/field_strategies.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/field_strategies.py)

### Score Computation

The scorer aggregates field-level scores into an overall pair score using weighted combination:

```python
overall_score = sum(field_score * field_weight for field in fields) / total_weight
```

The resulting score represents the probability that two records refer to the same entity, ranging from 0.0 (definitely different) to 1.0 (definitely match).

Source: [packages/python/goldenmatch/goldenmatch/core/scorer.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/scorer.py)

## Clustering Module

The clustering module transforms scored pairs into connected clusters representing unique entities. It handles transitive closure to ensure consistent grouping across the record graph.

### Clustering Algorithm

GoldenMatch uses a connected-components approach with configurable linkage criteria:

| Linkage Type | Behavior |
|--------------|----------|
| **Single** | Records merge if any pair within the cluster exceeds threshold |
| **Complete** | All pairs within merged clusters must exceed threshold |
| **Average** | Uses mean pairwise similarity |

### Golden Record Generation

Once clusters are formed, the clustering module generates golden records by applying survivorship rules:

```python
golden_record = build_golden_records_batch(cluster_members, provenance=True)
```

With `provenance=True` (introduced in v1.22.0), each field dict includes `source_row_id` tracking which record contributed the winning value.

Source: [packages/python/goldenmatch/goldenmatch/core/cluster.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/cluster.py)

## Autoconfig Controller

The autoconfig controller enables self-tuning of matching parameters based on labeled data or automatic heuristics. It replaces manual threshold tuning with data-driven optimization.

### Auto-Configuration Features

| Feature | Description | Version |
|---------|-------------|---------|
| **Zero-label commit** | Prefer higher confidence candidates when labels unavailable | v1.23.0+ |
| **Recall targeting** | Auto-configure thresholds to meet desired recall | v1.20.0+ |
| **Cluster threshold tuning** | Tune decision threshold based on cluster-level decisions | v1.20.0+ |
| **Field strategy tuning** | Auto-select field comparison strategies | v1.19.0+ |

### Zero-Label Confidence Handling

In v1.23.0, the `pick_committed` method was enhanced to handle zero-label profiles. When multiple candidates have equal health rank, the controller prefers candidates with higher `-overall_confidence` over those with higher `-mass_separation`. This addresses precision-collapse issues in unlabeled data scenarios.

Source: [packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py)

## Performance Characteristics

The Core Matching Engine is optimized for both scale and accuracy:

### Benchmark Results (v1.24.0)

| Dataset Size | Wall Time | Peak RSS | F1 Score |
|--------------|-----------|----------|----------|
| 5M records | 9.94 min | 6.4 GB | 0.99+ |
| 10M records | 502s | ~18% reduction vs v1.23 | 0.9886 |

### Performance Strategies

1. **Vectorization**: Single-group-by-per-column operations for golden record building
2. **Scale-aware cardinality**: Chao1 estimation for blocking key selection
3. **Native acceleration**: Optional Rust/PyO3 runtime via `pip install "goldenmatch[native]"`
4. **Incremental processing**: Support for streaming and incremental matching

Source: [packages/python/goldenmatch/goldenmatch/core/matcher.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/matcher.py)

## Configuration Reference

### Core Configuration Options

```yaml
matching:
  # Scoring thresholds
  score_threshold: 0.85        # Minimum score to consider a match
  decision_threshold: 0.5      # Threshold for cluster decisions
  
  # Field configuration
  fields:
    name:
      strategy: fuzzy
      weight: 2.0
    email:
      strategy: exact
      weight: 1.5
    phone:
      strategy: token_set
      weight: 1.0

  # Blocking
  blocking:
    keys: ["name_soundex", "phone_last4"]
    method: qis_bucket

  # Output
  output:
    lineage_provenance: false  # Track source records for golden fields
    include_scores: true
```

### Autoconfig Options

```yaml
autoconfig:
  enabled: true
  recall_target: 0.95
  zero_label_commit: true      # v1.23.0+ default behavior
  tune_cluster_threshold: true  # v1.20.0+
```

## Pipeline Integration

The Core Matching Engine integrates with the broader GoldenSuite ecosystem:

```mermaid
graph LR
    A[GoldenCheck] --> B[GoldenFlow]
    B --> C[GoldenMatch]
    C --> D[InferMap]
    
    E[GoldenPipe] -. orchestrates .-> A
    E -. orchestrates .-> B
    E -. orchestrates .-> C
    E -. orchestrates .-> D
```

- **GoldenCheck**: Validates data quality before matching
- **GoldenFlow**: Standardizes messy fields (phone, date, address)
- **InferMap**: Maps columns across heterogeneous schemas
- **GoldenPipe**: Orchestrates the full pipeline declaratively

## Related Documentation

- [API Reference](../api/core.md)
- [Configuration Guide](../guides/configuration.md)
- [Performance Tuning](../guides/performance.md)
- [CLI Reference](../cli/reference.md)

---

<a id='autoconfig'></a>

## AutoConfig System

### Related Pages

Related topics: [Core Matching Engine](#core-matching), [Learning Memory](#learning-memory)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py)
- [packages/python/goldenmatch/goldenmatch/core/autoconfig_policy.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_policy.py)
- [packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tuner.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tuner.py)
- [packages/python/goldenmatch/goldenmatch/core/zero_label_confidence.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/zero_label_confidence.py)
- [packages/python/goldenmatch/docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md)
- [packages/python/goldenmatch/web/frontend/src/lib/api.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/web/frontend/src/lib/api.ts)
</details>

# AutoConfig System

The AutoConfig System is GoldenMatch's intelligent self-configuration engine that automatically tunes entity resolution parameters based on data characteristics, eliminating the need for manual threshold and strategy tuning. Introduced incrementally across versions v1.19.0 through v1.24.0, it represents the "Learning Memory" family of auto-tuning features that consume human feedback to propose optimal configurations.

## Overview

AutoConfig addresses the fundamental challenge in entity resolution: finding the right balance between precision and recall requires understanding your specific data distribution. Rather than requiring users to manually specify match thresholds, blocking strategies, and scoring weights, AutoConfig:

- **Analyzes** data characteristics through profiling
- **Proposes** configuration parameters via iterative tuning
- **Learns** from user decisions in review queues
- **Commits** configurations based on zero-label confidence

Source: [docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md)

## Architecture

The AutoConfig System comprises four primary components that work together in an iterative feedback loop:

```mermaid
graph TD
    A[Data Input] --> B[Controller]
    B --> C[Policy Engine]
    C --> D[Cluster Threshold Tuner]
    D --> E[Zero-Label Confidence]
    E --> B
    F[User Decisions] --> D
    G[Telemetry] --> B
    B --> H[Committed Config]
```

### Component Overview

| Component | Purpose | Location |
|-----------|---------|----------|
| **Controller** | Orchestrates the auto-config loop, manages iterations | `core/autoconfig_controller.py` |
| **Policy Engine** | Evaluates candidate configurations against health metrics | `core/autoconfig_policy.py` |
| **Cluster Threshold Tuner** | Proposes per-dataset approve thresholds from cluster decisions | `core/autoconfig_cluster_threshold_tuner.py` |
| **Zero-Label Confidence** | Assigns confidence scores based on unlabeled data patterns | `core/zero_label_confidence.py` |

Source: [packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py)

## Controller

The `AutoConfigController` is the central orchestrator that manages the iterative configuration process. It coordinates between the policy engine, telemetry collectors, and the zero-label confidence system.

### Controller Responsibilities

```mermaid
graph LR
    A[Blocking<br>Summary] --> C[Controller]
    B[Scoring<br>Summary] --> C
    D[Cluster<br>Summary] --> C
    E[Indicators] --> C
    F[Column<br>Priors] --> C
    C --> G[Decisions]
    C --> H[Errors]
    C --> I[Committed<br>Matchkeys]
```

### Telemetry Data Model

The controller collects and emits structured telemetry for each iteration:

```typescript
interface ControllerScoringSummary {
  n_pairs_scored: number;
  candidates_compared: number;
  mass_above_threshold: number;
  mass_in_borderline: number;
  dip_statistic: number;
}

interface ControllerBlockingSummary {
  n_blocks: number;
  reduction_ratio: number;
  block_sizes_p50: number;
  block_sizes_p99: number;
  block_sizes_max: number;
  oversized_block_count: number;
  keys_used: string[][];
}

interface ControllerClusterSummary {
  n_clusters: number;
  cluster_size_p50: number;
  cluster_size_p99: number;
  cluster_size_max: number;
  transitivity_rate: number;
  oversized_cluster_count: number;
}
```

Source: [web/frontend/src/lib/api.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/web/frontend/src/lib/api.ts)

### Decision Recording

Each iteration produces a `ControllerDecision` record:

| Field | Type | Description |
|-------|------|-------------|
| `iteration` | `int` | Iteration number |
| `rule_name` | `str` | Name of the policy rule that triggered |
| `rationale` | `str` | Human-readable explanation |
| `config_diff` | `Record[str, str]` | Changes made to configuration |
| `wall_clock_ms` | `int` | Time taken for this iteration |

Source: [web/frontend/src/lib/api.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/web/frontend/src/lib/api.ts)

## Policy Engine

The `AutoConfigPolicy` evaluates candidate configurations against a health ranking system. It ranks configurations by their expected precision-collapse behavior and mass separation characteristics.

### Health Ranking

Configurations are evaluated on multiple health dimensions:

| Metric | Description | Impact |
|--------|-------------|--------|
| `overall_confidence` | Aggregate confidence score | Higher is better |
| `mass_separation` | Gap between match/non-match distributions | Larger gap indicates better discrimination |
| `precision_collapse` | Tendency to over-cluster | Lower is better |

### Commit Decision Logic

The policy engine's `pick_committed` method determines which configuration to commit when multiple candidates have equal health rank. In v1.23.0, the tiebreaker was modified to prefer higher `-overall_confidence` over higher `-mass_separation` for candidates carrying a zero_label profile.

Source: [docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md)

## Cluster Threshold Tuner

The `AutoConfigClusterThresholdTuner` is the third tuner in the Learning Memory family, alongside the pair-level `MemoryLearner` and field-level field-strategy tuner.

### Tuner Function: `tune_decision_threshold`

```python
def tune_decision_threshold(
    decisions: list[ClusterDecision],
    current_threshold: float
) -> float:
    """
    Proposes a per-dataset auto-approve threshold based on
    cluster-level approve/reject decisions.
    """
```

### Input: Cluster Decisions

| Field | Type | Description |
|-------|------|-------------|
| `cluster_id` | `str` | Unique cluster identifier |
| `decision` | `str` | "approve" or "reject" |
| `confidence` | `float` | Model confidence in decision |
| `cluster_size` | `int` | Number of records in cluster |

### Output

Returns an updated threshold value that balances precision and recall based on the observed decision pattern.

Source: [packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tuner.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tuner.py)

## Zero-Label Confidence

The zero-label confidence mechanism enables AutoConfig to work without labeled training data by analyzing the structure of unlabeled record pairs and their natural clustering behavior.

### Design Principles

1. **No Ground Truth Required**: Analyzes inherent data structure rather than relying on labeled examples
2. **Precision-Collapse Aware**: Identifies when configurations would cause over-merging
3. **Profile-Based Scoring**: Assigns confidence scores based on zero-label profile characteristics

### Zero-Label Profile

A zero-label profile captures characteristics of record pairs that help predict match quality without explicit labels:

```python
@dataclass
class ZeroLabelProfile:
    mass_separation: float
    overall_confidence: float
    precision_collapse_risk: float
    zero_label: bool  # True if profile carries zero-label characteristics
```

### Commit Behavior

Starting in v1.23.0, AutoConfig commits by zero-label confidence by default. The `pick_committed` tiebreaker logic:

```
IF same-health-rank candidates exist:
    AND at least one carries zero_label profile:
        PREFER the candidate with higher -overall_confidence
    ELSE:
        PREFER the candidate with higher -mass_separation
```

This change addressed precision-collapse scenarios where mass_separation alone was insufficient.

Source: [packages/python/goldenmatch/goldenmatch/core/zero_label_confidence.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/zero_label_confidence.py)

Source: [docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md)

## Interfaces

AutoConfig is accessible through multiple interfaces:

### CLI

```bash
goldenmatch autoconfig <input.csv> [--iterations N] [--output config.yaml]
```

### REST API

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/autoconfig` | POST | Start auto-configuration run |
| `/autoconfig/status` | GET | Get current iteration status |
| `/controller/telemetry` | GET | Retrieve full telemetry snapshot |

### Python API

```python
from goldenmatch import AutoConfigController

controller = AutoConfigController(config)
controller.run(iterations=10)
telemetry = controller.get_telemetry()
```

### SQL Interface (Postgres Extension)

```sql
SELECT goldenmatch_autoconfig('customers.csv');
SELECT gm_telemetry();
```

Source: [packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py)

## Web UI Integration

The AutoConfig System integrates with the GoldenMatch web workbench for visual monitoring and manual override:

```mermaid
graph LR
    A[Web UI] -->|Ctrl+A| B[AutoConfig Panel]
    B --> C[Iteration List]
    C --> D[Decision Details]
    D --> E[Config Diff View]
    B --> F[Telemetry Charts]
    F --> G[Blocking Summary]
    F --> H[Cluster Summary]
```

### Telemetry Visualization

The frontend displays real-time telemetry including:

- **Blocking Summary**: Reduction ratio, block size distribution
- **Scoring Summary**: Mass above/below threshold, DIP statistic
- **Cluster Summary**: Size distribution, transitivity rate
- **Indicators**: Matchkey hit rate, cross-blocking overlap

Source: [web/frontend/src/lib/api.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/web/frontend/src/lib/api.ts)

## Configuration Options

### AutoConfig Settings

| Parameter | Default | Description |
|-----------|---------|-------------|
| `autoconfig.enabled` | `true` | Enable auto-configuration |
| `autoconfig.max_iterations` | `10` | Maximum iterations before commit |
| `autoconfig.zero_label_commit` | `true` | Prefer zero-label confidence (v1.23.0+) |
| `autoconfig.recall_target` | `0.95` | Target recall for auto-config |
| `autoconfig.precision_floor` | `0.90` | Minimum acceptable precision |

### Learning Memory Integration

AutoConfig integrates with the broader Learning Memory system:

| Tuner | Level | Learns From |
|-------|-------|-------------|
| `MemoryLearner` | Pair | Pair-level approve/reject decisions |
| `FieldStrategyTuner` | Field | Field-level strategy preferences |
| `ClusterThresholdTuner` | Cluster | Cluster-level approve/reject decisions |

## Version History

| Version | Change |
|---------|--------|
| v1.24.0 | Heuristic rule expansion + diagnostic harness |
| v1.23.0 | Auto-config commits by zero-label confidence by default |
| v1.20.0 | Cluster decision tuner (`tune_decision_threshold`) |
| v1.19.0 | Native acceleration + autoconfig + probabilistic improvements |

Source: [packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py)

## Best Practices

1. **Start with defaults**: The zero-label confidence commit behavior (v1.23.0+) provides sensible defaults for most datasets
2. **Review telemetry**: Monitor blocking and cluster summaries to identify oversized blocks or clusters
3. **Use strict mode for evaluation**: The `_strictAutoconfig` flag disables runtime threshold shifts for reproducible results
4. **Integrate with review queue**: Feed cluster decisions back to the ClusterThresholdTuner for continuous improvement

---

<a id='learning-memory'></a>

## Learning Memory

### Related Pages

Related topics: [AutoConfig System](#autoconfig)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/python/goldenmatch/goldenmatch/core/memory/learner.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/memory/learner.py)
- [packages/python/goldenmatch/goldenmatch/core/memory/corrections.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/memory/corrections.py)
- [packages/python/goldenmatch/goldenmatch/core/autoconfig_golden_strategy_tuner.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_golden_strategy_tuner.py)
- [packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tune.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tune.py)
- [packages/python/goldenmatch/goldenmatch/core/autoconfig_field_strategy_tuner.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/autoconfig_field_strategy_tuner.py)
- [packages/python/goldenmatch/goldenmatch/core/memory/__init__.py](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/goldenmatch/core/memory/__init__.py)
- [packages/python/goldenmatch/CHANGELOG.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/CHANGELOG.md)
</details>

# Learning Memory

Learning Memory is GoldenMatch's adaptive feedback system that continuously improves entity resolution accuracy by learning from human corrections, labeled decisions, and performance feedback. It forms the closed-loop optimization layer that distinguishes GoldenMatch from static rule-based matching systems.

## Overview

Learning Memory captures human domain knowledge and transforms it into automated configuration improvements. Rather than requiring users to manually tune thresholds, define field strategies, or configure blocking rules, Learning Memory observes how humans resolve ambiguous cases and propagates those decisions across the entire dataset.

The system operates at three distinct levels:

| Level | Tuner | Input | Output |
|-------|-------|-------|--------|
| **Pair-level** | `MemoryLearner` | Human approve/reject on pair comparisons | Learned matchkey weights and thresholds |
| **Field-level** | `field_strategy_tuner` | Field-level corrections | Per-field strategy selection (exact, fuzzy, tokenized, etc.) |
| **Cluster-level** | `cluster_decision_tuner` | Cluster approve/reject decisions | Per-dataset auto-approve threshold |

This multi-level approach ensures that feedback at any granularity flows back into the appropriate configuration layer. Source: [packages/python/goldenmatch/CHANGELOG.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/CHANGELOG.md)

## Architecture

Learning Memory consists of several interconnected components that handle storage, learning, and application of corrections.

```mermaid
graph TD
    subgraph "Input Layer"
        A[Human Corrections] --> B[Corrections Store]
        C[Review Queue Feedback] --> B
        D[Ground Truth Labels] --> E[Memory Learner]
    end
    
    subgraph "Learning Layer"
        E --> F[Pair-Level Tuning]
        F --> G[Threshold Adjuster]
        F --> H[Matchkey Weighter]
        E --> I[Field-Level Tuning]
        I --> J[Strategy Selector]
        E --> K[Cluster-Level Tuning]
        K --> L[Decision Threshold Tuner]
    end
    
    subgraph "Output Layer"
        G --> M[AutoConfig Controller]
        H --> M
        J --> M
        L --> M
        M --> N[Matching Pipeline]
    end
    
    subgraph "Feedback Loop"
        N --> O[Review Queue]
        O --> A
    end
```

### Core Components

| Component | File | Responsibility |
|-----------|------|----------------|
| `MemoryCorrections` | `core/memory/corrections.py` | Persistent storage of human corrections |
| `MemoryLearner` | `core/memory/learner.py` | Pair-level learning from corrections |
| `FieldStrategyTuner` | `core/autoconfig_field_strategy_tuner.py` | Field-level strategy optimization |
| `ClusterDecisionTuner` | `core/autoconfig_cluster_threshold_tune.py` | Cluster threshold optimization |

## Corrections Store

The `MemoryCorrections` class provides the persistent backing store for all human feedback. It tracks corrections at both the pair level (which records should be linked or unlinked) and the field level (which field values should win in survivorship). Source: [packages/python/goldenmatch/goldenmatch/core/memory/corrections.py:1-50]()

### Data Model

```python
class MemoryCorrections:
    corrections: list[Correction]
    
class Correction:
    record_id_a: str      # First record in the pair
    record_id_b: str      # Second record in the pair
    decision: str         # "approve" or "reject"
    confidence: float     # Human confidence 0.0-1.0
    source: str           # "human", "ground_truth", "review_queue"
    timestamp: datetime
    metadata: dict        # Additional context
```

### CRUD Operations

| Operation | Method | Description |
|-----------|--------|-------------|
| Create | `add_correction()` | Record a new correction |
| Read | `get_corrections()` | Retrieve corrections with filters |
| Update | `update_correction()` | Modify an existing correction |
| Delete | `remove_correction()` | Remove a correction |

The corrections store supports filtering by:
- Source type (human, ground_truth, review_queue)
- Decision type (approve, reject)
- Date range
- Record pair

## Pair-Level Learning: MemoryLearner

The `MemoryLearner` processes corrections at the record-pair level and extracts patterns about which matchkey combinations indicate true matches versus false positives. Source: [packages/python/goldenmatch/goldenmatch/core/memory/learner.py:1-100]()

### Learning Algorithm

```python
class MemoryLearner:
    def learn(self, corrections: MemoryCorrections) -> LearnedWeights:
        """
        Process corrections and compute updated matchkey weights.
        """
```

The learner computes:
1. **Precision per matchkey**: Ratio of correct to total positive predictions for each matchkey
2. **Recall per matchkey**: Coverage of true matches captured by each matchkey
3. **Composite weights**: Combination weights that balance precision and recall

### Weight Computation

Weights are computed using a modified TF-IDF approach:

| Metric | Formula | Purpose |
|--------|---------|---------|
| Matchkey Precision | `correct_pairs / total_pairs_for_key` | How reliable is this key? |
| Key Frequency | `pairs_using_key / total_pairs` | How common is this key? |
| Composite Weight | `precision * log(key_frequency + 1)` | Balanced importance |

## Field-Level Learning: FieldStrategyTuner

The `FieldStrategyTuner` optimizes which scoring strategy to use for each field based on correction patterns. Source: [packages/python/goldenmatch/goldenmatch/core/autoconfig_field_strategy_tuner.py:1-80]()

### Available Field Strategies

| Strategy | Use Case | Example |
|----------|----------|---------|
| `exact` | Unique identifiers, codes | SSN, account numbers |
| `fuzzy` | Names, addresses with typos | "John" vs "Jon" |
| `tokenized` | Multi-word fields | "John Smith" vs "Smith, John" |
| `numeric` | Numbers with tolerance | Prices, quantities |
| `date` | Temporal fields | Birth dates, transaction dates |
| `phonetic` | Names with spelling variants | Soundex, Metaphone |

### Tuning Process

1. **Collect field-level signals** from corrections (which field caused the error?)
2. **Compute strategy accuracy** per field for each strategy
3. **Select best strategy** using cross-validation to avoid overfitting
4. **Generate strategy map**: `{field_name: strategy_name}`

## Cluster-Level Learning: ClusterDecisionTuner

The `ClusterDecisionTuner` (introduced in v1.20.0) consumes cluster-level approve/reject decisions and proposes a per-dataset auto-approve threshold. Source: [packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tune.py:1-60]()

### Threshold Tuning Algorithm

```python
class ClusterDecisionTuner:
    def tune_decision_threshold(
        self,
        decisions: list[ClusterDecision],
        target_recall: float = 0.95
    ) -> float:
        """
        Find the threshold that achieves target_recall on approved clusters.
        """
```

### Tuning Inputs

| Input | Type | Description |
|-------|------|-------------|
| `decisions` | `list[ClusterDecision]` | Human cluster decisions |
| `target_recall` | `float` | Desired recall target (default 0.95) |
| `min_approve_confidence` | `float` | Minimum confidence for auto-approve |

### Tuning Outputs

| Output | Type | Description |
|--------|------|-------------|
| `threshold` | `float` | Suggested auto-approve threshold |
| `expected_precision` | `float` | Estimated precision at this threshold |
| `calibration_curve` | `list[tuple]` | Precision-recall tradeoff points |

## Auto-Config Integration

Learning Memory integrates with GoldenMatch's AutoConfig system, which automatically optimizes matching parameters. The `pick_committed` method in the controller determines which candidate configurations to commit. Source: [packages/python/goldenmatch/goldenmatch/core/autoconfig_golden_strategy_tuner.py:1-100]()

### Commit Priority

The controller uses a multi-factor ranking for candidate selection:

1. **Health rank**: Overall dataset health improvement
2. **Mass separation**: Confidence gap between best and second-best candidate
3. **Zero-label confidence** (v1.23.0+): Preference for higher overall_confidence among zero-label profile candidates

```python
def pick_committed(self, candidates: list[Candidate]) -> Candidate:
    """
    Select the best candidate using health rank, mass separation,
    and zero-label confidence tiebreaker.
    """
```

### AutoConfig Workflow

```mermaid
graph LR
    A[Initialize Config] --> B[Generate Candidates]
    B --> C[Score Candidates]
    C --> D[Learn from Corrections]
    D --> E[Update Weights]
    E --> F[Filter Candidates]
    F --> G{Rank by Health?}
    G -->|Yes| H[Select Best]
    G -->|No| I[Apply Learning Memory]
    I --> E
    H --> J[Commit Configuration]
    J --> K[Execute Matching]
    K --> L[Generate Review Queue]
    L --> M[Human Feedback]
    M --> D
```

## Memory in Postgres Extension

The `goldenmatch_pg` extension (v0.5.0+) provides native Postgres functions for Learning Memory operations, enabling SQL-native corrections and statistics. Source: [packages/python/goldenmatch/CHANGELOG.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/CHANGELOG.md)

### Available Functions

| Function | Purpose |
|----------|---------|
| `memory_learn()` | Record corrections from SQL |
| `memory_stats()` | Retrieve learning statistics |
| `memory_clear()` | Reset corrections for a dataset |

### SQL Usage Example

```sql
-- Record a correction
SELECT memory_learn(
    'record_a_id',
    'record_b_id', 
    'approve',  -- or 'reject'
    0.95
);

-- Get learning statistics
SELECT * FROM memory_stats('my_dataset');

-- Clear corrections for re-learning
SELECT memory_clear('my_dataset');
```

## Usage Patterns

### Basic Correction Flow

```python
from goldenmatch import GoldenMatch, MemoryCorrections

# Initialize with memory
gm = GoldenMatch(config)
corrections = MemoryCorrections()

# Run initial matching
results = gm.match(data)

# Present review queue to human
for pair in results.review_queue:
    decision = human_review(pair)
    corrections.add_correction(
        record_id_a=pair.id_a,
        record_id_b=pair.id_b,
        decision=decision,
        confidence=0.95
    )

# Apply learning and re-run
gm.memory_learner.learn(corrections)
refined_results = gm.match(data)  # Uses learned weights
```

### Field Strategy Tuning

```python
from goldenmatch.core.autoconfig_field_strategy_tuner import FieldStrategyTuner

tuner = FieldStrategyTuner(dataset_id="my_dataset")

# Tune from corrections
strategy_map = tuner.tune(
    corrections=corrections,
    fields=["name", "address", "phone", "email"]
)

# Apply to config
config.field_strategies = strategy_map
```

### Cluster Threshold Tuning

```python
from goldenmatch.core.autoconfig_cluster_threshold_tune import ClusterDecisionTuner

tuner = ClusterDecisionTuner()

# Tune from cluster decisions
threshold = tuner.tune_decision_threshold(
    decisions=cluster_decisions,
    target_recall=0.97
)

# Apply auto-approve threshold
config.auto_approve_threshold = threshold
```

## Configuration Options

| Option | Default | Description |
|--------|---------|-------------|
| `memory.enabled` | `True` | Enable/disable learning memory |
| `memory.learning_rate` | `0.1` | Rate at which new corrections update weights |
| `memory.decay_factor` | `0.95` | Weight decay for older corrections |
| `memory.min_corrections` | `10` | Minimum corrections before tuning |
| `memory.strategy` | `"balanced"` | Tuning strategy: `balanced`, `precision`, `recall` |

## Version History

| Version | Feature |
|---------|---------|
| v1.19.0 | Initial Learning Memory with MemoryLearner |
| v1.20.0 | Added ClusterDecisionTuner (third tuner in family) |
| v1.23.0 | Zero-label confidence tiebreaker in `pick_committed` |
| v1.24.0 | Heuristic rule expansion + diagnostic harness |

## Limitations and Considerations

### Data Requirements

- Learning Memory requires a minimum number of corrections (default: 10) before producing reliable recommendations
- Highly imbalanced datasets (rare true matches) may need more corrections for accurate threshold tuning

### Convergence

- Weights converge faster when corrections are evenly distributed across matchkey types
- Cluster threshold tuning may require iterative refinement for datasets with unusual cluster size distributions

### Production Considerations

- Periodically review learned weights to ensure they remain aligned with business rules
- Reset memory when significant schema changes occur
- Monitor precision/recall drift over time

---

<a id='blocking-scoring'></a>

## Blocking and Scoring

### Related Pages

Related topics: [Core Matching Engine](#core-matching)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [packages/python/goldenmatch/examples/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/examples/README.md)
- [packages/python/goldenmatch/web/frontend/src/lib/types.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/web/frontend/src/lib/types.ts)
- [packages/python/goldenmatch/web/frontend/src/lib/api.ts](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenmatch/web/frontend/src/lib/api.ts)
- [README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/README.md)
- [packages/python/goldenflow/README.md](https://github.com/benseverndev-oss/goldenmatch/blob/main/packages/python/goldenflow/README.md)
</details>

# Blocking and Scoring

## Overview

Blocking and scoring are the two core mechanisms that enable GoldenMatch to perform entity resolution at scale. Without blocking, comparing every record against every other record would result in O(n²) comparisons—a computationally infeasible approach for datasets with millions of records. The blocking phase reduces the candidate space by grouping records that are likely to represent the same entity, while the scoring phase evaluates the similarity of each candidate pair to determine whether they should be merged.

In GoldenMatch, these mechanisms work together through a configurable pipeline that supports multiple blocking strategies (exact, fuzzy, token-based, and approximate nearest neighbor), multiple scoring approaches (string similarity, field weighting, and optional LLM-based evaluation), and a feedback-driven tuning system that learns from user corrections to improve accuracy over time.

Source: [packages/python/goldenmatch/examples/README.md]()

## Architecture

The blocking and scoring pipeline follows a multi-stage architecture:

```mermaid
graph TD
    A[Input Records] --> B[Standardization]
    B --> C[Blocking Strategy]
    C --> D[Candidate Pairs]
    D --> E[Field-Level Scoring]
    E --> F[Record-Level Score]
    F --> G[Clustering]
    G --> H[Golden Records]
    C -->|ann_* strategies| I[Approximate Nearest Neighbors]
    F -->|optional| J[Cross-Encoder Reranking]
```

### Components

| Component | Purpose | Key Classes/Modules |
|-----------|---------|---------------------|
| **Blocker** | Generates candidate pairs using blocking keys | `blocker.py` |
| **Matchkey** | Defines how fields contribute to blocking and scoring | `matchkey.py` |
| **Scorer** | Computes field and record-level similarity | `scoring.py` |
| **Cross-Encoder** | Optional reranking of candidate pairs | `cross_encoder.py` |
| **Controller** | Orchestrates the pipeline and applies thresholds | `controller.py` |

Source: [packages/python/goldenmatch/examples/README.md]()

## Blocking Strategies

### Exact Blocking

Exact blocking groups records that share identical values on one or more key fields. This is the simplest and fastest approach, ideal for high-quality, well-standardized data.

```python
blocking = {
    "strategy": "sorted_neighborhood",
    "fields": ["email"],
    "window_size": 3
}
```

### Sorted Neighborhood Blocking

Records are sorted by blocking key values and compared within a sliding window. This catches near-duplicates that would not appear adjacent under exact blocking.

```python
blocking = {
    "strategy": "sorted_neighborhood",
    "fields": ["last_name", "zip5"],
    "window_size": 5
}
```

### Multi-Pass Blocking

For complex datasets, multiple blocking passes with different keys can capture different types of matches:

```python
blocking = [
    {"strategy": "exact", "fields": ["email"]},
    {"strategy": "sorted_neighborhood", "fields": ["last_name", "first_name"], "window_size": 5},
    {"strategy": "sorted_neighborhood", "fields": ["phone"]}
]
```

Source: [packages/python/goldenmatch/examples/README.md]()

### Approximate Nearest Neighbor (ANN) Blocking

For high-cardinality string fields, ANN blocking provides efficient similarity-based candidate generation:

```python
blocking = {
    "strategy": "ann_l2",
    "fields": ["full_address"],
    "distance_threshold": 0.3
}
```

The v1.24.0 release introduced the **QIS-bucket** strategy and **Chao1 scale-aware cardinality** estimation, which significantly improves blocking accuracy on large datasets. The performance improvements in this release achieved 81% wall-clock reduction (2604s → 502s) on a 10M record benchmark while maintaining F1=0.9886.

Source: [README.md]()

### Blocking Configuration Fields

| Field | Type | Description |
|-------|------|-------------|
| `strategy` | string | One of: `exact`, `sorted_neighborhood`, `ann_l2`, `ann_cosine`, `canopy`, `learned` |
| `fields` | list[string] | Fields to use for blocking |
| `window_size` | int | Window size for sorted neighborhood (default: 3) |
| `distance_threshold` | float | Distance threshold for ANN strategies |
| `extras` | dict | Advanced strategy-specific parameters |

Source: [packages/python/goldenmatch/web/frontend/src/lib/types.ts]()

## Scoring

### Field-Level Scoring

Each field contributes a similarity score based on its configured strategy:

| Strategy | Description | Use Case |
|----------|-------------|----------|
| `levenshtein` | Character-level edit distance | Names, addresses |
| `jaro_winkler` | Optimized for short strings | Names |
| `token_set` | Set intersection of tokens | Address components |
| `numeric` | Absolute difference / range | Dates, amounts |
| `exact` | Binary match/no-match | IDs, codes |

### Matchkey Definition

Matchkeys define how fields participate in both blocking and scoring:

```python
matchkeys = [
    {
        "fields": ["email"],
        "blocking": True,
        "score_type": "exact",
        "weight": 1.0
    },
    {
        "fields": ["first_name", "last_name"],
        "blocking": True,
        "score_type": "token_set",
        "weight": 0.8
    },
    {
        "fields": ["phone"],
        "blocking": True,
        "score_type": "levenshtein",
        "weight": 0.6
    }
]
```

Source: [packages/python/goldenmatch/examples/README.md]()

### Record-Level Scoring

The record-level score combines field scores using weighted averaging:

```
record_score = Σ(field_score × field_weight) / Σ(field_weight)
```

Records exceeding the threshold are linked; the default threshold is tuned automatically based on the dataset characteristics via the autoconfig system introduced in v1.20.0.

Source: [README.md]()

### Weighted Matchkeys

GoldenMatch supports sophisticated weighting schemes:

```python
matchkeys = [
    {"fields": ["company_name"], "weight": 0.7, "score_type": "token_set"},
    {"fields": ["address", "city"], "weight": 0.3, "score_type": "token_set"},
    # Optional multi-pass with different weight profiles
]
```

The equipment deduplication example demonstrates multi-pass blocking with ANN fallback, weighted fuzzy matching, and LLM calibration for challenging datasets.

Source: [packages/python/goldenmatch/examples/README.md]()

## Cross-Encoder Reranking

The optional cross-encoder module provides a secondary scoring pass that considers field interactions:

```python
from goldenmatch.core.cross_encoder import CrossEncoderScorer

scorer = CrossEncoderScorer(model="cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(record_a, record_b) for record_a, record_b in candidate_pairs]
reranked = scorer.score_batch(pairs)
```

Cross-encoding is particularly valuable when field combinations carry more signal than individual fields—for example, a name and address together are more distinctive than either alone.

Source: [packages/python/goldenmatch/examples/README.md]()

## Configuration Payload

The web frontend communicates blocking and scoring configuration using a typed payload:

```typescript
export type RulesPayload = {
  threshold: number;
  matchkeys: Matchkey[];
  standardization?: StandardizationRules | null;
  blocking?: BlockingPayload | null;
};
```

Known blocking keys are validated at the server boundary, while unknown keys are preserved in an `extras` field for advanced strategies:

```typescript
const BLOCKING_KNOWN_KEYS = new Set([
  "strategy", "fields", "window_size", "distance_threshold",
  "_block_size", "skip_oversized", "auto_suggest", "auto_select"
]);
```

Source: [packages/python/goldenmatch/web/frontend/src/lib/types.ts]()

## Standardization Pipeline

Effective blocking and scoring depend on data standardization. GoldenFlow provides the transform library that should be applied before matching:

| Transform Category | Examples |
|-------------------|----------|
| **Text** | `strip`, `lowercase`, `normalize_unicode`, `normalize_quotes` |
| **Phone** | `phone_e164`, `phone_national` |
| **Address** | `address` (full address standardization) |
| **Numeric** | `extract_numbers`, `parse_currency` |

```python
standardization = {
    "email": "lowercase",
    "phone": "phone_e164",
    "address": "address",
    "state": "state"
}
```

Source: [packages/python/goldenflow/README.md]()

## Performance Considerations

### Bucket Strategies

The QIS-bucket strategy (v1.24.0) provides scale-aware cardinality estimation that adjusts bucket parameters based on dataset size. This prevents both over-blocking (too many candidates) and under-blocking (missing matches).

### Memory Reduction

The v1.24.0 release achieved an 18% RSS reduction through optimized data structures and streaming processing. Key optimizations include:

- Single-group-by-per-column vectorization for golden record building
- Lazy evaluation of scoring for low-confidence pairs
- Chunked processing for very large candidate sets

Source: [README.md]()

## Auto-Configuration

GoldenMatch can automatically tune blocking and scoring parameters:

```python
from goldenmatch.core.autoconfig import auto_configure

config = auto_configure(
    data=df,
    ground_truth=labels_df,  # Optional for supervised tuning
    target_recall=0.95
)
```

The autoconfig system uses:
- **Chao1 estimation** for cardinality-aware blocking tuning
- **Zero-label confidence** analysis for threshold calibration (v1.23.0)
- **Cluster-level tuning** for decision threshold optimization (v1.20.0)

Source: [README.md]()

## Memory and Learning

GoldenMatch maintains learned patterns across runs:

| Memory Type | Scope | Purpose |
|-------------|-------|---------|
| `MemoryLearner` | Pair-level | Learn from labeled match/non-match pairs |
| `field-strategy tuner` | Field-level | Optimize per-field scoring strategy |
| `cluster-decision tuner` | Cluster-level | Tune merge/reject thresholds |

These learning mechanisms enable the system to improve accuracy over time as users correct its decisions.

Source: [README.md]()

## Workflow Summary

```mermaid
graph LR
    A[Raw Data] --> B[GoldenFlow Standardization]
    B --> C[Blocking]
    C --> D[Scoring]
    D --> E[Clustering]
    E --> F{Threshold}
    F -->|Above| G[Auto-Approve]
    F -->|Below| H[Auto-Reject]
    F -->|Uncertain| I[Review Queue]
    I --> J[User Labels]
    J --> K[Memory Update]
    K --> C
```

## See Also

- [Golden Match Examples](packages/python/goldenmatch/examples/README.md) — Runnable scripts demonstrating all blocking and scoring patterns
- [Web UI Wiki](https://github.com/benseverndev-oss/goldenmatch/wiki/Web-UI) — Interactive blocking configuration in the browser
- [Auto-Configuration](https://github.com/benseverndev-oss/goldenmatch/wiki/Auto-Configuration) — Advanced tuning documentation
- [GoldenFlow Transforms](packages/python/goldenflow/README.md) — Standardization for improved matching

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: benseverndev-oss/goldenmatch

Summary: Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.

## 1. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

## 2. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

## 3. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

## 4. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

## 5. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

## 6. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Suggested check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

<!-- canonical_name: benseverndev-oss/goldenmatch; human_manual_source: deepwiki_human_wiki -->