goldenmatch Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

goldenmatch

Golden Suite is a comprehensive toolkit for data quality and entity resolution, designed to handle the complete lifecycle of messy data: profiling, standardization, deduplication, and orch...

Home

Related topics: Getting Started, Suite Packages Overview

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Data Flow

Continue reading this section for the full explanation and source context.

Section Python

Continue reading this section for the full explanation and source context.

Section TypeScript / Node.js

Continue reading this section for the full explanation and source context.

Related topics: Getting Started, Suite Packages Overview

Golden Suite

A polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.

*GoldenCheck profiles → GoldenFlow standardizes → GoldenMatch deduplicates → GoldenPipe orchestrates. With InferMap for schema mapping and a Rust extension layer for Postgres / DuckDB.*

Overview

Golden Suite is a comprehensive toolkit for data quality and entity resolution, designed to handle the complete lifecycle of messy data: profiling, standardization, deduplication, and orchestration. The project supports both Python and TypeScript ecosystems, with optional Rust acceleration for high-performance workloads.

Source: README.md:1-5

Packages Overview

Tool	Languages	Purpose	Install
GoldenMatch	Python · TS	Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package.	`pip install goldenmatch` · `npm i goldenmatch`
GoldenCheck	Python · TS	Data-quality scanning: encoding, Unicode, format validation, anomaly detection.	`pip install goldencheck` · `npm i goldencheck`
GoldenFlow	Python · TS	Transforms & standardizers: phone, date, address, categorical normalization.	`pip install goldenflow` · `npm i goldenflow`
GoldenPipe	Python · TS	Orchestrator that wires Check → Flow → Match into one declarative pipeline.	`pip install goldenpipe` · `npm i goldenpipe`
InferMap	Python · TS	Schema mapping engine — auto-aligns columns across heterogeneous sources.	`pip install infermap` · `npm i infermap`
goldenmatch-extensions	Rust	Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching.	source build
dbt-goldensuite	dbt · Python	dbt package — quality-gate tests, correction CRUD macros + GoldenCheck assertions for warehouse models.	`pip install dbt-goldensuite`
goldencheck-action	YAML	GitHub Action — CI with PR comments for data validation.	`uses: benseverndev-oss/goldencheck-action@v1`

Source: README.md:28-41

Architecture

graph LR
    A[Raw Data] --> B[GoldenCheck<br/>Profile & Validate]
    B --> C[GoldenFlow<br/>Standardize]
    C --> D[InferMap<br/>Schema Mapping]
    D --> E[GoldenMatch<br/>Deduplicate]
    E --> F[GoldenPipe<br/>Orchestrate]
    
    G[Postgres/DuckDB] --> E
    H[GitHub CI] --> B
    I[MCP Server] --> F

Data Flow

GoldenCheck profiles your data and discovers quality rules automatically
GoldenFlow transforms messy fields into canonical formats
InferMap aligns columns across heterogeneous schemas
GoldenMatch identifies and merges duplicate records
GoldenPipe orchestrates the entire pipeline declaratively

Source: packages/python/goldencheck/README.md:1-10

Quick Start

Python

# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv

Source: README.md:64-66

TypeScript / Node.js

# TypeScript / Edge runtimes
npm install goldenmatch

Source: README.md:69

GoldenMatch

The headline package for entity resolution. Supports multiple matching strategies:

Fuzzy matching — handles typos and variations
Exact matching — bit-for-bit comparisons
Probabilistic matching —贝叶斯-style confidence scoring
LLM matching — semantic clustering with language models

Key Features

Zero-config dedup for common cases
Configurable matchkeys and blocking strategies
Field-level score explanations
Streaming/incremental matching for new records
PPRL (Privacy-Preserving Record Linkage) for cross-organization matching

Performance

v1.24.0 achieved significant performance milestones:

Metric	Before	After	Improvement
10M records (wall time)	2604s	502s	-81%
Peak RSS	baseline	-18% reduction	—
F1 Score	0.9886	0.9886	invariant

Source: packages/python/goldenmatch/README.md:1-50

GoldenCheck

Data validation that discovers rules from your data so you don't have to write them.

Core Capabilities

Automatic rule discovery from data patterns
Encoding and Unicode validation
Format validation and anomaly detection
Health score grading (A-F)
Multiple output formats: HTML, JSON, TUI

Domain Type Packs

Community-contributed semantic type definitions for improved detection:

Domain	Types	Description
healthcare	10	NPI, ICD codes, insurance IDs, patient demographics, CPT, DRG
finance	8	Account numbers, routing numbers, CUSIP/ISIN, currency, transactions
ecommerce	9	SKUs, order IDs, tracking numbers, categories, shipping

Source: packages/typescript/goldencheck-types/README.md:1-30

GoldenFlow

Transforms and standardizers for messy data fields.

Transform Categories

Category	Count	Examples
Text Transforms	18	strip, lowercase, normalize_unicode, remove_html_tags
Phone Transforms	5	phone_e164, phone_national, phone_format
Date Transforms	7	date_parse, date_format, date_floor
Address Transforms	6	address_parse, address_standardize
Numeric Transforms	4	parse_currency, parse_number
Categorical Transforms	4	category_normalize, category_map

Source: packages/python/goldenflow/README.md:1-100

InferMap

Inference-driven schema mapping engine. Maps messy source columns to known target schemas with confidence scores and human-readable reasoning.

Supported Data Sources

CSV files
DataFrames
Database tables
In-memory records

TypeScript Compatibility

Next.js Server Components
Route Handlers
Server Actions
Edge Runtime

Source: packages/python/infermap/README.md:1-80

Optional Components

Native Acceleration

For maximum performance on large datasets:

pip install "goldenmatch[native]"

This pulls goldenmatch-native, a separately distributed compiled (Rust/PyO3 abi3) runtime.

MCP Server (Claude Desktop)

pip install goldencheck[mcp]

Source: packages/python/goldencheck/README.md:1-50

Integrations

dbt

Add data-quality gates to dbt:

# dbt_project.yml
packages:
  - package: benseverndev-oss/dbt-goldensuite

GitHub Actions

- uses: benseverndev-oss/goldencheck-action@v1
  with:
    files: "data/*.csv"
    fail-on: error

Airflow

12 drop-in DAGs available at examples/airflow/.

MCP Container

Run from a single MCP container:

docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest

Source: README.md:42-60

Web UI

GoldenMatch includes an interactive web workbench:

pip install goldenmatch[web]
goldenmatch serve-ui <project>

Features:

Pair drilldown with cluster members
Field-level diff view
Natural language explanations per pair

Source: README.md:58-62

Latest Release

v1.24.0 — 10M QIS-bucket-realistic: 2604s → 502s (-81% wall) at F1=0.9886 invariant + 18% RSS reduction.

Key improvements:

~15 performance PRs
Chao1 scale-aware cardinality
Heuristic rule expansion
Diagnostic harness

See CHANGELOG.md for the full PR list.

Source: community_context

Getting Help

Resource	Link
Documentation	Wiki
Examples	Python examples · TypeScript examples
Issues	GitHub Issues
Discussions	GitHub Discussions

License

MIT — see LICENSE

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Getting Started

Related topics: Installation, System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Installation, System Architecture

Getting Started

The Golden Suite is a polyglot data-quality and entity-resolution toolkit designed for deduplication, data standardization, and schema mapping. It consists of five core packages: GoldenCheck (data validation), GoldenFlow (transforms and standardization), GoldenMatch (record deduplication), GoldenPipe (pipeline orchestration), and InferMap (schema mapping). Source: README.md:1-10

This guide walks you through installation, basic usage patterns, and recommended starting points for common use cases.

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Suite Packages Overview

Related topics: Core Matching Engine

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Capabilities

Continue reading this section for the full explanation and source context.

Section Domain Type Packs

Continue reading this section for the full explanation and source context.

Section MCP Server

Continue reading this section for the full explanation and source context.

Related topics: Core Matching Engine

Suite Packages Overview

The Golden Suite is a polyglot data-quality and entity-resolution toolkit that provides a complete pipeline from data profiling to deduplication. The suite consists of multiple purpose-built packages that can be used independently or composed together for end-to-end workflows.

Source: README.md:1-15

Package Architecture

graph TD
    A[Raw Data] --> B[GoldenCheck]
    B --> C[GoldenFlow]
    C --> D[InferMap]
    D --> E[GoldenMatch]
    E --> F[GoldenPipe]
    
    G[goldenmatch-extensions] --> E
    H[dbt-goldensuite] --> B
    I[goldencheck-types] --> B
    
    J[goldencheck-action] --> B
    K[GoldenCheck MCP] --> B
    L[Goldensuite MCP] --> F

Package Summary

Package	Languages	Purpose	Install
GoldenMatch	Python, TypeScript	Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM	`pip install goldenmatch` / `npm i goldenmatch`
GoldenCheck	Python, TypeScript	Data-quality scanning: encoding, Unicode, format validation, anomaly detection	`pip install goldencheck` / `npm i goldencheck`
GoldenFlow	Python, TypeScript	Transforms & standardizers: phone, date, address, categorical normalization	`pip install goldenflow` / `npm i goldenflow`
GoldenPipe	Python, TypeScript	Orchestrator that wires Check → Flow → Match into one declarative pipeline	`pip install goldenpipe` / `npm i goldenpipe`
InferMap	Python, TypeScript	Schema mapping engine — auto-aligns columns across heterogeneous sources	`pip install infermap` / `npm i infermap`
goldenmatch-extensions	Rust	Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching	source build
dbt-goldensuite	dbt, Python	dbt package — quality-gate tests, correction CRUD macros + GoldenCheck assertions	`pip install dbt-goldensuite`
goldencheck-action	—	GitHub Action for CI with PR comments	marketplace
goldencheck-types	TypeScript	Community-contributed domain type packs (healthcare, finance, e-commerce)	`npm i goldencheck-types`

Source: README.md:80-100

GoldenCheck

GoldenCheck is the data-quality scanning and profiling component of the Golden Suite. It detects issues such as encoding problems, Unicode anomalies, format violations, and data anomalies.

Core Capabilities

Encoding Detection: Identifies and reports encoding issues in text fields
Unicode Validation: Detects malformed Unicode sequences and normalization issues
Format Validation: Validates against expected formats (email, phone, URL, etc.)
Anomaly Detection: Statistical analysis to identify outliers and unusual patterns
LLM-Powered Analysis: Uses language models to identify semantic data quality issues missed by automated profilers

Source: packages/python/goldencheck/README.md:1-50

Domain Type Packs

GoldenCheck supports domain-specific type definitions through community-contributed packs:

Domain	Types Included
Healthcare	NPI, ICD codes, insurance IDs, patient demographics, CPT, DRG
Finance	Account numbers, routing numbers, CUSIP/ISIN, currency, transactions
E-commerce	SKUs, order IDs, tracking numbers, categories, shipping

Source: packages/typescript/goldencheck-types/README.md:1-30

MCP Server

GoldenCheck includes an MCP (Model Context Protocol) server providing 10 agent-level tools:

analyze_data — Domain detection and strategy recommendation
auto_triage — Automated issue classification
explain_finding — Natural language explanation of findings
explain_column — Column-level analysis
compare_domains — Cross-domain comparison
generate_handoff — Pipeline handoff generation

Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts:1-50

GoldenFlow

GoldenFlow provides 76+ transformation functions for standardizing messy data fields. It focuses on transforming data before matching to improve deduplication accuracy.

Source: packages/python/goldenflow/README.md:1-50

Transform Categories

#### Text Transforms (18)

Transform	Description
`strip`	Trim whitespace
`lowercase` / `uppercase`	Case conversion
`title_case`	Proper casing ("john smith" → "John Smith")
`normalize_unicode`	NFKD normalization, strip accents
`normalize_quotes`	Smart/curly quotes → straight quotes
`collapse_whitespace`	Multiple spaces → single space
`remove_punctuation`	Strip punctuation characters
`remove_html_tags`	Strip HTML markup from scraped data
`fix_mojibake`	Fix common UTF-8/Latin-1 encoding garbling

#### Phone Transforms (5)

Transform	Description
`phone_e164`	Any format → +15550123456
`phone_national`	Any format → (555) 012-3456

#### Date Transforms

Date parsing and normalization across multiple formats
Timezone normalization

#### Domain-Specific Transforms

Domain	Capabilities
Healthcare	NPI, ICD, CPT, DRG parsing, transaction dates, amount parsing
E-commerce	SKU normalization, price parsing, order dates, address standardization
Real Estate	Property addresses, listing dates, price normalization, geo fields

Source: packages/python/goldenflow/README.md:50-120

GoldenMatch

GoldenMatch is the core entity resolution (deduplication) engine. It supports multiple matching strategies and scales to millions of records.

Source: README.md:80-85

Matching Strategies

Strategy	Use Case
Fuzzy Matching	Name/address variants with typo tolerance
Exact Matching	Identifier deduplication
Probabilistic Matching	Record linkage with confidence scores
LLM Clustering	Semantic product/matching for complex domains
PPRL	Privacy-preserving record linkage (cross-organization)

Performance Benchmarks

Dataset Size	Wall Time	Peak RSS	F1 Score
10M records (QIS-bucket-realistic)	502s (-81%)	18% reduction	0.9886

Source: v1.24.0 Release Notes

Native Acceleration

An optional Rust-based acceleration runtime is available:

pip install "goldenmatch[native]"

This pulls goldenmatch-native, a separately distributed compiled (Rust/PyO3 abi3) runtime. The native runtime is discovered automatically when installed.

Source: v1.21.0 Release Notes

Configuration Options

Option	Description
`backend`	`"bucket"` (recommended for 5M+ records), `"chunked"`
`auto_config`	Auto-tune recall thresholds
`lineage_provenance`	Track source row for golden record fields

InferMap

InferMap is a schema mapping engine that automatically aligns columns across heterogeneous data sources. It supports both Python and TypeScript with full API parity.

Source: packages/python/infermap/README.md:1-40

Key Features

Auto Schema Alignment: Detects and maps columns across different source schemas
Custom Scorers: Configurable similarity scoring algorithms
Domain Dictionaries: Industry-specific vocabulary for better matching
Calibration Tools: Score matrix introspection and tuning
Edge Runtime Support: Works in Vercel Edge Runtime and Next.js

Source: packages/python/infermap/README.md:40-80

GoldenPipe

GoldenPipe is the orchestrator that wires Check → Flow → Match into a single declarative pipeline. It enables pipeline definitions in YAML or Python.

Source: README.md:85-90

Pipeline Stages

graph LR
    A[Check] --> B[Flow]
    B --> C[Match]
    C --> D[Output]
    
    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#e8f5e9
    style D fill:#f3e5f5

Deployment Options

Runtime	Description
Airflow	12 drop-in DAGs for daily/incremental/warehouse-native dedupe
MCP Container	Single container: `docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest`
Python/CLI	Direct script execution

Source: README.md:90-110

goldenmatch-extensions

A Rust-based Postgres extension using pgrx that provides SQL-native fuzzy matching capabilities.

Source: packages/rust/extensions/README.md:1-30

Capabilities

13 pgrx Functions: Core API parity with Python/TypeScript
**goldenflow_* Transforms**: SQL-level data standardization
memory_learn/memory_stats CRUD: Learning memory database operations
DuckDB UDFs: Fuzzy matching in DuckDB queries

Source: goldenmatch_pg v0.5.0 Release

dbt-goldensuite

A dbt package that adds Golden Suite capabilities as dbt tests and macros.

Features

Quality-gate tests for warehouse models
Correction CRUD macros for data repair
GoldenCheck assertions integrated with dbt test framework

Source: packages/python/goldenmatch/dbt-goldensuite/README.md:1-20

goldencheck-action

GitHub Action for CI integration that:

Runs GoldenCheck scans on PR data changes
Posts PR comments with findings
Fails builds based on configurable severity thresholds

Source: packages/actions/goldencheck/README.md:1-30

Examples and Quick Start

The repository includes comprehensive examples organized by deployment target:

Source: examples/README.md:1-30

Directory	Audience	Highlights
`python/`	Python users	6 scripts: zero-config quickstart, full Suite composed, customer 360, PPRL
`typescript/`	TypeScript/edge users	4 scripts: quickstart, Vercel-Edge, MCP client
`sql/`	SQL/warehouse users	DuckDB + Postgres core-API examples
`airflow/`	Data-platform users	12 drop-in DAGs

Quick Start Commands

# Headline package: dedupe a CSV
pip install goldenmatch && goldenmatch dedupe customers.csv

# TypeScript / Edge
npm install goldenmatch

# Full suite via MCP
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest

Source: README.md:110-130

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Installation

The Golden Suite is a polyglot data-quality and entity-resolution toolkit available across Python, TypeScript/Node.js, Rust, and SQL environments. This page documents installation methods for all suite packages, optional dependencies, system requirements, and deployment patterns.

System Requirements

Component	Minimum Version	Recommended
Python	3.11+	3.11, 3.12
Node.js	20+	22 LTS
Rust	1.75+ (for extensions)	Latest stable
PostgreSQL	15+ (for extensions)	16+
DuckDB	1.0+ (for extensions)	Latest

Source: README.md:1-15

Quick Start

# Entity resolution (Python)
pip install goldenmatch

# TypeScript / Edge runtimes
npm install goldenmatch

Source: README.md:55-60

Python Installation

All Python packages are published to PyPI and support pip installation with optional extras.

Core Packages

Package	Purpose	Install Command
goldenmatch	Zero-config entity resolution (fuzzy + exact + probabilistic + LLM)	`pip install goldenmatch`
goldencheck	Data-quality scanning and validation	`pip install goldencheck`
goldenflow	Data transforms and standardization	`pip install goldenflow`
goldenpipe	Pipeline orchestrator	`pip install goldenpipe`
infermap	Schema mapping engine	`pip install infermap`

Source: packages/python/goldenmatch/README.md, packages/python/goldencheck/README.md, packages/python/goldenflow/README.md, packages/python/goldenpipe/README.md, packages/python/infermap/README.md

GoldenMatch Optional Dependencies

GoldenMatch supports modular installation through extras:

# Basic installation
pip install goldenmatch

# With native acceleration (Rust/PyO3 abi3 runtime)
pip install "goldenmatch[native]"

# With LLM support (Anthropic SDK)
pip install "goldenmatch[llm]"

# With baseline profiling support
pip install "goldenmatch[baseline]"

# With semantic type inference
pip install "goldenmatch[semantic]"

# With web UI
pip install "goldenmatch[web]"

# With MCP server
pip install "goldenmatch[mcp]"

# Full installation with all extras
pip install "goldenmatch[native,llm,baseline,semantic,web,mcp]"

Source: packages/python/goldenmatch/README.md, README.md:50-55

GoldenCheck Optional Dependencies

# Basic installation
pip install goldencheck

# With LLM enhancement
pip install "goldencheck[llm]"

# With MCP server for Claude Desktop
pip install "goldencheck[mcp]"

# With all extras
pip install "goldencheck[llm,mcp]"

Source: packages/python/goldencheck/README.md

TypeScript / Node.js Installation

All TypeScript packages are published to npm with zero runtime dependencies for the core packages (edge-safe).

# Core packages
npm install goldenmatch
npm install goldencheck
npm install goldenflow
npm install goldenpipe
npm install infermap

# MCP server (Node.js only)
npm install @benseverndev-oss/goldensuite-mcp

Source: packages/typescript/goldenmatch/README.md, packages/typescript/goldencheck/README.md

Peer Dependencies

Some TypeScript examples require optional peer dependencies:

Package	Purpose	Install Command
`yaml`	YAML configuration parsing	`npm install yaml`
`nodejs-polars`	Parquet reading (Node.js only)	auto-installed when needed
`csv-parse`	CSV reading (Node.js only)	auto-installed when needed
`@modelcontextprotocol/sdk`	MCP server (Node.js only)	auto-installed when needed

Source: packages/typescript/goldenmatch/examples/README.md

Docker Installation

For a self-contained MCP server deployment, use the official container image:

# Pull the latest MCP server
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest

Source: README.md:40

GoldenMatch Native Acceleration

As of v1.21.0, GoldenMatch offers an optional compiled Rust runtime via the goldenmatch-native package. This is separately distributed and uses PyO3 abi3 for compatibility.

graph TD
    A[User: pip install goldenmatch] --> B{Has native extra?}
    B -->|No| C[Pure-Python wheel]
    B -->|Yes| D[pip install goldenmatch-native]
    D --> E[Rust/PyO3 abi3 runtime]
    E --> F[Auto-discovery at runtime]
    C --> G[Standard Polars backend]
    F --> H[Optimized clustering + scoring kernels]
    G --> H

Source: v1.21.0 Release Notes, README.md

Installation Commands for Native

# Option 1: Install with native extra (recommended)
pip install "goldenmatch[native]"

# Option 2: Install separately
pip install goldenmatch
pip install goldenmatch-native

Source: v1.21.0 Release Notes

Rust Extensions (PostgreSQL / DuckDB)

For SQL-native fuzzy matching, install the Rust extension package.

PostgreSQL Extension

# Clone and build from source
git clone https://github.com/benseverndev-oss/goldenmatch.git
cd packages/rust/extensions
cargo build --release

# Load in PostgreSQL
CREATE EXTENSION goldenmatch_pg;

Source: packages/rust/extensions/README.md

DuckDB UDFs

The same Rust package provides DuckDB UDFs for in-database matching:

# Install via source build
cargo build --release
# UDFs are loaded via SQL commands in DuckDB

Source: packages/rust/extensions/README.md

MCP Server Setup

The Golden Suite includes an MCP (Model Context Protocol) server for Claude Desktop integration.

Python Installation

pip install "goldencheck[mcp]"
# or for full suite
pip install "goldenmatch[mcp]"

Source: packages/python/goldencheck/README.md

Claude Desktop Configuration

Add to your Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "goldencheck": {
      "command": "python",
      "args": ["-m", "goldencheck.mcp.server"]
    }
  }
}

Source: packages/python/goldencheck/README.md

Docker Deployment

For production MCP deployments:

docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest

Source: README.md:40, packages/python/goldensuite-mcp/README.md

dbt Integration

Install the dbt package for data warehouse quality gates:

pip install dbt-goldensuite

Source: README.md:35

GitHub Actions

For CI/CD integration:

pip install goldencheck
# or use the action directly in workflows

Source: README.md:35

Airflow DAGs

Run the Golden Suite as Airflow DAGs:

# Install Airflow adapter
pip install apache-airflow

# Use pre-built DAGs from examples/
cp -r examples/airflow/ /path/to/airflow/dags/

Source: examples/README.md

Verification

Verify installation with the following commands:

# Python packages
python -c "import goldenmatch; print(goldenmatch.__version__)"
python -c "import goldencheck; print(goldencheck.__version__)"
python -c "import goldenflow; print(goldenflow.__version__)"

# TypeScript packages
node -e "console.log(require('goldenmatch/package.json').version)"

# CLI tools
goldenmatch --version
goldencheck --version
goldenflow --version

Common Installation Issues

Python Version Mismatch

GoldenMatch requires Python 3.11+. Check your version:

python --version

If using an older version, use a virtual environment:

python -m venv golden-env
source golden-env/bin/activate  # Linux/macOS
# or
golden-env\Scripts\activate  # Windows
pip install goldenmatch

Native Extension Not Found

If goldenmatch-native isn't auto-discovered:

# Reinstall with native extra
pip uninstall goldenmatch-native
pip install "goldenmatch[native]"

MCP Server Connection Issues

For Claude Desktop MCP integration, ensure the config is in the correct location:

Linux: ~/.config/Claude/claude_desktop_config.json
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

Installation Hierarchy

graph TB
    subgraph "Full Stack Installation"
        A[Golden Suite MCP Container] --> B[GoldenMatch + Native]
        A --> C[GoldenCheck + MCP]
        A --> D[GoldenFlow]
        A --> E[GoldenPipe]
        A --> F[InferMap]
    end
    
    subgraph "Python-Only Stack"
        G[goldenmatch] --> H[Polars]
        G --> I[Polars-Runtime]
        G --> J[goldenmatch-native optional]
    end
    
    subgraph "TypeScript-Only Stack"
        K[goldenmatch] --> L[Zero deps]
        K --> M[nodejs-polars optional]
    end

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

System Architecture

Related topics: Backend Systems, Core Matching Engine, Blocking and Scoring

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Package Overview

Continue reading this section for the full explanation and source context.

Section Pipeline Stage Details

Continue reading this section for the full explanation and source context.

Section Match Pipeline Components

Continue reading this section for the full explanation and source context.

System Architecture

The Golden Suite is a polyglot data-quality and entity-resolution toolkit designed for AI-native workflows. The architecture follows a modular, pipeline-oriented design where each component addresses a specific stage of data processing: profiling, standardization, schema mapping, and deduplication. Source: packages/python/goldenmatch/README.md

Architecture Overview

The system consists of five core Python packages, three TypeScript packages, and a Rust extension layer. Each package is independently installable and can operate standalone or as part of an orchestrated pipeline.

Package Overview

Package	Language	Purpose	Install Command
GoldenMatch	Python · TS	Zero-config entity resolution (fuzzy + exact + probabilistic + LLM)	`pip install goldenmatch` · `npm i goldenmatch`
GoldenCheck	Python · TS	Data-quality scanning: encoding, Unicode, format validation, anomaly detection	`pip install goldencheck` · `npm i goldencheck`
GoldenFlow	Python · TS	Transforms & standardizers: phone, date, address, categorical normalization	`pip install goldenflow` · `npm i goldenflow`
GoldenPipe	Python · TS	Orchestrator wiring Check → Flow → Match into declarative pipelines	`pip install goldenpipe` · `npm i goldenpipe`
InferMap	Python · TS	Schema mapping engine — auto-aligns columns across heterogeneous sources	`pip install infermap` · `npm i infermap`
goldenmatch-extensions	Rust	Postgres extension (pgrx) + DuckDB UDFs for SQL-native fuzzy matching	source build
dbt-goldensuite	dbt · Python	dbt package with quality-gate tests and correction CRUD macros	`pip install dbt-goldensuite`
goldencheck-action	Action	GitHub Action for CI with PR comments	via GitHub Marketplace

Source: packages/python/goldenmatch/README.md

Core Data Flow

The canonical pipeline processes data through four stages:

graph TD
    A[Raw Data CSV/Parquet] --> B[GoldenCheck<br/>Profile & Validate]
    B --> C[GoldenFlow<br/>Standardize & Transform]
    C --> D[InferMap<br/>Schema Mapping]
    D --> E[GoldenMatch<br/>Deduplicate & Merge]
    E --> F[Golden Records<br/>with Provenance]
    
    B -->|Findings Report| G[Data Quality Score]
    E -->|Cluster Decisions| H[Memory Learning]
    H -->|Auto-config| E

Pipeline Stage Details

Stage	Input	Output	Key Capabilities
Profile	Raw CSV/Parquet	Schema, statistics, health score	Encoding detection, null analysis, cardinality profiling
Standardize	Messy fields	Normalized fields	Phone E.164, date parsing, address standardization, unicode normalization
Map	Heterogeneous schemas	Column alignments	Domain dictionaries, custom scorers, confidence scoring
Match	Canonical records	Duplicate clusters	Fuzzy matching, blocking, probabilistic scoring, LLM clustering

Source: packages/python/goldenflow/README.md

GoldenMatch Architecture

GoldenMatch is the headline package, providing entity resolution (ER) capabilities. The v1.24.0 release achieved 81% wall-clock reduction (2604s → 502s) on 10M record datasets through the QIS-bucket-realistic optimization path. Source: Community Context - v1.24.0

Match Pipeline Components

graph LR
    A[Input Records] --> B[Blocking<br/>Key Generation]
    B --> C[Candidate Pair<br/>Generation]
    C --> D[Vectorized<br/>Scoring Kernels]
    D --> E[Probabilistic<br/>Classifier]
    E --> F[Cluster<br/>Formation]
    F --> G[Golden Record<br/>Survivorship]
    G --> H[Output with<br/>Provenance]

Backend Modes

GoldenMatch supports multiple execution backends optimized for different dataset scales:

Backend	Use Case	Records	Performance
`chunked`	Development / small datasets	< 1M	Single-threaded baseline
`bucket`	Recommended 5M-on-one-node config	1-10M	5x wall reduction, 2x RSS reduction
`native`	Maximum performance	Any	Rust/PyO3 abi3 compiled kernels

The bucket backend became the recommended path in v1.16.0, processing 5M records in 9.94 minutes with 6.4 GB peak RSS on a single 16-core node. Source: packages/python/goldenmatch/README.md

Native Acceleration Layer

The optional native runtime (pip install "goldenmatch[native]") ships the compiled _native kernel as a separately distributed wheel. The runtime is discovered automatically at import time:

import goldenmatch
# Automatically detects and uses native runtime if available
result = goldenmatch.dedupe("customers.csv")

Source: Community Context - v1.21.0, goldenmatch-native v0.1.0

GoldenCheck Architecture

GoldenCheck provides data validation that discovers rules from your data. The TypeScript implementation follows a scanner-reporter pattern with an LLM enhancement layer. Source: packages/python/goldencheck/README.md

Core Engine Components

Component	File Location	Purpose
Scanner	`src/core/engine/scanner.ts`	Analyzes data files, profiles columns
Engine	`src/core/engine/`	Executes discovered quality rules
Confidence	`src/core/engine/confidence.ts`	Applies severity downgrades based on findings
Triage	`src/core/engine/triage.ts`	Auto-categorizes findings by priority
Fixer	`src/core/engine/fixer.ts`	Applies automated corrections
Agent	`src/core/agent/`	Strategy selection and explanation

Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts

LLM Integration Layer

The TypeScript implementation includes a comprehensive LLM interface for semantic analysis:

interface LLMResponse {
  columns: Record<string, LLMColumnAssessment>;
  relations: LLMRelation[];
}

interface LLMColumnAssessment {
  semantic_type: string | null;
  issues: LLMIssue[];
  upgrades: LLMUpgrade[];
  downgrades: LLMDowngrade[];
}

Source: packages/typescript/goldencheck/src/core/llm/prompts.ts

Semantic Type System

GoldenCheck ships with a bundled base type system defined as TypeScript constants (no runtime YAML dependency):

export const BASE_TYPES: Readonly<Record<string, TypeDef>> = {
  identifier: {
    nameHints: ["id", "key", "pk", "code", "sku", "number", "num", "record"],
    valueSignals: { min_unique_pct: 0.95 },
    suppress: ["cardinality", "pattern_consistency", "drift_detection"],
  },
  person_name: {
    nameHints: ["first_name", "last_name", "full_name", ...],
    valueSignals: { mixed_case: true },
    suppress: ["pattern_consistency", "cardinality"],
  },
  email: {
    nameHints: ["email", "mail", "e_mail"],
    valueSignals: { format_match: "email", min_match_pct: 0.70 },
    suppress: ["pattern_consistency"],
  },
  phone: {
    nameHints: ["phone", "tel", "fax", "mobile", "cell"],
    valueSignals: { format_match: "phone", min_match_pct: 0.70 },
    suppress: ["type_inference", "pattern_consistency"],
  },
  address: {
    nameHints: ["address", "street", "addr", "line1", "line2"],
    valueSignals: { avg_length_min: 15 },
    suppress: ["pattern_consistency", "cardinality"],
  },
  free_text: {
    nameHints: ["notes", "comments", "description", ...],
    // ...
  },
};

Source: packages/typescript/goldencheck/src/core/semantic/types.ts

Reporter System

The JSON reporter produces machine-readable output matching the spec schema:

interface ReportOutput {
  file: string;
  rows: number;
  columns: number;
  health_grade: string;
  health_score: number;
  summary: { errors: number; warnings: number; info: number };
  findings: Array<{
    severity: string;
    column: string;
    check: string;
    message: string;
    affected_rows: number;
    sample_values: string[];
  }>;
}

Source: packages/typescript/goldencheck/src/core/reporters/json.ts

GoldenFlow Transform Architecture

GoldenFlow provides 76 transforms organized into semantic categories for data standardization. Source: packages/python/goldenflow/README.md

Transform Categories

Category	Count	Examples
Text Transforms	18	`strip`, `lowercase`, `uppercase`, `normalize_unicode`, `remove_html_tags`
Phone Transforms	5	`phone_e164`, `phone_national`, `phone_format`
Date Transforms	8	`date_parse`, `date_format`, `fuzzy_date`
Numeric Transforms	6	`parse_currency`, `extract_numbers`, `round_precision`
Address Transforms	7	`standardize_address`, `parse_components`
Categorical Transforms	4	`normalize_category`, `fuzzy_category`

Domain-Specific Transforms

Domain	Transforms
Healthcare	NPI normalization, ICD code parsing, insurance ID formatting, CPT/DRG parsing
Finance	Account number formatting, routing number validation, CUSIP/ISIN parsing, currency normalization
E-commerce	SKU normalization, price parsing, order date standardization, address standardization
Real Estate	Property address parsing, listing date normalization, price standardization, geo field extraction

Source: packages/python/goldenflow/README.md

InferMap Architecture

InferMap is an inference-driven schema mapping engine that automatically aligns columns across heterogeneous data sources. Source: packages/python/infermap/README.md

MCP Server Tools

The TypeScript implementation exposes a comprehensive MCP tool interface:

Tool	Purpose
`inspect`	Analyze schema of a data source
`suggest-mappings`	Propose column alignments with confidence scores
`apply`	Generate remapped output file
`compare-schemas`	Side-by-side schema comparison
`domain-mapping`	Domain-specific mapping using dictionaries

Source: packages/typescript/infermap/src/node/mcp/server.ts

Scorer Architecture

InferMap supports custom scorers for mapping decisions:

graph TD
    A[Source Column] --> B[Name Similarity]
    A --> C[Type Compatibility]
    A --> D[Value Distribution]
    A --> E[Domain Dictionary]
    B --> F[Composite Score]
    C --> F
    D --> F
    E --> F
    F --> G[Mapping Confidence]

Source: packages/python/infermap/README.md

GoldenPipe Orchestration Layer

GoldenPipe wires Check → Flow → Match into declarative pipelines with zero boilerplate. Source: packages/python/goldenmatch/README.md

Pipeline Declaration

from goldenpipe import Pipeline

pipeline = (
    Pipeline("customer-dedup")
    .check("raw_data.csv")
    .flow(standardization_rules)
    .map(source_schema, target_schema)
    .match(blocking_rules, match_config)
    .output("golden_records.csv")
)
pipeline.run()

Integration Points

Integration	Package	Use Case
Airflow DAGs	`examples/airflow/`	12 drop-in DAG templates
dbt	`dbt-goldensuite`	Quality-gate tests for warehouse models
GitHub Actions	`goldencheck-action`	PR-level data validation
MCP Server	`goldensuite-mcp`	Single-container MCP deployment

Source: packages/python/goldenmatch/README.md

MCP Agent Tools Architecture

The GoldenCheck MCP server exposes 10 agent-level tools for Claude Desktop integration: Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts

export const AGENT_TOOLS: readonly Tool[] = [
  { name: "analyze_data", description: "Analyze data file to detect domain and recommend strategy" },
  { name: "explain_finding", description: "Explain a specific finding in natural language" },
  { name: "explain_column", description: "Explain column quality assessment" },
  { name: "auto_triage", description: "Auto-categorize findings by priority" },
  { name: "apply_fixes", description: "Apply automated corrections" },
  { name: "compare_domains", description: "Compare data against known domain schemas" },
  { name: "generate_handoff", description: "Generate pipeline handoff documentation" },
  { name: "build_review_queue", description: "Build prioritized review queue" },
];

Agent Tool Execution Flow

graph TD
    A[Claude Desktop] -->|MCP Protocol| B[Agent Tools Layer]
    B --> C{Command Router}
    C -->|analyze_data| D[Scanner Engine]
    C -->|explain_*| E[Agent Explanation]
    C -->|apply_fixes| F[Fixer Engine]
    C -->|auto_triage| G[Triage Engine]
    D --> H[Findings Report]
    E --> I[Natural Language Output]
    F --> J[Corrected Data]
    G --> K[Prioritized Queue]

Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts

Memory Learning System

The Learning Memory system in GoldenMatch consists of three tuners that consume user feedback to auto-configure thresholds: Source: Community Context - v1.20.0

Tuner	Level	Input	Output
MemoryLearner	Pair-level	Approve/reject pair decisions	Per-field auto-approve threshold
Field-Strategy Tuner	Field-level	Field match quality feedback	Per-field match strategy selection
Cluster-Decision Tuner	Cluster-level	Cluster approve/reject decisions	Per-dataset auto-approve threshold

Auto-Configuration Flow

graph LR
    A[User Feedback] --> B{Memory System}
    B --> C[Pair-Level Learning]
    B --> D[Field-Level Learning]
    B --> E[Cluster-Level Learning]
    C --> F[Auto-Config Commit]
    D --> F
    E --> F
    F --> G[Threshold Proposals]
    G --> H[Match Pipeline]

Zero-Label Confidence (v1.23.0)

The v1.23.0 release introduced auto-config commits by zero-label confidence by default. The controller's pick_committed tiebreaker prefers higher -overall_confidence candidates over higher -mass_separation among same-health-rank candidates carrying a zero_label profile. Source: Community Context - v1.23.0

Golden Record Provenance

The v1.22.0 release added field-level golden-record provenance tracking: Source: Community Context - v1.22.0

Provenance Data Model

# When provenance=True is enabled:
result = build_golden_records_batch(records, provenance=True)

# Each field dict includes:
{
    "value": "John Smith",
    "source_row_id": "__row_id__ of winning record",
    "survivorship_winner": True
}

Lineage Configuration

Config Option	Type	Default	Description
`config.output.lineage_provenance`	bool	`False`	Enable field-level source tracking

Source: Community Context - v1.22.0

TypeScript Package Structure

The TypeScript packages follow a consistent structure optimized for edge runtimes:

packages/typescript/
├── goldencheck/
│   ├── src/
│   │   ├── core/
│   │   │   ├── engine/       # Scanner, fixer, triage, confidence
│   │   │   ├── agent/       # Strategy selection, explanation
│   │   │   ├── llm/         # LLM interface, prompts
│   │   │   ├── reporters/   # JSON, HTML output formatters
│   │   │   ├── semantic/    # Type definitions, domain matching
│   │   │   └── types.ts     # Core type definitions
│   │   └── node/
│   │       └── mcp/         # MCP server, agent tools
│   └── domains/             # Bundled domain packs (YAML)
├── goldencheck-types/
│   ├── domains/             # Community-contributed domain packs
│   └── src/                 # Type definitions
└── infermap/
    ├── src/
    │   ├── core/            # Mapping engine
    │   └── node/
    │       └── mcp/         # MCP server tools
    └── domains/             # Domain dictionaries

Zero Runtime Dependencies

The core TypeScript packages have no runtime dependencies (edge-safe):

Package	Dependencies
`goldencheck` core	None (pure TypeScript)
`goldencheck-types`	`js-yaml` (dev: `tsup`, `vitest`, `typescript`)
`infermap` core	None (pure TypeScript)

Optional peer dependencies:

nodejs-polars — Parquet reading (Node.js only)
csv-parse — CSV reading (Node.js only)
@modelcontextprotocol/sdk — MCP server (Node.js only)

Source: packages/typescript/goldencheck-types/package.json

Cross-Language Record Fingerprint

The v1.21.0 release introduced cross-language record fingerprinting enabling consistent record identification across Python, TypeScript, and Rust execution environments. Source: Community Context - v1.21.0

Rust Extensions Layer

The goldenmatch-extensions package provides SQL-native fuzzy matching through Postgres UDFs (pgrx) and DuckDB UDFs: Source: Community Context - v1.24.0

Extension Capabilities (v0.5.0)

Capability	Postgres Functions	DuckDB Functions
Core API parity	13 pgrx functions	Multiple UDFs
GoldenFlow transforms	`goldenflow_*` transforms	Equivalent UDFs
Memory learning	`memory_learn` CRUD	`memory_stats` CRUD

Performance Characteristics

Scaling Benchmarks (v1.24.0)

Dataset Size	Wall Clock	Memory (RSS)	Configuration
100K records	~30s	< 1 GB	Default
1M records	~2 min	~2 GB	Default
5M records	9.94 min	6.4 GB	`backend="bucket"`, 16-core
10M records	8.37 min	~5 GB	`backend="bucket"`, optimized

The QIS-bucket-realistic path achieves F1=0.9886 invariant across all scales. Source: Community Context - v1.24.0

Memory Optimization

The native Rust acceleration (goldenmatch-native) provides:

18% RSS reduction on large datasets
Compiled clustering kernels
Optimized block-scoring operations

Source: Community Context - v1.19.0

Topic	Documentation
Getting Started	GoldenMatch Quick Start
Python API	packages/python/goldenmatch/
TypeScript API	packages/typescript/goldenmatch/
MCP Integration	Web UI Wiki
ER Agent	ER Agent / A2A Wiki
Examples	Python Examples, TypeScript Examples

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Backend Systems

GoldenMatch provides a pluggable backend architecture that allows the entity resolution engine to execute across different computational substrates. This design enables users to choose the...

Section Polars Backend

Continue reading this section for the full explanation and source context.

Section Bucket Backend

Continue reading this section for the full explanation and source context.

Section DuckDB Backend

Continue reading this section for the full explanation and source context.

Section Ray Backend

Continue reading this section for the full explanation and source context.

GoldenMatch provides a pluggable backend architecture that allows the entity resolution engine to execute across different computational substrates. This design enables users to choose the optimal backend based on dataset size, infrastructure constraints, and performance requirements.

Overview

The backend abstraction decouples the core matching logic from data execution, allowing GoldenMatch to scale from single-threaded operations on small datasets to distributed clusters processing tens of millions of records.

graph TD
    User[User Code] --> Config[GoldenMatch Config]
    Config --> Registry[Backend Registry]
    Registry --> Polars[Polars Backend]
    Registry --> Bucket[Bucket Backend]
    Registry --> DuckDB[DuckDB Backend]
    Registry --> Ray[Ray Backend]
    
    Polars --> Data1[In-Memory Data]
    Bucket --> Data2[Chunked Data]
    DuckDB --> Data3[SQL Engine]
    Ray --> Data4[Distributed Data]
    
    subgraph Core["Core ER Engine"]
        Blocking[Blocking]
        Scoring[Pair Scoring]
        Clustering[Clustering]
    end
    
    Data1 --> Core
    Data2 --> Core
    Data3 --> Core
    Data4 --> Core

Backend Types

Polars Backend

The default backend, designed for single-node operation with multi-threaded execution. It leverages Polars' native vectorized operations for blocking, scoring, and clustering.

Characteristic	Value
Execution Model	Single node, multi-threaded
Memory Model	In-memory
Typical Dataset Size	Up to 5M records
Dependencies	polars
Configuration Key	`backend="polars"`

The Polars backend is the recommended starting point for datasets under 5 million records. Source: packages/python/goldenmatch/goldenmatch/backends/polars_backend.py

Bucket Backend

An evolution of the Polars backend optimized for larger datasets on single nodes. The bucket backend partitions data into manageable chunks that are processed sequentially, reducing peak memory consumption.

Characteristic	Value
Execution Model	Single node, chunked processing
Memory Model	Memory-efficient, streaming
Typical Dataset Size	5M-10M records
RSS Reduction	~18% vs chunked baseline
Wall Time Reduction	~81% vs v1.15 baseline

Source: packages/python/goldenmatch/goldenmatch/backends/bucket_backend.py Source: docs/scale-envelope.md

v1.16.0+ Performance Note: The backend="bucket" path achieves 5M records in 9.94 minutes with 6.4 GB peak RSS on a 16-core node. This represents a 5x wall reduction and 2x peak RSS reduction compared to the v1.15 chunked baseline. Source: README.md

DuckDB Backend

Provides SQL-native fuzzy matching capabilities, enabling GoldenMatch operations to execute within DuckDB queries. This is particularly useful for warehouse-native entity resolution workflows.

Characteristic	Value
Execution Model	DuckDB SQL engine
Memory Model	Arrow-based, out-of-core
Use Case	Warehouse-native ER, SQL integration
Key Functions	`goldenflow_*` transforms, memory operations

Source: packages/python/goldenmatch/goldenmatch/backends/duckdb_backend.py

Ray Backend

Enables distributed entity resolution across a Ray cluster. The Ray backend distributes blocking, scoring, and clustering operations across multiple nodes.

Characteristic	Value
Execution Model	Distributed Ray cluster
Memory Model	Distributed, cluster-sharded
Typical Dataset Size	10M+ records
Dependencies	ray

Source: packages/python/goldenmatch/goldenmatch/backends/ray_backend.py

Backend Selection

GoldenMatch automatically selects an appropriate backend based on dataset characteristics, but users can explicitly override this via configuration:

from goldenmatch import GoldenMatch

config = {
    "backend": "bucket",  # Explicit backend selection
    "backend_options": {
        "chunk_size": 500_000,
        "num_threads": 16
    }
}

matcher = GoldenMatch(config)

Auto-Detection Logic

The backend registry implements automatic selection based on:

Record count: Datasets under 1M records use polars; larger datasets may trigger bucket or ray
Memory availability: Detected via RSS monitoring during execution
Cluster availability: If Ray is initialized, ray backend becomes available
User configuration: Explicit backend setting takes precedence

Backend Interface

All backends implement a common interface defined in base.py:

class Backend(Protocol):
    def setup(self, config: Config) -> None: ...
    def teardown(self) -> None: ...
    def execute_blocking(self, records: DataFrame, config: Config) -> DataFrame: ...
    def execute_scoring(self, pairs: DataFrame, config: Config) -> DataFrame: ...
    def execute_clustering(self, pairs: DataFrame, config: Config) -> DataFrame: ...
    def execute_merge(self, records: DataFrame, clusters: DataFrame, config: Config) -> DataFrame: ...

Source: packages/python/goldenmatch/goldenmatch/backends/base.py

Native Acceleration (Rust/PyO3)

v1.21.0 introduced optional native acceleration via goldenmatch-native, a separately distributed compiled runtime built with Rust and PyO3 abi3:

pip install "goldenmatch[native]"

The native runtime is discovered automatically when installed and provides:

Compiled clustering kernels
Optimized block-scoring operations
Polars-compatible ABI3 bindings

Source: packages/python/goldenmatch/goldenmatch/backends/__init__.py

Note: goldenmatch-native is not a standalone package. It must be installed alongside the core goldenmatch package and is automatically discovered at import time.

Memory Management

Health Monitoring

All backends implement health monitoring to detect memory pressure:

class HealthMonitor:
    """Tracks RSS usage and execution metrics."""
    
    def check_health(self) -> HealthStatus:
        """Returns current memory and execution health."""
    
    def get_metrics(self) -> dict:
        """Returns detailed metrics for diagnostics."""

RSS Reduction Features

Version	Feature	RSS Impact
v1.16.0	Bucket backend introduction	50% reduction vs chunked
v1.24.0	Scale-aware cardinality	18% reduction
v1.24.0	Heuristic rule expansion	Additional savings

Source: docs/scale-envelope.md

Configuration Reference

Backend-Specific Options

backend: "bucket"  # polars, bucket, duckdb, ray

backend_options:
  # Polars/Bucket options
  num_threads: 16
  chunk_size: 500_000
  memory_limit_gb: 32
  
  # Ray options
  num_actors: 8
  actor_placement: "node1,node2,node3"
  
  # DuckDB options
  catalog: "memory"
  threads: 8

Performance Tuning

For the recommended 5M-on-one-node configuration:

backend: "bucket"
backend_options:
  num_threads: 16
  chunk_size: 500_000

This configuration achieves approximately 9.94 minutes wall time and 6.4 GB peak RSS on a 16-core node.

Source: docs/scale-envelope.md

Execution Pipeline

The backend executes the entity resolution pipeline in these stages:

graph LR
    A[Raw Records] --> B[Blocking]
    B --> C[Pair Generation]
    C --> D[Pair Scoring]
    D --> E[Clustering]
    E --> F[Golden Record Construction]
    F --> G[Output]
    
    B1[Block Key Computation] --> B
    D1[Field Scorers] --> D
    E1[Linkage Criteria] --> E

Each stage is backend-specific for optimal execution:

Stage	Polars	Bucket	DuckDB	Ray
Blocking	Vectorized	Chunked vectorized	SQL	Distributed actors
Scoring	Multi-threaded	Chunked	SQL	Distributed
Clustering	Single-node	Chunked	SQL	Distributed

Extending Backends

To implement a custom backend:

from goldenmatch.backends.base import Backend
from goldenmatch.core.config import Config
import polars as pl

class CustomBackend(Backend):
    def setup(self, config: Config) -> None:
        self.config = config
    
    def execute_blocking(self, records: pl.DataFrame, config: Config) -> pl.DataFrame:
        # Implement custom blocking logic
        return blocked_df
    
    def execute_scoring(self, pairs: pl.DataFrame, config: Config) -> pl.DataFrame:
        # Implement custom scoring logic
        return scored_df
    
    def execute_clustering(self, pairs: pl.DataFrame, config: Config) -> pl.DataFrame:
        # Implement custom clustering logic
        return clustered_df
    
    def teardown(self) -> None:
        # Cleanup resources
        pass

from goldenmatch.backends import register_backend

register_backend("custom", CustomBackend)

Core Matching Engine

Related topics: AutoConfig System, Blocking and Scoring

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Blocking Strategies

Continue reading this section for the full explanation and source context.

Section Blocking Key Configuration

Continue reading this section for the full explanation and source context.

Related topics: AutoConfig System, Blocking and Scoring

Core Matching Engine

The Core Matching Engine is the central processing component of GoldenMatch, responsible for identifying, scoring, and clustering duplicate records within datasets. It orchestrates the entire entity resolution pipeline from candidate pair generation through final cluster formation, enabling users to deduplicate records with configurable precision, recall, and performance characteristics.

Architecture Overview

The Core Matching Engine consists of several interconnected modules that work together to perform entity resolution at scale. The engine processes input records through a staged pipeline: blocking reduces the candidate space, scoring evaluates pair similarity, and clustering groups related records into golden records.

graph TD
    A[Input Records] --> B[Blocker]
    B --> C[Candidate Pairs]
    C --> D[Scorer]
    D --> E[Scored Pairs]
    E --> F[Cluster]
    F --> G[Golden Records]
    
    H[Config] --> B
    H --> D
    H --> F
    
    I[Autoconfig Controller] -.-> H

Core Components

Component	Purpose	Key Responsibilities
Blocker	Candidate reduction	Generate candidate pairs using blocking keys, handle QIS-bucket blocking
Scorer	Similarity evaluation	Compute field-level and overall pair scores, apply field strategies
Cluster	Record grouping	Form clusters from scored pairs, manage transitive closure, build golden records
Autoconfig Controller	Self-configuration	Auto-tune thresholds, manage recall targets, handle zero-label commits
Controller	Orchestration	Coordinate pipeline stages, manage state, handle incremental processing

Source: packages/python/goldenmatch/goldenmatch/core/controller.py

Blocking Module

The blocking module is responsible for reducing the computational complexity of record matching from O(n²) to a manageable candidate set. It groups records by shared blocking keys and only generates candidate pairs within the same block.

Blocking Strategies

GoldenMatch supports multiple blocking strategies optimized for different data characteristics and scale requirements:

Strategy	Description	Use Case
QIS Bucket	Quality-Interval-Sorted bucketing with Chao1 cardinality estimation	Large-scale datasets (10M+ records), realistic data distributions
Standard	Traditional blocking on normalized field values	General purpose deduplication
Multi-pass	Sequential blocking with different keys	High recall requirements
ANN Fallback	Approximate nearest neighbor blocking	Fuzzy matching with edit distance

The QIS-bucket strategy introduced in v1.24.0 achieves significant performance improvements: 10M records processed in 502s (down from 2604s) with invariant F1=0.9886 and 18% RSS reduction.

Source: packages/python/goldenmatch/goldenmatch/core/blocker.py

Blocking Key Configuration

config = {
    "blocking": {
        "keys": ["name_soundex", "zip_code", "phone_area"],
        "min_block_size": 2,
        "max_block_size": 100000
    }
}

Scoring Module

The scoring module evaluates candidate pairs by computing similarity scores at both field and record levels. It applies configurable field strategies to determine how each field contributes to the overall match probability.

Field Strategies

Field strategies define how individual fields are compared and weighted:

Strategy	Description	Best For
`exact`	Binary match/mismatch	IDs, codes, categorical
`fuzzy`	Edit distance or Jaro-Winkler	Names, addresses
`token_set`	Token overlap comparison	Multi-word fields
`numeric`	Threshold-based comparison	Ages, amounts
`date`	Temporal proximity	Date fields
`phonetic`	Soundex/Metaphone matching	Names

Source: packages/python/goldenmatch/goldenmatch/core/field_strategies.py

Score Computation

The scorer aggregates field-level scores into an overall pair score using weighted combination:

overall_score = sum(field_score * field_weight for field in fields) / total_weight

The resulting score represents the probability that two records refer to the same entity, ranging from 0.0 (definitely different) to 1.0 (definitely match).

Source: packages/python/goldenmatch/goldenmatch/core/scorer.py

Clustering Module

The clustering module transforms scored pairs into connected clusters representing unique entities. It handles transitive closure to ensure consistent grouping across the record graph.

Clustering Algorithm

GoldenMatch uses a connected-components approach with configurable linkage criteria:

Linkage Type	Behavior
Single	Records merge if any pair within the cluster exceeds threshold
Complete	All pairs within merged clusters must exceed threshold
Average	Uses mean pairwise similarity

Golden Record Generation

Once clusters are formed, the clustering module generates golden records by applying survivorship rules:

golden_record = build_golden_records_batch(cluster_members, provenance=True)

With provenance=True (introduced in v1.22.0), each field dict includes source_row_id tracking which record contributed the winning value.

Source: packages/python/goldenmatch/goldenmatch/core/cluster.py

Autoconfig Controller

The autoconfig controller enables self-tuning of matching parameters based on labeled data or automatic heuristics. It replaces manual threshold tuning with data-driven optimization.

Auto-Configuration Features

Feature	Description	Version
Zero-label commit	Prefer higher confidence candidates when labels unavailable	v1.23.0+
Recall targeting	Auto-configure thresholds to meet desired recall	v1.20.0+
Cluster threshold tuning	Tune decision threshold based on cluster-level decisions	v1.20.0+
Field strategy tuning	Auto-select field comparison strategies	v1.19.0+

Zero-Label Confidence Handling

In v1.23.0, the pick_committed method was enhanced to handle zero-label profiles. When multiple candidates have equal health rank, the controller prefers candidates with higher -overall_confidence over those with higher -mass_separation. This addresses precision-collapse issues in unlabeled data scenarios.

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py

Performance Characteristics

The Core Matching Engine is optimized for both scale and accuracy:

Benchmark Results (v1.24.0)

Dataset Size	Wall Time	Peak RSS	F1 Score
5M records	9.94 min	6.4 GB	0.99+
10M records	502s	~18% reduction vs v1.23	0.9886

Performance Strategies

Vectorization: Single-group-by-per-column operations for golden record building
Scale-aware cardinality: Chao1 estimation for blocking key selection
Native acceleration: Optional Rust/PyO3 runtime via pip install "goldenmatch[native]"
Incremental processing: Support for streaming and incremental matching

Source: packages/python/goldenmatch/goldenmatch/core/matcher.py

Configuration Reference

Core Configuration Options

matching:
  # Scoring thresholds
  score_threshold: 0.85        # Minimum score to consider a match
  decision_threshold: 0.5      # Threshold for cluster decisions
  
  # Field configuration
  fields:
    name:
      strategy: fuzzy
      weight: 2.0
    email:
      strategy: exact
      weight: 1.5
    phone:
      strategy: token_set
      weight: 1.0

  # Blocking
  blocking:
    keys: ["name_soundex", "phone_last4"]
    method: qis_bucket

  # Output
  output:
    lineage_provenance: false  # Track source records for golden fields
    include_scores: true

Autoconfig Options

autoconfig:
  enabled: true
  recall_target: 0.95
  zero_label_commit: true      # v1.23.0+ default behavior
  tune_cluster_threshold: true  # v1.20.0+

Pipeline Integration

The Core Matching Engine integrates with the broader GoldenSuite ecosystem:

graph LR
    A[GoldenCheck] --> B[GoldenFlow]
    B --> C[GoldenMatch]
    C --> D[InferMap]
    
    E[GoldenPipe] -. orchestrates .-> A
    E -. orchestrates .-> B
    E -. orchestrates .-> C
    E -. orchestrates .-> D

GoldenCheck: Validates data quality before matching
GoldenFlow: Standardizes messy fields (phone, date, address)
InferMap: Maps columns across heterogeneous schemas
GoldenPipe: Orchestrates the full pipeline declaratively

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

AutoConfig System

Related topics: Core Matching Engine, Learning Memory

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Component Overview

Continue reading this section for the full explanation and source context.

Section Controller Responsibilities

Continue reading this section for the full explanation and source context.

Section Telemetry Data Model

Continue reading this section for the full explanation and source context.

Related topics: Core Matching Engine, Learning Memory

AutoConfig System

The AutoConfig System is GoldenMatch's intelligent self-configuration engine that automatically tunes entity resolution parameters based on data characteristics, eliminating the need for manual threshold and strategy tuning. Introduced incrementally across versions v1.19.0 through v1.24.0, it represents the "Learning Memory" family of auto-tuning features that consume human feedback to propose optimal configurations.

Overview

AutoConfig addresses the fundamental challenge in entity resolution: finding the right balance between precision and recall requires understanding your specific data distribution. Rather than requiring users to manually specify match thresholds, blocking strategies, and scoring weights, AutoConfig:

Analyzes data characteristics through profiling
Proposes configuration parameters via iterative tuning
Learns from user decisions in review queues
Commits configurations based on zero-label confidence

Source: docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md

Architecture

The AutoConfig System comprises four primary components that work together in an iterative feedback loop:

graph TD
    A[Data Input] --> B[Controller]
    B --> C[Policy Engine]
    C --> D[Cluster Threshold Tuner]
    D --> E[Zero-Label Confidence]
    E --> B
    F[User Decisions] --> D
    G[Telemetry] --> B
    B --> H[Committed Config]

Component Overview

Component	Purpose	Location
Controller	Orchestrates the auto-config loop, manages iterations	`core/autoconfig_controller.py`
Policy Engine	Evaluates candidate configurations against health metrics	`core/autoconfig_policy.py`
Cluster Threshold Tuner	Proposes per-dataset approve thresholds from cluster decisions	`core/autoconfig_cluster_threshold_tuner.py`
Zero-Label Confidence	Assigns confidence scores based on unlabeled data patterns	`core/zero_label_confidence.py`

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py

Controller

The AutoConfigController is the central orchestrator that manages the iterative configuration process. It coordinates between the policy engine, telemetry collectors, and the zero-label confidence system.

Controller Responsibilities

graph LR
    A[Blocking<br>Summary] --> C[Controller]
    B[Scoring<br>Summary] --> C
    D[Cluster<br>Summary] --> C
    E[Indicators] --> C
    F[Column<br>Priors] --> C
    C --> G[Decisions]
    C --> H[Errors]
    C --> I[Committed<br>Matchkeys]

Telemetry Data Model

The controller collects and emits structured telemetry for each iteration:

interface ControllerScoringSummary {
  n_pairs_scored: number;
  candidates_compared: number;
  mass_above_threshold: number;
  mass_in_borderline: number;
  dip_statistic: number;
}

interface ControllerBlockingSummary {
  n_blocks: number;
  reduction_ratio: number;
  block_sizes_p50: number;
  block_sizes_p99: number;
  block_sizes_max: number;
  oversized_block_count: number;
  keys_used: string[][];
}

interface ControllerClusterSummary {
  n_clusters: number;
  cluster_size_p50: number;
  cluster_size_p99: number;
  cluster_size_max: number;
  transitivity_rate: number;
  oversized_cluster_count: number;
}

Source: web/frontend/src/lib/api.ts

Decision Recording

Each iteration produces a ControllerDecision record:

Field	Type	Description
`iteration`	`int`	Iteration number
`rule_name`	`str`	Name of the policy rule that triggered
`rationale`	`str`	Human-readable explanation
`config_diff`	`Record[str, str]`	Changes made to configuration
`wall_clock_ms`	`int`	Time taken for this iteration

Source: web/frontend/src/lib/api.ts

Policy Engine

The AutoConfigPolicy evaluates candidate configurations against a health ranking system. It ranks configurations by their expected precision-collapse behavior and mass separation characteristics.

Health Ranking

Configurations are evaluated on multiple health dimensions:

Metric	Description	Impact
`overall_confidence`	Aggregate confidence score	Higher is better
`mass_separation`	Gap between match/non-match distributions	Larger gap indicates better discrimination
`precision_collapse`	Tendency to over-cluster	Lower is better

Commit Decision Logic

The policy engine's pick_committed method determines which configuration to commit when multiple candidates have equal health rank. In v1.23.0, the tiebreaker was modified to prefer higher -overall_confidence over higher -mass_separation for candidates carrying a zero_label profile.

Source: docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md

Cluster Threshold Tuner

The AutoConfigClusterThresholdTuner is the third tuner in the Learning Memory family, alongside the pair-level MemoryLearner and field-level field-strategy tuner.

Tuner Function: `tune_decision_threshold`

def tune_decision_threshold(
    decisions: list[ClusterDecision],
    current_threshold: float
) -> float:
    """
    Proposes a per-dataset auto-approve threshold based on
    cluster-level approve/reject decisions.
    """

Input: Cluster Decisions

Field	Type	Description
`cluster_id`	`str`	Unique cluster identifier
`decision`	`str`	"approve" or "reject"
`confidence`	`float`	Model confidence in decision
`cluster_size`	`int`	Number of records in cluster

Output

Returns an updated threshold value that balances precision and recall based on the observed decision pattern.

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tuner.py

Zero-Label Confidence

The zero-label confidence mechanism enables AutoConfig to work without labeled training data by analyzing the structure of unlabeled record pairs and their natural clustering behavior.

Design Principles

No Ground Truth Required: Analyzes inherent data structure rather than relying on labeled examples
Precision-Collapse Aware: Identifies when configurations would cause over-merging
Profile-Based Scoring: Assigns confidence scores based on zero-label profile characteristics

Zero-Label Profile

A zero-label profile captures characteristics of record pairs that help predict match quality without explicit labels:

@dataclass
class ZeroLabelProfile:
    mass_separation: float
    overall_confidence: float
    precision_collapse_risk: float
    zero_label: bool  # True if profile carries zero-label characteristics

Commit Behavior

Starting in v1.23.0, AutoConfig commits by zero-label confidence by default. The pick_committed tiebreaker logic:

IF same-health-rank candidates exist:
    AND at least one carries zero_label profile:
        PREFER the candidate with higher -overall_confidence
    ELSE:
        PREFER the candidate with higher -mass_separation

This change addressed precision-collapse scenarios where mass_separation alone was insufficient.

Source: packages/python/goldenmatch/goldenmatch/core/zero_label_confidence.py

Source: docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md

Interfaces

AutoConfig is accessible through multiple interfaces:

CLI

goldenmatch autoconfig <input.csv> [--iterations N] [--output config.yaml]

REST API

Endpoint	Method	Description
`/autoconfig`	POST	Start auto-configuration run
`/autoconfig/status`	GET	Get current iteration status
`/controller/telemetry`	GET	Retrieve full telemetry snapshot

Python API

from goldenmatch import AutoConfigController

controller = AutoConfigController(config)
controller.run(iterations=10)
telemetry = controller.get_telemetry()

SQL Interface (Postgres Extension)

SELECT goldenmatch_autoconfig('customers.csv');
SELECT gm_telemetry();

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py

Web UI Integration

The AutoConfig System integrates with the GoldenMatch web workbench for visual monitoring and manual override:

graph LR
    A[Web UI] -->|Ctrl+A| B[AutoConfig Panel]
    B --> C[Iteration List]
    C --> D[Decision Details]
    D --> E[Config Diff View]
    B --> F[Telemetry Charts]
    F --> G[Blocking Summary]
    F --> H[Cluster Summary]

Telemetry Visualization

The frontend displays real-time telemetry including:

Blocking Summary: Reduction ratio, block size distribution
Scoring Summary: Mass above/below threshold, DIP statistic
Cluster Summary: Size distribution, transitivity rate
Indicators: Matchkey hit rate, cross-blocking overlap

Source: web/frontend/src/lib/api.ts

Configuration Options

AutoConfig Settings

Parameter	Default	Description
`autoconfig.enabled`	`true`	Enable auto-configuration
`autoconfig.max_iterations`	`10`	Maximum iterations before commit
`autoconfig.zero_label_commit`	`true`	Prefer zero-label confidence (v1.23.0+)
`autoconfig.recall_target`	`0.95`	Target recall for auto-config
`autoconfig.precision_floor`	`0.90`	Minimum acceptable precision

Learning Memory Integration

AutoConfig integrates with the broader Learning Memory system:

Tuner	Level	Learns From
`MemoryLearner`	Pair	Pair-level approve/reject decisions
`FieldStrategyTuner`	Field	Field-level strategy preferences
`ClusterThresholdTuner`	Cluster	Cluster-level approve/reject decisions

Version History

Version	Change
v1.24.0	Heuristic rule expansion + diagnostic harness
v1.23.0	Auto-config commits by zero-label confidence by default
v1.20.0	Cluster decision tuner (`tune_decision_threshold`)
v1.19.0	Native acceleration + autoconfig + probabilistic improvements

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py

Best Practices

Start with defaults: The zero-label confidence commit behavior (v1.23.0+) provides sensible defaults for most datasets
Review telemetry: Monitor blocking and cluster summaries to identify oversized blocks or clusters
Use strict mode for evaluation: The _strictAutoconfig flag disables runtime threshold shifts for reproducible results
Integrate with review queue: Feed cluster decisions back to the ClusterThresholdTuner for continuous improvement

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Learning Memory

Learning Memory is GoldenMatch's adaptive feedback system that continuously improves entity resolution accuracy by learning from human corrections, labeled decisions, and performance feedback. It forms the closed-loop optimization layer that distinguishes GoldenMatch from static rule-based matching systems.

Overview

Learning Memory captures human domain knowledge and transforms it into automated configuration improvements. Rather than requiring users to manually tune thresholds, define field strategies, or configure blocking rules, Learning Memory observes how humans resolve ambiguous cases and propagates those decisions across the entire dataset.

The system operates at three distinct levels:

Level	Tuner	Input	Output
Pair-level	`MemoryLearner`	Human approve/reject on pair comparisons	Learned matchkey weights and thresholds
Field-level	`field_strategy_tuner`	Field-level corrections	Per-field strategy selection (exact, fuzzy, tokenized, etc.)
Cluster-level	`cluster_decision_tuner`	Cluster approve/reject decisions	Per-dataset auto-approve threshold

This multi-level approach ensures that feedback at any granularity flows back into the appropriate configuration layer. Source: packages/python/goldenmatch/CHANGELOG.md

Architecture

Learning Memory consists of several interconnected components that handle storage, learning, and application of corrections.

graph TD
    subgraph "Input Layer"
        A[Human Corrections] --> B[Corrections Store]
        C[Review Queue Feedback] --> B
        D[Ground Truth Labels] --> E[Memory Learner]
    end
    
    subgraph "Learning Layer"
        E --> F[Pair-Level Tuning]
        F --> G[Threshold Adjuster]
        F --> H[Matchkey Weighter]
        E --> I[Field-Level Tuning]
        I --> J[Strategy Selector]
        E --> K[Cluster-Level Tuning]
        K --> L[Decision Threshold Tuner]
    end
    
    subgraph "Output Layer"
        G --> M[AutoConfig Controller]
        H --> M
        J --> M
        L --> M
        M --> N[Matching Pipeline]
    end
    
    subgraph "Feedback Loop"
        N --> O[Review Queue]
        O --> A
    end

Core Components

Component	File	Responsibility
`MemoryCorrections`	`core/memory/corrections.py`	Persistent storage of human corrections
`MemoryLearner`	`core/memory/learner.py`	Pair-level learning from corrections
`FieldStrategyTuner`	`core/autoconfig_field_strategy_tuner.py`	Field-level strategy optimization
`ClusterDecisionTuner`	`core/autoconfig_cluster_threshold_tune.py`	Cluster threshold optimization

Corrections Store

The MemoryCorrections class provides the persistent backing store for all human feedback. It tracks corrections at both the pair level (which records should be linked or unlinked) and the field level (which field values should win in survivorship). Source: packages/python/goldenmatch/goldenmatch/core/memory/corrections.py:1-50

Data Model

class MemoryCorrections:
    corrections: list[Correction]
    
class Correction:
    record_id_a: str      # First record in the pair
    record_id_b: str      # Second record in the pair
    decision: str         # "approve" or "reject"
    confidence: float     # Human confidence 0.0-1.0
    source: str           # "human", "ground_truth", "review_queue"
    timestamp: datetime
    metadata: dict        # Additional context

CRUD Operations

Operation	Method	Description
Create	`add_correction()`	Record a new correction
Read	`get_corrections()`	Retrieve corrections with filters
Update	`update_correction()`	Modify an existing correction
Delete	`remove_correction()`	Remove a correction

The corrections store supports filtering by:

Source type (human, ground_truth, review_queue)
Decision type (approve, reject)
Date range
Record pair

Pair-Level Learning: MemoryLearner

The MemoryLearner processes corrections at the record-pair level and extracts patterns about which matchkey combinations indicate true matches versus false positives. Source: packages/python/goldenmatch/goldenmatch/core/memory/learner.py:1-100

Learning Algorithm

class MemoryLearner:
    def learn(self, corrections: MemoryCorrections) -> LearnedWeights:
        """
        Process corrections and compute updated matchkey weights.
        """

The learner computes:

Precision per matchkey: Ratio of correct to total positive predictions for each matchkey
Recall per matchkey: Coverage of true matches captured by each matchkey
Composite weights: Combination weights that balance precision and recall

Weight Computation

Weights are computed using a modified TF-IDF approach:

Metric	Formula	Purpose
Matchkey Precision	`correct_pairs / total_pairs_for_key`	How reliable is this key?
Key Frequency	`pairs_using_key / total_pairs`	How common is this key?
Composite Weight	`precision * log(key_frequency + 1)`	Balanced importance

Field-Level Learning: FieldStrategyTuner

The FieldStrategyTuner optimizes which scoring strategy to use for each field based on correction patterns. Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_field_strategy_tuner.py:1-80

Available Field Strategies

Strategy	Use Case	Example
`exact`	Unique identifiers, codes	SSN, account numbers
`fuzzy`	Names, addresses with typos	"John" vs "Jon"
`tokenized`	Multi-word fields	"John Smith" vs "Smith, John"
`numeric`	Numbers with tolerance	Prices, quantities
`date`	Temporal fields	Birth dates, transaction dates
`phonetic`	Names with spelling variants	Soundex, Metaphone

Tuning Process

Collect field-level signals from corrections (which field caused the error?)
Compute strategy accuracy per field for each strategy
Select best strategy using cross-validation to avoid overfitting
Generate strategy map: {field_name: strategy_name}

Cluster-Level Learning: ClusterDecisionTuner

The ClusterDecisionTuner (introduced in v1.20.0) consumes cluster-level approve/reject decisions and proposes a per-dataset auto-approve threshold. Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tune.py:1-60

Threshold Tuning Algorithm

class ClusterDecisionTuner:
    def tune_decision_threshold(
        self,
        decisions: list[ClusterDecision],
        target_recall: float = 0.95
    ) -> float:
        """
        Find the threshold that achieves target_recall on approved clusters.
        """

Tuning Inputs

Input	Type	Description
`decisions`	`list[ClusterDecision]`	Human cluster decisions
`target_recall`	`float`	Desired recall target (default 0.95)
`min_approve_confidence`	`float`	Minimum confidence for auto-approve

Tuning Outputs

Output	Type	Description
`threshold`	`float`	Suggested auto-approve threshold
`expected_precision`	`float`	Estimated precision at this threshold
`calibration_curve`	`list[tuple]`	Precision-recall tradeoff points

Auto-Config Integration

Learning Memory integrates with GoldenMatch's AutoConfig system, which automatically optimizes matching parameters. The pick_committed method in the controller determines which candidate configurations to commit. Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_golden_strategy_tuner.py:1-100

Commit Priority

The controller uses a multi-factor ranking for candidate selection:

Health rank: Overall dataset health improvement
Mass separation: Confidence gap between best and second-best candidate
Zero-label confidence (v1.23.0+): Preference for higher overall_confidence among zero-label profile candidates

def pick_committed(self, candidates: list[Candidate]) -> Candidate:
    """
    Select the best candidate using health rank, mass separation,
    and zero-label confidence tiebreaker.
    """

AutoConfig Workflow

graph LR
    A[Initialize Config] --> B[Generate Candidates]
    B --> C[Score Candidates]
    C --> D[Learn from Corrections]
    D --> E[Update Weights]
    E --> F[Filter Candidates]
    F --> G{Rank by Health?}
    G -->|Yes| H[Select Best]
    G -->|No| I[Apply Learning Memory]
    I --> E
    H --> J[Commit Configuration]
    J --> K[Execute Matching]
    K --> L[Generate Review Queue]
    L --> M[Human Feedback]
    M --> D

Memory in Postgres Extension

The goldenmatch_pg extension (v0.5.0+) provides native Postgres functions for Learning Memory operations, enabling SQL-native corrections and statistics. Source: packages/python/goldenmatch/CHANGELOG.md

Available Functions

Function	Purpose
`memory_learn()`	Record corrections from SQL
`memory_stats()`	Retrieve learning statistics
`memory_clear()`	Reset corrections for a dataset

SQL Usage Example

-- Record a correction
SELECT memory_learn(
    'record_a_id',
    'record_b_id', 
    'approve',  -- or 'reject'
    0.95
);

-- Get learning statistics
SELECT * FROM memory_stats('my_dataset');

-- Clear corrections for re-learning
SELECT memory_clear('my_dataset');

Usage Patterns

Basic Correction Flow

from goldenmatch import GoldenMatch, MemoryCorrections

# Initialize with memory
gm = GoldenMatch(config)
corrections = MemoryCorrections()

# Run initial matching
results = gm.match(data)

# Present review queue to human
for pair in results.review_queue:
    decision = human_review(pair)
    corrections.add_correction(
        record_id_a=pair.id_a,
        record_id_b=pair.id_b,
        decision=decision,
        confidence=0.95
    )

# Apply learning and re-run
gm.memory_learner.learn(corrections)
refined_results = gm.match(data)  # Uses learned weights

Field Strategy Tuning

from goldenmatch.core.autoconfig_field_strategy_tuner import FieldStrategyTuner

tuner = FieldStrategyTuner(dataset_id="my_dataset")

# Tune from corrections
strategy_map = tuner.tune(
    corrections=corrections,
    fields=["name", "address", "phone", "email"]
)

# Apply to config
config.field_strategies = strategy_map

Cluster Threshold Tuning

from goldenmatch.core.autoconfig_cluster_threshold_tune import ClusterDecisionTuner

tuner = ClusterDecisionTuner()

# Tune from cluster decisions
threshold = tuner.tune_decision_threshold(
    decisions=cluster_decisions,
    target_recall=0.97
)

# Apply auto-approve threshold
config.auto_approve_threshold = threshold

Configuration Options

Option	Default	Description
`memory.enabled`	`True`	Enable/disable learning memory
`memory.learning_rate`	`0.1`	Rate at which new corrections update weights
`memory.decay_factor`	`0.95`	Weight decay for older corrections
`memory.min_corrections`	`10`	Minimum corrections before tuning
`memory.strategy`	`"balanced"`	Tuning strategy: `balanced`, `precision`, `recall`

Version History

Version	Feature
v1.19.0	Initial Learning Memory with MemoryLearner
v1.20.0	Added ClusterDecisionTuner (third tuner in family)
v1.23.0	Zero-label confidence tiebreaker in `pick_committed`
v1.24.0	Heuristic rule expansion + diagnostic harness

Limitations and Considerations

Data Requirements

Learning Memory requires a minimum number of corrections (default: 10) before producing reliable recommendations
Highly imbalanced datasets (rare true matches) may need more corrections for accurate threshold tuning

Convergence

Weights converge faster when corrections are evenly distributed across matchkey types
Cluster threshold tuning may require iterative refinement for datasets with unusual cluster size distributions

Production Considerations

Periodically review learned weights to ensure they remain aligned with business rules
Reset memory when significant schema changes occur
Monitor precision/recall drift over time

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Blocking and Scoring

Related topics: Core Matching Engine

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Components

Continue reading this section for the full explanation and source context.

Section Exact Blocking

Continue reading this section for the full explanation and source context.

Section Sorted Neighborhood Blocking

Continue reading this section for the full explanation and source context.

Related topics: Core Matching Engine

Blocking and Scoring

Overview

Blocking and scoring are the two core mechanisms that enable GoldenMatch to perform entity resolution at scale. Without blocking, comparing every record against every other record would result in O(n²) comparisons—a computationally infeasible approach for datasets with millions of records. The blocking phase reduces the candidate space by grouping records that are likely to represent the same entity, while the scoring phase evaluates the similarity of each candidate pair to determine whether they should be merged.

In GoldenMatch, these mechanisms work together through a configurable pipeline that supports multiple blocking strategies (exact, fuzzy, token-based, and approximate nearest neighbor), multiple scoring approaches (string similarity, field weighting, and optional LLM-based evaluation), and a feedback-driven tuning system that learns from user corrections to improve accuracy over time.

Source: packages/python/goldenmatch/examples/README.md

Architecture

The blocking and scoring pipeline follows a multi-stage architecture:

graph TD
    A[Input Records] --> B[Standardization]
    B --> C[Blocking Strategy]
    C --> D[Candidate Pairs]
    D --> E[Field-Level Scoring]
    E --> F[Record-Level Score]
    F --> G[Clustering]
    G --> H[Golden Records]
    C -->|ann_* strategies| I[Approximate Nearest Neighbors]
    F -->|optional| J[Cross-Encoder Reranking]

Components

Component	Purpose	Key Classes/Modules
Blocker	Generates candidate pairs using blocking keys	`blocker.py`
Matchkey	Defines how fields contribute to blocking and scoring	`matchkey.py`
Scorer	Computes field and record-level similarity	`scoring.py`
Cross-Encoder	Optional reranking of candidate pairs	`cross_encoder.py`
Controller	Orchestrates the pipeline and applies thresholds	`controller.py`

Source: packages/python/goldenmatch/examples/README.md

Blocking Strategies

Exact Blocking

Exact blocking groups records that share identical values on one or more key fields. This is the simplest and fastest approach, ideal for high-quality, well-standardized data.

blocking = {
    "strategy": "sorted_neighborhood",
    "fields": ["email"],
    "window_size": 3
}

Sorted Neighborhood Blocking

Records are sorted by blocking key values and compared within a sliding window. This catches near-duplicates that would not appear adjacent under exact blocking.

blocking = {
    "strategy": "sorted_neighborhood",
    "fields": ["last_name", "zip5"],
    "window_size": 5
}

Multi-Pass Blocking

For complex datasets, multiple blocking passes with different keys can capture different types of matches:

blocking = [
    {"strategy": "exact", "fields": ["email"]},
    {"strategy": "sorted_neighborhood", "fields": ["last_name", "first_name"], "window_size": 5},
    {"strategy": "sorted_neighborhood", "fields": ["phone"]}
]

Source: packages/python/goldenmatch/examples/README.md

Approximate Nearest Neighbor (ANN) Blocking

For high-cardinality string fields, ANN blocking provides efficient similarity-based candidate generation:

blocking = {
    "strategy": "ann_l2",
    "fields": ["full_address"],
    "distance_threshold": 0.3
}

The v1.24.0 release introduced the QIS-bucket strategy and Chao1 scale-aware cardinality estimation, which significantly improves blocking accuracy on large datasets. The performance improvements in this release achieved 81% wall-clock reduction (2604s → 502s) on a 10M record benchmark while maintaining F1=0.9886.

Source: README.md

Blocking Configuration Fields

Field	Type	Description
`strategy`	string	One of: `exact`, `sorted_neighborhood`, `ann_l2`, `ann_cosine`, `canopy`, `learned`
`fields`	list[string]	Fields to use for blocking
`window_size`	int	Window size for sorted neighborhood (default: 3)
`distance_threshold`	float	Distance threshold for ANN strategies
`extras`	dict	Advanced strategy-specific parameters

Source: packages/python/goldenmatch/web/frontend/src/lib/types.ts

Scoring

Field-Level Scoring

Each field contributes a similarity score based on its configured strategy:

Strategy	Description	Use Case
`levenshtein`	Character-level edit distance	Names, addresses
`jaro_winkler`	Optimized for short strings	Names
`token_set`	Set intersection of tokens	Address components
`numeric`	Absolute difference / range	Dates, amounts
`exact`	Binary match/no-match	IDs, codes

Matchkey Definition

Matchkeys define how fields participate in both blocking and scoring:

matchkeys = [
    {
        "fields": ["email"],
        "blocking": True,
        "score_type": "exact",
        "weight": 1.0
    },
    {
        "fields": ["first_name", "last_name"],
        "blocking": True,
        "score_type": "token_set",
        "weight": 0.8
    },
    {
        "fields": ["phone"],
        "blocking": True,
        "score_type": "levenshtein",
        "weight": 0.6
    }
]

Source: packages/python/goldenmatch/examples/README.md

Record-Level Scoring

The record-level score combines field scores using weighted averaging:

record_score = Σ(field_score × field_weight) / Σ(field_weight)

Records exceeding the threshold are linked; the default threshold is tuned automatically based on the dataset characteristics via the autoconfig system introduced in v1.20.0.

Source: README.md

Weighted Matchkeys

GoldenMatch supports sophisticated weighting schemes:

matchkeys = [
    {"fields": ["company_name"], "weight": 0.7, "score_type": "token_set"},
    {"fields": ["address", "city"], "weight": 0.3, "score_type": "token_set"},
    # Optional multi-pass with different weight profiles
]

The equipment deduplication example demonstrates multi-pass blocking with ANN fallback, weighted fuzzy matching, and LLM calibration for challenging datasets.

Source: packages/python/goldenmatch/examples/README.md

Cross-Encoder Reranking

The optional cross-encoder module provides a secondary scoring pass that considers field interactions:

from goldenmatch.core.cross_encoder import CrossEncoderScorer

scorer = CrossEncoderScorer(model="cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(record_a, record_b) for record_a, record_b in candidate_pairs]
reranked = scorer.score_batch(pairs)

Cross-encoding is particularly valuable when field combinations carry more signal than individual fields—for example, a name and address together are more distinctive than either alone.

Source: packages/python/goldenmatch/examples/README.md

Configuration Payload

The web frontend communicates blocking and scoring configuration using a typed payload:

export type RulesPayload = {
  threshold: number;
  matchkeys: Matchkey[];
  standardization?: StandardizationRules | null;
  blocking?: BlockingPayload | null;
};

Known blocking keys are validated at the server boundary, while unknown keys are preserved in an extras field for advanced strategies:

const BLOCKING_KNOWN_KEYS = new Set([
  "strategy", "fields", "window_size", "distance_threshold",
  "_block_size", "skip_oversized", "auto_suggest", "auto_select"
]);

Source: packages/python/goldenmatch/web/frontend/src/lib/types.ts

Standardization Pipeline

Effective blocking and scoring depend on data standardization. GoldenFlow provides the transform library that should be applied before matching:

Transform Category	Examples
Text	`strip`, `lowercase`, `normalize_unicode`, `normalize_quotes`
Phone	`phone_e164`, `phone_national`
Address	`address` (full address standardization)
Numeric	`extract_numbers`, `parse_currency`

standardization = {
    "email": "lowercase",
    "phone": "phone_e164",
    "address": "address",
    "state": "state"
}

Source: packages/python/goldenflow/README.md

Performance Considerations

Bucket Strategies

The QIS-bucket strategy (v1.24.0) provides scale-aware cardinality estimation that adjusts bucket parameters based on dataset size. This prevents both over-blocking (too many candidates) and under-blocking (missing matches).

Memory Reduction

The v1.24.0 release achieved an 18% RSS reduction through optimized data structures and streaming processing. Key optimizations include:

Single-group-by-per-column vectorization for golden record building
Lazy evaluation of scoring for low-confidence pairs
Chunked processing for very large candidate sets

Source: README.md

Auto-Configuration

GoldenMatch can automatically tune blocking and scoring parameters:

from goldenmatch.core.autoconfig import auto_configure

config = auto_configure(
    data=df,
    ground_truth=labels_df,  # Optional for supervised tuning
    target_recall=0.95
)

The autoconfig system uses:

Chao1 estimation for cardinality-aware blocking tuning
Zero-label confidence analysis for threshold calibration (v1.23.0)
Cluster-level tuning for decision threshold optimization (v1.20.0)

Source: README.md

Memory and Learning

GoldenMatch maintains learned patterns across runs:

Memory Type	Scope	Purpose
`MemoryLearner`	Pair-level	Learn from labeled match/non-match pairs
`field-strategy tuner`	Field-level	Optimize per-field scoring strategy
`cluster-decision tuner`	Cluster-level	Tune merge/reject thresholds

These learning mechanisms enable the system to improve accuracy over time as users correct its decisions.

Source: README.md

Workflow Summary

graph LR
    A[Raw Data] --> B[GoldenFlow Standardization]
    B --> C[Blocking]
    C --> D[Scoring]
    D --> E[Clustering]
    E --> F{Threshold}
    F -->|Above| G[Auto-Approve]
    F -->|Below| H[Auto-Reject]
    F -->|Uncertain| I[Review Queue]
    I --> J[User Labels]
    J --> K[Memory Update]
    K --> C

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.

1. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

2. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

3. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

4. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

5. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

6. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 11

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using goldenmatch with real data or production workflows.

v1.24.0 - github / github_release
v1.23.0 - auto-config recall + zero-label commit default - github / github_release
v1.22.0 - github / github_release
v1.21.0 - github / github_release
goldenmatch-native v0.1.0 - github / github_release
goldenmatch v1.20.0 - github / github_release
goldenmatch v1.19.0 - github / github_release
infermap-js v0.5.0 - github / github_release
goldenpipe-js v0.2.0 - github / github_release
goldenmatch_pg v0.5.0 - github / github_release
Capability evidence risk requires verification - GitHub / issue

Source: Project Pack community evidence and pitfall evidence

goldenmatch

Home

Related Pages

Golden Suite

Overview

Packages Overview

Architecture

Data Flow

Quick Start

Python

TypeScript / Node.js

GoldenMatch

Key Features

Performance

GoldenCheck

Core Capabilities

Domain Type Packs

GoldenFlow

Transform Categories

InferMap

Supported Data Sources

TypeScript Compatibility

Optional Components

Native Acceleration

MCP Server (Claude Desktop)

Integrations

dbt

GitHub Actions

Airflow

MCP Container

Web UI

Latest Release

Getting Help

License

Getting Started

Related Pages

Getting Started

Suite Packages Overview

Related Pages

Suite Packages Overview

Package Architecture

Package Summary

GoldenCheck

Core Capabilities

Domain Type Packs

MCP Server

GoldenFlow

Transform Categories

GoldenMatch

Matching Strategies

Performance Benchmarks

Native Acceleration

Configuration Options

InferMap

Key Features

GoldenPipe

Pipeline Stages

Deployment Options

goldenmatch-extensions

Capabilities

dbt-goldensuite

Features

goldencheck-action

Examples and Quick Start

Quick Start Commands

Related Documentation

Installation

Related Pages

Installation

System Requirements

Quick Start

Python Installation

Core Packages

GoldenMatch Optional Dependencies

GoldenCheck Optional Dependencies

TypeScript / Node.js Installation

Peer Dependencies

Docker Installation

GoldenMatch Native Acceleration

Installation Commands for Native