Doramagic Project Pack · Human Manual

goldenmatch

Golden Suite is a comprehensive toolkit for data quality and entity resolution, designed to handle the complete lifecycle of messy data: profiling, standardization, deduplication, and orch...

Home

Related topics: Getting Started, Suite Packages Overview

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Data Flow

Continue reading this section for the full explanation and source context.

Section Python

Continue reading this section for the full explanation and source context.

Section TypeScript / Node.js

Continue reading this section for the full explanation and source context.

Related topics: Getting Started, Suite Packages Overview

Golden Suite

A polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.

*GoldenCheck profiles → GoldenFlow standardizes → GoldenMatch deduplicates → GoldenPipe orchestrates. With InferMap for schema mapping and a Rust extension layer for Postgres / DuckDB.*

Overview

Golden Suite is a comprehensive toolkit for data quality and entity resolution, designed to handle the complete lifecycle of messy data: profiling, standardization, deduplication, and orchestration. The project supports both Python and TypeScript ecosystems, with optional Rust acceleration for high-performance workloads.

Source: README.md:1-5

Packages Overview

ToolLanguagesPurposeInstall
GoldenMatchPython · TSZero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package.pip install goldenmatch · npm i goldenmatch
GoldenCheckPython · TSData-quality scanning: encoding, Unicode, format validation, anomaly detection.pip install goldencheck · npm i goldencheck
GoldenFlowPython · TSTransforms & standardizers: phone, date, address, categorical normalization.pip install goldenflow · npm i goldenflow
GoldenPipePython · TSOrchestrator that wires Check → Flow → Match into one declarative pipeline.pip install goldenpipe · npm i goldenpipe
InferMapPython · TSSchema mapping engine — auto-aligns columns across heterogeneous sources.pip install infermap · npm i infermap
goldenmatch-extensionsRustPostgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching.source build
dbt-goldensuitedbt · Pythondbt package — quality-gate tests, correction CRUD macros + GoldenCheck assertions for warehouse models.pip install dbt-goldensuite
goldencheck-actionYAMLGitHub Action — CI with PR comments for data validation.uses: benseverndev-oss/goldencheck-action@v1

Source: README.md:28-41

Architecture

graph LR
    A[Raw Data] --> B[GoldenCheck<br/>Profile & Validate]
    B --> C[GoldenFlow<br/>Standardize]
    C --> D[InferMap<br/>Schema Mapping]
    D --> E[GoldenMatch<br/>Deduplicate]
    E --> F[GoldenPipe<br/>Orchestrate]
    
    G[Postgres/DuckDB] --> E
    H[GitHub CI] --> B
    I[MCP Server] --> F

Data Flow

  1. GoldenCheck profiles your data and discovers quality rules automatically
  2. GoldenFlow transforms messy fields into canonical formats
  3. InferMap aligns columns across heterogeneous schemas
  4. GoldenMatch identifies and merges duplicate records
  5. GoldenPipe orchestrates the entire pipeline declaratively

Source: packages/python/goldencheck/README.md:1-10

Quick Start

Python

# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv

Source: README.md:64-66

TypeScript / Node.js

# TypeScript / Edge runtimes
npm install goldenmatch

Source: README.md:69

GoldenMatch

The headline package for entity resolution. Supports multiple matching strategies:

  • Fuzzy matching — handles typos and variations
  • Exact matching — bit-for-bit comparisons
  • Probabilistic matching —贝叶斯-style confidence scoring
  • LLM matching — semantic clustering with language models

Key Features

  • Zero-config dedup for common cases
  • Configurable matchkeys and blocking strategies
  • Field-level score explanations
  • Streaming/incremental matching for new records
  • PPRL (Privacy-Preserving Record Linkage) for cross-organization matching

Performance

v1.24.0 achieved significant performance milestones:

MetricBeforeAfterImprovement
10M records (wall time)2604s502s-81%
Peak RSSbaseline-18% reduction
F1 Score0.98860.9886invariant

Source: packages/python/goldenmatch/README.md:1-50

GoldenCheck

Data validation that discovers rules from your data so you don't have to write them.

Core Capabilities

  • Automatic rule discovery from data patterns
  • Encoding and Unicode validation
  • Format validation and anomaly detection
  • Health score grading (A-F)
  • Multiple output formats: HTML, JSON, TUI

Domain Type Packs

Community-contributed semantic type definitions for improved detection:

DomainTypesDescription
healthcare10NPI, ICD codes, insurance IDs, patient demographics, CPT, DRG
finance8Account numbers, routing numbers, CUSIP/ISIN, currency, transactions
ecommerce9SKUs, order IDs, tracking numbers, categories, shipping

Source: packages/typescript/goldencheck-types/README.md:1-30

GoldenFlow

Transforms and standardizers for messy data fields.

Transform Categories

CategoryCountExamples
Text Transforms18strip, lowercase, normalize_unicode, remove_html_tags
Phone Transforms5phone_e164, phone_national, phone_format
Date Transforms7date_parse, date_format, date_floor
Address Transforms6address_parse, address_standardize
Numeric Transforms4parse_currency, parse_number
Categorical Transforms4category_normalize, category_map

Source: packages/python/goldenflow/README.md:1-100

InferMap

Inference-driven schema mapping engine. Maps messy source columns to known target schemas with confidence scores and human-readable reasoning.

Supported Data Sources

  • CSV files
  • DataFrames
  • Database tables
  • In-memory records

TypeScript Compatibility

  • Next.js Server Components
  • Route Handlers
  • Server Actions
  • Edge Runtime

Source: packages/python/infermap/README.md:1-80

Optional Components

Native Acceleration

For maximum performance on large datasets:

pip install "goldenmatch[native]"

This pulls goldenmatch-native, a separately distributed compiled (Rust/PyO3 abi3) runtime.

MCP Server (Claude Desktop)

pip install goldencheck[mcp]

Source: packages/python/goldencheck/README.md:1-50

Integrations

dbt

Add data-quality gates to dbt:

# dbt_project.yml
packages:
  - package: benseverndev-oss/dbt-goldensuite

GitHub Actions

- uses: benseverndev-oss/goldencheck-action@v1
  with:
    files: "data/*.csv"
    fail-on: error

Airflow

12 drop-in DAGs available at examples/airflow/.

MCP Container

Run from a single MCP container:

docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest

Source: README.md:42-60

Web UI

GoldenMatch includes an interactive web workbench:

pip install goldenmatch[web]
goldenmatch serve-ui <project>

Features:

  • Pair drilldown with cluster members
  • Field-level diff view
  • Natural language explanations per pair

Source: README.md:58-62

Latest Release

v1.24.0 — 10M QIS-bucket-realistic: 2604s → 502s (-81% wall) at F1=0.9886 invariant + 18% RSS reduction.

Key improvements:

  • ~15 performance PRs
  • Chao1 scale-aware cardinality
  • Heuristic rule expansion
  • Diagnostic harness

See CHANGELOG.md for the full PR list.

Source: community_context

Getting Help

ResourceLink
DocumentationWiki
ExamplesPython examples · TypeScript examples
IssuesGitHub Issues
DiscussionsGitHub Discussions

License

MIT — see LICENSE

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Getting Started

Related topics: Installation, System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Installation, System Architecture

Getting Started

The Golden Suite is a polyglot data-quality and entity-resolution toolkit designed for deduplication, data standardization, and schema mapping. It consists of five core packages: GoldenCheck (data validation), GoldenFlow (transforms and standardization), GoldenMatch (record deduplication), GoldenPipe (pipeline orchestration), and InferMap (schema mapping). Source: README.md:1-10

This guide walks you through installation, basic usage patterns, and recommended starting points for common use cases.

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Suite Packages Overview

Related topics: Core Matching Engine

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Capabilities

Continue reading this section for the full explanation and source context.

Section Domain Type Packs

Continue reading this section for the full explanation and source context.

Section MCP Server

Continue reading this section for the full explanation and source context.

Related topics: Core Matching Engine

Suite Packages Overview

The Golden Suite is a polyglot data-quality and entity-resolution toolkit that provides a complete pipeline from data profiling to deduplication. The suite consists of multiple purpose-built packages that can be used independently or composed together for end-to-end workflows.

Source: README.md:1-15

Package Architecture

graph TD
    A[Raw Data] --> B[GoldenCheck]
    B --> C[GoldenFlow]
    C --> D[InferMap]
    D --> E[GoldenMatch]
    E --> F[GoldenPipe]
    
    G[goldenmatch-extensions] --> E
    H[dbt-goldensuite] --> B
    I[goldencheck-types] --> B
    
    J[goldencheck-action] --> B
    K[GoldenCheck MCP] --> B
    L[Goldensuite MCP] --> F

Package Summary

PackageLanguagesPurposeInstall
GoldenMatchPython, TypeScriptZero-config entity resolution. Fuzzy + exact + probabilistic + LLMpip install goldenmatch / npm i goldenmatch
GoldenCheckPython, TypeScriptData-quality scanning: encoding, Unicode, format validation, anomaly detectionpip install goldencheck / npm i goldencheck
GoldenFlowPython, TypeScriptTransforms & standardizers: phone, date, address, categorical normalizationpip install goldenflow / npm i goldenflow
GoldenPipePython, TypeScriptOrchestrator that wires Check → Flow → Match into one declarative pipelinepip install goldenpipe / npm i goldenpipe
InferMapPython, TypeScriptSchema mapping engine — auto-aligns columns across heterogeneous sourcespip install infermap / npm i infermap
goldenmatch-extensionsRustPostgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matchingsource build
dbt-goldensuitedbt, Pythondbt package — quality-gate tests, correction CRUD macros + GoldenCheck assertionspip install dbt-goldensuite
goldencheck-actionGitHub Action for CI with PR commentsmarketplace
goldencheck-typesTypeScriptCommunity-contributed domain type packs (healthcare, finance, e-commerce)npm i goldencheck-types

Source: README.md:80-100

GoldenCheck

GoldenCheck is the data-quality scanning and profiling component of the Golden Suite. It detects issues such as encoding problems, Unicode anomalies, format violations, and data anomalies.

Core Capabilities

  • Encoding Detection: Identifies and reports encoding issues in text fields
  • Unicode Validation: Detects malformed Unicode sequences and normalization issues
  • Format Validation: Validates against expected formats (email, phone, URL, etc.)
  • Anomaly Detection: Statistical analysis to identify outliers and unusual patterns
  • LLM-Powered Analysis: Uses language models to identify semantic data quality issues missed by automated profilers

Source: packages/python/goldencheck/README.md:1-50

Domain Type Packs

GoldenCheck supports domain-specific type definitions through community-contributed packs:

DomainTypes Included
HealthcareNPI, ICD codes, insurance IDs, patient demographics, CPT, DRG
FinanceAccount numbers, routing numbers, CUSIP/ISIN, currency, transactions
E-commerceSKUs, order IDs, tracking numbers, categories, shipping

Source: packages/typescript/goldencheck-types/README.md:1-30

MCP Server

GoldenCheck includes an MCP (Model Context Protocol) server providing 10 agent-level tools:

  • analyze_data — Domain detection and strategy recommendation
  • auto_triage — Automated issue classification
  • explain_finding — Natural language explanation of findings
  • explain_column — Column-level analysis
  • compare_domains — Cross-domain comparison
  • generate_handoff — Pipeline handoff generation

Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts:1-50

GoldenFlow

GoldenFlow provides 76+ transformation functions for standardizing messy data fields. It focuses on transforming data before matching to improve deduplication accuracy.

Source: packages/python/goldenflow/README.md:1-50

Transform Categories

#### Text Transforms (18)

TransformDescription
stripTrim whitespace
lowercase / uppercaseCase conversion
title_caseProper casing ("john smith" → "John Smith")
normalize_unicodeNFKD normalization, strip accents
normalize_quotesSmart/curly quotes → straight quotes
collapse_whitespaceMultiple spaces → single space
remove_punctuationStrip punctuation characters
remove_html_tagsStrip HTML markup from scraped data
fix_mojibakeFix common UTF-8/Latin-1 encoding garbling

#### Phone Transforms (5)

TransformDescription
phone_e164Any format → +15550123456
phone_nationalAny format → (555) 012-3456

#### Date Transforms

  • Date parsing and normalization across multiple formats
  • Timezone normalization

#### Domain-Specific Transforms

DomainCapabilities
HealthcareNPI, ICD, CPT, DRG parsing, transaction dates, amount parsing
E-commerceSKU normalization, price parsing, order dates, address standardization
Real EstateProperty addresses, listing dates, price normalization, geo fields

Source: packages/python/goldenflow/README.md:50-120

GoldenMatch

GoldenMatch is the core entity resolution (deduplication) engine. It supports multiple matching strategies and scales to millions of records.

Source: README.md:80-85

Matching Strategies

StrategyUse Case
Fuzzy MatchingName/address variants with typo tolerance
Exact MatchingIdentifier deduplication
Probabilistic MatchingRecord linkage with confidence scores
LLM ClusteringSemantic product/matching for complex domains
PPRLPrivacy-preserving record linkage (cross-organization)

Performance Benchmarks

Dataset SizeWall TimePeak RSSF1 Score
10M records (QIS-bucket-realistic)502s (-81%)18% reduction0.9886

Source: v1.24.0 Release Notes

Native Acceleration

An optional Rust-based acceleration runtime is available:

pip install "goldenmatch[native]"

This pulls goldenmatch-native, a separately distributed compiled (Rust/PyO3 abi3) runtime. The native runtime is discovered automatically when installed.

Source: v1.21.0 Release Notes

Configuration Options

OptionDescription
backend"bucket" (recommended for 5M+ records), "chunked"
auto_configAuto-tune recall thresholds
lineage_provenanceTrack source row for golden record fields

InferMap

InferMap is a schema mapping engine that automatically aligns columns across heterogeneous data sources. It supports both Python and TypeScript with full API parity.

Source: packages/python/infermap/README.md:1-40

Key Features

  • Auto Schema Alignment: Detects and maps columns across different source schemas
  • Custom Scorers: Configurable similarity scoring algorithms
  • Domain Dictionaries: Industry-specific vocabulary for better matching
  • Calibration Tools: Score matrix introspection and tuning
  • Edge Runtime Support: Works in Vercel Edge Runtime and Next.js

Source: packages/python/infermap/README.md:40-80

GoldenPipe

GoldenPipe is the orchestrator that wires Check → Flow → Match into a single declarative pipeline. It enables pipeline definitions in YAML or Python.

Source: README.md:85-90

Pipeline Stages

graph LR
    A[Check] --> B[Flow]
    B --> C[Match]
    C --> D[Output]
    
    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#e8f5e9
    style D fill:#f3e5f5

Deployment Options

RuntimeDescription
Airflow12 drop-in DAGs for daily/incremental/warehouse-native dedupe
MCP ContainerSingle container: docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
Python/CLIDirect script execution

Source: README.md:90-110

goldenmatch-extensions

A Rust-based Postgres extension using pgrx that provides SQL-native fuzzy matching capabilities.

Source: packages/rust/extensions/README.md:1-30

Capabilities

  • 13 pgrx Functions: Core API parity with Python/TypeScript
  • **goldenflow_* Transforms**: SQL-level data standardization
  • memory_learn/memory_stats CRUD: Learning memory database operations
  • DuckDB UDFs: Fuzzy matching in DuckDB queries

Source: goldenmatch_pg v0.5.0 Release

dbt-goldensuite

A dbt package that adds Golden Suite capabilities as dbt tests and macros.

Features

  • Quality-gate tests for warehouse models
  • Correction CRUD macros for data repair
  • GoldenCheck assertions integrated with dbt test framework

Source: packages/python/goldenmatch/dbt-goldensuite/README.md:1-20

goldencheck-action

GitHub Action for CI integration that:

  • Runs GoldenCheck scans on PR data changes
  • Posts PR comments with findings
  • Fails builds based on configurable severity thresholds

Source: packages/actions/goldencheck/README.md:1-30

Examples and Quick Start

The repository includes comprehensive examples organized by deployment target:

Source: examples/README.md:1-30

DirectoryAudienceHighlights
python/Python users6 scripts: zero-config quickstart, full Suite composed, customer 360, PPRL
typescript/TypeScript/edge users4 scripts: quickstart, Vercel-Edge, MCP client
sql/SQL/warehouse usersDuckDB + Postgres core-API examples
airflow/Data-platform users12 drop-in DAGs

Quick Start Commands

# Headline package: dedupe a CSV
pip install goldenmatch && goldenmatch dedupe customers.csv

# TypeScript / Edge
npm install goldenmatch

# Full suite via MCP
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest

Source: README.md:110-130

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Installation

Related topics: Getting Started

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Packages

Continue reading this section for the full explanation and source context.

Section GoldenMatch Optional Dependencies

Continue reading this section for the full explanation and source context.

Section GoldenCheck Optional Dependencies

Continue reading this section for the full explanation and source context.

Related topics: Getting Started

Installation

The Golden Suite is a polyglot data-quality and entity-resolution toolkit available across Python, TypeScript/Node.js, Rust, and SQL environments. This page documents installation methods for all suite packages, optional dependencies, system requirements, and deployment patterns.

System Requirements

ComponentMinimum VersionRecommended
Python3.11+3.11, 3.12
Node.js20+22 LTS
Rust1.75+ (for extensions)Latest stable
PostgreSQL15+ (for extensions)16+
DuckDB1.0+ (for extensions)Latest

Source: README.md:1-15

Quick Start

# Entity resolution (Python)
pip install goldenmatch

# TypeScript / Edge runtimes
npm install goldenmatch

Source: README.md:55-60

Python Installation

All Python packages are published to PyPI and support pip installation with optional extras.

Core Packages

PackagePurposeInstall Command
goldenmatchZero-config entity resolution (fuzzy + exact + probabilistic + LLM)pip install goldenmatch
goldencheckData-quality scanning and validationpip install goldencheck
goldenflowData transforms and standardizationpip install goldenflow
goldenpipePipeline orchestratorpip install goldenpipe
infermapSchema mapping enginepip install infermap

Source: packages/python/goldenmatch/README.md, packages/python/goldencheck/README.md, packages/python/goldenflow/README.md, packages/python/goldenpipe/README.md, packages/python/infermap/README.md

GoldenMatch Optional Dependencies

GoldenMatch supports modular installation through extras:

# Basic installation
pip install goldenmatch

# With native acceleration (Rust/PyO3 abi3 runtime)
pip install "goldenmatch[native]"

# With LLM support (Anthropic SDK)
pip install "goldenmatch[llm]"

# With baseline profiling support
pip install "goldenmatch[baseline]"

# With semantic type inference
pip install "goldenmatch[semantic]"

# With web UI
pip install "goldenmatch[web]"

# With MCP server
pip install "goldenmatch[mcp]"

# Full installation with all extras
pip install "goldenmatch[native,llm,baseline,semantic,web,mcp]"

Source: packages/python/goldenmatch/README.md, README.md:50-55

GoldenCheck Optional Dependencies

# Basic installation
pip install goldencheck

# With LLM enhancement
pip install "goldencheck[llm]"

# With MCP server for Claude Desktop
pip install "goldencheck[mcp]"

# With all extras
pip install "goldencheck[llm,mcp]"

Source: packages/python/goldencheck/README.md

TypeScript / Node.js Installation

All TypeScript packages are published to npm with zero runtime dependencies for the core packages (edge-safe).

# Core packages
npm install goldenmatch
npm install goldencheck
npm install goldenflow
npm install goldenpipe
npm install infermap

# MCP server (Node.js only)
npm install @benseverndev-oss/goldensuite-mcp

Source: packages/typescript/goldenmatch/README.md, packages/typescript/goldencheck/README.md

Peer Dependencies

Some TypeScript examples require optional peer dependencies:

PackagePurposeInstall Command
yamlYAML configuration parsingnpm install yaml
nodejs-polarsParquet reading (Node.js only)auto-installed when needed
csv-parseCSV reading (Node.js only)auto-installed when needed
@modelcontextprotocol/sdkMCP server (Node.js only)auto-installed when needed

Source: packages/typescript/goldenmatch/examples/README.md

Docker Installation

For a self-contained MCP server deployment, use the official container image:

# Pull the latest MCP server
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest

Source: README.md:40

GoldenMatch Native Acceleration

As of v1.21.0, GoldenMatch offers an optional compiled Rust runtime via the goldenmatch-native package. This is separately distributed and uses PyO3 abi3 for compatibility.

graph TD
    A[User: pip install goldenmatch] --> B{Has native extra?}
    B -->|No| C[Pure-Python wheel]
    B -->|Yes| D[pip install goldenmatch-native]
    D --> E[Rust/PyO3 abi3 runtime]
    E --> F[Auto-discovery at runtime]
    C --> G[Standard Polars backend]
    F --> H[Optimized clustering + scoring kernels]
    G --> H

Source: v1.21.0 Release Notes, README.md

Installation Commands for Native

# Option 1: Install with native extra (recommended)
pip install "goldenmatch[native]"

# Option 2: Install separately
pip install goldenmatch
pip install goldenmatch-native

Source: v1.21.0 Release Notes

Rust Extensions (PostgreSQL / DuckDB)

For SQL-native fuzzy matching, install the Rust extension package.

PostgreSQL Extension

# Clone and build from source
git clone https://github.com/benseverndev-oss/goldenmatch.git
cd packages/rust/extensions
cargo build --release

# Load in PostgreSQL
CREATE EXTENSION goldenmatch_pg;

Source: packages/rust/extensions/README.md

DuckDB UDFs

The same Rust package provides DuckDB UDFs for in-database matching:

# Install via source build
cargo build --release
# UDFs are loaded via SQL commands in DuckDB

Source: packages/rust/extensions/README.md

MCP Server Setup

The Golden Suite includes an MCP (Model Context Protocol) server for Claude Desktop integration.

Python Installation

pip install "goldencheck[mcp]"
# or for full suite
pip install "goldenmatch[mcp]"

Source: packages/python/goldencheck/README.md

Claude Desktop Configuration

Add to your Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "goldencheck": {
      "command": "python",
      "args": ["-m", "goldencheck.mcp.server"]
    }
  }
}

Source: packages/python/goldencheck/README.md

Docker Deployment

For production MCP deployments:

docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest

Source: README.md:40, packages/python/goldensuite-mcp/README.md

dbt Integration

Install the dbt package for data warehouse quality gates:

pip install dbt-goldensuite

Source: README.md:35

GitHub Actions

For CI/CD integration:

pip install goldencheck
# or use the action directly in workflows

Source: README.md:35

Airflow DAGs

Run the Golden Suite as Airflow DAGs:

# Install Airflow adapter
pip install apache-airflow

# Use pre-built DAGs from examples/
cp -r examples/airflow/ /path/to/airflow/dags/

Source: examples/README.md

Verification

Verify installation with the following commands:

# Python packages
python -c "import goldenmatch; print(goldenmatch.__version__)"
python -c "import goldencheck; print(goldencheck.__version__)"
python -c "import goldenflow; print(goldenflow.__version__)"

# TypeScript packages
node -e "console.log(require('goldenmatch/package.json').version)"

# CLI tools
goldenmatch --version
goldencheck --version
goldenflow --version

Common Installation Issues

Python Version Mismatch

GoldenMatch requires Python 3.11+. Check your version:

python --version

If using an older version, use a virtual environment:

python -m venv golden-env
source golden-env/bin/activate  # Linux/macOS
# or
golden-env\Scripts\activate  # Windows
pip install goldenmatch

Native Extension Not Found

If goldenmatch-native isn't auto-discovered:

# Reinstall with native extra
pip uninstall goldenmatch-native
pip install "goldenmatch[native]"

MCP Server Connection Issues

For Claude Desktop MCP integration, ensure the config is in the correct location:

Installation Hierarchy

graph TB
    subgraph "Full Stack Installation"
        A[Golden Suite MCP Container] --> B[GoldenMatch + Native]
        A --> C[GoldenCheck + MCP]
        A --> D[GoldenFlow]
        A --> E[GoldenPipe]
        A --> F[InferMap]
    end
    
    subgraph "Python-Only Stack"
        G[goldenmatch] --> H[Polars]
        G --> I[Polars-Runtime]
        G --> J[goldenmatch-native optional]
    end
    
    subgraph "TypeScript-Only Stack"
        K[goldenmatch] --> L[Zero deps]
        K --> M[nodejs-polars optional]
    end

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

System Architecture

Related topics: Backend Systems, Core Matching Engine, Blocking and Scoring

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Package Overview

Continue reading this section for the full explanation and source context.

Section Pipeline Stage Details

Continue reading this section for the full explanation and source context.

Section Match Pipeline Components

Continue reading this section for the full explanation and source context.

Related topics: Backend Systems, Core Matching Engine, Blocking and Scoring

System Architecture

The Golden Suite is a polyglot data-quality and entity-resolution toolkit designed for AI-native workflows. The architecture follows a modular, pipeline-oriented design where each component addresses a specific stage of data processing: profiling, standardization, schema mapping, and deduplication. Source: packages/python/goldenmatch/README.md

Architecture Overview

The system consists of five core Python packages, three TypeScript packages, and a Rust extension layer. Each package is independently installable and can operate standalone or as part of an orchestrated pipeline.

Package Overview

PackageLanguagePurposeInstall Command
GoldenMatchPython · TSZero-config entity resolution (fuzzy + exact + probabilistic + LLM)pip install goldenmatch · npm i goldenmatch
GoldenCheckPython · TSData-quality scanning: encoding, Unicode, format validation, anomaly detectionpip install goldencheck · npm i goldencheck
GoldenFlowPython · TSTransforms & standardizers: phone, date, address, categorical normalizationpip install goldenflow · npm i goldenflow
GoldenPipePython · TSOrchestrator wiring Check → Flow → Match into declarative pipelinespip install goldenpipe · npm i goldenpipe
InferMapPython · TSSchema mapping engine — auto-aligns columns across heterogeneous sourcespip install infermap · npm i infermap
goldenmatch-extensionsRustPostgres extension (pgrx) + DuckDB UDFs for SQL-native fuzzy matchingsource build
dbt-goldensuitedbt · Pythondbt package with quality-gate tests and correction CRUD macrospip install dbt-goldensuite
goldencheck-actionActionGitHub Action for CI with PR commentsvia GitHub Marketplace

Source: packages/python/goldenmatch/README.md

Core Data Flow

The canonical pipeline processes data through four stages:

graph TD
    A[Raw Data CSV/Parquet] --> B[GoldenCheck<br/>Profile & Validate]
    B --> C[GoldenFlow<br/>Standardize & Transform]
    C --> D[InferMap<br/>Schema Mapping]
    D --> E[GoldenMatch<br/>Deduplicate & Merge]
    E --> F[Golden Records<br/>with Provenance]
    
    B -->|Findings Report| G[Data Quality Score]
    E -->|Cluster Decisions| H[Memory Learning]
    H -->|Auto-config| E

Pipeline Stage Details

StageInputOutputKey Capabilities
ProfileRaw CSV/ParquetSchema, statistics, health scoreEncoding detection, null analysis, cardinality profiling
StandardizeMessy fieldsNormalized fieldsPhone E.164, date parsing, address standardization, unicode normalization
MapHeterogeneous schemasColumn alignmentsDomain dictionaries, custom scorers, confidence scoring
MatchCanonical recordsDuplicate clustersFuzzy matching, blocking, probabilistic scoring, LLM clustering

Source: packages/python/goldenflow/README.md

GoldenMatch Architecture

GoldenMatch is the headline package, providing entity resolution (ER) capabilities. The v1.24.0 release achieved 81% wall-clock reduction (2604s → 502s) on 10M record datasets through the QIS-bucket-realistic optimization path. Source: Community Context - v1.24.0

Match Pipeline Components

graph LR
    A[Input Records] --> B[Blocking<br/>Key Generation]
    B --> C[Candidate Pair<br/>Generation]
    C --> D[Vectorized<br/>Scoring Kernels]
    D --> E[Probabilistic<br/>Classifier]
    E --> F[Cluster<br/>Formation]
    F --> G[Golden Record<br/>Survivorship]
    G --> H[Output with<br/>Provenance]

Backend Modes

GoldenMatch supports multiple execution backends optimized for different dataset scales:

BackendUse CaseRecordsPerformance
chunkedDevelopment / small datasets< 1MSingle-threaded baseline
bucketRecommended 5M-on-one-node config1-10M5x wall reduction, 2x RSS reduction
nativeMaximum performanceAnyRust/PyO3 abi3 compiled kernels

The bucket backend became the recommended path in v1.16.0, processing 5M records in 9.94 minutes with 6.4 GB peak RSS on a single 16-core node. Source: packages/python/goldenmatch/README.md

Native Acceleration Layer

The optional native runtime (pip install "goldenmatch[native]") ships the compiled _native kernel as a separately distributed wheel. The runtime is discovered automatically at import time:

import goldenmatch
# Automatically detects and uses native runtime if available
result = goldenmatch.dedupe("customers.csv")

Source: Community Context - v1.21.0, goldenmatch-native v0.1.0

GoldenCheck Architecture

GoldenCheck provides data validation that discovers rules from your data. The TypeScript implementation follows a scanner-reporter pattern with an LLM enhancement layer. Source: packages/python/goldencheck/README.md

Core Engine Components

ComponentFile LocationPurpose
Scannersrc/core/engine/scanner.tsAnalyzes data files, profiles columns
Enginesrc/core/engine/Executes discovered quality rules
Confidencesrc/core/engine/confidence.tsApplies severity downgrades based on findings
Triagesrc/core/engine/triage.tsAuto-categorizes findings by priority
Fixersrc/core/engine/fixer.tsApplies automated corrections
Agentsrc/core/agent/Strategy selection and explanation

Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts

LLM Integration Layer

The TypeScript implementation includes a comprehensive LLM interface for semantic analysis:

interface LLMResponse {
  columns: Record<string, LLMColumnAssessment>;
  relations: LLMRelation[];
}

interface LLMColumnAssessment {
  semantic_type: string | null;
  issues: LLMIssue[];
  upgrades: LLMUpgrade[];
  downgrades: LLMDowngrade[];
}

Source: packages/typescript/goldencheck/src/core/llm/prompts.ts

Semantic Type System

GoldenCheck ships with a bundled base type system defined as TypeScript constants (no runtime YAML dependency):

export const BASE_TYPES: Readonly<Record<string, TypeDef>> = {
  identifier: {
    nameHints: ["id", "key", "pk", "code", "sku", "number", "num", "record"],
    valueSignals: { min_unique_pct: 0.95 },
    suppress: ["cardinality", "pattern_consistency", "drift_detection"],
  },
  person_name: {
    nameHints: ["first_name", "last_name", "full_name", ...],
    valueSignals: { mixed_case: true },
    suppress: ["pattern_consistency", "cardinality"],
  },
  email: {
    nameHints: ["email", "mail", "e_mail"],
    valueSignals: { format_match: "email", min_match_pct: 0.70 },
    suppress: ["pattern_consistency"],
  },
  phone: {
    nameHints: ["phone", "tel", "fax", "mobile", "cell"],
    valueSignals: { format_match: "phone", min_match_pct: 0.70 },
    suppress: ["type_inference", "pattern_consistency"],
  },
  address: {
    nameHints: ["address", "street", "addr", "line1", "line2"],
    valueSignals: { avg_length_min: 15 },
    suppress: ["pattern_consistency", "cardinality"],
  },
  free_text: {
    nameHints: ["notes", "comments", "description", ...],
    // ...
  },
};

Source: packages/typescript/goldencheck/src/core/semantic/types.ts

Reporter System

The JSON reporter produces machine-readable output matching the spec schema:

interface ReportOutput {
  file: string;
  rows: number;
  columns: number;
  health_grade: string;
  health_score: number;
  summary: { errors: number; warnings: number; info: number };
  findings: Array<{
    severity: string;
    column: string;
    check: string;
    message: string;
    affected_rows: number;
    sample_values: string[];
  }>;
}

Source: packages/typescript/goldencheck/src/core/reporters/json.ts

GoldenFlow Transform Architecture

GoldenFlow provides 76 transforms organized into semantic categories for data standardization. Source: packages/python/goldenflow/README.md

Transform Categories

CategoryCountExamples
Text Transforms18strip, lowercase, uppercase, normalize_unicode, remove_html_tags
Phone Transforms5phone_e164, phone_national, phone_format
Date Transforms8date_parse, date_format, fuzzy_date
Numeric Transforms6parse_currency, extract_numbers, round_precision
Address Transforms7standardize_address, parse_components
Categorical Transforms4normalize_category, fuzzy_category

Domain-Specific Transforms

DomainTransforms
HealthcareNPI normalization, ICD code parsing, insurance ID formatting, CPT/DRG parsing
FinanceAccount number formatting, routing number validation, CUSIP/ISIN parsing, currency normalization
E-commerceSKU normalization, price parsing, order date standardization, address standardization
Real EstateProperty address parsing, listing date normalization, price standardization, geo field extraction

Source: packages/python/goldenflow/README.md

InferMap Architecture

InferMap is an inference-driven schema mapping engine that automatically aligns columns across heterogeneous data sources. Source: packages/python/infermap/README.md

MCP Server Tools

The TypeScript implementation exposes a comprehensive MCP tool interface:

ToolPurpose
inspectAnalyze schema of a data source
suggest-mappingsPropose column alignments with confidence scores
applyGenerate remapped output file
compare-schemasSide-by-side schema comparison
domain-mappingDomain-specific mapping using dictionaries

Source: packages/typescript/infermap/src/node/mcp/server.ts

Scorer Architecture

InferMap supports custom scorers for mapping decisions:

graph TD
    A[Source Column] --> B[Name Similarity]
    A --> C[Type Compatibility]
    A --> D[Value Distribution]
    A --> E[Domain Dictionary]
    B --> F[Composite Score]
    C --> F
    D --> F
    E --> F
    F --> G[Mapping Confidence]

Source: packages/python/infermap/README.md

GoldenPipe Orchestration Layer

GoldenPipe wires Check → Flow → Match into declarative pipelines with zero boilerplate. Source: packages/python/goldenmatch/README.md

Pipeline Declaration

from goldenpipe import Pipeline

pipeline = (
    Pipeline("customer-dedup")
    .check("raw_data.csv")
    .flow(standardization_rules)
    .map(source_schema, target_schema)
    .match(blocking_rules, match_config)
    .output("golden_records.csv")
)
pipeline.run()

Integration Points

IntegrationPackageUse Case
Airflow DAGsexamples/airflow/12 drop-in DAG templates
dbtdbt-goldensuiteQuality-gate tests for warehouse models
GitHub Actionsgoldencheck-actionPR-level data validation
MCP Servergoldensuite-mcpSingle-container MCP deployment

Source: packages/python/goldenmatch/README.md

MCP Agent Tools Architecture

The GoldenCheck MCP server exposes 10 agent-level tools for Claude Desktop integration: Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts

export const AGENT_TOOLS: readonly Tool[] = [
  { name: "analyze_data", description: "Analyze data file to detect domain and recommend strategy" },
  { name: "explain_finding", description: "Explain a specific finding in natural language" },
  { name: "explain_column", description: "Explain column quality assessment" },
  { name: "auto_triage", description: "Auto-categorize findings by priority" },
  { name: "apply_fixes", description: "Apply automated corrections" },
  { name: "compare_domains", description: "Compare data against known domain schemas" },
  { name: "generate_handoff", description: "Generate pipeline handoff documentation" },
  { name: "build_review_queue", description: "Build prioritized review queue" },
];

Agent Tool Execution Flow

graph TD
    A[Claude Desktop] -->|MCP Protocol| B[Agent Tools Layer]
    B --> C{Command Router}
    C -->|analyze_data| D[Scanner Engine]
    C -->|explain_*| E[Agent Explanation]
    C -->|apply_fixes| F[Fixer Engine]
    C -->|auto_triage| G[Triage Engine]
    D --> H[Findings Report]
    E --> I[Natural Language Output]
    F --> J[Corrected Data]
    G --> K[Prioritized Queue]

Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts

Memory Learning System

The Learning Memory system in GoldenMatch consists of three tuners that consume user feedback to auto-configure thresholds: Source: Community Context - v1.20.0

TunerLevelInputOutput
MemoryLearnerPair-levelApprove/reject pair decisionsPer-field auto-approve threshold
Field-Strategy TunerField-levelField match quality feedbackPer-field match strategy selection
Cluster-Decision TunerCluster-levelCluster approve/reject decisionsPer-dataset auto-approve threshold

Auto-Configuration Flow

graph LR
    A[User Feedback] --> B{Memory System}
    B --> C[Pair-Level Learning]
    B --> D[Field-Level Learning]
    B --> E[Cluster-Level Learning]
    C --> F[Auto-Config Commit]
    D --> F
    E --> F
    F --> G[Threshold Proposals]
    G --> H[Match Pipeline]

Zero-Label Confidence (v1.23.0)

The v1.23.0 release introduced auto-config commits by zero-label confidence by default. The controller's pick_committed tiebreaker prefers higher -overall_confidence candidates over higher -mass_separation among same-health-rank candidates carrying a zero_label profile. Source: Community Context - v1.23.0

Golden Record Provenance

The v1.22.0 release added field-level golden-record provenance tracking: Source: Community Context - v1.22.0

Provenance Data Model

# When provenance=True is enabled:
result = build_golden_records_batch(records, provenance=True)

# Each field dict includes:
{
    "value": "John Smith",
    "source_row_id": "__row_id__ of winning record",
    "survivorship_winner": True
}

Lineage Configuration

Config OptionTypeDefaultDescription
config.output.lineage_provenanceboolFalseEnable field-level source tracking

Source: Community Context - v1.22.0

TypeScript Package Structure

The TypeScript packages follow a consistent structure optimized for edge runtimes:

packages/typescript/
├── goldencheck/
│   ├── src/
│   │   ├── core/
│   │   │   ├── engine/       # Scanner, fixer, triage, confidence
│   │   │   ├── agent/       # Strategy selection, explanation
│   │   │   ├── llm/         # LLM interface, prompts
│   │   │   ├── reporters/   # JSON, HTML output formatters
│   │   │   ├── semantic/    # Type definitions, domain matching
│   │   │   └── types.ts     # Core type definitions
│   │   └── node/
│   │       └── mcp/         # MCP server, agent tools
│   └── domains/             # Bundled domain packs (YAML)
├── goldencheck-types/
│   ├── domains/             # Community-contributed domain packs
│   └── src/                 # Type definitions
└── infermap/
    ├── src/
    │   ├── core/            # Mapping engine
    │   └── node/
    │       └── mcp/         # MCP server tools
    └── domains/             # Domain dictionaries

Zero Runtime Dependencies

The core TypeScript packages have no runtime dependencies (edge-safe):

PackageDependencies
goldencheck coreNone (pure TypeScript)
goldencheck-typesjs-yaml (dev: tsup, vitest, typescript)
infermap coreNone (pure TypeScript)

Optional peer dependencies:

  • nodejs-polars — Parquet reading (Node.js only)
  • csv-parse — CSV reading (Node.js only)
  • @modelcontextprotocol/sdk — MCP server (Node.js only)

Source: packages/typescript/goldencheck-types/package.json

Cross-Language Record Fingerprint

The v1.21.0 release introduced cross-language record fingerprinting enabling consistent record identification across Python, TypeScript, and Rust execution environments. Source: Community Context - v1.21.0

Rust Extensions Layer

The goldenmatch-extensions package provides SQL-native fuzzy matching through Postgres UDFs (pgrx) and DuckDB UDFs: Source: Community Context - v1.24.0

Extension Capabilities (v0.5.0)

CapabilityPostgres FunctionsDuckDB Functions
Core API parity13 pgrx functionsMultiple UDFs
GoldenFlow transformsgoldenflow_* transformsEquivalent UDFs
Memory learningmemory_learn CRUDmemory_stats CRUD

Performance Characteristics

Scaling Benchmarks (v1.24.0)

Dataset SizeWall ClockMemory (RSS)Configuration
100K records~30s< 1 GBDefault
1M records~2 min~2 GBDefault
5M records9.94 min6.4 GBbackend="bucket", 16-core
10M records8.37 min~5 GBbackend="bucket", optimized

The QIS-bucket-realistic path achieves F1=0.9886 invariant across all scales. Source: Community Context - v1.24.0

Memory Optimization

The native Rust acceleration (goldenmatch-native) provides:

  • 18% RSS reduction on large datasets
  • Compiled clustering kernels
  • Optimized block-scoring operations

Source: Community Context - v1.19.0

TopicDocumentation
Getting StartedGoldenMatch Quick Start
Python APIpackages/python/goldenmatch/
TypeScript APIpackages/typescript/goldenmatch/
MCP IntegrationWeb UI Wiki
ER AgentER Agent / A2A Wiki
ExamplesPython Examples, TypeScript Examples

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Backend Systems

GoldenMatch provides a pluggable backend architecture that allows the entity resolution engine to execute across different computational substrates. This design enables users to choose the...

Section Polars Backend

Continue reading this section for the full explanation and source context.

Section Bucket Backend

Continue reading this section for the full explanation and source context.

Section DuckDB Backend

Continue reading this section for the full explanation and source context.

Section Ray Backend

Continue reading this section for the full explanation and source context.

GoldenMatch provides a pluggable backend architecture that allows the entity resolution engine to execute across different computational substrates. This design enables users to choose the optimal backend based on dataset size, infrastructure constraints, and performance requirements.

Overview

The backend abstraction decouples the core matching logic from data execution, allowing GoldenMatch to scale from single-threaded operations on small datasets to distributed clusters processing tens of millions of records.

graph TD
    User[User Code] --> Config[GoldenMatch Config]
    Config --> Registry[Backend Registry]
    Registry --> Polars[Polars Backend]
    Registry --> Bucket[Bucket Backend]
    Registry --> DuckDB[DuckDB Backend]
    Registry --> Ray[Ray Backend]
    
    Polars --> Data1[In-Memory Data]
    Bucket --> Data2[Chunked Data]
    DuckDB --> Data3[SQL Engine]
    Ray --> Data4[Distributed Data]
    
    subgraph Core["Core ER Engine"]
        Blocking[Blocking]
        Scoring[Pair Scoring]
        Clustering[Clustering]
    end
    
    Data1 --> Core
    Data2 --> Core
    Data3 --> Core
    Data4 --> Core

Backend Types

Polars Backend

The default backend, designed for single-node operation with multi-threaded execution. It leverages Polars' native vectorized operations for blocking, scoring, and clustering.

CharacteristicValue
Execution ModelSingle node, multi-threaded
Memory ModelIn-memory
Typical Dataset SizeUp to 5M records
Dependenciespolars
Configuration Keybackend="polars"

The Polars backend is the recommended starting point for datasets under 5 million records. Source: packages/python/goldenmatch/goldenmatch/backends/polars_backend.py

Bucket Backend

An evolution of the Polars backend optimized for larger datasets on single nodes. The bucket backend partitions data into manageable chunks that are processed sequentially, reducing peak memory consumption.

CharacteristicValue
Execution ModelSingle node, chunked processing
Memory ModelMemory-efficient, streaming
Typical Dataset Size5M-10M records
RSS Reduction~18% vs chunked baseline
Wall Time Reduction~81% vs v1.15 baseline

Source: packages/python/goldenmatch/goldenmatch/backends/bucket_backend.py Source: docs/scale-envelope.md

v1.16.0+ Performance Note: The backend="bucket" path achieves 5M records in 9.94 minutes with 6.4 GB peak RSS on a 16-core node. This represents a 5x wall reduction and 2x peak RSS reduction compared to the v1.15 chunked baseline. Source: README.md

DuckDB Backend

Provides SQL-native fuzzy matching capabilities, enabling GoldenMatch operations to execute within DuckDB queries. This is particularly useful for warehouse-native entity resolution workflows.

CharacteristicValue
Execution ModelDuckDB SQL engine
Memory ModelArrow-based, out-of-core
Use CaseWarehouse-native ER, SQL integration
Key Functionsgoldenflow_* transforms, memory operations

Source: packages/python/goldenmatch/goldenmatch/backends/duckdb_backend.py

Ray Backend

Enables distributed entity resolution across a Ray cluster. The Ray backend distributes blocking, scoring, and clustering operations across multiple nodes.

CharacteristicValue
Execution ModelDistributed Ray cluster
Memory ModelDistributed, cluster-sharded
Typical Dataset Size10M+ records
Dependenciesray

Source: packages/python/goldenmatch/goldenmatch/backends/ray_backend.py

Backend Selection

GoldenMatch automatically selects an appropriate backend based on dataset characteristics, but users can explicitly override this via configuration:

from goldenmatch import GoldenMatch

config = {
    "backend": "bucket",  # Explicit backend selection
    "backend_options": {
        "chunk_size": 500_000,
        "num_threads": 16
    }
}

matcher = GoldenMatch(config)

Auto-Detection Logic

The backend registry implements automatic selection based on:

  1. Record count: Datasets under 1M records use polars; larger datasets may trigger bucket or ray
  2. Memory availability: Detected via RSS monitoring during execution
  3. Cluster availability: If Ray is initialized, ray backend becomes available
  4. User configuration: Explicit backend setting takes precedence

Backend Interface

All backends implement a common interface defined in base.py:

class Backend(Protocol):
    def setup(self, config: Config) -> None: ...
    def teardown(self) -> None: ...
    def execute_blocking(self, records: DataFrame, config: Config) -> DataFrame: ...
    def execute_scoring(self, pairs: DataFrame, config: Config) -> DataFrame: ...
    def execute_clustering(self, pairs: DataFrame, config: Config) -> DataFrame: ...
    def execute_merge(self, records: DataFrame, clusters: DataFrame, config: Config) -> DataFrame: ...

Source: packages/python/goldenmatch/goldenmatch/backends/base.py

Native Acceleration (Rust/PyO3)

v1.21.0 introduced optional native acceleration via goldenmatch-native, a separately distributed compiled runtime built with Rust and PyO3 abi3:

pip install "goldenmatch[native]"

The native runtime is discovered automatically when installed and provides:

  • Compiled clustering kernels
  • Optimized block-scoring operations
  • Polars-compatible ABI3 bindings

Source: packages/python/goldenmatch/goldenmatch/backends/__init__.py

Note: goldenmatch-native is not a standalone package. It must be installed alongside the core goldenmatch package and is automatically discovered at import time.

Memory Management

Health Monitoring

All backends implement health monitoring to detect memory pressure:

class HealthMonitor:
    """Tracks RSS usage and execution metrics."""
    
    def check_health(self) -> HealthStatus:
        """Returns current memory and execution health."""
    
    def get_metrics(self) -> dict:
        """Returns detailed metrics for diagnostics."""

RSS Reduction Features

VersionFeatureRSS Impact
v1.16.0Bucket backend introduction50% reduction vs chunked
v1.24.0Scale-aware cardinality18% reduction
v1.24.0Heuristic rule expansionAdditional savings

Source: docs/scale-envelope.md

Configuration Reference

Backend-Specific Options

backend: "bucket"  # polars, bucket, duckdb, ray

backend_options:
  # Polars/Bucket options
  num_threads: 16
  chunk_size: 500_000
  memory_limit_gb: 32
  
  # Ray options
  num_actors: 8
  actor_placement: "node1,node2,node3"
  
  # DuckDB options
  catalog: "memory"
  threads: 8

Performance Tuning

For the recommended 5M-on-one-node configuration:

backend: "bucket"
backend_options:
  num_threads: 16
  chunk_size: 500_000

This configuration achieves approximately 9.94 minutes wall time and 6.4 GB peak RSS on a 16-core node.

Source: docs/scale-envelope.md

Execution Pipeline

The backend executes the entity resolution pipeline in these stages:

graph LR
    A[Raw Records] --> B[Blocking]
    B --> C[Pair Generation]
    C --> D[Pair Scoring]
    D --> E[Clustering]
    E --> F[Golden Record Construction]
    F --> G[Output]
    
    B1[Block Key Computation] --> B
    D1[Field Scorers] --> D
    E1[Linkage Criteria] --> E

Each stage is backend-specific for optimal execution:

StagePolarsBucketDuckDBRay
BlockingVectorizedChunked vectorizedSQLDistributed actors
ScoringMulti-threadedChunkedSQLDistributed
ClusteringSingle-nodeChunkedSQLDistributed

Extending Backends

To implement a custom backend:

from goldenmatch.backends.base import Backend
from goldenmatch.core.config import Config
import polars as pl

class CustomBackend(Backend):
    def setup(self, config: Config) -> None:
        self.config = config
    
    def execute_blocking(self, records: pl.DataFrame, config: Config) -> pl.DataFrame:
        # Implement custom blocking logic
        return blocked_df
    
    def execute_scoring(self, pairs: pl.DataFrame, config: Config) -> pl.DataFrame:
        # Implement custom scoring logic
        return scored_df
    
    def execute_clustering(self, pairs: pl.DataFrame, config: Config) -> pl.DataFrame:
        # Implement custom clustering logic
        return clustered_df
    
    def teardown(self) -> None:
        # Cleanup resources
        pass

Register the backend:

from goldenmatch.backends import register_backend

register_backend("custom", CustomBackend)

See Also

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Core Matching Engine

Related topics: AutoConfig System, Blocking and Scoring

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Blocking Strategies

Continue reading this section for the full explanation and source context.

Section Blocking Key Configuration

Continue reading this section for the full explanation and source context.

Related topics: AutoConfig System, Blocking and Scoring

Core Matching Engine

The Core Matching Engine is the central processing component of GoldenMatch, responsible for identifying, scoring, and clustering duplicate records within datasets. It orchestrates the entire entity resolution pipeline from candidate pair generation through final cluster formation, enabling users to deduplicate records with configurable precision, recall, and performance characteristics.

Architecture Overview

The Core Matching Engine consists of several interconnected modules that work together to perform entity resolution at scale. The engine processes input records through a staged pipeline: blocking reduces the candidate space, scoring evaluates pair similarity, and clustering groups related records into golden records.

graph TD
    A[Input Records] --> B[Blocker]
    B --> C[Candidate Pairs]
    C --> D[Scorer]
    D --> E[Scored Pairs]
    E --> F[Cluster]
    F --> G[Golden Records]
    
    H[Config] --> B
    H --> D
    H --> F
    
    I[Autoconfig Controller] -.-> H

Core Components

ComponentPurposeKey Responsibilities
BlockerCandidate reductionGenerate candidate pairs using blocking keys, handle QIS-bucket blocking
ScorerSimilarity evaluationCompute field-level and overall pair scores, apply field strategies
ClusterRecord groupingForm clusters from scored pairs, manage transitive closure, build golden records
Autoconfig ControllerSelf-configurationAuto-tune thresholds, manage recall targets, handle zero-label commits
ControllerOrchestrationCoordinate pipeline stages, manage state, handle incremental processing

Source: packages/python/goldenmatch/goldenmatch/core/controller.py

Blocking Module

The blocking module is responsible for reducing the computational complexity of record matching from O(n²) to a manageable candidate set. It groups records by shared blocking keys and only generates candidate pairs within the same block.

Blocking Strategies

GoldenMatch supports multiple blocking strategies optimized for different data characteristics and scale requirements:

StrategyDescriptionUse Case
QIS BucketQuality-Interval-Sorted bucketing with Chao1 cardinality estimationLarge-scale datasets (10M+ records), realistic data distributions
StandardTraditional blocking on normalized field valuesGeneral purpose deduplication
Multi-passSequential blocking with different keysHigh recall requirements
ANN FallbackApproximate nearest neighbor blockingFuzzy matching with edit distance

The QIS-bucket strategy introduced in v1.24.0 achieves significant performance improvements: 10M records processed in 502s (down from 2604s) with invariant F1=0.9886 and 18% RSS reduction.

Source: packages/python/goldenmatch/goldenmatch/core/blocker.py

Blocking Key Configuration

config = {
    "blocking": {
        "keys": ["name_soundex", "zip_code", "phone_area"],
        "min_block_size": 2,
        "max_block_size": 100000
    }
}

Scoring Module

The scoring module evaluates candidate pairs by computing similarity scores at both field and record levels. It applies configurable field strategies to determine how each field contributes to the overall match probability.

Field Strategies

Field strategies define how individual fields are compared and weighted:

StrategyDescriptionBest For
exactBinary match/mismatchIDs, codes, categorical
fuzzyEdit distance or Jaro-WinklerNames, addresses
token_setToken overlap comparisonMulti-word fields
numericThreshold-based comparisonAges, amounts
dateTemporal proximityDate fields
phoneticSoundex/Metaphone matchingNames

Source: packages/python/goldenmatch/goldenmatch/core/field_strategies.py

Score Computation

The scorer aggregates field-level scores into an overall pair score using weighted combination:

overall_score = sum(field_score * field_weight for field in fields) / total_weight

The resulting score represents the probability that two records refer to the same entity, ranging from 0.0 (definitely different) to 1.0 (definitely match).

Source: packages/python/goldenmatch/goldenmatch/core/scorer.py

Clustering Module

The clustering module transforms scored pairs into connected clusters representing unique entities. It handles transitive closure to ensure consistent grouping across the record graph.

Clustering Algorithm

GoldenMatch uses a connected-components approach with configurable linkage criteria:

Linkage TypeBehavior
SingleRecords merge if any pair within the cluster exceeds threshold
CompleteAll pairs within merged clusters must exceed threshold
AverageUses mean pairwise similarity

Golden Record Generation

Once clusters are formed, the clustering module generates golden records by applying survivorship rules:

golden_record = build_golden_records_batch(cluster_members, provenance=True)

With provenance=True (introduced in v1.22.0), each field dict includes source_row_id tracking which record contributed the winning value.

Source: packages/python/goldenmatch/goldenmatch/core/cluster.py

Autoconfig Controller

The autoconfig controller enables self-tuning of matching parameters based on labeled data or automatic heuristics. It replaces manual threshold tuning with data-driven optimization.

Auto-Configuration Features

FeatureDescriptionVersion
Zero-label commitPrefer higher confidence candidates when labels unavailablev1.23.0+
Recall targetingAuto-configure thresholds to meet desired recallv1.20.0+
Cluster threshold tuningTune decision threshold based on cluster-level decisionsv1.20.0+
Field strategy tuningAuto-select field comparison strategiesv1.19.0+

Zero-Label Confidence Handling

In v1.23.0, the pick_committed method was enhanced to handle zero-label profiles. When multiple candidates have equal health rank, the controller prefers candidates with higher -overall_confidence over those with higher -mass_separation. This addresses precision-collapse issues in unlabeled data scenarios.

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py

Performance Characteristics

The Core Matching Engine is optimized for both scale and accuracy:

Benchmark Results (v1.24.0)

Dataset SizeWall TimePeak RSSF1 Score
5M records9.94 min6.4 GB0.99+
10M records502s~18% reduction vs v1.230.9886

Performance Strategies

  1. Vectorization: Single-group-by-per-column operations for golden record building
  2. Scale-aware cardinality: Chao1 estimation for blocking key selection
  3. Native acceleration: Optional Rust/PyO3 runtime via pip install "goldenmatch[native]"
  4. Incremental processing: Support for streaming and incremental matching

Source: packages/python/goldenmatch/goldenmatch/core/matcher.py

Configuration Reference

Core Configuration Options

matching:
  # Scoring thresholds
  score_threshold: 0.85        # Minimum score to consider a match
  decision_threshold: 0.5      # Threshold for cluster decisions
  
  # Field configuration
  fields:
    name:
      strategy: fuzzy
      weight: 2.0
    email:
      strategy: exact
      weight: 1.5
    phone:
      strategy: token_set
      weight: 1.0

  # Blocking
  blocking:
    keys: ["name_soundex", "phone_last4"]
    method: qis_bucket

  # Output
  output:
    lineage_provenance: false  # Track source records for golden fields
    include_scores: true

Autoconfig Options

autoconfig:
  enabled: true
  recall_target: 0.95
  zero_label_commit: true      # v1.23.0+ default behavior
  tune_cluster_threshold: true  # v1.20.0+

Pipeline Integration

The Core Matching Engine integrates with the broader GoldenSuite ecosystem:

graph LR
    A[GoldenCheck] --> B[GoldenFlow]
    B --> C[GoldenMatch]
    C --> D[InferMap]
    
    E[GoldenPipe] -. orchestrates .-> A
    E -. orchestrates .-> B
    E -. orchestrates .-> C
    E -. orchestrates .-> D
  • GoldenCheck: Validates data quality before matching
  • GoldenFlow: Standardizes messy fields (phone, date, address)
  • InferMap: Maps columns across heterogeneous schemas
  • GoldenPipe: Orchestrates the full pipeline declaratively

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

AutoConfig System

Related topics: Core Matching Engine, Learning Memory

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Component Overview

Continue reading this section for the full explanation and source context.

Section Controller Responsibilities

Continue reading this section for the full explanation and source context.

Section Telemetry Data Model

Continue reading this section for the full explanation and source context.

Related topics: Core Matching Engine, Learning Memory

AutoConfig System

The AutoConfig System is GoldenMatch's intelligent self-configuration engine that automatically tunes entity resolution parameters based on data characteristics, eliminating the need for manual threshold and strategy tuning. Introduced incrementally across versions v1.19.0 through v1.24.0, it represents the "Learning Memory" family of auto-tuning features that consume human feedback to propose optimal configurations.

Overview

AutoConfig addresses the fundamental challenge in entity resolution: finding the right balance between precision and recall requires understanding your specific data distribution. Rather than requiring users to manually specify match thresholds, blocking strategies, and scoring weights, AutoConfig:

  • Analyzes data characteristics through profiling
  • Proposes configuration parameters via iterative tuning
  • Learns from user decisions in review queues
  • Commits configurations based on zero-label confidence

Source: docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md

Architecture

The AutoConfig System comprises four primary components that work together in an iterative feedback loop:

graph TD
    A[Data Input] --> B[Controller]
    B --> C[Policy Engine]
    C --> D[Cluster Threshold Tuner]
    D --> E[Zero-Label Confidence]
    E --> B
    F[User Decisions] --> D
    G[Telemetry] --> B
    B --> H[Committed Config]

Component Overview

ComponentPurposeLocation
ControllerOrchestrates the auto-config loop, manages iterationscore/autoconfig_controller.py
Policy EngineEvaluates candidate configurations against health metricscore/autoconfig_policy.py
Cluster Threshold TunerProposes per-dataset approve thresholds from cluster decisionscore/autoconfig_cluster_threshold_tuner.py
Zero-Label ConfidenceAssigns confidence scores based on unlabeled data patternscore/zero_label_confidence.py

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py

Controller

The AutoConfigController is the central orchestrator that manages the iterative configuration process. It coordinates between the policy engine, telemetry collectors, and the zero-label confidence system.

Controller Responsibilities

graph LR
    A[Blocking<br>Summary] --> C[Controller]
    B[Scoring<br>Summary] --> C
    D[Cluster<br>Summary] --> C
    E[Indicators] --> C
    F[Column<br>Priors] --> C
    C --> G[Decisions]
    C --> H[Errors]
    C --> I[Committed<br>Matchkeys]

Telemetry Data Model

The controller collects and emits structured telemetry for each iteration:

interface ControllerScoringSummary {
  n_pairs_scored: number;
  candidates_compared: number;
  mass_above_threshold: number;
  mass_in_borderline: number;
  dip_statistic: number;
}

interface ControllerBlockingSummary {
  n_blocks: number;
  reduction_ratio: number;
  block_sizes_p50: number;
  block_sizes_p99: number;
  block_sizes_max: number;
  oversized_block_count: number;
  keys_used: string[][];
}

interface ControllerClusterSummary {
  n_clusters: number;
  cluster_size_p50: number;
  cluster_size_p99: number;
  cluster_size_max: number;
  transitivity_rate: number;
  oversized_cluster_count: number;
}

Source: web/frontend/src/lib/api.ts

Decision Recording

Each iteration produces a ControllerDecision record:

FieldTypeDescription
iterationintIteration number
rule_namestrName of the policy rule that triggered
rationalestrHuman-readable explanation
config_diffRecord[str, str]Changes made to configuration
wall_clock_msintTime taken for this iteration

Source: web/frontend/src/lib/api.ts

Policy Engine

The AutoConfigPolicy evaluates candidate configurations against a health ranking system. It ranks configurations by their expected precision-collapse behavior and mass separation characteristics.

Health Ranking

Configurations are evaluated on multiple health dimensions:

MetricDescriptionImpact
overall_confidenceAggregate confidence scoreHigher is better
mass_separationGap between match/non-match distributionsLarger gap indicates better discrimination
precision_collapseTendency to over-clusterLower is better

Commit Decision Logic

The policy engine's pick_committed method determines which configuration to commit when multiple candidates have equal health rank. In v1.23.0, the tiebreaker was modified to prefer higher -overall_confidence over higher -mass_separation for candidates carrying a zero_label profile.

Source: docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md

Cluster Threshold Tuner

The AutoConfigClusterThresholdTuner is the third tuner in the Learning Memory family, alongside the pair-level MemoryLearner and field-level field-strategy tuner.

Tuner Function: `tune_decision_threshold`

def tune_decision_threshold(
    decisions: list[ClusterDecision],
    current_threshold: float
) -> float:
    """
    Proposes a per-dataset auto-approve threshold based on
    cluster-level approve/reject decisions.
    """

Input: Cluster Decisions

FieldTypeDescription
cluster_idstrUnique cluster identifier
decisionstr"approve" or "reject"
confidencefloatModel confidence in decision
cluster_sizeintNumber of records in cluster

Output

Returns an updated threshold value that balances precision and recall based on the observed decision pattern.

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tuner.py

Zero-Label Confidence

The zero-label confidence mechanism enables AutoConfig to work without labeled training data by analyzing the structure of unlabeled record pairs and their natural clustering behavior.

Design Principles

  1. No Ground Truth Required: Analyzes inherent data structure rather than relying on labeled examples
  2. Precision-Collapse Aware: Identifies when configurations would cause over-merging
  3. Profile-Based Scoring: Assigns confidence scores based on zero-label profile characteristics

Zero-Label Profile

A zero-label profile captures characteristics of record pairs that help predict match quality without explicit labels:

@dataclass
class ZeroLabelProfile:
    mass_separation: float
    overall_confidence: float
    precision_collapse_risk: float
    zero_label: bool  # True if profile carries zero-label characteristics

Commit Behavior

Starting in v1.23.0, AutoConfig commits by zero-label confidence by default. The pick_committed tiebreaker logic:

IF same-health-rank candidates exist:
    AND at least one carries zero_label profile:
        PREFER the candidate with higher -overall_confidence
    ELSE:
        PREFER the candidate with higher -mass_separation

This change addressed precision-collapse scenarios where mass_separation alone was insufficient.

Source: packages/python/goldenmatch/goldenmatch/core/zero_label_confidence.py

Source: docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md

Interfaces

AutoConfig is accessible through multiple interfaces:

CLI

goldenmatch autoconfig <input.csv> [--iterations N] [--output config.yaml]

REST API

EndpointMethodDescription
/autoconfigPOSTStart auto-configuration run
/autoconfig/statusGETGet current iteration status
/controller/telemetryGETRetrieve full telemetry snapshot

Python API

from goldenmatch import AutoConfigController

controller = AutoConfigController(config)
controller.run(iterations=10)
telemetry = controller.get_telemetry()

SQL Interface (Postgres Extension)

SELECT goldenmatch_autoconfig('customers.csv');
SELECT gm_telemetry();

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py

Web UI Integration

The AutoConfig System integrates with the GoldenMatch web workbench for visual monitoring and manual override:

graph LR
    A[Web UI] -->|Ctrl+A| B[AutoConfig Panel]
    B --> C[Iteration List]
    C --> D[Decision Details]
    D --> E[Config Diff View]
    B --> F[Telemetry Charts]
    F --> G[Blocking Summary]
    F --> H[Cluster Summary]

Telemetry Visualization

The frontend displays real-time telemetry including:

  • Blocking Summary: Reduction ratio, block size distribution
  • Scoring Summary: Mass above/below threshold, DIP statistic
  • Cluster Summary: Size distribution, transitivity rate
  • Indicators: Matchkey hit rate, cross-blocking overlap

Source: web/frontend/src/lib/api.ts

Configuration Options

AutoConfig Settings

ParameterDefaultDescription
autoconfig.enabledtrueEnable auto-configuration
autoconfig.max_iterations10Maximum iterations before commit
autoconfig.zero_label_committruePrefer zero-label confidence (v1.23.0+)
autoconfig.recall_target0.95Target recall for auto-config
autoconfig.precision_floor0.90Minimum acceptable precision

Learning Memory Integration

AutoConfig integrates with the broader Learning Memory system:

TunerLevelLearns From
MemoryLearnerPairPair-level approve/reject decisions
FieldStrategyTunerFieldField-level strategy preferences
ClusterThresholdTunerClusterCluster-level approve/reject decisions

Version History

VersionChange
v1.24.0Heuristic rule expansion + diagnostic harness
v1.23.0Auto-config commits by zero-label confidence by default
v1.20.0Cluster decision tuner (tune_decision_threshold)
v1.19.0Native acceleration + autoconfig + probabilistic improvements

Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py

Best Practices

  1. Start with defaults: The zero-label confidence commit behavior (v1.23.0+) provides sensible defaults for most datasets
  2. Review telemetry: Monitor blocking and cluster summaries to identify oversized blocks or clusters
  3. Use strict mode for evaluation: The _strictAutoconfig flag disables runtime threshold shifts for reproducible results
  4. Integrate with review queue: Feed cluster decisions back to the ClusterThresholdTuner for continuous improvement

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Learning Memory

Related topics: AutoConfig System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Components

Continue reading this section for the full explanation and source context.

Section Data Model

Continue reading this section for the full explanation and source context.

Section CRUD Operations

Continue reading this section for the full explanation and source context.

Related topics: AutoConfig System

Learning Memory

Learning Memory is GoldenMatch's adaptive feedback system that continuously improves entity resolution accuracy by learning from human corrections, labeled decisions, and performance feedback. It forms the closed-loop optimization layer that distinguishes GoldenMatch from static rule-based matching systems.

Overview

Learning Memory captures human domain knowledge and transforms it into automated configuration improvements. Rather than requiring users to manually tune thresholds, define field strategies, or configure blocking rules, Learning Memory observes how humans resolve ambiguous cases and propagates those decisions across the entire dataset.

The system operates at three distinct levels:

LevelTunerInputOutput
Pair-levelMemoryLearnerHuman approve/reject on pair comparisonsLearned matchkey weights and thresholds
Field-levelfield_strategy_tunerField-level correctionsPer-field strategy selection (exact, fuzzy, tokenized, etc.)
Cluster-levelcluster_decision_tunerCluster approve/reject decisionsPer-dataset auto-approve threshold

This multi-level approach ensures that feedback at any granularity flows back into the appropriate configuration layer. Source: packages/python/goldenmatch/CHANGELOG.md

Architecture

Learning Memory consists of several interconnected components that handle storage, learning, and application of corrections.

graph TD
    subgraph "Input Layer"
        A[Human Corrections] --> B[Corrections Store]
        C[Review Queue Feedback] --> B
        D[Ground Truth Labels] --> E[Memory Learner]
    end
    
    subgraph "Learning Layer"
        E --> F[Pair-Level Tuning]
        F --> G[Threshold Adjuster]
        F --> H[Matchkey Weighter]
        E --> I[Field-Level Tuning]
        I --> J[Strategy Selector]
        E --> K[Cluster-Level Tuning]
        K --> L[Decision Threshold Tuner]
    end
    
    subgraph "Output Layer"
        G --> M[AutoConfig Controller]
        H --> M
        J --> M
        L --> M
        M --> N[Matching Pipeline]
    end
    
    subgraph "Feedback Loop"
        N --> O[Review Queue]
        O --> A
    end

Core Components

ComponentFileResponsibility
MemoryCorrectionscore/memory/corrections.pyPersistent storage of human corrections
MemoryLearnercore/memory/learner.pyPair-level learning from corrections
FieldStrategyTunercore/autoconfig_field_strategy_tuner.pyField-level strategy optimization
ClusterDecisionTunercore/autoconfig_cluster_threshold_tune.pyCluster threshold optimization

Corrections Store

The MemoryCorrections class provides the persistent backing store for all human feedback. It tracks corrections at both the pair level (which records should be linked or unlinked) and the field level (which field values should win in survivorship). Source: packages/python/goldenmatch/goldenmatch/core/memory/corrections.py:1-50

Data Model

class MemoryCorrections:
    corrections: list[Correction]
    
class Correction:
    record_id_a: str      # First record in the pair
    record_id_b: str      # Second record in the pair
    decision: str         # "approve" or "reject"
    confidence: float     # Human confidence 0.0-1.0
    source: str           # "human", "ground_truth", "review_queue"
    timestamp: datetime
    metadata: dict        # Additional context

CRUD Operations

OperationMethodDescription
Createadd_correction()Record a new correction
Readget_corrections()Retrieve corrections with filters
Updateupdate_correction()Modify an existing correction
Deleteremove_correction()Remove a correction

The corrections store supports filtering by:

  • Source type (human, ground_truth, review_queue)
  • Decision type (approve, reject)
  • Date range
  • Record pair

Pair-Level Learning: MemoryLearner

The MemoryLearner processes corrections at the record-pair level and extracts patterns about which matchkey combinations indicate true matches versus false positives. Source: packages/python/goldenmatch/goldenmatch/core/memory/learner.py:1-100

Learning Algorithm

class MemoryLearner:
    def learn(self, corrections: MemoryCorrections) -> LearnedWeights:
        """
        Process corrections and compute updated matchkey weights.
        """

The learner computes:

  1. Precision per matchkey: Ratio of correct to total positive predictions for each matchkey
  2. Recall per matchkey: Coverage of true matches captured by each matchkey
  3. Composite weights: Combination weights that balance precision and recall

Weight Computation

Weights are computed using a modified TF-IDF approach:

MetricFormulaPurpose
Matchkey Precisioncorrect_pairs / total_pairs_for_keyHow reliable is this key?
Key Frequencypairs_using_key / total_pairsHow common is this key?
Composite Weightprecision * log(key_frequency + 1)Balanced importance

Field-Level Learning: FieldStrategyTuner

The FieldStrategyTuner optimizes which scoring strategy to use for each field based on correction patterns. Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_field_strategy_tuner.py:1-80

Available Field Strategies

StrategyUse CaseExample
exactUnique identifiers, codesSSN, account numbers
fuzzyNames, addresses with typos"John" vs "Jon"
tokenizedMulti-word fields"John Smith" vs "Smith, John"
numericNumbers with tolerancePrices, quantities
dateTemporal fieldsBirth dates, transaction dates
phoneticNames with spelling variantsSoundex, Metaphone

Tuning Process

  1. Collect field-level signals from corrections (which field caused the error?)
  2. Compute strategy accuracy per field for each strategy
  3. Select best strategy using cross-validation to avoid overfitting
  4. Generate strategy map: {field_name: strategy_name}

Cluster-Level Learning: ClusterDecisionTuner

The ClusterDecisionTuner (introduced in v1.20.0) consumes cluster-level approve/reject decisions and proposes a per-dataset auto-approve threshold. Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tune.py:1-60

Threshold Tuning Algorithm

class ClusterDecisionTuner:
    def tune_decision_threshold(
        self,
        decisions: list[ClusterDecision],
        target_recall: float = 0.95
    ) -> float:
        """
        Find the threshold that achieves target_recall on approved clusters.
        """

Tuning Inputs

InputTypeDescription
decisionslist[ClusterDecision]Human cluster decisions
target_recallfloatDesired recall target (default 0.95)
min_approve_confidencefloatMinimum confidence for auto-approve

Tuning Outputs

OutputTypeDescription
thresholdfloatSuggested auto-approve threshold
expected_precisionfloatEstimated precision at this threshold
calibration_curvelist[tuple]Precision-recall tradeoff points

Auto-Config Integration

Learning Memory integrates with GoldenMatch's AutoConfig system, which automatically optimizes matching parameters. The pick_committed method in the controller determines which candidate configurations to commit. Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_golden_strategy_tuner.py:1-100

Commit Priority

The controller uses a multi-factor ranking for candidate selection:

  1. Health rank: Overall dataset health improvement
  2. Mass separation: Confidence gap between best and second-best candidate
  3. Zero-label confidence (v1.23.0+): Preference for higher overall_confidence among zero-label profile candidates
def pick_committed(self, candidates: list[Candidate]) -> Candidate:
    """
    Select the best candidate using health rank, mass separation,
    and zero-label confidence tiebreaker.
    """

AutoConfig Workflow

graph LR
    A[Initialize Config] --> B[Generate Candidates]
    B --> C[Score Candidates]
    C --> D[Learn from Corrections]
    D --> E[Update Weights]
    E --> F[Filter Candidates]
    F --> G{Rank by Health?}
    G -->|Yes| H[Select Best]
    G -->|No| I[Apply Learning Memory]
    I --> E
    H --> J[Commit Configuration]
    J --> K[Execute Matching]
    K --> L[Generate Review Queue]
    L --> M[Human Feedback]
    M --> D

Memory in Postgres Extension

The goldenmatch_pg extension (v0.5.0+) provides native Postgres functions for Learning Memory operations, enabling SQL-native corrections and statistics. Source: packages/python/goldenmatch/CHANGELOG.md

Available Functions

FunctionPurpose
memory_learn()Record corrections from SQL
memory_stats()Retrieve learning statistics
memory_clear()Reset corrections for a dataset

SQL Usage Example

-- Record a correction
SELECT memory_learn(
    'record_a_id',
    'record_b_id', 
    'approve',  -- or 'reject'
    0.95
);

-- Get learning statistics
SELECT * FROM memory_stats('my_dataset');

-- Clear corrections for re-learning
SELECT memory_clear('my_dataset');

Usage Patterns

Basic Correction Flow

from goldenmatch import GoldenMatch, MemoryCorrections

# Initialize with memory
gm = GoldenMatch(config)
corrections = MemoryCorrections()

# Run initial matching
results = gm.match(data)

# Present review queue to human
for pair in results.review_queue:
    decision = human_review(pair)
    corrections.add_correction(
        record_id_a=pair.id_a,
        record_id_b=pair.id_b,
        decision=decision,
        confidence=0.95
    )

# Apply learning and re-run
gm.memory_learner.learn(corrections)
refined_results = gm.match(data)  # Uses learned weights

Field Strategy Tuning

from goldenmatch.core.autoconfig_field_strategy_tuner import FieldStrategyTuner

tuner = FieldStrategyTuner(dataset_id="my_dataset")

# Tune from corrections
strategy_map = tuner.tune(
    corrections=corrections,
    fields=["name", "address", "phone", "email"]
)

# Apply to config
config.field_strategies = strategy_map

Cluster Threshold Tuning

from goldenmatch.core.autoconfig_cluster_threshold_tune import ClusterDecisionTuner

tuner = ClusterDecisionTuner()

# Tune from cluster decisions
threshold = tuner.tune_decision_threshold(
    decisions=cluster_decisions,
    target_recall=0.97
)

# Apply auto-approve threshold
config.auto_approve_threshold = threshold

Configuration Options

OptionDefaultDescription
memory.enabledTrueEnable/disable learning memory
memory.learning_rate0.1Rate at which new corrections update weights
memory.decay_factor0.95Weight decay for older corrections
memory.min_corrections10Minimum corrections before tuning
memory.strategy"balanced"Tuning strategy: balanced, precision, recall

Version History

VersionFeature
v1.19.0Initial Learning Memory with MemoryLearner
v1.20.0Added ClusterDecisionTuner (third tuner in family)
v1.23.0Zero-label confidence tiebreaker in pick_committed
v1.24.0Heuristic rule expansion + diagnostic harness

Limitations and Considerations

Data Requirements

  • Learning Memory requires a minimum number of corrections (default: 10) before producing reliable recommendations
  • Highly imbalanced datasets (rare true matches) may need more corrections for accurate threshold tuning

Convergence

  • Weights converge faster when corrections are evenly distributed across matchkey types
  • Cluster threshold tuning may require iterative refinement for datasets with unusual cluster size distributions

Production Considerations

  • Periodically review learned weights to ensure they remain aligned with business rules
  • Reset memory when significant schema changes occur
  • Monitor precision/recall drift over time

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Blocking and Scoring

Related topics: Core Matching Engine

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Components

Continue reading this section for the full explanation and source context.

Section Exact Blocking

Continue reading this section for the full explanation and source context.

Section Sorted Neighborhood Blocking

Continue reading this section for the full explanation and source context.

Related topics: Core Matching Engine

Blocking and Scoring

Overview

Blocking and scoring are the two core mechanisms that enable GoldenMatch to perform entity resolution at scale. Without blocking, comparing every record against every other record would result in O(n²) comparisons—a computationally infeasible approach for datasets with millions of records. The blocking phase reduces the candidate space by grouping records that are likely to represent the same entity, while the scoring phase evaluates the similarity of each candidate pair to determine whether they should be merged.

In GoldenMatch, these mechanisms work together through a configurable pipeline that supports multiple blocking strategies (exact, fuzzy, token-based, and approximate nearest neighbor), multiple scoring approaches (string similarity, field weighting, and optional LLM-based evaluation), and a feedback-driven tuning system that learns from user corrections to improve accuracy over time.

Source: packages/python/goldenmatch/examples/README.md

Architecture

The blocking and scoring pipeline follows a multi-stage architecture:

graph TD
    A[Input Records] --> B[Standardization]
    B --> C[Blocking Strategy]
    C --> D[Candidate Pairs]
    D --> E[Field-Level Scoring]
    E --> F[Record-Level Score]
    F --> G[Clustering]
    G --> H[Golden Records]
    C -->|ann_* strategies| I[Approximate Nearest Neighbors]
    F -->|optional| J[Cross-Encoder Reranking]

Components

ComponentPurposeKey Classes/Modules
BlockerGenerates candidate pairs using blocking keysblocker.py
MatchkeyDefines how fields contribute to blocking and scoringmatchkey.py
ScorerComputes field and record-level similarityscoring.py
Cross-EncoderOptional reranking of candidate pairscross_encoder.py
ControllerOrchestrates the pipeline and applies thresholdscontroller.py

Source: packages/python/goldenmatch/examples/README.md

Blocking Strategies

Exact Blocking

Exact blocking groups records that share identical values on one or more key fields. This is the simplest and fastest approach, ideal for high-quality, well-standardized data.

blocking = {
    "strategy": "sorted_neighborhood",
    "fields": ["email"],
    "window_size": 3
}

Sorted Neighborhood Blocking

Records are sorted by blocking key values and compared within a sliding window. This catches near-duplicates that would not appear adjacent under exact blocking.

blocking = {
    "strategy": "sorted_neighborhood",
    "fields": ["last_name", "zip5"],
    "window_size": 5
}

Multi-Pass Blocking

For complex datasets, multiple blocking passes with different keys can capture different types of matches:

blocking = [
    {"strategy": "exact", "fields": ["email"]},
    {"strategy": "sorted_neighborhood", "fields": ["last_name", "first_name"], "window_size": 5},
    {"strategy": "sorted_neighborhood", "fields": ["phone"]}
]

Source: packages/python/goldenmatch/examples/README.md

Approximate Nearest Neighbor (ANN) Blocking

For high-cardinality string fields, ANN blocking provides efficient similarity-based candidate generation:

blocking = {
    "strategy": "ann_l2",
    "fields": ["full_address"],
    "distance_threshold": 0.3
}

The v1.24.0 release introduced the QIS-bucket strategy and Chao1 scale-aware cardinality estimation, which significantly improves blocking accuracy on large datasets. The performance improvements in this release achieved 81% wall-clock reduction (2604s → 502s) on a 10M record benchmark while maintaining F1=0.9886.

Source: README.md

Blocking Configuration Fields

FieldTypeDescription
strategystringOne of: exact, sorted_neighborhood, ann_l2, ann_cosine, canopy, learned
fieldslist[string]Fields to use for blocking
window_sizeintWindow size for sorted neighborhood (default: 3)
distance_thresholdfloatDistance threshold for ANN strategies
extrasdictAdvanced strategy-specific parameters

Source: packages/python/goldenmatch/web/frontend/src/lib/types.ts

Scoring

Field-Level Scoring

Each field contributes a similarity score based on its configured strategy:

StrategyDescriptionUse Case
levenshteinCharacter-level edit distanceNames, addresses
jaro_winklerOptimized for short stringsNames
token_setSet intersection of tokensAddress components
numericAbsolute difference / rangeDates, amounts
exactBinary match/no-matchIDs, codes

Matchkey Definition

Matchkeys define how fields participate in both blocking and scoring:

matchkeys = [
    {
        "fields": ["email"],
        "blocking": True,
        "score_type": "exact",
        "weight": 1.0
    },
    {
        "fields": ["first_name", "last_name"],
        "blocking": True,
        "score_type": "token_set",
        "weight": 0.8
    },
    {
        "fields": ["phone"],
        "blocking": True,
        "score_type": "levenshtein",
        "weight": 0.6
    }
]

Source: packages/python/goldenmatch/examples/README.md

Record-Level Scoring

The record-level score combines field scores using weighted averaging:

record_score = Σ(field_score × field_weight) / Σ(field_weight)

Records exceeding the threshold are linked; the default threshold is tuned automatically based on the dataset characteristics via the autoconfig system introduced in v1.20.0.

Source: README.md

Weighted Matchkeys

GoldenMatch supports sophisticated weighting schemes:

matchkeys = [
    {"fields": ["company_name"], "weight": 0.7, "score_type": "token_set"},
    {"fields": ["address", "city"], "weight": 0.3, "score_type": "token_set"},
    # Optional multi-pass with different weight profiles
]

The equipment deduplication example demonstrates multi-pass blocking with ANN fallback, weighted fuzzy matching, and LLM calibration for challenging datasets.

Source: packages/python/goldenmatch/examples/README.md

Cross-Encoder Reranking

The optional cross-encoder module provides a secondary scoring pass that considers field interactions:

from goldenmatch.core.cross_encoder import CrossEncoderScorer

scorer = CrossEncoderScorer(model="cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(record_a, record_b) for record_a, record_b in candidate_pairs]
reranked = scorer.score_batch(pairs)

Cross-encoding is particularly valuable when field combinations carry more signal than individual fields—for example, a name and address together are more distinctive than either alone.

Source: packages/python/goldenmatch/examples/README.md

Configuration Payload

The web frontend communicates blocking and scoring configuration using a typed payload:

export type RulesPayload = {
  threshold: number;
  matchkeys: Matchkey[];
  standardization?: StandardizationRules | null;
  blocking?: BlockingPayload | null;
};

Known blocking keys are validated at the server boundary, while unknown keys are preserved in an extras field for advanced strategies:

const BLOCKING_KNOWN_KEYS = new Set([
  "strategy", "fields", "window_size", "distance_threshold",
  "_block_size", "skip_oversized", "auto_suggest", "auto_select"
]);

Source: packages/python/goldenmatch/web/frontend/src/lib/types.ts

Standardization Pipeline

Effective blocking and scoring depend on data standardization. GoldenFlow provides the transform library that should be applied before matching:

Transform CategoryExamples
Textstrip, lowercase, normalize_unicode, normalize_quotes
Phonephone_e164, phone_national
Addressaddress (full address standardization)
Numericextract_numbers, parse_currency
standardization = {
    "email": "lowercase",
    "phone": "phone_e164",
    "address": "address",
    "state": "state"
}

Source: packages/python/goldenflow/README.md

Performance Considerations

Bucket Strategies

The QIS-bucket strategy (v1.24.0) provides scale-aware cardinality estimation that adjusts bucket parameters based on dataset size. This prevents both over-blocking (too many candidates) and under-blocking (missing matches).

Memory Reduction

The v1.24.0 release achieved an 18% RSS reduction through optimized data structures and streaming processing. Key optimizations include:

  • Single-group-by-per-column vectorization for golden record building
  • Lazy evaluation of scoring for low-confidence pairs
  • Chunked processing for very large candidate sets

Source: README.md

Auto-Configuration

GoldenMatch can automatically tune blocking and scoring parameters:

from goldenmatch.core.autoconfig import auto_configure

config = auto_configure(
    data=df,
    ground_truth=labels_df,  # Optional for supervised tuning
    target_recall=0.95
)

The autoconfig system uses:

  • Chao1 estimation for cardinality-aware blocking tuning
  • Zero-label confidence analysis for threshold calibration (v1.23.0)
  • Cluster-level tuning for decision threshold optimization (v1.20.0)

Source: README.md

Memory and Learning

GoldenMatch maintains learned patterns across runs:

Memory TypeScopePurpose
MemoryLearnerPair-levelLearn from labeled match/non-match pairs
field-strategy tunerField-levelOptimize per-field scoring strategy
cluster-decision tunerCluster-levelTune merge/reject thresholds

These learning mechanisms enable the system to improve accuracy over time as users correct its decisions.

Source: README.md

Workflow Summary

graph LR
    A[Raw Data] --> B[GoldenFlow Standardization]
    B --> C[Blocking]
    C --> D[Scoring]
    D --> E[Clustering]
    E --> F{Threshold}
    F -->|Above| G[Auto-Approve]
    F -->|Below| H[Auto-Reject]
    F -->|Uncertain| I[Review Queue]
    I --> J[User Labels]
    J --> K[Memory Update]
    K --> C

See Also

Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.

1. Capability evidence risk: Capability evidence risk requires verification

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: capability.assumptions | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

2. Maintenance risk: Maintenance risk requires verification

  • Severity: medium
  • Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

3. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: downstream_validation.risk_items | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

4. Security or permission risk: Security or permission risk requires verification

  • Severity: medium
  • Finding: no_demo
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: risks.scoring_risks | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

5. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: issue_or_pr_quality=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

6. Maintenance risk: Maintenance risk requires verification

  • Severity: low
  • Finding: release_recency=unknown。
  • User impact: May increase setup, validation, or first-run risk for the user.
  • Recommended check: Reproduce the official install and quickstart path in an isolated environment.
  • Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 11

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using goldenmatch with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence