Doramagic Project Pack · Human Manual
goldenmatch
Golden Suite is a comprehensive toolkit for data quality and entity resolution, designed to handle the complete lifecycle of messy data: profiling, standardization, deduplication, and orch...
Home
Related topics: Getting Started, Suite Packages Overview
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Getting Started, Suite Packages Overview
Golden Suite
A polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.
*GoldenCheck profiles → GoldenFlow standardizes → GoldenMatch deduplicates → GoldenPipe orchestrates. With InferMap for schema mapping and a Rust extension layer for Postgres / DuckDB.*
Overview
Golden Suite is a comprehensive toolkit for data quality and entity resolution, designed to handle the complete lifecycle of messy data: profiling, standardization, deduplication, and orchestration. The project supports both Python and TypeScript ecosystems, with optional Rust acceleration for high-performance workloads.
Source: README.md:1-5
Packages Overview
| Tool | Languages | Purpose | Install |
|---|---|---|---|
| GoldenMatch | Python · TS | Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package. | pip install goldenmatch · npm i goldenmatch |
| GoldenCheck | Python · TS | Data-quality scanning: encoding, Unicode, format validation, anomaly detection. | pip install goldencheck · npm i goldencheck |
| GoldenFlow | Python · TS | Transforms & standardizers: phone, date, address, categorical normalization. | pip install goldenflow · npm i goldenflow |
| GoldenPipe | Python · TS | Orchestrator that wires Check → Flow → Match into one declarative pipeline. | pip install goldenpipe · npm i goldenpipe |
| InferMap | Python · TS | Schema mapping engine — auto-aligns columns across heterogeneous sources. | pip install infermap · npm i infermap |
| goldenmatch-extensions | Rust | Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching. | source build |
| dbt-goldensuite | dbt · Python | dbt package — quality-gate tests, correction CRUD macros + GoldenCheck assertions for warehouse models. | pip install dbt-goldensuite |
| goldencheck-action | YAML | GitHub Action — CI with PR comments for data validation. | uses: benseverndev-oss/goldencheck-action@v1 |
Source: README.md:28-41
Architecture
graph LR
A[Raw Data] --> B[GoldenCheck<br/>Profile & Validate]
B --> C[GoldenFlow<br/>Standardize]
C --> D[InferMap<br/>Schema Mapping]
D --> E[GoldenMatch<br/>Deduplicate]
E --> F[GoldenPipe<br/>Orchestrate]
G[Postgres/DuckDB] --> E
H[GitHub CI] --> B
I[MCP Server] --> FData Flow
- GoldenCheck profiles your data and discovers quality rules automatically
- GoldenFlow transforms messy fields into canonical formats
- InferMap aligns columns across heterogeneous schemas
- GoldenMatch identifies and merges duplicate records
- GoldenPipe orchestrates the entire pipeline declaratively
Source: packages/python/goldencheck/README.md:1-10
Quick Start
Python
# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv
Source: README.md:64-66
TypeScript / Node.js
# TypeScript / Edge runtimes
npm install goldenmatch
Source: README.md:69
GoldenMatch
The headline package for entity resolution. Supports multiple matching strategies:
- Fuzzy matching — handles typos and variations
- Exact matching — bit-for-bit comparisons
- Probabilistic matching —贝叶斯-style confidence scoring
- LLM matching — semantic clustering with language models
Key Features
- Zero-config dedup for common cases
- Configurable matchkeys and blocking strategies
- Field-level score explanations
- Streaming/incremental matching for new records
- PPRL (Privacy-Preserving Record Linkage) for cross-organization matching
Performance
v1.24.0 achieved significant performance milestones:
| Metric | Before | After | Improvement |
|---|---|---|---|
| 10M records (wall time) | 2604s | 502s | -81% |
| Peak RSS | baseline | -18% reduction | — |
| F1 Score | 0.9886 | 0.9886 | invariant |
Source: packages/python/goldenmatch/README.md:1-50
GoldenCheck
Data validation that discovers rules from your data so you don't have to write them.
Core Capabilities
- Automatic rule discovery from data patterns
- Encoding and Unicode validation
- Format validation and anomaly detection
- Health score grading (A-F)
- Multiple output formats: HTML, JSON, TUI
Domain Type Packs
Community-contributed semantic type definitions for improved detection:
| Domain | Types | Description |
|---|---|---|
| healthcare | 10 | NPI, ICD codes, insurance IDs, patient demographics, CPT, DRG |
| finance | 8 | Account numbers, routing numbers, CUSIP/ISIN, currency, transactions |
| ecommerce | 9 | SKUs, order IDs, tracking numbers, categories, shipping |
Source: packages/typescript/goldencheck-types/README.md:1-30
GoldenFlow
Transforms and standardizers for messy data fields.
Transform Categories
| Category | Count | Examples |
|---|---|---|
| Text Transforms | 18 | strip, lowercase, normalize_unicode, remove_html_tags |
| Phone Transforms | 5 | phone_e164, phone_national, phone_format |
| Date Transforms | 7 | date_parse, date_format, date_floor |
| Address Transforms | 6 | address_parse, address_standardize |
| Numeric Transforms | 4 | parse_currency, parse_number |
| Categorical Transforms | 4 | category_normalize, category_map |
Source: packages/python/goldenflow/README.md:1-100
InferMap
Inference-driven schema mapping engine. Maps messy source columns to known target schemas with confidence scores and human-readable reasoning.
Supported Data Sources
- CSV files
- DataFrames
- Database tables
- In-memory records
TypeScript Compatibility
- Next.js Server Components
- Route Handlers
- Server Actions
- Edge Runtime
Source: packages/python/infermap/README.md:1-80
Optional Components
Native Acceleration
For maximum performance on large datasets:
pip install "goldenmatch[native]"
This pulls goldenmatch-native, a separately distributed compiled (Rust/PyO3 abi3) runtime.
MCP Server (Claude Desktop)
pip install goldencheck[mcp]
Source: packages/python/goldencheck/README.md:1-50
Integrations
dbt
Add data-quality gates to dbt:
# dbt_project.yml
packages:
- package: benseverndev-oss/dbt-goldensuite
GitHub Actions
- uses: benseverndev-oss/goldencheck-action@v1
with:
files: "data/*.csv"
fail-on: error
Airflow
12 drop-in DAGs available at examples/airflow/.
MCP Container
Run from a single MCP container:
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
Source: README.md:42-60
Web UI
GoldenMatch includes an interactive web workbench:
pip install goldenmatch[web]
goldenmatch serve-ui <project>
Features:
- Pair drilldown with cluster members
- Field-level diff view
- Natural language explanations per pair
Source: README.md:58-62
Latest Release
v1.24.0 — 10M QIS-bucket-realistic: 2604s → 502s (-81% wall) at F1=0.9886 invariant + 18% RSS reduction.
Key improvements:
- ~15 performance PRs
- Chao1 scale-aware cardinality
- Heuristic rule expansion
- Diagnostic harness
See CHANGELOG.md for the full PR list.
Source: community_context
Getting Help
| Resource | Link |
|---|---|
| Documentation | Wiki |
| Examples | Python examples · TypeScript examples |
| Issues | GitHub Issues |
| Discussions | GitHub Discussions |
License
MIT — see LICENSE
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
Getting Started
Related topics: Installation, System Architecture
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Installation, System Architecture
Getting Started
The Golden Suite is a polyglot data-quality and entity-resolution toolkit designed for deduplication, data standardization, and schema mapping. It consists of five core packages: GoldenCheck (data validation), GoldenFlow (transforms and standardization), GoldenMatch (record deduplication), GoldenPipe (pipeline orchestration), and InferMap (schema mapping). Source: README.md:1-10
This guide walks you through installation, basic usage patterns, and recommended starting points for common use cases.
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
Suite Packages Overview
Related topics: Core Matching Engine
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Matching Engine
Suite Packages Overview
The Golden Suite is a polyglot data-quality and entity-resolution toolkit that provides a complete pipeline from data profiling to deduplication. The suite consists of multiple purpose-built packages that can be used independently or composed together for end-to-end workflows.
Source: README.md:1-15
Package Architecture
graph TD
A[Raw Data] --> B[GoldenCheck]
B --> C[GoldenFlow]
C --> D[InferMap]
D --> E[GoldenMatch]
E --> F[GoldenPipe]
G[goldenmatch-extensions] --> E
H[dbt-goldensuite] --> B
I[goldencheck-types] --> B
J[goldencheck-action] --> B
K[GoldenCheck MCP] --> B
L[Goldensuite MCP] --> FPackage Summary
| Package | Languages | Purpose | Install |
|---|---|---|---|
| GoldenMatch | Python, TypeScript | Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM | pip install goldenmatch / npm i goldenmatch |
| GoldenCheck | Python, TypeScript | Data-quality scanning: encoding, Unicode, format validation, anomaly detection | pip install goldencheck / npm i goldencheck |
| GoldenFlow | Python, TypeScript | Transforms & standardizers: phone, date, address, categorical normalization | pip install goldenflow / npm i goldenflow |
| GoldenPipe | Python, TypeScript | Orchestrator that wires Check → Flow → Match into one declarative pipeline | pip install goldenpipe / npm i goldenpipe |
| InferMap | Python, TypeScript | Schema mapping engine — auto-aligns columns across heterogeneous sources | pip install infermap / npm i infermap |
| goldenmatch-extensions | Rust | Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching | source build |
| dbt-goldensuite | dbt, Python | dbt package — quality-gate tests, correction CRUD macros + GoldenCheck assertions | pip install dbt-goldensuite |
| goldencheck-action | — | GitHub Action for CI with PR comments | marketplace |
| goldencheck-types | TypeScript | Community-contributed domain type packs (healthcare, finance, e-commerce) | npm i goldencheck-types |
Source: README.md:80-100
GoldenCheck
GoldenCheck is the data-quality scanning and profiling component of the Golden Suite. It detects issues such as encoding problems, Unicode anomalies, format violations, and data anomalies.
Core Capabilities
- Encoding Detection: Identifies and reports encoding issues in text fields
- Unicode Validation: Detects malformed Unicode sequences and normalization issues
- Format Validation: Validates against expected formats (email, phone, URL, etc.)
- Anomaly Detection: Statistical analysis to identify outliers and unusual patterns
- LLM-Powered Analysis: Uses language models to identify semantic data quality issues missed by automated profilers
Source: packages/python/goldencheck/README.md:1-50
Domain Type Packs
GoldenCheck supports domain-specific type definitions through community-contributed packs:
| Domain | Types Included |
|---|---|
| Healthcare | NPI, ICD codes, insurance IDs, patient demographics, CPT, DRG |
| Finance | Account numbers, routing numbers, CUSIP/ISIN, currency, transactions |
| E-commerce | SKUs, order IDs, tracking numbers, categories, shipping |
Source: packages/typescript/goldencheck-types/README.md:1-30
MCP Server
GoldenCheck includes an MCP (Model Context Protocol) server providing 10 agent-level tools:
analyze_data— Domain detection and strategy recommendationauto_triage— Automated issue classificationexplain_finding— Natural language explanation of findingsexplain_column— Column-level analysiscompare_domains— Cross-domain comparisongenerate_handoff— Pipeline handoff generation
Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts:1-50
GoldenFlow
GoldenFlow provides 76+ transformation functions for standardizing messy data fields. It focuses on transforming data before matching to improve deduplication accuracy.
Source: packages/python/goldenflow/README.md:1-50
Transform Categories
#### Text Transforms (18)
| Transform | Description |
|---|---|
strip | Trim whitespace |
lowercase / uppercase | Case conversion |
title_case | Proper casing ("john smith" → "John Smith") |
normalize_unicode | NFKD normalization, strip accents |
normalize_quotes | Smart/curly quotes → straight quotes |
collapse_whitespace | Multiple spaces → single space |
remove_punctuation | Strip punctuation characters |
remove_html_tags | Strip HTML markup from scraped data |
fix_mojibake | Fix common UTF-8/Latin-1 encoding garbling |
#### Phone Transforms (5)
| Transform | Description |
|---|---|
phone_e164 | Any format → +15550123456 |
phone_national | Any format → (555) 012-3456 |
#### Date Transforms
- Date parsing and normalization across multiple formats
- Timezone normalization
#### Domain-Specific Transforms
| Domain | Capabilities |
|---|---|
| Healthcare | NPI, ICD, CPT, DRG parsing, transaction dates, amount parsing |
| E-commerce | SKU normalization, price parsing, order dates, address standardization |
| Real Estate | Property addresses, listing dates, price normalization, geo fields |
Source: packages/python/goldenflow/README.md:50-120
GoldenMatch
GoldenMatch is the core entity resolution (deduplication) engine. It supports multiple matching strategies and scales to millions of records.
Source: README.md:80-85
Matching Strategies
| Strategy | Use Case |
|---|---|
| Fuzzy Matching | Name/address variants with typo tolerance |
| Exact Matching | Identifier deduplication |
| Probabilistic Matching | Record linkage with confidence scores |
| LLM Clustering | Semantic product/matching for complex domains |
| PPRL | Privacy-preserving record linkage (cross-organization) |
Performance Benchmarks
| Dataset Size | Wall Time | Peak RSS | F1 Score |
|---|---|---|---|
| 10M records (QIS-bucket-realistic) | 502s (-81%) | 18% reduction | 0.9886 |
Source: v1.24.0 Release Notes
Native Acceleration
An optional Rust-based acceleration runtime is available:
pip install "goldenmatch[native]"
This pulls goldenmatch-native, a separately distributed compiled (Rust/PyO3 abi3) runtime. The native runtime is discovered automatically when installed.
Source: v1.21.0 Release Notes
Configuration Options
| Option | Description |
|---|---|
backend | "bucket" (recommended for 5M+ records), "chunked" |
auto_config | Auto-tune recall thresholds |
lineage_provenance | Track source row for golden record fields |
InferMap
InferMap is a schema mapping engine that automatically aligns columns across heterogeneous data sources. It supports both Python and TypeScript with full API parity.
Source: packages/python/infermap/README.md:1-40
Key Features
- Auto Schema Alignment: Detects and maps columns across different source schemas
- Custom Scorers: Configurable similarity scoring algorithms
- Domain Dictionaries: Industry-specific vocabulary for better matching
- Calibration Tools: Score matrix introspection and tuning
- Edge Runtime Support: Works in Vercel Edge Runtime and Next.js
Source: packages/python/infermap/README.md:40-80
GoldenPipe
GoldenPipe is the orchestrator that wires Check → Flow → Match into a single declarative pipeline. It enables pipeline definitions in YAML or Python.
Source: README.md:85-90
Pipeline Stages
graph LR
A[Check] --> B[Flow]
B --> C[Match]
C --> D[Output]
style A fill:#e1f5fe
style B fill:#fff3e0
style C fill:#e8f5e9
style D fill:#f3e5f5Deployment Options
| Runtime | Description |
|---|---|
| Airflow | 12 drop-in DAGs for daily/incremental/warehouse-native dedupe |
| MCP Container | Single container: docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest |
| Python/CLI | Direct script execution |
Source: README.md:90-110
goldenmatch-extensions
A Rust-based Postgres extension using pgrx that provides SQL-native fuzzy matching capabilities.
Source: packages/rust/extensions/README.md:1-30
Capabilities
- 13 pgrx Functions: Core API parity with Python/TypeScript
- **goldenflow_* Transforms**: SQL-level data standardization
- memory_learn/memory_stats CRUD: Learning memory database operations
- DuckDB UDFs: Fuzzy matching in DuckDB queries
Source: goldenmatch_pg v0.5.0 Release
dbt-goldensuite
A dbt package that adds Golden Suite capabilities as dbt tests and macros.
Features
- Quality-gate tests for warehouse models
- Correction CRUD macros for data repair
- GoldenCheck assertions integrated with dbt test framework
Source: packages/python/goldenmatch/dbt-goldensuite/README.md:1-20
goldencheck-action
GitHub Action for CI integration that:
- Runs GoldenCheck scans on PR data changes
- Posts PR comments with findings
- Fails builds based on configurable severity thresholds
Source: packages/actions/goldencheck/README.md:1-30
Examples and Quick Start
The repository includes comprehensive examples organized by deployment target:
Source: examples/README.md:1-30
| Directory | Audience | Highlights |
|---|---|---|
python/ | Python users | 6 scripts: zero-config quickstart, full Suite composed, customer 360, PPRL |
typescript/ | TypeScript/edge users | 4 scripts: quickstart, Vercel-Edge, MCP client |
sql/ | SQL/warehouse users | DuckDB + Postgres core-API examples |
airflow/ | Data-platform users | 12 drop-in DAGs |
Quick Start Commands
# Headline package: dedupe a CSV
pip install goldenmatch && goldenmatch dedupe customers.csv
# TypeScript / Edge
npm install goldenmatch
# Full suite via MCP
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
Source: README.md:110-130
Related Documentation
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
Installation
Related topics: Getting Started
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Getting Started
Installation
The Golden Suite is a polyglot data-quality and entity-resolution toolkit available across Python, TypeScript/Node.js, Rust, and SQL environments. This page documents installation methods for all suite packages, optional dependencies, system requirements, and deployment patterns.
System Requirements
| Component | Minimum Version | Recommended |
|---|---|---|
| Python | 3.11+ | 3.11, 3.12 |
| Node.js | 20+ | 22 LTS |
| Rust | 1.75+ (for extensions) | Latest stable |
| PostgreSQL | 15+ (for extensions) | 16+ |
| DuckDB | 1.0+ (for extensions) | Latest |
Source: README.md:1-15
Quick Start
# Entity resolution (Python)
pip install goldenmatch
# TypeScript / Edge runtimes
npm install goldenmatch
Source: README.md:55-60
Python Installation
All Python packages are published to PyPI and support pip installation with optional extras.
Core Packages
| Package | Purpose | Install Command |
|---|---|---|
| goldenmatch | Zero-config entity resolution (fuzzy + exact + probabilistic + LLM) | pip install goldenmatch |
| goldencheck | Data-quality scanning and validation | pip install goldencheck |
| goldenflow | Data transforms and standardization | pip install goldenflow |
| goldenpipe | Pipeline orchestrator | pip install goldenpipe |
| infermap | Schema mapping engine | pip install infermap |
Source: packages/python/goldenmatch/README.md, packages/python/goldencheck/README.md, packages/python/goldenflow/README.md, packages/python/goldenpipe/README.md, packages/python/infermap/README.md
GoldenMatch Optional Dependencies
GoldenMatch supports modular installation through extras:
# Basic installation
pip install goldenmatch
# With native acceleration (Rust/PyO3 abi3 runtime)
pip install "goldenmatch[native]"
# With LLM support (Anthropic SDK)
pip install "goldenmatch[llm]"
# With baseline profiling support
pip install "goldenmatch[baseline]"
# With semantic type inference
pip install "goldenmatch[semantic]"
# With web UI
pip install "goldenmatch[web]"
# With MCP server
pip install "goldenmatch[mcp]"
# Full installation with all extras
pip install "goldenmatch[native,llm,baseline,semantic,web,mcp]"
Source: packages/python/goldenmatch/README.md, README.md:50-55
GoldenCheck Optional Dependencies
# Basic installation
pip install goldencheck
# With LLM enhancement
pip install "goldencheck[llm]"
# With MCP server for Claude Desktop
pip install "goldencheck[mcp]"
# With all extras
pip install "goldencheck[llm,mcp]"
Source: packages/python/goldencheck/README.md
TypeScript / Node.js Installation
All TypeScript packages are published to npm with zero runtime dependencies for the core packages (edge-safe).
# Core packages
npm install goldenmatch
npm install goldencheck
npm install goldenflow
npm install goldenpipe
npm install infermap
# MCP server (Node.js only)
npm install @benseverndev-oss/goldensuite-mcp
Source: packages/typescript/goldenmatch/README.md, packages/typescript/goldencheck/README.md
Peer Dependencies
Some TypeScript examples require optional peer dependencies:
| Package | Purpose | Install Command |
|---|---|---|
yaml | YAML configuration parsing | npm install yaml |
nodejs-polars | Parquet reading (Node.js only) | auto-installed when needed |
csv-parse | CSV reading (Node.js only) | auto-installed when needed |
@modelcontextprotocol/sdk | MCP server (Node.js only) | auto-installed when needed |
Source: packages/typescript/goldenmatch/examples/README.md
Docker Installation
For a self-contained MCP server deployment, use the official container image:
# Pull the latest MCP server
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
Source: README.md:40
GoldenMatch Native Acceleration
As of v1.21.0, GoldenMatch offers an optional compiled Rust runtime via the goldenmatch-native package. This is separately distributed and uses PyO3 abi3 for compatibility.
graph TD
A[User: pip install goldenmatch] --> B{Has native extra?}
B -->|No| C[Pure-Python wheel]
B -->|Yes| D[pip install goldenmatch-native]
D --> E[Rust/PyO3 abi3 runtime]
E --> F[Auto-discovery at runtime]
C --> G[Standard Polars backend]
F --> H[Optimized clustering + scoring kernels]
G --> HSource: v1.21.0 Release Notes, README.md
Installation Commands for Native
# Option 1: Install with native extra (recommended)
pip install "goldenmatch[native]"
# Option 2: Install separately
pip install goldenmatch
pip install goldenmatch-native
Source: v1.21.0 Release Notes
Rust Extensions (PostgreSQL / DuckDB)
For SQL-native fuzzy matching, install the Rust extension package.
PostgreSQL Extension
# Clone and build from source
git clone https://github.com/benseverndev-oss/goldenmatch.git
cd packages/rust/extensions
cargo build --release
# Load in PostgreSQL
CREATE EXTENSION goldenmatch_pg;
Source: packages/rust/extensions/README.md
DuckDB UDFs
The same Rust package provides DuckDB UDFs for in-database matching:
# Install via source build
cargo build --release
# UDFs are loaded via SQL commands in DuckDB
Source: packages/rust/extensions/README.md
MCP Server Setup
The Golden Suite includes an MCP (Model Context Protocol) server for Claude Desktop integration.
Python Installation
pip install "goldencheck[mcp]"
# or for full suite
pip install "goldenmatch[mcp]"
Source: packages/python/goldencheck/README.md
Claude Desktop Configuration
Add to your Claude Desktop config (claude_desktop_config.json):
{
"mcpServers": {
"goldencheck": {
"command": "python",
"args": ["-m", "goldencheck.mcp.server"]
}
}
}
Source: packages/python/goldencheck/README.md
Docker Deployment
For production MCP deployments:
docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
Source: README.md:40, packages/python/goldensuite-mcp/README.md
dbt Integration
Install the dbt package for data warehouse quality gates:
pip install dbt-goldensuite
Source: README.md:35
GitHub Actions
For CI/CD integration:
pip install goldencheck
# or use the action directly in workflows
Source: README.md:35
Airflow DAGs
Run the Golden Suite as Airflow DAGs:
# Install Airflow adapter
pip install apache-airflow
# Use pre-built DAGs from examples/
cp -r examples/airflow/ /path/to/airflow/dags/
Source: examples/README.md
Verification
Verify installation with the following commands:
# Python packages
python -c "import goldenmatch; print(goldenmatch.__version__)"
python -c "import goldencheck; print(goldencheck.__version__)"
python -c "import goldenflow; print(goldenflow.__version__)"
# TypeScript packages
node -e "console.log(require('goldenmatch/package.json').version)"
# CLI tools
goldenmatch --version
goldencheck --version
goldenflow --version
Common Installation Issues
Python Version Mismatch
GoldenMatch requires Python 3.11+. Check your version:
python --version
If using an older version, use a virtual environment:
python -m venv golden-env
source golden-env/bin/activate # Linux/macOS
# or
golden-env\Scripts\activate # Windows
pip install goldenmatch
Native Extension Not Found
If goldenmatch-native isn't auto-discovered:
# Reinstall with native extra
pip uninstall goldenmatch-native
pip install "goldenmatch[native]"
MCP Server Connection Issues
For Claude Desktop MCP integration, ensure the config is in the correct location:
- Linux:
~/.config/Claude/claude_desktop_config.json - macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
Installation Hierarchy
graph TB
subgraph "Full Stack Installation"
A[Golden Suite MCP Container] --> B[GoldenMatch + Native]
A --> C[GoldenCheck + MCP]
A --> D[GoldenFlow]
A --> E[GoldenPipe]
A --> F[InferMap]
end
subgraph "Python-Only Stack"
G[goldenmatch] --> H[Polars]
G --> I[Polars-Runtime]
G --> J[goldenmatch-native optional]
end
subgraph "TypeScript-Only Stack"
K[goldenmatch] --> L[Zero deps]
K --> M[nodejs-polars optional]
endRelated Pages
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
System Architecture
Related topics: Backend Systems, Core Matching Engine, Blocking and Scoring
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Backend Systems, Core Matching Engine, Blocking and Scoring
System Architecture
The Golden Suite is a polyglot data-quality and entity-resolution toolkit designed for AI-native workflows. The architecture follows a modular, pipeline-oriented design where each component addresses a specific stage of data processing: profiling, standardization, schema mapping, and deduplication. Source: packages/python/goldenmatch/README.md
Architecture Overview
The system consists of five core Python packages, three TypeScript packages, and a Rust extension layer. Each package is independently installable and can operate standalone or as part of an orchestrated pipeline.
Package Overview
| Package | Language | Purpose | Install Command |
|---|---|---|---|
| GoldenMatch | Python · TS | Zero-config entity resolution (fuzzy + exact + probabilistic + LLM) | pip install goldenmatch · npm i goldenmatch |
| GoldenCheck | Python · TS | Data-quality scanning: encoding, Unicode, format validation, anomaly detection | pip install goldencheck · npm i goldencheck |
| GoldenFlow | Python · TS | Transforms & standardizers: phone, date, address, categorical normalization | pip install goldenflow · npm i goldenflow |
| GoldenPipe | Python · TS | Orchestrator wiring Check → Flow → Match into declarative pipelines | pip install goldenpipe · npm i goldenpipe |
| InferMap | Python · TS | Schema mapping engine — auto-aligns columns across heterogeneous sources | pip install infermap · npm i infermap |
| goldenmatch-extensions | Rust | Postgres extension (pgrx) + DuckDB UDFs for SQL-native fuzzy matching | source build |
| dbt-goldensuite | dbt · Python | dbt package with quality-gate tests and correction CRUD macros | pip install dbt-goldensuite |
| goldencheck-action | Action | GitHub Action for CI with PR comments | via GitHub Marketplace |
Source: packages/python/goldenmatch/README.md
Core Data Flow
The canonical pipeline processes data through four stages:
graph TD
A[Raw Data CSV/Parquet] --> B[GoldenCheck<br/>Profile & Validate]
B --> C[GoldenFlow<br/>Standardize & Transform]
C --> D[InferMap<br/>Schema Mapping]
D --> E[GoldenMatch<br/>Deduplicate & Merge]
E --> F[Golden Records<br/>with Provenance]
B -->|Findings Report| G[Data Quality Score]
E -->|Cluster Decisions| H[Memory Learning]
H -->|Auto-config| EPipeline Stage Details
| Stage | Input | Output | Key Capabilities |
|---|---|---|---|
| Profile | Raw CSV/Parquet | Schema, statistics, health score | Encoding detection, null analysis, cardinality profiling |
| Standardize | Messy fields | Normalized fields | Phone E.164, date parsing, address standardization, unicode normalization |
| Map | Heterogeneous schemas | Column alignments | Domain dictionaries, custom scorers, confidence scoring |
| Match | Canonical records | Duplicate clusters | Fuzzy matching, blocking, probabilistic scoring, LLM clustering |
Source: packages/python/goldenflow/README.md
GoldenMatch Architecture
GoldenMatch is the headline package, providing entity resolution (ER) capabilities. The v1.24.0 release achieved 81% wall-clock reduction (2604s → 502s) on 10M record datasets through the QIS-bucket-realistic optimization path. Source: Community Context - v1.24.0
Match Pipeline Components
graph LR
A[Input Records] --> B[Blocking<br/>Key Generation]
B --> C[Candidate Pair<br/>Generation]
C --> D[Vectorized<br/>Scoring Kernels]
D --> E[Probabilistic<br/>Classifier]
E --> F[Cluster<br/>Formation]
F --> G[Golden Record<br/>Survivorship]
G --> H[Output with<br/>Provenance]Backend Modes
GoldenMatch supports multiple execution backends optimized for different dataset scales:
| Backend | Use Case | Records | Performance |
|---|---|---|---|
chunked | Development / small datasets | < 1M | Single-threaded baseline |
bucket | Recommended 5M-on-one-node config | 1-10M | 5x wall reduction, 2x RSS reduction |
native | Maximum performance | Any | Rust/PyO3 abi3 compiled kernels |
The bucket backend became the recommended path in v1.16.0, processing 5M records in 9.94 minutes with 6.4 GB peak RSS on a single 16-core node. Source: packages/python/goldenmatch/README.md
Native Acceleration Layer
The optional native runtime (pip install "goldenmatch[native]") ships the compiled _native kernel as a separately distributed wheel. The runtime is discovered automatically at import time:
import goldenmatch
# Automatically detects and uses native runtime if available
result = goldenmatch.dedupe("customers.csv")
Source: Community Context - v1.21.0, goldenmatch-native v0.1.0
GoldenCheck Architecture
GoldenCheck provides data validation that discovers rules from your data. The TypeScript implementation follows a scanner-reporter pattern with an LLM enhancement layer. Source: packages/python/goldencheck/README.md
Core Engine Components
| Component | File Location | Purpose |
|---|---|---|
| Scanner | src/core/engine/scanner.ts | Analyzes data files, profiles columns |
| Engine | src/core/engine/ | Executes discovered quality rules |
| Confidence | src/core/engine/confidence.ts | Applies severity downgrades based on findings |
| Triage | src/core/engine/triage.ts | Auto-categorizes findings by priority |
| Fixer | src/core/engine/fixer.ts | Applies automated corrections |
| Agent | src/core/agent/ | Strategy selection and explanation |
Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts
LLM Integration Layer
The TypeScript implementation includes a comprehensive LLM interface for semantic analysis:
interface LLMResponse {
columns: Record<string, LLMColumnAssessment>;
relations: LLMRelation[];
}
interface LLMColumnAssessment {
semantic_type: string | null;
issues: LLMIssue[];
upgrades: LLMUpgrade[];
downgrades: LLMDowngrade[];
}
Source: packages/typescript/goldencheck/src/core/llm/prompts.ts
Semantic Type System
GoldenCheck ships with a bundled base type system defined as TypeScript constants (no runtime YAML dependency):
export const BASE_TYPES: Readonly<Record<string, TypeDef>> = {
identifier: {
nameHints: ["id", "key", "pk", "code", "sku", "number", "num", "record"],
valueSignals: { min_unique_pct: 0.95 },
suppress: ["cardinality", "pattern_consistency", "drift_detection"],
},
person_name: {
nameHints: ["first_name", "last_name", "full_name", ...],
valueSignals: { mixed_case: true },
suppress: ["pattern_consistency", "cardinality"],
},
email: {
nameHints: ["email", "mail", "e_mail"],
valueSignals: { format_match: "email", min_match_pct: 0.70 },
suppress: ["pattern_consistency"],
},
phone: {
nameHints: ["phone", "tel", "fax", "mobile", "cell"],
valueSignals: { format_match: "phone", min_match_pct: 0.70 },
suppress: ["type_inference", "pattern_consistency"],
},
address: {
nameHints: ["address", "street", "addr", "line1", "line2"],
valueSignals: { avg_length_min: 15 },
suppress: ["pattern_consistency", "cardinality"],
},
free_text: {
nameHints: ["notes", "comments", "description", ...],
// ...
},
};
Source: packages/typescript/goldencheck/src/core/semantic/types.ts
Reporter System
The JSON reporter produces machine-readable output matching the spec schema:
interface ReportOutput {
file: string;
rows: number;
columns: number;
health_grade: string;
health_score: number;
summary: { errors: number; warnings: number; info: number };
findings: Array<{
severity: string;
column: string;
check: string;
message: string;
affected_rows: number;
sample_values: string[];
}>;
}
Source: packages/typescript/goldencheck/src/core/reporters/json.ts
GoldenFlow Transform Architecture
GoldenFlow provides 76 transforms organized into semantic categories for data standardization. Source: packages/python/goldenflow/README.md
Transform Categories
| Category | Count | Examples |
|---|---|---|
| Text Transforms | 18 | strip, lowercase, uppercase, normalize_unicode, remove_html_tags |
| Phone Transforms | 5 | phone_e164, phone_national, phone_format |
| Date Transforms | 8 | date_parse, date_format, fuzzy_date |
| Numeric Transforms | 6 | parse_currency, extract_numbers, round_precision |
| Address Transforms | 7 | standardize_address, parse_components |
| Categorical Transforms | 4 | normalize_category, fuzzy_category |
Domain-Specific Transforms
| Domain | Transforms |
|---|---|
| Healthcare | NPI normalization, ICD code parsing, insurance ID formatting, CPT/DRG parsing |
| Finance | Account number formatting, routing number validation, CUSIP/ISIN parsing, currency normalization |
| E-commerce | SKU normalization, price parsing, order date standardization, address standardization |
| Real Estate | Property address parsing, listing date normalization, price standardization, geo field extraction |
Source: packages/python/goldenflow/README.md
InferMap Architecture
InferMap is an inference-driven schema mapping engine that automatically aligns columns across heterogeneous data sources. Source: packages/python/infermap/README.md
MCP Server Tools
The TypeScript implementation exposes a comprehensive MCP tool interface:
| Tool | Purpose |
|---|---|
inspect | Analyze schema of a data source |
suggest-mappings | Propose column alignments with confidence scores |
apply | Generate remapped output file |
compare-schemas | Side-by-side schema comparison |
domain-mapping | Domain-specific mapping using dictionaries |
Source: packages/typescript/infermap/src/node/mcp/server.ts
Scorer Architecture
InferMap supports custom scorers for mapping decisions:
graph TD
A[Source Column] --> B[Name Similarity]
A --> C[Type Compatibility]
A --> D[Value Distribution]
A --> E[Domain Dictionary]
B --> F[Composite Score]
C --> F
D --> F
E --> F
F --> G[Mapping Confidence]Source: packages/python/infermap/README.md
GoldenPipe Orchestration Layer
GoldenPipe wires Check → Flow → Match into declarative pipelines with zero boilerplate. Source: packages/python/goldenmatch/README.md
Pipeline Declaration
from goldenpipe import Pipeline
pipeline = (
Pipeline("customer-dedup")
.check("raw_data.csv")
.flow(standardization_rules)
.map(source_schema, target_schema)
.match(blocking_rules, match_config)
.output("golden_records.csv")
)
pipeline.run()
Integration Points
| Integration | Package | Use Case |
|---|---|---|
| Airflow DAGs | examples/airflow/ | 12 drop-in DAG templates |
| dbt | dbt-goldensuite | Quality-gate tests for warehouse models |
| GitHub Actions | goldencheck-action | PR-level data validation |
| MCP Server | goldensuite-mcp | Single-container MCP deployment |
Source: packages/python/goldenmatch/README.md
MCP Agent Tools Architecture
The GoldenCheck MCP server exposes 10 agent-level tools for Claude Desktop integration: Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts
export const AGENT_TOOLS: readonly Tool[] = [
{ name: "analyze_data", description: "Analyze data file to detect domain and recommend strategy" },
{ name: "explain_finding", description: "Explain a specific finding in natural language" },
{ name: "explain_column", description: "Explain column quality assessment" },
{ name: "auto_triage", description: "Auto-categorize findings by priority" },
{ name: "apply_fixes", description: "Apply automated corrections" },
{ name: "compare_domains", description: "Compare data against known domain schemas" },
{ name: "generate_handoff", description: "Generate pipeline handoff documentation" },
{ name: "build_review_queue", description: "Build prioritized review queue" },
];
Agent Tool Execution Flow
graph TD
A[Claude Desktop] -->|MCP Protocol| B[Agent Tools Layer]
B --> C{Command Router}
C -->|analyze_data| D[Scanner Engine]
C -->|explain_*| E[Agent Explanation]
C -->|apply_fixes| F[Fixer Engine]
C -->|auto_triage| G[Triage Engine]
D --> H[Findings Report]
E --> I[Natural Language Output]
F --> J[Corrected Data]
G --> K[Prioritized Queue]Source: packages/typescript/goldencheck/src/node/mcp/agent-tools.ts
Memory Learning System
The Learning Memory system in GoldenMatch consists of three tuners that consume user feedback to auto-configure thresholds: Source: Community Context - v1.20.0
| Tuner | Level | Input | Output |
|---|---|---|---|
| MemoryLearner | Pair-level | Approve/reject pair decisions | Per-field auto-approve threshold |
| Field-Strategy Tuner | Field-level | Field match quality feedback | Per-field match strategy selection |
| Cluster-Decision Tuner | Cluster-level | Cluster approve/reject decisions | Per-dataset auto-approve threshold |
Auto-Configuration Flow
graph LR
A[User Feedback] --> B{Memory System}
B --> C[Pair-Level Learning]
B --> D[Field-Level Learning]
B --> E[Cluster-Level Learning]
C --> F[Auto-Config Commit]
D --> F
E --> F
F --> G[Threshold Proposals]
G --> H[Match Pipeline]Zero-Label Confidence (v1.23.0)
The v1.23.0 release introduced auto-config commits by zero-label confidence by default. The controller's pick_committed tiebreaker prefers higher -overall_confidence candidates over higher -mass_separation among same-health-rank candidates carrying a zero_label profile. Source: Community Context - v1.23.0
Golden Record Provenance
The v1.22.0 release added field-level golden-record provenance tracking: Source: Community Context - v1.22.0
Provenance Data Model
# When provenance=True is enabled:
result = build_golden_records_batch(records, provenance=True)
# Each field dict includes:
{
"value": "John Smith",
"source_row_id": "__row_id__ of winning record",
"survivorship_winner": True
}
Lineage Configuration
| Config Option | Type | Default | Description |
|---|---|---|---|
config.output.lineage_provenance | bool | False | Enable field-level source tracking |
Source: Community Context - v1.22.0
TypeScript Package Structure
The TypeScript packages follow a consistent structure optimized for edge runtimes:
packages/typescript/
├── goldencheck/
│ ├── src/
│ │ ├── core/
│ │ │ ├── engine/ # Scanner, fixer, triage, confidence
│ │ │ ├── agent/ # Strategy selection, explanation
│ │ │ ├── llm/ # LLM interface, prompts
│ │ │ ├── reporters/ # JSON, HTML output formatters
│ │ │ ├── semantic/ # Type definitions, domain matching
│ │ │ └── types.ts # Core type definitions
│ │ └── node/
│ │ └── mcp/ # MCP server, agent tools
│ └── domains/ # Bundled domain packs (YAML)
├── goldencheck-types/
│ ├── domains/ # Community-contributed domain packs
│ └── src/ # Type definitions
└── infermap/
├── src/
│ ├── core/ # Mapping engine
│ └── node/
│ └── mcp/ # MCP server tools
└── domains/ # Domain dictionaries
Zero Runtime Dependencies
The core TypeScript packages have no runtime dependencies (edge-safe):
| Package | Dependencies |
|---|---|
goldencheck core | None (pure TypeScript) |
goldencheck-types | js-yaml (dev: tsup, vitest, typescript) |
infermap core | None (pure TypeScript) |
Optional peer dependencies:
nodejs-polars— Parquet reading (Node.js only)csv-parse— CSV reading (Node.js only)@modelcontextprotocol/sdk— MCP server (Node.js only)
Source: packages/typescript/goldencheck-types/package.json
Cross-Language Record Fingerprint
The v1.21.0 release introduced cross-language record fingerprinting enabling consistent record identification across Python, TypeScript, and Rust execution environments. Source: Community Context - v1.21.0
Rust Extensions Layer
The goldenmatch-extensions package provides SQL-native fuzzy matching through Postgres UDFs (pgrx) and DuckDB UDFs: Source: Community Context - v1.24.0
Extension Capabilities (v0.5.0)
| Capability | Postgres Functions | DuckDB Functions |
|---|---|---|
| Core API parity | 13 pgrx functions | Multiple UDFs |
| GoldenFlow transforms | goldenflow_* transforms | Equivalent UDFs |
| Memory learning | memory_learn CRUD | memory_stats CRUD |
Performance Characteristics
Scaling Benchmarks (v1.24.0)
| Dataset Size | Wall Clock | Memory (RSS) | Configuration |
|---|---|---|---|
| 100K records | ~30s | < 1 GB | Default |
| 1M records | ~2 min | ~2 GB | Default |
| 5M records | 9.94 min | 6.4 GB | backend="bucket", 16-core |
| 10M records | 8.37 min | ~5 GB | backend="bucket", optimized |
The QIS-bucket-realistic path achieves F1=0.9886 invariant across all scales. Source: Community Context - v1.24.0
Memory Optimization
The native Rust acceleration (goldenmatch-native) provides:
- 18% RSS reduction on large datasets
- Compiled clustering kernels
- Optimized block-scoring operations
Source: Community Context - v1.19.0
Related Documentation
| Topic | Documentation |
|---|---|
| Getting Started | GoldenMatch Quick Start |
| Python API | packages/python/goldenmatch/ |
| TypeScript API | packages/typescript/goldenmatch/ |
| MCP Integration | Web UI Wiki |
| ER Agent | ER Agent / A2A Wiki |
| Examples | Python Examples, TypeScript Examples |
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
Backend Systems
GoldenMatch provides a pluggable backend architecture that allows the entity resolution engine to execute across different computational substrates. This design enables users to choose the...
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
GoldenMatch provides a pluggable backend architecture that allows the entity resolution engine to execute across different computational substrates. This design enables users to choose the optimal backend based on dataset size, infrastructure constraints, and performance requirements.
Overview
The backend abstraction decouples the core matching logic from data execution, allowing GoldenMatch to scale from single-threaded operations on small datasets to distributed clusters processing tens of millions of records.
graph TD
User[User Code] --> Config[GoldenMatch Config]
Config --> Registry[Backend Registry]
Registry --> Polars[Polars Backend]
Registry --> Bucket[Bucket Backend]
Registry --> DuckDB[DuckDB Backend]
Registry --> Ray[Ray Backend]
Polars --> Data1[In-Memory Data]
Bucket --> Data2[Chunked Data]
DuckDB --> Data3[SQL Engine]
Ray --> Data4[Distributed Data]
subgraph Core["Core ER Engine"]
Blocking[Blocking]
Scoring[Pair Scoring]
Clustering[Clustering]
end
Data1 --> Core
Data2 --> Core
Data3 --> Core
Data4 --> CoreBackend Types
Polars Backend
The default backend, designed for single-node operation with multi-threaded execution. It leverages Polars' native vectorized operations for blocking, scoring, and clustering.
| Characteristic | Value |
|---|---|
| Execution Model | Single node, multi-threaded |
| Memory Model | In-memory |
| Typical Dataset Size | Up to 5M records |
| Dependencies | polars |
| Configuration Key | backend="polars" |
The Polars backend is the recommended starting point for datasets under 5 million records. Source: packages/python/goldenmatch/goldenmatch/backends/polars_backend.py
Bucket Backend
An evolution of the Polars backend optimized for larger datasets on single nodes. The bucket backend partitions data into manageable chunks that are processed sequentially, reducing peak memory consumption.
| Characteristic | Value |
|---|---|
| Execution Model | Single node, chunked processing |
| Memory Model | Memory-efficient, streaming |
| Typical Dataset Size | 5M-10M records |
| RSS Reduction | ~18% vs chunked baseline |
| Wall Time Reduction | ~81% vs v1.15 baseline |
Source: packages/python/goldenmatch/goldenmatch/backends/bucket_backend.py Source: docs/scale-envelope.md
v1.16.0+ Performance Note: The backend="bucket" path achieves 5M records in 9.94 minutes with 6.4 GB peak RSS on a 16-core node. This represents a 5x wall reduction and 2x peak RSS reduction compared to the v1.15 chunked baseline. Source: README.md
DuckDB Backend
Provides SQL-native fuzzy matching capabilities, enabling GoldenMatch operations to execute within DuckDB queries. This is particularly useful for warehouse-native entity resolution workflows.
| Characteristic | Value |
|---|---|
| Execution Model | DuckDB SQL engine |
| Memory Model | Arrow-based, out-of-core |
| Use Case | Warehouse-native ER, SQL integration |
| Key Functions | goldenflow_* transforms, memory operations |
Source: packages/python/goldenmatch/goldenmatch/backends/duckdb_backend.py
Ray Backend
Enables distributed entity resolution across a Ray cluster. The Ray backend distributes blocking, scoring, and clustering operations across multiple nodes.
| Characteristic | Value |
|---|---|
| Execution Model | Distributed Ray cluster |
| Memory Model | Distributed, cluster-sharded |
| Typical Dataset Size | 10M+ records |
| Dependencies | ray |
Source: packages/python/goldenmatch/goldenmatch/backends/ray_backend.py
Backend Selection
GoldenMatch automatically selects an appropriate backend based on dataset characteristics, but users can explicitly override this via configuration:
from goldenmatch import GoldenMatch
config = {
"backend": "bucket", # Explicit backend selection
"backend_options": {
"chunk_size": 500_000,
"num_threads": 16
}
}
matcher = GoldenMatch(config)
Auto-Detection Logic
The backend registry implements automatic selection based on:
- Record count: Datasets under 1M records use
polars; larger datasets may triggerbucketorray - Memory availability: Detected via RSS monitoring during execution
- Cluster availability: If Ray is initialized,
raybackend becomes available - User configuration: Explicit
backendsetting takes precedence
Backend Interface
All backends implement a common interface defined in base.py:
class Backend(Protocol):
def setup(self, config: Config) -> None: ...
def teardown(self) -> None: ...
def execute_blocking(self, records: DataFrame, config: Config) -> DataFrame: ...
def execute_scoring(self, pairs: DataFrame, config: Config) -> DataFrame: ...
def execute_clustering(self, pairs: DataFrame, config: Config) -> DataFrame: ...
def execute_merge(self, records: DataFrame, clusters: DataFrame, config: Config) -> DataFrame: ...
Source: packages/python/goldenmatch/goldenmatch/backends/base.py
Native Acceleration (Rust/PyO3)
v1.21.0 introduced optional native acceleration via goldenmatch-native, a separately distributed compiled runtime built with Rust and PyO3 abi3:
pip install "goldenmatch[native]"
The native runtime is discovered automatically when installed and provides:
- Compiled clustering kernels
- Optimized block-scoring operations
- Polars-compatible ABI3 bindings
Source: packages/python/goldenmatch/goldenmatch/backends/__init__.py
Note:goldenmatch-nativeis not a standalone package. It must be installed alongside the coregoldenmatchpackage and is automatically discovered at import time.
Memory Management
Health Monitoring
All backends implement health monitoring to detect memory pressure:
class HealthMonitor:
"""Tracks RSS usage and execution metrics."""
def check_health(self) -> HealthStatus:
"""Returns current memory and execution health."""
def get_metrics(self) -> dict:
"""Returns detailed metrics for diagnostics."""
RSS Reduction Features
| Version | Feature | RSS Impact |
|---|---|---|
| v1.16.0 | Bucket backend introduction | 50% reduction vs chunked |
| v1.24.0 | Scale-aware cardinality | 18% reduction |
| v1.24.0 | Heuristic rule expansion | Additional savings |
Source: docs/scale-envelope.md
Configuration Reference
Backend-Specific Options
backend: "bucket" # polars, bucket, duckdb, ray
backend_options:
# Polars/Bucket options
num_threads: 16
chunk_size: 500_000
memory_limit_gb: 32
# Ray options
num_actors: 8
actor_placement: "node1,node2,node3"
# DuckDB options
catalog: "memory"
threads: 8
Performance Tuning
For the recommended 5M-on-one-node configuration:
backend: "bucket"
backend_options:
num_threads: 16
chunk_size: 500_000
This configuration achieves approximately 9.94 minutes wall time and 6.4 GB peak RSS on a 16-core node.
Source: docs/scale-envelope.md
Execution Pipeline
The backend executes the entity resolution pipeline in these stages:
graph LR
A[Raw Records] --> B[Blocking]
B --> C[Pair Generation]
C --> D[Pair Scoring]
D --> E[Clustering]
E --> F[Golden Record Construction]
F --> G[Output]
B1[Block Key Computation] --> B
D1[Field Scorers] --> D
E1[Linkage Criteria] --> EEach stage is backend-specific for optimal execution:
| Stage | Polars | Bucket | DuckDB | Ray |
|---|---|---|---|---|
| Blocking | Vectorized | Chunked vectorized | SQL | Distributed actors |
| Scoring | Multi-threaded | Chunked | SQL | Distributed |
| Clustering | Single-node | Chunked | SQL | Distributed |
Extending Backends
To implement a custom backend:
from goldenmatch.backends.base import Backend
from goldenmatch.core.config import Config
import polars as pl
class CustomBackend(Backend):
def setup(self, config: Config) -> None:
self.config = config
def execute_blocking(self, records: pl.DataFrame, config: Config) -> pl.DataFrame:
# Implement custom blocking logic
return blocked_df
def execute_scoring(self, pairs: pl.DataFrame, config: Config) -> pl.DataFrame:
# Implement custom scoring logic
return scored_df
def execute_clustering(self, pairs: pl.DataFrame, config: Config) -> pl.DataFrame:
# Implement custom clustering logic
return clustered_df
def teardown(self) -> None:
# Cleanup resources
pass
Register the backend:
from goldenmatch.backends import register_backend
register_backend("custom", CustomBackend)
See Also
- Architecture Overview — High-level system design
- Scale Envelope — Performance benchmarks
- CLI Reference — Command-line backend selection
- Configuration Guide — Backend configuration options
- GoldenMatch PostgreSQL Extension — SQL-native ER with pgrx
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
Core Matching Engine
Related topics: AutoConfig System, Blocking and Scoring
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: AutoConfig System, Blocking and Scoring
Core Matching Engine
The Core Matching Engine is the central processing component of GoldenMatch, responsible for identifying, scoring, and clustering duplicate records within datasets. It orchestrates the entire entity resolution pipeline from candidate pair generation through final cluster formation, enabling users to deduplicate records with configurable precision, recall, and performance characteristics.
Architecture Overview
The Core Matching Engine consists of several interconnected modules that work together to perform entity resolution at scale. The engine processes input records through a staged pipeline: blocking reduces the candidate space, scoring evaluates pair similarity, and clustering groups related records into golden records.
graph TD
A[Input Records] --> B[Blocker]
B --> C[Candidate Pairs]
C --> D[Scorer]
D --> E[Scored Pairs]
E --> F[Cluster]
F --> G[Golden Records]
H[Config] --> B
H --> D
H --> F
I[Autoconfig Controller] -.-> HCore Components
| Component | Purpose | Key Responsibilities |
|---|---|---|
| Blocker | Candidate reduction | Generate candidate pairs using blocking keys, handle QIS-bucket blocking |
| Scorer | Similarity evaluation | Compute field-level and overall pair scores, apply field strategies |
| Cluster | Record grouping | Form clusters from scored pairs, manage transitive closure, build golden records |
| Autoconfig Controller | Self-configuration | Auto-tune thresholds, manage recall targets, handle zero-label commits |
| Controller | Orchestration | Coordinate pipeline stages, manage state, handle incremental processing |
Source: packages/python/goldenmatch/goldenmatch/core/controller.py
Blocking Module
The blocking module is responsible for reducing the computational complexity of record matching from O(n²) to a manageable candidate set. It groups records by shared blocking keys and only generates candidate pairs within the same block.
Blocking Strategies
GoldenMatch supports multiple blocking strategies optimized for different data characteristics and scale requirements:
| Strategy | Description | Use Case |
|---|---|---|
| QIS Bucket | Quality-Interval-Sorted bucketing with Chao1 cardinality estimation | Large-scale datasets (10M+ records), realistic data distributions |
| Standard | Traditional blocking on normalized field values | General purpose deduplication |
| Multi-pass | Sequential blocking with different keys | High recall requirements |
| ANN Fallback | Approximate nearest neighbor blocking | Fuzzy matching with edit distance |
The QIS-bucket strategy introduced in v1.24.0 achieves significant performance improvements: 10M records processed in 502s (down from 2604s) with invariant F1=0.9886 and 18% RSS reduction.
Source: packages/python/goldenmatch/goldenmatch/core/blocker.py
Blocking Key Configuration
config = {
"blocking": {
"keys": ["name_soundex", "zip_code", "phone_area"],
"min_block_size": 2,
"max_block_size": 100000
}
}
Scoring Module
The scoring module evaluates candidate pairs by computing similarity scores at both field and record levels. It applies configurable field strategies to determine how each field contributes to the overall match probability.
Field Strategies
Field strategies define how individual fields are compared and weighted:
| Strategy | Description | Best For |
|---|---|---|
exact | Binary match/mismatch | IDs, codes, categorical |
fuzzy | Edit distance or Jaro-Winkler | Names, addresses |
token_set | Token overlap comparison | Multi-word fields |
numeric | Threshold-based comparison | Ages, amounts |
date | Temporal proximity | Date fields |
phonetic | Soundex/Metaphone matching | Names |
Source: packages/python/goldenmatch/goldenmatch/core/field_strategies.py
Score Computation
The scorer aggregates field-level scores into an overall pair score using weighted combination:
overall_score = sum(field_score * field_weight for field in fields) / total_weight
The resulting score represents the probability that two records refer to the same entity, ranging from 0.0 (definitely different) to 1.0 (definitely match).
Source: packages/python/goldenmatch/goldenmatch/core/scorer.py
Clustering Module
The clustering module transforms scored pairs into connected clusters representing unique entities. It handles transitive closure to ensure consistent grouping across the record graph.
Clustering Algorithm
GoldenMatch uses a connected-components approach with configurable linkage criteria:
| Linkage Type | Behavior |
|---|---|
| Single | Records merge if any pair within the cluster exceeds threshold |
| Complete | All pairs within merged clusters must exceed threshold |
| Average | Uses mean pairwise similarity |
Golden Record Generation
Once clusters are formed, the clustering module generates golden records by applying survivorship rules:
golden_record = build_golden_records_batch(cluster_members, provenance=True)
With provenance=True (introduced in v1.22.0), each field dict includes source_row_id tracking which record contributed the winning value.
Source: packages/python/goldenmatch/goldenmatch/core/cluster.py
Autoconfig Controller
The autoconfig controller enables self-tuning of matching parameters based on labeled data or automatic heuristics. It replaces manual threshold tuning with data-driven optimization.
Auto-Configuration Features
| Feature | Description | Version |
|---|---|---|
| Zero-label commit | Prefer higher confidence candidates when labels unavailable | v1.23.0+ |
| Recall targeting | Auto-configure thresholds to meet desired recall | v1.20.0+ |
| Cluster threshold tuning | Tune decision threshold based on cluster-level decisions | v1.20.0+ |
| Field strategy tuning | Auto-select field comparison strategies | v1.19.0+ |
Zero-Label Confidence Handling
In v1.23.0, the pick_committed method was enhanced to handle zero-label profiles. When multiple candidates have equal health rank, the controller prefers candidates with higher -overall_confidence over those with higher -mass_separation. This addresses precision-collapse issues in unlabeled data scenarios.
Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py
Performance Characteristics
The Core Matching Engine is optimized for both scale and accuracy:
Benchmark Results (v1.24.0)
| Dataset Size | Wall Time | Peak RSS | F1 Score |
|---|---|---|---|
| 5M records | 9.94 min | 6.4 GB | 0.99+ |
| 10M records | 502s | ~18% reduction vs v1.23 | 0.9886 |
Performance Strategies
- Vectorization: Single-group-by-per-column operations for golden record building
- Scale-aware cardinality: Chao1 estimation for blocking key selection
- Native acceleration: Optional Rust/PyO3 runtime via
pip install "goldenmatch[native]" - Incremental processing: Support for streaming and incremental matching
Source: packages/python/goldenmatch/goldenmatch/core/matcher.py
Configuration Reference
Core Configuration Options
matching:
# Scoring thresholds
score_threshold: 0.85 # Minimum score to consider a match
decision_threshold: 0.5 # Threshold for cluster decisions
# Field configuration
fields:
name:
strategy: fuzzy
weight: 2.0
email:
strategy: exact
weight: 1.5
phone:
strategy: token_set
weight: 1.0
# Blocking
blocking:
keys: ["name_soundex", "phone_last4"]
method: qis_bucket
# Output
output:
lineage_provenance: false # Track source records for golden fields
include_scores: true
Autoconfig Options
autoconfig:
enabled: true
recall_target: 0.95
zero_label_commit: true # v1.23.0+ default behavior
tune_cluster_threshold: true # v1.20.0+
Pipeline Integration
The Core Matching Engine integrates with the broader GoldenSuite ecosystem:
graph LR
A[GoldenCheck] --> B[GoldenFlow]
B --> C[GoldenMatch]
C --> D[InferMap]
E[GoldenPipe] -. orchestrates .-> A
E -. orchestrates .-> B
E -. orchestrates .-> C
E -. orchestrates .-> D- GoldenCheck: Validates data quality before matching
- GoldenFlow: Standardizes messy fields (phone, date, address)
- InferMap: Maps columns across heterogeneous schemas
- GoldenPipe: Orchestrates the full pipeline declaratively
Related Documentation
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
AutoConfig System
Related topics: Core Matching Engine, Learning Memory
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Matching Engine, Learning Memory
AutoConfig System
The AutoConfig System is GoldenMatch's intelligent self-configuration engine that automatically tunes entity resolution parameters based on data characteristics, eliminating the need for manual threshold and strategy tuning. Introduced incrementally across versions v1.19.0 through v1.24.0, it represents the "Learning Memory" family of auto-tuning features that consume human feedback to propose optimal configurations.
Overview
AutoConfig addresses the fundamental challenge in entity resolution: finding the right balance between precision and recall requires understanding your specific data distribution. Rather than requiring users to manually specify match thresholds, blocking strategies, and scoring weights, AutoConfig:
- Analyzes data characteristics through profiling
- Proposes configuration parameters via iterative tuning
- Learns from user decisions in review queues
- Commits configurations based on zero-label confidence
Source: docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md
Architecture
The AutoConfig System comprises four primary components that work together in an iterative feedback loop:
graph TD
A[Data Input] --> B[Controller]
B --> C[Policy Engine]
C --> D[Cluster Threshold Tuner]
D --> E[Zero-Label Confidence]
E --> B
F[User Decisions] --> D
G[Telemetry] --> B
B --> H[Committed Config]Component Overview
| Component | Purpose | Location |
|---|---|---|
| Controller | Orchestrates the auto-config loop, manages iterations | core/autoconfig_controller.py |
| Policy Engine | Evaluates candidate configurations against health metrics | core/autoconfig_policy.py |
| Cluster Threshold Tuner | Proposes per-dataset approve thresholds from cluster decisions | core/autoconfig_cluster_threshold_tuner.py |
| Zero-Label Confidence | Assigns confidence scores based on unlabeled data patterns | core/zero_label_confidence.py |
Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py
Controller
The AutoConfigController is the central orchestrator that manages the iterative configuration process. It coordinates between the policy engine, telemetry collectors, and the zero-label confidence system.
Controller Responsibilities
graph LR
A[Blocking<br>Summary] --> C[Controller]
B[Scoring<br>Summary] --> C
D[Cluster<br>Summary] --> C
E[Indicators] --> C
F[Column<br>Priors] --> C
C --> G[Decisions]
C --> H[Errors]
C --> I[Committed<br>Matchkeys]Telemetry Data Model
The controller collects and emits structured telemetry for each iteration:
interface ControllerScoringSummary {
n_pairs_scored: number;
candidates_compared: number;
mass_above_threshold: number;
mass_in_borderline: number;
dip_statistic: number;
}
interface ControllerBlockingSummary {
n_blocks: number;
reduction_ratio: number;
block_sizes_p50: number;
block_sizes_p99: number;
block_sizes_max: number;
oversized_block_count: number;
keys_used: string[][];
}
interface ControllerClusterSummary {
n_clusters: number;
cluster_size_p50: number;
cluster_size_p99: number;
cluster_size_max: number;
transitivity_rate: number;
oversized_cluster_count: number;
}
Source: web/frontend/src/lib/api.ts
Decision Recording
Each iteration produces a ControllerDecision record:
| Field | Type | Description |
|---|---|---|
iteration | int | Iteration number |
rule_name | str | Name of the policy rule that triggered |
rationale | str | Human-readable explanation |
config_diff | Record[str, str] | Changes made to configuration |
wall_clock_ms | int | Time taken for this iteration |
Source: web/frontend/src/lib/api.ts
Policy Engine
The AutoConfigPolicy evaluates candidate configurations against a health ranking system. It ranks configurations by their expected precision-collapse behavior and mass separation characteristics.
Health Ranking
Configurations are evaluated on multiple health dimensions:
| Metric | Description | Impact |
|---|---|---|
overall_confidence | Aggregate confidence score | Higher is better |
mass_separation | Gap between match/non-match distributions | Larger gap indicates better discrimination |
precision_collapse | Tendency to over-cluster | Lower is better |
Commit Decision Logic
The policy engine's pick_committed method determines which configuration to commit when multiple candidates have equal health rank. In v1.23.0, the tiebreaker was modified to prefer higher -overall_confidence over higher -mass_separation for candidates carrying a zero_label profile.
Source: docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md
Cluster Threshold Tuner
The AutoConfigClusterThresholdTuner is the third tuner in the Learning Memory family, alongside the pair-level MemoryLearner and field-level field-strategy tuner.
Tuner Function: `tune_decision_threshold`
def tune_decision_threshold(
decisions: list[ClusterDecision],
current_threshold: float
) -> float:
"""
Proposes a per-dataset auto-approve threshold based on
cluster-level approve/reject decisions.
"""
Input: Cluster Decisions
| Field | Type | Description |
|---|---|---|
cluster_id | str | Unique cluster identifier |
decision | str | "approve" or "reject" |
confidence | float | Model confidence in decision |
cluster_size | int | Number of records in cluster |
Output
Returns an updated threshold value that balances precision and recall based on the observed decision pattern.
Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tuner.py
Zero-Label Confidence
The zero-label confidence mechanism enables AutoConfig to work without labeled training data by analyzing the structure of unlabeled record pairs and their natural clustering behavior.
Design Principles
- No Ground Truth Required: Analyzes inherent data structure rather than relying on labeled examples
- Precision-Collapse Aware: Identifies when configurations would cause over-merging
- Profile-Based Scoring: Assigns confidence scores based on zero-label profile characteristics
Zero-Label Profile
A zero-label profile captures characteristics of record pairs that help predict match quality without explicit labels:
@dataclass
class ZeroLabelProfile:
mass_separation: float
overall_confidence: float
precision_collapse_risk: float
zero_label: bool # True if profile carries zero-label characteristics
Commit Behavior
Starting in v1.23.0, AutoConfig commits by zero-label confidence by default. The pick_committed tiebreaker logic:
IF same-health-rank candidates exist:
AND at least one carries zero_label profile:
PREFER the candidate with higher -overall_confidence
ELSE:
PREFER the candidate with higher -mass_separation
This change addressed precision-collapse scenarios where mass_separation alone was insufficient.
Source: packages/python/goldenmatch/goldenmatch/core/zero_label_confidence.py
Source: docs/design/2026-05-25-zero-label-confidence-autoconfig-design.md
Interfaces
AutoConfig is accessible through multiple interfaces:
CLI
goldenmatch autoconfig <input.csv> [--iterations N] [--output config.yaml]
REST API
| Endpoint | Method | Description |
|---|---|---|
/autoconfig | POST | Start auto-configuration run |
/autoconfig/status | GET | Get current iteration status |
/controller/telemetry | GET | Retrieve full telemetry snapshot |
Python API
from goldenmatch import AutoConfigController
controller = AutoConfigController(config)
controller.run(iterations=10)
telemetry = controller.get_telemetry()
SQL Interface (Postgres Extension)
SELECT goldenmatch_autoconfig('customers.csv');
SELECT gm_telemetry();
Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py
Web UI Integration
The AutoConfig System integrates with the GoldenMatch web workbench for visual monitoring and manual override:
graph LR
A[Web UI] -->|Ctrl+A| B[AutoConfig Panel]
B --> C[Iteration List]
C --> D[Decision Details]
D --> E[Config Diff View]
B --> F[Telemetry Charts]
F --> G[Blocking Summary]
F --> H[Cluster Summary]Telemetry Visualization
The frontend displays real-time telemetry including:
- Blocking Summary: Reduction ratio, block size distribution
- Scoring Summary: Mass above/below threshold, DIP statistic
- Cluster Summary: Size distribution, transitivity rate
- Indicators: Matchkey hit rate, cross-blocking overlap
Source: web/frontend/src/lib/api.ts
Configuration Options
AutoConfig Settings
| Parameter | Default | Description |
|---|---|---|
autoconfig.enabled | true | Enable auto-configuration |
autoconfig.max_iterations | 10 | Maximum iterations before commit |
autoconfig.zero_label_commit | true | Prefer zero-label confidence (v1.23.0+) |
autoconfig.recall_target | 0.95 | Target recall for auto-config |
autoconfig.precision_floor | 0.90 | Minimum acceptable precision |
Learning Memory Integration
AutoConfig integrates with the broader Learning Memory system:
| Tuner | Level | Learns From |
|---|---|---|
MemoryLearner | Pair | Pair-level approve/reject decisions |
FieldStrategyTuner | Field | Field-level strategy preferences |
ClusterThresholdTuner | Cluster | Cluster-level approve/reject decisions |
Version History
| Version | Change |
|---|---|
| v1.24.0 | Heuristic rule expansion + diagnostic harness |
| v1.23.0 | Auto-config commits by zero-label confidence by default |
| v1.20.0 | Cluster decision tuner (tune_decision_threshold) |
| v1.19.0 | Native acceleration + autoconfig + probabilistic improvements |
Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_controller.py
Best Practices
- Start with defaults: The zero-label confidence commit behavior (v1.23.0+) provides sensible defaults for most datasets
- Review telemetry: Monitor blocking and cluster summaries to identify oversized blocks or clusters
- Use strict mode for evaluation: The
_strictAutoconfigflag disables runtime threshold shifts for reproducible results - Integrate with review queue: Feed cluster decisions back to the ClusterThresholdTuner for continuous improvement
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
Learning Memory
Related topics: AutoConfig System
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: AutoConfig System
Learning Memory
Learning Memory is GoldenMatch's adaptive feedback system that continuously improves entity resolution accuracy by learning from human corrections, labeled decisions, and performance feedback. It forms the closed-loop optimization layer that distinguishes GoldenMatch from static rule-based matching systems.
Overview
Learning Memory captures human domain knowledge and transforms it into automated configuration improvements. Rather than requiring users to manually tune thresholds, define field strategies, or configure blocking rules, Learning Memory observes how humans resolve ambiguous cases and propagates those decisions across the entire dataset.
The system operates at three distinct levels:
| Level | Tuner | Input | Output |
|---|---|---|---|
| Pair-level | MemoryLearner | Human approve/reject on pair comparisons | Learned matchkey weights and thresholds |
| Field-level | field_strategy_tuner | Field-level corrections | Per-field strategy selection (exact, fuzzy, tokenized, etc.) |
| Cluster-level | cluster_decision_tuner | Cluster approve/reject decisions | Per-dataset auto-approve threshold |
This multi-level approach ensures that feedback at any granularity flows back into the appropriate configuration layer. Source: packages/python/goldenmatch/CHANGELOG.md
Architecture
Learning Memory consists of several interconnected components that handle storage, learning, and application of corrections.
graph TD
subgraph "Input Layer"
A[Human Corrections] --> B[Corrections Store]
C[Review Queue Feedback] --> B
D[Ground Truth Labels] --> E[Memory Learner]
end
subgraph "Learning Layer"
E --> F[Pair-Level Tuning]
F --> G[Threshold Adjuster]
F --> H[Matchkey Weighter]
E --> I[Field-Level Tuning]
I --> J[Strategy Selector]
E --> K[Cluster-Level Tuning]
K --> L[Decision Threshold Tuner]
end
subgraph "Output Layer"
G --> M[AutoConfig Controller]
H --> M
J --> M
L --> M
M --> N[Matching Pipeline]
end
subgraph "Feedback Loop"
N --> O[Review Queue]
O --> A
endCore Components
| Component | File | Responsibility |
|---|---|---|
MemoryCorrections | core/memory/corrections.py | Persistent storage of human corrections |
MemoryLearner | core/memory/learner.py | Pair-level learning from corrections |
FieldStrategyTuner | core/autoconfig_field_strategy_tuner.py | Field-level strategy optimization |
ClusterDecisionTuner | core/autoconfig_cluster_threshold_tune.py | Cluster threshold optimization |
Corrections Store
The MemoryCorrections class provides the persistent backing store for all human feedback. It tracks corrections at both the pair level (which records should be linked or unlinked) and the field level (which field values should win in survivorship). Source: packages/python/goldenmatch/goldenmatch/core/memory/corrections.py:1-50
Data Model
class MemoryCorrections:
corrections: list[Correction]
class Correction:
record_id_a: str # First record in the pair
record_id_b: str # Second record in the pair
decision: str # "approve" or "reject"
confidence: float # Human confidence 0.0-1.0
source: str # "human", "ground_truth", "review_queue"
timestamp: datetime
metadata: dict # Additional context
CRUD Operations
| Operation | Method | Description |
|---|---|---|
| Create | add_correction() | Record a new correction |
| Read | get_corrections() | Retrieve corrections with filters |
| Update | update_correction() | Modify an existing correction |
| Delete | remove_correction() | Remove a correction |
The corrections store supports filtering by:
- Source type (human, ground_truth, review_queue)
- Decision type (approve, reject)
- Date range
- Record pair
Pair-Level Learning: MemoryLearner
The MemoryLearner processes corrections at the record-pair level and extracts patterns about which matchkey combinations indicate true matches versus false positives. Source: packages/python/goldenmatch/goldenmatch/core/memory/learner.py:1-100
Learning Algorithm
class MemoryLearner:
def learn(self, corrections: MemoryCorrections) -> LearnedWeights:
"""
Process corrections and compute updated matchkey weights.
"""
The learner computes:
- Precision per matchkey: Ratio of correct to total positive predictions for each matchkey
- Recall per matchkey: Coverage of true matches captured by each matchkey
- Composite weights: Combination weights that balance precision and recall
Weight Computation
Weights are computed using a modified TF-IDF approach:
| Metric | Formula | Purpose |
|---|---|---|
| Matchkey Precision | correct_pairs / total_pairs_for_key | How reliable is this key? |
| Key Frequency | pairs_using_key / total_pairs | How common is this key? |
| Composite Weight | precision * log(key_frequency + 1) | Balanced importance |
Field-Level Learning: FieldStrategyTuner
The FieldStrategyTuner optimizes which scoring strategy to use for each field based on correction patterns. Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_field_strategy_tuner.py:1-80
Available Field Strategies
| Strategy | Use Case | Example |
|---|---|---|
exact | Unique identifiers, codes | SSN, account numbers |
fuzzy | Names, addresses with typos | "John" vs "Jon" |
tokenized | Multi-word fields | "John Smith" vs "Smith, John" |
numeric | Numbers with tolerance | Prices, quantities |
date | Temporal fields | Birth dates, transaction dates |
phonetic | Names with spelling variants | Soundex, Metaphone |
Tuning Process
- Collect field-level signals from corrections (which field caused the error?)
- Compute strategy accuracy per field for each strategy
- Select best strategy using cross-validation to avoid overfitting
- Generate strategy map:
{field_name: strategy_name}
Cluster-Level Learning: ClusterDecisionTuner
The ClusterDecisionTuner (introduced in v1.20.0) consumes cluster-level approve/reject decisions and proposes a per-dataset auto-approve threshold. Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_cluster_threshold_tune.py:1-60
Threshold Tuning Algorithm
class ClusterDecisionTuner:
def tune_decision_threshold(
self,
decisions: list[ClusterDecision],
target_recall: float = 0.95
) -> float:
"""
Find the threshold that achieves target_recall on approved clusters.
"""
Tuning Inputs
| Input | Type | Description |
|---|---|---|
decisions | list[ClusterDecision] | Human cluster decisions |
target_recall | float | Desired recall target (default 0.95) |
min_approve_confidence | float | Minimum confidence for auto-approve |
Tuning Outputs
| Output | Type | Description |
|---|---|---|
threshold | float | Suggested auto-approve threshold |
expected_precision | float | Estimated precision at this threshold |
calibration_curve | list[tuple] | Precision-recall tradeoff points |
Auto-Config Integration
Learning Memory integrates with GoldenMatch's AutoConfig system, which automatically optimizes matching parameters. The pick_committed method in the controller determines which candidate configurations to commit. Source: packages/python/goldenmatch/goldenmatch/core/autoconfig_golden_strategy_tuner.py:1-100
Commit Priority
The controller uses a multi-factor ranking for candidate selection:
- Health rank: Overall dataset health improvement
- Mass separation: Confidence gap between best and second-best candidate
- Zero-label confidence (v1.23.0+): Preference for higher overall_confidence among zero-label profile candidates
def pick_committed(self, candidates: list[Candidate]) -> Candidate:
"""
Select the best candidate using health rank, mass separation,
and zero-label confidence tiebreaker.
"""
AutoConfig Workflow
graph LR
A[Initialize Config] --> B[Generate Candidates]
B --> C[Score Candidates]
C --> D[Learn from Corrections]
D --> E[Update Weights]
E --> F[Filter Candidates]
F --> G{Rank by Health?}
G -->|Yes| H[Select Best]
G -->|No| I[Apply Learning Memory]
I --> E
H --> J[Commit Configuration]
J --> K[Execute Matching]
K --> L[Generate Review Queue]
L --> M[Human Feedback]
M --> DMemory in Postgres Extension
The goldenmatch_pg extension (v0.5.0+) provides native Postgres functions for Learning Memory operations, enabling SQL-native corrections and statistics. Source: packages/python/goldenmatch/CHANGELOG.md
Available Functions
| Function | Purpose |
|---|---|
memory_learn() | Record corrections from SQL |
memory_stats() | Retrieve learning statistics |
memory_clear() | Reset corrections for a dataset |
SQL Usage Example
-- Record a correction
SELECT memory_learn(
'record_a_id',
'record_b_id',
'approve', -- or 'reject'
0.95
);
-- Get learning statistics
SELECT * FROM memory_stats('my_dataset');
-- Clear corrections for re-learning
SELECT memory_clear('my_dataset');
Usage Patterns
Basic Correction Flow
from goldenmatch import GoldenMatch, MemoryCorrections
# Initialize with memory
gm = GoldenMatch(config)
corrections = MemoryCorrections()
# Run initial matching
results = gm.match(data)
# Present review queue to human
for pair in results.review_queue:
decision = human_review(pair)
corrections.add_correction(
record_id_a=pair.id_a,
record_id_b=pair.id_b,
decision=decision,
confidence=0.95
)
# Apply learning and re-run
gm.memory_learner.learn(corrections)
refined_results = gm.match(data) # Uses learned weights
Field Strategy Tuning
from goldenmatch.core.autoconfig_field_strategy_tuner import FieldStrategyTuner
tuner = FieldStrategyTuner(dataset_id="my_dataset")
# Tune from corrections
strategy_map = tuner.tune(
corrections=corrections,
fields=["name", "address", "phone", "email"]
)
# Apply to config
config.field_strategies = strategy_map
Cluster Threshold Tuning
from goldenmatch.core.autoconfig_cluster_threshold_tune import ClusterDecisionTuner
tuner = ClusterDecisionTuner()
# Tune from cluster decisions
threshold = tuner.tune_decision_threshold(
decisions=cluster_decisions,
target_recall=0.97
)
# Apply auto-approve threshold
config.auto_approve_threshold = threshold
Configuration Options
| Option | Default | Description |
|---|---|---|
memory.enabled | True | Enable/disable learning memory |
memory.learning_rate | 0.1 | Rate at which new corrections update weights |
memory.decay_factor | 0.95 | Weight decay for older corrections |
memory.min_corrections | 10 | Minimum corrections before tuning |
memory.strategy | "balanced" | Tuning strategy: balanced, precision, recall |
Version History
| Version | Feature |
|---|---|
| v1.19.0 | Initial Learning Memory with MemoryLearner |
| v1.20.0 | Added ClusterDecisionTuner (third tuner in family) |
| v1.23.0 | Zero-label confidence tiebreaker in pick_committed |
| v1.24.0 | Heuristic rule expansion + diagnostic harness |
Limitations and Considerations
Data Requirements
- Learning Memory requires a minimum number of corrections (default: 10) before producing reliable recommendations
- Highly imbalanced datasets (rare true matches) may need more corrections for accurate threshold tuning
Convergence
- Weights converge faster when corrections are evenly distributed across matchkey types
- Cluster threshold tuning may require iterative refinement for datasets with unusual cluster size distributions
Production Considerations
- Periodically review learned weights to ensure they remain aligned with business rules
- Reset memory when significant schema changes occur
- Monitor precision/recall drift over time
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
Blocking and Scoring
Related topics: Core Matching Engine
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Matching Engine
Blocking and Scoring
Overview
Blocking and scoring are the two core mechanisms that enable GoldenMatch to perform entity resolution at scale. Without blocking, comparing every record against every other record would result in O(n²) comparisons—a computationally infeasible approach for datasets with millions of records. The blocking phase reduces the candidate space by grouping records that are likely to represent the same entity, while the scoring phase evaluates the similarity of each candidate pair to determine whether they should be merged.
In GoldenMatch, these mechanisms work together through a configurable pipeline that supports multiple blocking strategies (exact, fuzzy, token-based, and approximate nearest neighbor), multiple scoring approaches (string similarity, field weighting, and optional LLM-based evaluation), and a feedback-driven tuning system that learns from user corrections to improve accuracy over time.
Source: packages/python/goldenmatch/examples/README.md
Architecture
The blocking and scoring pipeline follows a multi-stage architecture:
graph TD
A[Input Records] --> B[Standardization]
B --> C[Blocking Strategy]
C --> D[Candidate Pairs]
D --> E[Field-Level Scoring]
E --> F[Record-Level Score]
F --> G[Clustering]
G --> H[Golden Records]
C -->|ann_* strategies| I[Approximate Nearest Neighbors]
F -->|optional| J[Cross-Encoder Reranking]Components
| Component | Purpose | Key Classes/Modules |
|---|---|---|
| Blocker | Generates candidate pairs using blocking keys | blocker.py |
| Matchkey | Defines how fields contribute to blocking and scoring | matchkey.py |
| Scorer | Computes field and record-level similarity | scoring.py |
| Cross-Encoder | Optional reranking of candidate pairs | cross_encoder.py |
| Controller | Orchestrates the pipeline and applies thresholds | controller.py |
Source: packages/python/goldenmatch/examples/README.md
Blocking Strategies
Exact Blocking
Exact blocking groups records that share identical values on one or more key fields. This is the simplest and fastest approach, ideal for high-quality, well-standardized data.
blocking = {
"strategy": "sorted_neighborhood",
"fields": ["email"],
"window_size": 3
}
Sorted Neighborhood Blocking
Records are sorted by blocking key values and compared within a sliding window. This catches near-duplicates that would not appear adjacent under exact blocking.
blocking = {
"strategy": "sorted_neighborhood",
"fields": ["last_name", "zip5"],
"window_size": 5
}
Multi-Pass Blocking
For complex datasets, multiple blocking passes with different keys can capture different types of matches:
blocking = [
{"strategy": "exact", "fields": ["email"]},
{"strategy": "sorted_neighborhood", "fields": ["last_name", "first_name"], "window_size": 5},
{"strategy": "sorted_neighborhood", "fields": ["phone"]}
]
Source: packages/python/goldenmatch/examples/README.md
Approximate Nearest Neighbor (ANN) Blocking
For high-cardinality string fields, ANN blocking provides efficient similarity-based candidate generation:
blocking = {
"strategy": "ann_l2",
"fields": ["full_address"],
"distance_threshold": 0.3
}
The v1.24.0 release introduced the QIS-bucket strategy and Chao1 scale-aware cardinality estimation, which significantly improves blocking accuracy on large datasets. The performance improvements in this release achieved 81% wall-clock reduction (2604s → 502s) on a 10M record benchmark while maintaining F1=0.9886.
Source: README.md
Blocking Configuration Fields
| Field | Type | Description |
|---|---|---|
strategy | string | One of: exact, sorted_neighborhood, ann_l2, ann_cosine, canopy, learned |
fields | list[string] | Fields to use for blocking |
window_size | int | Window size for sorted neighborhood (default: 3) |
distance_threshold | float | Distance threshold for ANN strategies |
extras | dict | Advanced strategy-specific parameters |
Source: packages/python/goldenmatch/web/frontend/src/lib/types.ts
Scoring
Field-Level Scoring
Each field contributes a similarity score based on its configured strategy:
| Strategy | Description | Use Case |
|---|---|---|
levenshtein | Character-level edit distance | Names, addresses |
jaro_winkler | Optimized for short strings | Names |
token_set | Set intersection of tokens | Address components |
numeric | Absolute difference / range | Dates, amounts |
exact | Binary match/no-match | IDs, codes |
Matchkey Definition
Matchkeys define how fields participate in both blocking and scoring:
matchkeys = [
{
"fields": ["email"],
"blocking": True,
"score_type": "exact",
"weight": 1.0
},
{
"fields": ["first_name", "last_name"],
"blocking": True,
"score_type": "token_set",
"weight": 0.8
},
{
"fields": ["phone"],
"blocking": True,
"score_type": "levenshtein",
"weight": 0.6
}
]
Source: packages/python/goldenmatch/examples/README.md
Record-Level Scoring
The record-level score combines field scores using weighted averaging:
record_score = Σ(field_score × field_weight) / Σ(field_weight)
Records exceeding the threshold are linked; the default threshold is tuned automatically based on the dataset characteristics via the autoconfig system introduced in v1.20.0.
Source: README.md
Weighted Matchkeys
GoldenMatch supports sophisticated weighting schemes:
matchkeys = [
{"fields": ["company_name"], "weight": 0.7, "score_type": "token_set"},
{"fields": ["address", "city"], "weight": 0.3, "score_type": "token_set"},
# Optional multi-pass with different weight profiles
]
The equipment deduplication example demonstrates multi-pass blocking with ANN fallback, weighted fuzzy matching, and LLM calibration for challenging datasets.
Source: packages/python/goldenmatch/examples/README.md
Cross-Encoder Reranking
The optional cross-encoder module provides a secondary scoring pass that considers field interactions:
from goldenmatch.core.cross_encoder import CrossEncoderScorer
scorer = CrossEncoderScorer(model="cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(record_a, record_b) for record_a, record_b in candidate_pairs]
reranked = scorer.score_batch(pairs)
Cross-encoding is particularly valuable when field combinations carry more signal than individual fields—for example, a name and address together are more distinctive than either alone.
Source: packages/python/goldenmatch/examples/README.md
Configuration Payload
The web frontend communicates blocking and scoring configuration using a typed payload:
export type RulesPayload = {
threshold: number;
matchkeys: Matchkey[];
standardization?: StandardizationRules | null;
blocking?: BlockingPayload | null;
};
Known blocking keys are validated at the server boundary, while unknown keys are preserved in an extras field for advanced strategies:
const BLOCKING_KNOWN_KEYS = new Set([
"strategy", "fields", "window_size", "distance_threshold",
"_block_size", "skip_oversized", "auto_suggest", "auto_select"
]);
Source: packages/python/goldenmatch/web/frontend/src/lib/types.ts
Standardization Pipeline
Effective blocking and scoring depend on data standardization. GoldenFlow provides the transform library that should be applied before matching:
| Transform Category | Examples |
|---|---|
| Text | strip, lowercase, normalize_unicode, normalize_quotes |
| Phone | phone_e164, phone_national |
| Address | address (full address standardization) |
| Numeric | extract_numbers, parse_currency |
standardization = {
"email": "lowercase",
"phone": "phone_e164",
"address": "address",
"state": "state"
}
Source: packages/python/goldenflow/README.md
Performance Considerations
Bucket Strategies
The QIS-bucket strategy (v1.24.0) provides scale-aware cardinality estimation that adjusts bucket parameters based on dataset size. This prevents both over-blocking (too many candidates) and under-blocking (missing matches).
Memory Reduction
The v1.24.0 release achieved an 18% RSS reduction through optimized data structures and streaming processing. Key optimizations include:
- Single-group-by-per-column vectorization for golden record building
- Lazy evaluation of scoring for low-confidence pairs
- Chunked processing for very large candidate sets
Source: README.md
Auto-Configuration
GoldenMatch can automatically tune blocking and scoring parameters:
from goldenmatch.core.autoconfig import auto_configure
config = auto_configure(
data=df,
ground_truth=labels_df, # Optional for supervised tuning
target_recall=0.95
)
The autoconfig system uses:
- Chao1 estimation for cardinality-aware blocking tuning
- Zero-label confidence analysis for threshold calibration (v1.23.0)
- Cluster-level tuning for decision threshold optimization (v1.20.0)
Source: README.md
Memory and Learning
GoldenMatch maintains learned patterns across runs:
| Memory Type | Scope | Purpose |
|---|---|---|
MemoryLearner | Pair-level | Learn from labeled match/non-match pairs |
field-strategy tuner | Field-level | Optimize per-field scoring strategy |
cluster-decision tuner | Cluster-level | Tune merge/reject thresholds |
These learning mechanisms enable the system to improve accuracy over time as users correct its decisions.
Source: README.md
Workflow Summary
graph LR
A[Raw Data] --> B[GoldenFlow Standardization]
B --> C[Blocking]
C --> D[Scoring]
D --> E[Clustering]
E --> F{Threshold}
F -->|Above| G[Auto-Approve]
F -->|Below| H[Auto-Reject]
F -->|Uncertain| I[Review Queue]
I --> J[User Labels]
J --> K[Memory Update]
K --> CSee Also
- Golden Match Examples — Runnable scripts demonstrating all blocking and scoring patterns
- Web UI Wiki — Interactive blocking configuration in the browser
- Auto-Configuration — Advanced tuning documentation
- GoldenFlow Transforms — Standardization for improved matching
Source: https://github.com/benseverndev-oss/goldenmatch / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.
1. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch
2. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch
3. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch
4. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch
5. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch
6. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:1183640892 | https://github.com/benseverndev-oss/goldenmatch
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using goldenmatch with real data or production workflows.
- v1.24.0 - github / github_release
- v1.23.0 - auto-config recall + zero-label commit default - github / github_release
- v1.22.0 - github / github_release
- v1.21.0 - github / github_release
- goldenmatch-native v0.1.0 - github / github_release
- goldenmatch v1.20.0 - github / github_release
- goldenmatch v1.19.0 - github / github_release
- infermap-js v0.5.0 - github / github_release
- goldenpipe-js v0.2.0 - github / github_release
- goldenmatch_pg v0.5.0 - github / github_release
- Capability evidence risk requires verification - GitHub / issue
Source: Project Pack community evidence and pitfall evidence