pytorch-hessian-eigenthings Manual Preview

Doramagic Project Pack · Human Manual

pytorch-hessian-eigenthings

hessian-eigenthings is a PyTorch library that provides efficient and scalable computation of eigendecomposition for the Hessian matrix and related curvature operators in neural networks. T...

Introduction to hessian-eigenthings

Related topics: Curvature Matrices Explained, System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Section What is a Hessian?

Continue reading this section for the full explanation and source context.

Section Curvature Operators

Continue reading this section for the full explanation and source context.

Section Supported Curvature Matrices

Continue reading this section for the full explanation and source context.

Introduction to hessian-eigenthings

Overview

hessian-eigenthings is a PyTorch library that provides efficient and scalable computation of eigendecomposition for the Hessian matrix and related curvature operators in neural networks. The library enables practitioners to compute top eigenvalues and eigenvectors via Lanczos or stochastic power iteration, trace estimates via Hutch++, and spectral density via Stochastic Lanczos Quadrature.

Sources: README.md:1

The project targets researchers and engineers studying generalization properties of neural networks, where Hessian eigenvalues and eigenvectors have been implicated in understanding flat minima and model robustness.

Core Concepts

What is a Hessian?

The Hessian matrix is the second-order partial derivatives of a loss function with respect to model parameters. For a neural network with parameters θ and loss L, the Hessian H is defined as:

H[θ][i,j] = ∂²L / ∂θ[i]∂θ[j]

For modern large-scale models, the Hessian is prohibitively expensive to compute explicitly—it has O(n²) entries where n is the number of parameters (e.g., billions for large language models).

Sources: hessian_eigenthings/operators/hessian.py:1-50

Curvature Operators

Instead of computing the full Hessian matrix, this library works with curvature operators that implement matrix-vector products (matvecs). Given a vector v, these operators efficiently compute:

H @ v → operator.matvec(v)

This approach reduces memory from O(n²) to O(n), making analysis feasible for models with billions of parameters.

Supported Curvature Matrices

Operator	Description	Use Case
`HessianOperator`	Full Hessian of the loss	General curvature analysis
`GGNOperator`	Generalized Gauss-Newton approximation	More stable than raw Hessian; equals Fisher for cross-entropy + softmax
Custom Operators	User-defined curvature operators	Extend to other matrices

Sources: hessian_eigenthings/operators/ggn.py:1-30

Architecture

The library follows a clean separation of concerns with three main layers:

graph TD
    A[User Code] --> B[Algorithms]
    A --> C[Loss Functions]
    B --> D[Curvature Operators]
    C --> D
    D --> E[LinAlgBackend]
    
    B --> B1[Lanczos]
    B --> B2[Stochastic Power Iteration]
    B --> B3[Trace Estimation]
    
    D --> D1[HessianOperator]
    D --> D2[GGNOperator]
    D --> D3[Custom Operators]
    
    E --> E1[SingleDeviceBackend]
    E --> E2[Distributed Backends]

Component Layers

Algorithms Layer (hessian_eigenthings/algorithms/): Eigenvalue/eigenvector computation methods that operate on any CurvatureOperator.

Operators Layer (hessian_eigenthings/operators/): Implementations of various curvature matrices that provide the matvec() interface.

Loss Functions Layer (hessian_eigenthings/loss_fns/): Pre-built loss functions with analytical Hessian-vector products for common use cases.

Backend Layer (hessian_eigenthings/backends/): Abstraction for linear algebra operations supporting single-device and distributed execution.

Sources: CONTRIBUTING.md:1-30

Algorithms

Lanczos Eigendecomposition

The Lanczos algorithm computes the top k eigenvalues and eigenvectors of a symmetric matrix using only matrix-vector products. It achieves this through k iterations of tridiagonal matrix construction.

graph LR
    A[Start Vector v₀] --> B[Iterate i = 1 to k]
    B --> C[Compute βᵢvᵢ₊₁ = Avᵢ - αᵢvᵢ - βᵢ₋₁vᵢ₋₁]
    C --> D[Compute αᵢ = vᵢᵀAvᵢ]
    D --> E[Build Tridiagonal T]
    E --> F{Eigenvalues of T ≈ eigenvalues of A?}
    F -->|Yes| G[Eigenpairs Converged]

Key parameters for the Lanczos algorithm:

Parameter	Type	Default	Description
`k`	int	required	Number of eigenpairs to compute
`max_iter`	int	100	Maximum Lanczos iterations
`tol`	float	1e-6	Convergence tolerance
`seed`	int	None	Random seed for reproducibility
`which`	str	"LM"	Which eigenvalues: "LM" (largest magnitude), "LA" (largest algebraic), "SA" (smallest algebraic)

Sources: hessian_eigenthings/algorithms/lanczos.py:1-80

Trace Estimation

The trace of a matrix can be estimated using stochastic methods without forming the full matrix:

Hutchinson's Estimator:

trace(A) ≈ (1/m) Σᵢ vᵢᵀ A vᵢ

where vᵢ are random probe vectors.

Hutch++ Estimator: An improved estimator with lower variance:

trace(A) ≈ (2/m) Σᵢ vᵢᵀ A vᵢ - (1/m) Σⱼ wⱼᵀ A wⱼ

Method	`num_matvecs`	Variance	Use Case
`hutchinson`	100	Higher	Quick estimates
`hutch++`	30	Lower	Production estimates

Sources: hessian_eigenthings/algorithms/trace.py:1-60

Operators

HessianOperator

The primary operator for computing Hessian eigendecomposition. It supports two HVP computation methods:

Method	Description	Memory	Precision
`"autograd"` (default)	Exact double-backward via `torch.autograd.grad` with `create_graph=True`	Higher	Numerically exact
`"finite_difference"`	Central-difference approximation	Lower	O(ε²) bias

from hessian_eigenthings import HessianOperator, lanczos

# Basic usage
op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
eigenvalues, eigenvectors = lanczos(op, k=10)

# With parameter filtering (subset of parameters)
op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("transformer.h.*.attn.*"),
)

Sources: hessian_eigenthings/operators/hessian.py:1-100

GGNOperator

The Generalized Gauss-Newton (GGN) operator provides a more numerically stable approximation to the Hessian. For cross-entropy + softmax classification, the GGN equals the Fisher information matrix.

from hessian_eigenthings import GGNOperator

op = GGNOperator(
    model=model,
    dataloader=dataloader,
    forward_fn=model_forward,
    loss_of_output_fn=loss_of_output_fn,
)

Two matvec implementations:

Implementation	Description	Memory Footprint
`"analytical"` (default)	Finite-difference JVP + analytical loss-Hessian-vec product	Matches one training step
`"autograd"`	Full `torch.func.jvp` + autograd double-backward	Scales badly with output size

Sources: hessian_eigenthings/operators/ggn.py:1-80

Loss Functions

The library provides optimized loss functions with closed-form Hessian-vector products for common use cases.

Standard Loss Functions

from hessian_eigenthings.loss_fns.standard import cross_entropy_loss_of_output

loss_fn = cross_entropy_loss_of_output()  # Returns loss_of_output_fn

The closed-form cross-entropy HVP is:

H @ u = (p * u - p * <p, u>) / n

where p = softmax(output) and n is the number of valid positions.

Sources: hessian_eigenthings/loss_fns/standard.py:1-60

HuggingFace Transformers Loss

Specialized support for HuggingFace models with fused CUDA kernels:

from hessian_eigenthings.loss_fns.huggingface import hf_lm_loss

loss_fn = hf_lm_loss()  # For language modeling

Fused backend options:

Backend	Device	Speedup	Memory
`"triton"`	CUDA	~3.4x faster	2x reduction
`"compile"`	Any	~2.6x faster	2x reduction
`"eager"`	Any	Baseline	Baseline
`"auto"`	Auto-detect	Best available	Best available

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-80

Usage Examples

Basic Hessian Eigendecomposition

import torch
from torch import nn
from hessian_eigenthings import HessianOperator, lanczos

# Define model and data
model = nn.Sequential(nn.Linear(100, 50), nn.ReLU(), nn.Linear(50, 10))
dataloader = [(torch.randn(32, 100), torch.randint(0, 10, (32,)))]

# Loss function
def loss_fn(model, batch):
    x, y = batch
    return nn.functional.cross_entropy(model(x), y)

# Compute top 3 eigenvalues
op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
result = lanczos(op, k=3, max_iter=20, tol=1e-3, seed=0)

for i, val in enumerate(result.eigenvalues):
    print(f"λ_{i+1} = {val.item():.4e}")

Sources: examples/huggingface_tiny_gpt2.py:1-50

Analyzing Attention Layers Only

from hessian_eigenthings import HessianOperator, lanczos
from hessian_eigenthings.util import match_names

# Filter to attention parameters only
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*"),
)
eigenvalues = lanczos(attn_op, k=5)

Trace Estimation

from hessian_eigenthings import HessianOperator, trace

op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
result = trace(op, num_matvecs=30, method="hutch++", seed=0)
print(f"Trace estimate: {result.estimate:.4e}")

Performance Considerations

Memory Management

For large models, consider these strategies:

``python op = HessianOperator(model, dataloader, loss_fn, param_filter=filter_func) ``

Parameter Filtering: Analyze only relevant subsets of parameters

``python op = HessianOperator(model, dataloader, loss_fn, microbatch_size=8) ``

Microbatching: Process data in smaller chunks

``python op = HessianOperator(model, dataloader, loss_fn, method="finite_difference") ``

Finite Difference Method: Use "finite_difference" for lower memory with FSDP/HSDP/TP

Scalability

Model Size	Recommended Method	Notes
< 1B params	`"autograd"` HVP	Numerically exact
1B - 7B params	`"analytical"` GGN	Good memory efficiency
> 7B params	`"finite_difference"` HVP	Works with distributed training

Computation Cost

The primary cost driver is the number of matrix-vector products (matvecs):

Lanczos: ~k × max_iter matvecs for k eigenpairs
Trace (Hutch++): num_matvecs matvecs
Spectral Density: num_steps × num_random_start matvecs

API Reference

Core Functions

Function	Module	Description
`lanczos`	`hessian_eigenthings.algorithms`	Lanczos eigendecomposition
`stochastic_power_iteration`	`hessian_eigenthings.algorithms`	Stochastic power iteration
`trace`	`hessian_eigenthings.algorithms`	Trace estimation
`spectral_density`	`hessian_eigenthings.algorithms`	Stochastic Lanczos Quadrature

Operators

Class	Module	Description
`HessianOperator`	`hessian_eigenthings.operators`	Full Hessian operator
`GGNOperator`	`hessian_eigenthings.operators`	Generalized Gauss-Newton operator
`CurvatureOperator`	`hessian_eigenthings.operators`	Base class for custom operators

Utility Functions

Function	Description
`match_names(glob_pattern)`	Create parameter filter from glob pattern
`SingleDeviceBackend`	Linear algebra backend for single-device execution

Project Information

Acknowledgements

The original 2018 implementation was developed by Noah Golmant, Zhewei Yao, Amir Gholami, Michael Mahoney, and Joseph Gonzalez at UC Berkeley's RISELab.

The deflated power iteration is based on code from HessianFlow (Z. Yao, A. Gholami, Q. Lei, K. Keutzer, M. Mahoney. *"Hessian-based Analysis of Large Batch Training and Robustness to Adversaries"*, NeurIPS 2018).

Accelerated stochastic power iteration is from C. De Sa et al.

Citation

@misc{hessian-eigenthings,
    author       = {Noah Golmant and Zhewei Yao and Amir Gholami and Michael Mahoney and Joseph Gonzalez},
    title        = {pytorch-hessian-eigenthings: efficient PyTorch Hessian eigendecomposition},
    month        = oct,
    year         = 2018,
    version      = {1.0},
    url          = {https://github.com/noahgolmant/pytorch-hessian-eigenthings}
}

Installation

# From PyPI (stable release)
pip install hessian-eigenthings

# Development setup
git clone https://github.com/noahgolmant/pytorch-hessian-eigenthings
cd pytorch-hessian-eigenthings
uv sync --group dev --group docs

Documentation

Full documentation is available at noahgolmant.github.io/pytorch-hessian-eigenthings/.

Sources: README.md:1-100 Sources: CONTRIBUTING.md:1-60 Sources: mkdocs.yml:1-50

Sources: [README.md:1]()

Installation Guide

Related topics: Introduction to hessian-eigenthings

Section Related Pages

Continue reading this section for the full explanation and source context.

Section PyPI Release (Recommended for Users)

Continue reading this section for the full explanation and source context.

Section Development Installation (For Contributors)

Continue reading this section for the full explanation and source context.

Section Prerequisites

Continue reading this section for the full explanation and source context.

Related topics: Introduction to hessian-eigenthings

Installation Guide

Overview

This guide covers all aspects of setting up the hessian-eigenthings library for computing Hessian eigendecomposition and related curvature matrix operations in PyTorch models.

The library provides efficient methods for computing top eigenvalues and eigenvectors via Lanczos or stochastic power iteration, trace estimates via Hutch++, and spectral density via Stochastic Lanczos Quadrature. Sources: README.md:1-20

Installation Methods

PyPI Release (Recommended for Users)

The latest stable release is available on PyPI:

pip install hessian-eigenthings

This installs the core library without optional dependencies for transformer and curvlinops integrations. Sources: README.md:1-10

Development Installation (For Contributors)

For development, clone the repository and install with all optional dependency groups:

git clone https://github.com/noahgolmant/pytorch-hessian-eigenthings
cd pytorch-hessian-eigenthings
uv sync --group dev --group docs --extra transformers --extra transformer-lens --extra curvlinops

Sources: CONTRIBUTING.md:5-12

Optional Dependency Groups

The library uses optional dependency groups defined in pyproject.toml to enable specialized functionality:

Group	Purpose	Typical Use Case
`dev`	Testing, linting, type checking	Running CI checks locally
`docs`	Building documentation	`mkdocs build --strict`
`transformers`	HuggingFace Transformers integration	`GGNOperator` with HF models
`transformer-lens`	TransformerLens integration	Attention-only Hessian analysis
`curvlinops`	Cross-library validation tests	Testing against external oracle

Sources: CONTRIBUTING.md:8-10

Development Environment Setup

Prerequisites

Requirement	Version	Purpose
Python	≥3.10	Core runtime
uv	Latest	Package manager
PyTorch	≥2.0	Backend tensor operations
CUDA (optional)	11.8+	GPU acceleration for large models

Setup Workflow

graph TD
    A[Clone Repository] --> B[Install uv if needed]
    B --> C[Run uv sync with groups]
    C --> D[Verify Installation]
    D --> E{Which workflow?}
    E -->|Development| F[Run linting checks]
    E -->|Testing| G[Run pytest]
    E -->|Documentation| H[Build docs]
    F --> I[Ready to contribute]
    G --> I
    H --> I

Verification Commands

After installation, verify the setup by running the full check suite:

uv run ruff check .
uv run black --check .
uv run mypy
uv run pytest
uv run mkdocs build --strict

Sources: CONTRIBUTING.md:14-23

CUDA/GPU Support

The library provides optimized CUDA kernels for specific operations:

Triton Kernels (CUDA Only)

The hessian_eigenthings.loss_fns._fused_ce_hvp module includes a hand-written Triton CUDA kernel for fused CE HVP computation. This kernel:

Eliminates zero (N, V) intermediates (output buffer only)
Provides ~3.4x speedup over eager mode
Reduces peak memory by 2x compared to eager
Falls back to torch.compile if Triton/CUDA is unavailable

Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:50-80

Backend Selection

For HuggingFace language model loss functions, the fused parameter controls kernel selection:

Setting	Behavior
`"auto"` (default)	Picks fastest available: Triton on CUDA (~3.4x speedup), else `torch.compile`
`"eager"`	Plain PyTorch implementation, useful for debugging
`"compile"`	`torch.compile`-fused via Inductor, works on CPU/CUDA/MPS
`"triton"`	Hand-written CUDA Triton kernel (CUDA only)

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-50

Operator-Specific Dependencies

Different curvature operators have different computational requirements:

HessianOperator

Two HVP methods are supported:

Method	Memory Profile	Precision	FSDP/TP Compatible
`"autograd"` (default)	Higher (requires `create_graph=True`)	Numerically exact	No (requires special handling)
`"finite_difference"`	Matches one training step	O(ε²) truncation bias	Yes

Sources: hessian_eigenthings/operators/hessian.py:1-40

GGNOperator

For the Generalized Gauss-Newton operator, two matvec implementations are available:

Implementation	Memory	Use Case
`"analytical"` (default)	Matches one training step	LM-scale use, prevents OOM
`"autograd"`	Scales with output size	Losses without analytical `.hvp`

Sources: hessian_eigenthings/operators/ggn.py:1-60

CI/CD Verification

The repository uses GitHub Actions for continuous integration. CI runs:

All linting and type checks
Full pytest test suite
Example scripts execution
Documentation codeblock tests

Sources: CONTRIBUTING.md:23-26

Troubleshooting

Common Issues

Issue	Solution
Memory OOM with GGNOperator	Use `loss_hvp="analytical"` (default in recent versions)
FSDP/TP compatibility issues	Use `method="finite_difference"` for HessianOperator
Triton not available	Falls back to `torch.compile` automatically
Type checking failures	Run `uv run mypy` locally before submitting PR

Diagnostic Scripts

The repository includes diagnostic scripts for troubleshooting:

scripts/repro_ggn_oom.py: CPU-side memory regression test for GGNOperator OOM issues
scripts/bench_fused_ce_hvp.py: Microbenchmark for eager vs fused CE HVP performance

Sources: scripts/repro_ggn_oom.py:1-40 Sources: scripts/bench_fused_ce_hvp.py:1-50

Package Metadata

Property	Value
Package Name	`hessian-eigenthings`
License	MIT
Documentation	noahgolmant.github.io/pytorch-hessian-eigenthings
CI Status	![CI](https://github.com/noahgolmant/pytorch-hessian-eigenthings/actions/workflows/ci.yml)

Sources: README.md:1-20

Sources: [CONTRIBUTING.md:5-12]()

Curvature Matrices Explained

Related topics: Why Hessian-Vector Products, Curvature Operators

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Definition and Role

Continue reading this section for the full explanation and source context.

Section HessianOperator Implementation

Continue reading this section for the full explanation and source context.

Section When to Use the Hessian

Continue reading this section for the full explanation and source context.

Curvature Matrices Explained

Overview

Curvature matrices characterize the second-order behavior of loss functions in neural networks, providing critical information about optimization landscapes, generalization properties, and model robustness. The hessian-eigenthings library provides efficient, matrix-free computation of eigendecompositions for three key curvature matrices: the Hessian, the Generalized Gauss-Newton (GGN), and the Empirical Fisher.

These curvature operators serve as the foundation for analyzing flat minima, understanding generalization, and performing second-order optimization. The library implements matrix-vector products (matvecs) directly, avoiding explicit matrix construction which would be computationally infeasible for large neural networks with billions of parameters.

Curvature Matrices Architecture

graph TD
    A[Loss Function Lθ] --> B[Hessian H = ∇²L]
    A --> C[Generalized Gauss-Newton G]
    A --> D[Empirical Fisher F]
    
    B --> E[Matrix-Free MatVec]
    C --> E
    D --> E
    
    E --> F[Lanczos Eigendecomposition]
    E --> G[Trace Estimation]
    E --> H[Spectral Density]
    
    F --> I[Top-k Eigenpairs]
    G --> J[Trace Estimate ± SE]
    H --> K[Spectral Density Plot]

The Hessian Matrix

Definition and Role

The Hessian matrix H = ∇²L(θ) is the second derivative of the loss with respect to parameters. It captures the exact local curvature of the loss landscape, making it the most precise but also most computationally expensive curvature matrix.

The Hessian is symmetric by construction and its eigenvalues reveal critical properties:

Large positive eigenvalues indicate sharp curvature, suggesting the model is in a narrow minimum
Small eigenvalues indicate flat regions associated with better generalization
Negative eigenvalues signal instability and potential divergence

Sources: hessian_eigenthings/operators/hessian.py:1-30

HessianOperator Implementation

The HessianOperator class provides two methods for computing Hessian-vector products (HVPs):

class HessianOperator(CurvatureOperator):
    """Hessian of `loss_fn(model, batch)` averaged over batches in dataloader."""
    
    def __init__(
        self,
        model: nn.Module,
        dataloader: Iterable[Any],
        loss_fn: LossFn,
        *,
        param_filter: ParamFilter | None = None,
        full_dataset: bool = True,
        num_batches: int | None = None,
        microbatch_size: int | None = None,
        method: HvpMethod = "autograd",
        fd_eps: float | None = None,
        backend: LinAlgBackend[torch.Tensor] | None = None,
    ) -> None:

HVP Computation Methods:

Method	Description	Memory	Precision
`"autograd"`	Exact double-backward via `torch.autograd.grad` with `create_graph=True`	Higher (scales with model size)	Numerically exact to rounding
`"finite_difference"`	Central-difference `(∇L(θ+εv) − ∇L(θ−εv)) / 2ε`	Lower (two forward+backward passes)	O(ε²) truncation bias

The finite-difference method uses dtype-specific epsilon values for optimal precision:

dtype	Epsilon
float64	6e-6
float32	5e-3
bfloat16	0.2
float16	5e-2

Sources: hessian_eigenthings/operators/hessian.py:30-55

When to Use the Hessian

The Hessian is ideal for:

Single-device analysis of models up to ~7B parameters
Scenarios requiring exact curvature information
Research on loss landscape topology
Verifying approximations against ground truth

Generalized Gauss-Newton (GGN)

Definition and Mathematical Foundation

The Generalized Gauss-Newton matrix G is a positive semi-definite (PSD) approximation to the Hessian. For a loss of the form L = (1/n) Σ l(f(xᵢ;θ), yᵢ), the GGN is defined as:

G = Jᵀ · H_loss · J

Where:

J is the Jacobian of model outputs with respect to parameters
H_loss is the Hessian of the loss with respect to model outputs

The GGN is always PSD because G = Jᵀ · H_loss · J and H_loss is PSD for convex losses.

Sources: hessian_eigenthings/operators/ggn.py:1-40

GGNOperator Implementation

The GGNOperator class provides two matvec implementations:

class GGNOperator(CurvatureOperator):
    def __init__(
        self,
        model: nn.Module,
        dataloader: Iterable[Any],
        forward_fn: ForwardFn,
        loss_of_output_fn: LossOfOutputFn,
        *,
        loss_hvp: Literal["analytical", "autograd"] = "analytical",
    ) -> None:

Method	Description	Memory	Use Case
`"analytical"`	Finite-difference JVP + analytical loss-Hessian-vector product	Matches one training step	LM-scale use, OOM-safe
`"autograd"`	`torch.func.jvp` + autograd double-backward + `vjp`	Scales with output size	Exact for arbitrary losses

For cross-entropy + softmax classification, G equals the Fisher information matrix, making the GGN and Fisher equivalent in this common case.

Sources: hessian_eigenthings/operators/ggn.py:40-80

Two-Function API Design

The GGNOperator uses a separation between forward_fn and loss_of_output_fn:

ForwardFn = Callable[[nn.Module, Any], torch.Tensor]
LossOfOutputFn = Callable[[torch.Tensor, Any], torch.Tensor]

This design enables computing J·v, H_loss·(J·v), and Jᵀ·(H_loss·J·v) without coupling to loss internals.

Closed-Form Cross-Entropy HVP

For mean-reduced cross-entropy with softmax, the library provides an optimized analytical HVP:

H_loss @ u = (p * u - p * ⟨p, u⟩) / n

Where p = softmax(logits) and n is the count of non-ignored positions.

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-60

Fused CE HVP Implementations

The library provides three backend implementations for the cross-entropy HVP:

Backend	Description	Memory	Speedup
`"eager"`	Plain PyTorch reference	Highest	1x baseline
`"compile"`	`torch.compile`-fused; Inductor fuses operations	Reduced	~2.6x faster
`"triton"`	Hand-written CUDA Triton kernel	Minimal (output buffer only)	~3.4x faster

Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50

Empirical Fisher

Definition

The Empirical Fisher matrix F is defined as:

F = (1/n) Σ ∇lᵢ · ∇lᵢᵀ

Where the expectation over data is replaced by the empirical average over batches. It is always PSD and serves as an approximation to the Fisher Information Matrix.

EmpiricalFisherOperator

class EmpiricalFisherOperator(CurvatureOperator):
    def __init__(
        self,
        model: nn.Module,
        dataloader: Iterable[Any],
        loss_fn: LossFn,
        *,
        per_sample: bool = False,
    ) -> None:

Sources: hessian_eigenthings/operators/fisher.py

Curvature Operator Interface

All curvature matrices implement the CurvatureOperator base class:

class CurvatureOperator(ABC):
    @property
    @abstractmethod
    def size(self) -> int:
        """Number of parameters (matrix dimension)."""
        ...
    
    @property
    def dtype(self) -> torch.dtype:
        ...
    
    @property
    def device(self) -> torch.device:
        ...
    
    @abstractmethod
    def matvec(self, v: torch.Tensor) -> torch.Tensor:
        """Compute matrix-vector product A @ v."""
        ...

Sources: hessian_eigenthings/operators/base.py

Parameter Filtering

Curvature operators support computing curvature only over a subset of parameters using ParamFilter:

def match_names(*patterns: str) -> ParamFilter:
    """Match parameters by name patterns."""
    
def match_regex(pattern: str) -> ParamFilter:
    """Match parameters by regex pattern."""

This enables analysis of specific components:

# Analyze only attention weights
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*"),
)

Algorithmic Foundations

Lanczos Eigendecomposition

The Lanczos algorithm computes eigenvalues and eigenvectors of large sparse matrices using only matrix-vector products:

def lanczos(
    operator: CurvatureOperator,
    k: int = 10,
    max_iter: int = 100,
    tol: float = 1e-6,
    which: str = "LM",
) -> EigenResult:

The algorithm:

Builds a tridiagonal matrix T from matvec operations
Computes eigenvalues of T as Ritz approximations
Accumulates eigenvectors directly via rank-1 outer-product updates

Sources: hessian_eigenthings/algorithms/lanczos.py:1-60

Trace Estimation

Trace estimation uses stochastic probing to estimate tr(A) without computing the full matrix:

Method	Samples	Variance	Description
Hutchinson	m	O(1/√m)	`(1/m) Σ vᵢᵀ A vᵢ` with Rademacher/Gaussian vectors
Hutch++	m	O(1/m)	Improved estimator with better constant factors

def trace(
    operator: CurvatureOperator,
    *,
    num_matvecs: int = 100,
    method: Method = "hutch++",
) -> TraceResult:

Sources: hessian_eigenthings/algorithms/trace.py:1-50

Operator Selection Guide

graph LR
    A[Need Curvature?] --> B{Exact Hessian?}
    B -->|Yes, small model| C[HessianOperator<br/>method=autograd]
    B -->|No| D{Need Fisher/GGN?}
    D -->|Yes, cross-entropy| E[GGNOperator<br/>loss_hvp=analytical]
    D -->|Yes, other loss| F[GGNOperator<br/>loss_hvp=autograd]
    D -->|Empirical Fisher| G[EmpiricalFisherOperator]
    
    C --> H[Use lanczos for eigenvalues]
    E --> H
    F --> H
    G --> H

Scenario	Recommended Operator	Method
Exact Hessian, single GPU	`HessianOperator`	`method="autograd"`
Large model, distributed	`HessianOperator`	`method="finite_difference"`
Language modeling, cross-entropy	`GGNOperator`	`loss_hvp="analytical"`
Custom loss, need exact	`GGNOperator`	`loss_hvp="autograd"`
Natural gradient optimization	`EmpiricalFisherOperator`	Default

Module Exports

from hessian_eigenthings.operators import (
    CurvatureOperator,
    HessianOperator,
    GGNOperator,
    EmpiricalFisherOperator,
    DDPHessianOperator,  # For DistributedDataParallel
    LambdaOperator,      # Custom curvature wrappers
)

from hessian_eigenthings.algorithms import (
    lanczos,              # Top-k eigendecomposition
    trace,                # Trace estimation
    spectral_density,    # Density plot via SLQ
    deflated_power_iteration,
)

Sources: hessian_eigenthings/operators/__init__.py:1-25

Summary

The hessian-eigenthings library provides a unified interface for computing curvature information in neural networks:

Hessian: Exact local curvature via autograd or finite-difference approximation
GGN: Positive semi-definite approximation ideal for large-scale analysis
Empirical Fisher: Sample-based Fisher approximation for natural gradient methods

All operators provide matrix-free matvec implementations, enabling eigendecomposition and trace estimation for models with billions of parameters without explicit matrix construction.

Sources: [hessian_eigenthings/operators/hessian.py:1-30](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

Why Hessian-Vector Products

Related topics: Curvature Matrices Explained, Eigendecomposition Algorithms

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Eigendecomposition via Lanczos

Continue reading this section for the full explanation and source context.

Section Trace Estimation via Hutchinson's Method

Continue reading this section for the full explanation and source context.

Section Method 1: Autograd (Default)

Continue reading this section for the full explanation and source context.

Why Hessian-Vector Products

Hessian-vector products (HVPs) are the computational foundation of this library. Understanding *why* we use HVPs instead of computing the full Hessian matrix is essential for appreciating the design and capabilities of hessian-eigenthings.

The Full Hessian Problem

The Hessian matrix $H$ of a neural network's loss function is a second-order partial derivative matrix with dimensions $[n \times n]$, where $n$ is the number of parameters. For modern large-scale models:

Model	Parameters	Hessian Size	Memory (fp32)
BERT-Base	110M	110M × 110M	~48 TB
GPT-2	1.5B	1.5B × 1.5B	~9 PB
LLaMA-7B	7B	7B × 7B	~392 PB

Storing the full Hessian is fundamentally infeasible. Even computing it via automatic differentiation requires $O(n^2)$ operations and memory that scales quadratically with model size. Sources: README.md:1-30

What is a Hessian-Vector Product?

A Hessian-vector product computes $Hv$ for a given vector $v$ without ever constructing $H$ explicitly. The operation takes $O(n)$ time and memory—linear in the number of parameters.

Formally, given:

Loss function $\mathcal{L}(\theta)$
Parameter vector $\theta \in \mathbb{R}^n$
Direction vector $v \in \mathbb{R}^n$

The HVP is: $$Hv = \nabla_\theta^2 \mathcal{L} \cdot v = \frac{\partial}{\partial \theta} \left( \nabla_\theta \mathcal{L} \cdot v \right)$$

This is implemented as a double-backward pass:

Forward pass → compute loss
Backward pass → compute gradient $\nabla_\theta \mathcal{L}$
Second backward pass → compute Jacobian-vector product $H \cdot v$

Sources: hessian_eigenthings/operators/hessian.py:1-50

Why HVPs Enable Scalable Curvature Analysis

By avoiding explicit Hessian construction, HVP-based algorithms can operate on models of any size. This library provides several key algorithms that all rely on HVP as their primitive operation:

graph TD
    A[Hessian-Vector Product] --> B[ Lanczos Eigendecomposition]
    A --> C[ Hutchinson Trace Estimation]
    A --> D[ Hutch++ Trace Estimation]
    A --> E[ Stochastic Lanczos Quadrature]
    
    B --> F[Top-k Eigenvalues & Eigenvectors]
    C --> G[Trace Estimation]
    D --> G
    E --> H[Spectral Density Plot]

Eigendecomposition via Lanczos

The Lanczos algorithm iteratively builds an orthogonal basis that tridiagonalizes the operator. It requires only matrix-vector products, making it perfect for HVP-based curvature analysis:

Property	Full Eigendecomp	Lanczos + HVP
Memory	$O(n^2)$	$O(n \cdot k)$
Time	$O(n^3)$	$O(n \cdot k^2)$
Storage	Entire matrix	$k$ Lanczos vectors

Where $k$ is the number of desired eigenpairs (typically 1-20). Sources: hessian_eigenthings/algorithms/lanczos.py:1-60

Trace Estimation via Hutchinson's Method

The trace of the Hessian can be estimated without constructing the full matrix:

$$\text{tr}(H) \approx \frac{1}{m} \sum_{i=1}^{m} v_i^T H v_i$$

where $v_i$ are random probe vectors (typically Rademacher or Gaussian). Each term $v_i^T H v_i$ is a single HVP plus a dot product. Sources: hessian_eigenthings/algorithms/trace.py:1-45

HVP Implementation Strategies

The library provides two distinct methods for computing HVPs, each with different trade-offs:

Method 1: Autograd (Default)

Uses torch.autograd.grad with create_graph=True for exact double-backward computation:

def __init__(
    self,
    model: nn.Module,
    dataloader: Iterable[Any],
    loss_fn: LossFn,
    *,
    method: HvpMethod = "autograd",  # Default
    ...
) -> None:

Advantages:

Numerically exact (to floating-point rounding)
Works with any differentiable loss function
Simple implementation

Disadvantages:

Builds the full computation graph for second derivatives
Memory scales with model complexity and output size

Sources: hessian_eigenthings/operators/hessian.py:20-45

Method 2: Finite Difference

Uses central-difference approximation: $$\frac{\nabla_\theta \mathcal{L}(\theta + \epsilon v) - \nabla_\theta \mathcal{L}(\theta - \epsilon v)}{2\epsilon}$$

Advantages:

No second-backward graph → lower memory footprint
Compatible with distributed training (FSDP/HSDP/TP) without special handling

Disadvantages:

$O(\epsilon^2)$ truncation bias
Precision-dependent roundoff (~1e-5 fp32, ~1e-2 bf16)

Sources: hessian_eigenthings/operators/hessian.py:30-40

Fused HVP for Cross-Entropy Losses

For language models with large vocabulary softmax heads, computing $H_{\text{loss}} \cdot u$ naively allocates multiple $(N, V)$ intermediate tensors (where $V$ is vocabulary size). This is addressed with fused implementations:

Backend	Speedup	Memory Reduction	Requirements
`eager`	1× (baseline)	1×	Any
`compile`	~2.6×	~2×	torch.compile
`triton`	~3.4×	~2×	CUDA + Triton

The fused computation computes: $$H_{\text{loss}} \cdot u = \frac{p \odot u - p \odot \langle p, u \rangle}{n} \odot \text{mask}$$

Where $p = \text{softmax}(\text{logits})$ and the implementation avoids materializing the full $(N, V)$ softmax output. Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50

Generalized Gauss-Newton (GGN) Approximation

For optimization-focused curvature analysis, the GGN matrix $G$ provides a positive semi-definite (PSD) approximation to the Hessian:

$$G = J^T \cdot H_{\text{loss}} \cdot J$$

Where $J$ is the Jacobian of the model outputs with respect to parameters. For cross-entropy + softmax classification, $G$ equals the Fisher information matrix. The GGN is always PSD by construction, making it suitable for optimization algorithms. Sources: hessian_eigenthings/operators/ggn.py:1-40

GGN Matvec Implementation

The GGNOperator supports two matvec paths:

Analytical (default): Finite-difference JVP + analytical loss-Hessian-vector product + single normal backward. Memory footprint matches one normal training step.

Autograd: Original torch.func.jvp + autograd double-backward + torch.func.vjp. Numerically exact but scales badly with vocabulary size.

Sources: hessian_eigenthings/operators/ggn.py:25-45

Practical Implications

The HVP approach enables:

Capability	HVP-Based	Full Hessian
7B parameter model	✅ ~hours	❌ impossible
Top-10 eigenpairs	✅	❌
Trace estimation	✅	❌
Spectral density	✅	❌
FSDP compatibility	✅ (finite-diff)	❌

The eigenvalues and eigenvectors of the Hessian have been implicated in generalization properties of neural networks. Researchers hypothesize that "flat minima" generalize better, that Hessians of large models are very low-rank, and that curvature analysis can guide optimization. Sources: README.md:25-35

Summary

Hessian-vector products are the fundamental building block that makes large-scale curvature analysis possible:

Memory efficiency: $O(n)$ vs $O(n^2)$ for the full Hessian
Computational efficiency: $O(n)$ per matvec vs $O(n^2)$ for full computation
Scalability: Works with models of any size via iterative algorithms
Flexibility: Supports exact (autograd) or memory-efficient (finite-difference) computation

The hessian-eigenthings library provides production-ready implementations of HVP computation and HVP-based algorithms for practical curvature analysis in PyTorch.

Sources: [hessian_eigenthings/operators/hessian.py:1-50]()

System Architecture

Related topics: Curvature Operators, Eigendecomposition Algorithms, Loss Functions

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 1. Curvature Operators

Continue reading this section for the full explanation and source context.

Section 2. Algorithms Layer

Continue reading this section for the full explanation and source context.

Section 3. Backend Layer

Continue reading this section for the full explanation and source context.

System Architecture

The pytorch-hessian-eigenthings library provides an efficient and scalable framework for computing eigendecompositions of curvature matrices—including the Hessian, Generalized Gauss-Newton (GGN) matrix, and empirical Fisher—for arbitrary PyTorch models. The architecture is designed around three core abstractions: Curvature Operators, Algorithms, and Linear Algebra Backends.

High-Level Architecture Overview

The library implements a layered architecture that separates mathematical curvature computations from numerical algorithms:

graph TD
    subgraph "User Layer"
        U[User Code]
    end
    
    subgraph "Algorithm Layer"
        LA[Lanczos]
        TR[Trace Estimation]
        SP[Stochastic Power Iteration]
    end
    
    subgraph "Operator Layer"
        HO[HessianOperator]
        GGN[GGNOperator]
        FO[FisherOperator]
    end
    
    subgraph "Backend Layer"
        B[LinAlgBackend]
        SD[SingleDeviceBackend]
    end
    
    subgraph "PyTorch Core"
        PT[PyTorch Autograd]
    end
    
    U -->|uses| LA
    U -->|uses| TR
    U -->|uses| HO
    LA -->|operates on| HO
    TR -->|operates on| HO
    HO -->|implemented via| B
    B -->|delegates to| PT

Core Components

1. Curvature Operators

Curvature operators are the foundation of the library. They abstract away the details of how matrix-vector products (matvecs) with curvature matrices are computed, providing a unified interface for algorithms to work with.

#### Base Interface

All operators inherit from CurvatureOperator, which defines the contract for curvature computations:

Property/Method	Type	Description
`size`	`int`	Total number of parameters in the curvature matrix
`dtype`	`torch.dtype`	Data type of the operator
`device`	`torch.device`	Device where computations run
`matvec(v)`	`Callable`	Computes `A @ v` for input vector `v`

#### Hessian Operator

The HessianOperator computes the Hessian of a loss function with respect to model parameters:

HessianOperator(
    model: nn.Module,
    dataloader: Iterable[Any],
    loss_fn: LossFn,
    *,
    param_filter: ParamFilter | None = None,
    method: HvpMethod = "autograd"  # or "finite_difference"
)

Sources: hessian_eigenthings/operators/hessian.py:1-50

Two HVP computation methods are supported:

Method	Description	Use Case
`"autograd"` (default)	Exact double-backward via `torch.autograd.grad`	Up to ~7B parameters
`"finite_difference"`	Central-difference approximation	FSDP/HSDP/TP at scale

The finite difference method uses the approximation:

H(v) ≈ (∇L(θ+εv) − ∇L(θ−εv)) / 2ε

This avoids second-backward graph entirely, making it compatible with distributed training setups.

#### GGN Operator

The GGNOperator implements the Generalized Gauss-Newton matrix, which is always positive semi-definite:

GGNOperator(
    model: nn.Module,
    dataloader: Iterable[Any],
    forward_fn: ForwardFn,
    loss_of_output_fn: LossOfOutputFn,
    *,
    loss_hvp: Literal["analytical", "autograd"] = "analytical"
)

Sources: hessian_eigenthings/operators/ggn.py:1-80

The GGN decomposes as G = J^T · H_loss · J where:

J is the Jacobian of the model output with respect to parameters
H_loss is the Hessian of the loss with respect to the output

For cross-entropy + softmax classification, G equals the Fisher information matrix.

2. Algorithms Layer

Algorithms operate on any CurvatureOperator via its matvec interface, enabling eigenvalue computation, trace estimation, and spectral density analysis.

#### Lanczos Eigensolver

The Lanczos algorithm computes the top-k eigenvalues and eigenvectors of a symmetric matrix using only matrix-vector products:

lanczos(
    operator: CurvatureOperator,
    k: int = 10,
    max_iter: int = 100,
    tol: float = 1e-3,
    which: str = "LA"  # LA, SA, or LM
) -> EigendecompositionResult

Sources: hessian_eigenthings/algorithms/lanczos.py:1-50

Key features:

Ritz vector accumulation: Directly accumulates Ritz vectors into final (k, n) layout via rank-1 outer-product updates, avoiding transient (n, k) transpose copies
Convergence tracking: Monitors residual norms |β_k · s_{k}| to determine convergence
Eigenvalue selection: Supports "LA" (largest algebraic), "SA" (smallest algebraic), and "LM" (largest magnitude)

#### Trace Estimation

The library provides multiple trace estimation methods:

Method	Description	Samples Required
`hutchinson`	Classical Hutchinson: `(1/m) Σ vᵢᵀ A vᵢ`	Higher variance
`hutch++`	Improved estimator with lower variance	~30 recommended

trace(
    operator: CurvatureOperator,
    num_matvecs: int = 30,
    method: str = "hutch++",
    seed: int | None = None
) -> TraceResult

Sources: hessian_eigenthings/algorithms/trace.py:1-40

The Hutch++ estimator achieves lower variance by using both query and reply vectors.

3. Backend Layer

The LinAlgBackend abstract interface decouples linear algebra operations from specific device implementations:

classDiagram
    class LinAlgBackend~T~ {
        <<abstract>>
        +matmul(a, b) T
        +dot(a, b) T
        +norm(v) T
        +fill(v, value) T
        +copy(v) T
    }
    
    class SingleDeviceBackend {
        +matmul(a, b) Tensor
        +dot(a, b) Tensor
        +norm(v) Tensor
    }
    
    LinAlgBackend <|-- SingleDeviceBackend

Backends provide:

Vector arithmetic operations (dot product, norm, fill, copy)
Device-specific optimizations
Memory allocation strategies

Data Flow

Eigendecomposition Workflow

sequenceDiagram
    participant User
    participant Operator as CurvatureOperator
    participant Backend as LinAlgBackend
    participant Algo as Lanczos Algorithm
    participant PyTorch as PyTorch Autograd
    
    User->>Operator: Instantiate with model, dataloader
    User->>Algo: Call lanczos(operator, k)
    Algo->>Operator: Request matvec(v)
    Operator->>Backend: Allocate probe vector
    Backend->>PyTorch: Create tensor
    Operator->>PyTorch: Forward pass + backward
    PyTorch-->>Operator: Return HVP result
    Operator-->>Algo: Return Av
    Algo->>Algo: Repeat for m iterations
    Algo-->>User: Return eigenvalues, eigenvectors

Loss Function Integration

The library supports two loss function patterns:

graph LR
    subgraph "Single Function API"
        L1[loss_fn<br/>model, batch → scalar]
    end
    
    subgraph "Two Function API (for GGN)"
        F1[forward_fn<br/>model, batch → output]
        L2[loss_of_output_fn<br/>output, batch → scalar]
    end
    
    L1 --> HO[HessianOperator]
    F1 --> GGN[GGNOperator]
    L2 --> GGN

#### HuggingFace Integration

For language models, the library provides optimized loss functions:

hf_lm_loss(fused="auto")  # Auto-selects Triton or torch.compile

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-30

The fused implementation:

Uses Triton kernels on CUDA (~3.4x speedup, 2x peak-memory reduction)
Falls back to torch.compile (~2.6x speedup, 2x peak-memory reduction)
Eliminates most (N, V) intermediates

Configuration Options

HessianOperator Parameters

Parameter	Type	Default	Description
`model`	`nn.Module`	Required	PyTorch model
`dataloader`	`Iterable`	Required	Data batches
`loss_fn`	`LossFn`	Required	Loss computation function
`param_filter`	`ParamFilter`	`None`	Filter parameters by name
`method`	`HvpMethod`	`"autograd"`	HVP computation method
`fd_eps`	`float`	`None`	Finite difference epsilon

GGNOperator Parameters

Parameter	Type	Default	Description
`loss_hvp`	`str`	`"analytical"`	`"analytical"` or `"autograd"`
`full_dataset`	`bool`	`True`	Average over full dataset
`num_batches`	`int`	`None`	Limit to first N batches
`microbatch_size`	`int`	`None`	Process in smaller chunks

Lanczos Parameters

Parameter	Type	Default	Description
`k`	`int`	`10`	Number of eigenvalues to compute
`max_iter`	`int`	`100`	Maximum Lanczos iterations
`tol`	`float`	`1e-3`	Convergence tolerance
`which`	`str`	`"LA"`	Which eigenvalues ("LA", "SA", "LM")
`reorthogonalize`	`bool`	`False`	Full reorthogonalization

Usage Patterns

Basic Hessian Eigenvalue Computation

from hessian_eigenthings import HessianOperator, lanczos

# Create operator
hessian_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=lambda m, b: torch.nn.functional.cross_entropy(m(b[0]), b[1])
)

# Compute top eigenvalues
eig_result = lanczos(hessian_op, k=10, max_iter=100)
print(eig_result.eigenvalues)

Parameter-Filtered Analysis

from hessian_eigenthings import HessianOperator, lanczos

# Analyze only attention parameters
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)
eig_attn = lanczos(attn_op, k=3)

Trace Estimation

from hessian_eigenthings import HessianOperator, trace

trace_result = trace(
    hessian_op,
    num_matvecs=30,
    method="hutch++",
    seed=42
)
print(f"Trace estimate: {trace_result.estimate:.4e}")

Architecture Benefits

Benefit	Description
Separation of Concerns	Operators define "what" to compute; algorithms define "how"
Flexibility	Any operator can use any algorithm
Scalability	Backends enable device-specific optimizations
Composability	Easy to add new operators or algorithms
Memory Efficiency	Matrix-free design avoids explicit matrix storage

Extension Points

Adding Custom Curvature Operators

New operators should subclass CurvatureOperator and implement the matvec method:

class CustomCurvatureOperator(CurvatureOperator):
    def __init__(self, model, dataloader):
        super().__init__()
        self.model = model
        self.dataloader = dataloader
        # Register parameters
    
    def _matvec(self, v: torch.Tensor) -> torch.Tensor:
        # Implement A @ v
        return custom_computation(v)

Adding New Algorithms

Algorithms should accept any CurvatureOperator and use the backend exclusively:

def custom_algorithm(
    operator: CurvatureOperator,
    backend: LinAlgBackend | None = None
) -> SomeResult:
    backend = backend or SingleDeviceBackend()
    # Use backend for all vector operations

Summary

The system architecture of pytorch-hessian-eigenthings follows a clean, modular design that separates curvature matrix computation (operators), numerical algorithms (Lanczos, trace estimation), and linear algebra primitives (backends). This design enables efficient Hessian and GGN eigendecomposition for models ranging from small MLPs to large language models, with support for distributed training and optimized fused computations.

Sources: [hessian_eigenthings/operators/hessian.py:1-50]()

Curvature Operators

Related topics: System Architecture, Distributed Computing with DDP

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Interface

Continue reading this section for the full explanation and source context.

Section Parameter Filtering

Continue reading this section for the full explanation and source context.

Section Key Features

Continue reading this section for the full explanation and source context.

Curvature Operators

Overview

Curvature Operators in hessian-eigenthings provide a matrix-free abstraction for computing Hessian eigendecomposition and related curvature matrices for arbitrary PyTorch models. They implement the CurvatureOperator base class interface, enabling efficient computation of eigenvalues, eigenvectors, traces, and spectral densities without explicitly forming potentially massive matrices.

The core abstraction allows algorithms (Lanczos, power iteration, Hutch++) to operate on any curvature matrix through a unified matvec(v) interface that computes $Av$ for any vector $v$, enabling scalability to large models with billions of parameters.

Sources: hessian_eigenthings/__init__.py:1-10

Architecture

graph TD
    subgraph "Curvature Operators"
        Base[CurvatureOperator<br/>Base Class]
        Hessian[HessianOperator]
        GGN[GGNOperator]
        Fisher[EmpiricalFisherOperator]
        Lambda[LambdaOperator]
        DDP[DDPHessianOperator]
    end
    
    subgraph "Algorithms"
        Lanczos[Lanczos Eigendecomposition]
        Power[Power Iteration]
        Trace[Trace Estimation<br/>Hutch++/Hutchinson]
        Spectral[Spectral Density<br/>Stochastic Lanczos Quadrature]
    end
    
    Base --> Hessian
    Base --> GGN
    Base --> Fisher
    Base --> Lambda
    Base --> DDP
    
    Hessian --> Lanczos
    GGN --> Lanczos
    Fisher --> Lanczos
    Lambda --> Lanczos
    
    Hessian --> Trace
    GGN --> Trace
    Fisher --> Trace
    
    Hessian --> Power
    GGN --> Power
    Fisher --> Power
    
    Hessian --> Spectral
    GGN --> Spectral
    Fisher --> Spectral

Base Class: CurvatureOperator

All curvature operators inherit from CurvatureOperator, which defines the contract that subclasses must fulfill.

Core Interface

Method	Description
`matvec(v)`	Compute $Av$ where $A$ is the curvature matrix
`size`	Total number of parameters in the operator's scope
`dtype`, `device`	Tensor dtype and device for vector operations

Sources: hessian_eigenthings/operators/base.py

Parameter Filtering

Curvature operators can be restricted to subsets of model parameters using param_filter, enabling analysis of specific components (e.g., attention layers only).

from hessian_eigenthings import HessianOperator, match_names

# Filter to attention parameters only
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)

Sources: hessian_eigenthings/__init__.py:35-38

HessianOperator

Computes the Hessian $\nabla_{\theta}^2 \mathcal{L}$ of the loss function with respect to model parameters.

Key Features

Two HVP methods: autograd (exact double-backward via torch.autograd.grad) and finite_difference (central difference for FSDP/TP compatibility)
Batched computation: Automatically averages over multiple batches from the dataloader
Microbatch support: For large models, process batches in smaller microbatches

Constructor Parameters

Parameter	Type	Default	Description
`model`	`nn.Module`	Required	PyTorch model
`dataloader`	`Iterable`	Required	Data batches
`loss_fn`	`LossFn`	Required	Loss computation function
`param_filter`	`ParamFilter \	None`	`None`	Parameter name filter
`full_dataset`	`bool`	`True`	Average over full dataset
`num_batches`	`int \	None`	`None`	Limit batches for stochastic estimate
`microbatch_size`	`int \	None`	`None`	Split batches into smaller microbatches
`method`	`HvpMethod`	`"autograd"`	HVP computation method
`fd_eps`	`float \	None`	`None`	Finite difference epsilon
`backend`	`LinAlgBackend \	None`	`None`	Linear algebra backend

HVP Method Comparison

Method	Accuracy	Memory	FSDP/TP Compatible	Speed
`autograd`	Exact (to rounding)	High	No	Fast
`finite_difference`	$O(\epsilon^2)$ bias	Low	Yes	2x passes

Sources: hessian_eigenthings/operators/hessian.py:1-60

Finite Difference Epsilon Table

Dtype	Optimal $\epsilon$
`float64`	`6e-6`
`float32`	`5e-3`
`bfloat16`	`0.2`
`float16`	`5e-2`

Sources: hessian_eigenthings/operators/hessian.py:34-40

GGNOperator

The Generalized Gauss-Newton (GGN) matrix $G = J^T H_{loss} J$ provides a PSD approximation to the Hessian that is computationally cheaper while preserving the eigenvalues that matter for optimization.

Key Features

Always PSD: Unlike the exact Hessian, the GGN is positive semi-definite by construction
Analytical HVP path: For losses with known HVP (e.g., cross-entropy), uses analytical computation
For cross-entropy + softmax: $G$ equals the Fisher information matrix

Two Matvec Implementations

`loss_hvp`	Description	Memory	Use Case
`"analytical"` (default)	FD JVP + analytical loss-Hessian-vec + one backward	Matches one training step	LM-scale, large vocab
`"autograd"`	`torch.func.jvp` + double-backward + `torch.func.vjp`	Scales with output size	Exact, small vocab

Sources: hessian_eigenthings/operators/ggn.py:1-50

Fused Cross-Entropy HVP

For language model training, the GGN operator includes a fused kernel for the CE HVP computation:

# Auto-selects fastest backend: Triton > torch.compile > eager
hf_lm_loss_of_output(..., fused="auto")

The fused implementation reduces peak memory by 2x compared to eager, with Triton providing ~3.4x speedup on CUDA.

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-30

EmpiricalFisherOperator

Computes the empirical Fisher information matrix $F = \frac{1}{N} \sum_{i=1}^N \nabla_{\theta} \log p(y_i|x_i) \nabla_{\theta} \log p(y_i|x_i)^T$.

For classification with cross-entropy loss, the empirical Fisher equals the GGN when using the model distribution's expectation.

Sources: hessian_eigenthings/operators/fisher.py

LambdaOperator

Creates custom curvature operators from lambda functions for testing or custom curvature definitions.

from hessian_eigenthings import LambdaOperator

# Custom operator that always returns a scaled vector
custom_op = LambdaOperator(
    size=1000,
    matvec=lambda v: 2.0 * v  # Represents 2*I
)

DDPHessianOperator

Distributed Data Parallel-aware Hessian operator that handles gradient synchronization across processes.

Sources: hessian_eigenthings/operators/__init__.py:15-18

Common Usage Patterns

Computing Top Eigenvalues

from hessian_eigenthings import HessianOperator, lanczos

operator = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn
)

result = lanczos(operator, k=10, max_iter=100)
print(f"Top eigenvalues: {result.eigenvalues}")

Estimating Trace

from hessian_eigenthings import GGNOperator, trace

operator = GGNOperator(
    model=model,
    dataloader=dataloader,
    forward_fn=model_forward,
    loss_of_output_fn=loss_fn
)

result = trace(operator, num_matvecs=100, method="hutch++")
print(f"Trace estimate: {result.estimate:.4e} ± {result.stderr:.4e}")

Component-Specific Analysis

from hessian_eigenthings import HessianOperator, match_regex

# Analyze only attention weights in transformer
attn_only = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_regex(r"blocks\.\d+\.attn\.")
)

# Analyze only MLP weights
mlp_only = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_regex(r"blocks\.\d+\.mlp\.")
)

Linear Algebra Backends

The operators use pluggable LinAlgBackend for vector operations, enabling support for different hardware configurations and precision requirements.

Backend	Use Case
`SingleDeviceBackend`	Single GPU/CPU
(Distributed backends)	Multi-GPU via FSDP/TP

Module Exports

from hessian_eigenthings.operators import (
    CurvatureOperator,
    DDPHessianOperator,
    EmpiricalFisherOperator,
    GGNOperator,
    HessianOperator,
    LambdaOperator,
)

Sources: hessian_eigenthings/operators/__init__.py:1-20

Sources: [hessian_eigenthings/__init__.py:1-10](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/__init__.py)

Eigendecomposition Algorithms

Related topics: System Architecture, Why Hessian-Vector Products

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Symmetric Lanczos Implementation

Continue reading this section for the full explanation and source context.

Section Lanczos Output Structure

Continue reading this section for the full explanation and source context.

Section Full Lanczos Eigensolver

Continue reading this section for the full explanation and source context.

Eigendecomposition Algorithms

The hessian-eigenthings library provides a suite of efficient iterative algorithms for computing eigendecompositions of curvature matrices (Hessian, Generalized Gauss-Newton, and Fisher) in PyTorch models. These algorithms enable analysis of neural network loss landscapes by extracting eigenvalues, eigenvectors, spectral densities, and trace estimates without explicitly constructing the full curvature matrix—a critical capability for modern large-scale models.

Overview

Computing eigendecompositions of curvature matrices is fundamental to understanding generalization properties, flat minima, and training dynamics of neural networks. However, these curvature matrices are prohibitively large (n × n where n is the number of parameters), making explicit construction impossible for modern models.

The library implements Krylov subspace methods that only require matrix-vector products, enabling efficient computation of:

Capability	Algorithm	Use Case
Top-k eigenvalues/eigenvectors	Lanczos, Power Iteration	Finding most-curved directions
Trace estimation	Hutchinson, Hutch++	Computing average curvature
Spectral density	Stochastic Lanczos Quadrature	Visualizing eigenvalue distribution

Sources: hessian_eigenthings/algorithms/__init__.py:1-29

Algorithm Architecture

The algorithms in this module follow a consistent design pattern: they accept any CurvatureOperator and use the LinAlgBackend exclusively for vector arithmetic, ensuring portability across single-device and distributed settings.

graph TD
    A[CurvatureOperator] --> B[Lanczos Algorithm]
    A --> C[Power Iteration]
    A --> D[Trace Estimation]
    A --> E[Spectral Density]
    
    B --> F[EigenResult]
    C --> F
    D --> G[TraceResult]
    E --> H[SpectralDensityResult]
    
    F --> I[eigenvalues: Tensor]
    F --> J[eigenvectors: Tensor]
    F --> K[residuals: Tensor]

Sources: hessian_eigenthings/algorithms/result.py

Lanczos Algorithm

The Lanczos algorithm is the primary method for computing eigenvalues and eigenvectors of symmetric matrices. It builds a Krylov subspace through repeated matrix-vector products, then solves the small tridiagonal eigenvalue problem.

Symmetric Lanczos Implementation

The in-house Lanczos implementation provides optional full reorthogonalization to address the loss-of-orthogonality issues classical Lanczos is known for (Paige 1976).

def lanczos_tridiagonal(
    operator: CurvatureOperator,
    v0: torch.Tensor,
    max_iter: int,
    *,
    reorthogonalize: bool = True,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> LanczosTridiag

Key characteristics:

Default reorthogonalization: Enabled for max_iter <= 50 to suppress ghost eigenvalues
Computational tradeoff: For larger Krylov dimensions, reorthogonalization becomes O(m²n); users analyzing near-degenerate spectra should re-enable it
Memory efficiency: Accumulates Ritz vectors directly via rank-1 outer-product updates, avoiding transient (n, k) → (k, n) transpose copies

Sources: hessian_eigenthings/algorithms/lanczos.py:30-58

Lanczos Output Structure

@dataclass(frozen=True)
class LanczosTridiag:
    """Output of one Lanczos run: tridiagonal coefficients + the basis used to build them."""
    alphas: torch.Tensor      # (m,) diagonal
    betas: torch.Tensor       # (m-1,) off-diagonal
    basis: list[torch.Tensor] # length m, each (n,)
    last_beta: float          # ||r_m|| residual norm at termination
    iterations: int          # m, the actual number of Lanczos steps completed

Sources: hessian_eigenthings/algorithms/lanczos.py:30-43

Full Lanczos Eigensolver

The high-level lanczos() function computes top-k eigenvalues with configurable eigenpair selection:

def lanczos(
    operator: CurvatureOperator,
    k: int = 1,
    max_iter: int = 100,
    *,
    which: Which = "LM",
    tol: float = 1e-8,
    seed: int | None = None,
    reorthogonalize: bool | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> EigenResult

Parameters:

Parameter	Type	Default	Description
`operator`	`CurvatureOperator`	required	The curvature matrix operator
`k`	`int`	1	Number of eigenpairs to compute
`max_iter`	`int`	100	Maximum Lanczos iterations
`which`	`Literal["LM", "LA", "SA"]`	"LM"	Which eigenvalues: LM=largest magnitude, LA=largest algebraic, SA=smallest algebraic
`tol`	`float`	1e-8	Convergence tolerance
`seed`	`int \	None`	None	Random seed for reproducibility
`reorthogonalize`	`bool \	None`	None	Override default reorthogonalization setting

Sources: hessian_eigenthings/algorithms/lanczos.py:58-100

Eigenvalue Selection Logic

The algorithm selects eigenvalues based on the which parameter:

if which == "LM":
    order = torch.argsort(theta.abs(), descending=True)
elif which == "LA":
    order = torch.argsort(theta, descending=True)
elif which == "SA":
    order = torch.argsort(theta, descending=False)

Sources: hessian_eigenthings/algorithms/lanczos.py:75-83

Power Iteration

Power iteration is a simpler method for finding the dominant eigenvalue. The library implements deflated power iteration to compute multiple eigenpairs sequentially by projecting out previously found directions.

Single Power Iteration

def power_iteration_one(
    operator: CurvatureOperator,
    v0: torch.Tensor,
    max_iter: int,
    tol: float = 1e-6,
    backend: LinAlgBackend | None = None,
) -> tuple[torch.Tensor, torch.Tensor]

Sources: hessian_eigenthings/algorithms/power_iteration.py

Deflated Power Iteration

Deflated power iteration extends the basic method to compute multiple eigenpairs:

def deflated_power_iteration(
    operator: CurvatureOperator,
    num_eigs: int,
    max_iter: int,
    tol: float = 1e-6,
    seed: int | None = None,
    backend: LinAlgBackend | None = None,
) -> EigenResult

The deflation process removes previously found eigenvectors from the subspace before searching for the next eigenpair, preventing convergence to already-computed directions.

Sources: hessian_eigenthings/algorithms/power_iteration.py

Trace Estimation

Trace estimation provides a way to compute the average eigenvalue (trace / dimension) without full eigendecomposition. This is computationally much cheaper and useful for understanding overall curvature magnitude.

Hutchinson's Estimator

Hutchinson's method estimates the trace using random probe vectors:

def hutchinson(
    operator: CurvatureOperator,
    *,
    num_samples: int = 100,
    distribution: Distribution = "rademacher",
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult

The estimator computes: (1/m) Σ vᵢᵀ A vᵢ

Where vᵢ are random vectors from the specified distribution. Rademacher distribution provides lower variance than Gaussian.

Sources: hessian_eigenthings/algorithms/trace.py:48-71

Hutch++ Estimator

Hutch++ is an improved estimator with better convergence properties:

def hutch_plus_plus(
    operator: CurvatureOperator,
    *,
    num_matvecs: int = 30,
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult

Hutch++ uses a structured random sampling approach that achieves lower variance than standard Hutchinson with the same number of matrix-vector products.

Sources: hessian_eigenthings/algorithms/trace.py:22-47

Unified Trace Interface

def trace(
    operator: CurvatureOperator,
    num_matvecs: int = 30,
    *,
    method: Literal["hutchinson", "hutch++"] = "hutch++",
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult

Validation: The num_matvecs parameter is validated to be at least 1.

Sources: hessian_eigenthings/algorithms/trace.py:71-84

Trace Result Structure

@dataclass
class TraceResult:
    estimate: float      # The trace estimate
    stderr: float       # Standard error of the estimate
    num_matvecs: int     # Number of matrix-vector products used
    operator_size: int   # Dimension of the operator

Sources: hessian_eigenthings/algorithms/result.py

Spectral Density

Spectral density estimation computes the eigenvalue distribution (density function) across the spectrum, enabling visualization and analysis of the full eigenvalue structure.

Stochastic Lanczos Quadrature

The spectral_density() function implements Stochastic Lanczos Quadrature (SLQ) to compute the spectral density:

def spectral_density(
    operator: CurvatureOperator,
    num_runs: int = 16,
    lanczos_steps: int = 50,
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> SpectralDensityResult

Parameters:

Parameter	Type	Default	Description
`operator`	`CurvatureOperator`	required	The curvature matrix operator
`num_runs`	`int`	16	Number of randomized runs for averaging
`lanczos_steps`	`int`	50	Lanczos iterations per run
`seed`	`int \	None`	None	Random seed

Sources: hessian_eigenthings/algorithms/spectral_density.py

Spectral Density Result

@dataclass
class SpectralDensityResult:
    grid: torch.Tensor      # Eigenvalue grid points
    density: torch.Tensor   # Density values at each grid point
    eigenvalues: list[torch.Tensor]  # Eigenvalues from each run
    eigenvectors: list[list[torch.Tensor]]  # Corresponding eigenvectors

The spectral density integrates to 1: ∫ density(λ) dλ ≈ 1, which can be verified using numerical integration.

Sources: hessian_eigenthings/algorithms/result.py

Common Result Types

All algorithms return standardized result objects that encapsulate the computed quantities along with metadata about the computation.

@dataclass
class EigenResult:
    eigenvalues: torch.Tensor       # (k,) tensor of eigenvalues
    eigenvectors: torch.Tensor      # (k, n) matrix of eigenvectors
    residuals: torch.Tensor          # (k,) convergence residuals
    iterations: int                 # Number of iterations run
    converged: bool                 # Whether all eigenpairs converged

Sources: hessian_eigenthings/algorithms/result.py

Algorithm Selection Guide

graph LR
    A[Goal] --> B{Eigenpairs?}
    B -->|Yes, top-k| C[How many?]
    C -->|1-10| D[Lanczos]
    C -->|Many| E{Orthogonality critical?}
    E -->|Yes| D
    E -->|No| F[Deflated Power Iteration]
    
    B -->|Trace only| G{Accuracy priority?}
    G -->|High| H[Hutch++]
    G -->|Standard| I[Hutchinson]
    
    B -->|Full distribution| J[Spectral Density]
    
    D --> K[EigenResult]
    F --> K
    H --> L[TraceResult]
    I --> L
    J --> M[SpectralDensityResult]

Decision Criteria

Scenario	Recommended Algorithm	Notes
Top eigenvalues for single/large batch	`lanczos()`	Best accuracy, moderate cost
Quick dominant eigenvalue	`deflated_power_iteration()`	Lower memory, less accurate
Trace with limited matvecs	`hutch_plus_plus()`	Better convergence than Hutchinson
Trace estimation	`hutchinson()`	Simpler, more matvecs needed
Eigenvalue histogram/distribution	`spectral_density()`	Visualize full spectrum

Integration with Curvature Operators

The algorithms are designed to work with any CurvatureOperator implementation, including:

HessianOperator: Exact Hessian via autograd or finite differences
GGNOperator: Generalized Gauss-Newton matrix
EmpiricalFisherOperator: Empirical Fisher information matrix
DDPHessianOperator: Distributed Data Parallel Hessian

This abstraction allows the same algorithm code to work across different curvature definitions without modification.

Sources: hessian_eigenthings/algorithms/__init__.py:1-29

Performance Considerations

Memory Efficiency in Lanczos

The Lanczos implementation optimizes memory for large-scale models by:

Avoiding allocation of full (n, m) basis matrix
Using rank-1 outer-product updates for eigenvector accumulation
Computing Ritz vectors directly into final (k, n) layout

Reorthogonalization Tradeoffs

Setting	Memory	Computation	Accuracy
`reorthogonalize=True`	O(mn)	O(m²n)	High orthogonality
`reorthogonalize=False`	O(mn) basis list	O(mn)	May have ghost eigenvalues

For max_iter <= 50, reorthogonalization is enabled by default. For larger Krylov dimensions, it defaults off to maintain acceptable performance.

Sources: hessian_eigenthings/algorithms/lanczos.py:23-28

Example Usage

from hessian_eigenthings import HessianOperator, lanczos, trace, spectral_density

# Create Hessian operator
operator = HessianOperator(model, dataloader, loss_fn)

# Compute top-5 eigenvalues and eigenvectors
eig_result = lanczos(operator, k=5, max_iter=40, tol=1e-7, seed=0)
print(f"Top eigenvalue: {eig_result.eigenvalues[0]}")

# Estimate trace with Hutch++
trace_result = trace(operator, num_matvecs=99, method="hutch++", seed=0)
print(f"Trace estimate: {trace_result.estimate}")

# Compute spectral density
density_result = spectral_density(operator, num_runs=8, lanczos_steps=40, seed=0)
# Visualize with: plt.plot(density_result.grid, density_result.density)

Sources: examples/supervised_mlp.py:1-50

Sources: [hessian_eigenthings/algorithms/__init__.py:1-29]()

Loss Functions

Related topics: Curvature Operators, Parameter Utilities

Section Related Pages

Continue reading this section for the full explanation and source context.

Section MSE Loss

Continue reading this section for the full explanation and source context.

Section Cross-Entropy Loss with Analytical HVP

Continue reading this section for the full explanation and source context.

Section Autoregressive Language Model Loss

Continue reading this section for the full explanation and source context.

Related topics: Curvature Operators, Parameter Utilities

Loss Functions

Loss functions in this repository serve as the bridge between model outputs and curvature operators (Hessian and Generalized Gauss-Newton matrices). They provide the necessary computations for Hessian-vector products and support multiple backend implementations optimized for different use cases.

Overview

The loss functions module (hessian_eigenthings/loss_fns/) provides two distinct function signatures depending on the target operator:

Function Type	Signature	Used By
`loss_fn`	`(model: nn.Module, batch: Any) -> torch.Tensor`	`HessianOperator`
`loss_of_output_fn`	`(output: torch.Tensor, batch: Any) -> torch.Tensor`	`GGNOperator`

Sources: hessian_eigenthings/operators/hessian.py:1-50 Sources: hessian_eigenthings/operators/ggn.py:1-60

Architecture

graph TD
    A[Loss Function Entry Points] --> B[Standard Losses]
    A --> C[HuggingFace Losses]
    A --> D[TransformerLens Losses]
    
    B --> B1[MSE Loss]
    B --> B2[Cross-Entropy Loss with HVP]
    
    C --> C1[Autoregressive LM Loss]
    C --> C2[Shifted CE with Analytical HVP]
    C --> C3[Fused CE HVP Backends]
    
    D --> D1[TransformerLens HookedModel Loss]
    
    C3 --> C3a[Triton Kernel]
    C3 --> C3b[torch.compile]
    C3 --> C3c[Eager Reference]

Standard Loss Functions

The standard.py module provides loss functions for common supervised learning scenarios with closed-form Hessian-vector products.

Sources: hessian_eigenthings/loss_fns/standard.py:1-80

MSE Loss

Returns a wrapper compatible with GGNOperator for mean-squared error loss:

def mse_loss_of_output() -> Callable[[torch.Tensor, tuple[torch.Tensor, torch.Tensor]], torch.Tensor]:
    """Make a `loss_of_output_fn` for `GGNOperator` from a (output, target) criterion."""

Cross-Entropy Loss with Analytical HVP

The cross-entropy implementation includes a closed-form Hessian-vector product for efficient computation:

def _ce_hvp(
    output: torch.Tensor, batch: tuple[torch.Tensor, torch.Tensor], u: torch.Tensor
) -> torch.Tensor:
    """Closed-form H @ u for mean-reduced softmax + cross-entropy.
    
    `output` has shape `(N, C)` (logits). For each row,
    `H_row = (diag(p) - p p^T) / N` where `p = softmax(output)`.
    """

Mathematical Foundation:

For mean-reduced softmax + cross-entropy, the Hessian takes the form:

H_row = (diag(p) - p·p^T) / N

where:

p = softmax(output) is the predicted probability distribution
N is the number of samples
u is the input vector

Sources: hessian_eigenthings/loss_fns/standard.py:40-55

HuggingFace Transformers Integration

The huggingface.py module provides loss functions specifically designed for HuggingFace Transformers models. These handle the internal loss computation that occurs when labels are present in the batch.

Sources: hessian_eigenthings/loss_fns/huggingface.py:60-90

Autoregressive Language Model Loss

def hf_lm_loss() -> Callable[[nn.Module, dict[str, Any]], torch.Tensor]:
    """For autoregressive LMs: `loss_fn(model, batch)` calls `model(**batch).loss`."""

The batch must include labels so HuggingFace computes the loss internally. For causal language models, this is typically labels=input_ids with the standard internal shift.

Shifted Cross-Entropy with Analytical HVP

For large-scale language model analysis, a shifted cross-entropy variant provides both the loss function and its analytical Hessian-vector product:

def hf_lm_shifted_ce(fused: FusedCEHvpBackend = "auto") -> _LossOfOutputWithHvp:
    """Shifted CE loss with analytical H @ u for autoregressive LMs."""

Shift Mechanism:

The loss shifts logits left (discards last position) and labels right (discards first position)
Matches how cross_entropy(ignore_index=-100) handles gradient computation

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-50

Fused Cross-Entropy HVP Backends

The _fused_ce_hvp.py module implements optimized backends for computing the cross-entropy Hessian-vector product.

Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50

Backend Selection

Backend	Description	Performance	Availability
`"auto"`	Auto-select fastest available	Optimal	Default
`"triton"`	Hand-written CUDA Triton kernel	~3.4x speedup, 2x memory reduction	CUDA + Triton
`"compile"`	`torch.compile`-fused	~2.6x speedup, 2x memory reduction	torch >= 2.0
`"eager"`	Plain PyTorch reference	Baseline	Always

FusedCEHvpBackend = Literal["auto", "eager", "compile", "triton"]

Sources: hessian_eigenthings/loss_fns/huggingface.py:25-35

Backend Resolution Logic

graph LR
    A[Backend: "auto"] --> B{Device: CUDA?}
    B -->|Yes + Triton available| C[Triton Kernel]
    B -->|No| D{torch.compile available?}
    D -->|Yes| E[torch.compile Backend]
    D -->|No| F[Eager Backend]

The resolution checks:

If backend != "auto", use the specified backend
If "auto" and CUDA + Triton available → Triton
If "auto" and torch.compile available → compile
Otherwise → eager

Important: The Triton kernel asserts logits.is_cuda, so on CUDA-equipped hosts running CPU inputs, the system falls back to compile.

Sources: hessian_eigenthings/loss_fns/huggingface.py:40-60

Memory Optimization

At LM scale (e.g., B=64, T=256, V=50304, fp32):

Implementation	Memory Footprint	Intermediate Tensors
Eager	~19.6 GB	~6 (N, V) tensors
Compile	~3.3 GB	~1 (N, V) tensor
Target	~3.3 GB	Output buffer only

The fused implementations eliminate intermediates by computing:

out_flat = (p * u - p * <p, u>) * mask / n_valid

with shape (N, V) in a single kernel pass.

Sources: scripts/bench_fused_ce_hvp.py:1-60

Loss Function Wrapper

The _LossOfOutputWithHvp class wraps a loss function with its analytical Hessian-vector product:

class _LossOfOutputWithHvp:
    """Loss-of-output callable that also carries an analytical `.hvp` method.
    
    Wraps a plain `(output, batch) -> loss` function and a `(output, batch, u)
    -> H_loss @ u` function in a single callable. `GGNOperator` checks for the
    presence of `.hvp` and uses it as the loss-Hessian-vector product, skipping
    the autograd `create_graph=True` double-backward path entirely.
    """

Sources: hessian_eigenthings/loss_fns/huggingface.py:100-120

GGN Operator Integration

The GGNOperator automatically detects and uses analytical HVPs when available:

GGNOperator` picks this up automatically and skips the autograd
double-backward.

Two implementations of the matvec are available via `loss_hvp=`:

* ``"analytical"`` (default): finite-difference JVP + analytical loss-Hessian-vec
  product (read from `loss_of_output_fn.hvp`, which must be present) + a single
  normal backward to apply `J^T`. Memory footprint matches one normal training
  step. Required for LM-scale use.

* ``"autograd"``: the original `torch.func.jvp` + autograd double-backward +
  `torch.func.vjp` path. Numerically exact and supports any loss, but memory
  scales badly with output size.

Sources: hessian_eigenthings/operators/ggn.py:10-30

TransformerLens Integration

For TransformerLens HookedModel architectures, a dedicated loss function handles the hook-based forward pass:

Sources: hessian_eigenthings/loss_fns/transformer_lens.py:1-40

def tlens_loss() -> Callable[[nn.Module, Any], torch.Tensor]:
    """Loss function for TransformerLens HookedModel."""

Workflow: Choosing a Loss Function

graph TD
    A[Start] --> B{Model Type?}
    B -->|Standard MLP/CNN| C[Use HessianOperator]
    B -->|HuggingFace Transformers| D[Use GGNOperator]
    B -->|TransformerLens| E[Use HessianOperator]
    
    C --> F[standard.mse_loss_of_output]
    C --> G[standard.cross_entropy_loss_of_output]
    
    D --> H{Scale?}
    H -->|Small model| I[hf_lm_loss with GGNOperator]
    H -->|Large model| J[hf_lm_shifted_ce with GGNOperator]
    
    E --> K[tlens_loss with HessianOperator]
    
    J --> L[Choose HVP Backend]
    L --> M{Device?}
    M -->|CUDA + Triton| N[Use Triton backend]
    M -->|CPU/MPS| O[Use compile or eager]

Complete API Reference

Standard Module

Function	Returns	HVP Available
`mse_loss_of_output()`	`loss_of_output_fn`	No
`cross_entropy_loss_of_output()`	`loss_of_output_fn`	Yes
`_ce_hvp()`	Analytical HVP	-

HuggingFace Module

Function	Returns	HVP Available
`hf_lm_loss()`	`loss_fn`	Via GGNOperator
`hf_lm_shifted_ce(fused)`	`_LossOfOutputWithHvp`	Yes
`_LossOfOutputWithHvp`	Wrapper class	Via `.hvp` attribute

Fused CE HVP Module

Function	Description
`_ce_hvp_reference()`	Eager reference implementation
`_get_compiled_impl()`	Returns `torch.compile` wrapped version
`compiled_ce_hvp()`	Compiled backend entry point
`triton_ce_hvp()`	Triton kernel entry point

Usage Examples

Standard Classification

from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.loss_fns.standard import cross_entropy_loss_of_output

operator = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=lambda m, b: torch.nn.functional.cross_entropy(m(b[0]), b[1]),
)

HuggingFace Large Model (Memory-Optimized)

from hessian_eigenthings.operators import GGNOperator
from hessian_eigenthings.loss_fns.huggingface import hf_lm_shifted_ce

loss_fn = hf_lm_shifted_ce(fused="auto")  # Auto-selects best backend

operator = GGNOperator(
    model=model,
    dataloader=dataloader,
    forward_fn=lambda m, b: m(**b).logits,
    loss_of_output_fn=loss_fn,
    loss_hvp="analytical",  # Default, uses .hvp attribute
)

Sources: hessian_eigenthings/loss_fns/__init__.py

Sources: [hessian_eigenthings/operators/hessian.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

Parameter Utilities

Related topics: Curvature Operators

Section Related Pages

Continue reading this section for the full explanation and source context.

Section ParamFilter Type

Continue reading this section for the full explanation and source context.

Section Parameter Collection Utilities

Continue reading this section for the full explanation and source context.

Section Flattening Parameters to Vectors

Continue reading this section for the full explanation and source context.

Related topics: Curvature Operators

Parameter Utilities

The parameter utilities module (param_utils.py) provides essential infrastructure for managing, filtering, and manipulating PyTorch model parameters within the Hessian eigendecomposition pipeline. These utilities form the foundation that connects curvature operators to the underlying model parameters, enabling efficient computation of Hessian-vector products and eigendecomposition across arbitrary subsets of model parameters.

Overview

When working with large neural networks, it is often necessary to compute curvature information for only a subset of parameters. The parameter utilities support this use case through a flexible filtering mechanism combined with utilities for parameter vectorization, reshaping, and batch management.

The core responsibilities of the parameter utilities include:

Parameter Extraction - Gathering named parameters from PyTorch modules
Parameter Filtering - Selecting subsets of parameters based on name patterns or custom predicates
Vectorization - Flattening parameters into vectors and reshaping vectors back to parameter shapes
Size Tracking - Maintaining offset mappings for efficient vector-to-parameter conversions

Sources: hessian_eigenthings/operators/hessian.py

Core Types and Interfaces

ParamFilter Type

The ParamFilter type alias defines the contract for parameter selection functions:

ParamFilter = Callable[[str, nn.Parameter], bool]

A ParamFilter is a callable that takes two arguments:

name: str - The fully-qualified parameter name within the model
param: nn.Parameter - The parameter tensor itself

The function returns True if the parameter should be included in the operation, False otherwise.

Sources: hessian_eigenthings/operators/hessian.py:1-50

Parameter Collection Utilities

The module provides functions for extracting and organizing model parameters:

Function	Purpose
`get_param_names(model)`	Returns list of parameter names as fully-qualified strings
`get_param_list(model)`	Returns list of parameter tensors
`get_param_sizes(model)`	Returns list of parameter tensor sizes
`get_filtered_params(model, param_filter)`	Returns filtered parameter names and tensors

These functions work together to build the data structures required by curvature operators.

Sources: hessian_eigenthings/operators/ggn.py

Parameter Vectorization

Flattening Parameters to Vectors

The utilities support bidirectional conversion between parameter dictionaries and flat vectors. This is essential for Lanczos-based eigendecomposition algorithms that operate on vector spaces.

def flatten_params(param_dict: dict[str, Tensor]) -> Tensor:
    """Flatten all parameters into a single 1D tensor."""

The flattening process concatenates all parameter tensors in a deterministic order, preserving the mapping between parameter names and vector offsets.

Reshaping Vectors Back to Parameters

def unflatten_params(
    vec: Tensor,
    param_names: list[str],
    param_list: list[Tensor],
    sizes: list[torch.Size]
) -> dict[str, Tensor]:
    """Reshape a flat vector back to parameter dictionary."""

The unflattening operation uses offset tracking to slice the vector and reshape each slice to match the original parameter shape:

for name, param, size in zip(param_names, param_list, sizes, strict=True):
    out[name] = vec[offset : offset + size].reshape_as(param)
    offset += size

Sources: hessian_eigenthings/operators/ggn.py

Parameter Filtering Patterns

Name-Based Filtering with `match_names`

The most common filtering pattern uses glob-style matching against parameter names. The match_names function creates a ParamFilter from a list of name patterns:

def match_names(*patterns: str) -> ParamFilter:
    """Create a filter matching parameter names against glob patterns."""

Example usage:

from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.param_utils import match_names

# Filter attention parameters only
attn_filter = match_names("blocks.*.attn.*")
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=attn_filter
)

# Filter MLP parameters only
mlp_filter = match_names("blocks.*.mlp.*")
mlp_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=mlp_filter
)

Sources: examples/transformer_lens_attention_only.py

Multiple Pattern Matching

The match_names function supports multiple patterns, useful for targeting disjoint parameter groups:

# Match multiple parameter groups
filter_fn = match_names(
    "transformer.h.*.attn.*",
    "transformer.h.*.mlp.*"
)

HuggingFace-Specific Patterns

When working with HuggingFace transformers, parameter names follow a predictable structure:

# Attention parameters in GPT-2
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=hf_lm_loss(),
    param_filter=match_names("transformer.h.*.attn.*"),
)

Sources: examples/huggingface_tiny_gpt2.py

Integration with Curvature Operators

Operator Size and Parameter Tracking

Curvature operators maintain internal state about the parameters they operate on:

Attribute	Type	Description
`_param_names`	`list[str]`	Names of parameters in the filtered set
`_param_list`	`list[Tensor]`	Parameter tensors
`_sizes`	`list[torch.Size]`	Original tensor shapes for reshaping
`size`	`int`	Total number of parameters (sum of all parameter elements)

The size property is computed as:

self.size = sum(p.numel() for p in self._param_list)

This total parameter count determines the dimensionality of the vector space in which eigendecomposition occurs.

Sources: hessian_eigenthings/operators/hessian.py

Data Flow Diagram

graph TD
    A[PyTorch Model] --> B[get_param_names]
    A --> C[get_param_list]
    A --> D[get_param_sizes]
    B --> E[ParamFilter Application]
    C --> E
    D --> E
    E --> F[Filtered Parameter Collections]
    F --> G[Curvature Operator]
    G --> H[matvec Operations]
    H --> I[Eigendecomposition Results]
    
    J[Input Vector] --> K[unflatten_params]
    F --> K
    K --> L[Parameter Dict]
    L --> H

Practical Examples

Computing Attention-Only Hessian Eigendecomposition

from hessian_eigenthings.algorithms import lanczos
from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.param_utils import match_names

# Create attention-only operator
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)

print(f"Attention-only Hessian size: {attn_op.size} parameters")

# Compute top-3 eigenvalues
eig_attn = lanczos(attn_op, k=3, max_iter=20, tol=1e-3, seed=0)
for i, val in enumerate(eig_attn.eigenvalues):
    print(f"  λ_{i + 1} = {val.item(): .4e}")

Comparing Block-Specific Curvature

# Full model Hessian
full_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn
)

# Attention block only
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)

# MLP block only
mlp_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.mlp.*")
)

# Compare eigenvalue spectra
full_eig = lanczos(full_op, k=10)
attn_eig = lanczos(attn_op, k=10)
mlp_eig = lanczos(mlp_op, k=10)

Sources: examples/transformer_lens_attention_only.py

Advanced Filtering

Custom Filter Functions

For complex filtering logic beyond glob matching, implement a custom ParamFilter:

def custom_filter(name: str, param: nn.Parameter) -> bool:
    # Include only parameters with > 1000 elements
    if param.numel() < 1000:
        return False
    # Exclude certain modules
    if "embedding" in name:
        return False
    # Include based on naming patterns
    return "layer" in name or "head" in name

operator = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=custom_filter
)

Filter Composition

Filters can be combined using standard Python patterns:

# Intersection of patterns
combined_filter = lambda name, param: (
    match_names("blocks.*.*.*")(name, param) and
    param.dtype == torch.float32
)

# Negation
exclude_ln = lambda name, param: not (
    match_names(".*laynorm.*", ".*ln.*")(name, param)
)

Performance Considerations

Parameter Access Patterns

The parameter utilities maintain strict ordering between name lists and tensor lists to enable efficient offset-based indexing. When iterating over parameters in performance-critical paths:

Use the pre-computed _sizes list to avoid repeated param.shape calls
Leverage the strict=True zip when all lists are guaranteed to be aligned
Prefer in-place reshaping over copies when possible

Memory Implications

Operation	Memory Pattern
`flatten_params`	Allocates new tensor of size `sum(numel)`
`unflatten_params`	Creates dict, views from original vector
`matvec`	No parameter data copies; uses VJP/JVP chains

The vectorization maintains a view relationship with original parameters where possible, minimizing memory overhead during iterative algorithms.

API Reference

`match_names`

def match_names(*patterns: str) -> ParamFilter:
    """Create a ParamFilter matching parameter names against glob patterns."""

Parameters:

Parameter	Type	Description
`patterns`	`str`	Glob patterns to match against parameter names

Returns: A callable ParamFilter that returns True for parameters matching any of the provided patterns.

Supported Glob Patterns:

* - Matches any sequence of characters within a path component
** - Matches any sequence of path components (if supported)
? - Matches a single character
[abc] - Matches any character in the set

Parameter Extraction Functions

def get_param_names(model: nn.Module) -> list[str]:
    """Extract fully-qualified parameter names from a model."""

def get_param_list(model: nn.Module) -> list[Tensor]:
    """Extract parameter tensors from a model."""

def get_filtered_params(
    model: nn.Module,
    param_filter: ParamFilter | None
) -> tuple[list[str], list[Tensor]]:
    """Extract filtered parameter names and tensors."""

Sources: hessian_eigenthings/param_utils.py

Sources: [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

Distributed Computing with DDP

Related topics: Curvature Operators

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Class Hierarchy

Continue reading this section for the full explanation and source context.

Section Data Flow Diagram

Continue reading this section for the full explanation and source context.

Section DDP Behavior Explanation

Continue reading this section for the full explanation and source context.

Related topics: Curvature Operators

Distributed Computing with DDP

The hessian-eigenthings library provides native support for distributed training scenarios through DDPHessianOperator, a specialized curvature operator that extends the base HessianOperator to work correctly with PyTorch's DistributedDataParallel (DDP) wrapper.

Overview

In distributed training environments, the Hessian eigenvalue computations must account for how DDP synchronizes gradients across multiple processes. The DDPHessianOperator handles this synchronization transparently, ensuring that the Hessian-vector products (HVPs) computed across different ranks are properly averaged.

Key characteristics:

Subclass of HessianOperator with distributed awareness
Automatically averages HVPs across all data-parallel ranks
Compatible with standard DDP-wrapped models
Supports the same API as the base HessianOperator
Handles the autograd graph complexity introduced by DDP's all-reduce operations

Architecture

Class Hierarchy

CurvatureOperator (base interface)
    └── HessianOperator (base implementation)
            └── DDPHessianOperator (DDP-aware extension)

Data Flow Diagram

graph TD
    A[Model wrapped with DDP] --> B[DDPHessianOperator]
    B --> C[Per-rank HVP Computation]
    C --> D[Autograd-aware All-Reduce]
    D --> E[Synchronized HVP across Ranks]
    
    F[torch.autograd.grad calls] --> G[Regular .backward hooks]
    G --> H[No explicit all-reduce]
    
    I[Expected HVP] --> J[Actual HVP without DDPHessianOperator]
    
    K[Expected HVP] --> L[Actual HVP with DDPHessianOperator]
    
    style D fill:#90EE90
    style H fill:#FFB6C1
    style L fill:#90EE90

DDP Behavior Explanation

The core challenge addressed by DDPHessianOperator stems from how PyTorch's DistributedDataParallel handles gradient synchronization:

DDP's all-reduce mechanism: DDP normally fires its all-reduce operation inside the autograd graph during loss.backward(), synchronizing gradients across all ranks
Standard HessianOperator limitation: When using torch.autograd.grad directly (as the base HessianOperator does), the DDP hooks fire on .grad accumulation rather than on autograd.grad's return value
Resulting discrepancy: Without explicit handling, the computed HVP does not match the single-process HVP computed on the union of all per-rank batches

The DDPHessianOperator resolves this by adding an explicit autograd-aware all-reduce after each gradient computation call, ensuring the resulting HVP equals the single-process HVP.

API Reference

DDPHessianOperator

class DDPHessianOperator(HessianOperator):
    """HessianOperator that all-reduces the HVP across torch.distributed ranks."""

#### Constructor Parameters

Parameter	Type	Required	Default	Description
`model`	`nn.Module`	Yes	-	Model (may be DDP-wrapped; params are read directly)
`dataloader`	`Iterable[Any]`	Yes	-	Data loader providing batches to average over
`loss_fn`	`LossFn`	Yes	-	Loss function `forward_fn(...) -> loss`
`param_filter`	`ParamFilter \	None`	No	`None`	Optional filter for subset of parameters
`full_dataset`	`bool`	No	`True`	Whether to compute Hessian over full dataset
`num_batches`	`int \	None`	No	`None`	Number of batches to sample if not full dataset
`microbatch_size`	`int \	None`	No	`None`	Chunk batch into micro-batches for memory
`microbatch_unsafe`	`bool`	No	`False`	Skip gradient accumulation safety checks
`method`	`HvpMethod`	No	`"autograd"`	HVP computation method
`fd_eps`	`float \	None`	No	`None`	Finite difference epsilon
`backend`	`LinAlgBackend[torch.Tensor] \	None`	No	`None`	Linear algebra backend

Inherits all parameters from HessianOperator base class.

#### Inherited Methods

Method	Description
`matvec(v)`	Compute H·v where H is the Hessian averaged over batches
`size`	Total number of parameters in the filtered parameter set
`dtype`	Data type of parameters
`device`	Device of parameters

Import Location

from hessian_eigenthings.operators import DDPHessianOperator

Or via the distributed submodule:

from hessian_eigenthings.operators.distributed import DDPHessianOperator

Sources: hessian_eigenthings/operators/distributed/__init__.py:1-3

Usage Patterns

Basic Usage with DDP-wrapped Model

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from hessian_eigenthings.operators import DDPHessianOperator
from hessian_eigenthings.algorithms import lanczos

# Assume model, dataloader, and loss_fn are already set up
ddp_model = DDP(model)

# Create the distributed Hessian operator
hessian_op = DDPHessianOperator(
    model=ddp_model,
    dataloader=dataloader,
    loss_fn=loss_fn,
)

# Compute eigenvalues using Lanczos algorithm
eigenvalues, eigenvectors = lanczos(hessian_op, k=10, max_iter=50)

With Parameter Filtering

from hessian_eigenthings.param_utils import match_names

# Focus on specific layer parameters
hessian_op = DDPHessianOperator(
    model=ddp_model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("layer.4.*"),
)

Using with Different HVP Methods

# Using finite difference method (more memory-efficient)
hessian_op_fd = DDPHessianOperator(
    model=ddp_model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    method="finite_difference",
    fd_eps=1e-5,
)

Key Design Decisions

Autograd-aware All-Reduce

The DDPHessianOperator adds an explicit all-reduce operation that integrates with PyTorch's autograd engine. This ensures:

The all-reduce operation is included in the autograd graph when needed
Gradient flows correctly through the distributed computation
The final HVP is properly synchronized across all ranks

Parameter Access

The operator reads parameters directly from the model, whether or not it is wrapped with DDP:

# From the source:
# "The model passed in may already be wrapped with
#  torch.nn.parallel.DistributedDataParallel; we read params from it directly."

This design allows seamless usage with existing DDP-wrapped models without modification.

Batch Distribution

Each rank should receive its own shard of the dataset:

"Each rank should be receiving its own shard of the dataset (typical pattern: a torch.utils.data.distributed.DistributedSampler)."

Sources: hessian_eigenthings/operators/distributed/ddp.py:21-25

Comparison with Single-Process HessianOperator

Aspect	`HessianOperator`	`DDPHessianOperator`
Use case	Single GPU / CPU	Multi-GPU distributed
Gradient sync	Manual handling required	Automatic via all-reduce
DDP compatibility	May produce incorrect HVPs	Correct by design
API	Identical	Identical
Performance overhead	None	Single all-reduce per HVP

Relationship to Other Operators

The hessian_eigenthings package provides multiple curvature operators:

Operator	Description	Distributed Support
`HessianOperator`	Full Hessian computation	Not DDP-aware
`DDPHessianOperator`	Full Hessian with DDP sync	DDP-aware
`GGNOperator`	Generalized Gauss-Newton	Not DDP-aware (as of v1.0)
`EmpiricalFisherOperator`	Empirical Fisher matrix	Not DDP-aware

Sources: hessian_eigenthings/__init__.py:15-24

Limitations and Considerations

Current scope: Only HessianOperator has a DDP-aware counterpart; other operators like GGNOperator and EmpiricalFisherOperator do not yet have distributed variants
Gradient hooks: The operator does not currently support all DDP gradient hook mechanisms
Multi-node training: While the operator uses standard torch.distributed primitives, performance at very large scale (>8 nodes) has not been extensively benchmarked
Mixed precision: When using fp16/bf16 training, ensure consistent dtype across all ranks

Error Handling

The operator relies on standard PyTorch distributed error handling:

If torch.distributed is not initialized, standard errors will be raised
Mismatched tensor shapes across ranks will result in collective operation errors
Device mismatch (e.g., some ranks on CUDA, some on CPU) is not supported

Testing and Validation

The DDP functionality should be tested in a true distributed environment. Basic validation includes:

Consistency check: HVP computed via DDPHessianOperator should equal the single-process HVP when aggregating all batch shards
Numerical accuracy: Eigenvalues computed with DDP should match single-GPU results within floating-point tolerance
Scaling: Computation time should scale sub-linearly with number of GPUs for large models

Sources: [hessian_eigenthings/operators/distributed/__init__.py:1-3]()

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Project risk needs validation

The project should not be treated as fully validated until this signal is reviewed.

medium Python Error: the following arguments are required: experimentname

First-time setup may fail or require extra isolation and rollback planning.

medium v1.0.0a2 — packaging fix

First-time setup may fail or require extra isolation and rollback planning.

medium v1.0.0a3 — fix lanczos OOM

First-time setup may fail or require extra isolation and rollback planning.

Doramagic Pitfall Log

Doramagic extracted 15 source-linked risk signals. Review them before installing or handing real data to the project.

1. Project risk: Project risk needs validation

Severity: medium
Finding: Project risk is backed by a source signal: Project risk needs validation. Treat it as a review item until the current version is checked.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: identity.distribution | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | repo=pytorch-hessian-eigenthings; install=hessian-eigenthings

2. Installation risk: Python Error: the following arguments are required: experimentname

Severity: medium
Finding: Installation risk is backed by a source signal: Python Error: the following arguments are required: experimentname. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/39

3. Installation risk: v1.0.0a2 — packaging fix

Severity: medium
Finding: Installation risk is backed by a source signal: v1.0.0a2 — packaging fix. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a2

4. Installation risk: v1.0.0a3 — fix lanczos OOM

Severity: medium
Finding: Installation risk is backed by a source signal: v1.0.0a3 — fix lanczos OOM. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a3

5. Installation risk: v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo

Severity: medium
Finding: Installation risk is backed by a source signal: v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a4

6. Installation risk: v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests

Severity: medium
Finding: Installation risk is backed by a source signal: v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests. Treat it as a review item until the current version is checked.
User impact: First-time setup may fail or require extra isolation and rollback planning.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a5

7. Configuration risk: RuntimeError: One of the differentiated Tensors appears to not have been used in the graph.

Severity: medium
Finding: Configuration risk is backed by a source signal: RuntimeError: One of the differentiated Tensors appears to not have been used in the graph.. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/30

8. Configuration risk: ValueError: PENet on the Kitti benchmark suite

Severity: medium
Finding: Configuration risk is backed by a source signal: ValueError: PENet on the Kitti benchmark suite. Treat it as a review item until the current version is checked.
User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/41

9. Capability assumption: README/documentation is current enough for a first validation pass.

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: capability.assumptions | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | README/documentation is current enough for a first validation pass.

10. Project risk: AttributeError: 'HVPOperator' object has no attribute 'zero_grad'

Severity: medium
Finding: Project risk is backed by a source signal: AttributeError: 'HVPOperator' object has no attribute 'zero_grad'. Treat it as a review item until the current version is checked.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/38

11. Maintenance risk: Maintainer activity is unknown

Severity: medium
Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | last_activity_observed missing

12. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: downstream_validation.risk_items | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | no_demo; severity=medium

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 9

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using pytorch-hessian-eigenthings with real data or production workflows.

ValueError: PENet on the Kitti benchmark suite - github / github_issue
RuntimeError: One of the differentiated Tensors appears to not have been - github / github_issue
AttributeError: 'HVPOperator' object has no attribute 'zero_grad' - github / github_issue
Python Error: the following arguments are required: experimentname - github / github_issue
v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests - github / github_release
v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo - github / github_release
v1.0.0a3 — fix lanczos OOM - github / github_release
v1.0.0a2 — packaging fix - github / github_release
Project risk needs validation - GitHub / issue

Source: Project Pack community evidence and pitfall evidence