Doramagic Project Pack · Human Manual

pytorch-hessian-eigenthings

hessian-eigenthings is a PyTorch library that provides efficient and scalable computation of eigendecomposition for the Hessian matrix and related curvature operators in neural networks. T...

Introduction to hessian-eigenthings

Related topics: Curvature Matrices Explained, System Architecture

Section Related Pages

Continue reading this section for the full explanation and source context.

Section What is a Hessian?

Continue reading this section for the full explanation and source context.

Section Curvature Operators

Continue reading this section for the full explanation and source context.

Section Supported Curvature Matrices

Continue reading this section for the full explanation and source context.

Related topics: Curvature Matrices Explained, System Architecture

Introduction to hessian-eigenthings

Overview

hessian-eigenthings is a PyTorch library that provides efficient and scalable computation of eigendecomposition for the Hessian matrix and related curvature operators in neural networks. The library enables practitioners to compute top eigenvalues and eigenvectors via Lanczos or stochastic power iteration, trace estimates via Hutch++, and spectral density via Stochastic Lanczos Quadrature.

Sources: README.md:1

The project targets researchers and engineers studying generalization properties of neural networks, where Hessian eigenvalues and eigenvectors have been implicated in understanding flat minima and model robustness.

Core Concepts

What is a Hessian?

The Hessian matrix is the second-order partial derivatives of a loss function with respect to model parameters. For a neural network with parameters θ and loss L, the Hessian H is defined as:

H[θ][i,j] = ∂²L / ∂θ[i]∂θ[j]

For modern large-scale models, the Hessian is prohibitively expensive to compute explicitly—it has O(n²) entries where n is the number of parameters (e.g., billions for large language models).

Sources: hessian_eigenthings/operators/hessian.py:1-50

Curvature Operators

Instead of computing the full Hessian matrix, this library works with curvature operators that implement matrix-vector products (matvecs). Given a vector v, these operators efficiently compute:

H @ v → operator.matvec(v)

This approach reduces memory from O(n²) to O(n), making analysis feasible for models with billions of parameters.

Supported Curvature Matrices

OperatorDescriptionUse Case
HessianOperatorFull Hessian of the lossGeneral curvature analysis
GGNOperatorGeneralized Gauss-Newton approximationMore stable than raw Hessian; equals Fisher for cross-entropy + softmax
Custom OperatorsUser-defined curvature operatorsExtend to other matrices

Sources: hessian_eigenthings/operators/ggn.py:1-30

Architecture

The library follows a clean separation of concerns with three main layers:

graph TD
    A[User Code] --> B[Algorithms]
    A --> C[Loss Functions]
    B --> D[Curvature Operators]
    C --> D
    D --> E[LinAlgBackend]
    
    B --> B1[Lanczos]
    B --> B2[Stochastic Power Iteration]
    B --> B3[Trace Estimation]
    
    D --> D1[HessianOperator]
    D --> D2[GGNOperator]
    D --> D3[Custom Operators]
    
    E --> E1[SingleDeviceBackend]
    E --> E2[Distributed Backends]

Component Layers

  1. Algorithms Layer (hessian_eigenthings/algorithms/): Eigenvalue/eigenvector computation methods that operate on any CurvatureOperator.
  1. Operators Layer (hessian_eigenthings/operators/): Implementations of various curvature matrices that provide the matvec() interface.
  1. Loss Functions Layer (hessian_eigenthings/loss_fns/): Pre-built loss functions with analytical Hessian-vector products for common use cases.
  1. Backend Layer (hessian_eigenthings/backends/): Abstraction for linear algebra operations supporting single-device and distributed execution.

Sources: CONTRIBUTING.md:1-30

Algorithms

Lanczos Eigendecomposition

The Lanczos algorithm computes the top k eigenvalues and eigenvectors of a symmetric matrix using only matrix-vector products. It achieves this through k iterations of tridiagonal matrix construction.

graph LR
    A[Start Vector v₀] --> B[Iterate i = 1 to k]
    B --> C[Compute βᵢvᵢ₊₁ = Avᵢ - αᵢvᵢ - βᵢ₋₁vᵢ₋₁]
    C --> D[Compute αᵢ = vᵢᵀAvᵢ]
    D --> E[Build Tridiagonal T]
    E --> F{Eigenvalues of T ≈ eigenvalues of A?}
    F -->|Yes| G[Eigenpairs Converged]

Key parameters for the Lanczos algorithm:

ParameterTypeDefaultDescription
kintrequiredNumber of eigenpairs to compute
max_iterint100Maximum Lanczos iterations
tolfloat1e-6Convergence tolerance
seedintNoneRandom seed for reproducibility
whichstr"LM"Which eigenvalues: "LM" (largest magnitude), "LA" (largest algebraic), "SA" (smallest algebraic)

Sources: hessian_eigenthings/algorithms/lanczos.py:1-80

Trace Estimation

The trace of a matrix can be estimated using stochastic methods without forming the full matrix:

Hutchinson's Estimator:

trace(A) ≈ (1/m) Σᵢ vᵢᵀ A vᵢ

where vᵢ are random probe vectors.

Hutch++ Estimator: An improved estimator with lower variance:

trace(A) ≈ (2/m) Σᵢ vᵢᵀ A vᵢ - (1/m) Σⱼ wⱼᵀ A wⱼ
Methodnum_matvecsVarianceUse Case
hutchinson100HigherQuick estimates
hutch++30LowerProduction estimates

Sources: hessian_eigenthings/algorithms/trace.py:1-60

Operators

HessianOperator

The primary operator for computing Hessian eigendecomposition. It supports two HVP computation methods:

MethodDescriptionMemoryPrecision
"autograd" (default)Exact double-backward via torch.autograd.grad with create_graph=TrueHigherNumerically exact
"finite_difference"Central-difference approximationLowerO(ε²) bias
from hessian_eigenthings import HessianOperator, lanczos

# Basic usage
op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
eigenvalues, eigenvectors = lanczos(op, k=10)

# With parameter filtering (subset of parameters)
op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("transformer.h.*.attn.*"),
)

Sources: hessian_eigenthings/operators/hessian.py:1-100

GGNOperator

The Generalized Gauss-Newton (GGN) operator provides a more numerically stable approximation to the Hessian. For cross-entropy + softmax classification, the GGN equals the Fisher information matrix.

from hessian_eigenthings import GGNOperator

op = GGNOperator(
    model=model,
    dataloader=dataloader,
    forward_fn=model_forward,
    loss_of_output_fn=loss_of_output_fn,
)

Two matvec implementations:

ImplementationDescriptionMemory Footprint
"analytical" (default)Finite-difference JVP + analytical loss-Hessian-vec productMatches one training step
"autograd"Full torch.func.jvp + autograd double-backwardScales badly with output size

Sources: hessian_eigenthings/operators/ggn.py:1-80

Loss Functions

The library provides optimized loss functions with closed-form Hessian-vector products for common use cases.

Standard Loss Functions

from hessian_eigenthings.loss_fns.standard import cross_entropy_loss_of_output

loss_fn = cross_entropy_loss_of_output()  # Returns loss_of_output_fn

The closed-form cross-entropy HVP is:

H @ u = (p * u - p * <p, u>) / n

where p = softmax(output) and n is the number of valid positions.

Sources: hessian_eigenthings/loss_fns/standard.py:1-60

HuggingFace Transformers Loss

Specialized support for HuggingFace models with fused CUDA kernels:

from hessian_eigenthings.loss_fns.huggingface import hf_lm_loss

loss_fn = hf_lm_loss()  # For language modeling

Fused backend options:

BackendDeviceSpeedupMemory
"triton"CUDA~3.4x faster2x reduction
"compile"Any~2.6x faster2x reduction
"eager"AnyBaselineBaseline
"auto"Auto-detectBest availableBest available

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-80

Usage Examples

Basic Hessian Eigendecomposition

import torch
from torch import nn
from hessian_eigenthings import HessianOperator, lanczos

# Define model and data
model = nn.Sequential(nn.Linear(100, 50), nn.ReLU(), nn.Linear(50, 10))
dataloader = [(torch.randn(32, 100), torch.randint(0, 10, (32,)))]

# Loss function
def loss_fn(model, batch):
    x, y = batch
    return nn.functional.cross_entropy(model(x), y)

# Compute top 3 eigenvalues
op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
result = lanczos(op, k=3, max_iter=20, tol=1e-3, seed=0)

for i, val in enumerate(result.eigenvalues):
    print(f"λ_{i+1} = {val.item():.4e}")

Sources: examples/huggingface_tiny_gpt2.py:1-50

Analyzing Attention Layers Only

from hessian_eigenthings import HessianOperator, lanczos
from hessian_eigenthings.util import match_names

# Filter to attention parameters only
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*"),
)
eigenvalues = lanczos(attn_op, k=5)

Trace Estimation

from hessian_eigenthings import HessianOperator, trace

op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
result = trace(op, num_matvecs=30, method="hutch++", seed=0)
print(f"Trace estimate: {result.estimate:.4e}")

Performance Considerations

Memory Management

For large models, consider these strategies:

``python op = HessianOperator(model, dataloader, loss_fn, param_filter=filter_func) ``

  1. Parameter Filtering: Analyze only relevant subsets of parameters

``python op = HessianOperator(model, dataloader, loss_fn, microbatch_size=8) ``

  1. Microbatching: Process data in smaller chunks

``python op = HessianOperator(model, dataloader, loss_fn, method="finite_difference") ``

  1. Finite Difference Method: Use "finite_difference" for lower memory with FSDP/HSDP/TP

Scalability

Model SizeRecommended MethodNotes
< 1B params"autograd" HVPNumerically exact
1B - 7B params"analytical" GGNGood memory efficiency
> 7B params"finite_difference" HVPWorks with distributed training

Computation Cost

The primary cost driver is the number of matrix-vector products (matvecs):

  • Lanczos: ~k × max_iter matvecs for k eigenpairs
  • Trace (Hutch++): num_matvecs matvecs
  • Spectral Density: num_steps × num_random_start matvecs

API Reference

Core Functions

FunctionModuleDescription
lanczoshessian_eigenthings.algorithmsLanczos eigendecomposition
stochastic_power_iterationhessian_eigenthings.algorithmsStochastic power iteration
tracehessian_eigenthings.algorithmsTrace estimation
spectral_densityhessian_eigenthings.algorithmsStochastic Lanczos Quadrature

Operators

ClassModuleDescription
HessianOperatorhessian_eigenthings.operatorsFull Hessian operator
GGNOperatorhessian_eigenthings.operatorsGeneralized Gauss-Newton operator
CurvatureOperatorhessian_eigenthings.operatorsBase class for custom operators

Utility Functions

FunctionDescription
match_names(glob_pattern)Create parameter filter from glob pattern
SingleDeviceBackendLinear algebra backend for single-device execution

Project Information

Acknowledgements

The original 2018 implementation was developed by Noah Golmant, Zhewei Yao, Amir Gholami, Michael Mahoney, and Joseph Gonzalez at UC Berkeley's RISELab.

The deflated power iteration is based on code from HessianFlow (Z. Yao, A. Gholami, Q. Lei, K. Keutzer, M. Mahoney. *"Hessian-based Analysis of Large Batch Training and Robustness to Adversaries"*, NeurIPS 2018).

Accelerated stochastic power iteration is from C. De Sa et al.

Citation

@misc{hessian-eigenthings,
    author       = {Noah Golmant and Zhewei Yao and Amir Gholami and Michael Mahoney and Joseph Gonzalez},
    title        = {pytorch-hessian-eigenthings: efficient PyTorch Hessian eigendecomposition},
    month        = oct,
    year         = 2018,
    version      = {1.0},
    url          = {https://github.com/noahgolmant/pytorch-hessian-eigenthings}
}

Installation

# From PyPI (stable release)
pip install hessian-eigenthings

# Development setup
git clone https://github.com/noahgolmant/pytorch-hessian-eigenthings
cd pytorch-hessian-eigenthings
uv sync --group dev --group docs

Documentation

Full documentation is available at noahgolmant.github.io/pytorch-hessian-eigenthings/.

Sources: README.md:1-100 Sources: CONTRIBUTING.md:1-60 Sources: mkdocs.yml:1-50

Sources: [README.md:1]()

Installation Guide

Related topics: Introduction to hessian-eigenthings

Section Related Pages

Continue reading this section for the full explanation and source context.

Section PyPI Release (Recommended for Users)

Continue reading this section for the full explanation and source context.

Section Development Installation (For Contributors)

Continue reading this section for the full explanation and source context.

Section Prerequisites

Continue reading this section for the full explanation and source context.

Related topics: Introduction to hessian-eigenthings

Installation Guide

Overview

This guide covers all aspects of setting up the hessian-eigenthings library for computing Hessian eigendecomposition and related curvature matrix operations in PyTorch models.

The library provides efficient methods for computing top eigenvalues and eigenvectors via Lanczos or stochastic power iteration, trace estimates via Hutch++, and spectral density via Stochastic Lanczos Quadrature. Sources: README.md:1-20

Installation Methods

The latest stable release is available on PyPI:

pip install hessian-eigenthings

This installs the core library without optional dependencies for transformer and curvlinops integrations. Sources: README.md:1-10

Development Installation (For Contributors)

For development, clone the repository and install with all optional dependency groups:

git clone https://github.com/noahgolmant/pytorch-hessian-eigenthings
cd pytorch-hessian-eigenthings
uv sync --group dev --group docs --extra transformers --extra transformer-lens --extra curvlinops

Sources: CONTRIBUTING.md:5-12

Optional Dependency Groups

The library uses optional dependency groups defined in pyproject.toml to enable specialized functionality:

GroupPurposeTypical Use Case
devTesting, linting, type checkingRunning CI checks locally
docsBuilding documentationmkdocs build --strict
transformersHuggingFace Transformers integrationGGNOperator with HF models
transformer-lensTransformerLens integrationAttention-only Hessian analysis
curvlinopsCross-library validation testsTesting against external oracle

Sources: CONTRIBUTING.md:8-10

Development Environment Setup

Prerequisites

RequirementVersionPurpose
Python≥3.10Core runtime
uvLatestPackage manager
PyTorch≥2.0Backend tensor operations
CUDA (optional)11.8+GPU acceleration for large models

Setup Workflow

graph TD
    A[Clone Repository] --> B[Install uv if needed]
    B --> C[Run uv sync with groups]
    C --> D[Verify Installation]
    D --> E{Which workflow?}
    E -->|Development| F[Run linting checks]
    E -->|Testing| G[Run pytest]
    E -->|Documentation| H[Build docs]
    F --> I[Ready to contribute]
    G --> I
    H --> I

Verification Commands

After installation, verify the setup by running the full check suite:

uv run ruff check .
uv run black --check .
uv run mypy
uv run pytest
uv run mkdocs build --strict

Sources: CONTRIBUTING.md:14-23

CUDA/GPU Support

The library provides optimized CUDA kernels for specific operations:

Triton Kernels (CUDA Only)

The hessian_eigenthings.loss_fns._fused_ce_hvp module includes a hand-written Triton CUDA kernel for fused CE HVP computation. This kernel:

  • Eliminates zero (N, V) intermediates (output buffer only)
  • Provides ~3.4x speedup over eager mode
  • Reduces peak memory by 2x compared to eager
  • Falls back to torch.compile if Triton/CUDA is unavailable

Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:50-80

Backend Selection

For HuggingFace language model loss functions, the fused parameter controls kernel selection:

SettingBehavior
"auto" (default)Picks fastest available: Triton on CUDA (~3.4x speedup), else torch.compile
"eager"Plain PyTorch implementation, useful for debugging
"compile"torch.compile-fused via Inductor, works on CPU/CUDA/MPS
"triton"Hand-written CUDA Triton kernel (CUDA only)

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-50

Operator-Specific Dependencies

Different curvature operators have different computational requirements:

HessianOperator

Two HVP methods are supported:

MethodMemory ProfilePrecisionFSDP/TP Compatible
"autograd" (default)Higher (requires create_graph=True)Numerically exactNo (requires special handling)
"finite_difference"Matches one training stepO(ε²) truncation biasYes

Sources: hessian_eigenthings/operators/hessian.py:1-40

GGNOperator

For the Generalized Gauss-Newton operator, two matvec implementations are available:

ImplementationMemoryUse Case
"analytical" (default)Matches one training stepLM-scale use, prevents OOM
"autograd"Scales with output sizeLosses without analytical .hvp

Sources: hessian_eigenthings/operators/ggn.py:1-60

CI/CD Verification

The repository uses GitHub Actions for continuous integration. CI runs:

  • All linting and type checks
  • Full pytest test suite
  • Example scripts execution
  • Documentation codeblock tests

Sources: CONTRIBUTING.md:23-26

Troubleshooting

Common Issues

IssueSolution
Memory OOM with GGNOperatorUse loss_hvp="analytical" (default in recent versions)
FSDP/TP compatibility issuesUse method="finite_difference" for HessianOperator
Triton not availableFalls back to torch.compile automatically
Type checking failuresRun uv run mypy locally before submitting PR

Diagnostic Scripts

The repository includes diagnostic scripts for troubleshooting:

Sources: scripts/repro_ggn_oom.py:1-40 Sources: scripts/bench_fused_ce_hvp.py:1-50

Package Metadata

PropertyValue
Package Namehessian-eigenthings
LicenseMIT
Documentationnoahgolmant.github.io/pytorch-hessian-eigenthings
CI Status![CI](https://github.com/noahgolmant/pytorch-hessian-eigenthings/actions/workflows/ci.yml)

Sources: README.md:1-20

Sources: [CONTRIBUTING.md:5-12]()

Curvature Matrices Explained

Related topics: Why Hessian-Vector Products, Curvature Operators

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Definition and Role

Continue reading this section for the full explanation and source context.

Section HessianOperator Implementation

Continue reading this section for the full explanation and source context.

Section When to Use the Hessian

Continue reading this section for the full explanation and source context.

Related topics: Why Hessian-Vector Products, Curvature Operators

Curvature Matrices Explained

Overview

Curvature matrices characterize the second-order behavior of loss functions in neural networks, providing critical information about optimization landscapes, generalization properties, and model robustness. The hessian-eigenthings library provides efficient, matrix-free computation of eigendecompositions for three key curvature matrices: the Hessian, the Generalized Gauss-Newton (GGN), and the Empirical Fisher.

These curvature operators serve as the foundation for analyzing flat minima, understanding generalization, and performing second-order optimization. The library implements matrix-vector products (matvecs) directly, avoiding explicit matrix construction which would be computationally infeasible for large neural networks with billions of parameters.

Curvature Matrices Architecture

graph TD
    A[Loss Function Lθ] --> B[Hessian H = ∇²L]
    A --> C[Generalized Gauss-Newton G]
    A --> D[Empirical Fisher F]
    
    B --> E[Matrix-Free MatVec]
    C --> E
    D --> E
    
    E --> F[Lanczos Eigendecomposition]
    E --> G[Trace Estimation]
    E --> H[Spectral Density]
    
    F --> I[Top-k Eigenpairs]
    G --> J[Trace Estimate ± SE]
    H --> K[Spectral Density Plot]

The Hessian Matrix

Definition and Role

The Hessian matrix H = ∇²L(θ) is the second derivative of the loss with respect to parameters. It captures the exact local curvature of the loss landscape, making it the most precise but also most computationally expensive curvature matrix.

The Hessian is symmetric by construction and its eigenvalues reveal critical properties:

  • Large positive eigenvalues indicate sharp curvature, suggesting the model is in a narrow minimum
  • Small eigenvalues indicate flat regions associated with better generalization
  • Negative eigenvalues signal instability and potential divergence

Sources: hessian_eigenthings/operators/hessian.py:1-30

HessianOperator Implementation

The HessianOperator class provides two methods for computing Hessian-vector products (HVPs):

class HessianOperator(CurvatureOperator):
    """Hessian of `loss_fn(model, batch)` averaged over batches in dataloader."""
    
    def __init__(
        self,
        model: nn.Module,
        dataloader: Iterable[Any],
        loss_fn: LossFn,
        *,
        param_filter: ParamFilter | None = None,
        full_dataset: bool = True,
        num_batches: int | None = None,
        microbatch_size: int | None = None,
        method: HvpMethod = "autograd",
        fd_eps: float | None = None,
        backend: LinAlgBackend[torch.Tensor] | None = None,
    ) -> None:

HVP Computation Methods:

MethodDescriptionMemoryPrecision
"autograd"Exact double-backward via torch.autograd.grad with create_graph=TrueHigher (scales with model size)Numerically exact to rounding
"finite_difference"Central-difference (∇L(θ+εv) − ∇L(θ−εv)) / 2εLower (two forward+backward passes)O(ε²) truncation bias

The finite-difference method uses dtype-specific epsilon values for optimal precision:

dtypeEpsilon
float646e-6
float325e-3
bfloat160.2
float165e-2

Sources: hessian_eigenthings/operators/hessian.py:30-55

When to Use the Hessian

The Hessian is ideal for:

  • Single-device analysis of models up to ~7B parameters
  • Scenarios requiring exact curvature information
  • Research on loss landscape topology
  • Verifying approximations against ground truth

Generalized Gauss-Newton (GGN)

Definition and Mathematical Foundation

The Generalized Gauss-Newton matrix G is a positive semi-definite (PSD) approximation to the Hessian. For a loss of the form L = (1/n) Σ l(f(xᵢ;θ), yᵢ), the GGN is defined as:

G = Jᵀ · H_loss · J

Where:

  • J is the Jacobian of model outputs with respect to parameters
  • H_loss is the Hessian of the loss with respect to model outputs

The GGN is always PSD because G = Jᵀ · H_loss · J and H_loss is PSD for convex losses.

Sources: hessian_eigenthings/operators/ggn.py:1-40

GGNOperator Implementation

The GGNOperator class provides two matvec implementations:

class GGNOperator(CurvatureOperator):
    def __init__(
        self,
        model: nn.Module,
        dataloader: Iterable[Any],
        forward_fn: ForwardFn,
        loss_of_output_fn: LossOfOutputFn,
        *,
        loss_hvp: Literal["analytical", "autograd"] = "analytical",
    ) -> None:
MethodDescriptionMemoryUse Case
"analytical"Finite-difference JVP + analytical loss-Hessian-vector productMatches one training stepLM-scale use, OOM-safe
"autograd"torch.func.jvp + autograd double-backward + vjpScales with output sizeExact for arbitrary losses

For cross-entropy + softmax classification, G equals the Fisher information matrix, making the GGN and Fisher equivalent in this common case.

Sources: hessian_eigenthings/operators/ggn.py:40-80

Two-Function API Design

The GGNOperator uses a separation between forward_fn and loss_of_output_fn:

ForwardFn = Callable[[nn.Module, Any], torch.Tensor]
LossOfOutputFn = Callable[[torch.Tensor, Any], torch.Tensor]

This design enables computing J·v, H_loss·(J·v), and Jᵀ·(H_loss·J·v) without coupling to loss internals.

Closed-Form Cross-Entropy HVP

For mean-reduced cross-entropy with softmax, the library provides an optimized analytical HVP:

H_loss @ u = (p * u - p * ⟨p, u⟩) / n

Where p = softmax(logits) and n is the count of non-ignored positions.

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-60

Fused CE HVP Implementations

The library provides three backend implementations for the cross-entropy HVP:

BackendDescriptionMemorySpeedup
"eager"Plain PyTorch referenceHighest1x baseline
"compile"torch.compile-fused; Inductor fuses operationsReduced~2.6x faster
"triton"Hand-written CUDA Triton kernelMinimal (output buffer only)~3.4x faster

Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50

Empirical Fisher

Definition

The Empirical Fisher matrix F is defined as:

F = (1/n) Σ ∇lᵢ · ∇lᵢᵀ

Where the expectation over data is replaced by the empirical average over batches. It is always PSD and serves as an approximation to the Fisher Information Matrix.

EmpiricalFisherOperator

class EmpiricalFisherOperator(CurvatureOperator):
    def __init__(
        self,
        model: nn.Module,
        dataloader: Iterable[Any],
        loss_fn: LossFn,
        *,
        per_sample: bool = False,
    ) -> None:

Sources: hessian_eigenthings/operators/fisher.py

Curvature Operator Interface

All curvature matrices implement the CurvatureOperator base class:

class CurvatureOperator(ABC):
    @property
    @abstractmethod
    def size(self) -> int:
        """Number of parameters (matrix dimension)."""
        ...
    
    @property
    def dtype(self) -> torch.dtype:
        ...
    
    @property
    def device(self) -> torch.device:
        ...
    
    @abstractmethod
    def matvec(self, v: torch.Tensor) -> torch.Tensor:
        """Compute matrix-vector product A @ v."""
        ...

Sources: hessian_eigenthings/operators/base.py

Parameter Filtering

Curvature operators support computing curvature only over a subset of parameters using ParamFilter:

def match_names(*patterns: str) -> ParamFilter:
    """Match parameters by name patterns."""
    
def match_regex(pattern: str) -> ParamFilter:
    """Match parameters by regex pattern."""

This enables analysis of specific components:

# Analyze only attention weights
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*"),
)

Algorithmic Foundations

Lanczos Eigendecomposition

The Lanczos algorithm computes eigenvalues and eigenvectors of large sparse matrices using only matrix-vector products:

def lanczos(
    operator: CurvatureOperator,
    k: int = 10,
    max_iter: int = 100,
    tol: float = 1e-6,
    which: str = "LM",
) -> EigenResult:

The algorithm:

  1. Builds a tridiagonal matrix T from matvec operations
  2. Computes eigenvalues of T as Ritz approximations
  3. Accumulates eigenvectors directly via rank-1 outer-product updates

Sources: hessian_eigenthings/algorithms/lanczos.py:1-60

Trace Estimation

Trace estimation uses stochastic probing to estimate tr(A) without computing the full matrix:

MethodSamplesVarianceDescription
HutchinsonmO(1/√m)(1/m) Σ vᵢᵀ A vᵢ with Rademacher/Gaussian vectors
Hutch++mO(1/m)Improved estimator with better constant factors
def trace(
    operator: CurvatureOperator,
    *,
    num_matvecs: int = 100,
    method: Method = "hutch++",
) -> TraceResult:

Sources: hessian_eigenthings/algorithms/trace.py:1-50

Operator Selection Guide

graph LR
    A[Need Curvature?] --> B{Exact Hessian?}
    B -->|Yes, small model| C[HessianOperator<br/>method=autograd]
    B -->|No| D{Need Fisher/GGN?}
    D -->|Yes, cross-entropy| E[GGNOperator<br/>loss_hvp=analytical]
    D -->|Yes, other loss| F[GGNOperator<br/>loss_hvp=autograd]
    D -->|Empirical Fisher| G[EmpiricalFisherOperator]
    
    C --> H[Use lanczos for eigenvalues]
    E --> H
    F --> H
    G --> H
ScenarioRecommended OperatorMethod
Exact Hessian, single GPUHessianOperatormethod="autograd"
Large model, distributedHessianOperatormethod="finite_difference"
Language modeling, cross-entropyGGNOperatorloss_hvp="analytical"
Custom loss, need exactGGNOperatorloss_hvp="autograd"
Natural gradient optimizationEmpiricalFisherOperatorDefault

Module Exports

from hessian_eigenthings.operators import (
    CurvatureOperator,
    HessianOperator,
    GGNOperator,
    EmpiricalFisherOperator,
    DDPHessianOperator,  # For DistributedDataParallel
    LambdaOperator,      # Custom curvature wrappers
)

from hessian_eigenthings.algorithms import (
    lanczos,              # Top-k eigendecomposition
    trace,                # Trace estimation
    spectral_density,    # Density plot via SLQ
    deflated_power_iteration,
)

Sources: hessian_eigenthings/operators/__init__.py:1-25

Summary

The hessian-eigenthings library provides a unified interface for computing curvature information in neural networks:

  • Hessian: Exact local curvature via autograd or finite-difference approximation
  • GGN: Positive semi-definite approximation ideal for large-scale analysis
  • Empirical Fisher: Sample-based Fisher approximation for natural gradient methods

All operators provide matrix-free matvec implementations, enabling eigendecomposition and trace estimation for models with billions of parameters without explicit matrix construction.

Sources: [hessian_eigenthings/operators/hessian.py:1-30](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

Why Hessian-Vector Products

Related topics: Curvature Matrices Explained, Eigendecomposition Algorithms

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Eigendecomposition via Lanczos

Continue reading this section for the full explanation and source context.

Section Trace Estimation via Hutchinson's Method

Continue reading this section for the full explanation and source context.

Section Method 1: Autograd (Default)

Continue reading this section for the full explanation and source context.

Related topics: Curvature Matrices Explained, Eigendecomposition Algorithms

Why Hessian-Vector Products

Hessian-vector products (HVPs) are the computational foundation of this library. Understanding *why* we use HVPs instead of computing the full Hessian matrix is essential for appreciating the design and capabilities of hessian-eigenthings.

The Full Hessian Problem

The Hessian matrix $H$ of a neural network's loss function is a second-order partial derivative matrix with dimensions $[n \times n]$, where $n$ is the number of parameters. For modern large-scale models:

ModelParametersHessian SizeMemory (fp32)
BERT-Base110M110M × 110M~48 TB
GPT-21.5B1.5B × 1.5B~9 PB
LLaMA-7B7B7B × 7B~392 PB

Storing the full Hessian is fundamentally infeasible. Even computing it via automatic differentiation requires $O(n^2)$ operations and memory that scales quadratically with model size. Sources: README.md:1-30

What is a Hessian-Vector Product?

A Hessian-vector product computes $Hv$ for a given vector $v$ without ever constructing $H$ explicitly. The operation takes $O(n)$ time and memory—linear in the number of parameters.

Formally, given:

  • Loss function $\mathcal{L}(\theta)$
  • Parameter vector $\theta \in \mathbb{R}^n$
  • Direction vector $v \in \mathbb{R}^n$

The HVP is: $$Hv = \nabla_\theta^2 \mathcal{L} \cdot v = \frac{\partial}{\partial \theta} \left( \nabla_\theta \mathcal{L} \cdot v \right)$$

This is implemented as a double-backward pass:

  1. Forward pass → compute loss
  2. Backward pass → compute gradient $\nabla_\theta \mathcal{L}$
  3. Second backward pass → compute Jacobian-vector product $H \cdot v$

Sources: hessian_eigenthings/operators/hessian.py:1-50

Why HVPs Enable Scalable Curvature Analysis

By avoiding explicit Hessian construction, HVP-based algorithms can operate on models of any size. This library provides several key algorithms that all rely on HVP as their primitive operation:

graph TD
    A[Hessian-Vector Product] --> B[ Lanczos Eigendecomposition]
    A --> C[ Hutchinson Trace Estimation]
    A --> D[ Hutch++ Trace Estimation]
    A --> E[ Stochastic Lanczos Quadrature]
    
    B --> F[Top-k Eigenvalues & Eigenvectors]
    C --> G[Trace Estimation]
    D --> G
    E --> H[Spectral Density Plot]

Eigendecomposition via Lanczos

The Lanczos algorithm iteratively builds an orthogonal basis that tridiagonalizes the operator. It requires only matrix-vector products, making it perfect for HVP-based curvature analysis:

PropertyFull EigendecompLanczos + HVP
Memory$O(n^2)$$O(n \cdot k)$
Time$O(n^3)$$O(n \cdot k^2)$
StorageEntire matrix$k$ Lanczos vectors

Where $k$ is the number of desired eigenpairs (typically 1-20). Sources: hessian_eigenthings/algorithms/lanczos.py:1-60

Trace Estimation via Hutchinson's Method

The trace of the Hessian can be estimated without constructing the full matrix:

$$\text{tr}(H) \approx \frac{1}{m} \sum_{i=1}^{m} v_i^T H v_i$$

where $v_i$ are random probe vectors (typically Rademacher or Gaussian). Each term $v_i^T H v_i$ is a single HVP plus a dot product. Sources: hessian_eigenthings/algorithms/trace.py:1-45

HVP Implementation Strategies

The library provides two distinct methods for computing HVPs, each with different trade-offs:

Method 1: Autograd (Default)

Uses torch.autograd.grad with create_graph=True for exact double-backward computation:

def __init__(
    self,
    model: nn.Module,
    dataloader: Iterable[Any],
    loss_fn: LossFn,
    *,
    method: HvpMethod = "autograd",  # Default
    ...
) -> None:

Advantages:

  • Numerically exact (to floating-point rounding)
  • Works with any differentiable loss function
  • Simple implementation

Disadvantages:

  • Builds the full computation graph for second derivatives
  • Memory scales with model complexity and output size

Sources: hessian_eigenthings/operators/hessian.py:20-45

Method 2: Finite Difference

Uses central-difference approximation: $$\frac{\nabla_\theta \mathcal{L}(\theta + \epsilon v) - \nabla_\theta \mathcal{L}(\theta - \epsilon v)}{2\epsilon}$$

Advantages:

  • No second-backward graph → lower memory footprint
  • Compatible with distributed training (FSDP/HSDP/TP) without special handling

Disadvantages:

  • $O(\epsilon^2)$ truncation bias
  • Precision-dependent roundoff (~1e-5 fp32, ~1e-2 bf16)

Sources: hessian_eigenthings/operators/hessian.py:30-40

Fused HVP for Cross-Entropy Losses

For language models with large vocabulary softmax heads, computing $H_{\text{loss}} \cdot u$ naively allocates multiple $(N, V)$ intermediate tensors (where $V$ is vocabulary size). This is addressed with fused implementations:

BackendSpeedupMemory ReductionRequirements
eager1× (baseline)Any
compile~2.6×~2×torch.compile
triton~3.4×~2×CUDA + Triton

The fused computation computes: $$H_{\text{loss}} \cdot u = \frac{p \odot u - p \odot \langle p, u \rangle}{n} \odot \text{mask}$$

Where $p = \text{softmax}(\text{logits})$ and the implementation avoids materializing the full $(N, V)$ softmax output. Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50

Generalized Gauss-Newton (GGN) Approximation

For optimization-focused curvature analysis, the GGN matrix $G$ provides a positive semi-definite (PSD) approximation to the Hessian:

$$G = J^T \cdot H_{\text{loss}} \cdot J$$

Where $J$ is the Jacobian of the model outputs with respect to parameters. For cross-entropy + softmax classification, $G$ equals the Fisher information matrix. The GGN is always PSD by construction, making it suitable for optimization algorithms. Sources: hessian_eigenthings/operators/ggn.py:1-40

GGN Matvec Implementation

The GGNOperator supports two matvec paths:

  1. Analytical (default): Finite-difference JVP + analytical loss-Hessian-vector product + single normal backward. Memory footprint matches one normal training step.
  1. Autograd: Original torch.func.jvp + autograd double-backward + torch.func.vjp. Numerically exact but scales badly with vocabulary size.

Sources: hessian_eigenthings/operators/ggn.py:25-45

Practical Implications

The HVP approach enables:

CapabilityHVP-BasedFull Hessian
7B parameter model✅ ~hours❌ impossible
Top-10 eigenpairs
Trace estimation
Spectral density
FSDP compatibility✅ (finite-diff)

The eigenvalues and eigenvectors of the Hessian have been implicated in generalization properties of neural networks. Researchers hypothesize that "flat minima" generalize better, that Hessians of large models are very low-rank, and that curvature analysis can guide optimization. Sources: README.md:25-35

Summary

Hessian-vector products are the fundamental building block that makes large-scale curvature analysis possible:

  1. Memory efficiency: $O(n)$ vs $O(n^2)$ for the full Hessian
  2. Computational efficiency: $O(n)$ per matvec vs $O(n^2)$ for full computation
  3. Scalability: Works with models of any size via iterative algorithms
  4. Flexibility: Supports exact (autograd) or memory-efficient (finite-difference) computation

The hessian-eigenthings library provides production-ready implementations of HVP computation and HVP-based algorithms for practical curvature analysis in PyTorch.

Sources: [hessian_eigenthings/operators/hessian.py:1-50]()

System Architecture

Related topics: Curvature Operators, Eigendecomposition Algorithms, Loss Functions

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 1. Curvature Operators

Continue reading this section for the full explanation and source context.

Section 2. Algorithms Layer

Continue reading this section for the full explanation and source context.

Section 3. Backend Layer

Continue reading this section for the full explanation and source context.

Related topics: Curvature Operators, Eigendecomposition Algorithms, Loss Functions

System Architecture

The pytorch-hessian-eigenthings library provides an efficient and scalable framework for computing eigendecompositions of curvature matrices—including the Hessian, Generalized Gauss-Newton (GGN) matrix, and empirical Fisher—for arbitrary PyTorch models. The architecture is designed around three core abstractions: Curvature Operators, Algorithms, and Linear Algebra Backends.

High-Level Architecture Overview

The library implements a layered architecture that separates mathematical curvature computations from numerical algorithms:

graph TD
    subgraph "User Layer"
        U[User Code]
    end
    
    subgraph "Algorithm Layer"
        LA[Lanczos]
        TR[Trace Estimation]
        SP[Stochastic Power Iteration]
    end
    
    subgraph "Operator Layer"
        HO[HessianOperator]
        GGN[GGNOperator]
        FO[FisherOperator]
    end
    
    subgraph "Backend Layer"
        B[LinAlgBackend]
        SD[SingleDeviceBackend]
    end
    
    subgraph "PyTorch Core"
        PT[PyTorch Autograd]
    end
    
    U -->|uses| LA
    U -->|uses| TR
    U -->|uses| HO
    LA -->|operates on| HO
    TR -->|operates on| HO
    HO -->|implemented via| B
    B -->|delegates to| PT

Core Components

1. Curvature Operators

Curvature operators are the foundation of the library. They abstract away the details of how matrix-vector products (matvecs) with curvature matrices are computed, providing a unified interface for algorithms to work with.

#### Base Interface

All operators inherit from CurvatureOperator, which defines the contract for curvature computations:

Property/MethodTypeDescription
sizeintTotal number of parameters in the curvature matrix
dtypetorch.dtypeData type of the operator
devicetorch.deviceDevice where computations run
matvec(v)CallableComputes A @ v for input vector v

#### Hessian Operator

The HessianOperator computes the Hessian of a loss function with respect to model parameters:

HessianOperator(
    model: nn.Module,
    dataloader: Iterable[Any],
    loss_fn: LossFn,
    *,
    param_filter: ParamFilter | None = None,
    method: HvpMethod = "autograd"  # or "finite_difference"
)

Sources: hessian_eigenthings/operators/hessian.py:1-50

Two HVP computation methods are supported:

MethodDescriptionUse Case
"autograd" (default)Exact double-backward via torch.autograd.gradUp to ~7B parameters
"finite_difference"Central-difference approximationFSDP/HSDP/TP at scale

The finite difference method uses the approximation:

H(v) ≈ (∇L(θ+εv) − ∇L(θ−εv)) / 2ε

This avoids second-backward graph entirely, making it compatible with distributed training setups.

#### GGN Operator

The GGNOperator implements the Generalized Gauss-Newton matrix, which is always positive semi-definite:

GGNOperator(
    model: nn.Module,
    dataloader: Iterable[Any],
    forward_fn: ForwardFn,
    loss_of_output_fn: LossOfOutputFn,
    *,
    loss_hvp: Literal["analytical", "autograd"] = "analytical"
)

Sources: hessian_eigenthings/operators/ggn.py:1-80

The GGN decomposes as G = J^T · H_loss · J where:

  • J is the Jacobian of the model output with respect to parameters
  • H_loss is the Hessian of the loss with respect to the output

For cross-entropy + softmax classification, G equals the Fisher information matrix.

2. Algorithms Layer

Algorithms operate on any CurvatureOperator via its matvec interface, enabling eigenvalue computation, trace estimation, and spectral density analysis.

#### Lanczos Eigensolver

The Lanczos algorithm computes the top-k eigenvalues and eigenvectors of a symmetric matrix using only matrix-vector products:

lanczos(
    operator: CurvatureOperator,
    k: int = 10,
    max_iter: int = 100,
    tol: float = 1e-3,
    which: str = "LA"  # LA, SA, or LM
) -> EigendecompositionResult

Sources: hessian_eigenthings/algorithms/lanczos.py:1-50

Key features:

  • Ritz vector accumulation: Directly accumulates Ritz vectors into final (k, n) layout via rank-1 outer-product updates, avoiding transient (n, k) transpose copies
  • Convergence tracking: Monitors residual norms |β_k · s_{k}| to determine convergence
  • Eigenvalue selection: Supports "LA" (largest algebraic), "SA" (smallest algebraic), and "LM" (largest magnitude)

#### Trace Estimation

The library provides multiple trace estimation methods:

MethodDescriptionSamples Required
hutchinsonClassical Hutchinson: (1/m) Σ vᵢᵀ A vᵢHigher variance
hutch++Improved estimator with lower variance~30 recommended
trace(
    operator: CurvatureOperator,
    num_matvecs: int = 30,
    method: str = "hutch++",
    seed: int | None = None
) -> TraceResult

Sources: hessian_eigenthings/algorithms/trace.py:1-40

The Hutch++ estimator achieves lower variance by using both query and reply vectors.

3. Backend Layer

The LinAlgBackend abstract interface decouples linear algebra operations from specific device implementations:

classDiagram
    class LinAlgBackend~T~ {
        <<abstract>>
        +matmul(a, b) T
        +dot(a, b) T
        +norm(v) T
        +fill(v, value) T
        +copy(v) T
    }
    
    class SingleDeviceBackend {
        +matmul(a, b) Tensor
        +dot(a, b) Tensor
        +norm(v) Tensor
    }
    
    LinAlgBackend <|-- SingleDeviceBackend

Backends provide:

  • Vector arithmetic operations (dot product, norm, fill, copy)
  • Device-specific optimizations
  • Memory allocation strategies

Data Flow

Eigendecomposition Workflow

sequenceDiagram
    participant User
    participant Operator as CurvatureOperator
    participant Backend as LinAlgBackend
    participant Algo as Lanczos Algorithm
    participant PyTorch as PyTorch Autograd
    
    User->>Operator: Instantiate with model, dataloader
    User->>Algo: Call lanczos(operator, k)
    Algo->>Operator: Request matvec(v)
    Operator->>Backend: Allocate probe vector
    Backend->>PyTorch: Create tensor
    Operator->>PyTorch: Forward pass + backward
    PyTorch-->>Operator: Return HVP result
    Operator-->>Algo: Return Av
    Algo->>Algo: Repeat for m iterations
    Algo-->>User: Return eigenvalues, eigenvectors

Loss Function Integration

The library supports two loss function patterns:

graph LR
    subgraph "Single Function API"
        L1[loss_fn<br/>model, batch → scalar]
    end
    
    subgraph "Two Function API (for GGN)"
        F1[forward_fn<br/>model, batch → output]
        L2[loss_of_output_fn<br/>output, batch → scalar]
    end
    
    L1 --> HO[HessianOperator]
    F1 --> GGN[GGNOperator]
    L2 --> GGN

#### HuggingFace Integration

For language models, the library provides optimized loss functions:

hf_lm_loss(fused="auto")  # Auto-selects Triton or torch.compile

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-30

The fused implementation:

  • Uses Triton kernels on CUDA (~3.4x speedup, 2x peak-memory reduction)
  • Falls back to torch.compile (~2.6x speedup, 2x peak-memory reduction)
  • Eliminates most (N, V) intermediates

Configuration Options

HessianOperator Parameters

ParameterTypeDefaultDescription
modelnn.ModuleRequiredPyTorch model
dataloaderIterableRequiredData batches
loss_fnLossFnRequiredLoss computation function
param_filterParamFilterNoneFilter parameters by name
methodHvpMethod"autograd"HVP computation method
fd_epsfloatNoneFinite difference epsilon

GGNOperator Parameters

ParameterTypeDefaultDescription
loss_hvpstr"analytical""analytical" or "autograd"
full_datasetboolTrueAverage over full dataset
num_batchesintNoneLimit to first N batches
microbatch_sizeintNoneProcess in smaller chunks

Lanczos Parameters

ParameterTypeDefaultDescription
kint10Number of eigenvalues to compute
max_iterint100Maximum Lanczos iterations
tolfloat1e-3Convergence tolerance
whichstr"LA"Which eigenvalues ("LA", "SA", "LM")
reorthogonalizeboolFalseFull reorthogonalization

Usage Patterns

Basic Hessian Eigenvalue Computation

from hessian_eigenthings import HessianOperator, lanczos

# Create operator
hessian_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=lambda m, b: torch.nn.functional.cross_entropy(m(b[0]), b[1])
)

# Compute top eigenvalues
eig_result = lanczos(hessian_op, k=10, max_iter=100)
print(eig_result.eigenvalues)

Parameter-Filtered Analysis

from hessian_eigenthings import HessianOperator, lanczos

# Analyze only attention parameters
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)
eig_attn = lanczos(attn_op, k=3)

Trace Estimation

from hessian_eigenthings import HessianOperator, trace

trace_result = trace(
    hessian_op,
    num_matvecs=30,
    method="hutch++",
    seed=42
)
print(f"Trace estimate: {trace_result.estimate:.4e}")

Architecture Benefits

BenefitDescription
Separation of ConcernsOperators define "what" to compute; algorithms define "how"
FlexibilityAny operator can use any algorithm
ScalabilityBackends enable device-specific optimizations
ComposabilityEasy to add new operators or algorithms
Memory EfficiencyMatrix-free design avoids explicit matrix storage

Extension Points

Adding Custom Curvature Operators

New operators should subclass CurvatureOperator and implement the matvec method:

class CustomCurvatureOperator(CurvatureOperator):
    def __init__(self, model, dataloader):
        super().__init__()
        self.model = model
        self.dataloader = dataloader
        # Register parameters
    
    def _matvec(self, v: torch.Tensor) -> torch.Tensor:
        # Implement A @ v
        return custom_computation(v)

Adding New Algorithms

Algorithms should accept any CurvatureOperator and use the backend exclusively:

def custom_algorithm(
    operator: CurvatureOperator,
    backend: LinAlgBackend | None = None
) -> SomeResult:
    backend = backend or SingleDeviceBackend()
    # Use backend for all vector operations

Summary

The system architecture of pytorch-hessian-eigenthings follows a clean, modular design that separates curvature matrix computation (operators), numerical algorithms (Lanczos, trace estimation), and linear algebra primitives (backends). This design enables efficient Hessian and GGN eigendecomposition for models ranging from small MLPs to large language models, with support for distributed training and optimized fused computations.

Sources: [hessian_eigenthings/operators/hessian.py:1-50]()

Curvature Operators

Related topics: System Architecture, Distributed Computing with DDP

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Core Interface

Continue reading this section for the full explanation and source context.

Section Parameter Filtering

Continue reading this section for the full explanation and source context.

Section Key Features

Continue reading this section for the full explanation and source context.

Related topics: System Architecture, Distributed Computing with DDP

Curvature Operators

Overview

Curvature Operators in hessian-eigenthings provide a matrix-free abstraction for computing Hessian eigendecomposition and related curvature matrices for arbitrary PyTorch models. They implement the CurvatureOperator base class interface, enabling efficient computation of eigenvalues, eigenvectors, traces, and spectral densities without explicitly forming potentially massive matrices.

The core abstraction allows algorithms (Lanczos, power iteration, Hutch++) to operate on any curvature matrix through a unified matvec(v) interface that computes $Av$ for any vector $v$, enabling scalability to large models with billions of parameters.

Sources: hessian_eigenthings/__init__.py:1-10

Architecture

graph TD
    subgraph "Curvature Operators"
        Base[CurvatureOperator<br/>Base Class]
        Hessian[HessianOperator]
        GGN[GGNOperator]
        Fisher[EmpiricalFisherOperator]
        Lambda[LambdaOperator]
        DDP[DDPHessianOperator]
    end
    
    subgraph "Algorithms"
        Lanczos[Lanczos Eigendecomposition]
        Power[Power Iteration]
        Trace[Trace Estimation<br/>Hutch++/Hutchinson]
        Spectral[Spectral Density<br/>Stochastic Lanczos Quadrature]
    end
    
    Base --> Hessian
    Base --> GGN
    Base --> Fisher
    Base --> Lambda
    Base --> DDP
    
    Hessian --> Lanczos
    GGN --> Lanczos
    Fisher --> Lanczos
    Lambda --> Lanczos
    
    Hessian --> Trace
    GGN --> Trace
    Fisher --> Trace
    
    Hessian --> Power
    GGN --> Power
    Fisher --> Power
    
    Hessian --> Spectral
    GGN --> Spectral
    Fisher --> Spectral

Base Class: CurvatureOperator

All curvature operators inherit from CurvatureOperator, which defines the contract that subclasses must fulfill.

Core Interface

MethodDescription
matvec(v)Compute $Av$ where $A$ is the curvature matrix
sizeTotal number of parameters in the operator's scope
dtype, deviceTensor dtype and device for vector operations

Sources: hessian_eigenthings/operators/base.py

Parameter Filtering

Curvature operators can be restricted to subsets of model parameters using param_filter, enabling analysis of specific components (e.g., attention layers only).

from hessian_eigenthings import HessianOperator, match_names

# Filter to attention parameters only
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)

Sources: hessian_eigenthings/__init__.py:35-38

HessianOperator

Computes the Hessian $\nabla_{\theta}^2 \mathcal{L}$ of the loss function with respect to model parameters.

Key Features

  • Two HVP methods: autograd (exact double-backward via torch.autograd.grad) and finite_difference (central difference for FSDP/TP compatibility)
  • Batched computation: Automatically averages over multiple batches from the dataloader
  • Microbatch support: For large models, process batches in smaller microbatches

Constructor Parameters

ParameterTypeDefaultDescription
modelnn.ModuleRequiredPyTorch model
dataloaderIterableRequiredData batches
loss_fnLossFnRequiredLoss computation function
param_filter`ParamFilter \None`NoneParameter name filter
full_datasetboolTrueAverage over full dataset
num_batches`int \None`NoneLimit batches for stochastic estimate
microbatch_size`int \None`NoneSplit batches into smaller microbatches
methodHvpMethod"autograd"HVP computation method
fd_eps`float \None`NoneFinite difference epsilon
backend`LinAlgBackend \None`NoneLinear algebra backend

HVP Method Comparison

MethodAccuracyMemoryFSDP/TP CompatibleSpeed
autogradExact (to rounding)HighNoFast
finite_difference$O(\epsilon^2)$ biasLowYes2x passes

Sources: hessian_eigenthings/operators/hessian.py:1-60

Finite Difference Epsilon Table

DtypeOptimal $\epsilon$
float646e-6
float325e-3
bfloat160.2
float165e-2

Sources: hessian_eigenthings/operators/hessian.py:34-40

GGNOperator

The Generalized Gauss-Newton (GGN) matrix $G = J^T H_{loss} J$ provides a PSD approximation to the Hessian that is computationally cheaper while preserving the eigenvalues that matter for optimization.

Key Features

  • Always PSD: Unlike the exact Hessian, the GGN is positive semi-definite by construction
  • Analytical HVP path: For losses with known HVP (e.g., cross-entropy), uses analytical computation
  • For cross-entropy + softmax: $G$ equals the Fisher information matrix

Two Matvec Implementations

loss_hvpDescriptionMemoryUse Case
"analytical" (default)FD JVP + analytical loss-Hessian-vec + one backwardMatches one training stepLM-scale, large vocab
"autograd"torch.func.jvp + double-backward + torch.func.vjpScales with output sizeExact, small vocab

Sources: hessian_eigenthings/operators/ggn.py:1-50

Fused Cross-Entropy HVP

For language model training, the GGN operator includes a fused kernel for the CE HVP computation:

# Auto-selects fastest backend: Triton > torch.compile > eager
hf_lm_loss_of_output(..., fused="auto")

The fused implementation reduces peak memory by 2x compared to eager, with Triton providing ~3.4x speedup on CUDA.

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-30

EmpiricalFisherOperator

Computes the empirical Fisher information matrix $F = \frac{1}{N} \sum_{i=1}^N \nabla_{\theta} \log p(y_i|x_i) \nabla_{\theta} \log p(y_i|x_i)^T$.

For classification with cross-entropy loss, the empirical Fisher equals the GGN when using the model distribution's expectation.

Sources: hessian_eigenthings/operators/fisher.py

LambdaOperator

Creates custom curvature operators from lambda functions for testing or custom curvature definitions.

from hessian_eigenthings import LambdaOperator

# Custom operator that always returns a scaled vector
custom_op = LambdaOperator(
    size=1000,
    matvec=lambda v: 2.0 * v  # Represents 2*I
)

DDPHessianOperator

Distributed Data Parallel-aware Hessian operator that handles gradient synchronization across processes.

Sources: hessian_eigenthings/operators/__init__.py:15-18

Common Usage Patterns

Computing Top Eigenvalues

from hessian_eigenthings import HessianOperator, lanczos

operator = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn
)

result = lanczos(operator, k=10, max_iter=100)
print(f"Top eigenvalues: {result.eigenvalues}")

Estimating Trace

from hessian_eigenthings import GGNOperator, trace

operator = GGNOperator(
    model=model,
    dataloader=dataloader,
    forward_fn=model_forward,
    loss_of_output_fn=loss_fn
)

result = trace(operator, num_matvecs=100, method="hutch++")
print(f"Trace estimate: {result.estimate:.4e} ± {result.stderr:.4e}")

Component-Specific Analysis

from hessian_eigenthings import HessianOperator, match_regex

# Analyze only attention weights in transformer
attn_only = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_regex(r"blocks\.\d+\.attn\.")
)

# Analyze only MLP weights
mlp_only = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_regex(r"blocks\.\d+\.mlp\.")
)

Linear Algebra Backends

The operators use pluggable LinAlgBackend for vector operations, enabling support for different hardware configurations and precision requirements.

BackendUse Case
SingleDeviceBackendSingle GPU/CPU
(Distributed backends)Multi-GPU via FSDP/TP

Module Exports

from hessian_eigenthings.operators import (
    CurvatureOperator,
    DDPHessianOperator,
    EmpiricalFisherOperator,
    GGNOperator,
    HessianOperator,
    LambdaOperator,
)

Sources: hessian_eigenthings/operators/__init__.py:1-20

Sources: [hessian_eigenthings/__init__.py:1-10](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/__init__.py)

Eigendecomposition Algorithms

Related topics: System Architecture, Why Hessian-Vector Products

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Symmetric Lanczos Implementation

Continue reading this section for the full explanation and source context.

Section Lanczos Output Structure

Continue reading this section for the full explanation and source context.

Section Full Lanczos Eigensolver

Continue reading this section for the full explanation and source context.

Related topics: System Architecture, Why Hessian-Vector Products

Eigendecomposition Algorithms

The hessian-eigenthings library provides a suite of efficient iterative algorithms for computing eigendecompositions of curvature matrices (Hessian, Generalized Gauss-Newton, and Fisher) in PyTorch models. These algorithms enable analysis of neural network loss landscapes by extracting eigenvalues, eigenvectors, spectral densities, and trace estimates without explicitly constructing the full curvature matrix—a critical capability for modern large-scale models.

Overview

Computing eigendecompositions of curvature matrices is fundamental to understanding generalization properties, flat minima, and training dynamics of neural networks. However, these curvature matrices are prohibitively large (n × n where n is the number of parameters), making explicit construction impossible for modern models.

The library implements Krylov subspace methods that only require matrix-vector products, enabling efficient computation of:

CapabilityAlgorithmUse Case
Top-k eigenvalues/eigenvectorsLanczos, Power IterationFinding most-curved directions
Trace estimationHutchinson, Hutch++Computing average curvature
Spectral densityStochastic Lanczos QuadratureVisualizing eigenvalue distribution

Sources: hessian_eigenthings/algorithms/__init__.py:1-29

Algorithm Architecture

The algorithms in this module follow a consistent design pattern: they accept any CurvatureOperator and use the LinAlgBackend exclusively for vector arithmetic, ensuring portability across single-device and distributed settings.

graph TD
    A[CurvatureOperator] --> B[Lanczos Algorithm]
    A --> C[Power Iteration]
    A --> D[Trace Estimation]
    A --> E[Spectral Density]
    
    B --> F[EigenResult]
    C --> F
    D --> G[TraceResult]
    E --> H[SpectralDensityResult]
    
    F --> I[eigenvalues: Tensor]
    F --> J[eigenvectors: Tensor]
    F --> K[residuals: Tensor]

Sources: hessian_eigenthings/algorithms/result.py

Lanczos Algorithm

The Lanczos algorithm is the primary method for computing eigenvalues and eigenvectors of symmetric matrices. It builds a Krylov subspace through repeated matrix-vector products, then solves the small tridiagonal eigenvalue problem.

Symmetric Lanczos Implementation

The in-house Lanczos implementation provides optional full reorthogonalization to address the loss-of-orthogonality issues classical Lanczos is known for (Paige 1976).

def lanczos_tridiagonal(
    operator: CurvatureOperator,
    v0: torch.Tensor,
    max_iter: int,
    *,
    reorthogonalize: bool = True,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> LanczosTridiag

Key characteristics:

  • Default reorthogonalization: Enabled for max_iter <= 50 to suppress ghost eigenvalues
  • Computational tradeoff: For larger Krylov dimensions, reorthogonalization becomes O(m²n); users analyzing near-degenerate spectra should re-enable it
  • Memory efficiency: Accumulates Ritz vectors directly via rank-1 outer-product updates, avoiding transient (n, k) → (k, n) transpose copies

Sources: hessian_eigenthings/algorithms/lanczos.py:30-58

Lanczos Output Structure

@dataclass(frozen=True)
class LanczosTridiag:
    """Output of one Lanczos run: tridiagonal coefficients + the basis used to build them."""
    alphas: torch.Tensor      # (m,) diagonal
    betas: torch.Tensor       # (m-1,) off-diagonal
    basis: list[torch.Tensor] # length m, each (n,)
    last_beta: float          # ||r_m|| residual norm at termination
    iterations: int          # m, the actual number of Lanczos steps completed

Sources: hessian_eigenthings/algorithms/lanczos.py:30-43

Full Lanczos Eigensolver

The high-level lanczos() function computes top-k eigenvalues with configurable eigenpair selection:

def lanczos(
    operator: CurvatureOperator,
    k: int = 1,
    max_iter: int = 100,
    *,
    which: Which = "LM",
    tol: float = 1e-8,
    seed: int | None = None,
    reorthogonalize: bool | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> EigenResult

Parameters:

ParameterTypeDefaultDescription
operatorCurvatureOperatorrequiredThe curvature matrix operator
kint1Number of eigenpairs to compute
max_iterint100Maximum Lanczos iterations
whichLiteral["LM", "LA", "SA"]"LM"Which eigenvalues: LM=largest magnitude, LA=largest algebraic, SA=smallest algebraic
tolfloat1e-8Convergence tolerance
seed`int \None`NoneRandom seed for reproducibility
reorthogonalize`bool \None`NoneOverride default reorthogonalization setting

Sources: hessian_eigenthings/algorithms/lanczos.py:58-100

Eigenvalue Selection Logic

The algorithm selects eigenvalues based on the which parameter:

if which == "LM":
    order = torch.argsort(theta.abs(), descending=True)
elif which == "LA":
    order = torch.argsort(theta, descending=True)
elif which == "SA":
    order = torch.argsort(theta, descending=False)

Sources: hessian_eigenthings/algorithms/lanczos.py:75-83

Power Iteration

Power iteration is a simpler method for finding the dominant eigenvalue. The library implements deflated power iteration to compute multiple eigenpairs sequentially by projecting out previously found directions.

Single Power Iteration

def power_iteration_one(
    operator: CurvatureOperator,
    v0: torch.Tensor,
    max_iter: int,
    tol: float = 1e-6,
    backend: LinAlgBackend | None = None,
) -> tuple[torch.Tensor, torch.Tensor]

Sources: hessian_eigenthings/algorithms/power_iteration.py

Deflated Power Iteration

Deflated power iteration extends the basic method to compute multiple eigenpairs:

def deflated_power_iteration(
    operator: CurvatureOperator,
    num_eigs: int,
    max_iter: int,
    tol: float = 1e-6,
    seed: int | None = None,
    backend: LinAlgBackend | None = None,
) -> EigenResult

The deflation process removes previously found eigenvectors from the subspace before searching for the next eigenpair, preventing convergence to already-computed directions.

Sources: hessian_eigenthings/algorithms/power_iteration.py

Trace Estimation

Trace estimation provides a way to compute the average eigenvalue (trace / dimension) without full eigendecomposition. This is computationally much cheaper and useful for understanding overall curvature magnitude.

Hutchinson's Estimator

Hutchinson's method estimates the trace using random probe vectors:

def hutchinson(
    operator: CurvatureOperator,
    *,
    num_samples: int = 100,
    distribution: Distribution = "rademacher",
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult

The estimator computes: (1/m) Σ vᵢᵀ A vᵢ

Where vᵢ are random vectors from the specified distribution. Rademacher distribution provides lower variance than Gaussian.

Sources: hessian_eigenthings/algorithms/trace.py:48-71

Hutch++ Estimator

Hutch++ is an improved estimator with better convergence properties:

def hutch_plus_plus(
    operator: CurvatureOperator,
    *,
    num_matvecs: int = 30,
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult

Hutch++ uses a structured random sampling approach that achieves lower variance than standard Hutchinson with the same number of matrix-vector products.

Sources: hessian_eigenthings/algorithms/trace.py:22-47

Unified Trace Interface

def trace(
    operator: CurvatureOperator,
    num_matvecs: int = 30,
    *,
    method: Literal["hutchinson", "hutch++"] = "hutch++",
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult

Validation: The num_matvecs parameter is validated to be at least 1.

Sources: hessian_eigenthings/algorithms/trace.py:71-84

Trace Result Structure

@dataclass
class TraceResult:
    estimate: float      # The trace estimate
    stderr: float       # Standard error of the estimate
    num_matvecs: int     # Number of matrix-vector products used
    operator_size: int   # Dimension of the operator

Sources: hessian_eigenthings/algorithms/result.py

Spectral Density

Spectral density estimation computes the eigenvalue distribution (density function) across the spectrum, enabling visualization and analysis of the full eigenvalue structure.

Stochastic Lanczos Quadrature

The spectral_density() function implements Stochastic Lanczos Quadrature (SLQ) to compute the spectral density:

def spectral_density(
    operator: CurvatureOperator,
    num_runs: int = 16,
    lanczos_steps: int = 50,
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> SpectralDensityResult

Parameters:

ParameterTypeDefaultDescription
operatorCurvatureOperatorrequiredThe curvature matrix operator
num_runsint16Number of randomized runs for averaging
lanczos_stepsint50Lanczos iterations per run
seed`int \None`NoneRandom seed

Sources: hessian_eigenthings/algorithms/spectral_density.py

Spectral Density Result

@dataclass
class SpectralDensityResult:
    grid: torch.Tensor      # Eigenvalue grid points
    density: torch.Tensor   # Density values at each grid point
    eigenvalues: list[torch.Tensor]  # Eigenvalues from each run
    eigenvectors: list[list[torch.Tensor]]  # Corresponding eigenvectors

The spectral density integrates to 1: ∫ density(λ) dλ ≈ 1, which can be verified using numerical integration.

Sources: hessian_eigenthings/algorithms/result.py

Common Result Types

All algorithms return standardized result objects that encapsulate the computed quantities along with metadata about the computation.

@dataclass
class EigenResult:
    eigenvalues: torch.Tensor       # (k,) tensor of eigenvalues
    eigenvectors: torch.Tensor      # (k, n) matrix of eigenvectors
    residuals: torch.Tensor          # (k,) convergence residuals
    iterations: int                 # Number of iterations run
    converged: bool                 # Whether all eigenpairs converged

Sources: hessian_eigenthings/algorithms/result.py

Algorithm Selection Guide

graph LR
    A[Goal] --> B{Eigenpairs?}
    B -->|Yes, top-k| C[How many?]
    C -->|1-10| D[Lanczos]
    C -->|Many| E{Orthogonality critical?}
    E -->|Yes| D
    E -->|No| F[Deflated Power Iteration]
    
    B -->|Trace only| G{Accuracy priority?}
    G -->|High| H[Hutch++]
    G -->|Standard| I[Hutchinson]
    
    B -->|Full distribution| J[Spectral Density]
    
    D --> K[EigenResult]
    F --> K
    H --> L[TraceResult]
    I --> L
    J --> M[SpectralDensityResult]

Decision Criteria

ScenarioRecommended AlgorithmNotes
Top eigenvalues for single/large batchlanczos()Best accuracy, moderate cost
Quick dominant eigenvaluedeflated_power_iteration()Lower memory, less accurate
Trace with limited matvecshutch_plus_plus()Better convergence than Hutchinson
Trace estimationhutchinson()Simpler, more matvecs needed
Eigenvalue histogram/distributionspectral_density()Visualize full spectrum

Integration with Curvature Operators

The algorithms are designed to work with any CurvatureOperator implementation, including:

  • HessianOperator: Exact Hessian via autograd or finite differences
  • GGNOperator: Generalized Gauss-Newton matrix
  • EmpiricalFisherOperator: Empirical Fisher information matrix
  • DDPHessianOperator: Distributed Data Parallel Hessian

This abstraction allows the same algorithm code to work across different curvature definitions without modification.

Sources: hessian_eigenthings/algorithms/__init__.py:1-29

Performance Considerations

Memory Efficiency in Lanczos

The Lanczos implementation optimizes memory for large-scale models by:

  1. Avoiding allocation of full (n, m) basis matrix
  2. Using rank-1 outer-product updates for eigenvector accumulation
  3. Computing Ritz vectors directly into final (k, n) layout

Reorthogonalization Tradeoffs

SettingMemoryComputationAccuracy
reorthogonalize=TrueO(mn)O(m²n)High orthogonality
reorthogonalize=FalseO(mn) basis listO(mn)May have ghost eigenvalues

For max_iter <= 50, reorthogonalization is enabled by default. For larger Krylov dimensions, it defaults off to maintain acceptable performance.

Sources: hessian_eigenthings/algorithms/lanczos.py:23-28

Example Usage

from hessian_eigenthings import HessianOperator, lanczos, trace, spectral_density

# Create Hessian operator
operator = HessianOperator(model, dataloader, loss_fn)

# Compute top-5 eigenvalues and eigenvectors
eig_result = lanczos(operator, k=5, max_iter=40, tol=1e-7, seed=0)
print(f"Top eigenvalue: {eig_result.eigenvalues[0]}")

# Estimate trace with Hutch++
trace_result = trace(operator, num_matvecs=99, method="hutch++", seed=0)
print(f"Trace estimate: {trace_result.estimate}")

# Compute spectral density
density_result = spectral_density(operator, num_runs=8, lanczos_steps=40, seed=0)
# Visualize with: plt.plot(density_result.grid, density_result.density)

Sources: examples/supervised_mlp.py:1-50

Sources: [hessian_eigenthings/algorithms/__init__.py:1-29]()

Loss Functions

Related topics: Curvature Operators, Parameter Utilities

Section Related Pages

Continue reading this section for the full explanation and source context.

Section MSE Loss

Continue reading this section for the full explanation and source context.

Section Cross-Entropy Loss with Analytical HVP

Continue reading this section for the full explanation and source context.

Section Autoregressive Language Model Loss

Continue reading this section for the full explanation and source context.

Related topics: Curvature Operators, Parameter Utilities

Loss Functions

Loss functions in this repository serve as the bridge between model outputs and curvature operators (Hessian and Generalized Gauss-Newton matrices). They provide the necessary computations for Hessian-vector products and support multiple backend implementations optimized for different use cases.

Overview

The loss functions module (hessian_eigenthings/loss_fns/) provides two distinct function signatures depending on the target operator:

Function TypeSignatureUsed By
loss_fn(model: nn.Module, batch: Any) -> torch.TensorHessianOperator
loss_of_output_fn(output: torch.Tensor, batch: Any) -> torch.TensorGGNOperator

Sources: hessian_eigenthings/operators/hessian.py:1-50 Sources: hessian_eigenthings/operators/ggn.py:1-60

Architecture

graph TD
    A[Loss Function Entry Points] --> B[Standard Losses]
    A --> C[HuggingFace Losses]
    A --> D[TransformerLens Losses]
    
    B --> B1[MSE Loss]
    B --> B2[Cross-Entropy Loss with HVP]
    
    C --> C1[Autoregressive LM Loss]
    C --> C2[Shifted CE with Analytical HVP]
    C --> C3[Fused CE HVP Backends]
    
    D --> D1[TransformerLens HookedModel Loss]
    
    C3 --> C3a[Triton Kernel]
    C3 --> C3b[torch.compile]
    C3 --> C3c[Eager Reference]

Standard Loss Functions

The standard.py module provides loss functions for common supervised learning scenarios with closed-form Hessian-vector products.

Sources: hessian_eigenthings/loss_fns/standard.py:1-80

MSE Loss

Returns a wrapper compatible with GGNOperator for mean-squared error loss:

def mse_loss_of_output() -> Callable[[torch.Tensor, tuple[torch.Tensor, torch.Tensor]], torch.Tensor]:
    """Make a `loss_of_output_fn` for `GGNOperator` from a (output, target) criterion."""

Cross-Entropy Loss with Analytical HVP

The cross-entropy implementation includes a closed-form Hessian-vector product for efficient computation:

def _ce_hvp(
    output: torch.Tensor, batch: tuple[torch.Tensor, torch.Tensor], u: torch.Tensor
) -> torch.Tensor:
    """Closed-form H @ u for mean-reduced softmax + cross-entropy.
    
    `output` has shape `(N, C)` (logits). For each row,
    `H_row = (diag(p) - p p^T) / N` where `p = softmax(output)`.
    """

Mathematical Foundation:

For mean-reduced softmax + cross-entropy, the Hessian takes the form:

H_row = (diag(p) - p·p^T) / N

where:

  • p = softmax(output) is the predicted probability distribution
  • N is the number of samples
  • u is the input vector

Sources: hessian_eigenthings/loss_fns/standard.py:40-55

HuggingFace Transformers Integration

The huggingface.py module provides loss functions specifically designed for HuggingFace Transformers models. These handle the internal loss computation that occurs when labels are present in the batch.

Sources: hessian_eigenthings/loss_fns/huggingface.py:60-90

Autoregressive Language Model Loss

def hf_lm_loss() -> Callable[[nn.Module, dict[str, Any]], torch.Tensor]:
    """For autoregressive LMs: `loss_fn(model, batch)` calls `model(**batch).loss`."""

The batch must include labels so HuggingFace computes the loss internally. For causal language models, this is typically labels=input_ids with the standard internal shift.

Shifted Cross-Entropy with Analytical HVP

For large-scale language model analysis, a shifted cross-entropy variant provides both the loss function and its analytical Hessian-vector product:

def hf_lm_shifted_ce(fused: FusedCEHvpBackend = "auto") -> _LossOfOutputWithHvp:
    """Shifted CE loss with analytical H @ u for autoregressive LMs."""

Shift Mechanism:

  • The loss shifts logits left (discards last position) and labels right (discards first position)
  • Matches how cross_entropy(ignore_index=-100) handles gradient computation

Sources: hessian_eigenthings/loss_fns/huggingface.py:1-50

Fused Cross-Entropy HVP Backends

The _fused_ce_hvp.py module implements optimized backends for computing the cross-entropy Hessian-vector product.

Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50

Backend Selection

BackendDescriptionPerformanceAvailability
"auto"Auto-select fastest availableOptimalDefault
"triton"Hand-written CUDA Triton kernel~3.4x speedup, 2x memory reductionCUDA + Triton
"compile"torch.compile-fused~2.6x speedup, 2x memory reductiontorch >= 2.0
"eager"Plain PyTorch referenceBaselineAlways
FusedCEHvpBackend = Literal["auto", "eager", "compile", "triton"]

Sources: hessian_eigenthings/loss_fns/huggingface.py:25-35

Backend Resolution Logic

graph LR
    A[Backend: "auto"] --> B{Device: CUDA?}
    B -->|Yes + Triton available| C[Triton Kernel]
    B -->|No| D{torch.compile available?}
    D -->|Yes| E[torch.compile Backend]
    D -->|No| F[Eager Backend]

The resolution checks:

  1. If backend != "auto", use the specified backend
  2. If "auto" and CUDA + Triton available → Triton
  3. If "auto" and torch.compile available → compile
  4. Otherwise → eager

Important: The Triton kernel asserts logits.is_cuda, so on CUDA-equipped hosts running CPU inputs, the system falls back to compile.

Sources: hessian_eigenthings/loss_fns/huggingface.py:40-60

Memory Optimization

At LM scale (e.g., B=64, T=256, V=50304, fp32):

ImplementationMemory FootprintIntermediate Tensors
Eager~19.6 GB~6 (N, V) tensors
Compile~3.3 GB~1 (N, V) tensor
Target~3.3 GBOutput buffer only

The fused implementations eliminate intermediates by computing:

out_flat = (p * u - p * <p, u>) * mask / n_valid

with shape (N, V) in a single kernel pass.

Sources: scripts/bench_fused_ce_hvp.py:1-60

Loss Function Wrapper

The _LossOfOutputWithHvp class wraps a loss function with its analytical Hessian-vector product:

class _LossOfOutputWithHvp:
    """Loss-of-output callable that also carries an analytical `.hvp` method.
    
    Wraps a plain `(output, batch) -> loss` function and a `(output, batch, u)
    -> H_loss @ u` function in a single callable. `GGNOperator` checks for the
    presence of `.hvp` and uses it as the loss-Hessian-vector product, skipping
    the autograd `create_graph=True` double-backward path entirely.
    """

Sources: hessian_eigenthings/loss_fns/huggingface.py:100-120

GGN Operator Integration

The GGNOperator automatically detects and uses analytical HVPs when available:

GGNOperator` picks this up automatically and skips the autograd
double-backward.

Two implementations of the matvec are available via `loss_hvp=`:

* ``"analytical"`` (default): finite-difference JVP + analytical loss-Hessian-vec
  product (read from `loss_of_output_fn.hvp`, which must be present) + a single
  normal backward to apply `J^T`. Memory footprint matches one normal training
  step. Required for LM-scale use.

* ``"autograd"``: the original `torch.func.jvp` + autograd double-backward +
  `torch.func.vjp` path. Numerically exact and supports any loss, but memory
  scales badly with output size.

Sources: hessian_eigenthings/operators/ggn.py:10-30

TransformerLens Integration

For TransformerLens HookedModel architectures, a dedicated loss function handles the hook-based forward pass:

Sources: hessian_eigenthings/loss_fns/transformer_lens.py:1-40

def tlens_loss() -> Callable[[nn.Module, Any], torch.Tensor]:
    """Loss function for TransformerLens HookedModel."""

Workflow: Choosing a Loss Function

graph TD
    A[Start] --> B{Model Type?}
    B -->|Standard MLP/CNN| C[Use HessianOperator]
    B -->|HuggingFace Transformers| D[Use GGNOperator]
    B -->|TransformerLens| E[Use HessianOperator]
    
    C --> F[standard.mse_loss_of_output]
    C --> G[standard.cross_entropy_loss_of_output]
    
    D --> H{Scale?}
    H -->|Small model| I[hf_lm_loss with GGNOperator]
    H -->|Large model| J[hf_lm_shifted_ce with GGNOperator]
    
    E --> K[tlens_loss with HessianOperator]
    
    J --> L[Choose HVP Backend]
    L --> M{Device?}
    M -->|CUDA + Triton| N[Use Triton backend]
    M -->|CPU/MPS| O[Use compile or eager]

Complete API Reference

Standard Module

FunctionReturnsHVP Available
mse_loss_of_output()loss_of_output_fnNo
cross_entropy_loss_of_output()loss_of_output_fnYes
_ce_hvp()Analytical HVP-

HuggingFace Module

FunctionReturnsHVP Available
hf_lm_loss()loss_fnVia GGNOperator
hf_lm_shifted_ce(fused)_LossOfOutputWithHvpYes
_LossOfOutputWithHvpWrapper classVia .hvp attribute

Fused CE HVP Module

FunctionDescription
_ce_hvp_reference()Eager reference implementation
_get_compiled_impl()Returns torch.compile wrapped version
compiled_ce_hvp()Compiled backend entry point
triton_ce_hvp()Triton kernel entry point

Usage Examples

Standard Classification

from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.loss_fns.standard import cross_entropy_loss_of_output

operator = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=lambda m, b: torch.nn.functional.cross_entropy(m(b[0]), b[1]),
)

HuggingFace Large Model (Memory-Optimized)

from hessian_eigenthings.operators import GGNOperator
from hessian_eigenthings.loss_fns.huggingface import hf_lm_shifted_ce

loss_fn = hf_lm_shifted_ce(fused="auto")  # Auto-selects best backend

operator = GGNOperator(
    model=model,
    dataloader=dataloader,
    forward_fn=lambda m, b: m(**b).logits,
    loss_of_output_fn=loss_fn,
    loss_hvp="analytical",  # Default, uses .hvp attribute
)

Sources: hessian_eigenthings/loss_fns/__init__.py

Sources: [hessian_eigenthings/operators/hessian.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

Parameter Utilities

Related topics: Curvature Operators

Section Related Pages

Continue reading this section for the full explanation and source context.

Section ParamFilter Type

Continue reading this section for the full explanation and source context.

Section Parameter Collection Utilities

Continue reading this section for the full explanation and source context.

Section Flattening Parameters to Vectors

Continue reading this section for the full explanation and source context.

Related topics: Curvature Operators

Parameter Utilities

The parameter utilities module (param_utils.py) provides essential infrastructure for managing, filtering, and manipulating PyTorch model parameters within the Hessian eigendecomposition pipeline. These utilities form the foundation that connects curvature operators to the underlying model parameters, enabling efficient computation of Hessian-vector products and eigendecomposition across arbitrary subsets of model parameters.

Overview

When working with large neural networks, it is often necessary to compute curvature information for only a subset of parameters. The parameter utilities support this use case through a flexible filtering mechanism combined with utilities for parameter vectorization, reshaping, and batch management.

The core responsibilities of the parameter utilities include:

  1. Parameter Extraction - Gathering named parameters from PyTorch modules
  2. Parameter Filtering - Selecting subsets of parameters based on name patterns or custom predicates
  3. Vectorization - Flattening parameters into vectors and reshaping vectors back to parameter shapes
  4. Size Tracking - Maintaining offset mappings for efficient vector-to-parameter conversions

Sources: hessian_eigenthings/operators/hessian.py

Core Types and Interfaces

ParamFilter Type

The ParamFilter type alias defines the contract for parameter selection functions:

ParamFilter = Callable[[str, nn.Parameter], bool]

A ParamFilter is a callable that takes two arguments:

  • name: str - The fully-qualified parameter name within the model
  • param: nn.Parameter - The parameter tensor itself

The function returns True if the parameter should be included in the operation, False otherwise.

Sources: hessian_eigenthings/operators/hessian.py:1-50

Parameter Collection Utilities

The module provides functions for extracting and organizing model parameters:

FunctionPurpose
get_param_names(model)Returns list of parameter names as fully-qualified strings
get_param_list(model)Returns list of parameter tensors
get_param_sizes(model)Returns list of parameter tensor sizes
get_filtered_params(model, param_filter)Returns filtered parameter names and tensors

These functions work together to build the data structures required by curvature operators.

Sources: hessian_eigenthings/operators/ggn.py

Parameter Vectorization

Flattening Parameters to Vectors

The utilities support bidirectional conversion between parameter dictionaries and flat vectors. This is essential for Lanczos-based eigendecomposition algorithms that operate on vector spaces.

def flatten_params(param_dict: dict[str, Tensor]) -> Tensor:
    """Flatten all parameters into a single 1D tensor."""

The flattening process concatenates all parameter tensors in a deterministic order, preserving the mapping between parameter names and vector offsets.

Reshaping Vectors Back to Parameters

def unflatten_params(
    vec: Tensor,
    param_names: list[str],
    param_list: list[Tensor],
    sizes: list[torch.Size]
) -> dict[str, Tensor]:
    """Reshape a flat vector back to parameter dictionary."""

The unflattening operation uses offset tracking to slice the vector and reshape each slice to match the original parameter shape:

for name, param, size in zip(param_names, param_list, sizes, strict=True):
    out[name] = vec[offset : offset + size].reshape_as(param)
    offset += size

Sources: hessian_eigenthings/operators/ggn.py

Parameter Filtering Patterns

Name-Based Filtering with `match_names`

The most common filtering pattern uses glob-style matching against parameter names. The match_names function creates a ParamFilter from a list of name patterns:

def match_names(*patterns: str) -> ParamFilter:
    """Create a filter matching parameter names against glob patterns."""

Example usage:

from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.param_utils import match_names

# Filter attention parameters only
attn_filter = match_names("blocks.*.attn.*")
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=attn_filter
)

# Filter MLP parameters only
mlp_filter = match_names("blocks.*.mlp.*")
mlp_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=mlp_filter
)

Sources: examples/transformer_lens_attention_only.py

Multiple Pattern Matching

The match_names function supports multiple patterns, useful for targeting disjoint parameter groups:

# Match multiple parameter groups
filter_fn = match_names(
    "transformer.h.*.attn.*",
    "transformer.h.*.mlp.*"
)

HuggingFace-Specific Patterns

When working with HuggingFace transformers, parameter names follow a predictable structure:

# Attention parameters in GPT-2
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=hf_lm_loss(),
    param_filter=match_names("transformer.h.*.attn.*"),
)

Sources: examples/huggingface_tiny_gpt2.py

Integration with Curvature Operators

Operator Size and Parameter Tracking

Curvature operators maintain internal state about the parameters they operate on:

AttributeTypeDescription
_param_nameslist[str]Names of parameters in the filtered set
_param_listlist[Tensor]Parameter tensors
_sizeslist[torch.Size]Original tensor shapes for reshaping
sizeintTotal number of parameters (sum of all parameter elements)

The size property is computed as:

self.size = sum(p.numel() for p in self._param_list)

This total parameter count determines the dimensionality of the vector space in which eigendecomposition occurs.

Sources: hessian_eigenthings/operators/hessian.py

Data Flow Diagram

graph TD
    A[PyTorch Model] --> B[get_param_names]
    A --> C[get_param_list]
    A --> D[get_param_sizes]
    B --> E[ParamFilter Application]
    C --> E
    D --> E
    E --> F[Filtered Parameter Collections]
    F --> G[Curvature Operator]
    G --> H[matvec Operations]
    H --> I[Eigendecomposition Results]
    
    J[Input Vector] --> K[unflatten_params]
    F --> K
    K --> L[Parameter Dict]
    L --> H

Practical Examples

Computing Attention-Only Hessian Eigendecomposition

from hessian_eigenthings.algorithms import lanczos
from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.param_utils import match_names

# Create attention-only operator
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)

print(f"Attention-only Hessian size: {attn_op.size} parameters")

# Compute top-3 eigenvalues
eig_attn = lanczos(attn_op, k=3, max_iter=20, tol=1e-3, seed=0)
for i, val in enumerate(eig_attn.eigenvalues):
    print(f"  λ_{i + 1} = {val.item(): .4e}")

Comparing Block-Specific Curvature

# Full model Hessian
full_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn
)

# Attention block only
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)

# MLP block only
mlp_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.mlp.*")
)

# Compare eigenvalue spectra
full_eig = lanczos(full_op, k=10)
attn_eig = lanczos(attn_op, k=10)
mlp_eig = lanczos(mlp_op, k=10)

Sources: examples/transformer_lens_attention_only.py

Advanced Filtering

Custom Filter Functions

For complex filtering logic beyond glob matching, implement a custom ParamFilter:

def custom_filter(name: str, param: nn.Parameter) -> bool:
    # Include only parameters with > 1000 elements
    if param.numel() < 1000:
        return False
    # Exclude certain modules
    if "embedding" in name:
        return False
    # Include based on naming patterns
    return "layer" in name or "head" in name

operator = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=custom_filter
)

Filter Composition

Filters can be combined using standard Python patterns:

# Intersection of patterns
combined_filter = lambda name, param: (
    match_names("blocks.*.*.*")(name, param) and
    param.dtype == torch.float32
)

# Negation
exclude_ln = lambda name, param: not (
    match_names(".*laynorm.*", ".*ln.*")(name, param)
)

Performance Considerations

Parameter Access Patterns

The parameter utilities maintain strict ordering between name lists and tensor lists to enable efficient offset-based indexing. When iterating over parameters in performance-critical paths:

  1. Use the pre-computed _sizes list to avoid repeated param.shape calls
  2. Leverage the strict=True zip when all lists are guaranteed to be aligned
  3. Prefer in-place reshaping over copies when possible

Memory Implications

OperationMemory Pattern
flatten_paramsAllocates new tensor of size sum(numel)
unflatten_paramsCreates dict, views from original vector
matvecNo parameter data copies; uses VJP/JVP chains

The vectorization maintains a view relationship with original parameters where possible, minimizing memory overhead during iterative algorithms.

API Reference

`match_names`

def match_names(*patterns: str) -> ParamFilter:
    """Create a ParamFilter matching parameter names against glob patterns."""

Parameters:

ParameterTypeDescription
patternsstrGlob patterns to match against parameter names

Returns: A callable ParamFilter that returns True for parameters matching any of the provided patterns.

Supported Glob Patterns:

  • * - Matches any sequence of characters within a path component
  • ** - Matches any sequence of path components (if supported)
  • ? - Matches a single character
  • [abc] - Matches any character in the set

Parameter Extraction Functions

def get_param_names(model: nn.Module) -> list[str]:
    """Extract fully-qualified parameter names from a model."""

def get_param_list(model: nn.Module) -> list[Tensor]:
    """Extract parameter tensors from a model."""

def get_filtered_params(
    model: nn.Module,
    param_filter: ParamFilter | None
) -> tuple[list[str], list[Tensor]]:
    """Extract filtered parameter names and tensors."""

Sources: hessian_eigenthings/param_utils.py

Sources: [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

Distributed Computing with DDP

Related topics: Curvature Operators

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Class Hierarchy

Continue reading this section for the full explanation and source context.

Section Data Flow Diagram

Continue reading this section for the full explanation and source context.

Section DDP Behavior Explanation

Continue reading this section for the full explanation and source context.

Related topics: Curvature Operators

Distributed Computing with DDP

The hessian-eigenthings library provides native support for distributed training scenarios through DDPHessianOperator, a specialized curvature operator that extends the base HessianOperator to work correctly with PyTorch's DistributedDataParallel (DDP) wrapper.

Overview

In distributed training environments, the Hessian eigenvalue computations must account for how DDP synchronizes gradients across multiple processes. The DDPHessianOperator handles this synchronization transparently, ensuring that the Hessian-vector products (HVPs) computed across different ranks are properly averaged.

Key characteristics:

  • Subclass of HessianOperator with distributed awareness
  • Automatically averages HVPs across all data-parallel ranks
  • Compatible with standard DDP-wrapped models
  • Supports the same API as the base HessianOperator
  • Handles the autograd graph complexity introduced by DDP's all-reduce operations

Architecture

Class Hierarchy

CurvatureOperator (base interface)
    └── HessianOperator (base implementation)
            └── DDPHessianOperator (DDP-aware extension)

Data Flow Diagram

graph TD
    A[Model wrapped with DDP] --> B[DDPHessianOperator]
    B --> C[Per-rank HVP Computation]
    C --> D[Autograd-aware All-Reduce]
    D --> E[Synchronized HVP across Ranks]
    
    F[torch.autograd.grad calls] --> G[Regular .backward hooks]
    G --> H[No explicit all-reduce]
    
    I[Expected HVP] --> J[Actual HVP without DDPHessianOperator]
    
    K[Expected HVP] --> L[Actual HVP with DDPHessianOperator]
    
    style D fill:#90EE90
    style H fill:#FFB6C1
    style L fill:#90EE90

DDP Behavior Explanation

The core challenge addressed by DDPHessianOperator stems from how PyTorch's DistributedDataParallel handles gradient synchronization:

  1. DDP's all-reduce mechanism: DDP normally fires its all-reduce operation inside the autograd graph during loss.backward(), synchronizing gradients across all ranks
  2. Standard HessianOperator limitation: When using torch.autograd.grad directly (as the base HessianOperator does), the DDP hooks fire on .grad accumulation rather than on autograd.grad's return value
  3. Resulting discrepancy: Without explicit handling, the computed HVP does not match the single-process HVP computed on the union of all per-rank batches

The DDPHessianOperator resolves this by adding an explicit autograd-aware all-reduce after each gradient computation call, ensuring the resulting HVP equals the single-process HVP.

API Reference

DDPHessianOperator

class DDPHessianOperator(HessianOperator):
    """HessianOperator that all-reduces the HVP across torch.distributed ranks."""

#### Constructor Parameters

ParameterTypeRequiredDefaultDescription
modelnn.ModuleYes-Model (may be DDP-wrapped; params are read directly)
dataloaderIterable[Any]Yes-Data loader providing batches to average over
loss_fnLossFnYes-Loss function forward_fn(...) -> loss
param_filter`ParamFilter \None`NoNoneOptional filter for subset of parameters
full_datasetboolNoTrueWhether to compute Hessian over full dataset
num_batches`int \None`NoNoneNumber of batches to sample if not full dataset
microbatch_size`int \None`NoNoneChunk batch into micro-batches for memory
microbatch_unsafeboolNoFalseSkip gradient accumulation safety checks
methodHvpMethodNo"autograd"HVP computation method
fd_eps`float \None`NoNoneFinite difference epsilon
backend`LinAlgBackend[torch.Tensor] \None`NoNoneLinear algebra backend

Inherits all parameters from HessianOperator base class.

#### Inherited Methods

MethodDescription
matvec(v)Compute H·v where H is the Hessian averaged over batches
sizeTotal number of parameters in the filtered parameter set
dtypeData type of parameters
deviceDevice of parameters

Import Location

from hessian_eigenthings.operators import DDPHessianOperator

Or via the distributed submodule:

from hessian_eigenthings.operators.distributed import DDPHessianOperator

Sources: hessian_eigenthings/operators/distributed/__init__.py:1-3

Usage Patterns

Basic Usage with DDP-wrapped Model

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from hessian_eigenthings.operators import DDPHessianOperator
from hessian_eigenthings.algorithms import lanczos

# Assume model, dataloader, and loss_fn are already set up
ddp_model = DDP(model)

# Create the distributed Hessian operator
hessian_op = DDPHessianOperator(
    model=ddp_model,
    dataloader=dataloader,
    loss_fn=loss_fn,
)

# Compute eigenvalues using Lanczos algorithm
eigenvalues, eigenvectors = lanczos(hessian_op, k=10, max_iter=50)

With Parameter Filtering

from hessian_eigenthings.param_utils import match_names

# Focus on specific layer parameters
hessian_op = DDPHessianOperator(
    model=ddp_model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("layer.4.*"),
)

Using with Different HVP Methods

# Using finite difference method (more memory-efficient)
hessian_op_fd = DDPHessianOperator(
    model=ddp_model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    method="finite_difference",
    fd_eps=1e-5,
)

Key Design Decisions

Autograd-aware All-Reduce

The DDPHessianOperator adds an explicit all-reduce operation that integrates with PyTorch's autograd engine. This ensures:

  • The all-reduce operation is included in the autograd graph when needed
  • Gradient flows correctly through the distributed computation
  • The final HVP is properly synchronized across all ranks

Parameter Access

The operator reads parameters directly from the model, whether or not it is wrapped with DDP:

# From the source:
# "The model passed in may already be wrapped with
#  torch.nn.parallel.DistributedDataParallel; we read params from it directly."

This design allows seamless usage with existing DDP-wrapped models without modification.

Batch Distribution

Each rank should receive its own shard of the dataset:

"Each rank should be receiving its own shard of the dataset (typical pattern: a torch.utils.data.distributed.DistributedSampler)."

Sources: hessian_eigenthings/operators/distributed/ddp.py:21-25

Comparison with Single-Process HessianOperator

AspectHessianOperatorDDPHessianOperator
Use caseSingle GPU / CPUMulti-GPU distributed
Gradient syncManual handling requiredAutomatic via all-reduce
DDP compatibilityMay produce incorrect HVPsCorrect by design
APIIdenticalIdentical
Performance overheadNoneSingle all-reduce per HVP

Relationship to Other Operators

The hessian_eigenthings package provides multiple curvature operators:

OperatorDescriptionDistributed Support
HessianOperatorFull Hessian computationNot DDP-aware
DDPHessianOperatorFull Hessian with DDP syncDDP-aware
GGNOperatorGeneralized Gauss-NewtonNot DDP-aware (as of v1.0)
EmpiricalFisherOperatorEmpirical Fisher matrixNot DDP-aware

Sources: hessian_eigenthings/__init__.py:15-24

Limitations and Considerations

  1. Current scope: Only HessianOperator has a DDP-aware counterpart; other operators like GGNOperator and EmpiricalFisherOperator do not yet have distributed variants
  2. Gradient hooks: The operator does not currently support all DDP gradient hook mechanisms
  3. Multi-node training: While the operator uses standard torch.distributed primitives, performance at very large scale (>8 nodes) has not been extensively benchmarked
  4. Mixed precision: When using fp16/bf16 training, ensure consistent dtype across all ranks

Error Handling

The operator relies on standard PyTorch distributed error handling:

  • If torch.distributed is not initialized, standard errors will be raised
  • Mismatched tensor shapes across ranks will result in collective operation errors
  • Device mismatch (e.g., some ranks on CUDA, some on CPU) is not supported

Testing and Validation

The DDP functionality should be tested in a true distributed environment. Basic validation includes:

  1. Consistency check: HVP computed via DDPHessianOperator should equal the single-process HVP when aggregating all batch shards
  2. Numerical accuracy: Eigenvalues computed with DDP should match single-GPU results within floating-point tolerance
  3. Scaling: Computation time should scale sub-linearly with number of GPUs for large models

Sources: [hessian_eigenthings/operators/distributed/__init__.py:1-3]()

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Project risk needs validation

The project should not be treated as fully validated until this signal is reviewed.

medium Python Error: the following arguments are required: experimentname

First-time setup may fail or require extra isolation and rollback planning.

medium v1.0.0a2 — packaging fix

First-time setup may fail or require extra isolation and rollback planning.

medium v1.0.0a3 — fix lanczos OOM

First-time setup may fail or require extra isolation and rollback planning.

Doramagic Pitfall Log

Doramagic extracted 15 source-linked risk signals. Review them before installing or handing real data to the project.

1. Project risk: Project risk needs validation

  • Severity: medium
  • Finding: Project risk is backed by a source signal: Project risk needs validation. Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: identity.distribution | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | repo=pytorch-hessian-eigenthings; install=hessian-eigenthings

2. Installation risk: Python Error: the following arguments are required: experimentname

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: Python Error: the following arguments are required: experimentname. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/39

3. Installation risk: v1.0.0a2 — packaging fix

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v1.0.0a2 — packaging fix. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a2

4. Installation risk: v1.0.0a3 — fix lanczos OOM

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v1.0.0a3 — fix lanczos OOM. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a3

5. Installation risk: v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a4

6. Installation risk: v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests

  • Severity: medium
  • Finding: Installation risk is backed by a source signal: v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a5

7. Configuration risk: RuntimeError: One of the differentiated Tensors appears to not have been used in the graph.

  • Severity: medium
  • Finding: Configuration risk is backed by a source signal: RuntimeError: One of the differentiated Tensors appears to not have been used in the graph.. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/30

8. Configuration risk: ValueError: PENet on the Kitti benchmark suite

  • Severity: medium
  • Finding: Configuration risk is backed by a source signal: ValueError: PENet on the Kitti benchmark suite. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/41

9. Capability assumption: README/documentation is current enough for a first validation pass.

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: capability.assumptions | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | README/documentation is current enough for a first validation pass.

10. Project risk: AttributeError: 'HVPOperator' object has no attribute 'zero_grad'

  • Severity: medium
  • Finding: Project risk is backed by a source signal: AttributeError: 'HVPOperator' object has no attribute 'zero_grad'. Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/38

11. Maintenance risk: Maintainer activity is unknown

  • Severity: medium
  • Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | last_activity_observed missing

12. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: downstream_validation.risk_items | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | no_demo; severity=medium

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 9

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using pytorch-hessian-eigenthings with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence