Doramagic Project Pack · Human Manual
pytorch-hessian-eigenthings
hessian-eigenthings is a PyTorch library that provides efficient and scalable computation of eigendecomposition for the Hessian matrix and related curvature operators in neural networks. T...
Introduction to hessian-eigenthings
Related topics: Curvature Matrices Explained, System Architecture
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Curvature Matrices Explained, System Architecture
Introduction to hessian-eigenthings
Overview
hessian-eigenthings is a PyTorch library that provides efficient and scalable computation of eigendecomposition for the Hessian matrix and related curvature operators in neural networks. The library enables practitioners to compute top eigenvalues and eigenvectors via Lanczos or stochastic power iteration, trace estimates via Hutch++, and spectral density via Stochastic Lanczos Quadrature.
Sources: README.md:1
The project targets researchers and engineers studying generalization properties of neural networks, where Hessian eigenvalues and eigenvectors have been implicated in understanding flat minima and model robustness.
Core Concepts
What is a Hessian?
The Hessian matrix is the second-order partial derivatives of a loss function with respect to model parameters. For a neural network with parameters θ and loss L, the Hessian H is defined as:
H[θ][i,j] = ∂²L / ∂θ[i]∂θ[j]
For modern large-scale models, the Hessian is prohibitively expensive to compute explicitly—it has O(n²) entries where n is the number of parameters (e.g., billions for large language models).
Sources: hessian_eigenthings/operators/hessian.py:1-50
Curvature Operators
Instead of computing the full Hessian matrix, this library works with curvature operators that implement matrix-vector products (matvecs). Given a vector v, these operators efficiently compute:
H @ v → operator.matvec(v)
This approach reduces memory from O(n²) to O(n), making analysis feasible for models with billions of parameters.
Supported Curvature Matrices
| Operator | Description | Use Case |
|---|---|---|
HessianOperator | Full Hessian of the loss | General curvature analysis |
GGNOperator | Generalized Gauss-Newton approximation | More stable than raw Hessian; equals Fisher for cross-entropy + softmax |
| Custom Operators | User-defined curvature operators | Extend to other matrices |
Sources: hessian_eigenthings/operators/ggn.py:1-30
Architecture
The library follows a clean separation of concerns with three main layers:
graph TD
A[User Code] --> B[Algorithms]
A --> C[Loss Functions]
B --> D[Curvature Operators]
C --> D
D --> E[LinAlgBackend]
B --> B1[Lanczos]
B --> B2[Stochastic Power Iteration]
B --> B3[Trace Estimation]
D --> D1[HessianOperator]
D --> D2[GGNOperator]
D --> D3[Custom Operators]
E --> E1[SingleDeviceBackend]
E --> E2[Distributed Backends]Component Layers
- Algorithms Layer (
hessian_eigenthings/algorithms/): Eigenvalue/eigenvector computation methods that operate on anyCurvatureOperator.
- Operators Layer (
hessian_eigenthings/operators/): Implementations of various curvature matrices that provide thematvec()interface.
- Loss Functions Layer (
hessian_eigenthings/loss_fns/): Pre-built loss functions with analytical Hessian-vector products for common use cases.
- Backend Layer (
hessian_eigenthings/backends/): Abstraction for linear algebra operations supporting single-device and distributed execution.
Sources: CONTRIBUTING.md:1-30
Algorithms
Lanczos Eigendecomposition
The Lanczos algorithm computes the top k eigenvalues and eigenvectors of a symmetric matrix using only matrix-vector products. It achieves this through k iterations of tridiagonal matrix construction.
graph LR
A[Start Vector v₀] --> B[Iterate i = 1 to k]
B --> C[Compute βᵢvᵢ₊₁ = Avᵢ - αᵢvᵢ - βᵢ₋₁vᵢ₋₁]
C --> D[Compute αᵢ = vᵢᵀAvᵢ]
D --> E[Build Tridiagonal T]
E --> F{Eigenvalues of T ≈ eigenvalues of A?}
F -->|Yes| G[Eigenpairs Converged]Key parameters for the Lanczos algorithm:
| Parameter | Type | Default | Description |
|---|---|---|---|
k | int | required | Number of eigenpairs to compute |
max_iter | int | 100 | Maximum Lanczos iterations |
tol | float | 1e-6 | Convergence tolerance |
seed | int | None | Random seed for reproducibility |
which | str | "LM" | Which eigenvalues: "LM" (largest magnitude), "LA" (largest algebraic), "SA" (smallest algebraic) |
Sources: hessian_eigenthings/algorithms/lanczos.py:1-80
Trace Estimation
The trace of a matrix can be estimated using stochastic methods without forming the full matrix:
Hutchinson's Estimator:
trace(A) ≈ (1/m) Σᵢ vᵢᵀ A vᵢ
where vᵢ are random probe vectors.
Hutch++ Estimator: An improved estimator with lower variance:
trace(A) ≈ (2/m) Σᵢ vᵢᵀ A vᵢ - (1/m) Σⱼ wⱼᵀ A wⱼ
| Method | num_matvecs | Variance | Use Case |
|---|---|---|---|
hutchinson | 100 | Higher | Quick estimates |
hutch++ | 30 | Lower | Production estimates |
Sources: hessian_eigenthings/algorithms/trace.py:1-60
Operators
HessianOperator
The primary operator for computing Hessian eigendecomposition. It supports two HVP computation methods:
| Method | Description | Memory | Precision |
|---|---|---|---|
"autograd" (default) | Exact double-backward via torch.autograd.grad with create_graph=True | Higher | Numerically exact |
"finite_difference" | Central-difference approximation | Lower | O(ε²) bias |
from hessian_eigenthings import HessianOperator, lanczos
# Basic usage
op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
eigenvalues, eigenvectors = lanczos(op, k=10)
# With parameter filtering (subset of parameters)
op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_names("transformer.h.*.attn.*"),
)
Sources: hessian_eigenthings/operators/hessian.py:1-100
GGNOperator
The Generalized Gauss-Newton (GGN) operator provides a more numerically stable approximation to the Hessian. For cross-entropy + softmax classification, the GGN equals the Fisher information matrix.
from hessian_eigenthings import GGNOperator
op = GGNOperator(
model=model,
dataloader=dataloader,
forward_fn=model_forward,
loss_of_output_fn=loss_of_output_fn,
)
Two matvec implementations:
| Implementation | Description | Memory Footprint |
|---|---|---|
"analytical" (default) | Finite-difference JVP + analytical loss-Hessian-vec product | Matches one training step |
"autograd" | Full torch.func.jvp + autograd double-backward | Scales badly with output size |
Sources: hessian_eigenthings/operators/ggn.py:1-80
Loss Functions
The library provides optimized loss functions with closed-form Hessian-vector products for common use cases.
Standard Loss Functions
from hessian_eigenthings.loss_fns.standard import cross_entropy_loss_of_output
loss_fn = cross_entropy_loss_of_output() # Returns loss_of_output_fn
The closed-form cross-entropy HVP is:
H @ u = (p * u - p * <p, u>) / n
where p = softmax(output) and n is the number of valid positions.
Sources: hessian_eigenthings/loss_fns/standard.py:1-60
HuggingFace Transformers Loss
Specialized support for HuggingFace models with fused CUDA kernels:
from hessian_eigenthings.loss_fns.huggingface import hf_lm_loss
loss_fn = hf_lm_loss() # For language modeling
Fused backend options:
| Backend | Device | Speedup | Memory |
|---|---|---|---|
"triton" | CUDA | ~3.4x faster | 2x reduction |
"compile" | Any | ~2.6x faster | 2x reduction |
"eager" | Any | Baseline | Baseline |
"auto" | Auto-detect | Best available | Best available |
Sources: hessian_eigenthings/loss_fns/huggingface.py:1-80
Usage Examples
Basic Hessian Eigendecomposition
import torch
from torch import nn
from hessian_eigenthings import HessianOperator, lanczos
# Define model and data
model = nn.Sequential(nn.Linear(100, 50), nn.ReLU(), nn.Linear(50, 10))
dataloader = [(torch.randn(32, 100), torch.randint(0, 10, (32,)))]
# Loss function
def loss_fn(model, batch):
x, y = batch
return nn.functional.cross_entropy(model(x), y)
# Compute top 3 eigenvalues
op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
result = lanczos(op, k=3, max_iter=20, tol=1e-3, seed=0)
for i, val in enumerate(result.eigenvalues):
print(f"λ_{i+1} = {val.item():.4e}")
Sources: examples/huggingface_tiny_gpt2.py:1-50
Analyzing Attention Layers Only
from hessian_eigenthings import HessianOperator, lanczos
from hessian_eigenthings.util import match_names
# Filter to attention parameters only
attn_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_names("blocks.*.attn.*"),
)
eigenvalues = lanczos(attn_op, k=5)
Trace Estimation
from hessian_eigenthings import HessianOperator, trace
op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
result = trace(op, num_matvecs=30, method="hutch++", seed=0)
print(f"Trace estimate: {result.estimate:.4e}")
Performance Considerations
Memory Management
For large models, consider these strategies:
``python op = HessianOperator(model, dataloader, loss_fn, param_filter=filter_func) ``
- Parameter Filtering: Analyze only relevant subsets of parameters
``python op = HessianOperator(model, dataloader, loss_fn, microbatch_size=8) ``
- Microbatching: Process data in smaller chunks
``python op = HessianOperator(model, dataloader, loss_fn, method="finite_difference") ``
- Finite Difference Method: Use
"finite_difference"for lower memory with FSDP/HSDP/TP
Scalability
| Model Size | Recommended Method | Notes |
|---|---|---|
| < 1B params | "autograd" HVP | Numerically exact |
| 1B - 7B params | "analytical" GGN | Good memory efficiency |
| > 7B params | "finite_difference" HVP | Works with distributed training |
Computation Cost
The primary cost driver is the number of matrix-vector products (matvecs):
- Lanczos: ~k × max_iter matvecs for k eigenpairs
- Trace (Hutch++): num_matvecs matvecs
- Spectral Density: num_steps × num_random_start matvecs
API Reference
Core Functions
| Function | Module | Description |
|---|---|---|
lanczos | hessian_eigenthings.algorithms | Lanczos eigendecomposition |
stochastic_power_iteration | hessian_eigenthings.algorithms | Stochastic power iteration |
trace | hessian_eigenthings.algorithms | Trace estimation |
spectral_density | hessian_eigenthings.algorithms | Stochastic Lanczos Quadrature |
Operators
| Class | Module | Description |
|---|---|---|
HessianOperator | hessian_eigenthings.operators | Full Hessian operator |
GGNOperator | hessian_eigenthings.operators | Generalized Gauss-Newton operator |
CurvatureOperator | hessian_eigenthings.operators | Base class for custom operators |
Utility Functions
| Function | Description |
|---|---|
match_names(glob_pattern) | Create parameter filter from glob pattern |
SingleDeviceBackend | Linear algebra backend for single-device execution |
Project Information
Acknowledgements
The original 2018 implementation was developed by Noah Golmant, Zhewei Yao, Amir Gholami, Michael Mahoney, and Joseph Gonzalez at UC Berkeley's RISELab.
The deflated power iteration is based on code from HessianFlow (Z. Yao, A. Gholami, Q. Lei, K. Keutzer, M. Mahoney. *"Hessian-based Analysis of Large Batch Training and Robustness to Adversaries"*, NeurIPS 2018).
Accelerated stochastic power iteration is from C. De Sa et al.
Citation
@misc{hessian-eigenthings,
author = {Noah Golmant and Zhewei Yao and Amir Gholami and Michael Mahoney and Joseph Gonzalez},
title = {pytorch-hessian-eigenthings: efficient PyTorch Hessian eigendecomposition},
month = oct,
year = 2018,
version = {1.0},
url = {https://github.com/noahgolmant/pytorch-hessian-eigenthings}
}
Installation
# From PyPI (stable release)
pip install hessian-eigenthings
# Development setup
git clone https://github.com/noahgolmant/pytorch-hessian-eigenthings
cd pytorch-hessian-eigenthings
uv sync --group dev --group docs
Documentation
Full documentation is available at noahgolmant.github.io/pytorch-hessian-eigenthings/.
Sources: README.md:1-100 Sources: CONTRIBUTING.md:1-60 Sources: mkdocs.yml:1-50
Sources: [README.md:1]()
Installation Guide
Related topics: Introduction to hessian-eigenthings
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction to hessian-eigenthings
Installation Guide
Overview
This guide covers all aspects of setting up the hessian-eigenthings library for computing Hessian eigendecomposition and related curvature matrix operations in PyTorch models.
The library provides efficient methods for computing top eigenvalues and eigenvectors via Lanczos or stochastic power iteration, trace estimates via Hutch++, and spectral density via Stochastic Lanczos Quadrature. Sources: README.md:1-20
Installation Methods
PyPI Release (Recommended for Users)
The latest stable release is available on PyPI:
pip install hessian-eigenthings
This installs the core library without optional dependencies for transformer and curvlinops integrations. Sources: README.md:1-10
Development Installation (For Contributors)
For development, clone the repository and install with all optional dependency groups:
git clone https://github.com/noahgolmant/pytorch-hessian-eigenthings
cd pytorch-hessian-eigenthings
uv sync --group dev --group docs --extra transformers --extra transformer-lens --extra curvlinops
Sources: CONTRIBUTING.md:5-12
Optional Dependency Groups
The library uses optional dependency groups defined in pyproject.toml to enable specialized functionality:
| Group | Purpose | Typical Use Case |
|---|---|---|
dev | Testing, linting, type checking | Running CI checks locally |
docs | Building documentation | mkdocs build --strict |
transformers | HuggingFace Transformers integration | GGNOperator with HF models |
transformer-lens | TransformerLens integration | Attention-only Hessian analysis |
curvlinops | Cross-library validation tests | Testing against external oracle |
Sources: CONTRIBUTING.md:8-10
Development Environment Setup
Prerequisites
| Requirement | Version | Purpose |
|---|---|---|
| Python | ≥3.10 | Core runtime |
| uv | Latest | Package manager |
| PyTorch | ≥2.0 | Backend tensor operations |
| CUDA (optional) | 11.8+ | GPU acceleration for large models |
Setup Workflow
graph TD
A[Clone Repository] --> B[Install uv if needed]
B --> C[Run uv sync with groups]
C --> D[Verify Installation]
D --> E{Which workflow?}
E -->|Development| F[Run linting checks]
E -->|Testing| G[Run pytest]
E -->|Documentation| H[Build docs]
F --> I[Ready to contribute]
G --> I
H --> IVerification Commands
After installation, verify the setup by running the full check suite:
uv run ruff check .
uv run black --check .
uv run mypy
uv run pytest
uv run mkdocs build --strict
Sources: CONTRIBUTING.md:14-23
CUDA/GPU Support
The library provides optimized CUDA kernels for specific operations:
Triton Kernels (CUDA Only)
The hessian_eigenthings.loss_fns._fused_ce_hvp module includes a hand-written Triton CUDA kernel for fused CE HVP computation. This kernel:
- Eliminates zero
(N, V)intermediates (output buffer only) - Provides ~3.4x speedup over eager mode
- Reduces peak memory by 2x compared to eager
- Falls back to
torch.compileif Triton/CUDA is unavailable
Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:50-80
Backend Selection
For HuggingFace language model loss functions, the fused parameter controls kernel selection:
| Setting | Behavior |
|---|---|
"auto" (default) | Picks fastest available: Triton on CUDA (~3.4x speedup), else torch.compile |
"eager" | Plain PyTorch implementation, useful for debugging |
"compile" | torch.compile-fused via Inductor, works on CPU/CUDA/MPS |
"triton" | Hand-written CUDA Triton kernel (CUDA only) |
Sources: hessian_eigenthings/loss_fns/huggingface.py:1-50
Operator-Specific Dependencies
Different curvature operators have different computational requirements:
HessianOperator
Two HVP methods are supported:
| Method | Memory Profile | Precision | FSDP/TP Compatible |
|---|---|---|---|
"autograd" (default) | Higher (requires create_graph=True) | Numerically exact | No (requires special handling) |
"finite_difference" | Matches one training step | O(ε²) truncation bias | Yes |
Sources: hessian_eigenthings/operators/hessian.py:1-40
GGNOperator
For the Generalized Gauss-Newton operator, two matvec implementations are available:
| Implementation | Memory | Use Case |
|---|---|---|
"analytical" (default) | Matches one training step | LM-scale use, prevents OOM |
"autograd" | Scales with output size | Losses without analytical .hvp |
Sources: hessian_eigenthings/operators/ggn.py:1-60
CI/CD Verification
The repository uses GitHub Actions for continuous integration. CI runs:
- All linting and type checks
- Full pytest test suite
- Example scripts execution
- Documentation codeblock tests
Sources: CONTRIBUTING.md:23-26
Troubleshooting
Common Issues
| Issue | Solution |
|---|---|
| Memory OOM with GGNOperator | Use loss_hvp="analytical" (default in recent versions) |
| FSDP/TP compatibility issues | Use method="finite_difference" for HessianOperator |
| Triton not available | Falls back to torch.compile automatically |
| Type checking failures | Run uv run mypy locally before submitting PR |
Diagnostic Scripts
The repository includes diagnostic scripts for troubleshooting:
scripts/repro_ggn_oom.py: CPU-side memory regression test for GGNOperator OOM issuesscripts/bench_fused_ce_hvp.py: Microbenchmark for eager vs fused CE HVP performance
Sources: scripts/repro_ggn_oom.py:1-40 Sources: scripts/bench_fused_ce_hvp.py:1-50
Package Metadata
| Property | Value |
|---|---|
| Package Name | hessian-eigenthings |
| License | MIT |
| Documentation | noahgolmant.github.io/pytorch-hessian-eigenthings |
| CI Status |  |
Sources: README.md:1-20
Sources: [CONTRIBUTING.md:5-12]()
Curvature Matrices Explained
Related topics: Why Hessian-Vector Products, Curvature Operators
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Why Hessian-Vector Products, Curvature Operators
Curvature Matrices Explained
Overview
Curvature matrices characterize the second-order behavior of loss functions in neural networks, providing critical information about optimization landscapes, generalization properties, and model robustness. The hessian-eigenthings library provides efficient, matrix-free computation of eigendecompositions for three key curvature matrices: the Hessian, the Generalized Gauss-Newton (GGN), and the Empirical Fisher.
These curvature operators serve as the foundation for analyzing flat minima, understanding generalization, and performing second-order optimization. The library implements matrix-vector products (matvecs) directly, avoiding explicit matrix construction which would be computationally infeasible for large neural networks with billions of parameters.
Curvature Matrices Architecture
graph TD
A[Loss Function Lθ] --> B[Hessian H = ∇²L]
A --> C[Generalized Gauss-Newton G]
A --> D[Empirical Fisher F]
B --> E[Matrix-Free MatVec]
C --> E
D --> E
E --> F[Lanczos Eigendecomposition]
E --> G[Trace Estimation]
E --> H[Spectral Density]
F --> I[Top-k Eigenpairs]
G --> J[Trace Estimate ± SE]
H --> K[Spectral Density Plot]The Hessian Matrix
Definition and Role
The Hessian matrix H = ∇²L(θ) is the second derivative of the loss with respect to parameters. It captures the exact local curvature of the loss landscape, making it the most precise but also most computationally expensive curvature matrix.
The Hessian is symmetric by construction and its eigenvalues reveal critical properties:
- Large positive eigenvalues indicate sharp curvature, suggesting the model is in a narrow minimum
- Small eigenvalues indicate flat regions associated with better generalization
- Negative eigenvalues signal instability and potential divergence
Sources: hessian_eigenthings/operators/hessian.py:1-30
HessianOperator Implementation
The HessianOperator class provides two methods for computing Hessian-vector products (HVPs):
class HessianOperator(CurvatureOperator):
"""Hessian of `loss_fn(model, batch)` averaged over batches in dataloader."""
def __init__(
self,
model: nn.Module,
dataloader: Iterable[Any],
loss_fn: LossFn,
*,
param_filter: ParamFilter | None = None,
full_dataset: bool = True,
num_batches: int | None = None,
microbatch_size: int | None = None,
method: HvpMethod = "autograd",
fd_eps: float | None = None,
backend: LinAlgBackend[torch.Tensor] | None = None,
) -> None:
HVP Computation Methods:
| Method | Description | Memory | Precision |
|---|---|---|---|
"autograd" | Exact double-backward via torch.autograd.grad with create_graph=True | Higher (scales with model size) | Numerically exact to rounding |
"finite_difference" | Central-difference (∇L(θ+εv) − ∇L(θ−εv)) / 2ε | Lower (two forward+backward passes) | O(ε²) truncation bias |
The finite-difference method uses dtype-specific epsilon values for optimal precision:
| dtype | Epsilon |
|---|---|
| float64 | 6e-6 |
| float32 | 5e-3 |
| bfloat16 | 0.2 |
| float16 | 5e-2 |
Sources: hessian_eigenthings/operators/hessian.py:30-55
When to Use the Hessian
The Hessian is ideal for:
- Single-device analysis of models up to ~7B parameters
- Scenarios requiring exact curvature information
- Research on loss landscape topology
- Verifying approximations against ground truth
Generalized Gauss-Newton (GGN)
Definition and Mathematical Foundation
The Generalized Gauss-Newton matrix G is a positive semi-definite (PSD) approximation to the Hessian. For a loss of the form L = (1/n) Σ l(f(xᵢ;θ), yᵢ), the GGN is defined as:
G = Jᵀ · H_loss · J
Where:
Jis the Jacobian of model outputs with respect to parametersH_lossis the Hessian of the loss with respect to model outputs
The GGN is always PSD because G = Jᵀ · H_loss · J and H_loss is PSD for convex losses.
Sources: hessian_eigenthings/operators/ggn.py:1-40
GGNOperator Implementation
The GGNOperator class provides two matvec implementations:
class GGNOperator(CurvatureOperator):
def __init__(
self,
model: nn.Module,
dataloader: Iterable[Any],
forward_fn: ForwardFn,
loss_of_output_fn: LossOfOutputFn,
*,
loss_hvp: Literal["analytical", "autograd"] = "analytical",
) -> None:
| Method | Description | Memory | Use Case |
|---|---|---|---|
"analytical" | Finite-difference JVP + analytical loss-Hessian-vector product | Matches one training step | LM-scale use, OOM-safe |
"autograd" | torch.func.jvp + autograd double-backward + vjp | Scales with output size | Exact for arbitrary losses |
For cross-entropy + softmax classification, G equals the Fisher information matrix, making the GGN and Fisher equivalent in this common case.
Sources: hessian_eigenthings/operators/ggn.py:40-80
Two-Function API Design
The GGNOperator uses a separation between forward_fn and loss_of_output_fn:
ForwardFn = Callable[[nn.Module, Any], torch.Tensor]
LossOfOutputFn = Callable[[torch.Tensor, Any], torch.Tensor]
This design enables computing J·v, H_loss·(J·v), and Jᵀ·(H_loss·J·v) without coupling to loss internals.
Closed-Form Cross-Entropy HVP
For mean-reduced cross-entropy with softmax, the library provides an optimized analytical HVP:
H_loss @ u = (p * u - p * ⟨p, u⟩) / n
Where p = softmax(logits) and n is the count of non-ignored positions.
Sources: hessian_eigenthings/loss_fns/huggingface.py:1-60
Fused CE HVP Implementations
The library provides three backend implementations for the cross-entropy HVP:
| Backend | Description | Memory | Speedup |
|---|---|---|---|
"eager" | Plain PyTorch reference | Highest | 1x baseline |
"compile" | torch.compile-fused; Inductor fuses operations | Reduced | ~2.6x faster |
"triton" | Hand-written CUDA Triton kernel | Minimal (output buffer only) | ~3.4x faster |
Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50
Empirical Fisher
Definition
The Empirical Fisher matrix F is defined as:
F = (1/n) Σ ∇lᵢ · ∇lᵢᵀ
Where the expectation over data is replaced by the empirical average over batches. It is always PSD and serves as an approximation to the Fisher Information Matrix.
EmpiricalFisherOperator
class EmpiricalFisherOperator(CurvatureOperator):
def __init__(
self,
model: nn.Module,
dataloader: Iterable[Any],
loss_fn: LossFn,
*,
per_sample: bool = False,
) -> None:
Sources: hessian_eigenthings/operators/fisher.py
Curvature Operator Interface
All curvature matrices implement the CurvatureOperator base class:
class CurvatureOperator(ABC):
@property
@abstractmethod
def size(self) -> int:
"""Number of parameters (matrix dimension)."""
...
@property
def dtype(self) -> torch.dtype:
...
@property
def device(self) -> torch.device:
...
@abstractmethod
def matvec(self, v: torch.Tensor) -> torch.Tensor:
"""Compute matrix-vector product A @ v."""
...
Sources: hessian_eigenthings/operators/base.py
Parameter Filtering
Curvature operators support computing curvature only over a subset of parameters using ParamFilter:
def match_names(*patterns: str) -> ParamFilter:
"""Match parameters by name patterns."""
def match_regex(pattern: str) -> ParamFilter:
"""Match parameters by regex pattern."""
This enables analysis of specific components:
# Analyze only attention weights
attn_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_names("blocks.*.attn.*"),
)
Algorithmic Foundations
Lanczos Eigendecomposition
The Lanczos algorithm computes eigenvalues and eigenvectors of large sparse matrices using only matrix-vector products:
def lanczos(
operator: CurvatureOperator,
k: int = 10,
max_iter: int = 100,
tol: float = 1e-6,
which: str = "LM",
) -> EigenResult:
The algorithm:
- Builds a tridiagonal matrix
Tfrom matvec operations - Computes eigenvalues of
Tas Ritz approximations - Accumulates eigenvectors directly via rank-1 outer-product updates
Sources: hessian_eigenthings/algorithms/lanczos.py:1-60
Trace Estimation
Trace estimation uses stochastic probing to estimate tr(A) without computing the full matrix:
| Method | Samples | Variance | Description |
|---|---|---|---|
| Hutchinson | m | O(1/√m) | (1/m) Σ vᵢᵀ A vᵢ with Rademacher/Gaussian vectors |
| Hutch++ | m | O(1/m) | Improved estimator with better constant factors |
def trace(
operator: CurvatureOperator,
*,
num_matvecs: int = 100,
method: Method = "hutch++",
) -> TraceResult:
Sources: hessian_eigenthings/algorithms/trace.py:1-50
Operator Selection Guide
graph LR
A[Need Curvature?] --> B{Exact Hessian?}
B -->|Yes, small model| C[HessianOperator<br/>method=autograd]
B -->|No| D{Need Fisher/GGN?}
D -->|Yes, cross-entropy| E[GGNOperator<br/>loss_hvp=analytical]
D -->|Yes, other loss| F[GGNOperator<br/>loss_hvp=autograd]
D -->|Empirical Fisher| G[EmpiricalFisherOperator]
C --> H[Use lanczos for eigenvalues]
E --> H
F --> H
G --> H| Scenario | Recommended Operator | Method |
|---|---|---|
| Exact Hessian, single GPU | HessianOperator | method="autograd" |
| Large model, distributed | HessianOperator | method="finite_difference" |
| Language modeling, cross-entropy | GGNOperator | loss_hvp="analytical" |
| Custom loss, need exact | GGNOperator | loss_hvp="autograd" |
| Natural gradient optimization | EmpiricalFisherOperator | Default |
Module Exports
from hessian_eigenthings.operators import (
CurvatureOperator,
HessianOperator,
GGNOperator,
EmpiricalFisherOperator,
DDPHessianOperator, # For DistributedDataParallel
LambdaOperator, # Custom curvature wrappers
)
from hessian_eigenthings.algorithms import (
lanczos, # Top-k eigendecomposition
trace, # Trace estimation
spectral_density, # Density plot via SLQ
deflated_power_iteration,
)
Sources: hessian_eigenthings/operators/__init__.py:1-25
Summary
The hessian-eigenthings library provides a unified interface for computing curvature information in neural networks:
- Hessian: Exact local curvature via autograd or finite-difference approximation
- GGN: Positive semi-definite approximation ideal for large-scale analysis
- Empirical Fisher: Sample-based Fisher approximation for natural gradient methods
All operators provide matrix-free matvec implementations, enabling eigendecomposition and trace estimation for models with billions of parameters without explicit matrix construction.
Sources: [hessian_eigenthings/operators/hessian.py:1-30](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
Why Hessian-Vector Products
Related topics: Curvature Matrices Explained, Eigendecomposition Algorithms
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Curvature Matrices Explained, Eigendecomposition Algorithms
Why Hessian-Vector Products
Hessian-vector products (HVPs) are the computational foundation of this library. Understanding *why* we use HVPs instead of computing the full Hessian matrix is essential for appreciating the design and capabilities of hessian-eigenthings.
The Full Hessian Problem
The Hessian matrix $H$ of a neural network's loss function is a second-order partial derivative matrix with dimensions $[n \times n]$, where $n$ is the number of parameters. For modern large-scale models:
| Model | Parameters | Hessian Size | Memory (fp32) |
|---|---|---|---|
| BERT-Base | 110M | 110M × 110M | ~48 TB |
| GPT-2 | 1.5B | 1.5B × 1.5B | ~9 PB |
| LLaMA-7B | 7B | 7B × 7B | ~392 PB |
Storing the full Hessian is fundamentally infeasible. Even computing it via automatic differentiation requires $O(n^2)$ operations and memory that scales quadratically with model size. Sources: README.md:1-30
What is a Hessian-Vector Product?
A Hessian-vector product computes $Hv$ for a given vector $v$ without ever constructing $H$ explicitly. The operation takes $O(n)$ time and memory—linear in the number of parameters.
Formally, given:
- Loss function $\mathcal{L}(\theta)$
- Parameter vector $\theta \in \mathbb{R}^n$
- Direction vector $v \in \mathbb{R}^n$
The HVP is: $$Hv = \nabla_\theta^2 \mathcal{L} \cdot v = \frac{\partial}{\partial \theta} \left( \nabla_\theta \mathcal{L} \cdot v \right)$$
This is implemented as a double-backward pass:
- Forward pass → compute loss
- Backward pass → compute gradient $\nabla_\theta \mathcal{L}$
- Second backward pass → compute Jacobian-vector product $H \cdot v$
Sources: hessian_eigenthings/operators/hessian.py:1-50
Why HVPs Enable Scalable Curvature Analysis
By avoiding explicit Hessian construction, HVP-based algorithms can operate on models of any size. This library provides several key algorithms that all rely on HVP as their primitive operation:
graph TD
A[Hessian-Vector Product] --> B[ Lanczos Eigendecomposition]
A --> C[ Hutchinson Trace Estimation]
A --> D[ Hutch++ Trace Estimation]
A --> E[ Stochastic Lanczos Quadrature]
B --> F[Top-k Eigenvalues & Eigenvectors]
C --> G[Trace Estimation]
D --> G
E --> H[Spectral Density Plot]Eigendecomposition via Lanczos
The Lanczos algorithm iteratively builds an orthogonal basis that tridiagonalizes the operator. It requires only matrix-vector products, making it perfect for HVP-based curvature analysis:
| Property | Full Eigendecomp | Lanczos + HVP |
|---|---|---|
| Memory | $O(n^2)$ | $O(n \cdot k)$ |
| Time | $O(n^3)$ | $O(n \cdot k^2)$ |
| Storage | Entire matrix | $k$ Lanczos vectors |
Where $k$ is the number of desired eigenpairs (typically 1-20). Sources: hessian_eigenthings/algorithms/lanczos.py:1-60
Trace Estimation via Hutchinson's Method
The trace of the Hessian can be estimated without constructing the full matrix:
$$\text{tr}(H) \approx \frac{1}{m} \sum_{i=1}^{m} v_i^T H v_i$$
where $v_i$ are random probe vectors (typically Rademacher or Gaussian). Each term $v_i^T H v_i$ is a single HVP plus a dot product. Sources: hessian_eigenthings/algorithms/trace.py:1-45
HVP Implementation Strategies
The library provides two distinct methods for computing HVPs, each with different trade-offs:
Method 1: Autograd (Default)
Uses torch.autograd.grad with create_graph=True for exact double-backward computation:
def __init__(
self,
model: nn.Module,
dataloader: Iterable[Any],
loss_fn: LossFn,
*,
method: HvpMethod = "autograd", # Default
...
) -> None:
Advantages:
- Numerically exact (to floating-point rounding)
- Works with any differentiable loss function
- Simple implementation
Disadvantages:
- Builds the full computation graph for second derivatives
- Memory scales with model complexity and output size
Sources: hessian_eigenthings/operators/hessian.py:20-45
Method 2: Finite Difference
Uses central-difference approximation: $$\frac{\nabla_\theta \mathcal{L}(\theta + \epsilon v) - \nabla_\theta \mathcal{L}(\theta - \epsilon v)}{2\epsilon}$$
Advantages:
- No second-backward graph → lower memory footprint
- Compatible with distributed training (FSDP/HSDP/TP) without special handling
Disadvantages:
- $O(\epsilon^2)$ truncation bias
- Precision-dependent roundoff (~1e-5 fp32, ~1e-2 bf16)
Sources: hessian_eigenthings/operators/hessian.py:30-40
Fused HVP for Cross-Entropy Losses
For language models with large vocabulary softmax heads, computing $H_{\text{loss}} \cdot u$ naively allocates multiple $(N, V)$ intermediate tensors (where $V$ is vocabulary size). This is addressed with fused implementations:
| Backend | Speedup | Memory Reduction | Requirements |
|---|---|---|---|
eager | 1× (baseline) | 1× | Any |
compile | ~2.6× | ~2× | torch.compile |
triton | ~3.4× | ~2× | CUDA + Triton |
The fused computation computes: $$H_{\text{loss}} \cdot u = \frac{p \odot u - p \odot \langle p, u \rangle}{n} \odot \text{mask}$$
Where $p = \text{softmax}(\text{logits})$ and the implementation avoids materializing the full $(N, V)$ softmax output. Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50
Generalized Gauss-Newton (GGN) Approximation
For optimization-focused curvature analysis, the GGN matrix $G$ provides a positive semi-definite (PSD) approximation to the Hessian:
$$G = J^T \cdot H_{\text{loss}} \cdot J$$
Where $J$ is the Jacobian of the model outputs with respect to parameters. For cross-entropy + softmax classification, $G$ equals the Fisher information matrix. The GGN is always PSD by construction, making it suitable for optimization algorithms. Sources: hessian_eigenthings/operators/ggn.py:1-40
GGN Matvec Implementation
The GGNOperator supports two matvec paths:
- Analytical (default): Finite-difference JVP + analytical loss-Hessian-vector product + single normal backward. Memory footprint matches one normal training step.
- Autograd: Original
torch.func.jvp+ autograd double-backward +torch.func.vjp. Numerically exact but scales badly with vocabulary size.
Sources: hessian_eigenthings/operators/ggn.py:25-45
Practical Implications
The HVP approach enables:
| Capability | HVP-Based | Full Hessian |
|---|---|---|
| 7B parameter model | ✅ ~hours | ❌ impossible |
| Top-10 eigenpairs | ✅ | ❌ |
| Trace estimation | ✅ | ❌ |
| Spectral density | ✅ | ❌ |
| FSDP compatibility | ✅ (finite-diff) | ❌ |
The eigenvalues and eigenvectors of the Hessian have been implicated in generalization properties of neural networks. Researchers hypothesize that "flat minima" generalize better, that Hessians of large models are very low-rank, and that curvature analysis can guide optimization. Sources: README.md:25-35
Summary
Hessian-vector products are the fundamental building block that makes large-scale curvature analysis possible:
- Memory efficiency: $O(n)$ vs $O(n^2)$ for the full Hessian
- Computational efficiency: $O(n)$ per matvec vs $O(n^2)$ for full computation
- Scalability: Works with models of any size via iterative algorithms
- Flexibility: Supports exact (autograd) or memory-efficient (finite-difference) computation
The hessian-eigenthings library provides production-ready implementations of HVP computation and HVP-based algorithms for practical curvature analysis in PyTorch.
Sources: [hessian_eigenthings/operators/hessian.py:1-50]()
System Architecture
Related topics: Curvature Operators, Eigendecomposition Algorithms, Loss Functions
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Curvature Operators, Eigendecomposition Algorithms, Loss Functions
System Architecture
The pytorch-hessian-eigenthings library provides an efficient and scalable framework for computing eigendecompositions of curvature matrices—including the Hessian, Generalized Gauss-Newton (GGN) matrix, and empirical Fisher—for arbitrary PyTorch models. The architecture is designed around three core abstractions: Curvature Operators, Algorithms, and Linear Algebra Backends.
High-Level Architecture Overview
The library implements a layered architecture that separates mathematical curvature computations from numerical algorithms:
graph TD
subgraph "User Layer"
U[User Code]
end
subgraph "Algorithm Layer"
LA[Lanczos]
TR[Trace Estimation]
SP[Stochastic Power Iteration]
end
subgraph "Operator Layer"
HO[HessianOperator]
GGN[GGNOperator]
FO[FisherOperator]
end
subgraph "Backend Layer"
B[LinAlgBackend]
SD[SingleDeviceBackend]
end
subgraph "PyTorch Core"
PT[PyTorch Autograd]
end
U -->|uses| LA
U -->|uses| TR
U -->|uses| HO
LA -->|operates on| HO
TR -->|operates on| HO
HO -->|implemented via| B
B -->|delegates to| PTCore Components
1. Curvature Operators
Curvature operators are the foundation of the library. They abstract away the details of how matrix-vector products (matvecs) with curvature matrices are computed, providing a unified interface for algorithms to work with.
#### Base Interface
All operators inherit from CurvatureOperator, which defines the contract for curvature computations:
| Property/Method | Type | Description |
|---|---|---|
size | int | Total number of parameters in the curvature matrix |
dtype | torch.dtype | Data type of the operator |
device | torch.device | Device where computations run |
matvec(v) | Callable | Computes A @ v for input vector v |
#### Hessian Operator
The HessianOperator computes the Hessian of a loss function with respect to model parameters:
HessianOperator(
model: nn.Module,
dataloader: Iterable[Any],
loss_fn: LossFn,
*,
param_filter: ParamFilter | None = None,
method: HvpMethod = "autograd" # or "finite_difference"
)
Sources: hessian_eigenthings/operators/hessian.py:1-50
Two HVP computation methods are supported:
| Method | Description | Use Case |
|---|---|---|
"autograd" (default) | Exact double-backward via torch.autograd.grad | Up to ~7B parameters |
"finite_difference" | Central-difference approximation | FSDP/HSDP/TP at scale |
The finite difference method uses the approximation:
H(v) ≈ (∇L(θ+εv) − ∇L(θ−εv)) / 2ε
This avoids second-backward graph entirely, making it compatible with distributed training setups.
#### GGN Operator
The GGNOperator implements the Generalized Gauss-Newton matrix, which is always positive semi-definite:
GGNOperator(
model: nn.Module,
dataloader: Iterable[Any],
forward_fn: ForwardFn,
loss_of_output_fn: LossOfOutputFn,
*,
loss_hvp: Literal["analytical", "autograd"] = "analytical"
)
Sources: hessian_eigenthings/operators/ggn.py:1-80
The GGN decomposes as G = J^T · H_loss · J where:
Jis the Jacobian of the model output with respect to parametersH_lossis the Hessian of the loss with respect to the output
For cross-entropy + softmax classification, G equals the Fisher information matrix.
2. Algorithms Layer
Algorithms operate on any CurvatureOperator via its matvec interface, enabling eigenvalue computation, trace estimation, and spectral density analysis.
#### Lanczos Eigensolver
The Lanczos algorithm computes the top-k eigenvalues and eigenvectors of a symmetric matrix using only matrix-vector products:
lanczos(
operator: CurvatureOperator,
k: int = 10,
max_iter: int = 100,
tol: float = 1e-3,
which: str = "LA" # LA, SA, or LM
) -> EigendecompositionResult
Sources: hessian_eigenthings/algorithms/lanczos.py:1-50
Key features:
- Ritz vector accumulation: Directly accumulates Ritz vectors into final
(k, n)layout via rank-1 outer-product updates, avoiding transient(n, k)transpose copies - Convergence tracking: Monitors residual norms
|β_k · s_{k}|to determine convergence - Eigenvalue selection: Supports "LA" (largest algebraic), "SA" (smallest algebraic), and "LM" (largest magnitude)
#### Trace Estimation
The library provides multiple trace estimation methods:
| Method | Description | Samples Required |
|---|---|---|
hutchinson | Classical Hutchinson: (1/m) Σ vᵢᵀ A vᵢ | Higher variance |
hutch++ | Improved estimator with lower variance | ~30 recommended |
trace(
operator: CurvatureOperator,
num_matvecs: int = 30,
method: str = "hutch++",
seed: int | None = None
) -> TraceResult
Sources: hessian_eigenthings/algorithms/trace.py:1-40
The Hutch++ estimator achieves lower variance by using both query and reply vectors.
3. Backend Layer
The LinAlgBackend abstract interface decouples linear algebra operations from specific device implementations:
classDiagram
class LinAlgBackend~T~ {
<<abstract>>
+matmul(a, b) T
+dot(a, b) T
+norm(v) T
+fill(v, value) T
+copy(v) T
}
class SingleDeviceBackend {
+matmul(a, b) Tensor
+dot(a, b) Tensor
+norm(v) Tensor
}
LinAlgBackend <|-- SingleDeviceBackendBackends provide:
- Vector arithmetic operations (dot product, norm, fill, copy)
- Device-specific optimizations
- Memory allocation strategies
Data Flow
Eigendecomposition Workflow
sequenceDiagram
participant User
participant Operator as CurvatureOperator
participant Backend as LinAlgBackend
participant Algo as Lanczos Algorithm
participant PyTorch as PyTorch Autograd
User->>Operator: Instantiate with model, dataloader
User->>Algo: Call lanczos(operator, k)
Algo->>Operator: Request matvec(v)
Operator->>Backend: Allocate probe vector
Backend->>PyTorch: Create tensor
Operator->>PyTorch: Forward pass + backward
PyTorch-->>Operator: Return HVP result
Operator-->>Algo: Return Av
Algo->>Algo: Repeat for m iterations
Algo-->>User: Return eigenvalues, eigenvectorsLoss Function Integration
The library supports two loss function patterns:
graph LR
subgraph "Single Function API"
L1[loss_fn<br/>model, batch → scalar]
end
subgraph "Two Function API (for GGN)"
F1[forward_fn<br/>model, batch → output]
L2[loss_of_output_fn<br/>output, batch → scalar]
end
L1 --> HO[HessianOperator]
F1 --> GGN[GGNOperator]
L2 --> GGN#### HuggingFace Integration
For language models, the library provides optimized loss functions:
hf_lm_loss(fused="auto") # Auto-selects Triton or torch.compile
Sources: hessian_eigenthings/loss_fns/huggingface.py:1-30
The fused implementation:
- Uses Triton kernels on CUDA (~3.4x speedup, 2x peak-memory reduction)
- Falls back to
torch.compile(~2.6x speedup, 2x peak-memory reduction) - Eliminates most
(N, V)intermediates
Configuration Options
HessianOperator Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | nn.Module | Required | PyTorch model |
dataloader | Iterable | Required | Data batches |
loss_fn | LossFn | Required | Loss computation function |
param_filter | ParamFilter | None | Filter parameters by name |
method | HvpMethod | "autograd" | HVP computation method |
fd_eps | float | None | Finite difference epsilon |
GGNOperator Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
loss_hvp | str | "analytical" | "analytical" or "autograd" |
full_dataset | bool | True | Average over full dataset |
num_batches | int | None | Limit to first N batches |
microbatch_size | int | None | Process in smaller chunks |
Lanczos Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
k | int | 10 | Number of eigenvalues to compute |
max_iter | int | 100 | Maximum Lanczos iterations |
tol | float | 1e-3 | Convergence tolerance |
which | str | "LA" | Which eigenvalues ("LA", "SA", "LM") |
reorthogonalize | bool | False | Full reorthogonalization |
Usage Patterns
Basic Hessian Eigenvalue Computation
from hessian_eigenthings import HessianOperator, lanczos
# Create operator
hessian_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=lambda m, b: torch.nn.functional.cross_entropy(m(b[0]), b[1])
)
# Compute top eigenvalues
eig_result = lanczos(hessian_op, k=10, max_iter=100)
print(eig_result.eigenvalues)
Parameter-Filtered Analysis
from hessian_eigenthings import HessianOperator, lanczos
# Analyze only attention parameters
attn_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_names("blocks.*.attn.*")
)
eig_attn = lanczos(attn_op, k=3)
Trace Estimation
from hessian_eigenthings import HessianOperator, trace
trace_result = trace(
hessian_op,
num_matvecs=30,
method="hutch++",
seed=42
)
print(f"Trace estimate: {trace_result.estimate:.4e}")
Architecture Benefits
| Benefit | Description |
|---|---|
| Separation of Concerns | Operators define "what" to compute; algorithms define "how" |
| Flexibility | Any operator can use any algorithm |
| Scalability | Backends enable device-specific optimizations |
| Composability | Easy to add new operators or algorithms |
| Memory Efficiency | Matrix-free design avoids explicit matrix storage |
Extension Points
Adding Custom Curvature Operators
New operators should subclass CurvatureOperator and implement the matvec method:
class CustomCurvatureOperator(CurvatureOperator):
def __init__(self, model, dataloader):
super().__init__()
self.model = model
self.dataloader = dataloader
# Register parameters
def _matvec(self, v: torch.Tensor) -> torch.Tensor:
# Implement A @ v
return custom_computation(v)
Adding New Algorithms
Algorithms should accept any CurvatureOperator and use the backend exclusively:
def custom_algorithm(
operator: CurvatureOperator,
backend: LinAlgBackend | None = None
) -> SomeResult:
backend = backend or SingleDeviceBackend()
# Use backend for all vector operations
Summary
The system architecture of pytorch-hessian-eigenthings follows a clean, modular design that separates curvature matrix computation (operators), numerical algorithms (Lanczos, trace estimation), and linear algebra primitives (backends). This design enables efficient Hessian and GGN eigendecomposition for models ranging from small MLPs to large language models, with support for distributed training and optimized fused computations.
Sources: [hessian_eigenthings/operators/hessian.py:1-50]()
Curvature Operators
Related topics: System Architecture, Distributed Computing with DDP
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture, Distributed Computing with DDP
Curvature Operators
Overview
Curvature Operators in hessian-eigenthings provide a matrix-free abstraction for computing Hessian eigendecomposition and related curvature matrices for arbitrary PyTorch models. They implement the CurvatureOperator base class interface, enabling efficient computation of eigenvalues, eigenvectors, traces, and spectral densities without explicitly forming potentially massive matrices.
The core abstraction allows algorithms (Lanczos, power iteration, Hutch++) to operate on any curvature matrix through a unified matvec(v) interface that computes $Av$ for any vector $v$, enabling scalability to large models with billions of parameters.
Sources: hessian_eigenthings/__init__.py:1-10
Architecture
graph TD
subgraph "Curvature Operators"
Base[CurvatureOperator<br/>Base Class]
Hessian[HessianOperator]
GGN[GGNOperator]
Fisher[EmpiricalFisherOperator]
Lambda[LambdaOperator]
DDP[DDPHessianOperator]
end
subgraph "Algorithms"
Lanczos[Lanczos Eigendecomposition]
Power[Power Iteration]
Trace[Trace Estimation<br/>Hutch++/Hutchinson]
Spectral[Spectral Density<br/>Stochastic Lanczos Quadrature]
end
Base --> Hessian
Base --> GGN
Base --> Fisher
Base --> Lambda
Base --> DDP
Hessian --> Lanczos
GGN --> Lanczos
Fisher --> Lanczos
Lambda --> Lanczos
Hessian --> Trace
GGN --> Trace
Fisher --> Trace
Hessian --> Power
GGN --> Power
Fisher --> Power
Hessian --> Spectral
GGN --> Spectral
Fisher --> SpectralBase Class: CurvatureOperator
All curvature operators inherit from CurvatureOperator, which defines the contract that subclasses must fulfill.
Core Interface
| Method | Description |
|---|---|
matvec(v) | Compute $Av$ where $A$ is the curvature matrix |
size | Total number of parameters in the operator's scope |
dtype, device | Tensor dtype and device for vector operations |
Sources: hessian_eigenthings/operators/base.py
Parameter Filtering
Curvature operators can be restricted to subsets of model parameters using param_filter, enabling analysis of specific components (e.g., attention layers only).
from hessian_eigenthings import HessianOperator, match_names
# Filter to attention parameters only
attn_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_names("blocks.*.attn.*")
)
Sources: hessian_eigenthings/__init__.py:35-38
HessianOperator
Computes the Hessian $\nabla_{\theta}^2 \mathcal{L}$ of the loss function with respect to model parameters.
Key Features
- Two HVP methods:
autograd(exact double-backward viatorch.autograd.grad) andfinite_difference(central difference for FSDP/TP compatibility) - Batched computation: Automatically averages over multiple batches from the dataloader
- Microbatch support: For large models, process batches in smaller microbatches
Constructor Parameters
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
model | nn.Module | Required | PyTorch model | |
dataloader | Iterable | Required | Data batches | |
loss_fn | LossFn | Required | Loss computation function | |
param_filter | `ParamFilter \ | None` | None | Parameter name filter |
full_dataset | bool | True | Average over full dataset | |
num_batches | `int \ | None` | None | Limit batches for stochastic estimate |
microbatch_size | `int \ | None` | None | Split batches into smaller microbatches |
method | HvpMethod | "autograd" | HVP computation method | |
fd_eps | `float \ | None` | None | Finite difference epsilon |
backend | `LinAlgBackend \ | None` | None | Linear algebra backend |
HVP Method Comparison
| Method | Accuracy | Memory | FSDP/TP Compatible | Speed |
|---|---|---|---|---|
autograd | Exact (to rounding) | High | No | Fast |
finite_difference | $O(\epsilon^2)$ bias | Low | Yes | 2x passes |
Sources: hessian_eigenthings/operators/hessian.py:1-60
Finite Difference Epsilon Table
| Dtype | Optimal $\epsilon$ |
|---|---|
float64 | 6e-6 |
float32 | 5e-3 |
bfloat16 | 0.2 |
float16 | 5e-2 |
Sources: hessian_eigenthings/operators/hessian.py:34-40
GGNOperator
The Generalized Gauss-Newton (GGN) matrix $G = J^T H_{loss} J$ provides a PSD approximation to the Hessian that is computationally cheaper while preserving the eigenvalues that matter for optimization.
Key Features
- Always PSD: Unlike the exact Hessian, the GGN is positive semi-definite by construction
- Analytical HVP path: For losses with known HVP (e.g., cross-entropy), uses analytical computation
- For cross-entropy + softmax: $G$ equals the Fisher information matrix
Two Matvec Implementations
loss_hvp | Description | Memory | Use Case |
|---|---|---|---|
"analytical" (default) | FD JVP + analytical loss-Hessian-vec + one backward | Matches one training step | LM-scale, large vocab |
"autograd" | torch.func.jvp + double-backward + torch.func.vjp | Scales with output size | Exact, small vocab |
Sources: hessian_eigenthings/operators/ggn.py:1-50
Fused Cross-Entropy HVP
For language model training, the GGN operator includes a fused kernel for the CE HVP computation:
# Auto-selects fastest backend: Triton > torch.compile > eager
hf_lm_loss_of_output(..., fused="auto")
The fused implementation reduces peak memory by 2x compared to eager, with Triton providing ~3.4x speedup on CUDA.
Sources: hessian_eigenthings/loss_fns/huggingface.py:1-30
EmpiricalFisherOperator
Computes the empirical Fisher information matrix $F = \frac{1}{N} \sum_{i=1}^N \nabla_{\theta} \log p(y_i|x_i) \nabla_{\theta} \log p(y_i|x_i)^T$.
For classification with cross-entropy loss, the empirical Fisher equals the GGN when using the model distribution's expectation.
Sources: hessian_eigenthings/operators/fisher.py
LambdaOperator
Creates custom curvature operators from lambda functions for testing or custom curvature definitions.
from hessian_eigenthings import LambdaOperator
# Custom operator that always returns a scaled vector
custom_op = LambdaOperator(
size=1000,
matvec=lambda v: 2.0 * v # Represents 2*I
)
DDPHessianOperator
Distributed Data Parallel-aware Hessian operator that handles gradient synchronization across processes.
Sources: hessian_eigenthings/operators/__init__.py:15-18
Common Usage Patterns
Computing Top Eigenvalues
from hessian_eigenthings import HessianOperator, lanczos
operator = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn
)
result = lanczos(operator, k=10, max_iter=100)
print(f"Top eigenvalues: {result.eigenvalues}")
Estimating Trace
from hessian_eigenthings import GGNOperator, trace
operator = GGNOperator(
model=model,
dataloader=dataloader,
forward_fn=model_forward,
loss_of_output_fn=loss_fn
)
result = trace(operator, num_matvecs=100, method="hutch++")
print(f"Trace estimate: {result.estimate:.4e} ± {result.stderr:.4e}")
Component-Specific Analysis
from hessian_eigenthings import HessianOperator, match_regex
# Analyze only attention weights in transformer
attn_only = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_regex(r"blocks\.\d+\.attn\.")
)
# Analyze only MLP weights
mlp_only = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_regex(r"blocks\.\d+\.mlp\.")
)
Linear Algebra Backends
The operators use pluggable LinAlgBackend for vector operations, enabling support for different hardware configurations and precision requirements.
| Backend | Use Case |
|---|---|
SingleDeviceBackend | Single GPU/CPU |
| (Distributed backends) | Multi-GPU via FSDP/TP |
Module Exports
from hessian_eigenthings.operators import (
CurvatureOperator,
DDPHessianOperator,
EmpiricalFisherOperator,
GGNOperator,
HessianOperator,
LambdaOperator,
)
Sources: [hessian_eigenthings/__init__.py:1-10](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/__init__.py)
Eigendecomposition Algorithms
Related topics: System Architecture, Why Hessian-Vector Products
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: System Architecture, Why Hessian-Vector Products
Eigendecomposition Algorithms
The hessian-eigenthings library provides a suite of efficient iterative algorithms for computing eigendecompositions of curvature matrices (Hessian, Generalized Gauss-Newton, and Fisher) in PyTorch models. These algorithms enable analysis of neural network loss landscapes by extracting eigenvalues, eigenvectors, spectral densities, and trace estimates without explicitly constructing the full curvature matrix—a critical capability for modern large-scale models.
Overview
Computing eigendecompositions of curvature matrices is fundamental to understanding generalization properties, flat minima, and training dynamics of neural networks. However, these curvature matrices are prohibitively large (n × n where n is the number of parameters), making explicit construction impossible for modern models.
The library implements Krylov subspace methods that only require matrix-vector products, enabling efficient computation of:
| Capability | Algorithm | Use Case |
|---|---|---|
| Top-k eigenvalues/eigenvectors | Lanczos, Power Iteration | Finding most-curved directions |
| Trace estimation | Hutchinson, Hutch++ | Computing average curvature |
| Spectral density | Stochastic Lanczos Quadrature | Visualizing eigenvalue distribution |
Sources: hessian_eigenthings/algorithms/__init__.py:1-29
Algorithm Architecture
The algorithms in this module follow a consistent design pattern: they accept any CurvatureOperator and use the LinAlgBackend exclusively for vector arithmetic, ensuring portability across single-device and distributed settings.
graph TD
A[CurvatureOperator] --> B[Lanczos Algorithm]
A --> C[Power Iteration]
A --> D[Trace Estimation]
A --> E[Spectral Density]
B --> F[EigenResult]
C --> F
D --> G[TraceResult]
E --> H[SpectralDensityResult]
F --> I[eigenvalues: Tensor]
F --> J[eigenvectors: Tensor]
F --> K[residuals: Tensor]Sources: hessian_eigenthings/algorithms/result.py
Lanczos Algorithm
The Lanczos algorithm is the primary method for computing eigenvalues and eigenvectors of symmetric matrices. It builds a Krylov subspace through repeated matrix-vector products, then solves the small tridiagonal eigenvalue problem.
Symmetric Lanczos Implementation
The in-house Lanczos implementation provides optional full reorthogonalization to address the loss-of-orthogonality issues classical Lanczos is known for (Paige 1976).
def lanczos_tridiagonal(
operator: CurvatureOperator,
v0: torch.Tensor,
max_iter: int,
*,
reorthogonalize: bool = True,
backend: LinAlgBackend[torch.Tensor] | None = None,
) -> LanczosTridiag
Key characteristics:
- Default reorthogonalization: Enabled for
max_iter <= 50to suppress ghost eigenvalues - Computational tradeoff: For larger Krylov dimensions, reorthogonalization becomes O(m²n); users analyzing near-degenerate spectra should re-enable it
- Memory efficiency: Accumulates Ritz vectors directly via rank-1 outer-product updates, avoiding transient (n, k) → (k, n) transpose copies
Sources: hessian_eigenthings/algorithms/lanczos.py:30-58
Lanczos Output Structure
@dataclass(frozen=True)
class LanczosTridiag:
"""Output of one Lanczos run: tridiagonal coefficients + the basis used to build them."""
alphas: torch.Tensor # (m,) diagonal
betas: torch.Tensor # (m-1,) off-diagonal
basis: list[torch.Tensor] # length m, each (n,)
last_beta: float # ||r_m|| residual norm at termination
iterations: int # m, the actual number of Lanczos steps completed
Sources: hessian_eigenthings/algorithms/lanczos.py:30-43
Full Lanczos Eigensolver
The high-level lanczos() function computes top-k eigenvalues with configurable eigenpair selection:
def lanczos(
operator: CurvatureOperator,
k: int = 1,
max_iter: int = 100,
*,
which: Which = "LM",
tol: float = 1e-8,
seed: int | None = None,
reorthogonalize: bool | None = None,
backend: LinAlgBackend[torch.Tensor] | None = None,
) -> EigenResult
Parameters:
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
operator | CurvatureOperator | required | The curvature matrix operator | |
k | int | 1 | Number of eigenpairs to compute | |
max_iter | int | 100 | Maximum Lanczos iterations | |
which | Literal["LM", "LA", "SA"] | "LM" | Which eigenvalues: LM=largest magnitude, LA=largest algebraic, SA=smallest algebraic | |
tol | float | 1e-8 | Convergence tolerance | |
seed | `int \ | None` | None | Random seed for reproducibility |
reorthogonalize | `bool \ | None` | None | Override default reorthogonalization setting |
Sources: hessian_eigenthings/algorithms/lanczos.py:58-100
Eigenvalue Selection Logic
The algorithm selects eigenvalues based on the which parameter:
if which == "LM":
order = torch.argsort(theta.abs(), descending=True)
elif which == "LA":
order = torch.argsort(theta, descending=True)
elif which == "SA":
order = torch.argsort(theta, descending=False)
Sources: hessian_eigenthings/algorithms/lanczos.py:75-83
Power Iteration
Power iteration is a simpler method for finding the dominant eigenvalue. The library implements deflated power iteration to compute multiple eigenpairs sequentially by projecting out previously found directions.
Single Power Iteration
def power_iteration_one(
operator: CurvatureOperator,
v0: torch.Tensor,
max_iter: int,
tol: float = 1e-6,
backend: LinAlgBackend | None = None,
) -> tuple[torch.Tensor, torch.Tensor]
Sources: hessian_eigenthings/algorithms/power_iteration.py
Deflated Power Iteration
Deflated power iteration extends the basic method to compute multiple eigenpairs:
def deflated_power_iteration(
operator: CurvatureOperator,
num_eigs: int,
max_iter: int,
tol: float = 1e-6,
seed: int | None = None,
backend: LinAlgBackend | None = None,
) -> EigenResult
The deflation process removes previously found eigenvectors from the subspace before searching for the next eigenpair, preventing convergence to already-computed directions.
Sources: hessian_eigenthings/algorithms/power_iteration.py
Trace Estimation
Trace estimation provides a way to compute the average eigenvalue (trace / dimension) without full eigendecomposition. This is computationally much cheaper and useful for understanding overall curvature magnitude.
Hutchinson's Estimator
Hutchinson's method estimates the trace using random probe vectors:
def hutchinson(
operator: CurvatureOperator,
*,
num_samples: int = 100,
distribution: Distribution = "rademacher",
seed: int | None = None,
backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult
The estimator computes: (1/m) Σ vᵢᵀ A vᵢ
Where vᵢ are random vectors from the specified distribution. Rademacher distribution provides lower variance than Gaussian.
Sources: hessian_eigenthings/algorithms/trace.py:48-71
Hutch++ Estimator
Hutch++ is an improved estimator with better convergence properties:
def hutch_plus_plus(
operator: CurvatureOperator,
*,
num_matvecs: int = 30,
seed: int | None = None,
backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult
Hutch++ uses a structured random sampling approach that achieves lower variance than standard Hutchinson with the same number of matrix-vector products.
Sources: hessian_eigenthings/algorithms/trace.py:22-47
Unified Trace Interface
def trace(
operator: CurvatureOperator,
num_matvecs: int = 30,
*,
method: Literal["hutchinson", "hutch++"] = "hutch++",
seed: int | None = None,
backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult
Validation: The num_matvecs parameter is validated to be at least 1.
Sources: hessian_eigenthings/algorithms/trace.py:71-84
Trace Result Structure
@dataclass
class TraceResult:
estimate: float # The trace estimate
stderr: float # Standard error of the estimate
num_matvecs: int # Number of matrix-vector products used
operator_size: int # Dimension of the operator
Sources: hessian_eigenthings/algorithms/result.py
Spectral Density
Spectral density estimation computes the eigenvalue distribution (density function) across the spectrum, enabling visualization and analysis of the full eigenvalue structure.
Stochastic Lanczos Quadrature
The spectral_density() function implements Stochastic Lanczos Quadrature (SLQ) to compute the spectral density:
def spectral_density(
operator: CurvatureOperator,
num_runs: int = 16,
lanczos_steps: int = 50,
seed: int | None = None,
backend: LinAlgBackend[torch.Tensor] | None = None,
) -> SpectralDensityResult
Parameters:
| Parameter | Type | Default | Description | |
|---|---|---|---|---|
operator | CurvatureOperator | required | The curvature matrix operator | |
num_runs | int | 16 | Number of randomized runs for averaging | |
lanczos_steps | int | 50 | Lanczos iterations per run | |
seed | `int \ | None` | None | Random seed |
Sources: hessian_eigenthings/algorithms/spectral_density.py
Spectral Density Result
@dataclass
class SpectralDensityResult:
grid: torch.Tensor # Eigenvalue grid points
density: torch.Tensor # Density values at each grid point
eigenvalues: list[torch.Tensor] # Eigenvalues from each run
eigenvectors: list[list[torch.Tensor]] # Corresponding eigenvectors
The spectral density integrates to 1: ∫ density(λ) dλ ≈ 1, which can be verified using numerical integration.
Sources: hessian_eigenthings/algorithms/result.py
Common Result Types
All algorithms return standardized result objects that encapsulate the computed quantities along with metadata about the computation.
@dataclass
class EigenResult:
eigenvalues: torch.Tensor # (k,) tensor of eigenvalues
eigenvectors: torch.Tensor # (k, n) matrix of eigenvectors
residuals: torch.Tensor # (k,) convergence residuals
iterations: int # Number of iterations run
converged: bool # Whether all eigenpairs converged
Sources: hessian_eigenthings/algorithms/result.py
Algorithm Selection Guide
graph LR
A[Goal] --> B{Eigenpairs?}
B -->|Yes, top-k| C[How many?]
C -->|1-10| D[Lanczos]
C -->|Many| E{Orthogonality critical?}
E -->|Yes| D
E -->|No| F[Deflated Power Iteration]
B -->|Trace only| G{Accuracy priority?}
G -->|High| H[Hutch++]
G -->|Standard| I[Hutchinson]
B -->|Full distribution| J[Spectral Density]
D --> K[EigenResult]
F --> K
H --> L[TraceResult]
I --> L
J --> M[SpectralDensityResult]Decision Criteria
| Scenario | Recommended Algorithm | Notes |
|---|---|---|
| Top eigenvalues for single/large batch | lanczos() | Best accuracy, moderate cost |
| Quick dominant eigenvalue | deflated_power_iteration() | Lower memory, less accurate |
| Trace with limited matvecs | hutch_plus_plus() | Better convergence than Hutchinson |
| Trace estimation | hutchinson() | Simpler, more matvecs needed |
| Eigenvalue histogram/distribution | spectral_density() | Visualize full spectrum |
Integration with Curvature Operators
The algorithms are designed to work with any CurvatureOperator implementation, including:
HessianOperator: Exact Hessian via autograd or finite differencesGGNOperator: Generalized Gauss-Newton matrixEmpiricalFisherOperator: Empirical Fisher information matrixDDPHessianOperator: Distributed Data Parallel Hessian
This abstraction allows the same algorithm code to work across different curvature definitions without modification.
Sources: hessian_eigenthings/algorithms/__init__.py:1-29
Performance Considerations
Memory Efficiency in Lanczos
The Lanczos implementation optimizes memory for large-scale models by:
- Avoiding allocation of full (n, m) basis matrix
- Using rank-1 outer-product updates for eigenvector accumulation
- Computing Ritz vectors directly into final (k, n) layout
Reorthogonalization Tradeoffs
| Setting | Memory | Computation | Accuracy |
|---|---|---|---|
reorthogonalize=True | O(mn) | O(m²n) | High orthogonality |
reorthogonalize=False | O(mn) basis list | O(mn) | May have ghost eigenvalues |
For max_iter <= 50, reorthogonalization is enabled by default. For larger Krylov dimensions, it defaults off to maintain acceptable performance.
Sources: hessian_eigenthings/algorithms/lanczos.py:23-28
Example Usage
from hessian_eigenthings import HessianOperator, lanczos, trace, spectral_density
# Create Hessian operator
operator = HessianOperator(model, dataloader, loss_fn)
# Compute top-5 eigenvalues and eigenvectors
eig_result = lanczos(operator, k=5, max_iter=40, tol=1e-7, seed=0)
print(f"Top eigenvalue: {eig_result.eigenvalues[0]}")
# Estimate trace with Hutch++
trace_result = trace(operator, num_matvecs=99, method="hutch++", seed=0)
print(f"Trace estimate: {trace_result.estimate}")
# Compute spectral density
density_result = spectral_density(operator, num_runs=8, lanczos_steps=40, seed=0)
# Visualize with: plt.plot(density_result.grid, density_result.density)
Sources: examples/supervised_mlp.py:1-50
Sources: [hessian_eigenthings/algorithms/__init__.py:1-29]()
Loss Functions
Related topics: Curvature Operators, Parameter Utilities
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Curvature Operators, Parameter Utilities
Loss Functions
Loss functions in this repository serve as the bridge between model outputs and curvature operators (Hessian and Generalized Gauss-Newton matrices). They provide the necessary computations for Hessian-vector products and support multiple backend implementations optimized for different use cases.
Overview
The loss functions module (hessian_eigenthings/loss_fns/) provides two distinct function signatures depending on the target operator:
| Function Type | Signature | Used By |
|---|---|---|
loss_fn | (model: nn.Module, batch: Any) -> torch.Tensor | HessianOperator |
loss_of_output_fn | (output: torch.Tensor, batch: Any) -> torch.Tensor | GGNOperator |
Sources: hessian_eigenthings/operators/hessian.py:1-50 Sources: hessian_eigenthings/operators/ggn.py:1-60
Architecture
graph TD
A[Loss Function Entry Points] --> B[Standard Losses]
A --> C[HuggingFace Losses]
A --> D[TransformerLens Losses]
B --> B1[MSE Loss]
B --> B2[Cross-Entropy Loss with HVP]
C --> C1[Autoregressive LM Loss]
C --> C2[Shifted CE with Analytical HVP]
C --> C3[Fused CE HVP Backends]
D --> D1[TransformerLens HookedModel Loss]
C3 --> C3a[Triton Kernel]
C3 --> C3b[torch.compile]
C3 --> C3c[Eager Reference]Standard Loss Functions
The standard.py module provides loss functions for common supervised learning scenarios with closed-form Hessian-vector products.
Sources: hessian_eigenthings/loss_fns/standard.py:1-80
MSE Loss
Returns a wrapper compatible with GGNOperator for mean-squared error loss:
def mse_loss_of_output() -> Callable[[torch.Tensor, tuple[torch.Tensor, torch.Tensor]], torch.Tensor]:
"""Make a `loss_of_output_fn` for `GGNOperator` from a (output, target) criterion."""
Cross-Entropy Loss with Analytical HVP
The cross-entropy implementation includes a closed-form Hessian-vector product for efficient computation:
def _ce_hvp(
output: torch.Tensor, batch: tuple[torch.Tensor, torch.Tensor], u: torch.Tensor
) -> torch.Tensor:
"""Closed-form H @ u for mean-reduced softmax + cross-entropy.
`output` has shape `(N, C)` (logits). For each row,
`H_row = (diag(p) - p p^T) / N` where `p = softmax(output)`.
"""
Mathematical Foundation:
For mean-reduced softmax + cross-entropy, the Hessian takes the form:
H_row = (diag(p) - p·p^T) / N
where:
p = softmax(output)is the predicted probability distributionNis the number of samplesuis the input vector
Sources: hessian_eigenthings/loss_fns/standard.py:40-55
HuggingFace Transformers Integration
The huggingface.py module provides loss functions specifically designed for HuggingFace Transformers models. These handle the internal loss computation that occurs when labels are present in the batch.
Sources: hessian_eigenthings/loss_fns/huggingface.py:60-90
Autoregressive Language Model Loss
def hf_lm_loss() -> Callable[[nn.Module, dict[str, Any]], torch.Tensor]:
"""For autoregressive LMs: `loss_fn(model, batch)` calls `model(**batch).loss`."""
The batch must include labels so HuggingFace computes the loss internally. For causal language models, this is typically labels=input_ids with the standard internal shift.
Shifted Cross-Entropy with Analytical HVP
For large-scale language model analysis, a shifted cross-entropy variant provides both the loss function and its analytical Hessian-vector product:
def hf_lm_shifted_ce(fused: FusedCEHvpBackend = "auto") -> _LossOfOutputWithHvp:
"""Shifted CE loss with analytical H @ u for autoregressive LMs."""
Shift Mechanism:
- The loss shifts logits left (discards last position) and labels right (discards first position)
- Matches how
cross_entropy(ignore_index=-100)handles gradient computation
Sources: hessian_eigenthings/loss_fns/huggingface.py:1-50
Fused Cross-Entropy HVP Backends
The _fused_ce_hvp.py module implements optimized backends for computing the cross-entropy Hessian-vector product.
Sources: hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50
Backend Selection
| Backend | Description | Performance | Availability |
|---|---|---|---|
"auto" | Auto-select fastest available | Optimal | Default |
"triton" | Hand-written CUDA Triton kernel | ~3.4x speedup, 2x memory reduction | CUDA + Triton |
"compile" | torch.compile-fused | ~2.6x speedup, 2x memory reduction | torch >= 2.0 |
"eager" | Plain PyTorch reference | Baseline | Always |
FusedCEHvpBackend = Literal["auto", "eager", "compile", "triton"]
Sources: hessian_eigenthings/loss_fns/huggingface.py:25-35
Backend Resolution Logic
graph LR
A[Backend: "auto"] --> B{Device: CUDA?}
B -->|Yes + Triton available| C[Triton Kernel]
B -->|No| D{torch.compile available?}
D -->|Yes| E[torch.compile Backend]
D -->|No| F[Eager Backend]The resolution checks:
- If
backend != "auto", use the specified backend - If
"auto"and CUDA + Triton available → Triton - If
"auto"and torch.compile available → compile - Otherwise → eager
Important: The Triton kernel asserts logits.is_cuda, so on CUDA-equipped hosts running CPU inputs, the system falls back to compile.
Sources: hessian_eigenthings/loss_fns/huggingface.py:40-60
Memory Optimization
At LM scale (e.g., B=64, T=256, V=50304, fp32):
| Implementation | Memory Footprint | Intermediate Tensors |
|---|---|---|
| Eager | ~19.6 GB | ~6 (N, V) tensors |
| Compile | ~3.3 GB | ~1 (N, V) tensor |
| Target | ~3.3 GB | Output buffer only |
The fused implementations eliminate intermediates by computing:
out_flat = (p * u - p * <p, u>) * mask / n_valid
with shape (N, V) in a single kernel pass.
Sources: scripts/bench_fused_ce_hvp.py:1-60
Loss Function Wrapper
The _LossOfOutputWithHvp class wraps a loss function with its analytical Hessian-vector product:
class _LossOfOutputWithHvp:
"""Loss-of-output callable that also carries an analytical `.hvp` method.
Wraps a plain `(output, batch) -> loss` function and a `(output, batch, u)
-> H_loss @ u` function in a single callable. `GGNOperator` checks for the
presence of `.hvp` and uses it as the loss-Hessian-vector product, skipping
the autograd `create_graph=True` double-backward path entirely.
"""
Sources: hessian_eigenthings/loss_fns/huggingface.py:100-120
GGN Operator Integration
The GGNOperator automatically detects and uses analytical HVPs when available:
GGNOperator` picks this up automatically and skips the autograd
double-backward.
Two implementations of the matvec are available via `loss_hvp=`:
* ``"analytical"`` (default): finite-difference JVP + analytical loss-Hessian-vec
product (read from `loss_of_output_fn.hvp`, which must be present) + a single
normal backward to apply `J^T`. Memory footprint matches one normal training
step. Required for LM-scale use.
* ``"autograd"``: the original `torch.func.jvp` + autograd double-backward +
`torch.func.vjp` path. Numerically exact and supports any loss, but memory
scales badly with output size.
Sources: hessian_eigenthings/operators/ggn.py:10-30
TransformerLens Integration
For TransformerLens HookedModel architectures, a dedicated loss function handles the hook-based forward pass:
Sources: hessian_eigenthings/loss_fns/transformer_lens.py:1-40
def tlens_loss() -> Callable[[nn.Module, Any], torch.Tensor]:
"""Loss function for TransformerLens HookedModel."""
Workflow: Choosing a Loss Function
graph TD
A[Start] --> B{Model Type?}
B -->|Standard MLP/CNN| C[Use HessianOperator]
B -->|HuggingFace Transformers| D[Use GGNOperator]
B -->|TransformerLens| E[Use HessianOperator]
C --> F[standard.mse_loss_of_output]
C --> G[standard.cross_entropy_loss_of_output]
D --> H{Scale?}
H -->|Small model| I[hf_lm_loss with GGNOperator]
H -->|Large model| J[hf_lm_shifted_ce with GGNOperator]
E --> K[tlens_loss with HessianOperator]
J --> L[Choose HVP Backend]
L --> M{Device?}
M -->|CUDA + Triton| N[Use Triton backend]
M -->|CPU/MPS| O[Use compile or eager]Complete API Reference
Standard Module
| Function | Returns | HVP Available |
|---|---|---|
mse_loss_of_output() | loss_of_output_fn | No |
cross_entropy_loss_of_output() | loss_of_output_fn | Yes |
_ce_hvp() | Analytical HVP | - |
HuggingFace Module
| Function | Returns | HVP Available |
|---|---|---|
hf_lm_loss() | loss_fn | Via GGNOperator |
hf_lm_shifted_ce(fused) | _LossOfOutputWithHvp | Yes |
_LossOfOutputWithHvp | Wrapper class | Via .hvp attribute |
Fused CE HVP Module
| Function | Description |
|---|---|
_ce_hvp_reference() | Eager reference implementation |
_get_compiled_impl() | Returns torch.compile wrapped version |
compiled_ce_hvp() | Compiled backend entry point |
triton_ce_hvp() | Triton kernel entry point |
Usage Examples
Standard Classification
from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.loss_fns.standard import cross_entropy_loss_of_output
operator = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=lambda m, b: torch.nn.functional.cross_entropy(m(b[0]), b[1]),
)
HuggingFace Large Model (Memory-Optimized)
from hessian_eigenthings.operators import GGNOperator
from hessian_eigenthings.loss_fns.huggingface import hf_lm_shifted_ce
loss_fn = hf_lm_shifted_ce(fused="auto") # Auto-selects best backend
operator = GGNOperator(
model=model,
dataloader=dataloader,
forward_fn=lambda m, b: m(**b).logits,
loss_of_output_fn=loss_fn,
loss_hvp="analytical", # Default, uses .hvp attribute
)
Sources: [hessian_eigenthings/operators/hessian.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
Parameter Utilities
Related topics: Curvature Operators
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Curvature Operators
Parameter Utilities
The parameter utilities module (param_utils.py) provides essential infrastructure for managing, filtering, and manipulating PyTorch model parameters within the Hessian eigendecomposition pipeline. These utilities form the foundation that connects curvature operators to the underlying model parameters, enabling efficient computation of Hessian-vector products and eigendecomposition across arbitrary subsets of model parameters.
Overview
When working with large neural networks, it is often necessary to compute curvature information for only a subset of parameters. The parameter utilities support this use case through a flexible filtering mechanism combined with utilities for parameter vectorization, reshaping, and batch management.
The core responsibilities of the parameter utilities include:
- Parameter Extraction - Gathering named parameters from PyTorch modules
- Parameter Filtering - Selecting subsets of parameters based on name patterns or custom predicates
- Vectorization - Flattening parameters into vectors and reshaping vectors back to parameter shapes
- Size Tracking - Maintaining offset mappings for efficient vector-to-parameter conversions
Sources: hessian_eigenthings/operators/hessian.py
Core Types and Interfaces
ParamFilter Type
The ParamFilter type alias defines the contract for parameter selection functions:
ParamFilter = Callable[[str, nn.Parameter], bool]
A ParamFilter is a callable that takes two arguments:
name: str- The fully-qualified parameter name within the modelparam: nn.Parameter- The parameter tensor itself
The function returns True if the parameter should be included in the operation, False otherwise.
Sources: hessian_eigenthings/operators/hessian.py:1-50
Parameter Collection Utilities
The module provides functions for extracting and organizing model parameters:
| Function | Purpose |
|---|---|
get_param_names(model) | Returns list of parameter names as fully-qualified strings |
get_param_list(model) | Returns list of parameter tensors |
get_param_sizes(model) | Returns list of parameter tensor sizes |
get_filtered_params(model, param_filter) | Returns filtered parameter names and tensors |
These functions work together to build the data structures required by curvature operators.
Sources: hessian_eigenthings/operators/ggn.py
Parameter Vectorization
Flattening Parameters to Vectors
The utilities support bidirectional conversion between parameter dictionaries and flat vectors. This is essential for Lanczos-based eigendecomposition algorithms that operate on vector spaces.
def flatten_params(param_dict: dict[str, Tensor]) -> Tensor:
"""Flatten all parameters into a single 1D tensor."""
The flattening process concatenates all parameter tensors in a deterministic order, preserving the mapping between parameter names and vector offsets.
Reshaping Vectors Back to Parameters
def unflatten_params(
vec: Tensor,
param_names: list[str],
param_list: list[Tensor],
sizes: list[torch.Size]
) -> dict[str, Tensor]:
"""Reshape a flat vector back to parameter dictionary."""
The unflattening operation uses offset tracking to slice the vector and reshape each slice to match the original parameter shape:
for name, param, size in zip(param_names, param_list, sizes, strict=True):
out[name] = vec[offset : offset + size].reshape_as(param)
offset += size
Sources: hessian_eigenthings/operators/ggn.py
Parameter Filtering Patterns
Name-Based Filtering with `match_names`
The most common filtering pattern uses glob-style matching against parameter names. The match_names function creates a ParamFilter from a list of name patterns:
def match_names(*patterns: str) -> ParamFilter:
"""Create a filter matching parameter names against glob patterns."""
Example usage:
from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.param_utils import match_names
# Filter attention parameters only
attn_filter = match_names("blocks.*.attn.*")
attn_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=attn_filter
)
# Filter MLP parameters only
mlp_filter = match_names("blocks.*.mlp.*")
mlp_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=mlp_filter
)
Sources: examples/transformer_lens_attention_only.py
Multiple Pattern Matching
The match_names function supports multiple patterns, useful for targeting disjoint parameter groups:
# Match multiple parameter groups
filter_fn = match_names(
"transformer.h.*.attn.*",
"transformer.h.*.mlp.*"
)
HuggingFace-Specific Patterns
When working with HuggingFace transformers, parameter names follow a predictable structure:
# Attention parameters in GPT-2
attn_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=hf_lm_loss(),
param_filter=match_names("transformer.h.*.attn.*"),
)
Sources: examples/huggingface_tiny_gpt2.py
Integration with Curvature Operators
Operator Size and Parameter Tracking
Curvature operators maintain internal state about the parameters they operate on:
| Attribute | Type | Description |
|---|---|---|
_param_names | list[str] | Names of parameters in the filtered set |
_param_list | list[Tensor] | Parameter tensors |
_sizes | list[torch.Size] | Original tensor shapes for reshaping |
size | int | Total number of parameters (sum of all parameter elements) |
The size property is computed as:
self.size = sum(p.numel() for p in self._param_list)
This total parameter count determines the dimensionality of the vector space in which eigendecomposition occurs.
Sources: hessian_eigenthings/operators/hessian.py
Data Flow Diagram
graph TD
A[PyTorch Model] --> B[get_param_names]
A --> C[get_param_list]
A --> D[get_param_sizes]
B --> E[ParamFilter Application]
C --> E
D --> E
E --> F[Filtered Parameter Collections]
F --> G[Curvature Operator]
G --> H[matvec Operations]
H --> I[Eigendecomposition Results]
J[Input Vector] --> K[unflatten_params]
F --> K
K --> L[Parameter Dict]
L --> HPractical Examples
Computing Attention-Only Hessian Eigendecomposition
from hessian_eigenthings.algorithms import lanczos
from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.param_utils import match_names
# Create attention-only operator
attn_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_names("blocks.*.attn.*")
)
print(f"Attention-only Hessian size: {attn_op.size} parameters")
# Compute top-3 eigenvalues
eig_attn = lanczos(attn_op, k=3, max_iter=20, tol=1e-3, seed=0)
for i, val in enumerate(eig_attn.eigenvalues):
print(f" λ_{i + 1} = {val.item(): .4e}")
Comparing Block-Specific Curvature
# Full model Hessian
full_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn
)
# Attention block only
attn_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_names("blocks.*.attn.*")
)
# MLP block only
mlp_op = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_names("blocks.*.mlp.*")
)
# Compare eigenvalue spectra
full_eig = lanczos(full_op, k=10)
attn_eig = lanczos(attn_op, k=10)
mlp_eig = lanczos(mlp_op, k=10)
Sources: examples/transformer_lens_attention_only.py
Advanced Filtering
Custom Filter Functions
For complex filtering logic beyond glob matching, implement a custom ParamFilter:
def custom_filter(name: str, param: nn.Parameter) -> bool:
# Include only parameters with > 1000 elements
if param.numel() < 1000:
return False
# Exclude certain modules
if "embedding" in name:
return False
# Include based on naming patterns
return "layer" in name or "head" in name
operator = HessianOperator(
model=model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=custom_filter
)
Filter Composition
Filters can be combined using standard Python patterns:
# Intersection of patterns
combined_filter = lambda name, param: (
match_names("blocks.*.*.*")(name, param) and
param.dtype == torch.float32
)
# Negation
exclude_ln = lambda name, param: not (
match_names(".*laynorm.*", ".*ln.*")(name, param)
)
Performance Considerations
Parameter Access Patterns
The parameter utilities maintain strict ordering between name lists and tensor lists to enable efficient offset-based indexing. When iterating over parameters in performance-critical paths:
- Use the pre-computed
_sizeslist to avoid repeatedparam.shapecalls - Leverage the
strict=Truezip when all lists are guaranteed to be aligned - Prefer in-place reshaping over copies when possible
Memory Implications
| Operation | Memory Pattern |
|---|---|
flatten_params | Allocates new tensor of size sum(numel) |
unflatten_params | Creates dict, views from original vector |
matvec | No parameter data copies; uses VJP/JVP chains |
The vectorization maintains a view relationship with original parameters where possible, minimizing memory overhead during iterative algorithms.
API Reference
`match_names`
def match_names(*patterns: str) -> ParamFilter:
"""Create a ParamFilter matching parameter names against glob patterns."""
Parameters:
| Parameter | Type | Description |
|---|---|---|
patterns | str | Glob patterns to match against parameter names |
Returns: A callable ParamFilter that returns True for parameters matching any of the provided patterns.
Supported Glob Patterns:
*- Matches any sequence of characters within a path component**- Matches any sequence of path components (if supported)?- Matches a single character[abc]- Matches any character in the set
Parameter Extraction Functions
def get_param_names(model: nn.Module) -> list[str]:
"""Extract fully-qualified parameter names from a model."""
def get_param_list(model: nn.Module) -> list[Tensor]:
"""Extract parameter tensors from a model."""
def get_filtered_params(
model: nn.Module,
param_filter: ParamFilter | None
) -> tuple[list[str], list[Tensor]]:
"""Extract filtered parameter names and tensors."""
Sources: hessian_eigenthings/param_utils.py
Sources: [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
Distributed Computing with DDP
Related topics: Curvature Operators
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Curvature Operators
Distributed Computing with DDP
The hessian-eigenthings library provides native support for distributed training scenarios through DDPHessianOperator, a specialized curvature operator that extends the base HessianOperator to work correctly with PyTorch's DistributedDataParallel (DDP) wrapper.
Overview
In distributed training environments, the Hessian eigenvalue computations must account for how DDP synchronizes gradients across multiple processes. The DDPHessianOperator handles this synchronization transparently, ensuring that the Hessian-vector products (HVPs) computed across different ranks are properly averaged.
Key characteristics:
- Subclass of
HessianOperatorwith distributed awareness - Automatically averages HVPs across all data-parallel ranks
- Compatible with standard DDP-wrapped models
- Supports the same API as the base
HessianOperator - Handles the autograd graph complexity introduced by DDP's all-reduce operations
Architecture
Class Hierarchy
CurvatureOperator (base interface)
└── HessianOperator (base implementation)
└── DDPHessianOperator (DDP-aware extension)
Data Flow Diagram
graph TD
A[Model wrapped with DDP] --> B[DDPHessianOperator]
B --> C[Per-rank HVP Computation]
C --> D[Autograd-aware All-Reduce]
D --> E[Synchronized HVP across Ranks]
F[torch.autograd.grad calls] --> G[Regular .backward hooks]
G --> H[No explicit all-reduce]
I[Expected HVP] --> J[Actual HVP without DDPHessianOperator]
K[Expected HVP] --> L[Actual HVP with DDPHessianOperator]
style D fill:#90EE90
style H fill:#FFB6C1
style L fill:#90EE90DDP Behavior Explanation
The core challenge addressed by DDPHessianOperator stems from how PyTorch's DistributedDataParallel handles gradient synchronization:
- DDP's all-reduce mechanism: DDP normally fires its all-reduce operation inside the autograd graph during
loss.backward(), synchronizing gradients across all ranks - Standard HessianOperator limitation: When using
torch.autograd.graddirectly (as the baseHessianOperatordoes), the DDP hooks fire on.gradaccumulation rather than onautograd.grad's return value - Resulting discrepancy: Without explicit handling, the computed HVP does not match the single-process HVP computed on the union of all per-rank batches
The DDPHessianOperator resolves this by adding an explicit autograd-aware all-reduce after each gradient computation call, ensuring the resulting HVP equals the single-process HVP.
API Reference
DDPHessianOperator
class DDPHessianOperator(HessianOperator):
"""HessianOperator that all-reduces the HVP across torch.distributed ranks."""
#### Constructor Parameters
| Parameter | Type | Required | Default | Description | |
|---|---|---|---|---|---|
model | nn.Module | Yes | - | Model (may be DDP-wrapped; params are read directly) | |
dataloader | Iterable[Any] | Yes | - | Data loader providing batches to average over | |
loss_fn | LossFn | Yes | - | Loss function forward_fn(...) -> loss | |
param_filter | `ParamFilter \ | None` | No | None | Optional filter for subset of parameters |
full_dataset | bool | No | True | Whether to compute Hessian over full dataset | |
num_batches | `int \ | None` | No | None | Number of batches to sample if not full dataset |
microbatch_size | `int \ | None` | No | None | Chunk batch into micro-batches for memory |
microbatch_unsafe | bool | No | False | Skip gradient accumulation safety checks | |
method | HvpMethod | No | "autograd" | HVP computation method | |
fd_eps | `float \ | None` | No | None | Finite difference epsilon |
backend | `LinAlgBackend[torch.Tensor] \ | None` | No | None | Linear algebra backend |
Inherits all parameters from HessianOperator base class.
#### Inherited Methods
| Method | Description |
|---|---|
matvec(v) | Compute H·v where H is the Hessian averaged over batches |
size | Total number of parameters in the filtered parameter set |
dtype | Data type of parameters |
device | Device of parameters |
Import Location
from hessian_eigenthings.operators import DDPHessianOperator
Or via the distributed submodule:
from hessian_eigenthings.operators.distributed import DDPHessianOperator
Sources: hessian_eigenthings/operators/distributed/__init__.py:1-3
Usage Patterns
Basic Usage with DDP-wrapped Model
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from hessian_eigenthings.operators import DDPHessianOperator
from hessian_eigenthings.algorithms import lanczos
# Assume model, dataloader, and loss_fn are already set up
ddp_model = DDP(model)
# Create the distributed Hessian operator
hessian_op = DDPHessianOperator(
model=ddp_model,
dataloader=dataloader,
loss_fn=loss_fn,
)
# Compute eigenvalues using Lanczos algorithm
eigenvalues, eigenvectors = lanczos(hessian_op, k=10, max_iter=50)
With Parameter Filtering
from hessian_eigenthings.param_utils import match_names
# Focus on specific layer parameters
hessian_op = DDPHessianOperator(
model=ddp_model,
dataloader=dataloader,
loss_fn=loss_fn,
param_filter=match_names("layer.4.*"),
)
Using with Different HVP Methods
# Using finite difference method (more memory-efficient)
hessian_op_fd = DDPHessianOperator(
model=ddp_model,
dataloader=dataloader,
loss_fn=loss_fn,
method="finite_difference",
fd_eps=1e-5,
)
Key Design Decisions
Autograd-aware All-Reduce
The DDPHessianOperator adds an explicit all-reduce operation that integrates with PyTorch's autograd engine. This ensures:
- The all-reduce operation is included in the autograd graph when needed
- Gradient flows correctly through the distributed computation
- The final HVP is properly synchronized across all ranks
Parameter Access
The operator reads parameters directly from the model, whether or not it is wrapped with DDP:
# From the source:
# "The model passed in may already be wrapped with
# torch.nn.parallel.DistributedDataParallel; we read params from it directly."
This design allows seamless usage with existing DDP-wrapped models without modification.
Batch Distribution
Each rank should receive its own shard of the dataset:
"Each rank should be receiving its own shard of the dataset (typical pattern: a torch.utils.data.distributed.DistributedSampler)."
Sources: hessian_eigenthings/operators/distributed/ddp.py:21-25
Comparison with Single-Process HessianOperator
| Aspect | HessianOperator | DDPHessianOperator |
|---|---|---|
| Use case | Single GPU / CPU | Multi-GPU distributed |
| Gradient sync | Manual handling required | Automatic via all-reduce |
| DDP compatibility | May produce incorrect HVPs | Correct by design |
| API | Identical | Identical |
| Performance overhead | None | Single all-reduce per HVP |
Relationship to Other Operators
The hessian_eigenthings package provides multiple curvature operators:
| Operator | Description | Distributed Support |
|---|---|---|
HessianOperator | Full Hessian computation | Not DDP-aware |
DDPHessianOperator | Full Hessian with DDP sync | DDP-aware |
GGNOperator | Generalized Gauss-Newton | Not DDP-aware (as of v1.0) |
EmpiricalFisherOperator | Empirical Fisher matrix | Not DDP-aware |
Sources: hessian_eigenthings/__init__.py:15-24
Limitations and Considerations
- Current scope: Only
HessianOperatorhas a DDP-aware counterpart; other operators likeGGNOperatorandEmpiricalFisherOperatordo not yet have distributed variants - Gradient hooks: The operator does not currently support all DDP gradient hook mechanisms
- Multi-node training: While the operator uses standard
torch.distributedprimitives, performance at very large scale (>8 nodes) has not been extensively benchmarked - Mixed precision: When using fp16/bf16 training, ensure consistent dtype across all ranks
Error Handling
The operator relies on standard PyTorch distributed error handling:
- If
torch.distributedis not initialized, standard errors will be raised - Mismatched tensor shapes across ranks will result in collective operation errors
- Device mismatch (e.g., some ranks on CUDA, some on CPU) is not supported
Testing and Validation
The DDP functionality should be tested in a true distributed environment. Basic validation includes:
- Consistency check: HVP computed via
DDPHessianOperatorshould equal the single-process HVP when aggregating all batch shards - Numerical accuracy: Eigenvalues computed with DDP should match single-GPU results within floating-point tolerance
- Scaling: Computation time should scale sub-linearly with number of GPUs for large models
Sources: [hessian_eigenthings/operators/distributed/__init__.py:1-3]()
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
The project should not be treated as fully validated until this signal is reviewed.
First-time setup may fail or require extra isolation and rollback planning.
First-time setup may fail or require extra isolation and rollback planning.
First-time setup may fail or require extra isolation and rollback planning.
Doramagic Pitfall Log
Doramagic extracted 15 source-linked risk signals. Review them before installing or handing real data to the project.
1. Project risk: Project risk needs validation
- Severity: medium
- Finding: Project risk is backed by a source signal: Project risk needs validation. Treat it as a review item until the current version is checked.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: identity.distribution | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | repo=pytorch-hessian-eigenthings; install=hessian-eigenthings
2. Installation risk: Python Error: the following arguments are required: experimentname
- Severity: medium
- Finding: Installation risk is backed by a source signal: Python Error: the following arguments are required: experimentname. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/39
3. Installation risk: v1.0.0a2 — packaging fix
- Severity: medium
- Finding: Installation risk is backed by a source signal: v1.0.0a2 — packaging fix. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a2
4. Installation risk: v1.0.0a3 — fix lanczos OOM
- Severity: medium
- Finding: Installation risk is backed by a source signal: v1.0.0a3 — fix lanczos OOM. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a3
5. Installation risk: v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo
- Severity: medium
- Finding: Installation risk is backed by a source signal: v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a4
6. Installation risk: v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests
- Severity: medium
- Finding: Installation risk is backed by a source signal: v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests. Treat it as a review item until the current version is checked.
- User impact: First-time setup may fail or require extra isolation and rollback planning.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a5
7. Configuration risk: RuntimeError: One of the differentiated Tensors appears to not have been used in the graph.
- Severity: medium
- Finding: Configuration risk is backed by a source signal: RuntimeError: One of the differentiated Tensors appears to not have been used in the graph.. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/30
8. Configuration risk: ValueError: PENet on the Kitti benchmark suite
- Severity: medium
- Finding: Configuration risk is backed by a source signal: ValueError: PENet on the Kitti benchmark suite. Treat it as a review item until the current version is checked.
- User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/41
9. Capability assumption: README/documentation is current enough for a first validation pass.
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: capability.assumptions | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | README/documentation is current enough for a first validation pass.
10. Project risk: AttributeError: 'HVPOperator' object has no attribute 'zero_grad'
- Severity: medium
- Finding: Project risk is backed by a source signal: AttributeError: 'HVPOperator' object has no attribute 'zero_grad'. Treat it as a review item until the current version is checked.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: Source-linked evidence: https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/38
11. Maintenance risk: Maintainer activity is unknown
- Severity: medium
- Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | last_activity_observed missing
12. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: downstream_validation.risk_items | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | no_demo; severity=medium
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using pytorch-hessian-eigenthings with real data or production workflows.
- ValueError: PENet on the Kitti benchmark suite - github / github_issue
- RuntimeError: One of the differentiated Tensors appears to not have been - github / github_issue
- AttributeError: 'HVPOperator' object has no attribute 'zero_grad' - github / github_issue
- Python Error: the following arguments are required: experimentname - github / github_issue
- v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests - github / github_release
- v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo - github / github_release
- v1.0.0a3 — fix lanczos OOM - github / github_release
- v1.0.0a2 — packaging fix - github / github_release
- Project risk needs validation - GitHub / issue
Source: Project Pack community evidence and pitfall evidence