# https://github.com/noahgolmant/pytorch-hessian-eigenthings 项目说明书

生成时间：2026-05-16 15:14:47 UTC

## 目录

- [Introduction to hessian-eigenthings](#introduction)
- [Installation Guide](#installation)
- [Curvature Matrices Explained](#curvature-matrices)
- [Why Hessian-Vector Products](#hvp-approach)
- [System Architecture](#architecture)
- [Curvature Operators](#curvature-operators)
- [Eigendecomposition Algorithms](#algorithms)
- [Loss Functions](#loss-functions)
- [Parameter Utilities](#parameter-utilities)
- [Distributed Computing with DDP](#distributed-computing)

<a id='introduction'></a>

## Introduction to hessian-eigenthings

### 相关页面

相关主题：[Curvature Matrices Explained](#curvature-matrices), [System Architecture](#architecture)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/README.md)
- [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
- [hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)
- [hessian_eigenthings/algorithms/lanczos.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/lanczos.py)
- [hessian_eigenthings/algorithms/trace.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/trace.py)
- [hessian_eigenthings/loss_fns/huggingface.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)
- [hessian_eigenthings/loss_fns/standard.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/standard.py)
- [examples/huggingface_tiny_gpt2.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/examples/huggingface_tiny_gpt2.py)
- [mkdocs.yml](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/mkdocs.yml)
</details>

# Introduction to hessian-eigenthings

## Overview

`hessian-eigenthings` is a PyTorch library that provides efficient and scalable computation of eigendecomposition for the Hessian matrix and related curvature operators in neural networks. The library enables practitioners to compute top eigenvalues and eigenvectors via Lanczos or stochastic power iteration, trace estimates via Hutch++, and spectral density via Stochastic Lanczos Quadrature.

资料来源：[README.md:1]()

The project targets researchers and engineers studying generalization properties of neural networks, where Hessian eigenvalues and eigenvectors have been implicated in understanding flat minima and model robustness.

## Core Concepts

### What is a Hessian?

The Hessian matrix is the second-order partial derivatives of a loss function with respect to model parameters. For a neural network with parameters θ and loss L, the Hessian H is defined as:

```
H[θ][i,j] = ∂²L / ∂θ[i]∂θ[j]
```

For modern large-scale models, the Hessian is prohibitively expensive to compute explicitly—it has O(n²) entries where n is the number of parameters (e.g., billions for large language models).

资料来源：[hessian_eigenthings/operators/hessian.py:1-50]()

### Curvature Operators

Instead of computing the full Hessian matrix, this library works with **curvature operators** that implement matrix-vector products (matvecs). Given a vector v, these operators efficiently compute:

```
H @ v → operator.matvec(v)
```

This approach reduces memory from O(n²) to O(n), making analysis feasible for models with billions of parameters.

### Supported Curvature Matrices

| Operator | Description | Use Case |
|----------|-------------|----------|
| `HessianOperator` | Full Hessian of the loss | General curvature analysis |
| `GGNOperator` | Generalized Gauss-Newton approximation | More stable than raw Hessian; equals Fisher for cross-entropy + softmax |
| Custom Operators | User-defined curvature operators | Extend to other matrices |

资料来源：[hessian_eigenthings/operators/ggn.py:1-30]()

## Architecture

The library follows a clean separation of concerns with three main layers:

```mermaid
graph TD
    A[User Code] --> B[Algorithms]
    A --> C[Loss Functions]
    B --> D[Curvature Operators]
    C --> D
    D --> E[LinAlgBackend]
    
    B --> B1[Lanczos]
    B --> B2[Stochastic Power Iteration]
    B --> B3[Trace Estimation]
    
    D --> D1[HessianOperator]
    D --> D2[GGNOperator]
    D --> D3[Custom Operators]
    
    E --> E1[SingleDeviceBackend]
    E --> E2[Distributed Backends]
```

### Component Layers

1. **Algorithms Layer** (`hessian_eigenthings/algorithms/`): Eigenvalue/eigenvector computation methods that operate on any `CurvatureOperator`.

2. **Operators Layer** (`hessian_eigenthings/operators/`): Implementations of various curvature matrices that provide the `matvec()` interface.

3. **Loss Functions Layer** (`hessian_eigenthings/loss_fns/`): Pre-built loss functions with analytical Hessian-vector products for common use cases.

4. **Backend Layer** (`hessian_eigenthings/backends/`): Abstraction for linear algebra operations supporting single-device and distributed execution.

资料来源：[CONTRIBUTING.md:1-30]()

## Algorithms

### Lanczos Eigendecomposition

The Lanczos algorithm computes the top k eigenvalues and eigenvectors of a symmetric matrix using only matrix-vector products. It achieves this through k iterations of tridiagonal matrix construction.

```mermaid
graph LR
    A[Start Vector v₀] --> B[Iterate i = 1 to k]
    B --> C[Compute βᵢvᵢ₊₁ = Avᵢ - αᵢvᵢ - βᵢ₋₁vᵢ₋₁]
    C --> D[Compute αᵢ = vᵢᵀAvᵢ]
    D --> E[Build Tridiagonal T]
    E --> F{Eigenvalues of T ≈ eigenvalues of A?}
    F -->|Yes| G[Eigenpairs Converged]
```

Key parameters for the Lanczos algorithm:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `k` | int | required | Number of eigenpairs to compute |
| `max_iter` | int | 100 | Maximum Lanczos iterations |
| `tol` | float | 1e-6 | Convergence tolerance |
| `seed` | int | None | Random seed for reproducibility |
| `which` | str | "LM" | Which eigenvalues: "LM" (largest magnitude), "LA" (largest algebraic), "SA" (smallest algebraic) |

资料来源：[hessian_eigenthings/algorithms/lanczos.py:1-80]()

### Trace Estimation

The trace of a matrix can be estimated using stochastic methods without forming the full matrix:

**Hutchinson's Estimator:**
```
trace(A) ≈ (1/m) Σᵢ vᵢᵀ A vᵢ
```
where vᵢ are random probe vectors.

**Hutch++ Estimator:**
An improved estimator with lower variance:
```
trace(A) ≈ (2/m) Σᵢ vᵢᵀ A vᵢ - (1/m) Σⱼ wⱼᵀ A wⱼ
```

| Method | `num_matvecs` | Variance | Use Case |
|--------|---------------|----------|----------|
| `hutchinson` | 100 | Higher | Quick estimates |
| `hutch++` | 30 | Lower | Production estimates |

资料来源：[hessian_eigenthings/algorithms/trace.py:1-60]()

## Operators

### HessianOperator

The primary operator for computing Hessian eigendecomposition. It supports two HVP computation methods:

| Method | Description | Memory | Precision |
|--------|-------------|--------|-----------|
| `"autograd"` (default) | Exact double-backward via `torch.autograd.grad` with `create_graph=True` | Higher | Numerically exact |
| `"finite_difference"` | Central-difference approximation | Lower | O(ε²) bias |

```python
from hessian_eigenthings import HessianOperator, lanczos

# Basic usage
op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
eigenvalues, eigenvectors = lanczos(op, k=10)

# With parameter filtering (subset of parameters)
op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("transformer.h.*.attn.*"),
)
```

资料来源：[hessian_eigenthings/operators/hessian.py:1-100]()

### GGNOperator

The Generalized Gauss-Newton (GGN) operator provides a more numerically stable approximation to the Hessian. For cross-entropy + softmax classification, the GGN equals the Fisher information matrix.

```python
from hessian_eigenthings import GGNOperator

op = GGNOperator(
    model=model,
    dataloader=dataloader,
    forward_fn=model_forward,
    loss_of_output_fn=loss_of_output_fn,
)
```

**Two matvec implementations:**

| Implementation | Description | Memory Footprint |
|----------------|-------------|------------------|
| `"analytical"` (default) | Finite-difference JVP + analytical loss-Hessian-vec product | Matches one training step |
| `"autograd"` | Full `torch.func.jvp` + autograd double-backward | Scales badly with output size |

资料来源：[hessian_eigenthings/operators/ggn.py:1-80]()

## Loss Functions

The library provides optimized loss functions with closed-form Hessian-vector products for common use cases.

### Standard Loss Functions

```python
from hessian_eigenthings.loss_fns.standard import cross_entropy_loss_of_output

loss_fn = cross_entropy_loss_of_output()  # Returns loss_of_output_fn
```

The closed-form cross-entropy HVP is:
```
H @ u = (p * u - p * <p, u>) / n
```
where `p = softmax(output)` and `n` is the number of valid positions.

资料来源：[hessian_eigenthings/loss_fns/standard.py:1-60]()

### HuggingFace Transformers Loss

Specialized support for HuggingFace models with fused CUDA kernels:

```python
from hessian_eigenthings.loss_fns.huggingface import hf_lm_loss

loss_fn = hf_lm_loss()  # For language modeling
```

**Fused backend options:**

| Backend | Device | Speedup | Memory |
|---------|--------|---------|--------|
| `"triton"` | CUDA | ~3.4x faster | 2x reduction |
| `"compile"` | Any | ~2.6x faster | 2x reduction |
| `"eager"` | Any | Baseline | Baseline |
| `"auto"` | Auto-detect | Best available | Best available |

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:1-80]()

## Usage Examples

### Basic Hessian Eigendecomposition

```python
import torch
from torch import nn
from hessian_eigenthings import HessianOperator, lanczos

# Define model and data
model = nn.Sequential(nn.Linear(100, 50), nn.ReLU(), nn.Linear(50, 10))
dataloader = [(torch.randn(32, 100), torch.randint(0, 10, (32,)))]

# Loss function
def loss_fn(model, batch):
    x, y = batch
    return nn.functional.cross_entropy(model(x), y)

# Compute top 3 eigenvalues
op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
result = lanczos(op, k=3, max_iter=20, tol=1e-3, seed=0)

for i, val in enumerate(result.eigenvalues):
    print(f"λ_{i+1} = {val.item():.4e}")
```

资料来源：[examples/huggingface_tiny_gpt2.py:1-50]()

### Analyzing Attention Layers Only

```python
from hessian_eigenthings import HessianOperator, lanczos
from hessian_eigenthings.util import match_names

# Filter to attention parameters only
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*"),
)
eigenvalues = lanczos(attn_op, k=5)
```

### Trace Estimation

```python
from hessian_eigenthings import HessianOperator, trace

op = HessianOperator(model=model, dataloader=dataloader, loss_fn=loss_fn)
result = trace(op, num_matvecs=30, method="hutch++", seed=0)
print(f"Trace estimate: {result.estimate:.4e}")
```

## Performance Considerations

### Memory Management

For large models, consider these strategies:

1. **Parameter Filtering**: Analyze only relevant subsets of parameters
   ```python
   op = HessianOperator(model, dataloader, loss_fn, param_filter=filter_func)
   ```

2. **Microbatching**: Process data in smaller chunks
   ```python
   op = HessianOperator(model, dataloader, loss_fn, microbatch_size=8)
   ```

3. **Finite Difference Method**: Use `"finite_difference"` for lower memory with FSDP/HSDP/TP
   ```python
   op = HessianOperator(model, dataloader, loss_fn, method="finite_difference")
   ```

### Scalability

| Model Size | Recommended Method | Notes |
|------------|-------------------|-------|
| < 1B params | `"autograd"` HVP | Numerically exact |
| 1B - 7B params | `"analytical"` GGN | Good memory efficiency |
| > 7B params | `"finite_difference"` HVP | Works with distributed training |

### Computation Cost

The primary cost driver is the number of matrix-vector products (`matvecs`):

- **Lanczos**: ~k × max_iter matvecs for k eigenpairs
- **Trace (Hutch++)**: num_matvecs matvecs
- **Spectral Density**: num_steps × num_random_start matvecs

## API Reference

### Core Functions

| Function | Module | Description |
|----------|--------|-------------|
| `lanczos` | `hessian_eigenthings.algorithms` | Lanczos eigendecomposition |
| `stochastic_power_iteration` | `hessian_eigenthings.algorithms` | Stochastic power iteration |
| `trace` | `hessian_eigenthings.algorithms` | Trace estimation |
| `spectral_density` | `hessian_eigenthings.algorithms` | Stochastic Lanczos Quadrature |

### Operators

| Class | Module | Description |
|-------|--------|-------------|
| `HessianOperator` | `hessian_eigenthings.operators` | Full Hessian operator |
| `GGNOperator` | `hessian_eigenthings.operators` | Generalized Gauss-Newton operator |
| `CurvatureOperator` | `hessian_eigenthings.operators` | Base class for custom operators |

### Utility Functions

| Function | Description |
|----------|-------------|
| `match_names(glob_pattern)` | Create parameter filter from glob pattern |
| `SingleDeviceBackend` | Linear algebra backend for single-device execution |

## Project Information

### Acknowledgements

The original 2018 implementation was developed by Noah Golmant, Zhewei Yao, Amir Gholami, Michael Mahoney, and Joseph Gonzalez at UC Berkeley's RISELab.

The deflated power iteration is based on code from [HessianFlow](https://github.com/amirgholami/HessianFlow) (Z. Yao, A. Gholami, Q. Lei, K. Keutzer, M. Mahoney. *"Hessian-based Analysis of Large Batch Training and Robustness to Adversaries"*, NeurIPS 2018).

Accelerated stochastic power iteration is from C. De Sa et al.

### Citation

```bibtex
@misc{hessian-eigenthings,
    author       = {Noah Golmant and Zhewei Yao and Amir Gholami and Michael Mahoney and Joseph Gonzalez},
    title        = {pytorch-hessian-eigenthings: efficient PyTorch Hessian eigendecomposition},
    month        = oct,
    year         = 2018,
    version      = {1.0},
    url          = {https://github.com/noahgolmant/pytorch-hessian-eigenthings}
}
```

### Installation

```bash
# From PyPI (stable release)
pip install hessian-eigenthings

# Development setup
git clone https://github.com/noahgolmant/pytorch-hessian-eigenthings
cd pytorch-hessian-eigenthings
uv sync --group dev --group docs
```

### Documentation

Full documentation is available at [noahgolmant.github.io/pytorch-hessian-eigenthings/](https://noahgolmant.github.io/pytorch-hessian-eigenthings/).

资料来源：[README.md:1-100]()
资料来源：[CONTRIBUTING.md:1-60]()
资料来源：[mkdocs.yml:1-50]()

---

<a id='installation'></a>

## Installation Guide

### 相关页面

相关主题：[Introduction to hessian-eigenthings](#introduction)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [CONTRIBUTING.md](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/CONTRIBUTING.md)
- [README.md](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/README.md)
- [pyproject.toml](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/pyproject.toml)
- [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
- [hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)
- [examples/transformer_lens_attention_only.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/examples/transformer_lens_attention_only.py)
- [examples/huggingface_tiny_gpt2.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/examples/huggingface_tiny_gpt2.py)
</details>

# Installation Guide

## Overview

This guide covers all aspects of setting up the `hessian-eigenthings` library for computing Hessian eigendecomposition and related curvature matrix operations in PyTorch models.

The library provides efficient methods for computing top eigenvalues and eigenvectors via Lanczos or stochastic power iteration, trace estimates via Hutch++, and spectral density via Stochastic Lanczos Quadrature. 资料来源：[README.md:1-20]()

## Installation Methods

### PyPI Release (Recommended for Users)

The latest stable release is available on PyPI:

```bash
pip install hessian-eigenthings
```

This installs the core library without optional dependencies for transformer and curvlinops integrations. 资料来源：[README.md:1-10]()

### Development Installation (For Contributors)

For development, clone the repository and install with all optional dependency groups:

```bash
git clone https://github.com/noahgolmant/pytorch-hessian-eigenthings
cd pytorch-hessian-eigenthings
uv sync --group dev --group docs --extra transformers --extra transformer-lens --extra curvlinops
```

资料来源：[CONTRIBUTING.md:5-12]()

## Optional Dependency Groups

The library uses optional dependency groups defined in `pyproject.toml` to enable specialized functionality:

| Group | Purpose | Typical Use Case |
|-------|---------|------------------|
| `dev` | Testing, linting, type checking | Running CI checks locally |
| `docs` | Building documentation | `mkdocs build --strict` |
| `transformers` | HuggingFace Transformers integration | `GGNOperator` with HF models |
| `transformer-lens` | TransformerLens integration | Attention-only Hessian analysis |
| `curvlinops` | Cross-library validation tests | Testing against external oracle |

资料来源：[CONTRIBUTING.md:8-10]()

## Development Environment Setup

### Prerequisites

| Requirement | Version | Purpose |
|-------------|---------|---------|
| Python | ≥3.10 | Core runtime |
| uv | Latest | Package manager |
| PyTorch | ≥2.0 | Backend tensor operations |
| CUDA (optional) | 11.8+ | GPU acceleration for large models |

### Setup Workflow

```mermaid
graph TD
    A[Clone Repository] --> B[Install uv if needed]
    B --> C[Run uv sync with groups]
    C --> D[Verify Installation]
    D --> E{Which workflow?}
    E -->|Development| F[Run linting checks]
    E -->|Testing| G[Run pytest]
    E -->|Documentation| H[Build docs]
    F --> I[Ready to contribute]
    G --> I
    H --> I
```

### Verification Commands

After installation, verify the setup by running the full check suite:

```bash
uv run ruff check .
uv run black --check .
uv run mypy
uv run pytest
uv run mkdocs build --strict
```

资料来源：[CONTRIBUTING.md:14-23]()

## CUDA/GPU Support

The library provides optimized CUDA kernels for specific operations:

### Triton Kernels (CUDA Only)

The `hessian_eigenthings.loss_fns._fused_ce_hvp` module includes a hand-written Triton CUDA kernel for fused CE HVP computation. This kernel:

- Eliminates zero `(N, V)` intermediates (output buffer only)
- Provides ~3.4x speedup over eager mode
- Reduces peak memory by 2x compared to eager
- Falls back to `torch.compile` if Triton/CUDA is unavailable

资料来源：[hessian_eigenthings/loss_fns/_fused_ce_hvp.py:50-80]()

### Backend Selection

For HuggingFace language model loss functions, the `fused` parameter controls kernel selection:

| Setting | Behavior |
|---------|----------|
| `"auto"` (default) | Picks fastest available: Triton on CUDA (~3.4x speedup), else `torch.compile` |
| `"eager"` | Plain PyTorch implementation, useful for debugging |
| `"compile"` | `torch.compile`-fused via Inductor, works on CPU/CUDA/MPS |
| `"triton"` | Hand-written CUDA Triton kernel (CUDA only) |

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:1-50]()

## Operator-Specific Dependencies

Different curvature operators have different computational requirements:

### HessianOperator

Two HVP methods are supported:

| Method | Memory Profile | Precision | FSDP/TP Compatible |
|--------|---------------|-----------|-------------------|
| `"autograd"` (default) | Higher (requires `create_graph=True`) | Numerically exact | No (requires special handling) |
| `"finite_difference"` | Matches one training step | O(ε²) truncation bias | Yes |

资料来源：[hessian_eigenthings/operators/hessian.py:1-40]()

### GGNOperator

For the Generalized Gauss-Newton operator, two matvec implementations are available:

| Implementation | Memory | Use Case |
|----------------|--------|----------|
| `"analytical"` (default) | Matches one training step | LM-scale use, prevents OOM |
| `"autograd"` | Scales with output size | Losses without analytical `.hvp` |

资料来源：[hessian_eigenthings/operators/ggn.py:1-60]()

## CI/CD Verification

The repository uses GitHub Actions for continuous integration. CI runs:

- All linting and type checks
- Full pytest test suite
- Example scripts execution
- Documentation codeblock tests

资料来源：[CONTRIBUTING.md:23-26]()

## Troubleshooting

### Common Issues

| Issue | Solution |
|-------|----------|
| Memory OOM with GGNOperator | Use `loss_hvp="analytical"` (default in recent versions) |
| FSDP/TP compatibility issues | Use `method="finite_difference"` for HessianOperator |
| Triton not available | Falls back to `torch.compile` automatically |
| Type checking failures | Run `uv run mypy` locally before submitting PR |

### Diagnostic Scripts

The repository includes diagnostic scripts for troubleshooting:

- `scripts/repro_ggn_oom.py`: CPU-side memory regression test for GGNOperator OOM issues
- `scripts/bench_fused_ce_hvp.py`: Microbenchmark for eager vs fused CE HVP performance

资料来源：[scripts/repro_ggn_oom.py:1-40]()
资料来源：[scripts/bench_fused_ce_hvp.py:1-50]()

## Package Metadata

| Property | Value |
|----------|-------|
| Package Name | `hessian-eigenthings` |
| License | MIT |
| Documentation | [noahgolmant.github.io/pytorch-hessian-eigenthings](https://noahgolmant.github.io/pytorch-hessian-eigenthings/) |
| CI Status | [![CI](https://github.com/noahgolmant/pytorch-hessian-eigenthings/actions/workflows/ci.yml/badge.svg)](https://github.com/noahgolmant/pytorch-hessian-eigenthings/actions/workflows/ci.yml) |

资料来源：[README.md:1-20]()

---

<a id='curvature-matrices'></a>

## Curvature Matrices Explained

### 相关页面

相关主题：[Why Hessian-Vector Products](#hvp-approach), [Curvature Operators](#curvature-operators)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
- [hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)
- [hessian_eigenthings/operators/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/__init__.py)
- [hessian_eigenthings/operators/fisher.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/fisher.py)
- [hessian_eigenthings/algorithms/lanczos.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/lanczos.py)
- [hessian_eigenthings/algorithms/trace.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/trace.py)
- [hessian_eigenthings/loss_fns/huggingface.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)
- [hessian_eigenthings/loss_fns/_fused_ce_hvp.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/_fused_ce_hvp.py)
</details>

# Curvature Matrices Explained

## Overview

Curvature matrices characterize the second-order behavior of loss functions in neural networks, providing critical information about optimization landscapes, generalization properties, and model robustness. The `hessian-eigenthings` library provides efficient, matrix-free computation of eigendecompositions for three key curvature matrices: the **Hessian**, the **Generalized Gauss-Newton (GGN)**, and the **Empirical Fisher**.

These curvature operators serve as the foundation for analyzing flat minima, understanding generalization, and performing second-order optimization. The library implements matrix-vector products (matvecs) directly, avoiding explicit matrix construction which would be computationally infeasible for large neural networks with billions of parameters.

## Curvature Matrices Architecture

```mermaid
graph TD
    A[Loss Function Lθ] --> B[Hessian H = ∇²L]
    A --> C[Generalized Gauss-Newton G]
    A --> D[Empirical Fisher F]
    
    B --> E[Matrix-Free MatVec]
    C --> E
    D --> E
    
    E --> F[Lanczos Eigendecomposition]
    E --> G[Trace Estimation]
    E --> H[Spectral Density]
    
    F --> I[Top-k Eigenpairs]
    G --> J[Trace Estimate ± SE]
    H --> K[Spectral Density Plot]
```

## The Hessian Matrix

### Definition and Role

The Hessian matrix `H = ∇²L(θ)` is the second derivative of the loss with respect to parameters. It captures the exact local curvature of the loss landscape, making it the most precise but also most computationally expensive curvature matrix.

The Hessian is symmetric by construction and its eigenvalues reveal critical properties:

- **Large positive eigenvalues** indicate sharp curvature, suggesting the model is in a narrow minimum
- **Small eigenvalues** indicate flat regions associated with better generalization
- **Negative eigenvalues** signal instability and potential divergence

资料来源：[hessian_eigenthings/operators/hessian.py:1-30](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

### HessianOperator Implementation

The `HessianOperator` class provides two methods for computing Hessian-vector products (HVPs):

```python
class HessianOperator(CurvatureOperator):
    """Hessian of `loss_fn(model, batch)` averaged over batches in dataloader."""
    
    def __init__(
        self,
        model: nn.Module,
        dataloader: Iterable[Any],
        loss_fn: LossFn,
        *,
        param_filter: ParamFilter | None = None,
        full_dataset: bool = True,
        num_batches: int | None = None,
        microbatch_size: int | None = None,
        method: HvpMethod = "autograd",
        fd_eps: float | None = None,
        backend: LinAlgBackend[torch.Tensor] | None = None,
    ) -> None:
```

**HVP Computation Methods:**

| Method | Description | Memory | Precision |
|--------|-------------|--------|-----------|
| `"autograd"` | Exact double-backward via `torch.autograd.grad` with `create_graph=True` | Higher (scales with model size) | Numerically exact to rounding |
| `"finite_difference"` | Central-difference `(∇L(θ+εv) − ∇L(θ−εv)) / 2ε` | Lower (two forward+backward passes) | O(ε²) truncation bias |

The finite-difference method uses dtype-specific epsilon values for optimal precision:

| dtype | Epsilon |
|-------|---------|
| float64 | 6e-6 |
| float32 | 5e-3 |
| bfloat16 | 0.2 |
| float16 | 5e-2 |

资料来源：[hessian_eigenthings/operators/hessian.py:30-55](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

### When to Use the Hessian

The Hessian is ideal for:
- Single-device analysis of models up to ~7B parameters
- Scenarios requiring exact curvature information
- Research on loss landscape topology
- Verifying approximations against ground truth

## Generalized Gauss-Newton (GGN)

### Definition and Mathematical Foundation

The Generalized Gauss-Newton matrix `G` is a positive semi-definite (PSD) approximation to the Hessian. For a loss of the form `L = (1/n) Σ l(f(xᵢ;θ), yᵢ)`, the GGN is defined as:

```
G = Jᵀ · H_loss · J
```

Where:
- `J` is the Jacobian of model outputs with respect to parameters
- `H_loss` is the Hessian of the loss with respect to model outputs

The GGN is always PSD because `G = Jᵀ · H_loss · J` and `H_loss` is PSD for convex losses.

资料来源：[hessian_eigenthings/operators/ggn.py:1-40](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)

### GGNOperator Implementation

The `GGNOperator` class provides two matvec implementations:

```python
class GGNOperator(CurvatureOperator):
    def __init__(
        self,
        model: nn.Module,
        dataloader: Iterable[Any],
        forward_fn: ForwardFn,
        loss_of_output_fn: LossOfOutputFn,
        *,
        loss_hvp: Literal["analytical", "autograd"] = "analytical",
    ) -> None:
```

| Method | Description | Memory | Use Case |
|--------|-------------|--------|----------|
| `"analytical"` | Finite-difference JVP + analytical loss-Hessian-vector product | Matches one training step | LM-scale use, OOM-safe |
| `"autograd"` | `torch.func.jvp` + autograd double-backward + `vjp` | Scales with output size | Exact for arbitrary losses |

For cross-entropy + softmax classification, `G` equals the Fisher information matrix, making the GGN and Fisher equivalent in this common case.

资料来源：[hessian_eigenthings/operators/ggn.py:40-80](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)

### Two-Function API Design

The GGNOperator uses a separation between `forward_fn` and `loss_of_output_fn`:

```python
ForwardFn = Callable[[nn.Module, Any], torch.Tensor]
LossOfOutputFn = Callable[[torch.Tensor, Any], torch.Tensor]
```

This design enables computing `J·v`, `H_loss·(J·v)`, and `Jᵀ·(H_loss·J·v)` without coupling to loss internals.

### Closed-Form Cross-Entropy HVP

For mean-reduced cross-entropy with softmax, the library provides an optimized analytical HVP:

```
H_loss @ u = (p * u - p * ⟨p, u⟩) / n
```

Where `p = softmax(logits)` and `n` is the count of non-ignored positions.

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:1-60](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)

### Fused CE HVP Implementations

The library provides three backend implementations for the cross-entropy HVP:

| Backend | Description | Memory | Speedup |
|---------|-------------|--------|---------|
| `"eager"` | Plain PyTorch reference | Highest | 1x baseline |
| `"compile"` | `torch.compile`-fused; Inductor fuses operations | Reduced | ~2.6x faster |
| `"triton"` | Hand-written CUDA Triton kernel | Minimal (output buffer only) | ~3.4x faster |

资料来源：[hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/_fused_ce_hvp.py)

## Empirical Fisher

### Definition

The Empirical Fisher matrix `F` is defined as:

```
F = (1/n) Σ ∇lᵢ · ∇lᵢᵀ
```

Where the expectation over data is replaced by the empirical average over batches. It is always PSD and serves as an approximation to the Fisher Information Matrix.

### EmpiricalFisherOperator

```python
class EmpiricalFisherOperator(CurvatureOperator):
    def __init__(
        self,
        model: nn.Module,
        dataloader: Iterable[Any],
        loss_fn: LossFn,
        *,
        per_sample: bool = False,
    ) -> None:
```

资料来源：[hessian_eigenthings/operators/fisher.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/fisher.py)

## Curvature Operator Interface

All curvature matrices implement the `CurvatureOperator` base class:

```python
class CurvatureOperator(ABC):
    @property
    @abstractmethod
    def size(self) -> int:
        """Number of parameters (matrix dimension)."""
        ...
    
    @property
    def dtype(self) -> torch.dtype:
        ...
    
    @property
    def device(self) -> torch.device:
        ...
    
    @abstractmethod
    def matvec(self, v: torch.Tensor) -> torch.Tensor:
        """Compute matrix-vector product A @ v."""
        ...
```

资料来源：[hessian_eigenthings/operators/base.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/base.py)

## Parameter Filtering

Curvature operators support computing curvature only over a subset of parameters using `ParamFilter`:

```python
def match_names(*patterns: str) -> ParamFilter:
    """Match parameters by name patterns."""
    
def match_regex(pattern: str) -> ParamFilter:
    """Match parameters by regex pattern."""
```

This enables analysis of specific components:

```python
# Analyze only attention weights
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*"),
)
```

## Algorithmic Foundations

### Lanczos Eigendecomposition

The Lanczos algorithm computes eigenvalues and eigenvectors of large sparse matrices using only matrix-vector products:

```python
def lanczos(
    operator: CurvatureOperator,
    k: int = 10,
    max_iter: int = 100,
    tol: float = 1e-6,
    which: str = "LM",
) -> EigenResult:
```

The algorithm:
1. Builds a tridiagonal matrix `T` from matvec operations
2. Computes eigenvalues of `T` as Ritz approximations
3. Accumulates eigenvectors directly via rank-1 outer-product updates

资料来源：[hessian_eigenthings/algorithms/lanczos.py:1-60](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/lanczos.py)

### Trace Estimation

Trace estimation uses stochastic probing to estimate `tr(A)` without computing the full matrix:

| Method | Samples | Variance | Description |
|--------|---------|----------|-------------|
| Hutchinson | m | O(1/√m) | `(1/m) Σ vᵢᵀ A vᵢ` with Rademacher/Gaussian vectors |
| Hutch++ | m | O(1/m) | Improved estimator with better constant factors |

```python
def trace(
    operator: CurvatureOperator,
    *,
    num_matvecs: int = 100,
    method: Method = "hutch++",
) -> TraceResult:
```

资料来源：[hessian_eigenthings/algorithms/trace.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/trace.py)

## Operator Selection Guide

```mermaid
graph LR
    A[Need Curvature?] --> B{Exact Hessian?}
    B -->|Yes, small model| C[HessianOperator<br/>method=autograd]
    B -->|No| D{Need Fisher/GGN?}
    D -->|Yes, cross-entropy| E[GGNOperator<br/>loss_hvp=analytical]
    D -->|Yes, other loss| F[GGNOperator<br/>loss_hvp=autograd]
    D -->|Empirical Fisher| G[EmpiricalFisherOperator]
    
    C --> H[Use lanczos for eigenvalues]
    E --> H
    F --> H
    G --> H
```

| Scenario | Recommended Operator | Method |
|----------|----------------------|--------|
| Exact Hessian, single GPU | `HessianOperator` | `method="autograd"` |
| Large model, distributed | `HessianOperator` | `method="finite_difference"` |
| Language modeling, cross-entropy | `GGNOperator` | `loss_hvp="analytical"` |
| Custom loss, need exact | `GGNOperator` | `loss_hvp="autograd"` |
| Natural gradient optimization | `EmpiricalFisherOperator` | Default |

## Module Exports

```python
from hessian_eigenthings.operators import (
    CurvatureOperator,
    HessianOperator,
    GGNOperator,
    EmpiricalFisherOperator,
    DDPHessianOperator,  # For DistributedDataParallel
    LambdaOperator,      # Custom curvature wrappers
)

from hessian_eigenthings.algorithms import (
    lanczos,              # Top-k eigendecomposition
    trace,                # Trace estimation
    spectral_density,    # Density plot via SLQ
    deflated_power_iteration,
)
```

资料来源：[hessian_eigenthings/operators/__init__.py:1-25](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/__init__.py)

## Summary

The `hessian-eigenthings` library provides a unified interface for computing curvature information in neural networks:

- **Hessian**: Exact local curvature via autograd or finite-difference approximation
- **GGN**: Positive semi-definite approximation ideal for large-scale analysis
- **Empirical Fisher**: Sample-based Fisher approximation for natural gradient methods

All operators provide matrix-free matvec implementations, enabling eigendecomposition and trace estimation for models with billions of parameters without explicit matrix construction.

---

<a id='hvp-approach'></a>

## Why Hessian-Vector Products

### 相关页面

相关主题：[Curvature Matrices Explained](#curvature-matrices), [Eigendecomposition Algorithms](#algorithms)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
- [hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)
- [hessian_eigenthings/loss_fns/_fused_ce_hvp.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/_fused_ce_hvp.py)
- [hessian_eigenthings/loss_fns/huggingface.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)
- [hessian_eigenthings/loss_fns/standard.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/standard.py)
- [hessian_eigenthings/algorithms/lanczos.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/lanczos.py)
- [hessian_eigenthings/algorithms/trace.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/trace.py)
</details>

# Why Hessian-Vector Products

Hessian-vector products (HVPs) are the computational foundation of this library. Understanding *why* we use HVPs instead of computing the full Hessian matrix is essential for appreciating the design and capabilities of `hessian-eigenthings`.

## The Full Hessian Problem

The Hessian matrix $H$ of a neural network's loss function is a second-order partial derivative matrix with dimensions $[n \times n]$, where $n$ is the number of parameters. For modern large-scale models:

| Model | Parameters | Hessian Size | Memory (fp32) |
|-------|-----------|--------------|---------------|
| BERT-Base | 110M | 110M × 110M | ~48 TB |
| GPT-2 | 1.5B | 1.5B × 1.5B | ~9 PB |
| LLaMA-7B | 7B | 7B × 7B | ~392 PB |

Storing the full Hessian is fundamentally infeasible. Even computing it via automatic differentiation requires $O(n^2)$ operations and memory that scales quadratically with model size. 资料来源：[README.md:1-30]()

## What is a Hessian-Vector Product?

A Hessian-vector product computes $Hv$ for a given vector $v$ without ever constructing $H$ explicitly. The operation takes $O(n)$ time and memory—linear in the number of parameters.

Formally, given:
- Loss function $\mathcal{L}(\theta)$
- Parameter vector $\theta \in \mathbb{R}^n$
- Direction vector $v \in \mathbb{R}^n$

The HVP is:
$$Hv = \nabla_\theta^2 \mathcal{L} \cdot v = \frac{\partial}{\partial \theta} \left( \nabla_\theta \mathcal{L} \cdot v \right)$$

This is implemented as a double-backward pass:
1. Forward pass → compute loss
2. Backward pass → compute gradient $\nabla_\theta \mathcal{L}$
3. Second backward pass → compute Jacobian-vector product $H \cdot v$

资料来源：[hessian_eigenthings/operators/hessian.py:1-50]()

## Why HVPs Enable Scalable Curvature Analysis

By avoiding explicit Hessian construction, HVP-based algorithms can operate on models of any size. This library provides several key algorithms that all rely on HVP as their primitive operation:

```mermaid
graph TD
    A[Hessian-Vector Product] --> B[ Lanczos Eigendecomposition]
    A --> C[ Hutchinson Trace Estimation]
    A --> D[ Hutch++ Trace Estimation]
    A --> E[ Stochastic Lanczos Quadrature]
    
    B --> F[Top-k Eigenvalues & Eigenvectors]
    C --> G[Trace Estimation]
    D --> G
    E --> H[Spectral Density Plot]
```

### Eigendecomposition via Lanczos

The Lanczos algorithm iteratively builds an orthogonal basis that tridiagonalizes the operator. It requires only matrix-vector products, making it perfect for HVP-based curvature analysis:

| Property | Full Eigendecomp | Lanczos + HVP |
|----------|------------------|---------------|
| Memory | $O(n^2)$ | $O(n \cdot k)$ |
| Time | $O(n^3)$ | $O(n \cdot k^2)$ |
| Storage | Entire matrix | $k$ Lanczos vectors |

Where $k$ is the number of desired eigenpairs (typically 1-20). 资料来源：[hessian_eigenthings/algorithms/lanczos.py:1-60]()

### Trace Estimation via Hutchinson's Method

The trace of the Hessian can be estimated without constructing the full matrix:

$$\text{tr}(H) \approx \frac{1}{m} \sum_{i=1}^{m} v_i^T H v_i$$

where $v_i$ are random probe vectors (typically Rademacher or Gaussian). Each term $v_i^T H v_i$ is a single HVP plus a dot product. 资料来源：[hessian_eigenthings/algorithms/trace.py:1-45]()

## HVP Implementation Strategies

The library provides two distinct methods for computing HVPs, each with different trade-offs:

### Method 1: Autograd (Default)

Uses `torch.autograd.grad` with `create_graph=True` for exact double-backward computation:

```python
def __init__(
    self,
    model: nn.Module,
    dataloader: Iterable[Any],
    loss_fn: LossFn,
    *,
    method: HvpMethod = "autograd",  # Default
    ...
) -> None:
```

**Advantages:**
- Numerically exact (to floating-point rounding)
- Works with any differentiable loss function
- Simple implementation

**Disadvantages:**
- Builds the full computation graph for second derivatives
- Memory scales with model complexity and output size

资料来源：[hessian_eigenthings/operators/hessian.py:20-45]()

### Method 2: Finite Difference

Uses central-difference approximation:
$$\frac{\nabla_\theta \mathcal{L}(\theta + \epsilon v) - \nabla_\theta \mathcal{L}(\theta - \epsilon v)}{2\epsilon}$$

**Advantages:**
- No second-backward graph → lower memory footprint
- Compatible with distributed training (FSDP/HSDP/TP) without special handling

**Disadvantages:**
- $O(\epsilon^2)$ truncation bias
- Precision-dependent roundoff (~1e-5 fp32, ~1e-2 bf16)

资料来源：[hessian_eigenthings/operators/hessian.py:30-40]()

## Fused HVP for Cross-Entropy Losses

For language models with large vocabulary softmax heads, computing $H_{\text{loss}} \cdot u$ naively allocates multiple $(N, V)$ intermediate tensors (where $V$ is vocabulary size). This is addressed with fused implementations:

| Backend | Speedup | Memory Reduction | Requirements |
|---------|---------|-------------------|--------------|
| `eager` | 1× (baseline) | 1× | Any |
| `compile` | ~2.6× | ~2× | torch.compile |
| `triton` | ~3.4× | ~2× | CUDA + Triton |

The fused computation computes:
$$H_{\text{loss}} \cdot u = \frac{p \odot u - p \odot \langle p, u \rangle}{n} \odot \text{mask}$$

Where $p = \text{softmax}(\text{logits})$ and the implementation avoids materializing the full $(N, V)$ softmax output. 资料来源：[hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50]()

## Generalized Gauss-Newton (GGN) Approximation

For optimization-focused curvature analysis, the GGN matrix $G$ provides a positive semi-definite (PSD) approximation to the Hessian:

$$G = J^T \cdot H_{\text{loss}} \cdot J$$

Where $J$ is the Jacobian of the model outputs with respect to parameters. For cross-entropy + softmax classification, $G$ equals the Fisher information matrix. The GGN is always PSD by construction, making it suitable for optimization algorithms. 资料来源：[hessian_eigenthings/operators/ggn.py:1-40]()

### GGN Matvec Implementation

The GGNOperator supports two matvec paths:

1. **Analytical (default):** Finite-difference JVP + analytical loss-Hessian-vector product + single normal backward. Memory footprint matches one normal training step.

2. **Autograd:** Original `torch.func.jvp` + autograd double-backward + `torch.func.vjp`. Numerically exact but scales badly with vocabulary size.

资料来源：[hessian_eigenthings/operators/ggn.py:25-45]()

## Practical Implications

The HVP approach enables:

| Capability | HVP-Based | Full Hessian |
|------------|-----------|--------------|
| 7B parameter model | ✅ ~hours | ❌ impossible |
| Top-10 eigenpairs | ✅ | ❌ |
| Trace estimation | ✅ | ❌ |
| Spectral density | ✅ | ❌ |
| FSDP compatibility | ✅ (finite-diff) | ❌ |

The eigenvalues and eigenvectors of the Hessian have been implicated in generalization properties of neural networks. Researchers hypothesize that "flat minima" generalize better, that Hessians of large models are very low-rank, and that curvature analysis can guide optimization. 资料来源：[README.md:25-35]()

## Summary

Hessian-vector products are the fundamental building block that makes large-scale curvature analysis possible:

1. **Memory efficiency:** $O(n)$ vs $O(n^2)$ for the full Hessian
2. **Computational efficiency:** $O(n)$ per matvec vs $O(n^2)$ for full computation
3. **Scalability:** Works with models of any size via iterative algorithms
4. **Flexibility:** Supports exact (autograd) or memory-efficient (finite-difference) computation

The `hessian-eigenthings` library provides production-ready implementations of HVP computation and HVP-based algorithms for practical curvature analysis in PyTorch.

---

<a id='architecture'></a>

## System Architecture

### 相关页面

相关主题：[Curvature Operators](#curvature-operators), [Eigendecomposition Algorithms](#algorithms), [Loss Functions](#loss-functions)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [hessian_eigenthings/operators/base.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/base.py)
- [hessian_eigenthings/algorithms/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/__init__.py)
- [hessian_eigenthings/linalg/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/linalg/__init__.py)
- [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
- [hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)
- [hessian_eigenthings/algorithms/lanczos.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/lanczos.py)
- [hessian_eigenthings/algorithms/trace.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/trace.py)
- [hessian_eigenthings/loss_fns/huggingface.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)
</details>

# System Architecture

The `pytorch-hessian-eigenthings` library provides an efficient and scalable framework for computing eigendecompositions of curvature matrices—including the Hessian, Generalized Gauss-Newton (GGN) matrix, and empirical Fisher—for arbitrary PyTorch models. The architecture is designed around three core abstractions: **Curvature Operators**, **Algorithms**, and **Linear Algebra Backends**.

## High-Level Architecture Overview

The library implements a layered architecture that separates mathematical curvature computations from numerical algorithms:

```mermaid
graph TD
    subgraph "User Layer"
        U[User Code]
    end
    
    subgraph "Algorithm Layer"
        LA[Lanczos]
        TR[Trace Estimation]
        SP[Stochastic Power Iteration]
    end
    
    subgraph "Operator Layer"
        HO[HessianOperator]
        GGN[GGNOperator]
        FO[FisherOperator]
    end
    
    subgraph "Backend Layer"
        B[LinAlgBackend]
        SD[SingleDeviceBackend]
    end
    
    subgraph "PyTorch Core"
        PT[PyTorch Autograd]
    end
    
    U -->|uses| LA
    U -->|uses| TR
    U -->|uses| HO
    LA -->|operates on| HO
    TR -->|operates on| HO
    HO -->|implemented via| B
    B -->|delegates to| PT
```

## Core Components

### 1. Curvature Operators

Curvature operators are the foundation of the library. They abstract away the details of how matrix-vector products (matvecs) with curvature matrices are computed, providing a unified interface for algorithms to work with.

#### Base Interface

All operators inherit from `CurvatureOperator`, which defines the contract for curvature computations:

| Property/Method | Type | Description |
|-----------------|------|-------------|
| `size` | `int` | Total number of parameters in the curvature matrix |
| `dtype` | `torch.dtype` | Data type of the operator |
| `device` | `torch.device` | Device where computations run |
| `matvec(v)` | `Callable` | Computes `A @ v` for input vector `v` |

#### Hessian Operator

The `HessianOperator` computes the Hessian of a loss function with respect to model parameters:

```python
HessianOperator(
    model: nn.Module,
    dataloader: Iterable[Any],
    loss_fn: LossFn,
    *,
    param_filter: ParamFilter | None = None,
    method: HvpMethod = "autograd"  # or "finite_difference"
)
```

资料来源：[hessian_eigenthings/operators/hessian.py:1-50]()

Two HVP computation methods are supported:

| Method | Description | Use Case |
|--------|-------------|----------|
| `"autograd"` (default) | Exact double-backward via `torch.autograd.grad` | Up to ~7B parameters |
| `"finite_difference"` | Central-difference approximation | FSDP/HSDP/TP at scale |

The finite difference method uses the approximation:

```
H(v) ≈ (∇L(θ+εv) − ∇L(θ−εv)) / 2ε
```

This avoids second-backward graph entirely, making it compatible with distributed training setups.

#### GGN Operator

The `GGNOperator` implements the Generalized Gauss-Newton matrix, which is always positive semi-definite:

```python
GGNOperator(
    model: nn.Module,
    dataloader: Iterable[Any],
    forward_fn: ForwardFn,
    loss_of_output_fn: LossOfOutputFn,
    *,
    loss_hvp: Literal["analytical", "autograd"] = "analytical"
)
```

资料来源：[hessian_eigenthings/operators/ggn.py:1-80]()

The GGN decomposes as `G = J^T · H_loss · J` where:
- `J` is the Jacobian of the model output with respect to parameters
- `H_loss` is the Hessian of the loss with respect to the output

For cross-entropy + softmax classification, `G` equals the Fisher information matrix.

### 2. Algorithms Layer

Algorithms operate on any `CurvatureOperator` via its `matvec` interface, enabling eigenvalue computation, trace estimation, and spectral density analysis.

#### Lanczos Eigensolver

The Lanczos algorithm computes the top-k eigenvalues and eigenvectors of a symmetric matrix using only matrix-vector products:

```python
lanczos(
    operator: CurvatureOperator,
    k: int = 10,
    max_iter: int = 100,
    tol: float = 1e-3,
    which: str = "LA"  # LA, SA, or LM
) -> EigendecompositionResult
```

资料来源：[hessian_eigenthings/algorithms/lanczos.py:1-50]()

**Key features:**

- **Ritz vector accumulation**: Directly accumulates Ritz vectors into final `(k, n)` layout via rank-1 outer-product updates, avoiding transient `(n, k)` transpose copies
- **Convergence tracking**: Monitors residual norms `|β_k · s_{k}|` to determine convergence
- **Eigenvalue selection**: Supports "LA" (largest algebraic), "SA" (smallest algebraic), and "LM" (largest magnitude)

#### Trace Estimation

The library provides multiple trace estimation methods:

| Method | Description | Samples Required |
|--------|-------------|------------------|
| `hutchinson` | Classical Hutchinson: `(1/m) Σ vᵢᵀ A vᵢ` | Higher variance |
| `hutch++` | Improved estimator with lower variance | ~30 recommended |

```python
trace(
    operator: CurvatureOperator,
    num_matvecs: int = 30,
    method: str = "hutch++",
    seed: int | None = None
) -> TraceResult
```

资料来源：[hessian_eigenthings/algorithms/trace.py:1-40]()

The Hutch++ estimator achieves lower variance by using both query and reply vectors.

### 3. Backend Layer

The `LinAlgBackend` abstract interface decouples linear algebra operations from specific device implementations:

```mermaid
classDiagram
    class LinAlgBackend~T~ {
        <<abstract>>
        +matmul(a, b) T
        +dot(a, b) T
        +norm(v) T
        +fill(v, value) T
        +copy(v) T
    }
    
    class SingleDeviceBackend {
        +matmul(a, b) Tensor
        +dot(a, b) Tensor
        +norm(v) Tensor
    }
    
    LinAlgBackend <|-- SingleDeviceBackend
```

Backends provide:
- Vector arithmetic operations (dot product, norm, fill, copy)
- Device-specific optimizations
- Memory allocation strategies

## Data Flow

### Eigendecomposition Workflow

```mermaid
sequenceDiagram
    participant User
    participant Operator as CurvatureOperator
    participant Backend as LinAlgBackend
    participant Algo as Lanczos Algorithm
    participant PyTorch as PyTorch Autograd
    
    User->>Operator: Instantiate with model, dataloader
    User->>Algo: Call lanczos(operator, k)
    Algo->>Operator: Request matvec(v)
    Operator->>Backend: Allocate probe vector
    Backend->>PyTorch: Create tensor
    Operator->>PyTorch: Forward pass + backward
    PyTorch-->>Operator: Return HVP result
    Operator-->>Algo: Return Av
    Algo->>Algo: Repeat for m iterations
    Algo-->>User: Return eigenvalues, eigenvectors
```

### Loss Function Integration

The library supports two loss function patterns:

```mermaid
graph LR
    subgraph "Single Function API"
        L1[loss_fn<br/>model, batch → scalar]
    end
    
    subgraph "Two Function API (for GGN)"
        F1[forward_fn<br/>model, batch → output]
        L2[loss_of_output_fn<br/>output, batch → scalar]
    end
    
    L1 --> HO[HessianOperator]
    F1 --> GGN[GGNOperator]
    L2 --> GGN
```

#### HuggingFace Integration

For language models, the library provides optimized loss functions:

```python
hf_lm_loss(fused="auto")  # Auto-selects Triton or torch.compile
```

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:1-30]()

The fused implementation:
- Uses Triton kernels on CUDA (~3.4x speedup, 2x peak-memory reduction)
- Falls back to `torch.compile` (~2.6x speedup, 2x peak-memory reduction)
- Eliminates most `(N, V)` intermediates

## Configuration Options

### HessianOperator Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model` | `nn.Module` | Required | PyTorch model |
| `dataloader` | `Iterable` | Required | Data batches |
| `loss_fn` | `LossFn` | Required | Loss computation function |
| `param_filter` | `ParamFilter` | `None` | Filter parameters by name |
| `method` | `HvpMethod` | `"autograd"` | HVP computation method |
| `fd_eps` | `float` | `None` | Finite difference epsilon |

### GGNOperator Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `loss_hvp` | `str` | `"analytical"` | `"analytical"` or `"autograd"` |
| `full_dataset` | `bool` | `True` | Average over full dataset |
| `num_batches` | `int` | `None` | Limit to first N batches |
| `microbatch_size` | `int` | `None` | Process in smaller chunks |

### Lanczos Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `k` | `int` | `10` | Number of eigenvalues to compute |
| `max_iter` | `int` | `100` | Maximum Lanczos iterations |
| `tol` | `float` | `1e-3` | Convergence tolerance |
| `which` | `str` | `"LA"` | Which eigenvalues ("LA", "SA", "LM") |
| `reorthogonalize` | `bool` | `False` | Full reorthogonalization |

## Usage Patterns

### Basic Hessian Eigenvalue Computation

```python
from hessian_eigenthings import HessianOperator, lanczos

# Create operator
hessian_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=lambda m, b: torch.nn.functional.cross_entropy(m(b[0]), b[1])
)

# Compute top eigenvalues
eig_result = lanczos(hessian_op, k=10, max_iter=100)
print(eig_result.eigenvalues)
```

### Parameter-Filtered Analysis

```python
from hessian_eigenthings import HessianOperator, lanczos

# Analyze only attention parameters
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)
eig_attn = lanczos(attn_op, k=3)
```

### Trace Estimation

```python
from hessian_eigenthings import HessianOperator, trace

trace_result = trace(
    hessian_op,
    num_matvecs=30,
    method="hutch++",
    seed=42
)
print(f"Trace estimate: {trace_result.estimate:.4e}")
```

## Architecture Benefits

| Benefit | Description |
|---------|-------------|
| **Separation of Concerns** | Operators define "what" to compute; algorithms define "how" |
| **Flexibility** | Any operator can use any algorithm |
| **Scalability** | Backends enable device-specific optimizations |
| **Composability** | Easy to add new operators or algorithms |
| **Memory Efficiency** | Matrix-free design avoids explicit matrix storage |

## Extension Points

### Adding Custom Curvature Operators

New operators should subclass `CurvatureOperator` and implement the `matvec` method:

```python
class CustomCurvatureOperator(CurvatureOperator):
    def __init__(self, model, dataloader):
        super().__init__()
        self.model = model
        self.dataloader = dataloader
        # Register parameters
    
    def _matvec(self, v: torch.Tensor) -> torch.Tensor:
        # Implement A @ v
        return custom_computation(v)
```

### Adding New Algorithms

Algorithms should accept any `CurvatureOperator` and use the backend exclusively:

```python
def custom_algorithm(
    operator: CurvatureOperator,
    backend: LinAlgBackend | None = None
) -> SomeResult:
    backend = backend or SingleDeviceBackend()
    # Use backend for all vector operations
```

## Summary

The system architecture of `pytorch-hessian-eigenthings` follows a clean, modular design that separates curvature matrix computation (operators), numerical algorithms (Lanczos, trace estimation), and linear algebra primitives (backends). This design enables efficient Hessian and GGN eigendecomposition for models ranging from small MLPs to large language models, with support for distributed training and optimized fused computations.

---

<a id='curvature-operators'></a>

## Curvature Operators

### 相关页面

相关主题：[System Architecture](#architecture), [Distributed Computing with DDP](#distributed-computing)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [hessian_eigenthings/operators/base.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/base.py)
- [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
- [hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)
- [hessian_eigenthings/operators/fisher.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/fisher.py)
- [hessian_eigenthings/operators/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/__init__.py)
</details>

# Curvature Operators

## Overview

Curvature Operators in `hessian-eigenthings` provide a matrix-free abstraction for computing Hessian eigendecomposition and related curvature matrices for arbitrary PyTorch models. They implement the `CurvatureOperator` base class interface, enabling efficient computation of eigenvalues, eigenvectors, traces, and spectral densities without explicitly forming potentially massive matrices.

The core abstraction allows algorithms (Lanczos, power iteration, Hutch++) to operate on any curvature matrix through a unified `matvec(v)` interface that computes $Av$ for any vector $v$, enabling scalability to large models with billions of parameters.

资料来源：[hessian_eigenthings/__init__.py:1-10](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/__init__.py)

## Architecture

```mermaid
graph TD
    subgraph "Curvature Operators"
        Base[CurvatureOperator<br/>Base Class]
        Hessian[HessianOperator]
        GGN[GGNOperator]
        Fisher[EmpiricalFisherOperator]
        Lambda[LambdaOperator]
        DDP[DDPHessianOperator]
    end
    
    subgraph "Algorithms"
        Lanczos[Lanczos Eigendecomposition]
        Power[Power Iteration]
        Trace[Trace Estimation<br/>Hutch++/Hutchinson]
        Spectral[Spectral Density<br/>Stochastic Lanczos Quadrature]
    end
    
    Base --> Hessian
    Base --> GGN
    Base --> Fisher
    Base --> Lambda
    Base --> DDP
    
    Hessian --> Lanczos
    GGN --> Lanczos
    Fisher --> Lanczos
    Lambda --> Lanczos
    
    Hessian --> Trace
    GGN --> Trace
    Fisher --> Trace
    
    Hessian --> Power
    GGN --> Power
    Fisher --> Power
    
    Hessian --> Spectral
    GGN --> Spectral
    Fisher --> Spectral
```

## Base Class: CurvatureOperator

All curvature operators inherit from `CurvatureOperator`, which defines the contract that subclasses must fulfill.

### Core Interface

| Method | Description |
|--------|-------------|
| `matvec(v)` | Compute $Av$ where $A$ is the curvature matrix |
| `size` | Total number of parameters in the operator's scope |
| `dtype`, `device` | Tensor dtype and device for vector operations |

资料来源：[hessian_eigenthings/operators/base.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/base.py)

### Parameter Filtering

Curvature operators can be restricted to subsets of model parameters using `param_filter`, enabling analysis of specific components (e.g., attention layers only).

```python
from hessian_eigenthings import HessianOperator, match_names

# Filter to attention parameters only
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)
```

资料来源：[hessian_eigenthings/__init__.py:35-38](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/__init__.py)

## HessianOperator

Computes the Hessian $\nabla_{\theta}^2 \mathcal{L}$ of the loss function with respect to model parameters.

### Key Features

- **Two HVP methods**: `autograd` (exact double-backward via `torch.autograd.grad`) and `finite_difference` (central difference for FSDP/TP compatibility)
- **Batched computation**: Automatically averages over multiple batches from the dataloader
- **Microbatch support**: For large models, process batches in smaller microbatches

### Constructor Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model` | `nn.Module` | Required | PyTorch model |
| `dataloader` | `Iterable` | Required | Data batches |
| `loss_fn` | `LossFn` | Required | Loss computation function |
| `param_filter` | `ParamFilter \| None` | `None` | Parameter name filter |
| `full_dataset` | `bool` | `True` | Average over full dataset |
| `num_batches` | `int \| None` | `None` | Limit batches for stochastic estimate |
| `microbatch_size` | `int \| None` | `None` | Split batches into smaller microbatches |
| `method` | `HvpMethod` | `"autograd"` | HVP computation method |
| `fd_eps` | `float \| None` | `None` | Finite difference epsilon |
| `backend` | `LinAlgBackend \| None` | `None` | Linear algebra backend |

### HVP Method Comparison

| Method | Accuracy | Memory | FSDP/TP Compatible | Speed |
|--------|----------|--------|-------------------|-------|
| `autograd` | Exact (to rounding) | High | No | Fast |
| `finite_difference` | $O(\epsilon^2)$ bias | Low | Yes | 2x passes |

资料来源：[hessian_eigenthings/operators/hessian.py:1-60](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

### Finite Difference Epsilon Table

| Dtype | Optimal $\epsilon$ |
|-------|-------------------|
| `float64` | `6e-6` |
| `float32` | `5e-3` |
| `bfloat16` | `0.2` |
| `float16` | `5e-2` |

资料来源：[hessian_eigenthings/operators/hessian.py:34-40](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

## GGNOperator

The Generalized Gauss-Newton (GGN) matrix $G = J^T H_{loss} J$ provides a PSD approximation to the Hessian that is computationally cheaper while preserving the eigenvalues that matter for optimization.

### Key Features

- **Always PSD**: Unlike the exact Hessian, the GGN is positive semi-definite by construction
- **Analytical HVP path**: For losses with known HVP (e.g., cross-entropy), uses analytical computation
- **For cross-entropy + softmax**: $G$ equals the Fisher information matrix

### Two Matvec Implementations

| `loss_hvp` | Description | Memory | Use Case |
|------------|-------------|--------|----------|
| `"analytical"` (default) | FD JVP + analytical loss-Hessian-vec + one backward | Matches one training step | LM-scale, large vocab |
| `"autograd"` | `torch.func.jvp` + double-backward + `torch.func.vjp` | Scales with output size | Exact, small vocab |

资料来源：[hessian_eigenthings/operators/ggn.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)

### Fused Cross-Entropy HVP

For language model training, the GGN operator includes a fused kernel for the CE HVP computation:

```python
# Auto-selects fastest backend: Triton > torch.compile > eager
hf_lm_loss_of_output(..., fused="auto")
```

The fused implementation reduces peak memory by 2x compared to eager, with Triton providing ~3.4x speedup on CUDA.

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:1-30](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)

## EmpiricalFisherOperator

Computes the empirical Fisher information matrix $F = \frac{1}{N} \sum_{i=1}^N \nabla_{\theta} \log p(y_i|x_i) \nabla_{\theta} \log p(y_i|x_i)^T$.

For classification with cross-entropy loss, the empirical Fisher equals the GGN when using the model distribution's expectation.

资料来源：[hessian_eigenthings/operators/fisher.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/fisher.py)

## LambdaOperator

Creates custom curvature operators from lambda functions for testing or custom curvature definitions.

```python
from hessian_eigenthings import LambdaOperator

# Custom operator that always returns a scaled vector
custom_op = LambdaOperator(
    size=1000,
    matvec=lambda v: 2.0 * v  # Represents 2*I
)
```

## DDPHessianOperator

Distributed Data Parallel-aware Hessian operator that handles gradient synchronization across processes.

资料来源：[hessian_eigenthings/operators/__init__.py:15-18](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/__init__.py)

## Common Usage Patterns

### Computing Top Eigenvalues

```python
from hessian_eigenthings import HessianOperator, lanczos

operator = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn
)

result = lanczos(operator, k=10, max_iter=100)
print(f"Top eigenvalues: {result.eigenvalues}")
```

### Estimating Trace

```python
from hessian_eigenthings import GGNOperator, trace

operator = GGNOperator(
    model=model,
    dataloader=dataloader,
    forward_fn=model_forward,
    loss_of_output_fn=loss_fn
)

result = trace(operator, num_matvecs=100, method="hutch++")
print(f"Trace estimate: {result.estimate:.4e} ± {result.stderr:.4e}")
```

### Component-Specific Analysis

```python
from hessian_eigenthings import HessianOperator, match_regex

# Analyze only attention weights in transformer
attn_only = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_regex(r"blocks\.\d+\.attn\.")
)

# Analyze only MLP weights
mlp_only = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_regex(r"blocks\.\d+\.mlp\.")
)
```

## Linear Algebra Backends

The operators use pluggable `LinAlgBackend` for vector operations, enabling support for different hardware configurations and precision requirements.

| Backend | Use Case |
|---------|----------|
| `SingleDeviceBackend` | Single GPU/CPU |
| (Distributed backends) | Multi-GPU via FSDP/TP |

## Module Exports

```python
from hessian_eigenthings.operators import (
    CurvatureOperator,
    DDPHessianOperator,
    EmpiricalFisherOperator,
    GGNOperator,
    HessianOperator,
    LambdaOperator,
)
```

资料来源：[hessian_eigenthings/operators/__init__.py:1-20](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/__init__.py)

---

<a id='algorithms'></a>

## Eigendecomposition Algorithms

### 相关页面

相关主题：[System Architecture](#architecture), [Why Hessian-Vector Products](#hvp-approach)

<details>
<summary>Relevant Source Files</summary>

以下源码文件用于生成本页说明：

- [hessian_eigenthings/algorithms/lanczos.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/lanczos.py)
- [hessian_eigenthings/algorithms/power_iteration.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/power_iteration.py)
- [hessian_eigenthings/algorithms/trace.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/trace.py)
- [hessian_eigenthings/algorithms/spectral_density.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/spectral_density.py)
- [hessian_eigenthings/algorithms/result.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/result.py)
- [hessian_eigenthings/algorithms/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/algorithms/__init__.py)
</details>

# Eigendecomposition Algorithms

The `hessian-eigenthings` library provides a suite of efficient iterative algorithms for computing eigendecompositions of curvature matrices (Hessian, Generalized Gauss-Newton, and Fisher) in PyTorch models. These algorithms enable analysis of neural network loss landscapes by extracting eigenvalues, eigenvectors, spectral densities, and trace estimates without explicitly constructing the full curvature matrix—a critical capability for modern large-scale models.

## Overview

Computing eigendecompositions of curvature matrices is fundamental to understanding generalization properties, flat minima, and training dynamics of neural networks. However, these curvature matrices are prohibitively large (n × n where n is the number of parameters), making explicit construction impossible for modern models.

The library implements Krylov subspace methods that only require matrix-vector products, enabling efficient computation of:

| Capability | Algorithm | Use Case |
|------------|-----------|----------|
| Top-k eigenvalues/eigenvectors | Lanczos, Power Iteration | Finding most-curved directions |
| Trace estimation | Hutchinson, Hutch++ | Computing average curvature |
| Spectral density | Stochastic Lanczos Quadrature | Visualizing eigenvalue distribution |

资料来源：[hessian_eigenthings/algorithms/__init__.py:1-29]()

## Algorithm Architecture

The algorithms in this module follow a consistent design pattern: they accept any `CurvatureOperator` and use the `LinAlgBackend` exclusively for vector arithmetic, ensuring portability across single-device and distributed settings.

```mermaid
graph TD
    A[CurvatureOperator] --> B[Lanczos Algorithm]
    A --> C[Power Iteration]
    A --> D[Trace Estimation]
    A --> E[Spectral Density]
    
    B --> F[EigenResult]
    C --> F
    D --> G[TraceResult]
    E --> H[SpectralDensityResult]
    
    F --> I[eigenvalues: Tensor]
    F --> J[eigenvectors: Tensor]
    F --> K[residuals: Tensor]
```

资料来源：[hessian_eigenthings/algorithms/result.py]()

## Lanczos Algorithm

The Lanczos algorithm is the primary method for computing eigenvalues and eigenvectors of symmetric matrices. It builds a Krylov subspace through repeated matrix-vector products, then solves the small tridiagonal eigenvalue problem.

### Symmetric Lanczos Implementation

The in-house Lanczos implementation provides optional full reorthogonalization to address the loss-of-orthogonality issues classical Lanczos is known for (Paige 1976).

```python
def lanczos_tridiagonal(
    operator: CurvatureOperator,
    v0: torch.Tensor,
    max_iter: int,
    *,
    reorthogonalize: bool = True,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> LanczosTridiag
```

**Key characteristics:**

- **Default reorthogonalization**: Enabled for `max_iter <= 50` to suppress ghost eigenvalues
- **Computational tradeoff**: For larger Krylov dimensions, reorthogonalization becomes O(m²n); users analyzing near-degenerate spectra should re-enable it
- **Memory efficiency**: Accumulates Ritz vectors directly via rank-1 outer-product updates, avoiding transient (n, k) → (k, n) transpose copies

资料来源：[hessian_eigenthings/algorithms/lanczos.py:30-58]()

### Lanczos Output Structure

```python
@dataclass(frozen=True)
class LanczosTridiag:
    """Output of one Lanczos run: tridiagonal coefficients + the basis used to build them."""
    alphas: torch.Tensor      # (m,) diagonal
    betas: torch.Tensor       # (m-1,) off-diagonal
    basis: list[torch.Tensor] # length m, each (n,)
    last_beta: float          # ||r_m|| residual norm at termination
    iterations: int          # m, the actual number of Lanczos steps completed
```

资料来源：[hessian_eigenthings/algorithms/lanczos.py:30-43]()

### Full Lanczos Eigensolver

The high-level `lanczos()` function computes top-k eigenvalues with configurable eigenpair selection:

```python
def lanczos(
    operator: CurvatureOperator,
    k: int = 1,
    max_iter: int = 100,
    *,
    which: Which = "LM",
    tol: float = 1e-8,
    seed: int | None = None,
    reorthogonalize: bool | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> EigenResult
```

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `operator` | `CurvatureOperator` | required | The curvature matrix operator |
| `k` | `int` | 1 | Number of eigenpairs to compute |
| `max_iter` | `int` | 100 | Maximum Lanczos iterations |
| `which` | `Literal["LM", "LA", "SA"]` | "LM" | Which eigenvalues: LM=largest magnitude, LA=largest algebraic, SA=smallest algebraic |
| `tol` | `float` | 1e-8 | Convergence tolerance |
| `seed` | `int \| None` | None | Random seed for reproducibility |
| `reorthogonalize` | `bool \| None` | None | Override default reorthogonalization setting |

资料来源：[hessian_eigenthings/algorithms/lanczos.py:58-100]()

### Eigenvalue Selection Logic

The algorithm selects eigenvalues based on the `which` parameter:

```python
if which == "LM":
    order = torch.argsort(theta.abs(), descending=True)
elif which == "LA":
    order = torch.argsort(theta, descending=True)
elif which == "SA":
    order = torch.argsort(theta, descending=False)
```

资料来源：[hessian_eigenthings/algorithms/lanczos.py:75-83]()

## Power Iteration

Power iteration is a simpler method for finding the dominant eigenvalue. The library implements deflated power iteration to compute multiple eigenpairs sequentially by projecting out previously found directions.

### Single Power Iteration

```python
def power_iteration_one(
    operator: CurvatureOperator,
    v0: torch.Tensor,
    max_iter: int,
    tol: float = 1e-6,
    backend: LinAlgBackend | None = None,
) -> tuple[torch.Tensor, torch.Tensor]
```

资料来源：[hessian_eigenthings/algorithms/power_iteration.py]()

### Deflated Power Iteration

Deflated power iteration extends the basic method to compute multiple eigenpairs:

```python
def deflated_power_iteration(
    operator: CurvatureOperator,
    num_eigs: int,
    max_iter: int,
    tol: float = 1e-6,
    seed: int | None = None,
    backend: LinAlgBackend | None = None,
) -> EigenResult
```

The deflation process removes previously found eigenvectors from the subspace before searching for the next eigenpair, preventing convergence to already-computed directions.

资料来源：[hessian_eigenthings/algorithms/power_iteration.py]()

## Trace Estimation

Trace estimation provides a way to compute the average eigenvalue (trace / dimension) without full eigendecomposition. This is computationally much cheaper and useful for understanding overall curvature magnitude.

### Hutchinson's Estimator

Hutchinson's method estimates the trace using random probe vectors:

```python
def hutchinson(
    operator: CurvatureOperator,
    *,
    num_samples: int = 100,
    distribution: Distribution = "rademacher",
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult
```

The estimator computes: `(1/m) Σ vᵢᵀ A vᵢ`

Where vᵢ are random vectors from the specified distribution. Rademacher distribution provides lower variance than Gaussian.

资料来源：[hessian_eigenthings/algorithms/trace.py:48-71]()

### Hutch++ Estimator

Hutch++ is an improved estimator with better convergence properties:

```python
def hutch_plus_plus(
    operator: CurvatureOperator,
    *,
    num_matvecs: int = 30,
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult
```

Hutch++ uses a structured random sampling approach that achieves lower variance than standard Hutchinson with the same number of matrix-vector products.

资料来源：[hessian_eigenthings/algorithms/trace.py:22-47]()

### Unified Trace Interface

```python
def trace(
    operator: CurvatureOperator,
    num_matvecs: int = 30,
    *,
    method: Literal["hutchinson", "hutch++"] = "hutch++",
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> TraceResult
```

**Validation:** The `num_matvecs` parameter is validated to be at least 1.

资料来源：[hessian_eigenthings/algorithms/trace.py:71-84]()

### Trace Result Structure

```python
@dataclass
class TraceResult:
    estimate: float      # The trace estimate
    stderr: float       # Standard error of the estimate
    num_matvecs: int     # Number of matrix-vector products used
    operator_size: int   # Dimension of the operator
```

资料来源：[hessian_eigenthings/algorithms/result.py]()

## Spectral Density

Spectral density estimation computes the eigenvalue distribution (density function) across the spectrum, enabling visualization and analysis of the full eigenvalue structure.

### Stochastic Lanczos Quadrature

The `spectral_density()` function implements Stochastic Lanczos Quadrature (SLQ) to compute the spectral density:

```python
def spectral_density(
    operator: CurvatureOperator,
    num_runs: int = 16,
    lanczos_steps: int = 50,
    seed: int | None = None,
    backend: LinAlgBackend[torch.Tensor] | None = None,
) -> SpectralDensityResult
```

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `operator` | `CurvatureOperator` | required | The curvature matrix operator |
| `num_runs` | `int` | 16 | Number of randomized runs for averaging |
| `lanczos_steps` | `int` | 50 | Lanczos iterations per run |
| `seed` | `int \| None` | None | Random seed |

资料来源：[hessian_eigenthings/algorithms/spectral_density.py]()

### Spectral Density Result

```python
@dataclass
class SpectralDensityResult:
    grid: torch.Tensor      # Eigenvalue grid points
    density: torch.Tensor   # Density values at each grid point
    eigenvalues: list[torch.Tensor]  # Eigenvalues from each run
    eigenvectors: list[list[torch.Tensor]]  # Corresponding eigenvectors
```

The spectral density integrates to 1: `∫ density(λ) dλ ≈ 1`, which can be verified using numerical integration.

资料来源：[hessian_eigenthings/algorithms/result.py]()

## Common Result Types

All algorithms return standardized result objects that encapsulate the computed quantities along with metadata about the computation.

```python
@dataclass
class EigenResult:
    eigenvalues: torch.Tensor       # (k,) tensor of eigenvalues
    eigenvectors: torch.Tensor      # (k, n) matrix of eigenvectors
    residuals: torch.Tensor          # (k,) convergence residuals
    iterations: int                 # Number of iterations run
    converged: bool                 # Whether all eigenpairs converged
```

资料来源：[hessian_eigenthings/algorithms/result.py]()

## Algorithm Selection Guide

```mermaid
graph LR
    A[Goal] --> B{Eigenpairs?}
    B -->|Yes, top-k| C[How many?]
    C -->|1-10| D[Lanczos]
    C -->|Many| E{Orthogonality critical?}
    E -->|Yes| D
    E -->|No| F[Deflated Power Iteration]
    
    B -->|Trace only| G{Accuracy priority?}
    G -->|High| H[Hutch++]
    G -->|Standard| I[Hutchinson]
    
    B -->|Full distribution| J[Spectral Density]
    
    D --> K[EigenResult]
    F --> K
    H --> L[TraceResult]
    I --> L
    J --> M[SpectralDensityResult]
```

### Decision Criteria

| Scenario | Recommended Algorithm | Notes |
|----------|----------------------|-------|
| Top eigenvalues for single/large batch | `lanczos()` | Best accuracy, moderate cost |
| Quick dominant eigenvalue | `deflated_power_iteration()` | Lower memory, less accurate |
| Trace with limited matvecs | `hutch_plus_plus()` | Better convergence than Hutchinson |
| Trace estimation | `hutchinson()` | Simpler, more matvecs needed |
| Eigenvalue histogram/distribution | `spectral_density()` | Visualize full spectrum |

## Integration with Curvature Operators

The algorithms are designed to work with any `CurvatureOperator` implementation, including:

- `HessianOperator`: Exact Hessian via autograd or finite differences
- `GGNOperator`: Generalized Gauss-Newton matrix
- `EmpiricalFisherOperator`: Empirical Fisher information matrix
- `DDPHessianOperator`: Distributed Data Parallel Hessian

This abstraction allows the same algorithm code to work across different curvature definitions without modification.

资料来源：[hessian_eigenthings/algorithms/__init__.py:1-29]()

## Performance Considerations

### Memory Efficiency in Lanczos

The Lanczos implementation optimizes memory for large-scale models by:

1. Avoiding allocation of full (n, m) basis matrix
2. Using rank-1 outer-product updates for eigenvector accumulation
3. Computing Ritz vectors directly into final (k, n) layout

### Reorthogonalization Tradeoffs

| Setting | Memory | Computation | Accuracy |
|---------|--------|-------------|----------|
| `reorthogonalize=True` | O(mn) | O(m²n) | High orthogonality |
| `reorthogonalize=False` | O(mn) basis list | O(mn) | May have ghost eigenvalues |

For `max_iter <= 50`, reorthogonalization is enabled by default. For larger Krylov dimensions, it defaults off to maintain acceptable performance.

资料来源：[hessian_eigenthings/algorithms/lanczos.py:23-28]()

## Example Usage

```python
from hessian_eigenthings import HessianOperator, lanczos, trace, spectral_density

# Create Hessian operator
operator = HessianOperator(model, dataloader, loss_fn)

# Compute top-5 eigenvalues and eigenvectors
eig_result = lanczos(operator, k=5, max_iter=40, tol=1e-7, seed=0)
print(f"Top eigenvalue: {eig_result.eigenvalues[0]}")

# Estimate trace with Hutch++
trace_result = trace(operator, num_matvecs=99, method="hutch++", seed=0)
print(f"Trace estimate: {trace_result.estimate}")

# Compute spectral density
density_result = spectral_density(operator, num_runs=8, lanczos_steps=40, seed=0)
# Visualize with: plt.plot(density_result.grid, density_result.density)
```

资料来源：[examples/supervised_mlp.py:1-50]()

---

<a id='loss-functions'></a>

## Loss Functions

### 相关页面

相关主题：[Curvature Operators](#curvature-operators), [Parameter Utilities](#parameter-utilities)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [hessian_eigenthings/loss_fns/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/__init__.py)
- [hessian_eigenthings/loss_fns/standard.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/standard.py)
- [hessian_eigenthings/loss_fns/huggingface.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)
- [hessian_eigenthings/loss_fns/transformer_lens.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/transformer_lens.py)
- [hessian_eigenthings/loss_fns/_fused_ce_hvp.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/_fused_ce_hvp.py)
- [hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)
- [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
</details>

# Loss Functions

Loss functions in this repository serve as the bridge between model outputs and curvature operators (Hessian and Generalized Gauss-Newton matrices). They provide the necessary computations for Hessian-vector products and support multiple backend implementations optimized for different use cases.

## Overview

The loss functions module (`hessian_eigenthings/loss_fns/`) provides two distinct function signatures depending on the target operator:

| Function Type | Signature | Used By |
|---------------|-----------|---------|
| `loss_fn` | `(model: nn.Module, batch: Any) -> torch.Tensor` | `HessianOperator` |
| `loss_of_output_fn` | `(output: torch.Tensor, batch: Any) -> torch.Tensor` | `GGNOperator` |

资料来源：[hessian_eigenthings/operators/hessian.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
资料来源：[hessian_eigenthings/operators/ggn.py:1-60](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)

## Architecture

```mermaid
graph TD
    A[Loss Function Entry Points] --> B[Standard Losses]
    A --> C[HuggingFace Losses]
    A --> D[TransformerLens Losses]
    
    B --> B1[MSE Loss]
    B --> B2[Cross-Entropy Loss with HVP]
    
    C --> C1[Autoregressive LM Loss]
    C --> C2[Shifted CE with Analytical HVP]
    C --> C3[Fused CE HVP Backends]
    
    D --> D1[TransformerLens HookedModel Loss]
    
    C3 --> C3a[Triton Kernel]
    C3 --> C3b[torch.compile]
    C3 --> C3c[Eager Reference]
```

## Standard Loss Functions

The `standard.py` module provides loss functions for common supervised learning scenarios with closed-form Hessian-vector products.

资料来源：[hessian_eigenthings/loss_fns/standard.py:1-80](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/standard.py)

### MSE Loss

Returns a wrapper compatible with `GGNOperator` for mean-squared error loss:

```python
def mse_loss_of_output() -> Callable[[torch.Tensor, tuple[torch.Tensor, torch.Tensor]], torch.Tensor]:
    """Make a `loss_of_output_fn` for `GGNOperator` from a (output, target) criterion."""
```

### Cross-Entropy Loss with Analytical HVP

The cross-entropy implementation includes a closed-form Hessian-vector product for efficient computation:

```python
def _ce_hvp(
    output: torch.Tensor, batch: tuple[torch.Tensor, torch.Tensor], u: torch.Tensor
) -> torch.Tensor:
    """Closed-form H @ u for mean-reduced softmax + cross-entropy.
    
    `output` has shape `(N, C)` (logits). For each row,
    `H_row = (diag(p) - p p^T) / N` where `p = softmax(output)`.
    """
```

**Mathematical Foundation:**

For mean-reduced softmax + cross-entropy, the Hessian takes the form:

```
H_row = (diag(p) - p·p^T) / N
```

where:
- `p = softmax(output)` is the predicted probability distribution
- `N` is the number of samples
- `u` is the input vector

资料来源：[hessian_eigenthings/loss_fns/standard.py:40-55](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/standard.py)

## HuggingFace Transformers Integration

The `huggingface.py` module provides loss functions specifically designed for HuggingFace Transformers models. These handle the internal loss computation that occurs when `labels` are present in the batch.

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:60-90](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)

### Autoregressive Language Model Loss

```python
def hf_lm_loss() -> Callable[[nn.Module, dict[str, Any]], torch.Tensor]:
    """For autoregressive LMs: `loss_fn(model, batch)` calls `model(**batch).loss`."""
```

The batch must include `labels` so HuggingFace computes the loss internally. For causal language models, this is typically `labels=input_ids` with the standard internal shift.

### Shifted Cross-Entropy with Analytical HVP

For large-scale language model analysis, a shifted cross-entropy variant provides both the loss function and its analytical Hessian-vector product:

```python
def hf_lm_shifted_ce(fused: FusedCEHvpBackend = "auto") -> _LossOfOutputWithHvp:
    """Shifted CE loss with analytical H @ u for autoregressive LMs."""
```

**Shift Mechanism:**
- The loss shifts logits left (discards last position) and labels right (discards first position)
- Matches how `cross_entropy(ignore_index=-100)` handles gradient computation

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)

## Fused Cross-Entropy HVP Backends

The `_fused_ce_hvp.py` module implements optimized backends for computing the cross-entropy Hessian-vector product.

资料来源：[hessian_eigenthings/loss_fns/_fused_ce_hvp.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/_fused_ce_hvp.py)

### Backend Selection

| Backend | Description | Performance | Availability |
|---------|-------------|-------------|--------------|
| `"auto"` | Auto-select fastest available | Optimal | Default |
| `"triton"` | Hand-written CUDA Triton kernel | ~3.4x speedup, 2x memory reduction | CUDA + Triton |
| `"compile"` | `torch.compile`-fused | ~2.6x speedup, 2x memory reduction | torch >= 2.0 |
| `"eager"` | Plain PyTorch reference | Baseline | Always |

```python
FusedCEHvpBackend = Literal["auto", "eager", "compile", "triton"]
```

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:25-35](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)

### Backend Resolution Logic

```mermaid
graph LR
    A[Backend: "auto"] --> B{Device: CUDA?}
    B -->|Yes + Triton available| C[Triton Kernel]
    B -->|No| D{torch.compile available?}
    D -->|Yes| E[torch.compile Backend]
    D -->|No| F[Eager Backend]
```

The resolution checks:
1. If `backend != "auto"`, use the specified backend
2. If `"auto"` and CUDA + Triton available → Triton
3. If `"auto"` and torch.compile available → compile
4. Otherwise → eager

**Important:** The Triton kernel asserts `logits.is_cuda`, so on CUDA-equipped hosts running CPU inputs, the system falls back to `compile`.

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:40-60](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)

### Memory Optimization

At LM scale (e.g., B=64, T=256, V=50304, fp32):

| Implementation | Memory Footprint | Intermediate Tensors |
|----------------|------------------|----------------------|
| Eager | ~19.6 GB | ~6 (N, V) tensors |
| Compile | ~3.3 GB | ~1 (N, V) tensor |
| Target | ~3.3 GB | Output buffer only |

The fused implementations eliminate intermediates by computing:

```
out_flat = (p * u - p * <p, u>) * mask / n_valid
```

with shape `(N, V)` in a single kernel pass.

资料来源：[scripts/bench_fused_ce_hvp.py:1-60](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/scripts/bench_fused_ce_hvp.py)

## Loss Function Wrapper

The `_LossOfOutputWithHvp` class wraps a loss function with its analytical Hessian-vector product:

```python
class _LossOfOutputWithHvp:
    """Loss-of-output callable that also carries an analytical `.hvp` method.
    
    Wraps a plain `(output, batch) -> loss` function and a `(output, batch, u)
    -> H_loss @ u` function in a single callable. `GGNOperator` checks for the
    presence of `.hvp` and uses it as the loss-Hessian-vector product, skipping
    the autograd `create_graph=True` double-backward path entirely.
    """
```

资料来源：[hessian_eigenthings/loss_fns/huggingface.py:100-120](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/huggingface.py)

## GGN Operator Integration

The `GGNOperator` automatically detects and uses analytical HVPs when available:

```python
GGNOperator` picks this up automatically and skips the autograd
double-backward.

Two implementations of the matvec are available via `loss_hvp=`:

* ``"analytical"`` (default): finite-difference JVP + analytical loss-Hessian-vec
  product (read from `loss_of_output_fn.hvp`, which must be present) + a single
  normal backward to apply `J^T`. Memory footprint matches one normal training
  step. Required for LM-scale use.

* ``"autograd"``: the original `torch.func.jvp` + autograd double-backward +
  `torch.func.vjp` path. Numerically exact and supports any loss, but memory
  scales badly with output size.
```

资料来源：[hessian_eigenthings/operators/ggn.py:10-30](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)

## TransformerLens Integration

For TransformerLens `HookedModel` architectures, a dedicated loss function handles the hook-based forward pass:

资料来源：[hessian_eigenthings/loss_fns/transformer_lens.py:1-40](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/transformer_lens.py)

```python
def tlens_loss() -> Callable[[nn.Module, Any], torch.Tensor]:
    """Loss function for TransformerLens HookedModel."""
```

## Workflow: Choosing a Loss Function

```mermaid
graph TD
    A[Start] --> B{Model Type?}
    B -->|Standard MLP/CNN| C[Use HessianOperator]
    B -->|HuggingFace Transformers| D[Use GGNOperator]
    B -->|TransformerLens| E[Use HessianOperator]
    
    C --> F[standard.mse_loss_of_output]
    C --> G[standard.cross_entropy_loss_of_output]
    
    D --> H{Scale?}
    H -->|Small model| I[hf_lm_loss with GGNOperator]
    H -->|Large model| J[hf_lm_shifted_ce with GGNOperator]
    
    E --> K[tlens_loss with HessianOperator]
    
    J --> L[Choose HVP Backend]
    L --> M{Device?}
    M -->|CUDA + Triton| N[Use Triton backend]
    M -->|CPU/MPS| O[Use compile or eager]
```

## Complete API Reference

### Standard Module

| Function | Returns | HVP Available |
|----------|---------|---------------|
| `mse_loss_of_output()` | `loss_of_output_fn` | No |
| `cross_entropy_loss_of_output()` | `loss_of_output_fn` | Yes |
| `_ce_hvp()` | Analytical HVP | - |

### HuggingFace Module

| Function | Returns | HVP Available |
|----------|---------|---------------|
| `hf_lm_loss()` | `loss_fn` | Via GGNOperator |
| `hf_lm_shifted_ce(fused)` | `_LossOfOutputWithHvp` | Yes |
| `_LossOfOutputWithHvp` | Wrapper class | Via `.hvp` attribute |

### Fused CE HVP Module

| Function | Description |
|----------|-------------|
| `_ce_hvp_reference()` | Eager reference implementation |
| `_get_compiled_impl()` | Returns `torch.compile` wrapped version |
| `compiled_ce_hvp()` | Compiled backend entry point |
| `triton_ce_hvp()` | Triton kernel entry point |

## Usage Examples

### Standard Classification

```python
from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.loss_fns.standard import cross_entropy_loss_of_output

operator = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=lambda m, b: torch.nn.functional.cross_entropy(m(b[0]), b[1]),
)
```

### HuggingFace Large Model (Memory-Optimized)

```python
from hessian_eigenthings.operators import GGNOperator
from hessian_eigenthings.loss_fns.huggingface import hf_lm_shifted_ce

loss_fn = hf_lm_shifted_ce(fused="auto")  # Auto-selects best backend

operator = GGNOperator(
    model=model,
    dataloader=dataloader,
    forward_fn=lambda m, b: m(**b).logits,
    loss_of_output_fn=loss_fn,
    loss_hvp="analytical",  # Default, uses .hvp attribute
)
```

资料来源：[hessian_eigenthings/loss_fns/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/loss_fns/__init__.py)

---

<a id='parameter-utilities'></a>

## Parameter Utilities

### 相关页面

相关主题：[Curvature Operators](#curvature-operators)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [hessian_eigenthings/param_utils.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/param_utils.py)
- [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
- [hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)
- [examples/transformer_lens_attention_only.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/examples/transformer_lens_attention_only.py)
- [examples/huggingface_tiny_gpt2.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/examples/huggingface_tiny_gpt2.py)
</details>

# Parameter Utilities

The parameter utilities module (`param_utils.py`) provides essential infrastructure for managing, filtering, and manipulating PyTorch model parameters within the Hessian eigendecomposition pipeline. These utilities form the foundation that connects curvature operators to the underlying model parameters, enabling efficient computation of Hessian-vector products and eigendecomposition across arbitrary subsets of model parameters.

## Overview

When working with large neural networks, it is often necessary to compute curvature information for only a subset of parameters. The parameter utilities support this use case through a flexible filtering mechanism combined with utilities for parameter vectorization, reshaping, and batch management.

The core responsibilities of the parameter utilities include:

1. **Parameter Extraction** - Gathering named parameters from PyTorch modules
2. **Parameter Filtering** - Selecting subsets of parameters based on name patterns or custom predicates
3. **Vectorization** - Flattening parameters into vectors and reshaping vectors back to parameter shapes
4. **Size Tracking** - Maintaining offset mappings for efficient vector-to-parameter conversions

资料来源：[hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

## Core Types and Interfaces

### ParamFilter Type

The `ParamFilter` type alias defines the contract for parameter selection functions:

```python
ParamFilter = Callable[[str, nn.Parameter], bool]
```

A `ParamFilter` is a callable that takes two arguments:
- `name: str` - The fully-qualified parameter name within the model
- `param: nn.Parameter` - The parameter tensor itself

The function returns `True` if the parameter should be included in the operation, `False` otherwise.

资料来源：[hessian_eigenthings/operators/hessian.py:1-50](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

### Parameter Collection Utilities

The module provides functions for extracting and organizing model parameters:

| Function | Purpose |
|----------|---------|
| `get_param_names(model)` | Returns list of parameter names as fully-qualified strings |
| `get_param_list(model)` | Returns list of parameter tensors |
| `get_param_sizes(model)` | Returns list of parameter tensor sizes |
| `get_filtered_params(model, param_filter)` | Returns filtered parameter names and tensors |

These functions work together to build the data structures required by curvature operators.

资料来源：[hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)

## Parameter Vectorization

### Flattening Parameters to Vectors

The utilities support bidirectional conversion between parameter dictionaries and flat vectors. This is essential for Lanczos-based eigendecomposition algorithms that operate on vector spaces.

```python
def flatten_params(param_dict: dict[str, Tensor]) -> Tensor:
    """Flatten all parameters into a single 1D tensor."""
```

The flattening process concatenates all parameter tensors in a deterministic order, preserving the mapping between parameter names and vector offsets.

### Reshaping Vectors Back to Parameters

```python
def unflatten_params(
    vec: Tensor,
    param_names: list[str],
    param_list: list[Tensor],
    sizes: list[torch.Size]
) -> dict[str, Tensor]:
    """Reshape a flat vector back to parameter dictionary."""
```

The unflattening operation uses offset tracking to slice the vector and reshape each slice to match the original parameter shape:

```python
for name, param, size in zip(param_names, param_list, sizes, strict=True):
    out[name] = vec[offset : offset + size].reshape_as(param)
    offset += size
```

资料来源：[hessian_eigenthings/operators/ggn.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/ggn.py)

## Parameter Filtering Patterns

### Name-Based Filtering with `match_names`

The most common filtering pattern uses glob-style matching against parameter names. The `match_names` function creates a `ParamFilter` from a list of name patterns:

```python
def match_names(*patterns: str) -> ParamFilter:
    """Create a filter matching parameter names against glob patterns."""
```

**Example usage:**

```python
from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.param_utils import match_names

# Filter attention parameters only
attn_filter = match_names("blocks.*.attn.*")
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=attn_filter
)

# Filter MLP parameters only
mlp_filter = match_names("blocks.*.mlp.*")
mlp_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=mlp_filter
)
```

资料来源：[examples/transformer_lens_attention_only.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/examples/transformer_lens_attention_only.py)

### Multiple Pattern Matching

The `match_names` function supports multiple patterns, useful for targeting disjoint parameter groups:

```python
# Match multiple parameter groups
filter_fn = match_names(
    "transformer.h.*.attn.*",
    "transformer.h.*.mlp.*"
)
```

### HuggingFace-Specific Patterns

When working with HuggingFace transformers, parameter names follow a predictable structure:

```python
# Attention parameters in GPT-2
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=hf_lm_loss(),
    param_filter=match_names("transformer.h.*.attn.*"),
)
```

资料来源：[examples/huggingface_tiny_gpt2.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/examples/huggingface_tiny_gpt2.py)

## Integration with Curvature Operators

### Operator Size and Parameter Tracking

Curvature operators maintain internal state about the parameters they operate on:

| Attribute | Type | Description |
|-----------|------|-------------|
| `_param_names` | `list[str]` | Names of parameters in the filtered set |
| `_param_list` | `list[Tensor]` | Parameter tensors |
| `_sizes` | `list[torch.Size]` | Original tensor shapes for reshaping |
| `size` | `int` | Total number of parameters (sum of all parameter elements) |

The `size` property is computed as:

```python
self.size = sum(p.numel() for p in self._param_list)
```

This total parameter count determines the dimensionality of the vector space in which eigendecomposition occurs.

资料来源：[hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)

### Data Flow Diagram

```mermaid
graph TD
    A[PyTorch Model] --> B[get_param_names]
    A --> C[get_param_list]
    A --> D[get_param_sizes]
    B --> E[ParamFilter Application]
    C --> E
    D --> E
    E --> F[Filtered Parameter Collections]
    F --> G[Curvature Operator]
    G --> H[matvec Operations]
    H --> I[Eigendecomposition Results]
    
    J[Input Vector] --> K[unflatten_params]
    F --> K
    K --> L[Parameter Dict]
    L --> H
```

## Practical Examples

### Computing Attention-Only Hessian Eigendecomposition

```python
from hessian_eigenthings.algorithms import lanczos
from hessian_eigenthings.operators import HessianOperator
from hessian_eigenthings.param_utils import match_names

# Create attention-only operator
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)

print(f"Attention-only Hessian size: {attn_op.size} parameters")

# Compute top-3 eigenvalues
eig_attn = lanczos(attn_op, k=3, max_iter=20, tol=1e-3, seed=0)
for i, val in enumerate(eig_attn.eigenvalues):
    print(f"  λ_{i + 1} = {val.item(): .4e}")
```

### Comparing Block-Specific Curvature

```python
# Full model Hessian
full_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn
)

# Attention block only
attn_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.attn.*")
)

# MLP block only
mlp_op = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("blocks.*.mlp.*")
)

# Compare eigenvalue spectra
full_eig = lanczos(full_op, k=10)
attn_eig = lanczos(attn_op, k=10)
mlp_eig = lanczos(mlp_op, k=10)
```

资料来源：[examples/transformer_lens_attention_only.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/examples/transformer_lens_attention_only.py)

## Advanced Filtering

### Custom Filter Functions

For complex filtering logic beyond glob matching, implement a custom `ParamFilter`:

```python
def custom_filter(name: str, param: nn.Parameter) -> bool:
    # Include only parameters with > 1000 elements
    if param.numel() < 1000:
        return False
    # Exclude certain modules
    if "embedding" in name:
        return False
    # Include based on naming patterns
    return "layer" in name or "head" in name

operator = HessianOperator(
    model=model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=custom_filter
)
```

### Filter Composition

Filters can be combined using standard Python patterns:

```python
# Intersection of patterns
combined_filter = lambda name, param: (
    match_names("blocks.*.*.*")(name, param) and
    param.dtype == torch.float32
)

# Negation
exclude_ln = lambda name, param: not (
    match_names(".*laynorm.*", ".*ln.*")(name, param)
)
```

## Performance Considerations

### Parameter Access Patterns

The parameter utilities maintain strict ordering between name lists and tensor lists to enable efficient offset-based indexing. When iterating over parameters in performance-critical paths:

1. Use the pre-computed `_sizes` list to avoid repeated `param.shape` calls
2. Leverage the `strict=True` zip when all lists are guaranteed to be aligned
3. Prefer in-place reshaping over copies when possible

### Memory Implications

| Operation | Memory Pattern |
|-----------|----------------|
| `flatten_params` | Allocates new tensor of size `sum(numel)` |
| `unflatten_params` | Creates dict, views from original vector |
| `matvec` | No parameter data copies; uses VJP/JVP chains |

The vectorization maintains a view relationship with original parameters where possible, minimizing memory overhead during iterative algorithms.

## API Reference

### `match_names`

```python
def match_names(*patterns: str) -> ParamFilter:
    """Create a ParamFilter matching parameter names against glob patterns."""
```

**Parameters:**

| Parameter | Type | Description |
|-----------|------|-------------|
| `patterns` | `str` | Glob patterns to match against parameter names |

**Returns:** A callable `ParamFilter` that returns `True` for parameters matching any of the provided patterns.

**Supported Glob Patterns:**
- `*` - Matches any sequence of characters within a path component
- `**` - Matches any sequence of path components (if supported)
- `?` - Matches a single character
- `[abc]` - Matches any character in the set

### Parameter Extraction Functions

```python
def get_param_names(model: nn.Module) -> list[str]:
    """Extract fully-qualified parameter names from a model."""

def get_param_list(model: nn.Module) -> list[Tensor]:
    """Extract parameter tensors from a model."""

def get_filtered_params(
    model: nn.Module,
    param_filter: ParamFilter | None
) -> tuple[list[str], list[Tensor]]:
    """Extract filtered parameter names and tensors."""
```

资料来源：[hessian_eigenthings/param_utils.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/param_utils.py)

---

<a id='distributed-computing'></a>

## Distributed Computing with DDP

### 相关页面

相关主题：[Curvature Operators](#curvature-operators)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [hessian_eigenthings/operators/distributed/ddp.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/distributed/ddp.py)
- [hessian_eigenthings/operators/distributed/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/distributed/__init__.py)
- [hessian_eigenthings/operators/hessian.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/hessian.py)
- [hessian_eigenthings/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/__init__.py)
- [hessian_eigenthings/operators/__init__.py](https://github.com/noahgolmant/pytorch-hessian-eigenthings/blob/main/hessian_eigenthings/operators/__init__.py)
</details>

# Distributed Computing with DDP

The `hessian-eigenthings` library provides native support for distributed training scenarios through `DDPHessianOperator`, a specialized curvature operator that extends the base `HessianOperator` to work correctly with PyTorch's `DistributedDataParallel` (DDP) wrapper.

## Overview

In distributed training environments, the Hessian eigenvalue computations must account for how DDP synchronizes gradients across multiple processes. The `DDPHessianOperator` handles this synchronization transparently, ensuring that the Hessian-vector products (HVPs) computed across different ranks are properly averaged.

**Key characteristics:**

- Subclass of `HessianOperator` with distributed awareness
- Automatically averages HVPs across all data-parallel ranks
- Compatible with standard DDP-wrapped models
- Supports the same API as the base `HessianOperator`
- Handles the autograd graph complexity introduced by DDP's all-reduce operations

## Architecture

### Class Hierarchy

```
CurvatureOperator (base interface)
    └── HessianOperator (base implementation)
            └── DDPHessianOperator (DDP-aware extension)
```

### Data Flow Diagram

```mermaid
graph TD
    A[Model wrapped with DDP] --> B[DDPHessianOperator]
    B --> C[Per-rank HVP Computation]
    C --> D[Autograd-aware All-Reduce]
    D --> E[Synchronized HVP across Ranks]
    
    F[torch.autograd.grad calls] --> G[Regular .backward hooks]
    G --> H[No explicit all-reduce]
    
    I[Expected HVP] --> J[Actual HVP without DDPHessianOperator]
    
    K[Expected HVP] --> L[Actual HVP with DDPHessianOperator]
    
    style D fill:#90EE90
    style H fill:#FFB6C1
    style L fill:#90EE90
```

### DDP Behavior Explanation

The core challenge addressed by `DDPHessianOperator` stems from how PyTorch's `DistributedDataParallel` handles gradient synchronization:

1. **DDP's all-reduce mechanism**: DDP normally fires its all-reduce operation inside the autograd graph during `loss.backward()`, synchronizing gradients across all ranks
2. **Standard HessianOperator limitation**: When using `torch.autograd.grad` directly (as the base `HessianOperator` does), the DDP hooks fire on `.grad` accumulation rather than on `autograd.grad`'s return value
3. **Resulting discrepancy**: Without explicit handling, the computed HVP does not match the single-process HVP computed on the union of all per-rank batches

The `DDPHessianOperator` resolves this by adding an explicit autograd-aware all-reduce after each gradient computation call, ensuring the resulting HVP equals the single-process HVP.

## API Reference

### DDPHessianOperator

```python
class DDPHessianOperator(HessianOperator):
    """HessianOperator that all-reduces the HVP across torch.distributed ranks."""
```

#### Constructor Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `model` | `nn.Module` | Yes | - | Model (may be DDP-wrapped; params are read directly) |
| `dataloader` | `Iterable[Any]` | Yes | - | Data loader providing batches to average over |
| `loss_fn` | `LossFn` | Yes | - | Loss function `forward_fn(...) -> loss` |
| `param_filter` | `ParamFilter \| None` | No | `None` | Optional filter for subset of parameters |
| `full_dataset` | `bool` | No | `True` | Whether to compute Hessian over full dataset |
| `num_batches` | `int \| None` | No | `None` | Number of batches to sample if not full dataset |
| `microbatch_size` | `int \| None` | No | `None` | Chunk batch into micro-batches for memory |
| `microbatch_unsafe` | `bool` | No | `False` | Skip gradient accumulation safety checks |
| `method` | `HvpMethod` | No | `"autograd"` | HVP computation method |
| `fd_eps` | `float \| None` | No | `None` | Finite difference epsilon |
| `backend` | `LinAlgBackend[torch.Tensor] \| None` | No | `None` | Linear algebra backend |

Inherits all parameters from `HessianOperator` base class.

#### Inherited Methods

| Method | Description |
|--------|-------------|
| `matvec(v)` | Compute H·v where H is the Hessian averaged over batches |
| `size` | Total number of parameters in the filtered parameter set |
| `dtype` | Data type of parameters |
| `device` | Device of parameters |

### Import Location

```python
from hessian_eigenthings.operators import DDPHessianOperator
```

Or via the distributed submodule:

```python
from hessian_eigenthings.operators.distributed import DDPHessianOperator
```

资料来源：[hessian_eigenthings/operators/distributed/__init__.py:1-3]()

## Usage Patterns

### Basic Usage with DDP-wrapped Model

```python
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from hessian_eigenthings.operators import DDPHessianOperator
from hessian_eigenthings.algorithms import lanczos

# Assume model, dataloader, and loss_fn are already set up
ddp_model = DDP(model)

# Create the distributed Hessian operator
hessian_op = DDPHessianOperator(
    model=ddp_model,
    dataloader=dataloader,
    loss_fn=loss_fn,
)

# Compute eigenvalues using Lanczos algorithm
eigenvalues, eigenvectors = lanczos(hessian_op, k=10, max_iter=50)
```

### With Parameter Filtering

```python
from hessian_eigenthings.param_utils import match_names

# Focus on specific layer parameters
hessian_op = DDPHessianOperator(
    model=ddp_model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    param_filter=match_names("layer.4.*"),
)
```

### Using with Different HVP Methods

```python
# Using finite difference method (more memory-efficient)
hessian_op_fd = DDPHessianOperator(
    model=ddp_model,
    dataloader=dataloader,
    loss_fn=loss_fn,
    method="finite_difference",
    fd_eps=1e-5,
)
```

## Key Design Decisions

### Autograd-aware All-Reduce

The `DDPHessianOperator` adds an explicit all-reduce operation that integrates with PyTorch's autograd engine. This ensures:

- The all-reduce operation is included in the autograd graph when needed
- Gradient flows correctly through the distributed computation
- The final HVP is properly synchronized across all ranks

### Parameter Access

The operator reads parameters directly from the model, whether or not it is wrapped with DDP:

```python
# From the source:
# "The model passed in may already be wrapped with
#  torch.nn.parallel.DistributedDataParallel; we read params from it directly."
```

This design allows seamless usage with existing DDP-wrapped models without modification.

### Batch Distribution

Each rank should receive its own shard of the dataset:

> "Each rank should be receiving its own shard of the dataset (typical pattern: a `torch.utils.data.distributed.DistributedSampler`)."

资料来源：[hessian_eigenthings/operators/distributed/ddp.py:21-25]()

## Comparison with Single-Process HessianOperator

| Aspect | `HessianOperator` | `DDPHessianOperator` |
|--------|-------------------|----------------------|
| Use case | Single GPU / CPU | Multi-GPU distributed |
| Gradient sync | Manual handling required | Automatic via all-reduce |
| DDP compatibility | May produce incorrect HVPs | Correct by design |
| API | Identical | Identical |
| Performance overhead | None | Single all-reduce per HVP |

## Relationship to Other Operators

The `hessian_eigenthings` package provides multiple curvature operators:

| Operator | Description | Distributed Support |
|----------|-------------|---------------------|
| `HessianOperator` | Full Hessian computation | Not DDP-aware |
| `DDPHessianOperator` | Full Hessian with DDP sync | DDP-aware |
| `GGNOperator` | Generalized Gauss-Newton | Not DDP-aware (as of v1.0) |
| `EmpiricalFisherOperator` | Empirical Fisher matrix | Not DDP-aware |

资料来源：[hessian_eigenthings/__init__.py:15-24]()

## Limitations and Considerations

1. **Current scope**: Only `HessianOperator` has a DDP-aware counterpart; other operators like `GGNOperator` and `EmpiricalFisherOperator` do not yet have distributed variants
2. **Gradient hooks**: The operator does not currently support all DDP gradient hook mechanisms
3. **Multi-node training**: While the operator uses standard `torch.distributed` primitives, performance at very large scale (>8 nodes) has not been extensively benchmarked
4. **Mixed precision**: When using fp16/bf16 training, ensure consistent dtype across all ranks

## Error Handling

The operator relies on standard PyTorch distributed error handling:

- If `torch.distributed` is not initialized, standard errors will be raised
- Mismatched tensor shapes across ranks will result in collective operation errors
- Device mismatch (e.g., some ranks on CUDA, some on CPU) is not supported

## Testing and Validation

The DDP functionality should be tested in a true distributed environment. Basic validation includes:

1. **Consistency check**: HVP computed via `DDPHessianOperator` should equal the single-process HVP when aggregating all batch shards
2. **Numerical accuracy**: Eigenvalues computed with DDP should match single-GPU results within floating-point tolerance
3. **Scaling**: Computation time should scale sub-linearly with number of GPUs for large models

---

---

## Doramagic 踩坑日志

项目：noahgolmant/pytorch-hessian-eigenthings

摘要：发现 15 个潜在踩坑项，其中 0 个为 high/blocking；最高优先级：身份坑 - 仓库名和安装名不一致。

## 1. 身份坑 · 仓库名和安装名不一致

- 严重度：medium
- 证据强度：runtime_trace
- 发现：仓库名 `pytorch-hessian-eigenthings` 与安装入口 `hessian-eigenthings` 不完全一致。
- 对用户的影响：用户照着仓库名搜索包或照着包名找仓库时容易走错入口。
- 建议检查：在 npm/PyPI/GitHub 上确认包名映射和官方 README 说明。
- 复现命令：`pip install hessian-eigenthings`
- 防护动作：页面必须同时展示 repo 名和真实安装入口，避免用户搜索错包。
- 证据：identity.distribution | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | repo=pytorch-hessian-eigenthings; install=hessian-eigenthings

## 2. 安装坑 · 来源证据：Python Error: the following arguments are required: experimentname

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Python Error: the following arguments are required: experimentname
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_24f46464d79f4ae3830f046c077a2574 | https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/39 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 3. 安装坑 · 来源证据：v1.0.0a2 — packaging fix

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v1.0.0a2 — packaging fix
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_7540a696b30c46cdba07c12f33388567 | https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a2 | 来源类型 github_release 暴露的待验证使用条件。

## 4. 安装坑 · 来源证据：v1.0.0a3 — fix lanczos OOM

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v1.0.0a3 — fix lanczos OOM
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_e5e68e2f24e1436cb8f3c2f11cefe326 | https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a3 | 来源类型 github_release 暴露的待验证使用条件。

## 5. 安装坑 · 来源证据：v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v1.0.0a4 — backend handles CPU-generator + CUDA-tensor combo
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_914b7653aa8b4ef2844a2b4690fab2ad | https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a4 | 来源类型 github_release 暴露的待验证使用条件。

## 6. 安装坑 · 来源证据：v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：v1.0.0a5 — comprehensive LLM-scale memory fixes + regression tests
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_509356ab9b68434992d7237219952ba6 | https://github.com/noahgolmant/pytorch-hessian-eigenthings/releases/tag/v1.0.0a5 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 7. 配置坑 · 来源证据：RuntimeError: One of the differentiated Tensors appears to not have been used in the graph.

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：RuntimeError: One of the differentiated Tensors appears to not have been used in the graph.
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_f79a3a34cbab435cb3730b7ae17cf492 | https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/30 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 8. 配置坑 · 来源证据：ValueError: PENet on the Kitti benchmark suite

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个配置相关的待验证问题：ValueError: PENet on the Kitti benchmark suite
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_850f8cf0010c4d269ab71d864610097a | https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/41 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 9. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | README/documentation is current enough for a first validation pass.

## 10. 运行坑 · 来源证据：AttributeError: 'HVPOperator' object has no attribute 'zero_grad'

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个运行相关的待验证问题：AttributeError: 'HVPOperator' object has no attribute 'zero_grad'
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 建议检查：来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- 防护动作：不得脱离来源链接放大为确定性结论；需要标注适用版本和复核状态。
- 证据：community_evidence:github | cevd_b515f5c06a5744b19b667bcbc8123348 | https://github.com/noahgolmant/pytorch-hessian-eigenthings/issues/38 | 来源类型 github_issue 暴露的待验证使用条件。

## 11. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | last_activity_observed missing

## 12. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | no_demo; severity=medium

## 13. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | no_demo; severity=medium

## 14. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | issue_or_pr_quality=unknown

## 15. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | hn_item:48132232 | https://news.ycombinator.com/item?id=48132232 | release_recency=unknown

<!-- canonical_name: noahgolmant/pytorch-hessian-eigenthings; human_manual_source: deepwiki_human_wiki -->
