llnl · kab163 · Apr 21, 2026 · Apr 22, 2026 · Apr 22, 2026 · Apr 24, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,387 @@
+# Umpire Agent Guide
+
+This file provides guidance to coding agents working in this repository.
+
+## Overview
+
+Umpire is a resource management library for discovering, provisioning, and managing memory on machines with multiple memory devices like NUMA nodes and GPUs. It provides a unified interface to allocate and free data across different memory spaces (host, device, unified, pinned, etc.) and supports various memory allocation strategies (pools, advisors, prefetchers, etc.).
+
+## Repo Skills
+
+Use the narrowest matching repo-local skill under `skills/` before making non-trivial changes:
+
+Use more than one only when a task genuinely spans multiple areas, such as a backend change that also needs new tests.
+
+## ⚠️ WARNING: Auto-Generated Code
+
+**NEVER directly edit auto-generated files.** Umpire uses code generation tools, and your changes will be overwritten.
+
+### Fortran Interface (Shroud)
+
+The Fortran interface is generated using [Shroud](https://shroud.readthedocs.io/en/latest/).
+
+**Auto-generated files** (DO NOT EDIT):
+- `src/umpire/interface/c_fortran/*.f` - All Fortran files
+- `src/umpire/interface/c_fortran/wrap*.cpp` - C wrapper files
+- `src/umpire/interface/c_fortran/wrap*.h` - C wrapper headers
+- `src/umpire/interface/c_fortran/types*.h` - Type definition files
+- `src/umpire/interface/c_fortran/genc*.inc` - Generated include files
+
+**To modify Fortran interface:**
+1. Edit `src/umpire/interface/umpire_shroud.yaml`
+2. Run Shroud to regenerate files (via the build system or GitHub workflow)
+3. Never manually edit the generated `.f`, `.cpp`, or `.h` files
+
+**Identifying auto-generated files:**
+- Look for headers like `! Generated by genumpiresplicer.py` or `! wrapf*.f`
+- Check for Shroud copyright/generation comments at the top
+- Files in `c_fortran/` directory starting with `wrap*` or `gen*` are generated
+
+## Build System
+
+Umpire uses CMake and BLT (Build, Link, and Test) as its build system. **BLT is included as a submodule** - always ensure submodules are initialized before building.
+
+### Initial Setup
+
+```bash
+git submodule init && git submodule update
+mkdir build && cd build
+cmake ..
+make
+```
+
+### Running Tests
+
+```bash
+# From build directory
+make test
+# Or use ctest directly
+ctest
+ctest -R <test_name_pattern>  # Run specific tests
+```
+
+### Running Single Test
+
+```bash
+# From build directory
+./bin/<test_executable>
+# Example:
+./bin/allocator_tests
+```
+
+### Common CMake Options
+
+- `UMPIRE_ENABLE_*` options are Umpire-owned; some build examples also use BLT-facing `ENABLE_*` options, and `UMPIRE_ENABLE_OPENMP_TARGET` is distinct from `UMPIRE_ENABLE_OPENMP`
+- `BLT_CXX_STD`: C++ standard (default: c++20, minimum: c++20)
+- `UMPIRE_ENABLE_CUDA`: Build with CUDA support (default: depends on ENABLE_CUDA)
+- `UMPIRE_ENABLE_HIP`: Build with HIP support (default: depends on ENABLE_HIP)
+- `UMPIRE_ENABLE_OPENMP`: Build with OpenMP support
+- `UMPIRE_ENABLE_TESTS`: Build tests (default: On)
+- `UMPIRE_ENABLE_EXAMPLES`: Build examples (default: On)
+- `UMPIRE_ENABLE_BENCHMARKS`: Build benchmarks (requires UMPIRE_ENABLE_DEVELOPER_BENCHMARKS)
+- `UMPIRE_ENABLE_LOGGING`: Enable logging (default: On)
+- `CMAKE_BUILD_TYPE`: Build type (Release, Debug, RelWithDebInfo)
+
+## Architecture
+
+### Core Components
+
+1. **ResourceManager** (`src/umpire/ResourceManager.{hpp,cpp}`): Singleton that manages all allocators and provides the primary interface for getting allocators and introspecting allocations.
+
+2. **Allocator** (`src/umpire/Allocator.{hpp,cpp}`): User-facing interface for memory allocation/deallocation. Wraps an AllocationStrategy and provides a unified API.
+
+3. **AllocationStrategy** (`src/umpire/strategy/AllocationStrategy.hpp`): Abstract base class for all allocation strategies. Strategies are composable and can be chained.
+
+4. **MemoryResource** (`src/umpire/resource/MemoryResource.hpp`): Represents actual memory resources (e.g., host, CUDA device, HIP device).
+
+5. **MemoryOperation** (`src/umpire/op/MemoryOperation.hpp`): Platform-specific operations for memory manipulation (copy, memset, prefetch, advise). Operations are registered in the MemoryOperationRegistry and selected based on source/destination resource types.
+
+### Key Directories
+
+- `src/umpire/`: Core library implementation
+  - `alloc/`: Low-level allocator implementations (MallocAllocator, CudaMallocAllocator, etc.)
+  - `resource/`: Memory resource implementations and factories
+  - `strategy/`: Allocation strategies (pools, advisors, prefetchers, limiters, etc.)
+  - `op/`: Memory operations (copy, memset, advise, prefetch)
+  - `util/`: Utility classes (AllocationMap, Platform, MemoryResourceTraits)
+  - `interface/`: C and Fortran interfaces
+  - `event/`: Event recording and replay functionality
+
+- `tests/`: Test suite
+  - `unit/`: Unit tests for individual components
+  - `integration/`: Integration tests for end-to-end functionality
+  - `applications/`: Application-level tests
+
+- `examples/`: Example code and tutorials
+  - `tutorial/`: Tutorial examples (C and Fortran)
+  - `cookbook/`: Recipe-style examples
+
+### Memory Resources
+
+Available memory resources (platform-dependent):
+- `HOST`: Standard host memory
+- `DEVICE`: GPU device memory (CUDA/HIP/SYCL)
+- `UM`: Unified memory (CUDA/HIP managed memory)
+- `PINNED`: Pinned/page-locked host memory
+- `DEVICE_CONST`: Constant memory on device
+- `FILE`: File-backed memory (memory-mapped files)
+- `SHARED`: Shared memory between processes
+  - `SHARED::POSIX`: IPC shared memory (POSIX implementation)
+  - `SHARED::MPI3`: MPI-3 shared memory
+  - Note: Use full names (`SHARED::POSIX` or `SHARED::MPI3`) when both are enabled
+- `NO_OP`: No-op memory resource (for testing/debugging)
+
+### Allocation Strategies
+
+Strategies can be composed to create complex allocation patterns:
+- **Pools**: `DynamicPoolList`, `DynamicSizePool`, `QuickPool`, `FixedPool`, `MixedPool`
+- **Advisors**: `AllocationAdvisor` (for memory access hints)
+- **Prefetchers**: `AllocationPrefetcher` (for explicit prefetching)
+- **Limiters**: `SizeLimiter` (enforce allocation size limits)
+- **Alignment**: `AlignedAllocator` (enforce memory alignment)
+- **NUMA**: `NumaPolicy` (NUMA node binding)
+
+## Development Workflow
+
+### Branch Strategy
+
+- `develop`: Main branch for all development (PRs target this branch)
+- Feature branches: `feature/<name>`
+- Bugfix branches: `bugfix/<name>`
+- Task branches: `task/<name>`
+- Note: The `main` branch is deprecated and should not be used
+
+### Code Style
+
+- C++20 standard is required
+- Use Doxygen comments for public APIs
+- Follow existing code formatting patterns
+
+### Adding New Features
+
+1. Create feature branch from `develop`
+2. Implement feature with tests
+3. Add Doxygen documentation for new public APIs
+4. Add a minimalistic example demonstrating basic functionality (in `examples/`, `examples/cookbook/`, or `examples/tutorial/`)
+5. Ensure all tests pass
+6. Create PR targeting `develop`
+
+**Examples should be:**
+- Simple and focused on demonstrating the core feature
+- Self-contained and easy to compile
+- Well-commented to explain what the feature does
+- Useful for both automated testing and human understanding
+
+### Testing Requirements
+
+- Add unit tests for new classes/functions in `tests/unit/`
+- Add integration tests for end-to-end features in `tests/integration/`
+- Ensure tests pass on various configurations (host-only, CUDA, HIP)
+
+## Common Patterns
+
+### Getting an Allocator
+
+```cpp
+auto& rm = umpire::ResourceManager::getInstance();
+umpire::Allocator alloc = rm.getAllocator("HOST");
+```
+
+### Creating a Pool
+
+```cpp
+auto& rm = umpire::ResourceManager::getInstance();
+auto alloc = rm.getAllocator("HOST");
+auto pool = rm.makeAllocator<umpire::strategy::QuickPool>(
+    "my_pool", alloc);
+```
+
+### Introspecting Allocations
+
+```cpp
+auto& rm = umpire::ResourceManager::getInstance();
+auto record = rm.findAllocationRecord(ptr);
+size_t size = record.size;
+std::string name = record.name;
+```
+
+### Memory Operations
+
+```cpp
+auto& rm = umpire::ResourceManager::getInstance();
+rm.copy(dest_ptr, src_ptr);  // Automatically determines copy operation
+rm.memset(ptr, 0);           // Set memory to value
+```
+
+## Key Architectural Principles
+
+1. **Strategy Pattern**: AllocationStrategy implementations allow flexible composition of memory management behaviors
+2. **Factory Pattern**: MemoryResourceFactory creates appropriate resources based on platform capabilities
+3. **Singleton Pattern**: ResourceManager is the central registry for all allocators
+4. **Introspection**: AllocationMap tracks all allocations for debugging and introspection
+5. **Platform Abstraction**: Platform-specific operations are abstracted through MemoryOperation registry
+
+## Critical Constraints (HPC Performance Library)
+
+Umpire is a **performance-critical HPC library**. All changes must preserve performance, portability, and API stability.
+If you believe this is not possible, inform the user before proceeding.
+
+### Architectural Invariants
+
+**Core concepts and their relationships:**
+- ResourceManager (singleton, thread-safe)
+- Allocator (lightweight handle, must remain O(1) operations)
+  - If this will not be the case, inform the user
+- MemoryResource (backend abstraction)
+- AllocationStrategy (composable, may have different complexity)
+  - For example, you can apply a SizeLimiter strategy to a QuickPool to impose a strict upper bound on how much the QuickPool can grow.
+- MemoryOperation (platform-specific operations)
+
+**Must maintain:**
+- Allocators remain lightweight handles (no heavy state)
+- Allocation/deallocation O(1) unless strategy explicitly requires otherwise
+- Backend support remains conditionally compiled (no forced dependencies)
+- No backend-specific code in generic layers (strict separation)
+- No cross-layer violations
+
+### Performance Requirements
+
+**In hot paths (allocate/deallocate), avoid:**
+- Virtual calls (unless already required by design)
+- `dynamic_cast` in performance-sensitive code
+- `std::function` in allocator operations
+- Exceptions in fast allocation paths
+- Unnecessary heap allocations
+- Hidden device synchronization (`cudaDeviceSynchronize()`, etc.)
+
+**All new logic in allocation paths must justify its performance cost.**
+
+### Thread Safety Rules
+
+- ResourceManager is thread-safe (already implemented)
+- Allocators must not introduce race conditions
+- Strategies must document thread-safety guarantees
+- No static non-const globals outside ResourceManager
+- No global mutable state
+
+### GPU Backend Rules
+
+When working with CUDA, HIP, SYCL, or device allocators:
+- Ensure proper conditional compilation (`#ifdef UMPIRE_ENABLE_CUDA`, etc.)
+- No device-host synchronization unless explicitly required
+- No implicit stream synchronization
+- No hidden `cudaDeviceSynchronize()` or equivalent
+- Document any synchronization points
+
+### What You Can Do
+
+- Add tests (unit, integration)
+- Improve documentation
+- Refactor internal implementation (without changing public API) - get approval from user first!
+- Add new allocators or strategies (if explicitly requested)
+- Fix bugs with tests demonstrating the issue
+
+### What You Must NOT Do (Without Explicit Approval)
+
+- **Edit auto-generated code** (see warning section above - edit source YAML instead)
+- Break public API compatibility
+- Modify allocator semantics silently
+- Change default allocator behavior
+- Modify ResourceManager initialization logic
+- Alter memory tracking logic
+- Change device memory semantics
+- Remove or break backend support (CUDA, HIP, SYCL, etc.)
+- Introduce runtime overhead to allocation fast paths
+- Add global mutable state
+- Embed backend-specific logic in generic code
+
+### Testing Requirements
+
+**All changes must:**
+- Build with host-only configuration
+- Add or update unit tests if behavior changes
+- Avoid introducing nondeterminism in tests
+
+**Testing considerations:**
+- Ask the user if the test should build with CUDA enabled.
+- Ask the user if the test should build with HIP enabled.
+- Ask the user if the test should build with any other options enabled (e.g. sanitizer support, fortran enabled, etc.).
+- If you are trying to test IPC Shared Memory or MPI3 shared memory and you are working on a Apple or Windows environment:
+  - Tell the user that a test can't be run here - they will have to use LC resources to successfully build and run that test
+
+**Tests should:**
+- Avoid large allocations unless done by design as part of the test
+- Avoid device synchronization unless validating behavior
+- Clean up all allocations
+- Run quickly (this is a CI constraint)
+
+### Code Style Expectations
+
+- Follow existing formatting conventions (RAJA/LLNL C++ style)
+- Prefer clarity over cleverness
+- No unnecessary template metaprogramming
+- Avoid deep inheritance chains
+- Avoid overengineering or unnecessary abstraction layers
+- Use Doxygen comments for all public APIs
+
+### Documentation Requirements
+
+When adding allocators, strategies, or resources:
+- Update relevant documentation
+- Add usage examples
+- Document performance implications if possible
+- Document backend constraints and requirements if possible
+- Document thread-safety guarantees if possible
+
+### When Uncertain
+
+**If architectural impact is unclear:**
+- Ask for clarification before implementing
+- Do not guess about memory semantics
+- Do not assume thread-safety without verification
+- Do not make changes that could affect hot paths without discussion
+
+**Umpire correctness and performance take precedence over feature velocity.**
+
+## Important Conventions
+
+- Memory sizes are in bytes
+- Allocator names must be unique strings
+- AllocationStrategy objects form a hierarchy (strategies can wrap other strategies)
+- C++20 is the minimum required standard (enforced by CMake configuration)
+- The ResourceManager must be initialized before use (happens automatically on first getInstance())
+
+## Documentation
+
+- Full documentation: https://umpire.readthedocs.io/
+- Tutorial: https://umpire.readthedocs.io/en/develop/sphinx/tutorial.html
+- API documentation is generated from Doxygen comments in headers
+
+## Other Notes
+
+- When you need to create an example or show the user some sample Umpire code which involves a memory pool, always use QuickPool unless:
+  - The allocation size is known to always be the same amount in bytes - then, you can use FixedPool
+  - The allocations need some sort of synchronization guarantee to avoid data races - then, you can use ResourceAwarePool
+  - The allocations will be deallocated in the opposite order in which they were allocated - then, you can use DynamicPoolList
+  - The user specifically asks for you to use a different pool
+
+## Maintaining This File
+
+**When to update this file:**
+
+This file should be updated when making changes that affect how future developers (human or AI) work with the codebase:
+
+- Adding new core components or architectural patterns
+- Introducing new memory resources or allocation strategies
+- Adding new build options or dependencies
+- Changing development workflows or testing requirements
+- Adding new code generation tools (like Shroud)
+- Modifying branch strategies or contribution guidelines
+- Adding new categories of files that should/shouldn't be edited
+- Introducing new performance constraints or safety requirements
+
+**At minimum:** When completing a major feature or architectural change, ask the user whether `AGENTS.md` and `CLAUDE.md` should be updated to reflect the changes. Consider:
+- Would a future agent benefit from knowing about this?
+- Are there new constraints or patterns that should be documented?
+
+Keeping this file up-to-date ensures that future coding agents and human developers have accurate, helpful guidance.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1 @@
+AGENTS.md