Skip to content

Mpi evaluationmanager dev#761

Open
nychiang wants to merge 9 commits into
developfrom
mpi-evaluationmanager-dev
Open

Mpi evaluationmanager dev#761
nychiang wants to merge 9 commits into
developfrom
mpi-evaluationmanager-dev

Conversation

@nychiang
Copy link
Copy Markdown
Collaborator

@nychiang nychiang commented May 20, 2026

Redesign EvaluationManager to work with MPI on multiple nodes, and add comprehensive testing

Major fixes:
- Fix MPI hanging by switching to OpenMPI and python -m mpi4py.futures
- Fix missing profiling output by storing timing data as instance variables

New features:
- Comprehensive scaling test infrastructure (submit_bo_scaling.sh)
- BODriverEX_mpi.py: Simple 2D test problem for MPI scaling
- Thread/Process/MPI executor test scripts
- Complete documentation suite

Details in COMMIT_MESSAGE.txt and CHANGELOG.md

Files changed:
- hiopbbpy/utils/evaluation_manager.py (profiling fix)
- hiopbbpy/utils/util.py (debug output)
- test_multinode_mpi.sh (OpenMPI + proper launcher)
- Created: BODriverEX_mpi.py, scaling test scripts, documentation

Tested on LLNL LC (Dane) with 1-64 workers, single and multi-node.

@nychiang nychiang marked this pull request as ready for review May 20, 2026 21:02
@nychiang nychiang requested review from cnpetra and thartland May 20, 2026 21:04
nychiang and others added 3 commits May 20, 2026 14:06
bug fixed

Fix multi-node MPI worker detection in EvaluationManager

Fixed critical bug in _get_num_workers() that prevented correct worker
counting for true multi-node MPI execution (mpiexec -n N where N > 1).

Key changes:
- evaluation_manager.py: Query MPI.COMM_WORLD.Get_size() - 1 instead of
  relying solely on MPI4PY_FUTURES_MAX_WORKERS environment variable
- util.py: Enhanced MPIEvaluator documentation with multi-node examples
- EvaluationManagerCI.py: Added -e flag to test thread/process/mpi executors
  with proper rank handling for multi-node MPI

Before: Multi-node profiling incorrectly reported 1 worker
After:  Correctly reports N-1 workers for mpiexec -n N

Backward compatible - legacy single-node MPI mode still supported.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
EvaluationManagerCI.py now handles all executor types (thread/process/mpi)
making the separate MPI-only example redundant.

update

fix CI
Major fixes:
- Fix MPI hanging by switching to OpenMPI and python -m mpi4py.futures
- Fix missing profiling output by storing timing data as instance variables

New features:
- Comprehensive scaling test infrastructure (submit_bo_scaling.sh)
- BODriverEX_mpi.py: Simple 2D test problem for MPI scaling
- Thread/Process/MPI executor test scripts
- Complete documentation suite

Details in COMMIT_MESSAGE.txt and CHANGELOG.md

Files changed:
- hiopbbpy/utils/evaluation_manager.py (profiling fix)
- hiopbbpy/utils/util.py (debug output)
- test_multinode_mpi.sh (OpenMPI + proper launcher)
- Created: BODriverEX_mpi.py, scaling test scripts, documentation

Tested on LLNL LC (Dane) with 1-64 workers, single and multi-node.
@nychiang nychiang force-pushed the mpi-evaluationmanager-dev branch from 9c8e5f9 to 80c54df Compare May 20, 2026 21:07
Comment thread src/Drivers/hiopbbpy/test_process.sh Outdated
Comment thread src/Drivers/hiopbbpy/test_thread.sh Outdated
if __name__ == "__main__":
do_profiling = True

for prob_type in prob_type_l:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If prob_type_l contains one element do we want to keep this loop over elements of prob_type_l?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm... let's keep it since we can easily switch prob_type_l to a list with more than one elements

Comment thread src/Drivers/hiopbbpy/README.md Outdated
Comment thread src/Drivers/hiopbbpy/submit_bo_template.sbatch Outdated
Comment thread src/Drivers/hiopbbpy/submit_scaling_tests.sh Outdated
Comment thread src/Drivers/hiopbbpy/submit_scaling_tests.sh Outdated
Comment thread src/Drivers/hiopbbpy/submit_scaling_tests.sh Outdated
Comment thread src/Drivers/hiopbbpy/TESTING_GUIDE.md Outdated
Comment thread src/Drivers/hiopbbpy/TESTING_GUIDE.md Outdated

**BODriverEX_mpi.py**: Simple 2D LpNorm optimization
- 64 initial samples, 20 BO iterations
- No xfoil dependency, fast evaluations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "No xfoil dependency"

Comment thread src/Drivers/hiopbbpy/TESTING_GUIDE.md Outdated
Comment thread src/Drivers/hiopbbpy/TESTING_GUIDE.md Outdated
Comment thread src/Drivers/hiopbbpy/TESTING_GUIDE.md Outdated
Comment thread src/Drivers/hiopbbpy/TESTING_GUIDE.md Outdated
Comment thread src/Drivers/hiopbbpy/test_thread.sh Outdated
Comment thread src/Drivers/hiopbbpy/test_multinode_mpi.sh Outdated
Comment thread src/Drivers/hiopbbpy/README.md Outdated
Comment thread src/Drivers/hiopbbpy/README.md Outdated
Comment thread src/Drivers/hiopbbpy/README.md Outdated
Comment thread src/Drivers/hiopbbpy/README.md
Copy link
Copy Markdown
Collaborator

@thartland thartland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments.

nychiang added a commit that referenced this pull request May 28, 2026
Remove all legacy xfoil references from shell scripts and documentation, and clean up unused variables in response to reviewer feedback.

Changes:
- Update job names in test scripts from xfoil_* to eval_mgr_*
- Remove unused HIOP_XFOIL_JOB_ROOT and TOTAL_TASKS variables
- Remove xfoil references from README.md and TESTING_GUIDE.md
- Remove references to non-existent files (README_EvaluationManager.md, README_XFOIL.md)
- Remove ds4mems package dependency mention

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
nychiang added 2 commits May 28, 2026 13:45
Remove all legacy xfoil references from shell scripts and documentation, and clean up unused variables in response to reviewer feedback.

Changes:
- Update job names in test scripts from xfoil_* to eval_mgr_*
- Remove unused HIOP_XFOIL_JOB_ROOT and TOTAL_TASKS variables
- Remove xfoil references from README.md and TESTING_GUIDE.md
- Remove references to non-existent files (README_EvaluationManager.md, README_XFOIL.md)
- Remove ds4mems package dependency mention
Remove remaining xfoil sections that were missed in the previous cleanup:
- Production xfoil BO section
- xfoil Problems subsection with file references
- xfoil_bo/submit_bo_xfoil.sbatch reference
@nychiang nychiang force-pushed the mpi-evaluationmanager-dev branch from b261406 to 42c3a01 Compare May 28, 2026 20:48
@nychiang
Copy link
Copy Markdown
Collaborator Author

nychiang commented May 28, 2026

See inline comments.

All the comments have been addressed in the new commits.

The --no-cache flag was preventing Spack from using cached dependency builds,
causing 'compiler-wrapper not installed' errors. Dependencies installed in the
previous step should be available from cache for the package-only build.
Copy link
Copy Markdown
Collaborator

@thartland thartland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants