Add GPU Support for ML Models by wk9874 · Pull Request #291 · ukaea/toktagger

wk9874 · 2026-06-15T16:12:29Z

Summary

Adds ML Model GPU Support. For Train and Predict tasks, the user can now specify whether they want to reserve a GPU worker node for it to run on via a switch on the UI. This is then passed to the backend and into Ray, which allocates Actors to hardware accordingly. Note that it is down to the model implementation as to how / whether it uses the hardware

Changes

Adds code into main.py which determines how many CPU and GPU nodes are available (and allows overriding these by user via env vars)
ActorRegistry tracks GPU actors separately from CPU ones, and compares these to the number of GPU nodes available before comparing total numbers, removing stale GPU actors first
Adds use_gpu query params to train and predict endpoints, and passes these into worker tasks
If GPU requested, Worker checks if existing Actors have access to GPU, stopping and restarting on GPU node if not
Adds fixes to disruption model so that it handles device correctly
Adds 'Use GPU' switch and contextual help to train and predict modals via SchemaForm to set use_gpu query params
Adds tests, refactors models tests to make more reliable
Adds documentation

Closes #264

Don't merge until #273 is merged

…u_support

samueljackson92

Code Review

Bugs

1. ActorRegistry.update_actors() always stores self.gpu_enabled, not the actual GPU status of the actor — toktagger/api/models/base.py

# new actor path:
self.actors[actor_name] = self.gpu_enabled  # always the server-wide flag

The actors dict is supposed to map actor_name → is_gpu_actor, but update_actors doesn't receive use_gpu. When a CPU-only prediction is dispatched on a GPU-enabled server, it gets stored as True, inflating gpu_count and causing legitimate GPU actors to be incorrectly evicted. The update_actors call sites in the router need to pass use_gpu through.

2. create_sample_predictions ignores the use_gpu query param — toktagger/api/routers/models.py

task = get_predictions.remote(
    ...
    use_gpu=task_registry.gpu_enabled,  # should be: use_gpu=use_gpu
)

The use_gpu query param is validated at the top of the function (raising 409 if not available), but then the worker is always launched with the server-wide gpu_enabled flag rather than the user's choice.

3. int(cluster_resources.get("CPU")) will raise TypeError if "CPU" is absent — toktagger/api/models/base.py

cpus_available = int(cluster_resources.get("CPU")) or os.cpu_count()

int(None) raises TypeError. Should be int(cluster_resources.get("CPU", 0)).

4. max_gpu_actors calculation with 1 GPU silently disables GPU support — toktagger/api/models/base.py

max_gpu_actors = int(cluster_resources.get("GPU", 0)) - 1

With 1 GPU: 1 - 1 = 0 → gpu_enabled = False. If this subtraction is intentional (reserving one GPU for something else), it needs a comment. As written it makes GPU support impossible on single-GPU systems.

5. Wrong model ID checked in test_update_model — tests/api/crud/test_utils.py

await utils.update_model(db_client, model_id=setup_model_db["model_id_3"], updates=...)
# then checks:
model_updated = await db_client.get_document_by_id("models", ObjectId(setup_model_db["model_id_1"]))

The test updates model_id_3 but asserts the result on model_id_1. This is a pre-existing bug carried forward from the old setup_db fixture.

6. load_model return type annotation is wrong — toktagger/api/worker.py

def load_model(...) -> tuple[str, str | None]:
    ...
    return {"project_id": ..., "model_id": ..., "message": ...}

Annotated as returning a tuple, actually returns a dict. The caller in the router uses result.get(...) so it works at runtime, but the annotation is misleading.

7. test_model_load_local_disabled is missing @pytest.mark.models_enabled — tests/api/routers/test_models.py

Every other test in that file has this mark, but test_model_load_local_disabled does not. Without it, the check_models_status autouse fixture won't skip it when Ray is absent, and it will fail with an unexpected error.

Code Quality / Medium Issues

8. get_actor() uses exception-as-control-flow and a private Ray API — toktagger/api/worker.py

ray.get(ml_model.__ray_terminate__.remote())
raise ValueError("Actor has no GPU, but GPU has been requested.")
# falls through to:
except ValueError:
    # "Actor not alive, so load from weights"
    ml_model = ray.remote(num_gpus=1 if use_gpu else 0)(model_type)...

Using __ray_terminate__ is fragile (private internal API). Raising ValueError to trigger actor recreation is confusing — the log message "Actor not alive, so load from weights" fires even though the actor was alive. A cleaner approach would be a flag variable or an explicit condition check.

9. get_load_model_status() return type is -> bool but returns mixed types — toktagger/api/routers/models.py

The function returns JSONResponse (202), True (200), or raises. FastAPI will serialize True as true in the 200 response, which is a valid but unusual API shape. The misleading annotation may cause confusion for future callers. The return type should be JSONResponse | bool.

10. VideoCNN is missing @ray.remote decorator — toktagger/api/models/temp.py

All other model classes that serve as Ray actors are decorated with @ray.remote. VideoCNN lacks it and cannot be used as an actor. If this is intentionally incomplete (the filename is temp.py), it should at least have a TODO noting this.

Minor

11. Leftover print() in check_models_status fixture — tests/conftest.py

def check_models_status(request):
    print()  # leftover debug call

12. Redundant ternary in gpu_available() — toktagger/api/models/base.py

return True if assigned_resources.get("GPU") else False
# cleaner:
return bool(assigned_resources.get("GPU"))

Summary

Severity	Count	Key items
Bug	7	GPU flag not tracked per-actor (#1), `use_gpu` param ignored in sample predict (#2), potential `TypeError` (#3), wrong model checked in test (#5), wrong return type (#6)
Medium	3	Exception-as-control-flow + private Ray API (#8), missing test mark (#7), `VideoCNN` missing `@ray.remote` (#10)
Minor	2	Leftover `print()` (#11), redundant ternary (#12)

The most critical issues are #1 and #2 — the GPU tracking data structure is inconsistent with the actual GPU status of each actor, and the sample prediction endpoint ignores the user's use_gpu choice.

…r into wk9874/models/gpu_support

…port

…r into wk9874/models/gpu_support

wk9874 · 2026-06-23T15:01:58Z

Finally passes the CI

@abdullah-ukaea & @praksharma reminder to give this a quick functionality test when you get a chance pls!

abdullah-ukaea · 2026-06-24T12:16:26Z

Finally passes the CI

@abdullah-ukaea & @praksharma reminder to give this a quick functionality test when you get a chance pls!

LGTM!, tested on a NVIDIA GeForce RTX 4090. Model training and model predict works as expected.

…r into wk9874/models/gpu_support

praksharma · 2026-06-26T13:29:17Z

Tested on M4 Max. A GPU was not discovered as expected.

But PyTorch detects the GPU/ MPS for both training and prediction.

(YoloVideoDetectionModel pid=17952) {'model': 'yolo26n.pt', 'epochs': 1, 'batch': 5, 'imgsz': 640, 'workers': 0, 'device': 'mps', 'save': False, 'plots': False, 'val': False, 'close_mosaic': 0}
(YoloVideoDetectionModel pid=17952) Ultralytics 8.4.33 🚀 Python-3.12.2 torch-2.11.0 MPS (Apple M4 Max)

wk9874 added 23 commits June 1, 2026 17:37

Update actor registry to track GPU tasks

4a39381

Designate actors to all be GPU if GPU is enabled

fe7e19a

Make use_gpus static based on if they are available

68a55f0

Incorporate some of abdullah's fixes, add gpu status to health endpoint

91b5b0c

Add abdullah's fixes to model

b2a0fe5

Add query param to train and predict for use_gpu

d327d86

Add use gpu toggle

0281622

Fix query param

6ccd8bd

Rebuild static

2684900

Move detection to main

4eed25c

Improve handling of cpu and gpu actors

e6acf2f

Handle gpu killing better

1e3763c

Add if

b3a582b

Add try except

1921b1d

Log messages

cc25de0

Allow overriding

c68b7f7

Check validated samples and anns before creating model

a3aa119

Add use GPU toggle to form

b4b4e9f

Add contextual help and docs

9ceb94e

Merge branch 'wk9874/models/local_load_support' into wk9874/models/gp…

6e6091c

…u_support

Fix tests

de5f778

Move use GPU to schema form, improve typing, add e2e test

172e451

Resolve conflict

53b05bd

wk9874 requested review from abdullah-ukaea and samueljackson92 June 15, 2026 16:15

wk9874 changed the title ~~Wk9874/models/gpu support~~ Add GPU Support for ML Models Jun 16, 2026

samueljackson92 reviewed Jun 17, 2026

View reviewed changes

samueljackson92 assigned wk9874 Jun 18, 2026

samueljackson92 added the enhancement New feature or request label Jun 18, 2026

change to use_gpu in update_actors

3ff8ce8

wk9874 and others added 20 commits June 18, 2026 17:22

Merge branch 'wk9874/models/gpu_support' of github.com:ukaea/toktagge…

ab0a1e2

…r into wk9874/models/gpu_support

Fix max_actors calculation

de3fd15

Reduce num GPU in tests to 1

dc940a6

Change Use GPU to Allocate

307bbca

Change Use GPU to Allocate

82eaeb7

Rebuild

435e0dd

chore: update build output [skip ci]

f682b08

Fix mistake in conftest

0424abe

Merge branch 'wk9874/models/gpu_support' of github.com:ukaea/toktagge…

34b4ee2

…r into wk9874/models/gpu_support

Try setting env var

4d0f37b

Rebuild

1dad8c9

Merge remote-tracking branch 'origin/main' into wk9874/models/gpu_sup…

7479b47

…port

chore: update build output [skip ci]

56e9280

Change env var to str

bcdfb66

Merge branch 'wk9874/models/gpu_support' of github.com:ukaea/toktagge…

e95cee8

…r into wk9874/models/gpu_support

Merge branch 'dev' into wk9874/models/gpu_support

4f85974

Increase timeout, rebuild

1d74b95

chore: update build output [skip ci]

4a6a1fc

Merge branch 'dev' into wk9874/models/gpu_support

d6ab68a

Merge branch 'wk9874/models/gpu_support' of github.com:ukaea/toktagge…

08c01af

…r into wk9874/models/gpu_support

wk9874 and others added 6 commits June 24, 2026 14:05

Fix gpu check in disruption cnn

ae9637f

Merge branch 'dev' into wk9874/models/gpu_support

a79c2fb

chore: update build output [skip ci]

8058095

Fix config default script for bools, update package.lock, rebuild

3e3f596

Merge branch 'wk9874/models/gpu_support' of github.com:ukaea/toktagge…

e702171

…r into wk9874/models/gpu_support

chore: update build output [skip ci]

8787a17

wk9874 requested a review from samueljackson92 June 26, 2026 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GPU Support for ML Models#291

Add GPU Support for ML Models#291
wk9874 wants to merge 57 commits into
devfrom
wk9874/models/gpu_support

wk9874 commented Jun 15, 2026 •

edited

Loading

Uh oh!

samueljackson92 left a comment

Uh oh!

wk9874 commented Jun 23, 2026

Uh oh!

abdullah-ukaea commented Jun 24, 2026

Uh oh!

praksharma commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

wk9874 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

samueljackson92 left a comment

Choose a reason for hiding this comment

Code Review

Bugs

Code Quality / Medium Issues

Minor

Summary

Uh oh!

wk9874 commented Jun 23, 2026

Uh oh!

abdullah-ukaea commented Jun 24, 2026

Uh oh!

praksharma commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wk9874 commented Jun 15, 2026 •

edited

Loading