feat: N=1 cooperative fiber scheduler replacing per-thread host threads#137
Draft
smmathews wants to merge 5 commits into
Draft
feat: N=1 cooperative fiber scheduler replacing per-thread host threads#137smmathews wants to merge 5 commits into
smmathews wants to merge 5 commits into
Conversation
Replaces the thread-per-PS2-thread model with a single executor thread
(g_guest_thread) running all guest fibers via POSIX ucontext. Implements
correct publish-before-arm wakeup ordering, SleepThread Mesa loop,
generation-token-validated cross-thread wakeups, borrowed-worker guards,
and a clean shutdown path.
New files:
- ps2_scheduler.{h,cpp,_internal.h}: scheduler core, arm_park/block_current
protocol, N=1 executor, async_guest_begin/end, yield_point fast-path exit
- ps2_fiber.{h,cpp}: ucontext fiber backend with guard pages
Key correctness properties:
- Publish-before-arm: waiters push to object wait-list under object mutex,
release it, then arm_park(). Wakers seeing g_running_fiber==fc set
wake_pending instead of touching fc->next; block_current() consumes it.
make_ready/enqueue_external_wakeup have no state==Blocked gate so
publish-window wakes are never dropped.
- SleepThread Mesa loop: re-arms every iteration; distinguishes terminate /
forceRelease / genuine wakeupCount / spurious; re-asserts THS_WAIT on
spurious so ResumeThread without a permit does not count as a wakeup.
- Generation tokens: SemaInfo/EventFlagInfo waitLists store {tid, token}
pairs; all cross-thread wakeups use enqueue_external_wakeup_validated so
a recycled tid with a new generation drops stale wakeups.
- Borrowed-worker guards: all 7 self-targeting syscalls (tid==0 → self)
return KE_ILLEGAL_THID when g_currentThreadId==-1 (IRQ/alarm context).
- Finished-branch teardown: g_fiber_map entry erased under g_sched_mutex
before dropping the lock for munmap, closing a UAF window where a
borrowed worker recycling the tid could have its new FiberContext
immediately clobbered by the old teardown's post-relock erase.
- Shutdown: stop callback + yield_point fast-path throw ensure executor
exits even for fibers that rarely hit the 128-sample yield gate.
Tests: 46 new scheduler test cases (Y1/Y2/Y3/Y4/Z1 suites) added to
ps2_runtime_expansion_tests.cpp covering all the above properties plus
regression tests for every correctness class fixed during development.
dcd18be to
b01382c
Compare
Contributor
Author
|
leaving this in draft while I give it a more thorough review, but overall I think it should be a more robust solution than #134 |
~446 lines removed, all behavior-neutral: - Remove rate-limited RUNTIME_LOG/debug-trace blocks from Thread.cpp, Sync.cpp, and Interrupt.cpp that buried the Mesa-monitor control flow under counter statics and stream expressions. - Remove dead log-feeder locals (beforeCount, afterCount, newBits, statusAfter, newWakeupCount) that only existed to feed the deleted logs. - Unify dispatchIntcHandlersForCause and dispatchDmacHandlersForCause into one dispatchHandlersForCause helper — they were identical in structure, differing only in map, mask, and tag. - Collapse triple-line // ---- banner comments above functions to nothing or a single line; keep the explanatory blocks above wake_locked, arm_park, block_current, and guest_executor_main. - Compress over-documented field comments in ps2_scheduler_internal.h to two-liners with pointers to the authoritative explanation in the .cpp. - Drop the unused runtime parameter from waitWhileSuspended and the dead s_rpcInvokeStackBase thread-local. - Remove one unreachable line in ps2_fiber.cpp (the author's own comment already called it unreachable). - Trim the pc==0 diagnostic dump in ps2_runtime.cpp (normal thread return, not an error).
Contributor
Author
|
still leaving in draft, I don't like how vita is broken and I can't test it. working on it. |
guest_executor_main unlocked before munmap but never re-locked, so the next g_sched_cv.wait() ran with an unowned unique_lock. On the break path lk.unlock() then threw EPERM, propagating out of the thread function and hitting std::terminate.
…n hardening - ps2_fiber.cpp: add SceFiber backend for PLATFORM_VITA; replace the old #error stub with a full implementation using sceFiberRun/sceFiberReturnToThread with guard-paged mmap stacks. sceFiberFinalize is always called (not just when started) since sceFiberInitialize always runs in ps2fiber_alloc. pthread backend gains guard-paged mmap stacks via pthread_attr_setstack. - ps2_fiber.h: widen ps2fiber_set_tid guard to cover PS2X_FIBER_PTHREAD too. - ps2_scheduler.cpp: call ps2fiber_set_tid in create_fiber for Vita/pthread; join_fiber handles non-fiber callers (host thread TerminateThread) via g_sched_cv.wait instead of a fiber yield; block_current throws ThreadExitException on shutdown instead of returning WokenInWindow so Mesa wait-loops unwind cleanly without an extra re-block cycle. - Sync.cpp: WaitEventFlag uses a Mesa-style re-block loop for fibers; non-fiber path gets one bounded yield only. - Lifecycle.cpp: reset g_activeThreads in notifyRuntimeStop so test cycles that restart the scheduler begin with a clean count. - RPC.cpp: waitList stores pair<int,uint64_t>; use .first when extracting tid. - CMakeLists.txt: add PS2X_FIBER_PTHREAD option (Linux TSan/ASan testing) and PS2X_SANITIZE cache variable. - Test fixes: T4 no longer spins (avoids priority deadlock); T5 waits for counter stability before sampling suspend; T10/T11 use WakeupThread instead of enqueue_external_wakeup; U2 priority/sequence corrected.
…spatch, fix fiber TLS under pthread
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What was wrong
PS2 thread scheduling is cooperative and strictly priority-ordered. When two threads are runnable, the highest-priority one runs uninterrupted until it explicitly yields — there is no time-slicing, no preemption, no "two threads run at once." The previous implementation modeled each PS2 thread as a host
std::thread, handing scheduling decisions to the OS. That is structurally incompatible with EE semantics.The visible symptoms in DQ8:
ResumeThreadun-suspends a sleeping thread but does not count as aWakeupThreadpermit — the thread goes back to sleep. The old model woke the host thread unconditionally, so DQ8's sleep loops would proceed when they shouldn't and corrupt the state machine they were protecting.scheduler_shutdowncould block indefinitely.SuspendThread(0)would try to look up its own thread ID, get-1, corrupt the thread table, and eventually crash.These problems are not DQ8-specific. Any game that relies on EE-accurate cooperative scheduling, explicit priority ordering, or the expected behavior of
SleepThread/WakeupThreadcan hit them. DQ8 is where they became unavoidable.What changed
New: N=1 fiber scheduler (
ps2_scheduler.{h,cpp,_internal.h},ps2_fiber.{h,cpp})All PS2 threads now run as fibers (POSIX
ucontext_twith mmap'd stacks and guard pages) on a single dedicated executor thread. Only one fiber runs at any moment. The scheduler maintains a priority-sorted run queue; when the currently-running fiber blocks or yields, the next-highest-priority fiber resumes immediately. This is structurally the same as what the EE BIOS does.The correctness-critical protocol is the arm_park / block_current wakeup window. The outline: a fiber publishes itself to an object's wait-list under the object's mutex, releases that mutex, then calls
arm_park()(sets state=Blocked) followed byblock_current()(actually suspends). A waker that fires after the publish but beforeblock_current()setswake_pendingrather than touching the fiber's run-queue links —block_current()sees this and returns immediately without ever parking. This closes the lost-wakeup window that exists in any lock-then-block pattern.SleepThreadnow uses a Mesa re-check loop. Each iteration: arm_park → block_current → re-acquire lock → check terminate / forceRelease / wakeupCount. AResumeThreadthat finds no permit re-parks the fiber immediately. AWakeupThreadthat incrementswakeupCountto 1 releases it. These are now unambiguous.Generation tokens on semaphore and event flag wait-lists.
SemaInfo::waitListandEventFlagInfo::waitListnow store{tid, token}pairs where the token encodes{generation, tid}. All cross-thread wakeups (SignalSema,DeleteSema,SetEventFlag,DeleteEventFlag, vsync wait-list,notifyRuntimeStop) useenqueue_external_wakeup_validated(tid, token)which drops the wakeup if the tid has been recycled.Borrowed-worker guards. Seven self-targeting syscalls (
SuspendThread,ResumeThread,TerminateThread,ReferThreadStatus,CancelWakeupThread,ChangeThreadPriority,RotateThreadReadyQueue) now checkg_currentThreadId == -1before resolvingtid == 0 → selfand returnKE_ILLEGAL_THID. Blocking syscalls gateThreadInfocreation ononFiber, so a borrowed worker can never aliasg_threads[-1].Shutdown.
scheduler_set_stop_callbacklets the runtime register a callback that fires insidescheduler_shutdown, kicking the executor's dispatch loop.yield_point()has an unsampled fast-path: ifg_stop && terminateRequested, it throwsThreadExitExceptionimmediately rather than waiting for the 128-backedge gate. Both are necessary: the callback wakes a fiber blocked ing_sched_cv.wait, and the fast-path exits a fiber that is running but rarely yields.Finished-branch teardown fix. The executor's
Finishedbranch now erases theg_fiber_mapentry and moves theFiberContextout to a local before droppingg_sched_mutexfor themunmap. Previously the map entry was erased after re-acquiring the lock, and a borrowed worker racing inasync_guest_begin→StartThreadon the same tid during that window could have its newFiberContextimmediately destroyed by the old teardown's post-relock erase.How it applies to DQ8
DQ8 has a thread hierarchy with strict priority ordering: a main game loop thread at priority 32, several subsystem threads at higher priorities (lower numbers), and timer/alarm callbacks that fire as borrowed workers. The old model's incorrect
SleepThreadbehavior was the most immediate symptom, but the priority and suspend/resume semantics were also wrong in ways that matter for the save/load system and the enemy AI scheduler. The new model makes all of these correct by construction.How it applies to other games
Any recompiled PS2 game that uses the EE threading syscalls gets these fixes. The correctness invariants (single executor, priority queue, publish-before-arm wakeup protocol, Mesa sleeps, generation-validated wakeups, borrowed-worker safety) are structural — they apply to the scheduler itself, not to DQ8-specific code paths.
Tests
46 new scheduler test cases in
ps2_runtime_expansion_tests.cppcovering:SignalSemahammered by a concurrent thread with no coordination — the waiter must never miss a signal (X1).notify_allcoverage underasync_guest_begin/endcontention (X2).ResumeThreadwithout aWakeupThreadpermit must not wake the sleeper; the sleeper re-parks and eventually wakes on a realWakeupThread(Y1).block_current's shutdown guard (Y2, W3).g_threads[-1]non-creation asserted after each call (Y3).