feat(sandbox): implement snapshot and restore functionality for sandb…#21
Open
saggiyogesh wants to merge 13 commits into
Open
feat(sandbox): implement snapshot and restore functionality for sandb…#21saggiyogesh wants to merge 13 commits into
saggiyogesh wants to merge 13 commits into
Conversation
…oxes - Added new methods for snapshotting and restoring sandboxes in the SandboxService. - Updated AutoLifecycleConfig to include parameters for snapshotting and restoring. - Refactored lifecycle management to support automatic snapshotting of idle sandboxes. - Enhanced network namespace management to ensure proper configuration during snapshot and restore operations. - Updated API routes to replace start/stop with snapshot/restore for better clarity in sandbox lifecycle management.
- Renamed API endpoints from /start to /snapshot and /resume to /restore for clarity. - Updated summaries and descriptions to reflect new functionality for snapshotting and restoring sandboxes. - Removed deprecated start, stop, and pause endpoints from the OpenAPI specification. - Cleaned up initial data population logic by removing unused system image creation code.
- Added `github.com/cenkalti/backoff/v4` version 4.3.0 to manage retry logic for transient errors. - Updated OpenAPI specification to include `snapshotted` state in the lifecycle enum for better clarity on sandbox states. - Refactored sandbox service methods to improve error handling and ensure proper context management during network configuration and sandbox restoration.
…t management - Introduced SetSnapshottedAtAndOrg method in ISandboxRepository to update the status and timestamp for a sandbox based on its ID and organization ID. - Updated SandboxService to utilize the new method for setting the snapshotted state, enhancing error handling and ensuring proper state management.
…rialization Introduces a refcounted, per-sandbox-ID keyed mutex used to serialize lifecycle operations (Snapshot, Restore, Delete, auto-snapshot, auto-delete) on the same sandbox. Prevents concurrent operations from spawning duplicate VMM processes, corrupting snapshot directories, killing reused PIDs, or leaving DB status mismatched with the runtime. Entries are removed from the map when no longer held, so the working set stays proportional to in-flight operations rather than the total number of sandboxes the process has ever touched. Track this file now so subsequent commits that wire it into SandboxService, LifecycleManager, and server setup can rely on the type being defined.
LifecycleManager runs two queries every 30s tick (default CheckIntervalSec):
- FindIdleRunning filter: {status: "running", lastActivityAt: { $lt: t }}
- FindStaleSnapshotted filter: {status: "snapshotted", snapshottedAt: { $lt: t }}
Until now only {orgId} was indexed on the sandboxes collection, so each
sweep did a full collection scan. At ~10k sandboxes that's two full scans
per minute on the hot collection. The compound indexes turn both into
index range scans (equality on status + range on the timestamp field is
fully covered).
Switched to CreateMany so all three indexes ship in one round-trip;
behavior on failure unchanged (warn-only, Init still succeeds so the
service can come up if Mongo is briefly read-only).
SEC-04 stopgap. On a dense host with rapid snapshot/restore churn, the kernel can reuse a freshly-freed PID before our pidfile is invalidated. A subsequent forceKillByPIDFile would then SIGKILL an unrelated process that happens to share that PID. This change reads /proc/<pid>/cmdline and refuses to signal unless argv[0] matches the configured cloud-hypervisor binary (by absolute path or basename). When the cmdline doesn't match, we log a warning and return nil — the real CLH has already exited and there's nothing for us to kill. Mechanism mirrors the existing runtime.InstancesRoot pattern: a package-level CHBinary var set once at startup from cfg.CHBinary via runtime.SetCHBinary, wired in server.New. When CHBinary is unset (unit tests that bypass server.New), the check degrades to legacy behavior, so existing TestForceKillByPIDFile keeps passing. Added TestForceKillByPIDFile_RefusesNonCH regression test. Proper race-free follow-up (pidfd_open + pidfd_send_signal) is folded into the actor refactor next week.
SEC-01. iptables is first-match-wins. With the conntrack ESTABLISHED,RELATED ACCEPT sitting at the top of the FORWARD chain, a stale conntrack entry from before a policy update would short-circuit the destination DROPs that came after it. A guest holding an existing flow to a destination later added to the blocklist could keep reaching it via the conntrack accept. Move the ESTABLISHED,RELATED ACCEPT to the end so policy decisions are based on destination first, conntrack state second. Effective order is now: MAC anti-spoof -> destination DROPs (169.254.169.254 + RFC1918) -> DNS ACCEPTs -> ESTABLISHED,RELATED ACCEPT fall-through. Default policy stays ACCEPT, so the ESTABLISHED rule is technically redundant today but kept for the day we tighten default to DROP. Scope: affects only sandboxes whose netns is created after this change. Existing snapshotted sandboxes keep the old order baked into their persistent netns until SEC-02 (reapply iptables on restore) ships. TestNetworkNSCreationAndIptables now asserts ESTABLISHED comes after both the destination DROPs and the DNS ACCEPTs; live iptables -L confirms the order on the wire.
…ion DROPs" This reverts commit 178bfb3.
- Introduced BalloonEnabled flag in SandboxConfig to enable virtio-balloon device for memory management. - Updated default configuration to enable ballooning by default. - Added validateNameservers function to enforce strict validation of DNS nameservers, ensuring only valid public IPs are accepted. - Enhanced New function to include balloon configuration and nameserver validation during sandbox setup.
- Introduced comprehensive tests for SandboxLifecycleLocks to ensure mutual exclusion, no contention across different IDs, and proper handling of multiple holders for the same ID. - Added a test to verify reference counting under high churn conditions, ensuring no lock entries are leaked. - Implemented a helper function to track the current number of active lock entries for testing purposes.
- Deleted the outdated document detailing the snapshot/restore scale and security review, which is no longer relevant to the current project direction. - This document contained assessments and recommendations that have been superseded by recent developments in the snapshot/restore functionality.
- Updated OpenAPI specification to change the endpoint paths from `/sandboxes/{id}/snapshot` to `/sandboxes/{id}/sleep` and from `/sandboxes/{id}/restore` to `/sandboxes/{id}/wake`.
- Modified the summaries and descriptions to reflect the new functionality: putting a sandbox to sleep and waking it from a persisted state.
- Adjusted the server routing to align with the new endpoint names and operations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…oxes