Skip to content

feat(sam3): add SAM3-LiteText image segmentation presets#264

Open
wep21 wants to merge 1 commit into
jamjamjon:mainfrom
wep21:sam3-litetext
Open

feat(sam3): add SAM3-LiteText image segmentation presets#264
wep21 wants to merge 1 commit into
jamjamjon:mainfrom
wep21:sam3-litetext

Conversation

@wep21

@wep21 wep21 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds SAM3-LiteText (arXiv:2602.12173) — SAM3-Image with the heavy text encoder replaced by a distilled MobileCLIP student. The ViT-H vision encoder, geometry encoder and mask decoder are kept intact, so the presets reuse the existing SAM3 image vision/decoder ONNX (jamjamjon/assets sam3) and only swap in the lightweight MobileCLIP text encoder (wep21/assets sam3-litetext).

  • Config::sam3_litetext_s0() / _s1() / _l() (MobileCLIP-S0 / S1 / MobileCLIP2-L)
  • open-set-segmentation example: sam3-litetext subcommand (reuses the Sam3Image inference path) + README usage
  • scripts/sam3-litetext/: text-encoder ONNX export (fp32 + fp16 via NVIDIA Model Optimizer AutoCast)

Verification

  • text encoder ONNX ORT == PyTorch (fp32 cos=1.0, fp16 cos≈0.99999) for s0/s1/l
  • end-to-end (vision + text + decoder ONNX) reproduces the HF Sam3LiteTextModel output
  • runtime on dog.jpg -p dog: dog 0.965 vs SAM3-Image 0.972 — equivalent masks

Quality / perf vs SAM3-Image (dog.jpg)

SAM3-Image SAM3-LiteText-S0
detection dog 0.972 dog 0.965
text encoder ONNX (fp16) 674 MB 83 MB (~8× smaller)
TensorRT total /call (proc CUDA) 486 ms 449 ms

The benefit is text-encoder memory/size (~8×) and load time; end-to-end latency is similar (ViT-H vision dominates).

🤖 Generated with Claude Code

SAM3-LiteText (arXiv:2602.12173) is SAM3-Image with the heavy text encoder
replaced by a distilled MobileCLIP student; the ViT-H vision encoder, geometry
encoder and mask decoder are kept intact. The presets therefore reuse the SAM3
image vision/decoder ONNX (jamjamjon/assets `sam3`) and only swap in the
lightweight MobileCLIP text encoder (wep21/assets `sam3-litetext`).

- Config::sam3_litetext_s0() / _s1() / _l() (MobileCLIP-S0/S1/L)
- open-set-segmentation example: `sam3-litetext` subcommand (reuses the
  Sam3Image inference path) + README usage
- scripts/sam3-litetext: text-encoder ONNX export (fp32 + fp16 via modelopt
  AutoCast), verified ORT==PyTorch (fp32 cos=1.0, fp16 cos~=0.99999)

Quality matches SAM3-Image (dog.jpg: 0.972 vs 0.965); the text encoder is ~8x
smaller (674MB -> 83MB) with equivalent end-to-end masks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant