feat(sam3): add SAM3-LiteText image segmentation presets by wep21 · Pull Request #264 · jamjamjon/usls

wep21 · 2026-06-30T15:25:39Z

Summary

Adds SAM3-LiteText (arXiv:2602.12173) — SAM3-Image with the heavy text encoder replaced by a distilled MobileCLIP student. The ViT-H vision encoder, geometry encoder and mask decoder are kept intact, so the presets reuse the existing SAM3 image vision/decoder ONNX (jamjamjon/assets sam3) and only swap in the lightweight MobileCLIP text encoder (wep21/assets sam3-litetext).

Config::sam3_litetext_s0() / _s1() / _l() (MobileCLIP-S0 / S1 / MobileCLIP2-L)
open-set-segmentation example: sam3-litetext subcommand (reuses the Sam3Image inference path) + README usage
scripts/sam3-litetext/: text-encoder ONNX export (fp32 + fp16 via NVIDIA Model Optimizer AutoCast)

Verification

text encoder ONNX ORT == PyTorch (fp32 cos=1.0, fp16 cos≈0.99999) for s0/s1/l
end-to-end (vision + text + decoder ONNX) reproduces the HF Sam3LiteTextModel output
runtime on dog.jpg -p dog: dog 0.965 vs SAM3-Image 0.972 — equivalent masks

Quality / perf vs SAM3-Image (dog.jpg)

	SAM3-Image	SAM3-LiteText-S0
detection	dog 0.972	dog 0.965
text encoder ONNX (fp16)	674 MB	83 MB (~8× smaller)
TensorRT total /call (proc CUDA)	486 ms	449 ms

The benefit is text-encoder memory/size (~8×) and load time; end-to-end latency is similar (ViT-H vision dominates).

🤖 Generated with Claude Code

SAM3-LiteText (arXiv:2602.12173) is SAM3-Image with the heavy text encoder replaced by a distilled MobileCLIP student; the ViT-H vision encoder, geometry encoder and mask decoder are kept intact. The presets therefore reuse the SAM3 image vision/decoder ONNX (jamjamjon/assets `sam3`) and only swap in the lightweight MobileCLIP text encoder (wep21/assets `sam3-litetext`). - Config::sam3_litetext_s0() / _s1() / _l() (MobileCLIP-S0/S1/L) - open-set-segmentation example: `sam3-litetext` subcommand (reuses the Sam3Image inference path) + README usage - scripts/sam3-litetext: text-encoder ONNX export (fp32 + fp16 via modelopt AutoCast), verified ORT==PyTorch (fp32 cos=1.0, fp16 cos~=0.99999) Quality matches SAM3-Image (dog.jpg: 0.972 vs 0.965); the text encoder is ~8x smaller (674MB -> 83MB) with equivalent end-to-end masks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wep21 force-pushed the sam3-litetext branch from 9a14dea to 0deab24 Compare June 30, 2026 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sam3): add SAM3-LiteText image segmentation presets#264

feat(sam3): add SAM3-LiteText image segmentation presets#264
wep21 wants to merge 1 commit into
jamjamjon:mainfrom
wep21:sam3-litetext

wep21 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wep21 commented Jun 30, 2026

Summary

Verification

Quality / perf vs SAM3-Image (dog.jpg)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant