![]() |
| Grounded SAM has become an essential tool in my toolbox (for others: it lets you mask any image using a text prompt, only). HUGE thank you to the team at Meta, I can't wait to try SAM2! |
![]() |
| Yes I did give it a glance, polite and clever HN member, it showed an object in a sequence of images extracted from video, and evidently followed the object from sequence.
Perhaps however my interpretation of what happens here is way off, which is why I asked in an obviously incorrect and stupid way that you have pointed out to me without clarifying exactly why it was incorrect and stupid. So anyway there is the extraction of the object I referred to, but also seeming to follow the object through sequence of scenes? https://github.com/facebookresearch/segment-anything-2/raw/m... So it seems to me that they identify the object and follow it for a contiguous sequence. Img1, img2, img3, img4, is my interpretation incorrect here? But what I am wondering is - what happens if the object is not in img3 - like perhaps two people talking and shifting viewpoint from person talking to person listening. Person talking is in img1, img2, img4. Can you get that sequence or is it just img1, img2 the sequence. It says "We extend SAM to video by considering images as a video with a single frame." which I don't know what that means, does it mean that they concatenated all the video frames into a single image and identified the object in them, in which case their example still shows contiguous images without the object ever disappearing so my question still pertains. So anyway my conclusion is what said when addressing me was wrong, to quote: "what SAM does is immediately apparent when you view the home page" because I (the you addressed) viewed the homepage I wondered about some things? Obviously wrong things that you have identified as being wrong. And thus my question is: If what SAM does is immediately apparent when you view the home page can you point out where my understanding has failed? On edit: grammar fixes for last paragraph / question. |
![]() |
| > A segment then is a collection of images that follow each other in time?
A segment is a visually distinctive... segment of image, segmentation is basically splitting an image into objects: https://segment-anything.com, as such it has nothing to do with time or video. Now SAM 2 is about video, so they seem to add object tracking (that is attributing same object to the same segment throughout frames) The videos in the main article demonstrate that it can track objects in and out of frame (the one with bacteria or the one with boy going around the tree). However they do acknowledge this part of the algorithm can produce incorrect result sometimes (example with the horses). The answer to your question is img1, img2, img4, as there is no reason to believe that it can only track objects in contiguous sequence. |
![]() |
| i covered SAM 1 a year ago (https://news.ycombinator.com/item?id=35558522). notes from quick read of the SAM 2 paper https://ai.meta.com/research/publications/sam-2-segment-anyt...
1. SAM 2 was trained on 256 A100 GPUs for 108 hours (SAM1 was 68 hrs on same cluster). Taking the upper end $2 A100 cost off gpulist means SAM2 cost ~$50k to train - surprisingly cheap for adding video understanding? 2. new dataset: the new SA-V dataset is "only" 50k videos, with careful attention given to scene/object/geographical diversity incl that of annotators. I wonder if LAION or Datacomp (AFAICT the only other real players in the open image data space) can reach this standard.. 3. bootstrapped annotation: similar to SAM1, a 3 phase approach where 16k initial annotations across 1.4k videos was then expanded to 63k+197k more with SAM 1+2 assistance, with annotation time accelerating dramatically (89% faster than SAM1 only) by the end 4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa? (written up in https://x.com/swyx/status/1818074658299855262) |
![]() |
| I might be minority, but I am not that surprised by the results or the not so significant GPU hours. I've been video segment tracking for a while now using SAM for mask generation and some of the robust academic video-object segmentation models (see CUTIE: https://hkchengrex.com/Cutie/ presented at CVPR this year.)for tracking the mask.
I need to read SAM2 paper, but 4. seems a lot like what Rex has in CUTIE. CUTIE can consistently track segments across video frames even if they get occluded/ go out of frame for a while. |
![]() |
| I tried on the default video (white soccer ball), and it seems to really struggle with the trees in the background, maybe you could benefit of more of such examples. |
![]() |
| > This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas.
Are there laws stricter than in California or EU in those places? |
![]() |
| I’m not sure what you are looking for a reference to exactly, but segmentation as a preprocessing step for tracking has been one of, if not the primary, most typical workflow for decades. |
![]() |
| I think the first SAM is the open source model I've gotten the most mileage out of. Very excited to play around with SAM2! |
![]() |
| I used the original SAM (alongside Grounding DINO) to create an ever growing database of all the individual objects I see as I go about my daily life. It automatically parses all the photos I take on my Meta Raybans and my phone along with all my laptop screenshots. I made it for an artwork that's exhibiting in Australia, and it will likely form the basis of many artworks to come.
I haven't put it up on my website yet (and proper documentation is still coming) so unfortunately the best I can do is show you an Instagram link: https://www.instagram.com/p/C98t1hlzDLx/?igsh=MWxuOHlsY2lvdT... Not exactly functional, but fun . Artwork aside it's quite interesting to see your life broken into all its little bits. Provides a new perspective (apparently, there are a lot more teacups in my life than I notice). |
![]() |
| After playing with the SAM2 demo for far too long, my immediate thought was: this would be brilliant for things like (accessible, responsive) interactive videos. I've coded up such a thing before[1] but that uses hardcoded data to track the position of the geese, and a filter to identify the swans. When I loaded that raw video into the SAM2 demo it had very little problem tracking the various birds - which would make building the interactivity on top of it very easy, I think.
Sadly my knowledge of how to make use of these models is limited to what I learned playing with some (very ancient) MediaPipe and Tensorflow models. Those models provided some WASM code to run the model in the browser and I was able to find the data from that to pipe it though to my canvas effects[2]. I'd love to get something similar working with SAM2! [1] - https://scrawl-v8.rikweb.org.uk/demo/canvas-027.html [2] - https://scrawl-v8.rikweb.org.uk/demo/mediapipe-003.html |
![]() |
| > This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas.
Alright, I'll bite, why not? |
![]() |
| Nice! Of particular interest to me is the slightly improved mIoU and 6x speedup on images [1] (though they say the speedup is mainly from the more efficient encoder, so multiple segmentations of the same image presumably would see less benefit?). It would also be nice to get a comparison to original SAM with bounding box inputs - I didn't see that in the paper though I may have missed it.
[1] - page 11 of https://ai.meta.com/research/publications/sam-2-segment-anyt... |
![]() |
| > We extend SAM to video by considering images as a video with a single frame.
I can't make sense of this sentence. Is there some mistake? |
![]() |
| One thing its enabled is automated annotations for segmentation, even on out-of-distribution examples. e.g. in the first 7 months of SAM, users on Roboflow used SAM-powered labeling to label over 13 million images, saving over ~21 years[0] of labeling time. That doesn't include labeling from self hosting autodistill[1] for automated annotation either.
[0] based on comparing avg labeling session time on individual polygon creation vs SAM-powered polygon examples [1] https://github.com/autodistill/autodistill |
![]() |
| I used it for segmentation for this home climbing/spray wall project: https://freeclimbs.org/wall/demo/edit-set
It does detection on the backend and then feeds those bounding boxes into SAM running in the browser. This is a little slow on the first pass but allows the user the adjust the bboxes and get new segmentations in nearly real time, without putting a ton of load on the server. Saved me having to label a bunch of holds with precise masks/polygons (I labeled 10k for the detection model and that was quite enough). I might try using SAM's output to train a smaller model in the future, haven't gotten around to it. (Site is early in development and not ready for actual users, but feel free to mess around.) |
![]() |
| This is great! Can someone point me to examples how to bundle something like to run offline on a browser, if possible at all? |
![]() |
| this is what I was getting at, i tried on my mbp and no luck. might be just an installer issue but I wanted confirmation from someone with more know-how before diving in |