Show HN: From Photos to Positions: Prototyping VLM-Based Indoor Maps

原始链接: https://arjo129.github.io/blog/5-7-2025-From-Photos-To-Positions-Prototyping.html

In July 2025, inspired by Andrej Karpathy's "Software 3.0" talk, I explored using VLMs for indoor localization in shopping malls, reminiscent of my early work on BLE-based location systems. I wanted to determine if VLMs could use semantic maps for localization purposes. The approach involved annotating mall floor plans with shop locations and developing a Python tool to determine visible shops from corridor points. Using the Gemini API, a small program detected shops in a photo taken in the mall, comparing these shop names to the pre-processed map data. The result was surprisingly accurate, pinpointing my location within the mall based on the visible shops in the photograph. While the example was somewhat cherry-picked, it demonstrates the potential of VLMs for localization, even with imprecise maps. Future improvements could involve video input, sensor data integration, or training a specialized model. Though a production-ready system is distant, this experiment highlights the exciting possibilities for VLMs in AR, robotics, and beyond. The code and tools used are publicly available.

A Hacker News post showcases a weekend project where the author, accurrent, explored using Vision Language Models (VLMs) to solve indoor location problems in malls. Inspired by his wife's shopping trip, he built a proof-of-concept system that takes a mall map and a photo as input, using the VLM to estimate the user's location. In the comments, rohanrao123 praised the project and pointed out related work from NVIDIA Research using VLMs and Retrieval-Augmented Generation (RAG) for robot navigation. They also mentioned OpenAI's "o3" models, highlighting their impressive performance in GeoGuessr as another example of VLMs excelling at location-based tasks. These suggestions provide further context and inspiration for the author's work, showcasing the growing potential of VLMs in geolocation applications.

原文

July 05, 2025

Disclaimer:
This project was completed entirely on personal time and hardware. It is not affiliated with, endorsed by, or representative of any institutions or organizations with which I am affiliated. The views and opinions expressed herein are solely my own and do not represent those of my employer or any associated institutions.

LLMs and VLMs have been eating the world. Last week, I listened to a talk by Andrej Karpathy on his view of "Software 3.0." Unlike the two extreme perspectives I'm usually surrounded by, Andrej’s talk was refreshingly optimistic—without ignoring the pitfalls of LLMs.

It took me back to 2014, when I took on my first paid gig. I was still in high school and had recently released a small Arduino library that classified phonemes. Amid recruiter emails and general noise, I somehow ended up doing freelance work before I was called up for national service.

My first client wanted an "indoor location system" using BLE beacons. I charged a measly $500—not realizing I probably could have added two zeros if I had even a shred of business sense. The system was simple: a user would carry a Bluetooth tag, and the server would read the BLE signal and run a basic decision tree classifier to guess which room the tag was in. The dashboard was written in PHP, and we used XMPP as the communication protocol between the IoT sensors (PCDuinos, if I remember correctly).

This idea kind of followed me around. In undergrad, I built a simple app that used AR to show directions within our School of Computing. It was far from a working solution—we didn’t have an accurate map to align things with—but it planted a seed.

Fast forward to 2025. I’m now married and found myself shopping for gifts. While my wife was browsing in a large shopping mall in Singapore, I was bored in a corner and decided to see if I could "vibe-code" something on my phone. I was especially curious: could we use today’s VLM-based technologies to do indoor localization in a mall?

Generally if a user looks at a map in a shopping mall, he or she is greeted with something like this:

floorplan

This floorplan is sufficient for humans to navigate. So what about machines? Is there a way for them to use VLMs for localization purposes. Lets start simply: What features are on these maps? Well, there are markings for corridors, shops and toilets. For simplicity's sake lets start with shops:

gemini convo

This got my head spinning. Maybe we could simply code out a localization system that uses semantic maps for figuring out where a user is within a building. So I got prototyping over the weekend. I vibed out an annotation tool and a tool for parsing the annotated files. The annotation tool is available here. It looks something like this:

floorplan annotation tool

It was amazing that I could do this in two prompts. In general I've found vibe coding to be great for building such one-of tools. After that we post-process the annotations. For each point on the corridor we determine which shops will be visible based on which direction a user is facing.

def preprocess_visible_shops(annotations, grid, grid_info, radius=50):
    """
    Go through all the corridor points and find visible shops.
    """
    visible_shops = {}
    explored = set()
    for annotation in annotations:
        if annotation.get('type') != 'corridor':
            continue

        points = annotation.get('points', [])
        if not points:
            continue

        for point in points:
            cx = int(point['x'])
            cy = int(point['y'])
            if (cx, cy) in explored:
                continue
            explored.add((cx, cy))
            # Sample visible shops in all directions
            for dir in range(0, 360, 30):  # Sample every 30 degrees
                shops = sample_grid_visible_shops(grid, grid_info, cx, cy, dir, fov=50, radius=radius)
                if shops in visible_shops:
                    visible_shops[shops].append((cx, cy, dir))
                else:
                    visible_shops[shops] = [(cx, cy, dir)]

    return visible_shops

We save the individual poses and the visible shops to a pickle file. Next we write a small API to query the shops in the image.

def detect_shops_in_image(image_path):
    if not os.environ.get("GOOGLE_API_KEY"):
        raise ValueError("GOOGLE_API_KEY environment variable is not set.")

    with open(image_path, 'rb') as f:
        image_contents = f.read()

    image_part = {
        "mime_type": "image/jpeg",
        "data": image_contents
    }

    # Call Gemini API to detect shops
    gemini_result = call_gemini(image_part)

    # Convert the result to a list of shop names
    return sorted([normalize_shop_name(item["shop_name"]) for item in gemini_result])

We can now use our pickled list of shops to match against our pickle file. This is the result I got. The yellow dots represent potential positions.

localization_probs

For reference the image I used was:

localization_probs

This was honestly incredible. Going from a single photo to reasonably accurate map coordinates was genuinely surprising. Despite some ambiguity, the locations marked were correct—I was indeed standing among the yellow circles! It’s amazing that we can localize against an imprecise map using just a bit of prompting and glue code.

Some of the ambiguity likely stems from the fact that we’re only using text-based prompting, rather than feeding both the image and the map directly into the VLM.

Admittedly, this example is somewhat cherry-picked—the photo I used had clearly visible shop signs. Still, it strongly suggests that VLMs can be useful for localization tasks. With video input and additional phone sensor data—perhaps paired with a particle filter—we might be able to further refine the estimate. Alternatively, we could go the "benchmark dataset" route, training a model to map photos to rough map positions more robustly. In the process we could probably kill the planet a little faster as well.

I genuinely believe there’s real potential here, especially with the upcoming wave of AR devices. There may also be applications in robotics, though it’s worth noting that there’s a long road from this proof-of-concept to a reliable, production-ready system.

For now, this is just a fun experiment—so take it as exactly that.

The tools for doing this are all available in this repo.