苹果的新研究表明，大型语言模型可以通过音频和运动数据来判断你在做什么。

苹果的新研究表明，大型语言模型可以通过音频和运动数据来判断你在做什么。
New Apple Study Shows LLMs Can Tell What You're Doing from Audio and Motion Data

原始链接: https://9to5mac.com/2025/11/21/apple-research-llm-study-audio-motion-activity/

## 苹果探索使用LLM进行活动识别苹果研究人员展示了大型语言模型（LLM）在准确推断用户活动方面的潜力，即使*无需*特定训练。他们的研究“使用LLM进行活动识别的后期多模态传感器融合”利用LLM结合音频描述和运动追踪（通过IMU）的见解——而非原始数据本身——来识别烹饪、运动或看电视等活动。该研究使用Ego4D数据集，表明LLM在零样本（无先例）和单样本（单个示例）分类场景中均取得了显著高于偶然的准确率。这种“后期融合”方法——将专业模型的输出与LLM结合——在训练数据有限时尤其有价值。苹果强调这可以提高活动分析的精确度，尤其是在传感器数据不完整时。值得注意的是，研究人员已公开发布他们的实验数据，以鼓励该领域的进一步研究，从而可能为更细致和上下文感知的健康和活动追踪功能铺平道路。

## 苹果研究与监控担忧一项新的苹果研究表明，大型语言模型（LLM）仅通过Apple Watch等设备收集的音频和运动数据，就能准确推断出一个人的活动。数据并非直接输入LLM，而是由生成文本描述的模型进行处理，然后将文本描述传递给LLM进行解读。这在Hacker News上引发了关于这项技术更广泛影响的讨论。虽然数据收集的担忧并非新问题——可以追溯到早期的Android应用——但评论员强调，收集到的数据会被无限期存储，其潜在用途将随着技术进步而不断增加。许多人表达了对隐私侵蚀的担忧，将当前的监控能力与《1984》中的反乌托邦景象相提并论。一些人认为，即使“没有什么可隐瞒”的人，也在为可能被用于对付他人的系统做出贡献，为人工智能提供有价值的训练数据。还有人争论了分析历史数据的实际价值，但承认了我们的活动被追踪的日益容易。讨论还涉及了潜在的缓解措施，例如拒绝传感器权限，以及未来无处不在的追踪可能变得不可避免的可能性。

原文

Apple researchers have published a study that looks into how LLMs can analyze audio and motion data to get a better overview of the user’s activities. Here are the details.

They’re good at it, but not in a creepy way

A new paper titled “Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition” offers insight into how Apple may be considering incorporating LLM analysis alongside traditional sensor data to gain a more precise understanding of user activity.

This, they argue, has great potential to make activity analysis more precise, even in situations where there isn’t enough sensor data.

From the researchers:

“Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We curated a subset of data for diverse activity recognition across contexts (e.g., household activities, sports) from the Ego4D dataset. Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, with no task-specific training. Zero-shot classification via LLM-based fusion from modality-specific models can enable multimodal temporal applications where there is limited aligned training data for learning a shared embedding space. Additionally, LLM-based fusion can enable model deploying without requiring additional memory and computation for targeted application-specific multimodal models.”

In other words, LLMs are actually pretty good at inferring what a user is doing from basic audio and motion signals, even when they’re not specifically trained for that. Moreover, when given just a single example, their accuracy improves even further.

One important distinction is that in this study, the LLM wasn’t fed the actual audio recording, but rather, short text descriptions generated by audio models and an IMU-based motion model (which tracks movement through accelerometer and gyroscope data), as shown below:

Diving a bit deeper

In the paper, the researchers explain that they used Ego4D, a massive dataset of media shot in first-person perspective. The data contains thousands of hours of real-world environments and situations, from household tasks to outdoor activities.

From the study:

“We curated a dataset of day-to-day activities from the Ego4D dataset by searching for activities of daily living within the provided narrative descriptions. The curated dataset includes 20 second samples from twelve high-level activities: vacuum cleaning, cooking, doing laundry, eating, playing basketball, playing soccer, playing with pets, reading a book, using a computer, washing dishes, watching TV, workout/weightlifting. These activities were selected to span a range of household and fitness tasks, and based on their prevalence in the larger dataset.”

The researchers ran the audio and motion data through smaller models that generated text captions and class predictions, then fed those outputs into different LLMs (Gemini-2.5-pro and Qwen-32B) to see how well they could identify the activity.

Then, Apple compared the performance of these models in two different situations: one in which they were given the list of the 12 possible activities to choose from (closed-set), and another where they weren’t given any options (open-ended).

For each test, they were given different combinations of audio captions, audio labels, IMU activity prediction data, and extra context, and this is how they did:

In the end, the researchers note that the results of this study offer interesting insights into how combining multiple models can benefit activity and health data, especially in cases where raw sensor data alone is insufficient to provide a clear picture of the user’s activity.

Perhaps more importantly, Apple published supplemental materials alongside the study, including the Ego4D segment IDs, timestamps, prompts, and one-shot examples used in the experiments, to assist researchers interested in reproducing the results.

Accessory deals on Amazon

FTC: We use income earning auto affiliate links. More.