Show HN:我构建了一个使用 iPhone 的 AI 智能体
Show HN: I built an AI Agent that uses the iPhone

原始链接: https://github.com/rounak/PhoneAgent

这款iPhone Agent应用基于OpenAI模型构建,能够实现AI驱动的手机控制,模拟人类用户交互。它利用Xcode的UI测试功能(无需越狱), 通过访问辅助功能树信息来与应用和iOS系统交互。 由GPT-4.1驱动,它可以执行诸如拍照并发送自拍、下载应用、发送消息、叫Uber以及调整系统设置(例如手电筒)等任务。您可以通过文本或语音发出指令,甚至可以通过后台的“始终开启”模式监听唤醒词。 该Agent可以访问辅助功能数据,进行点击、滑动、滚动、输入和打开应用等操作。主机应用和UI测试之间通过TCP服务器进行通信。虽然效果出奇地好,但仍存在一些限制:键盘输入可以改进,动画会干扰视图层级捕获,Agent可能会过早放弃长时间的任务,并且缺乏视觉屏幕理解能力(尽管这方面是可行的)。请将此视为实验性软件,并在隔离环境中使用,并注意应用内容会被发送到OpenAI的API,并且模型可能会出错。

Rounak 使用OpenAI的GPT 4.1构建了一个AI代理,它可以通过Xcode UI测试和辅助功能树来控制iPhone,从而能够通过滑动和点击与应用程序进行交互。 这个项目引发了人们对AI代理需要访问信用卡信息、日历和消息应用程序等敏感用户数据的安全和隐私隐患的讨论,它可能绕过操作系统的安全措施。人们担心数据会在设备外部进行处理。 另一位用户开玩笑地设想了一个能够伤害人类的机器人,尽管它遵守了机器人三定律,这突显了确保AI理解并遵守伦理约束的难度。其他人则讨论了这种AI如何进化成能够进行资源开采和繁殖的“合成生命”。 一些用户推测苹果公司会将其类似的AI功能集成到其生态系统中,并将其与“苹果智能”的进展缓慢形成对比。

原文

This is an iPhone using Agent that uses OpenAI models to get things done on a phone, spanning across multiple apps, very similar to a human user. It was built during an OpenAI hackathon last year.

image

Example prompts:

  • Click a new selfie and send it to {Contact name} with a haiku about the weekend
  • Download {app name} from the App Store
  • Send a message to {Contact name}: my flight is DL 1715 and Call an Uber X to SFO
  • Open Control Center and enable the torch
  • Clone the repo
  • Open the Xcode project
  • Open PhoneAgentUITests.swift and run the testLoop function
  • Paste your OpenAI API key, and input your command (text or voice)
  • The model can see an app's accessibilty tree
  • It can tap, swipe, scroll, type and open apps
  • You can follow up on a task by replying to the completion notification
  • You can talk to the agent using the microphone button
  • There is an optional Always On mode that listens for prompts starting with a wake word (Agent by default) even when the app is backgroundded. So you can say something like "Agent, open Settings"
  • The app persists your OpenAI API key securely on your device's keychain

iOS apps are sandboxed, so this project uses Xcode's UI testing harness to inspect and interact with apps and the system. (no jailbreak required).

The agent is powered by OpenAI's gpt-4.1 model. It is surprisingly good at using the iPhone just with the accessibility contents of an app. It access to these tools:

  • getting the contents of the current app
  • tapping on a UI element
  • typing in a text field
  • opening an app

The host app communicates with the UI test via a TCP Server to trigger prompts.

  • Keyboard input can be improved
  • Capturing the view hierarchy while an animation is inflight confuses the model
  • The model doesn't wait for long running tasks to complete, so it might give up prematurely.
  • The model doesn't see an image representation of the screen yet, but it's possible to do it via XCTest APIs.
  • This is experimental software
  • This is a personal project
  • Recommend running this in an isolated environment
  • The app contents are sent to OpenAI's API
  • The model can get things wrong sometimes
联系我们 contact @ memedata.com