我构建了Playwright桌面应用。节省了80%的token。
I built the Playwright for desktop apps. 80% token savings

原始链接: https://github.com/lahfir/agent-desktop

## Agent-Desktop:基于AI的桌面自动化CLI Agent-Desktop是一个快速、原生的Rust CLI,旨在赋予AI代理桌面自动化能力。它通过操作系统可访问性树访问应用程序,无需截图或像素匹配,可与任何具有可访问性树的应用程序(如Finder、Safari、Slack)协同工作。 主要功能包括53个命令,用于观察、交互(键盘、鼠标)、窗口管理和通知,并输出结构化的JSON以供机器读取。独特的“渐进骨架遍历”通过首先提供浅层概述,然后根据需要深入挖掘,从而最大限度地减少对Slack等复杂应用程序的token使用量。确定性元素引用(@e1、@e2)简化了工作流程。 该工具提供C-ABI cdylib,可与Python、Swift、Go和Node.js等语言集成,避免重复的CLI调用。它适用于macOS(需要可访问性权限)、Linux和Windows。安装方式为`npm install -g agent-desktop`或直接从源代码构建(需要Rust 1.78+)。详细文档和示例可在GitHub上找到 ([https://github.com/lahfir/agent-desktop](https://github.com/lahfir/agent-desktop))。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 我构建了桌面应用的 Playwright。节省 80% 的 token (github.com/lahfir) 12 分,由 lahfir 发表于 2 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 jstanley 1 分钟前 | 下一个 [–] lahfir,我为你(目前仍然被隐藏)的评论点了赞,因为它对我来说很有趣。 我认为它被隐藏的原因是看起来像是 LLM 生成的(你“悄悄地”在 github 上发布了它?谁会这么说?)。回复 考虑申请 YC 2026 夏季项目!申请截止至 5 月 4 日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式 搜索:
相关文章

原文

OBSERVE. DECIDE. ACT.

CI status GitHub release npm version ClawHub skill skills.sh listing Apache-2.0 License

agent-desktop tutorial demo

agent-desktop is a native desktop automation CLI designed for AI agents, built with Rust. It gives structured access to any application through OS accessibility trees — no screenshots, no pixel matching, no browser required.

agent-desktop architecture diagram

Star history for lahfir/agent-desktop
  • Native Rust CLI: Fast, single binary, no runtime dependencies
  • C-ABI cdylib (libagent_desktop_ffi): Load once from Python / Swift / Go / Ruby / Node / C instead of forking the CLI per call
  • 53 commands: Observation, interaction, keyboard, mouse, notifications, clipboard, window management
  • Progressive skeleton traversal: 78–96% token reduction on dense apps via shallow overview + targeted drill-down
  • Snapshot & refs: AI-optimized workflow using deterministic element references (@e1, @e2)
  • AX-first interactions: Every action exhausts pure accessibility API strategies before falling back to mouse events
  • Structured JSON output: Machine-readable responses with error codes and recovery hints
  • Works with any app: Finder, Safari, System Settings, Xcode, Slack — anything with an accessibility tree
npm install -g agent-desktop        # downloads prebuilt binary automatically

Or without installing:

npx agent-desktop snapshot --app Finder -i
git clone https://github.com/lahfir/agent-desktop
cd agent-desktop
cargo build --release
cp target/release/agent-desktop /usr/local/bin/

Requires Rust 1.78+ and macOS 13.0+.

macOS requires Accessibility permission. Grant it in System Settings > Privacy & Security > Accessibility by adding your terminal app, or:

agent-desktop permissions --request   # trigger system dialog

Every GitHub Release ships a prebuilt C-ABI cdylib alongside the CLI tarballs. Hosts that need in-process calls (Python agents, Swift apps, Go services, Node tools, Ruby scripts, C/C++ code) dlopen the dylib and call the functions declared in agent_desktop.h — no fork-exec per command.

Platform Artifact
macOS arm64 agent-desktop-ffi-v<ver>-aarch64-apple-darwin.tar.gz
macOS x86_64 agent-desktop-ffi-v<ver>-x86_64-apple-darwin.tar.gz
Linux x86_64 (glibc) agent-desktop-ffi-v<ver>-x86_64-unknown-linux-gnu.tar.gz
Linux arm64 (glibc) agent-desktop-ffi-v<ver>-aarch64-unknown-linux-gnu.tar.gz
Windows x86_64 (MSVC) agent-desktop-ffi-v<ver>-x86_64-pc-windows-msvc.zip

Each archive contains lib/libagent_desktop_ffi.{dylib,so,dll}, include/agent_desktop.h, LICENSE, and a short README. Verify the download with the release's checksums.txt:

shasum -a 256 -c checksums.txt
gh attestation verify agent-desktop-ffi-v*.tar.gz --repo lahfir/agent-desktop   # Sigstore provenance

Minimal Python round-trip:

import ctypes
lib = ctypes.CDLL("./lib/libagent_desktop_ffi.dylib")
lib.ad_adapter_create.restype = ctypes.c_void_p
adapter = lib.ad_adapter_create()
# ... call ad_list_apps / ad_get_tree / ad_execute_action, see docs below
lib.ad_adapter_destroy(adapter)

Full consumer guide — error-handling contract, ownership rules, threading constraints, every entrypoint with Safety docs: skills/agent-desktop-ffi/.

For dense apps (Slack, VS Code, Notion), use progressive skeleton traversal to minimize token usage:

# 1. Shallow overview — depth-3 map, truncated containers show children_count
agent-desktop snapshot --skeleton --app Slack -i --compact

# 2. Drill into a region of interest (named containers get refs as drill targets)
agent-desktop snapshot --root @e3 -i --compact

# 3. Act on an element found in the drill-down
agent-desktop click @e12

# 4. Re-drill the same region to verify the state change
agent-desktop snapshot --root @e3 -i --compact

For simple apps, a full snapshot is fine:

agent-desktop snapshot --app Finder -i   # get interactive elements with refs
agent-desktop click @e3                  # click a button by ref
agent-desktop type @e5 "quarterly report"  # type into a text field
agent-desktop press cmd+s               # keyboard shortcut
agent-desktop snapshot -i               # re-observe after UI changes
Agent loop:  snapshot → decide → act → snapshot → decide → act → ...
agent-desktop snapshot --app Safari -i           # accessibility tree with refs
agent-desktop snapshot --surface menu            # capture open menu
agent-desktop screenshot --app Finder            # PNG screenshot
agent-desktop find --role button --app TextEdit  # search by role, name, value, text
agent-desktop get @e3 value                      # read element property
agent-desktop is @e7 checked                     # check boolean state
agent-desktop list-surfaces --app Notes          # list menus, sheets, popovers, alerts
agent-desktop click @e3                  # smart AX-first click (15-step chain)
agent-desktop double-click @e3           # open files, select words
agent-desktop triple-click @e3           # select lines/paragraphs
agent-desktop right-click @e3            # context menu (returns menu tree inline)
agent-desktop type @e5 "hello world"     # type text into element
agent-desktop set-value @e5 "new value"  # set value directly via AX
agent-desktop clear @e5                  # clear element value
agent-desktop focus @e5                  # set keyboard focus
agent-desktop select @e9 "Option B"      # select option in dropdown/list
agent-desktop toggle @e12                # flip checkbox or switch
agent-desktop check @e12                 # idempotent check
agent-desktop uncheck @e12               # idempotent uncheck
agent-desktop expand @e15                # expand disclosure/tree item
agent-desktop collapse @e15              # collapse disclosure/tree item
agent-desktop scroll @e1 down 3          # scroll (AX-first, 10-step chain)
agent-desktop scroll-to @e20             # scroll element into view
agent-desktop press cmd+s               # key combo
agent-desktop press cmd+shift+z          # multi-modifier
agent-desktop press escape               # single key
agent-desktop key-down shift             # hold key
agent-desktop key-up shift               # release key
agent-desktop hover @e3                  # move cursor to element
agent-desktop hover --xy 500,300         # move cursor to coordinates
agent-desktop drag @e3 --to @e8          # drag between elements
agent-desktop drag --xy 100,200 --to-xy 400,200  # drag between coordinates
agent-desktop mouse-click --xy 500,300   # click at coordinates
agent-desktop mouse-down --xy 500,300    # press at coordinates
agent-desktop mouse-up --xy 500,300      # release at coordinates
agent-desktop launch Safari              # launch app by name
agent-desktop launch com.apple.Safari    # launch by bundle ID
agent-desktop close-app Safari           # quit app
agent-desktop close-app Safari --force   # force quit (SIGKILL)
agent-desktop list-apps                  # list running GUI apps
agent-desktop list-windows               # list visible windows
agent-desktop list-windows --app Finder  # windows for specific app
agent-desktop focus-window w-4521        # bring window to front
agent-desktop resize-window w-4521 800 600  # resize
agent-desktop move-window w-4521 100 100    # move
agent-desktop minimize w-4521            # minimize
agent-desktop maximize w-4521            # maximize
agent-desktop restore w-4521             # restore

Notifications (macOS only)

agent-desktop list-notifications                       # list all notifications
agent-desktop list-notifications --app "Slack"         # filter by app
agent-desktop list-notifications --text "deploy" --limit 5  # filter by text
agent-desktop dismiss-notification 1                   # dismiss by index
agent-desktop dismiss-all-notifications                # dismiss all
agent-desktop dismiss-all-notifications --app "Slack"  # dismiss all from app
agent-desktop notification-action 1 --action "Reply"   # click action button
agent-desktop clipboard-get              # read clipboard text
agent-desktop clipboard-set "copied"     # write to clipboard
agent-desktop clipboard-clear            # clear clipboard
agent-desktop wait 500                                       # sleep 500ms
agent-desktop wait --element @e3 --timeout 5000              # wait for element
agent-desktop wait --window "Save" --timeout 10000           # wait for window
agent-desktop wait --text "Loading complete" --app Safari    # wait for text
agent-desktop wait --menu --timeout 3000                     # wait for menu
agent-desktop batch '[
  {"command": "click", "args": {"ref_id": "@e2"}},
  {"command": "type", "args": {"ref_id": "@e5", "text": "hello"}},
  {"command": "press", "args": {"combo": "return"}}
]' --stop-on-error
agent-desktop status                     # platform, permission state
agent-desktop permissions                # check accessibility permission
agent-desktop permissions --request      # trigger system dialog
agent-desktop version                    # version string
agent-desktop snapshot [OPTIONS]
Flag Default Description
--app <NAME> focused app Filter to a specific application
--window-id <ID> - Filter to a specific window
-i / --interactive-only off Only include interactive elements
--compact off Omit empty structural nodes
--include-bounds off Include pixel bounds (x, y, width, height)
--max-depth <N> 10 Maximum tree depth
--skeleton off Shallow 3-level overview; truncated containers show children_count and get refs as drill targets
--root <REF> - Start traversal from this ref; merges into existing refmap with scoped invalidation
--surface <TYPE> window window, focused, menu, menubar, sheet, popover, alert

Every command returns structured JSON:

{
  "version": "1.0",
  "ok": true,
  "command": "click",
  "data": { "action": "click" }
}

Errors include machine-readable codes and recovery hints:

{
  "version": "1.0",
  "ok": false,
  "command": "click",
  "error": {
    "code": "STALE_REF",
    "message": "Element at @e7 no longer matches the last snapshot",
    "suggestion": "Run 'snapshot' to refresh refs, then retry"
  }
}
Code Meaning
PERM_DENIED Accessibility permission not granted
ELEMENT_NOT_FOUND No element matched the ref or query
APP_NOT_FOUND Application not running or no windows
STALE_REF Ref is from a previous snapshot
ACTION_FAILED The OS rejected the action
TIMEOUT Wait condition expired
INVALID_ARGS Invalid argument values

0 success, 1 structured error (JSON on stdout), 2 argument parse error.

snapshot assigns refs to interactive elements in depth-first order: @e1, @e2, @e3, etc. Refs are valid until the next snapshot replaces them.

Interactive roles that receive refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell, radiobutton, incrementor, menubutton, switch, colorwell, dockitem.

Static elements (labels, groups, containers) appear in the tree for context but have no ref.

Stale ref recovery:

snapshot → act → STALE_REF? → snapshot again → retry
macOS Windows Linux
Accessibility tree Yes Planned Planned
Click / type / keyboard Yes Planned Planned
Mouse input Yes Planned Planned
Screenshot Yes Planned Planned
Clipboard Yes Planned Planned
App & window management Yes Planned Planned
Notifications Yes Planned Planned
cargo build                               # debug build
cargo build --release                     # optimized (<15MB)
cargo test --lib --workspace              # run tests
cargo clippy --all-targets -- -D warnings # lint (must pass with zero warnings)

Apache-2.0

联系我们 contact @ memedata.com