展示 HN:用于人工智能知识工作的开源 SDK
Show HN: Open-Source SDK for AI Knowledge Work

原始链接: https://github.com/ClioAI/kw-sdk

## ClioAI/kw-sdk:用于AI知识工作的Python SDK ClioAI SDK是一个Python工具,旨在构建能够处理复杂“知识工作”的AI代理——例如研究、分析、写作和战略决策。与编程不同,知识工作缺乏明确的“通过/失败”测试,并且具有庞大且未定义的解决方案空间。该SDK通过引入**评分标准**来解决这个问题:预定义的评估“良好”工作的标准,从而实现自我验证和迭代改进。 该SDK在一个自我验证的循环中运行:任务创建、评分标准生成、任务执行(利用网络搜索和代码执行等工具)以及根据评分标准进行验证。如果验证失败,代理会迭代;如果通过,则提交结果。 主要功能包括多种模式(“标准”、“计划”、“探索”、“迭代”)以适应不同类型的任务,以及可扩展性以支持自定义模式和提供程序。它支持流行的LLM(Gemini、OpenAI、Anthropic),并提供文件访问和用户澄清等工具。 该项目已开源,以促进知识工作AI的发展,因为当前工具主要集中在代码上。创建者希望通过社区贡献来改进验证能力,并可能训练专门用于基于评分标准的评估的模型。该SDK旨在节省开发人员的时间,并解锁AI驱动的研究、推荐和文档生成的新可能性。 您可以在这里找到该项目和安装说明:[https://github.com/ClioAI/kw-sdk](https://github.com/ClioAI/kw-sdk)

## ClioAI 的 AI 知识工作开源 SDK ClioAI 发布了一个开源 SDK ([https://github.com/ClioAI/kw-sdk](https://github.com/ClioAI/kw-sdk)),旨在将知识工作(如研究和战略)视为一个工程问题。与许多专注于代码的 AI 代理框架不同,该 SDK 强调通过“任务 → 简报 → 评分标准 → 工作 → 验证”循环进行**验证**。评分标准是一种隐藏的评估标准,提供结构化的奖励信号,尤其适用于模型训练。 主要功能包括一个生成多种方案及权衡的 **“探索”模式**,以及允许恢复或分叉探索的 **检查点**。该 SDK 支持远程执行环境(包括浏览器)并通过上下文或文档灵活调用工具。 核心思想是使知识工作可验证,超越主观的“听起来不错”评估。开发者正在寻求反馈,特别是关于其他人如何处理非代码代理工作流程中的验证。该项目采用 MIT 许可,并提供丰富的示例和指南。
相关文章

原文

A Python SDK for building AI agents that perform knowledge work—research, analysis, writing, and decision-making tasks that require iteration, verification, and structured thinking.

Why Knowledge Work is Different from Code

Code has a tight feedback loop: write code → run tests → fix errors → repeat. The solution space is constrained—there's usually one correct answer, and automated tests tell you if you found it.

Knowledge work is fundamentally different. The solution space is vast and underspecified. A "market analysis" could be a two-paragraph summary or a 50-page deep dive. A "strategy recommendation" could emphasize cost, speed, risk, innovation, or any combination. There's no test suite that returns pass/fail.

Our approach: Since knowledge work lacks natural verification, we synthesize one using rubrics. A rubric defines what "good" looks like before execution begins, enabling:

  • Self-verification: The agent checks its own work against explicit criteria
  • Iterative refinement: Failed verification triggers targeted improvement
  • Transparent evaluation: Humans can audit the rubric and verification process

This SDK implements a self-verifying agentic loop that brings structure to the inherently open-ended nature of knowledge work. The agent can search the web, read and write files, execute code, generate artifacts, and ask the user for clarification—all coordinated through an orchestrator that verifies its own output.

This started as a harness for running RL training on knowledge tasks. I'm open-sourcing it because:

  1. Knowledge workflows are underexplored. Most AI tooling focuses on code. But knowledge work—research, analysis, strategy, writing—is where most professionals spend their time. The primitives for building these systems aren't well established yet.

  2. This could be a useful building block. If you're building products that involve AI doing research, making recommendations, or producing documents, this verification loop might save you weeks of iteration.

  3. Models still struggle with verification. The self-check step is the weakest link. If this gets adoption, an open-source model provider could train specifically on rubric-based verification—improving the entire ecosystem.

I'd rather see these ideas spread than keep them proprietary.

┌─────────────────────────────────────────────────────────────┐
│  1. BRIEF CREATION                                          │
│     → Formalize task into structured requirements           │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│  2. RUBRIC CREATION                                         │
│     → Generate evaluation criteria (hidden from executor)   │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│  3. TASK EXECUTION                                          │
│     → Orchestrator delegates to subagents, runs searches    │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│  4. VERIFICATION                                            │
│     → Check answer against rubric → PASS or FAIL            │
└─────────────────────────────────────────────────────────────┘
                          ↓ (if FAIL)
                    ← ITERATE ←
                          ↓ (if PASS)
┌─────────────────────────────────────────────────────────────┐
│  5. SUBMISSION                                              │
│     → Submit verified answer                                │
└─────────────────────────────────────────────────────────────┘

As a dependency (recommended)

uv pip install git+https://github.com/ClioAI/kw-sdk.git

Or add to your pyproject.toml:

dependencies = [
    "verif @ git+https://github.com/ClioAI/kw-sdk.git",
]
git clone https://github.com/ClioAI/kw-sdk.git
cd kw-sdk
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

Create a .env file:

GEMINI_API_KEY=your_gemini_key
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
from verif import RLHarness

harness = RLHarness(provider="gemini")  # or "openai" or "anthropic"
result = harness.run_single("Analyze the economic impact of remote work on urban real estate.")

print(result.answer)  # The analysis
print(result.rubric)  # Auto-generated evaluation criteria

The SDK provides different modes optimized for different types of knowledge work:

Mode Best For Rubric Strategy
standard General research & analysis Auto-created during execution
plan Complex multi-step tasks User-provided or auto-created
explore Creative/divergent thinking Quality checklist (no accuracy rubric)
iterate Refining existing work Uses existing rubric + feedback
Provider Config Thinking Control
Gemini provider="gemini" thinking_level: LOW / MEDIUM / HIGH
OpenAI provider="openai" reasoning_effort: low / medium / high
Anthropic provider="anthropic" thinking_budget: token count (default 10000)

For general tasks. The orchestrator creates brief and rubric automatically.

from verif import RLHarness

harness = RLHarness(provider="gemini", enable_search=True)

result = harness.run_single(
    "Compare carbon tax vs cap-and-trade for reducing industrial emissions."
)

print(result.answer)
print(result.rubric)  # Auto-generated

See: examples/standard_mode.py

For structured execution with explicit control over strategy.

from verif import RLHarness

harness = RLHarness(provider="gemini", enable_search=True)

PLAN = """
## Investigation Phase
1. Research incident postmortem best practices
2. Identify key sections for blameless postmortems

## Writing Phase
3. Write executive summary
4. Document timeline with timestamps
5. Describe root cause analysis
"""

RUBRIC = """
## Structure (40 points)
- [ ] Has executive summary
- [ ] Includes timeline with timestamps
- [ ] Contains root cause analysis

## Blameless Culture (30 points)
- [ ] No individual blame
- [ ] Uses "we" language
"""

result = harness.run_single(
    task="Write a postmortem for a 47-minute database outage.",
    mode="plan",
    plan=PLAN,
    rubric=RUBRIC,  # Optional - omit to auto-create
)

See: examples/plan_mode.py

For divergent thinking—generate multiple distinct perspectives. Unlike standard mode, explore doesn't optimize for a single "right" answer. It maps the solution space.

How explore differs from standard:

  • No accuracy rubric. Standard mode creates a rubric to verify correctness. Explore uses a quality checklist—are the takes distinct? Do they cover different assumptions?
  • Forces gap identification. Each take must state its assumptions and what would break it. This surfaces blind spots you wouldn't find with a single answer.
  • Quantity over convergence. Standard iterates toward one verified answer. Explore produces N parallel answers that may contradict each other—that's the point.
from verif import RLHarness

harness = RLHarness(provider="gemini", enable_search=True)

result = harness.run_single(
    task="""Explore database architectures for a fintech handling 10K TPS 
    with strong consistency and multi-region deployment.""",
    mode="explore",
    num_takes=3,  # Generate 3 distinct approaches
)

# Result contains multiple takes separated by ===
takes = result.answer.split("===")
for i, take in enumerate(takes, 1):
    print(f"--- Approach {i} ---\n{take[:500]}...")

Each take includes:

  • The solution/recommendation
  • Assumptions: What must be true for this to work (e.g., "assumes budget for multi-region replication")
  • Counterfactual: What could make this fail (e.g., "breaks if latency requirements tighten to <10ms")

The output ends with set-level gaps: what's missing from the entire set? This tells you which angles weren't covered—maybe all takes assumed a single cloud provider, or none considered regulatory constraints. The gaps are often more valuable than the takes themselves.

Use explore when you're not sure what the right question is, or when the "best" answer depends on unstated constraints.

See: examples/explore_mode.py

For refining existing work based on user feedback.

# Initial execution
result = harness.run_single(task="Write a market analysis memo.")

# User provides feedback
iterate_result = harness.iterate(
    task="Write a market analysis memo.",
    answer=result.answer,
    rubric=result.rubric,
    feedback="Use 2024 data instead of 2023. Add executive summary.",
    rubric_update="Must address data residency requirements.",  # Optional
)

print(iterate_result.answer)  # Refined version

See: examples/iterate_workflow.py

Save execution state at every step. Resume from any checkpoint with optional feedback and rubric updates.

from verif import RLHarness

harness = RLHarness(provider="gemini", enable_search=True)

# Run with checkpointing
result = harness.run_single(
    "Analyze the power dynamics among Olympian gods.",
    checkpoint=True,
)

# List checkpoints
for snap_id, snap in harness.snapshots.items():
    print(f"{snap_id} (step {snap.step})")

# Resume from any checkpoint with new direction
resumed = harness.resume(
    checkpoint_id="<snap_id>",
    feedback="Focus more on the Trojan War.",
    rubric_update="Must include analysis of divine intervention in the Iliad.",
)

See: tests/test_checkpoint.py


Explore → Select → Execute

The most powerful pattern: brainstorm, pick the best approach, then execute.

# Stage 1: Explore multiple approaches
explore_result = harness.run_single(task=TASK, mode="explore", num_takes=3)
takes = explore_result.answer.split("===")

# Stage 2: Use LLM to select best approach
selector = GeminiProvider()
selection = selector.generate(f"Pick the best approach:\n{explore_result.answer}")

# Stage 3: Execute with selected plan
final_result = harness.run_single(
    task=TASK,
    mode="plan",
    plan=selected_plan,
    rubric=selected_rubric,
)

See: examples/end_to_end_workflow.py


harness = RLHarness(
    provider="gemini",
    enable_search=True,  # Adds search_web tool
)

See: examples/standard_with_search.py

harness = RLHarness(
    provider="gemini",
    enable_bash=True,  # Adds search_files tool (ls, find, grep, cat)
)
from verif.executor import SubprocessExecutor

harness = RLHarness(
    provider="gemini",
    enable_code=True,
    code_executor=SubprocessExecutor("./artifacts"),
    artifacts_dir="./artifacts",
)

The code executor is stateful—variables persist across calls. Files saved to artifacts_dir are tracked and returned.

See: examples/with_code_execution.py

from verif import RLHarness, Attachment, Prompt

# Create attachment with preview
attachment = Attachment(
    content="/path/to/data.csv",
    mime_type="text/csv",
    name="data.csv",
    preview="col1,col2\n1,2\n3,4...",  # First N lines
)

# Build multimodal prompt
prompt: Prompt = [
    "Analyze the attached sales data and create a summary.",
    attachment,
]

result = harness.run_single(prompt)

See: examples/with_files.py

User Clarification (ask_user)

Enable interactive clarification when tasks are ambiguous:

import threading
from verif import RLHarness, ProviderConfig

def on_event(entry, harness):
    if entry.entry_type == "user_question":
        question_id = entry.metadata["question_id"]
        questions = entry.metadata["questions"]
        
        # Generate or collect answers
        answers = {0: "B2B SaaS platform", 1: "$50,000 budget"}
        
        # Send response back (in a thread to not block)
        threading.Thread(
            target=lambda: harness.provider.receive_user_response(question_id, answers)
        ).start()

harness = RLHarness(
    provider="gemini",
    enable_ask_user=True,
    on_event=lambda e: on_event(e, harness),
)

result = harness.run_single("Create a project plan for my product launch.")

The orchestrator can call ask_user to request clarification. Verification is blocked until all pending questions are answered.

See: tests/test_ask_user.py


from verif import RLHarness, HistoryEntry

def on_event(event: HistoryEntry):
    if event.entry_type == "tool_call":
        print(f"→ {event.content}")
    elif event.entry_type == "thought":
        print(f"💭 {event.content[:100]}...")

harness = RLHarness(
    provider="gemini",
    on_event=on_event,
    stream=True,           # Stream orchestrator output
    stream_subagents=True, # Stream subagent output
)

See: examples/with_streaming.py


from verif import RLHarness, ProviderConfig, CompactionConfig
from verif.executor import SubprocessExecutor

harness = RLHarness(
    # Provider: "gemini" | "openai" | "anthropic" | ProviderConfig
    provider=ProviderConfig(
        name="gemini",
        thinking_level="MEDIUM",  # Gemini: LOW | MEDIUM | HIGH
        # OR for OpenAI:
        # name="openai",
        # reasoning_effort="medium",  # low | medium | high
        # OR for Anthropic:
        # name="anthropic",
        # thinking_budget=10000,  # token budget for extended thinking
    ),
    
    # Tool Capabilities
    enable_search=True,     # Web search tool
    enable_bash=False,      # File system navigation
    enable_code=False,      # Python code execution
    enable_ask_user=False,  # User clarification tool
    
    # Code Execution (required if enable_code=True)
    code_executor=SubprocessExecutor("./artifacts"),
    artifacts_dir="./artifacts",
    
    # Execution Limits
    max_iterations=30,
    
    # Mode Selection
    default_mode="standard",  # "standard" | "plan" | "explore"
    
    # Pre-set Rubric (optional)
    rubric="1. Must be accurate\n2. Must cite sources",
    
    # Event Streaming
    on_event=lambda e: print(f"[{e.entry_type}] {e.content[:100]}"),
    stream=True,
    stream_subagents=True,
    
    # Context Compaction (for long tasks)
    compaction_config=CompactionConfig(
        enabled=True,
        threshold=0.8,        # Trigger at 80% context capacity
        keep_recent_turns=3,
    ),
)

result = harness.run_single(task)

result.task          # Original task text
result.answer        # Final submitted answer
result.rubric        # Evaluation rubric used
result.history       # List[HistoryEntry] - full execution trace
result.mode          # Mode used: "standard" | "plan" | "explore"
result.plan          # Plan (if plan mode)
result.brief         # Brief (if available)
# Get formatted history
print(harness.get_history_markdown())
print(harness.get_history_text())

# Access raw entries
for entry in result.history:
    print(f"[{entry.timestamp}] {entry.entry_type}: {entry.content[:100]}")

Tools Available to Orchestrator

Tool Description When Available
create_brief Formalize task requirements standard, explore
create_rubric Generate evaluation criteria standard, plan
spawn_subagent Delegate subtasks All modes
search_web Web search enable_search=True
search_files File read/search enable_bash=True
execute_code Python REPL enable_code=True
ask_user Request user clarification enable_ask_user=True
verify_answer Check against rubric standard, plan, iterate
verify_exploration Check quality checklist explore
submit_answer Submit final answer All modes

  • Computer use subagent — Attach a computer-use capable subagent for GUI interaction (filling forms, navigating apps, extracting data from web interfaces).
  • Multi-app workflows — Working across browsers, spreadsheets, and documents in a single run.
  • Parallel verification — Run multiple verification passes and take consensus, reducing single-verifier bias.
  • Rubric quality scoring — Meta-evaluation: score the rubric itself before using it for verification. Catch "always-pass" rubrics early.
  • Structured output from runs — Return typed sections (executive summary, recommendations, evidence) instead of a single answer string.
  • Eval framework — Systematic comparison across providers/modes/rubric strategies on a benchmark task set. run_eval exists but needs scoring and reporting.
  • Token usage tracking — Surface per-run token counts by phase (brief, rubric, execution, verification) for cost analysis.
  • Mixed-model orchestration — Use different models for orchestrator vs subagents (e.g., Opus for orchestration, Flash for search subagents). Currently the same provider handles both. I kept it this way because RL training benefits from a single policy, but for production use the cost savings of routing cheap tasks to smaller models would be significant.

¹ See TOOL_CALLING_GUIDE.md for the philosophy: skip MCP servers, use code as tools. ² See EXTENSIONS.md for creating custom modes and providers.

See examples/outputs/ for sample execution results:



If you're using this for RL training:

Experiment relentlessly. The reward signal for knowledge work is noisy. What works for one task type may fail for another.

Train selectively on the control plane. In my experience, training works best when you focus on:

  • Orchestrator outputs (tool calls, sequencing decisions)
  • Brief creation (task formalization)
  • Rubric creation (evaluation criteria)

Leave out subagent outputs, search results, and code execution from the training signal—even if they're generated by the same policy. The goal is to improve the orchestration and verification layers. Everything else is downstream; if the orchestrator gets better at decomposition and the rubric gets better at capturing intent, the subagents benefit automatically.

Verification is the bottleneck. Most training gains come from improving the verify step. A model that can accurately assess its own work against a rubric is more valuable than one that generates slightly better first drafts.


Verification is only as good as the model. The rubric is generated by the same model that does the work. If the model has blind spots, the rubric will too. This is a fundamental constraint of self-verification.

External grounding happens at brief level, not verification. If you need external validation (e.g., checking facts against a database), you can provide your own rubric. But be careful: the verifier is intentionally limited—it doesn't have access to search or filesystem. The design assumes grounding happens during task execution (via the brief and subagents), not during verification. The verifier checks internal consistency against the rubric, not external correctness.

Rubrics can be gamed. A sufficiently clever model could write a rubric that's easy to pass. This is why human review of rubrics matters for high-stakes tasks.

Context compaction requires a Gemini API key. Compaction (summarizing mid-context to stay under token limits) uses gemini-3-flash-preview regardless of your chosen provider. If you enable compaction with OpenAI or Anthropic as the orchestrator, you'll still need a GEMINI_API_KEY. Free keys are available from Google AI Studio.


MIT

联系我们 contact @ memedata.com