Gt：[实验性] 多路复用张量框架

Gt：[实验性] 多路复用张量框架
GT – Experimental multiplexing tensor framework for distributed GPU computing

## GT：一种用于分布式GPU计算的动态异步框架 GT是一个新颖的Python框架，旨在简化机器学习的分布式GPU计算，摆脱传统的锁步方法。它借鉴了多核操作系统的思想，采用动态调度和异步执行，提供类似于PyTorch的熟悉、即时模式API。该系统包括客户端（用户）、调度器和工作进程（每个GPU一个）。客户端定义张量操作，调度器将其转换为GPU执行指令，并根据YAML配置和客户端的“信号”分片数据，工作进程异步处理指令，并可选择进行JIT编译。主要特性包括通过ZeroMQ实现的高性能传输、基于磁带的自动微分、PyTorch兼容性，以及用于实时监控、指令记录和AI辅助开发的工具。GT优先考虑可读性和可调试性，旨在借助AI编码助手轻松理解和修改。它是一个研究原型，注重简洁性，并欢迎贡献。

Hacker News新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 GT – 用于分布式GPU计算的实验性多路复用张量框架 (github.com/bwasti) 30点由 brrrrrm 1天前 | 隐藏 | 过去 | 收藏 | 1评论 almostgotcaught 22小时前 [–] Bram总是能构建出非常优雅的东西。回复考虑申请YC冬季2026批次！申请截止至11月10日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

An experimental multiplexing tensor framework for distributed GPU computing.

pip install git+https://github.com/bwasti/gt.git
python -c 'import gt; print(gt.randn(2,2))'

The motivation for this project is a rejection of the clunky lock-step paradigm ML researchers tend to use. GT attempts to pull some of the ideas that are present in the decades of development done on multi-core operating systems. It fully embraces dynamic scheduling and heavily asynchronous execution while presenting a familiar eager frontend.

Three components
- N × clients (as many users as you want!)
- 1 × dispatcher (for coordinating)
- N × workers (1 per GPU)
Everything communicates with a stream of instructions
- Clients deal with math. They emit (GPU-unaware) pure functional instructions
- The dispatcher rewrites these instructions on the fly to be GPU-aware and sends them to the workers
- Workers asynchronously process these instructions, optionally JIT compiling
Instruction streams are annotated
- Clients can send "signals" which allow the dispatcher to more appropriately shard the tensors
- Dispatchers annotate "hot" paths to give hints to workers about JIT compiling
- Annotations are supplemented with YAML configs that specify sharding and compilation information
- Every annotation can be safely ignored, so the same code can run anywhere (just remove the YAML)

import gt

a = gt.randn(1000, 1000)
b = gt.randn(1000, 1000)
c = a @ b
result = c[:4, :4]
print(result)

It may not look like it, but in the background GT automatically spins up an asynchronous dispatching server and GPU worker.

High-performance transport - ZeroMQ (ZMQ) with automatic message batching and efficient DEALER/ROUTER pattern
Autograd support - Tape-based automatic differentiation exclusively at the client layer
PyTorch-compatible API - Familiar syntax for tensor operations
Signal-based sharding - Declarative YAML configuration for distributed training
Real-time monitoring - htop-style visualization of worker activity
Instruction logging - Debug distributed execution with timeline visualizations
AI-assisted development - Optimized for collaboration with AI coding assistants

📚 Read the full documentation

See examples/ directory for demonstrations:

demo.py - Basic tensor operations
signal_demo.py - Signal-based sharding
compile_demo.py - Compilation directives
debug_demo.py - Debug utilities
visualize_demo.py - Instruction tape visualization

┌─────────────────────────────────────────────────────────────────┐
│                          User Code                              │
│  import gt                                                      │
│  with gt.signal.context('layer1'):                              │
│      x = gt.randn(100, 64)                                      │
│      loss = model(x)                                            │
│      loss.backward()                                            │
└──────────────────────┬──────────────────────────────────────────┘
                       │ PyTorch-like API + Signal Metadata
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│                      gt/client/                                 │
│  ┌──────────────┐  ┌─────────────┐  ┌──────────────┐            │
│  │   Tensor     │  │  Autograd   │  │  nn.Module   │            │
│  │ (Remote Data)│  │   (Tape)    │  │  (Layers)    │            │
│  └──────────────┘  └─────────────┘  └──────────────┘            │
└──────────────────────┬──────────────────────────────────────────┘
                       │ ZMQ (DEALER → ROUTER)
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│                    gt/dispatcher/                               │
│  • ZMQ ROUTER socket handles all connections                    │
│  • Reads signal configs from YAML                               │
│  • Routes operations based on sharding strategy                 │
│  • Logs instruction stream to file                              │
│  • Handles multiple clients concurrently                        │
└───────┬──────────────┬──────────────┬───────────────────────────┘
        │              │              │ ZMQ (DEALER ← ROUTER)
        │              │              │
    ┌───▼────┐    ┌───▼────┐    ┌───▼────┐
    │Worker 0│    │Worker 1│    │Worker N│ (1 per GPU)
    │PyTorch │    │PyTorch │    │PyTorch │
    │  GPU   │    │  GPU   │    │  GPU   │
    └────────┘    └────────┘    └────────┘

Optimized for AI Development

GT is designed to be understood, modified, and debugged with AI coding assistants:

CLAUDE.md - Detailed architecture documentation for AI assistants
Declarative YAML configs - Easy for AI to parse and generate
Tape-based debugging - Inspect computation graphs with gt.debug.print_tape()
Instruction logging - Track every operation with timestamps
Comprehensive test suite - 50+ tests serving as executable specifications

Contributions welcome! This is a research prototype focused on simplicity and readability.

See Contributing Guide for development workflow, testing, code style, and PR guidelines.

MIT

See License for details.

Gt：[实验性] 多路复用张量框架 GT – Experimental multiplexing tensor framework for distributed GPU computing

Optimized for AI Development

Gt：[实验性] 多路复用张量框架
GT – Experimental multiplexing tensor framework for distributed GPU computing