Cedana (YC S23) 正在招聘

Cedana (YC S23) 正在招聘
Cedana (YC S23) Is Hiring

原始链接: https://www.ycombinator.com/companies/cedana/jobs/d1vYocG-forward-deployed-engineer-ai-hpc

**Cedana** 是一家 AI/HPC 基础设施初创公司，通过自动化、透明的 GPU 检查点（checkpointing）技术，最大程度地提高集群的可靠性和吞吐量。Cedana 能够在无需修改代码的情况下实现工作负载迁移，从而解决了 Kubernetes 和 SLURM 环境中的关键效率瓶颈，确保研究和生产任务在硬件故障面前仍能顺畅运行。 Cedana 现招聘一名**前沿部署工程师（Forward Deployed Engineer）**，负责在客户现场（包括研究型大学、新兴云厂商及财富 100 强企业）主导端到端的集成工作。该职位涉及部署和配置复杂的 HPC 技术栈、排查底层的 Linux/内核问题，并根据实地经验推动产品创新。 **理想候选人需具备：** * 3-10 年相关经验，深谙 SLURM 配置及 Linux 内核（cgroups、namespaces、网络）。 * 在 Kubernetes、GPU 编排及面向客户的部署方面拥有生产环境经验。 * 具备解决复杂全栈基础设施问题的“实干”精神。该职位为美国远程办公，需约 25% 的出差，薪资范围为 14 万至 18 万美元，并提供股权。欢迎加入由研究人员和资深创始人组成的精英团队，共同构建下一代可靠的 AI 计算平台。

抱歉。

原文

The Problem

AI and HPC infrastructure suffers from scarcity and high costs, so when failures happen they are costly in terms of time and money. Cluster productivity directly determines research output and revenue. Achieving high utilization and throughput is increasingly challenging due to the complexity of workloads, hardware, and operations.

Cedana’s Solution

Cedana maximizes AI+HPC cluster utilization and reliability with automated GPU checkpointing infrastructure. We enable transparent and fast migration of GPU workloads across instances, without losing work. Workloads automatically migrate to achieve new levels of reliability and throughput while accelerating time to results. Our system is at the kernel/OS level, requiring no code or config changes, and works seamlessly with Kubernetes, SLURM, and NVIDIA Dynamo. Today, we're deploying into leading inference platforms, neoclouds, enterprise, and research clusters.

The Team

Cedana's founding team has spent over a decade making computation run fast, productively, and reliably for AI. Our research appears in NeurIPS and CVPR. We published some of the earliest formal methods for guaranteeing convergence in distributed training. At Shopify we've deployed warehouse automation and robot fleets building behavior trees, fleet control planes, and OTA infrastructure that performs reliably over constrained networks. We bring repeat founder experience having built and exited a healthcare AI company.

What you’ll own

As a Forward Deployed Engineer at Cedana, you’ll lead and own technical engagement from end to end. You’ll engage with customers to understand and deploy on their environments: from production SLURM at a university, bare-metal Kubernetes at an inference provider, hybrid setup at a Fortune 100 Pharma enterprise. You’ll rapidly understand their key pain points, and use Cedana to solve their problems. For each customer you own everything from the OS up: SLURM plugins, Kubernetes operators, node configuration, networking, and observability.

This role will expose you to the cutting edge of AI and HPC infrastructure, working with the world’s leading research and commercial customers to deliver a breakthrough solution.

What You'll Do

Engineer solutions at client sites: Lead customer integrations. install, configure and deploy Cedana into SLURM, Kubernetes, and Dynamo environments.
Drive product innovation from the field: Identify technical gaps while embedded with clients, then provide product feedback for new capabilities that become core product features.
Measure and optimize platform performance: Measure reliability, throughput and performance using our internal tools. Design and implement policy-based migration automations to optimize reliability, throughput and performance
Own critical deployments: Ensure our platform performs reliably for clients' critical operations, debugging issues across the full stack. Debug install issues against unfamiliar customer infrastructure, escalate to engineering when necessary.
Improve scalability: Build the internal install playbook so the second customer in each segment is faster than the first.
Respect our customers: Understand ways to make their life easier, minimize their time and overhead.

What we are looking for

3-10 years of software engineering experience with a track record of configuring and managing SLURM deployments.
A multi-month enterprise or research deployment you led end-to-end, from scoping through signoff. You write effective status updates to keep your team updated and on schedule.
Production experience standing up SLURM in a customer or research environment. You've configured slurmctld, slurmdbd, accounting, cgroup integration, and GPU resource selection.
Strong Linux fundamentals of systemd, cgroups v2, namespaces, networking, filesystems, kernel module loading, PAM session modules. You read strace and dmesg output and form a hypothesis.
Working Kubernetes operations including operators, CRDs, device plugins, node-level debugging. You've debugged a controller in production even if you haven't written one from scratch.

Bonus if you have

Experience at an HPC integrator field team
Client-facing technical experience working directly with customers.
Background in national lab user services or university research computing
You’ve developed SLURM plug-ins, and understand their architecture and how they fit into the overall platform.
Familiarity with CRIU, container runtimes, GPU driver internals, distributed training stacks
Hands-on with NVIDIA Dynamo, Determined, Ray, Kueue, KServe, or comparable AI orchestration.
Contributed to open-source schedulers or job systems (SLURM, Flux, Torque, PBS).
A passion for debugging a weird cgroup issue at 11pm just as much as writing a clean install playbook the next morning.

Logistics

Remote, US-based. ~25% travel for customer installs.
Base $140,000–$180,000 + meaningful early-stage equity.

Benefits

100% covered medical, dental, and vision insurance for employees and families
Unlimited PTO policy
401K Plan

Equal Opportunity Employer

Cedana is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status