p.enthalabs

Forward Deployed Engineer: AI + HPC at Cedana | Y Combinator

![Image 1: Y Combinator](https://www.ycombinator.com/ "Y Combinator")Open menu

About

What Happens at YC?ApplyYC Interview GuideFAQPeopleYC Blog

Companies

Startup DirectoryFounder DirectoryLaunch YC

Library

![Image 2: Y Combinator](https://www.ycombinator.com/ "Y Combinator")

Partners

Resources

Startup SchoolNewsletterRequests for StartupsFor InvestorsVerify FoundersHacker NewsBookfaceSafeFind a Co-Founder

Startup Jobs

Log inApply

![Image 3](https://www.ycombinator.com/companies/cedana/jobs/d1vYocG-forward-deployed-engineer-ai-hpc)

Cedana

Fast, reliable, reproducible AI with GPU live migration

Forward Deployed Engineer: AI + HPC

$140K - $180K•0.10% - 0.25%•US / Remote (US)

**Job type**

Full-time

**Role**

Engineering, Backend

**Experience**

6+ years

**Visa**

US citizen/visa only

Connect directly with founders of the best YC-funded startups.

Apply to role ›

!Image 4: Neel Master

Neel Master

Founder

!Image 5: Neel Master

Neel Master

Founder

About the role

**Introducing Cedana**

**The Problem**

AI and HPC infrastructure suffers from scarcity and high costs, so when failures happen they are costly in terms of time and money. Cluster productivity directly determines research output and revenue. Achieving high utilization and throughput is increasingly challenging due to the complexity of workloads, hardware, and operations.

**Cedana’s Solution**

Cedana maximizes AI+HPC cluster utilization and reliability with automated GPU checkpointing infrastructure. We enable transparent and fast migration of GPU workloads across instances, without losing work. Workloads automatically migrate to achieve new levels of reliability and throughput while accelerating time to results. Our system is at the kernel/OS level, requiring no code or config changes, and works seamlessly with Kubernetes, SLURM, and NVIDIA Dynamo. Today, we're deploying into leading inference platforms, neoclouds, enterprise, and research clusters.

**The Team**

Cedana's founding team has spent over a decade making computation run fast, productively, and reliably for AI. Our research appears in NeurIPS and CVPR. We published some of the earliest formal methods for guaranteeing convergence in distributed training. At Shopify we've deployed warehouse automation and robot fleets building behavior trees, fleet control planes, and OTA infrastructure that performs reliably over constrained networks. We bring repeat founder experience having built and exited a healthcare AI company.

**The Role**

**What you’ll own**

As a Forward Deployed Engineer at Cedana, you’ll lead and own technical engagement from end to end. You’ll engage with customers to understand and deploy in their environments: from production SLURM at a university, bare-metal Kubernetes at an inference provider, hybrid setup at a Fortune 100 Pharma enterprise. You’ll rapidly understand their key pain points, and use Cedana to solve their problems. For each customer you own everything from the OS up: SLURM plugins, Kubernetes operators, node configuration, networking, and observability.

This role will expose you to the cutting edge of AI and HPC infrastructure, working with the world’s leading research and commercial customers to deliver a breakthrough solution.

**What You'll Do**

- **Engineer solutions at client sites:** Lead customer integrations. Install, configure, and deploy Cedana into SLURM, Kubernetes, and Dynamo environments.

- **Drive product innovation from the field:** Identify technical gaps while embedded with clients, then provide product feedback for new capabilities that become core product features.

- **Measure and optimize platform performance:** Measure reliability, throughput, and performance using our internal tools. Design and implement policy-based migration automations to optimize reliability, throughput, and performance

- **Own critical deployments:** Ensure our platform performs reliably for clients' critical operations, debugging issues across the full stack. Debug install issues against unfamiliar customer infrastructure, and escalate to engineering when necessary.

- **Improve scalability**: Build and own the internal installation playbook so that the second customer in each segment is onboarded faster than the first.

- **Respect our customers**: Understand how to make their lives easier and minimize their time and overhead.

What we are looking for

- Team management experience.Requires strong project and time management skills, delivering milestones on time, and effective

- 3-10 years of software engineering experience with a track record of configuring and managing SLURM deployments.

- A multi-month enterprise or research deployment you led end-to-end, from scoping through signoff. You write effective status updates to keep your team updated and on schedule.

- Production experience in standing up SLURM in a customer or research environment. You've configured slurmctld, slurmdbd, accounting, cgroup integration, and GPU resource selection.

- Strong Linux fundamentals of systemd, cgroups v2, namespaces, networking, filesystems, firewalls, kernel module loading, PAM session modules. You can read strace and dmesg output and form a hypothesis.

- Experience with Kubernetes operations including operators, CRDs, CNIs, device plugins, and node-level debugging. You've debugged a controller in production even if you haven't written one from scratch.

Bonus if you have

- Experience in an HPC integrator field team

- Client-facing technical experience working directly with customers.

- Background in national lab user services or university research computing

- You’ve developed SLURM plug-ins, and understand their architecture and how they fit into the overall platform.

- Familiarity with CRIU, container runtimes, GPU driver internals, distributed training stacks

- Hands-on with NVIDIA Dynamo, Determined, Ray, Kueue, KServe, or comparable AI orchestration.

- Contributed to open-source schedulers or job systems (SLURM, Flux, Torque, PBS).

- A passion for debugging a weird cgroup issue at 11pm just as much as writing a clean install playbook the next morning.

**Logistics**

- Remote, US-based. ~25% travel for customer installs.

- Base $140,000–$180,000 + meaningful early-stage equity.

**Benefits**

- 100% covered medical, dental, and vision insurance for employees and families

- Unlimited PTO policy

- 401K Plan

**Equal Opportunity Employer**

Cedana is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status

About the interview

- Initial interview for fit

- Written component to understand background and motivation. Not a coding test.

- Interviews with engineering team.

- References

About Cedana

Cedana is pause/migrate/resume for compute workloads. We're working on building a global, real-time system for compute. This means a paradigm shift in how we allocate resources to things like high performance computing, numerical simulation and training and running machine learning models. We do so by taking a systems-level and deep-tech approach to these problems, working at the Linux Kernel layer and with hardware.

![Image 6: Cedana](https://www.ycombinator.com/companies/cedana)

Cedana

Founded:2023

Batch:S23

Team Size:5

Status:Active

Location:New York

[](https://cedana.ai/)[](https://www.linkedin.com/company/cedanacorp/)![Image 7: Twitter account](https://twitter.com/cedana_ai)

Founders

!Image 8: Neel Master

Neel Master

![Image 9: Twitter account](https://twitter.com/neelmaster)[](https://www.linkedin.com/in/neelmaster1)

Founder

!Image 10: Neel Master

Neel Master

![Image 11: Twitter account](https://twitter.com/neelmaster)[](https://www.linkedin.com/in/neelmaster1)

Founder

!Image 12: Niranjan Ravichandra

Niranjan Ravichandra

![Image 13: Twitter account](https://x.com/0xnravic)[](https://linkedin.com/in/niranjanravichandra)

Founder

!Image 14: Niranjan Ravichandra

Niranjan Ravichandra

![Image 15: Twitter account](https://x.com/0xnravic)[](https://linkedin.com/in/niranjanravichandra)

Founder

Similar Jobs

!Image 16: Knowtex

Knowtex

Software Engineer (Applications Engineering)

!Image 17: Lamar Health

Lamar Health

Product Engineer

!Image 18: Athelas

Athelas

Senior Software Engineer - Ambient

!Image 19: Pocket

Pocket

Backend Engineer

!Image 20: Manufact

Manufact

Open Source Engineer (DevRel / Community)

!Image 21: Edexia

Edexia

Founding AI Engineer

!Image 22: Nango

Nango

Staff Backend Engineer (Remote)

!Image 23: Simbie AI

Simbie AI

Product Engineer / Senior Product Engineer

!Image 24: JustAI

JustAI

Senior Platform Engineer

!Image 25: QFEX

QFEX

Research Engineer

!Image 26: GoGoGrandparent

GoGoGrandparent

Backend Engineer

!Image 27: Aviator

Aviator

Software engineer, Fullstack

!Image 28: Coris

Coris

Backend Engineer

!Image 29: MindFort

MindFort

AI Researcher

!Image 30: Raven

Raven

Software Engineer (Backend/AI)

!Image 31: Cyble

Cyble

Senior Researcher - Dark Web & Threat Intelligence

!Image 32: AiPrise

AiPrise

Staff Software Engineer

!Image 33: Clinikally

Clinikally

Business Central Developer

!Image 34: Piris Labs

Piris Labs

Founding Engineer -- AI Inference Stack

!Image 35: Axle

Axle

Staff Engineer

Footer

Y Combinator![Image 36: Y Combinator](https://www.ycombinator.com/ "Y Combinator")

Make something people want.

Programs

- YC Program

- Startup School

- Work at a Startup

- Co-Founder Matching

Resources

- Startup Directory

- Startup Library

- Investors

- Demo Day

- Safe

- Hacker News

- Launch YC

- YC Deals

Company

- YC Blog

- Contact

- Press

- People

- Careers

- Privacy Policy

- Notice at Collection

- Security

- Terms of Use

TwitterFacebookInstagramLinkedInYoutube

© 2026 Y Combinator