Company Name: Captions
Location: Union Square, New York City, United States (In-person at NYC HQ)
Job Type: Full-time
Salary Range: $160,000 – $250,000 yearly (Offers Equity)
Industry: AI Research / Machine Learning / Video AI / Software
Job Overview
Captions is a pioneering company at the forefront of AI Research, Machine Learning, and Video AI, dedicated to transforming how video content is created and understood. We are seeking a highly skilled and experienced Member of Technical Staff, ML Ops to join our team in the heart of Union Square, New York City. This Full-time, Experienced to Senior-Level role offers a pivotal opportunity to design, build, and optimize the scalable ML infrastructure that powers our cutting-edge generative models, directly impacting the future of video AI software.
As a Member of Technical Staff, ML Ops, you will be instrumental in developing and optimizing distributed training frameworks for multimodal data, managing GPU clusters, and implementing creative solutions for cost optimization. You will leverage your strong programming skills in Python and systems programming, deep expertise in PyTorch internals, and experience with containerization and orchestration to build reliable and performant large-scale ML systems. If you love diving deep into complex systems optimization, thrive in fast-paced, research-driven environments, and are eager to work directly with researchers on cutting-edge ML problems, Captions invites you to contribute your expertise to our groundbreaking mission.
Duties and Responsibilities
- Develop and optimize distributed training frameworks for multimodal data, ensuring high performance and scalability.
- Build flexible systems for cross-modal training orchestration and efficient experimentation.
- Design reproducible training environments and implement comprehensive testing frameworks for ML models.
- Create infrastructure for systematic model quality assessment and benchmarking.
- Design and manage GPU clusters optimized for distributed training, ensuring efficient resource utilization.
- Build out comprehensive automated metrics collection and alerting systems for ML pipelines.
- Profile and optimize model training throughput using advanced techniques.
- Develop custom CUDA and Triton kernels to accelerate compute paths.
- Implement creative solutions for cost optimization across various cloud instances.
- Design and optimize real-time inference systems for rapid research iteration.
- Build infrastructure enabling rapid testing of research hypotheses.
- Create systems supporting close collaboration between infrastructure and research teams.
- Develop frameworks for reproducible research experimentation.
- Enable seamless deployment of research innovations to production.
- Demonstrate strong programming skills in Python and systems programming.
- Possess experience with distributed systems and scalable infrastructure.
- Maintain a proven track record of building reliable, performant large-scale ML systems.
- Apply deep expertise in PyTorch internals and distributed training frameworks (FSDP, DeepSpeed).
- Possess experience with GPU cluster management and optimization.
- Demonstrate proficiency in performance profiling and systems optimization.
- Apply expertise in CUDA programming and kernel optimization.
- Possess strong experience with containerization and orchestration (Docker, Kubernetes).
- Have experience with ML model serving and deployment at scale.
- Be familiar with language models, attention mechanisms, video/audio processing pipelines, and large-scale diffusion models.
- Love diving deep into complex systems optimization challenges.
- Take ownership of critical infrastructure and collaborate effectively with cross-functional teams.
- Be excited about pushing the boundaries of ML system performance.
- Want to work directly with researchers on cutting-edge ML problems.
- Thrive in fast-paced, research-driven environments.
Qualifications
- Experience Level: Experienced to Senior-Level individual, given the emphasis on “strong experience,” “deep expertise,” and the responsibility for architecting and scaling complex ML infrastructure.
- Education Requirement: Bachelor’s or Master’s degree in Computer Science, Machine Learning, or a related field.
- Required Skills and Experience:
- Education: Bachelor’s or Master’s degree in Computer Science, Machine Learning, or a related field.
- Technical Background:
- Strong programming skills in Python and systems programming.
- Experience with distributed systems and scalable infrastructure.
- Proven track record of building reliable, performant large-scale ML systems.
- Deep expertise in PyTorch internals and distributed training frameworks (FSDP, DeepSpeed).
- Experience with GPU cluster management and optimization.
- Proficiency in performance profiling and systems optimization.
- CUDA programming and kernel optimization.
- Strong experience with containerization and orchestration (Docker, Kubernetes).
- Experience with ML model serving and deployment at scale.
- Familiarity with language models, attention mechanisms, video/audio processing pipelines, and large-scale diffusion models.
- Engineering Approach:
- Loves diving deep into complex systems optimization challenges.
- Takes ownership of critical infrastructure and collaborates effectively.
- Excited about pushing the boundaries of ML system performance.
- Wants to work directly with researchers on cutting-edge ML problems.
- Thrives in fast-paced, research-driven environments.
- Key Responsibilities: Develop and optimize distributed training frameworks for multimodal data. Build flexible systems for cross-modal training orchestration and efficient experimentation. Design reproducible training environments and implement comprehensive testing frameworks. Create infrastructure for systematic model quality assessment and benchmarking. Design and manage GPU clusters optimized for distributed training. Build out comprehensive automated metrics collection and alerting. Profile and optimize model training throughput using advanced techniques. Develop custom CUDA and Triton kernels to accelerate compute paths. Implement creative solutions for cost optimization across various cloud instances. Design and optimize real-time inference systems for rapid research iteration. Build infrastructure enabling rapid testing of research hypotheses. Create systems supporting close collaboration between infrastructure and research teams. Develop frameworks for reproducible research experimentation. Enable seamless deployment of research innovations to production.
Salary and Benefits
Captions offers an exceptional annual salary ranging from $160,000 – $250,000 yearly for this Full-time Member of Technical Staff, ML Ops position. The compensation package also offers Equity in the company. We believe in rewarding top talent and fostering a dynamic work environment. Beyond salary and equity, Captions is committed to providing a comprehensive benefits package designed to support your overall well-being and professional growth, which typically includes robust health, dental, and vision insurance, generous paid time off, and opportunities for continuous professional development at the cutting-edge of AI research.
Working Conditions
This is a Full-time position based in-person at NYC HQ in Union Square, New York City, United States. You will work within a highly collaborative and innovative office environment, engaging directly with researchers, ML engineers, and software development teams. The role demands exceptional technical expertise in large-scale ML systems, strong programming skills, and the ability to design and optimize critical infrastructure. You will be expected to push the boundaries of ML system performance and contribute to the deployment of cutting-edge AI models. Standard business hours are generally observed.
Why Work with Us
At Captions, you’re not just joining a company; you’re becoming part of a team that’s redefining the future of video content creation through AI Research and cutting-edge Video AI technology. We are a pioneering force, building sophisticated software that empowers users with unimaginable creative capabilities. As a Member of Technical Staff, ML Ops, your role is pivotal in architecting and scaling the robust ML infrastructure that fuels our innovation.
We offer a challenging yet incredibly rewarding environment where your expertise in Machine Learning Operations (ML Ops), distributed systems, and GPU optimization will be highly valued. You will be empowered to develop distributed training frameworks, manage GPU clusters, and directly contribute to state-of-the-art generative AI. If you are a results-driven ML Ops professional with a clear passion for pushing the boundaries of AI, and eager to make a tangible impact on a rapidly evolving software landscape, Captions offers an unparalleled opportunity for your next career chapter.