Company Name: Captions
Location: Union Square, New York City, United States (In-person at NYC HQ)
Job Type: Full-time
Salary Range: $160K – $250K yearly (Offers Equity)
Industry: AI Research / Video AI / Software (specifically video creation using generative models)
Job Overview
Captions is a pioneering company at the forefront of AI Research, Machine Learning, and Video AI, dedicated to transforming how video content is created and understood using groundbreaking generative models. We are seeking an exceptional Member of Technical Staff, Model Evaluation (Research Engineer (MOTS)) to join our team in the heart of Union Square, New York City. This Full-time, Senior-Level / Principal-Level role offers a unique and pivotal opportunity to drive the research, design, and implementation of cutting-edge evaluation methodologies for large-scale multimodal diffusion models, directly shaping the future of video creation software.
As a Member of Technical Staff, Model Evaluation, you will be instrumental in systematically analyzing and improving model behavior, designing new evaluation metrics, and validating research directly through product deployment and user feedback. You will leverage your deep expertise in generative modeling, distributed training systems, and rigorous empirical research to ensure our models are robust, reliable, and perform optimally at massive scale. If you have a strong track record of research contributions at top ML conferences, thrive on tackling complex technical and research challenges, and are passionate about bringing state-of-the-art AI to real-world applications, Captions invites you to contribute your expertise to our groundbreaking mission.
Duties and Responsibilities
- Design and implement novel architectures for large-scale video and multimodal diffusion models.
- Develop new approaches to multimodal fusion, temporal modeling, and video control.
- Research temporal video editing and controllable generation techniques.
- Research and validate scaling laws for video generation models.
- Create new loss functions and training objectives for improved generation quality.
- Drive rapid experimentation with model architectures and training strategies.
- Validate research directly through product deployment and user feedback, ensuring real-world impact.
- Train and optimize models at massive scale (10s-100s of billions of parameters).
- Develop sophisticated distributed training approaches, including FSDP, DeepSpeed, and Megatron-LM.
- Design and implement model surgery techniques (pruning, distillation, quantization) for model refinement.
- Create new approaches to memory optimization and training efficiency.
- Research techniques for improving training stability at scale.
- Conduct systematic empirical studies of architecture and optimization choices.
- Advance the state-of-the-art in video model architecture design and optimization.
- Develop new approaches to temporal modeling for video generation.
- Create novel solutions for multimodal learning and cross-modal alignment.
- Research and implement new optimization techniques for generative modeling and sampling.
- Design and validate new evaluation metrics for generation quality.
- Systematically analyze and improve model behavior across different regimes.
- Apply a Master’s or PhD in Computer Science, Machine Learning, or related field.
- Demonstrate a track record of research contributions at top ML conferences (NeurIPS, ICML, ICLR).
- Show proven experience implementing and improving upon state-of-the-art architectures.
- Possess deep expertise in generative modeling approaches (diffusion, autoregressive, VAEs, etc.).
- Maintain a strong background in optimization techniques and loss function design.
- Exhibit experience with empirical scaling studies and systematic architecture research.
- Demonstrate strong proficiency in modern deep learning tooling (PyTorch, CUDA, Triton, FSDP, etc.).
- Possess experience training diffusion models with 10B+ parameters.
- Experience with very large language models (200B+ parameters) is a plus.
- Maintain a deep understanding of attention, transformers, and modern multimodal architectures.
- Possess expertise in distributed training systems and model parallelism.
- Showcase a proven ability to implement and improve complex model architectures.
- Demonstrate a track record of systematic empirical research and rigorous evaluation.
- Possess the ability to write clean, modular research code that scales effectively.
- Apply strong software engineering practices, including testing and code review.
- Exhibit experience with rapid prototyping and experimental design.
- Utilize strong analytical skills for debugging model behavior and training dynamics.
- Possess facility with profiling and optimization tools.
- Maintain a track record of bringing research ideas to production.
- Demonstrate experience maintaining high code quality in a research environment.
Qualifications
- Experience Level: Senior-Level / Principal-Level (Master’s or PhD with a strong research track record and experience with large-scale models, indicating a highly experienced individual contributor specializing in ML research).
- Education Requirement: Master’s or PhD in Computer Science, Machine Learning, or related field.
- Required Skills:
- Research Experience: Master’s or PhD in Computer Science, Machine Learning, or related field; Track record of research contributions at top ML conferences (NeurIPS, ICML, ICLR); Demonstrated experience implementing and improving upon state-of-the-art architectures; Deep expertise in generative modeling approaches (diffusion, autoregressive, VAEs, etc.); Strong background in optimization techniques and loss function design; Experience with empirical scaling studies and systematic architecture research.
- Technical Expertise: Strong proficiency in modern deep learning tooling (PyTorch, CUDA, Triton, FSDP, etc.); Experience training diffusion models with 10B+ parameters; Experience with very large language models (200B+ parameters) is a plus; Deep understanding of attention, transformers, and modern multimodal architectures; Expertise in distributed training systems and model parallelism; Proven ability to implement and improve complex model architectures; Track record of systematic empirical research and rigorous evaluation.
- Engineering Capabilities: Ability to write clean, modular research code that scales; Strong software engineering practices including testing and code review; Experience with rapid prototyping and experimental design; Strong analytical skills for debugging model behavior and training dynamics; Facility with profiling and optimization tools; Track record of bringing research ideas to production; Experience maintaining high code quality in a research environment.
- Key Responsibilities involve: Designing and implementing novel architectures for large-scale video and multimodal diffusion models; developing new approaches to multimodal fusion, temporal modeling, and video control; researching temporal video editing and controllable generation; researching and validating scaling laws for video generation models; creating new loss functions and training objectives; driving rapid experimentation with model architectures and training strategies; validating research directly through product deployment and user feedback; training and optimizing models at massive scale (10s-100s of billions of parameters); developing sophisticated distributed training approaches (FSDP, DeepSpeed, Megatron-LM); designing and implementing model surgery techniques; creating new approaches to memory optimization and training efficiency; researching techniques for improving training stability; conducting systematic empirical studies; advancing state-of-the-art in video model architecture design and optimization; developing new approaches to temporal modeling; creating novel solutions for multimodal learning and cross-modal alignment; researching and implementing new optimization techniques; designing and validating new evaluation metrics; systematically analyzing and improving model behavior.
Salary and Benefits
Captions offers an exceptional annual salary ranging from $160K – $250K yearly for this Full-time Member of Technical Staff, Model Evaluation position. The compensation package also offers Equity in the company. We believe in rewarding top talent and fostering a dynamic work environment. Beyond salary and equity, Captions is committed to providing a comprehensive benefits package designed to support your overall well-being and professional growth, which typically includes robust health, dental, and vision insurance, generous paid time off, and opportunities for continuous professional development at the cutting-edge of AI research.
Working Conditions
This is a Full-time position based in-person at NYC HQ in Union Square, New York City, United States. You will work within a highly collaborative and innovative office environment, engaging directly with researchers, ML engineers, and software development teams. The role demands exceptional technical expertise in large-scale generative models, strong software engineering practices, and the ability to conduct rigorous empirical research. You will be expected to drive rapid experimentation and contribute to the deployment of cutting-edge AI models. Standard business hours are generally observed.
Why Work with Us
At Captions, you’re not just joining a company; you’re becoming part of a team that’s redefining the future of video content creation through AI Research and cutting-edge Video AI technology. We are a pioneering force, building sophisticated software that empowers users with unimaginable creative capabilities, specifically focused on video creation using generative models.
We offer a challenging yet incredibly rewarding environment where your expertise in Model Evaluation, distributed training, and model optimization will be highly valued. You will be empowered to design novel architectures, research temporal video editing, and directly contribute to state-of-the-art generative AI. If you are a results-driven researcher with a clear passion for pushing the boundaries of AI, and eager to make a tangible impact on a rapidly evolving software landscape, Captions offers an unparalleled opportunity for your next career chapter.