Research Scientist (Model Evaluation)
Sanas
Sanas is pioneering the future of human communication. Founded by a team of Stanford researchers and entrepreneurs with deep industry experience, Sanas has developed the world's first real-time speech AI platform capable of accent translation, noise cancellation, speech enhancement, cross-language communication, and more.
Sanas makes conversations clearer, more inclusive, and more effective, removing barriers that prevent people from being understood, regardless of accent, background noise, or native language.
Sanas is currently one of the fastest growing startups in Silicon Valley, growing from $16M to $50M ARR in 2025. The company's core business is profitable and is on track to end 2026 with >$120M ARR. Our team combines deep expertise in model innovation and systems engineering with a design-minded product engineering culture to build and ship cutting-edge AI models and experiences — entirely in-house.
Sanas is a 180-strong team, established in 2020. In this short span, we've successfully secured over $100 million in funding. Our innovation has been supported by the industry's leading investors, including Insight Partners, Google Ventures, Quadrille Capital, General Catalyst, Quiet Capital, and other influential investors. Our reputation is further solidified by collaborations with numerous Fortune 100 companies. With Sanas, you're not just adopting a product; you're investing in the future of communication.
If you’re looking to have a significant role in roadmapping and driving technical directions, if you’re looking to deploy challenging and big ideas without much overhead or slowness, if you're looking to leave your mark on an ambitious, generational mission to change how the worlds thinks about speech + AI, then Sanas is a well-suited place for you.
About the Role
Progress in speech AI is only as meaningful as our ability to measure it. At Sanas, model quality spans dimensions that automated metrics struggle to capture — accent naturalness, perceptual clarity, speaker identity preservation, noise suppression without speech distortion, translation fluency under real-world disfluency. We're looking for a Research Scientist who can define what "better" actually means across all of Sanas's model families, build the evaluation infrastructure to measure it rigorously, and close the loop between research progress and real-world impact. This role sits at the intersection of research, product, and infrastructure — and directly shapes how every model team at Sanas measures progress.
Job Description
Evaluation framework design
- Design and own evaluation frameworks across Sanas's full model portfolio — Accent Translation, Noise Cancellation, Speech Enhancement, and Language Translation, and more — ensuring each captures meaningful progress, not just benchmark performance.
- Develop novel quantitative metrics for subjective and perceptual qualities: accent similarity, naturalness, speaker identity preservation, intelligibility under noise, and translation fluency in spoken-language domains.
- Build evaluation systems that bridge automated metrics and human judgment — designing listening studies, MOS/MUSHRA protocols, and preference tests that are statistically rigorous and operationally scalable.
- Define evaluation splits, test sets, and benchmark suites that accurately reflect production conditions — diverse accents, languages, noise environments, recording devices, and telephony codecs.
Evaluation infrastructure & tooling
- Build and maintain automated evaluation pipelines that run continuously against model checkpoints — surfacing regressions early and tracking quality trends across training runs.
- Develop reference-based and reference-free metrics calibrated to Sanas's specific model tasks: SI-SDR, PESQ, STOI, DNSMOS, speaker similarity, WER delta, COMET, and task-specific custom metrics where off-the-shelf measures fall short.
- Instrument model quality monitoring in production — detecting degradation across language pairs, accent profiles, and acoustic conditions in live customer traffic.
- Build tooling that allows research scientists and ML engineers to run rigorous ablations, compare model versions, and understand quality tradeoffs without needing to design the evaluation from scratch each time.
Human evaluation & research
- Design and operate human evaluation programs — listener panels, crowdsourced annotation, and expert evaluator workflows — that produce reliable signal on dimensions automated metrics cannot capture.
- Conduct research into evaluation methodology itself: when do automated metrics correlate with human perception, when do they diverge, and what does that tell us about model behavior?
- Partner directly with research scientists across model teams to translate open-ended quality questions into concrete, measurable evaluation protocols.
Cross-functional impact
- Work closely with ML research, product, and customer success teams to ensure evaluation reflects what customers actually experience — not just what lab conditions optimize for.
- Feed evaluation insights back into data acquisition and model training priorities — identifying which failure modes require more data, architectural changes, or training procedure improvements.
- Communicate evaluation results clearly to both technical and non-technical stakeholders, translating metric movements into product quality narratives that inform roadmap decisions.
Qualifications
- 4+ years of research or applied research experience in speech, audio, or NLP, with a demonstrated focus on evaluation methodology and quality measurement.
- Deep familiarity with speech and audio quality metrics — perceptual (MOS, MUSHRA, PESQ, STOI), signal-level (SI-SDR, SNR), and task-specific (WER, speaker similarity, DNSMOS) — and an understanding of when each is and isn't the right tool.
- Experience designing and running human evaluation studies — listener panels, crowdsourced annotation, inter-annotator agreement analysis — with statistical rigor.
- Strong engineering skills: you can build production-quality evaluation pipelines, not just run scripts. Proficiency in Python and PyTorch or equivalent.
- Creativity in defining novel quantitative metrics for subjective or behavioral qualities — you've identified gaps in existing evaluation approaches and built something better.
- Ability to take open-ended research questions and translate them into concrete, measurable evaluation systems that run reliably at scale.
- Curiosity and rigor in equal measure — you're as motivated by discovering the right way to measure progress as by the progress itself.
Bonus
- Experience evaluating models across multiple speech tasks — ASR, TTS, speech enhancement, speaker verification, or machine translation.
- Familiarity with real-time or streaming model evaluation — latency-quality tradeoffs, codec-degraded audio, telephony channel conditions.
- Background in psychoacoustics or perceptual audio quality — understanding of how humans perceive speech naturalness, noise, and distortion.
- Experience with multilingual evaluation — cross-lingual quality metrics, language-specific annotation challenges, low-resource language evaluation.
- Published research at INTERSPEECH, ICASSP, ACL, EMNLP, or equivalent venues on evaluation methodology, speech quality, or related topics.