How to Build World-class AI Products — Sarah Sachs, AI Lead Notion

Sarah Sachs, AI Lead at Notion, who will share insights into how Notion built their acclaimed Notion AI.

11 3 minutes read

In this AI Engineers workshop you’ll learn practical strategies to evaluate AI applications throughout their lifecycle—from initial testing of prompts to ongoing monitoring in production.

Sarah Sachs, as the AI lead for Notion, emphasizes a strategic and structured approach to building world-class AI products, focusing on integration, modularity, and continuous evaluation.

Her insights, drawn from Notion’s AI development journey, highlight several key principles:

Seamless Integration into Core Product Architecture: Sachs stresses that AI should not be an afterthought or a standalone feature but deeply integrated into the product’s core. Notion AI is designed to adapt to users’ workflows, triggered through familiar interfaces like the “/” command or toolbar, ensuring it feels like a natural extension of the platform. This approach maintains simplicity and brand integrity while enhancing functionality.

Modular AI Architecture: Notion’s AI is built around a modular stack that routes tasks to the best-suited model based on quality, latency, and cost. For example, writing product specs uses high-reasoning models for coherent long-form generation, while answering queries about workspace history leverages models optimized for retrieval. This flexibility allows rapid iteration and scalability across diverse use cases.

Continuous Evaluation with LLM-as-a-Judge: Sachs highlights Notion’s use of a unique “LLM-as-a-judge” system, managed by AI Data Specialists who combine QA expertise, prompt engineering, and product thinking. These specialists design custom evaluation criteria for each feature, analyzing real user behaviors to refine prompts and improve performance. Ongoing evaluations catch regressions early and ensure consistent quality as the AI stack evolves.

Experimentation and Iteration: Sachs advocates for a culture of experimentation, where engineers are empowered to test new models or methods quickly. Notion’s partnership with Braintrust streamlined its evaluation workflow, enabling the team to address 30 issues per day (up from 3), accelerating feature development and quality improvements. This iterative approach supports rapid deployment of new models, like Claude 4 Sonnet & Opus, within hours of their release.

User-Centric Design: Notion AI is positioned as a “co-pilot” rather than a replacement for human thinking, with outputs presented as editable blocks to foster collaboration. Sachs emphasizes clear, in-context examples to guide users, avoiding overwhelming blank prompt boxes. This design builds trust and aligns with Notion’s ethos of empowering creators and knowledge workers.

Scalable and Efficient Workflows: By leveraging tools like Braintrust for dataset management and evaluation, Notion transformed its development process, making it more scalable and efficient. This allowed the team to focus on delivering high-quality features like Notion Q&A, which pulls insights directly from workspace data, enhancing user experience without manual overhead.

Sachs’ approach underscores the importance of aligning AI with user needs, maintaining a flexible and scalable architecture, and fostering a culture of rapid experimentation and evaluation to build AI products that are both powerful and intuitive.

LLM-as-a-Judge

LLM-as-a-Judge is an innovative evaluation method where a large language model (LLM) assesses the outputs of another AI model, acting as an automated judge to ensure quality and relevance. At Notion, as described by AI lead Sarah Sachs, this approach streamlines the development of world-class AI products by replacing time-intensive human evaluations with scalable, consistent assessments.

The process begins with AI Data Specialists crafting custom evaluation criteria tailored to specific features, such as Notion Q&A’s ability to retrieve accurate workspace data or generate coherent responses. The judging LLM analyzes the target model’s outputs against these standards, scoring for accuracy, clarity, and alignment with user intent, and provides feedback to refine prompts or improve system performance.

This method enables rapid iteration, with Notion resolving up to 30 issues daily, a tenfold increase from manual methods, thanks to tools like Braintrust. While scalable and cost-effective, LLM-as-a-Judge can face challenges like inherited biases or complex judgment needs, which Notion mitigates through iterative prompt refinement and selective human oversight.

By integrating this system, Notion ensures its AI remains user-centric and high-quality, fostering a culture of continuous improvement and efficient development for seamless, impactful features.

aibuilder 15 hours ago

11 3 minutes read