Introduction: Clinicians often spend a substantial amount of time on documentation, which can detract from patient care and contribute to burnout. Ambient digital scribes (ADS) aim to alleviate this burden by automatically generating clinical notes during patient encounters, allowing clinicians to focus more on their patients. Unlike traditional machine learning, technologies that use generative AI introduce new challenges in this context. Evaluating generative AI poses difficulties due to the complexity of creating appropriate metrics and the limited applicability of traditional measures like precision, recall, and F1. Additionally, generative AI models can produce hallucinations or omit critical information, which necessitates rigorous safety and value assessments to ensure their reliability and effectiveness in clinical settings.