NLP/LLM Interest Group
Beyond Benchmarks: Evaluating LLM Safety, Reasoning, and Real-World Impact in Clinical Medicine
Large language models have achieved near-saturated performance on medical knowledge benchmarks, yet high exam scores tell us little about clinical safety or real-world utility.
In this talk, I review three recent studies that collectively reframe how we should evaluate LLMs for clinical use: NOHARM, which introduces a safety-oriented benchmark revealing that most LLM errors are harmful omissions rather than commissions; MedR-Bench, which decomposes clinical reasoning into stages and exposes critical weaknesses beyond diagnosis; and the first randomized controlled trial of ambient AI scribes, which highlights the gap between technical capability and clinical adoption. Together, these works suggest a paradigm shift, from asking "are LLMs smart enough for medicine?" to "how do we rigorously evaluate their safety, understand their failure modes, and validate their real-world impact".
Related Media
Speaker
Contacts
Host Organizations
- Biomedical Informatics & Data Science
- Clinical NLP Lab