Skip to Main Content
Everyone (Public)

NLP/LLM Interest Group

Beyond Benchmarks: Evaluating LLM Safety, Reasoning, and Real-World Impact in Clinical Medicine

Large language models have achieved near-saturated performance on medical knowledge benchmarks, yet high exam scores tell us little about clinical safety or real-world utility.

In this talk, I review three recent studies that collectively reframe how we should evaluate LLMs for clinical use: NOHARM, which introduces a safety-oriented benchmark revealing that most LLM errors are harmful omissions rather than commissions; MedR-Bench, which decomposes clinical reasoning into stages and exposes critical weaknesses beyond diagnosis; and the first randomized controlled trial of ambient AI scribes, which highlights the gap between technical capability and clinical adoption. Together, these works suggest a paradigm shift, from asking "are LLMs smart enough for medicine?" to "how do we rigorously evaluate their safety, understand their failure modes, and validate their real-world impact".

Speaker

Contacts

Host Organizations

Admission

Free

Event Type

Lectures and Seminars

Tag

Next upcoming occurrences of this event

Apr 20266Monday