Everyone (Public)

NLP/LLM Interest Group

Name: NLP/LLM Interest Group
Start: 2026-04-06T20:00:00.0000000Z
End: 2026-04-06T21:00:00.0000000Z
Location: Yale University

Beyond Benchmarks: Evaluating LLM Safety, Reasoning, and Real-World Impact in Clinical Medicine

101 College Street

Join our mailing list to receive Zoom Passcode: https://mailman.yale.edu/mailman/listinfo/nlp-llm-ig

Add event to Calendar

Add event series to Calendar

Large language models have achieved near-saturated performance on medical knowledge benchmarks, yet high exam scores tell us little about clinical safety or real-world utility.

In this talk, I review three recent studies that collectively reframe how we should evaluate LLMs for clinical use: NOHARM, which introduces a safety-oriented benchmark revealing that most LLM errors are harmful omissions rather than commissions; MedR-Bench, which decomposes clinical reasoning into stages and exposes critical weaknesses beyond diagnosis; and the first randomized controlled trial of ambient AI scribes, which highlights the gap between technical capability and clinical adoption. Together, these works suggest a paradigm shift, from asking "are LLMs smart enough for medicine?" to "how do we rigorously evaluate their safety, understand their failure modes, and validate their real-world impact".