Announcing a Benchmark to Improve AI Safety
Cars That Think
APRIL 16, 2024
A simple keyword- or rules- based rating system for evaluating the responses is affordable and scalable, but isn’t adequate when models’ responses are complex, ambiguous or unusual. Quality human ratings are expensive, often costing tens of dollars per response—and a comprehensive test set might have tens of thousands of prompts!
Let's personalize your content