article thumbnail

Announcing a Benchmark to Improve AI Safety

Cars That Think

A simple keyword- or rules- based rating system for evaluating the responses is affordable and scalable, but isn’t adequate when models’ responses are complex, ambiguous or unusual. Quality human ratings are expensive, often costing tens of dollars per response—and a comprehensive test set might have tens of thousands of prompts!