Interesting bit on ARA evaluations in the Claude 3 model card:
"Across all the rounds, the model was clearly below our ARA ASL-3 risk threshold, having failed at least 3 out of 5 tasks, although it did make non-trivial partial progress in a few cases and passed a simplified