Two red-team critiques of METR's research on long tasks
AI benchmarks are one attempt to track and formalise the progress of AI’s developing capabilities. As explained succinctly in the introduction to “Measuring AI Ability to Complete Long Tasks” (Kwa et al.), published by METR this year, commonly used benchmarks suffer from a variety of issues. One of them is that it’s hard to track AI progress across time because benchmarks are often not mutually comparable. METR proposes a metric which addresses this problem: the X% (task completion) time horizon, defined as the maximum length of task for a human that an AI can complete X% of the time. I really like this metric as an attempt to quantitatively come to grips with how AI is developing over time.