Abstract: Reinforcement learning (RL) benchmarking has long relied on learning curves and cumulative reward tables, yet these metrics fail to capture critical design challenges, such as environment ...