In adversarial multi-agent paradigms, it is often difficult to describe what is learned by reinforcement-learning agents and to measure the robustness of the learned policies. To achieve a better view, we explore the effect of manipulating agent capabilities and policies in a predator-prey pursuit task. In these experiments, we trained a single prey using multi-agent reinforcement learning with three slower predators, then tested the prey against three faster predators using fixed “interceptor” strategies (head to the closest possible intersection with the prey assuming the prey maintains its current velocity) instead of their learned policies. While the prey’s performance was impressive under these novel conditions, it varied widely. Initial locations and velocities (randomized during training and testing) were limited in explaining differences in prey’s performance across test conditions. Nevertheless, visual inspection indicates that more successful prey quickly begin a circling pattern, whereas less successful prey often become cornered and double back into predator collisions. To quantify this behavior, we computed windowed entropy measures of the prey’s angle relative to the arena origin to show when an agent transitioned in and out of this unsuccessful behavior. Ultimately, these transitions suggest that circling is triggered by coming close to the center of the arena. By varying agents’ capabilities and predator policies upon evaluation, we achieve a more comprehensive view of the prey’s learned policy, and we suggest that these windowed entropy measures, along with correlations between entropy and performance, result in a quantification of the learned policy.
|