Intent-Aware Multi-Agent Reinforcement Learning

6d ago
4.04 MB
8 Pages

2018 IEEE International Conference on Robotics and Automation (ICRA)May 21-25, 2018, Brisbane, AustraliaIntent-aware Multi-agent Reinforcement LearningSiyuan Qi1 and Song-Chun Zhu1Abstract— This paper proposes an intent-aware multi-agentplanning framework as well as a learning algorithm. Under thisframework, an agent plans in the goal space to maximize theexpected utility. The planning process takes the belief of otheragents’ intents into consideration. Instead of formulating thelearning problem as a partially observable Markov decisionprocess (POMDP), we propose a simple but effective linearfunction approximation of the utility function. It is based onthe observation that for humans, other people’s intents will posean influence on our utility for a goal. The proposed frameworkhas several major advantages: i) it is computationally feasibleand guaranteed to converge. ii) It can easily integrate existingintent prediction and low-level planning algorithms. iii) Itdoes not suffer from sparse feedbacks in the action space.We experiment our algorithm in a real-world problem thatis non-episodic, and the number of agents and goals canvary over time. Our algorithm is trained in a scene in whichaerial robots and humans interact, and tested in a novel scenewith a different environment. Experimental results show thatour algorithm achieves the best performance and human-likebehaviors emerge during the dynamic process.I. I NTRODUCTIONThe success of human species could be attributed toour remarkable adaptability to both the physical world andthe social environment. Human social intelligence endowsus the ability to reason about the state of mind of otheragents, and this mental state reasoning widely influencesdecisions made in our daily lives. For example, driving safelyrequires us to reason about the intents of other drivers andmake decisions accordingly. This kind of subtle intent-awaredecision-making (theory of mind) behavior is ubiquitous inhuman activities, but virtually absent from even the mostadvanced artificial intelligence and robotic systems.Intent-aware decision making is particularly useful inmulti-agent systems, which can find applications in a widevariety of domains including robotic teams, distributed control, collaborative decision support systems, etc. Ideally,rather than being pre-programmed with intent-aware planning behaviors, artificial intelligent agents should be ableto learn the behavior in a way similar to human socialadaptation. This is because designing a good behavior isdifficult or even impossible, and multi-agent environmentsare highly non-stationary.Fortunately, advances in machine learning and roboticsalgorithms provide powerful tools for solving this learningand planning problem. To design an appropriate frameworkfor learning agents to plan upon beliefs about of agents’mental states, the following aspects need to be considered:1 Theauthors are affiliated with University of California, Los Angeles.*This research was supported by grants DARPA XAI project N6600117-2-4029, ONR MURI project N00014- 16-1-2007, and NSF IIS-1423305.978-1-5386-3080-8/18/ 31.00 2018 IEEEAgent A's mindBelief of other agents' goalAgent B's mindIntrinsic valueUtilityFig. 1: Theory of mind reasoning. Safe driving requiresprediction of other drivers’ intent. The utility of goalsare inferred based on the belief and intrinsic values, anddecisions are made to maximize utility. In the figure, bothdrivers drove slightly across the line hence they both believethe other driver is planning to change lane. The left driver ismore aggressive, having a different intrinsic value than theright one. Finally, the left driver chose to change lane sincethe utility is higher while the other one chose to go straight. Among the desired properties of a learning algorithm(e.g., guaranteed convergence, explainability, computational feasibility, fast training, ease of implementation,the degree of awareness of other agents, adaptabilityto non-stationary environments), which ones are moreimportant for intent-aware planning?How should the framework unify the learning and planning process so that existing and future intent predictionalgorithms and low-level planning methods can be fullyexploited?The space of combined strategies of all agents can behuge. What are the important factors in the decisionmaking process? Particularly, how important are eachgoal itself? How would an intent-aware agent’s strategybe influenced by the goals of other agents?How reasonable are the learning outcomes? Willhuman-like social behavior (e.g., cooperations and competitions) emerge during this dynamic process?In this paper, we propose an intent-aware planning framework as well as a learning algorithm. We propose to plan inthe goal space based on the belief of other agents’ intents.The intuitive idea of intent-aware planning is illustrated in7533

Fig. 1, in which an aggressive driver (left) and a mild driver(right) are driving on a three-lane road, and they are intendedto change lane at the same time. First, both drivers infereach other’s intent and based on their intrinsic values fordifferent possibilities of strategy combinations; an expectedutility is computed for their own goal. A goal is then chosento maximize their own utility, and low-level planners areutilized to find actions to achieve the goal.The proposed framework and algorithm provides a temporal abstraction that decouples the low-level planning, intentprediction, and high-level reasoning. The framework bringsthe following advantages: i) from a planning perspective,different intent prediction algorithms and low-level plannerscan be easily integrated. ii) By decoupling the belief updateprocess and the learning process as opposed to POMDP,learning becomes computational feasible. iii) The temporalabstraction of planning in a goal space avoids the problemof sparse feedbacks in the action space. iv) Since any intentprediction algorithm can be adopted in this framework, noassumption is made about other agents’ behaviors. The beliefcan be updated by various computational methods such asBayesian methods and maximum likelihood estimation oreven by communication.The rest of the paper is organized as follows. In sec. IIwe review some related literature. We formally introduce theproposed planning framework in sec. III and the proposedlearning algorithm in sec. IV. A real-world problem andour solution under the proposed framework is described insec. V. We then describe the designed comparative experiment and analyze the results. Finally, sec. VI concludes thepaper.II. R ELATED W ORKa) Intent prediction: autonomous systems in multiagent environments could benefit from understandings ofother agents’ behavior. There is a growing interest in therobotics and computer vision community to predict futureactivities of humans. [1], [2], [3], [4], [5], [6], [7] predicthuman trajectories/activities in various settings includingcomplex indoor/outdoor scenes and crowded spaces. Theseresearch advances are potentially applicable in many domainssuch as assistive robotics, robot coordination, and socialrobotics. However, it remains unclear how these predictionalgorithms can be effectively utilized for robot planning ingeneral.b) Predictive multi-agent systems: Increasing effortshave been made to design systems that are capable ofpredicting other agents’ intents/actions to some level. Prediction algorithms have been explicitly or implicitly appliedto problems such as navigation in crowds and traffic [8], [9],[10], motion planning [11], [12], and human-robot collaborative planning [13], [14]. Despite the promising results onspecific problems, there lacks a framework to unify learning,prediction, and planning in general multi-agent systems.c) Multi-agent Reinforcement learning (MARL): A variety of MARL algorithms have been proposed to accomplishAgent 1Agent 2StateObservationIntrinsic valueBelief of other agents' goalHistoryUtilityGoal of agentFig. 2: The computational framework. At each time step t, anobservation ot is made by an agent and added to history ht .The intents/goals of other agents are inferred from history asbelief bt . Based on bt and the intrinsic value θ, the utilityvector ut for all goals is computed. Finally, a goal gt ischoosen to maximize the utility.Combinational logicFinite state machine (MDP)Pushdown automaton (Our method)Turing machineFig. 3: Chomsky hierarchy in formal language theory.tasks without pre-programmed agent behaviors. [15] classified MARL algorithms by their field of origin: temporaldifference RL [16], [17], [18], [19], game theory [20], [21],and direct policy search techniques [22], [23]. The degreeof awareness of other learning agents exhibited by MARLalgorithms depends on the learning goal. Some algorithmsare independent of other learning agents and focus onconvergence, while some consider adaptation to the otheragents. Many deep multi-agent reinforcement learning algorithms [24], [25], for example, are agnostic of the intention ofother agents. In this paper, we explicitly model other agents’intentions during the decision making process.To model other agents’ policies, POMDP is usuallyadopted [26], [27]. However, POMDP requires the model ofa multi-agent world, and it is computationally infeasible asfurther discussed in Sec. III-A. We believe that by decouplingthe intent inference and the learning process, we can keepthe power of prediction algorithms while making the learningalgorithm computational feasible.III. I NTENT- AWARE H IERARCHICAL P LANNINGWe propose an intent-aware hierarchical planning framework in multi-agent environments that plans in the goalspace. The planning process is based on the belief/predictionof other agents’ intents/goals inferred from the observationhistory. Existing low-level planners (e.g., trajectory planners)are then utilized for action planning to achieve the chosengoal. In some literature, the goals are called “macro-actions”.7534

Specifically, at each time step t, an agent in an environmentmakes an observation ot from the environment state st .The past observations are summarized into a history ht .According to ht , the agent infers the intent of other agentsas probability distributions bt , which we call belief. A utilityvector ut is then computed for all possible goals G basedon bt and the agent’s intrinsic value θ. θ is a time-invariantparameter that encodes the agent’s values for different goals,and how the values would change conditioned on otheragents’ intent. Finally, a goal gt G is choosen to maximizethe utility. The computational framework is illustrated inFigure 2.This framework is motivated and inspired mainly by twoschools of thoughts. Motivated by automata theory, we storethe observation history h; inspired by theory of mind, weinfer other agents’ mental state and reason on the goal level.a) Automata theory: the study of abstract machines andautomata. It is a theory in theoretical computer science andclosely related to formal language theory. An automaton isa finite representation of a formal language that may be aninfinite set. Automata are often classified by the class offormal languages they can recognize, typically illustratedby the Chomsky hierarchy which describes the relationsbetween various languages as shown in Figure 3. Accordingto this classification, traditional MDPs provide a finite statemachine solution to planning problems. Keeping a history hof past observations makes the agent a pushdown automatonthat goes beyond the finite state machine level. This freesthe agent from the Markovian assumption that limits theproblems in a constrained space.b) Theory of mind (ToM): the intuitive grasp thathumans have of our own and other people’s mental state beliefs, intents, desires, knowledge, etc. Although no one hasdirect access to the mind of others, it is typically assumedby humans that these mental states guide their actions inpredictable ways to achieve their desires according to theirbeliefs. A growing body of evidence has shown that since theyoung-infant phase, the ability to perform mental simulationsof others increases rapidly [28], [29], [30], [31]. This allowsus to reason backward what others are thinking/intended forgiven their behavior. This reasoning process is vital in amulti-agent environment since each agent’s choice affectsthe payoff of other agents [32], [33].where Q(a, b) denotes the long term return of taking anaction a A given the belief b of other agents’ intents.R(s, a) denotes the immediate reward function for takingaction a in state s. A goal g can be then chosen based onthe return:XQ(g, b) p(a g)Q(a, b)(3)A. A POMDP formulation and its drawbacksA. Utility functionFor intelligent agents to adopt the proposed framework, akey facet is the learning process in which agents learn how toachieve their goals or maximize their utilities through interactions. As a result of a learning process within a population,conventions including cooperations and competitions canemerge dynamically. It is possible to formulate the learningproblem as a POMDP in the belief space:XXQ(a, b) b(s)R(s, a) γp(o b, a)V (bao (s0 )) (1)Intuitively, the utility function should be designed in away that the utility of all possible combinations of intents ofall agents can be evaluated. A possible way to parameterizeu(g b-i ; h, θ) is using a matrix θ to represent the long-termvalue for all intent combinations. However, this approach iscomputationally inefficient to find the expected utility giventhe belief of other agents’ intent: i) a joint probability needsto be computed for all intent combinations; ii) marginalprobabilities need to be computed by summing up the jointprobabilities if some agents are absent or unobservable.Instead of directly encoding a value matrix, we use amatrix θ to represent the influence on the value of pursuingsov(b) max Q(a, b)a(2)aHowever, this POMDP approach has two major limitations: (i) a model of the world (i.e., the underlying transiti