Multi-Objective, Multi-Armed Bandits: Algorithms for Repeated Games and Application to Route Choice




Multi-objective decision-making, Multi-objective route choice, Reinforcement learning, Repeated games, Multiagent systems, Multi-armed Bandit algorithms


Multi-objective decision-making in multi-agent scenarios poses multiple challenges. Dealing with multiple objectives and non-stationarity caused by simultaneous learning are only two of them, which have been addressed separately. In this work, reinforcement learning algorithms that tackle both issues together are proposed and applied to a route choice problem, where drivers must select an action in a single-state formulation, while aiming to minimize both their travel time and toll. Hence, we deal with repeated games, now with a multi-objective approach. Advantages, limitations and differences of these algorithms are discussed. Our results show that the proposed algorithms for action selection using reinforcement learning deal with non-stationarity and multiple objectives, while providing alternative solutions to those of centralized methods.


Download data is not yet available.


AUER, P.; CESA-BIANCHI, N.; FISCHER, P. P. finite-time analysis of the multiarmed bandit problem. Machine Learning, Netherlands, v. 47, n. 2, p. 235–256, May 2002.

KOCSIS, L.; SZEPESVÁRI, C. Discounted UCB. Proc. of the 2nd PASCAL Challenges Workshop, Venice, v. 2, 2006.

GARIVIER, A.; MOULINES, E. On upper-confidence bound policies for switching bandit problems. In: Algorithmic Learning Theory. Berlin, Germany: Springer Berlin Heidelberg, 2011. p. 174–188.

NARENDRA, K. S.; THATHACHAR, M. A. L. Learning Automata: An Introduction. Upper Saddle River, USA: Prentice-Hall, 1989.

CLAUS, C.; BOUTILIER, C. The dynamics of reinforcement learning in cooperative multiagent systems. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence. Menlo Park, USA: American Association for Artificial Intelligence, 1998. p. 746–752.

RAITH, A. et al. Solving multi-objective traffic assignment. Annals of Operations Research, Netherlands, v. 222, n. 1, p. 483–516, January 2014.

WANG, J. Y.; EHRGOTT, M. Modelling route choice behaviour in a tolled road network with a time surplus maximisation bi-objective user equilibrium model. Transportation Research Part B: Methodological, United Kingdom, v. 57, p. 342–360, November 2013.

WANG, J. Y. T.; RAITH, A.; EHRGOTT, M. Tolling analysis with bi-objective traffic assignment. In: Multiple Criteria Decision Making for Sustainable Energy and Transportation Systems. Berlin, Germany: Springer Berlin Heidelberg, 2010. p. 117–129.

DRUGAN, M. M.; NOWE, A. Designing multi-objective multi-armed bandits algorithms: A study. In: The 2013 International Joint Conference on Neural Networks (IJCNN). New York, USA: IEEE, 2013. p. 1–8.

MOFFAERT, K. V.; NOWÉ, A. Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, New

York, USA, v. 15, n. 1, p. 3483–3512, January 2014.

RADULESCU, R. et al. Multi-objective multi-agent decision making: a utility-based analysis and survey. Autonomous Agents and Multi-Agent Systems, Netherlands, v. 34, n. 10, p. 1–52, January 2020.

HAYES, C. F. et al. A practical guide to multi-objective reinforcement learning and planning. Arxiv preprint arXiv:2103.09568, [S. l.], p. 1–41, March 2021.

ORT ÚZAR, J. de D.; WILLUMSEN, L. G. Modelling transport. 4. ed. Chichester, United Kingdom: John Wiley & Sons, 2011.

DIAL, R. B. A model and algorithm for multicriteria route-mode choice. Transportation Research Part B: Methodological, United Kingdom, v. 13, n. 4, p. 311–316, December 1979.

EHRGOTT, J. Y. T. W. and Matthias. Journal of Multi-Criteria Decision Analysis, United Kingdom, v. 25, n. 1-2, p. 3–15, March 2018.

TIDSWELL, J. et al. Minimising emissions in traffic assignment with non-monotonic arc costs. Transportation Research Part B: Methodological, United Kingdom, v. 153, p. 70–90, November 2021.

YEN, J. Y. Finding the k shortest loopless paths in a network. Management Science, USA, v. 17, n. 11, p. 712–716, July 1971.

RAMOS, G. de O.; SILVA, B. C. da; BAZZAN, A. L. C. Learning to minimise regret in route choice. In: Proc. of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2017). New York, USA: IFAAMAS, 2017. p. 846–855.

OLIVEIRA, T. B. F. de et al. Comparing multi-armed bandit algorithms and Q-learning for multiagent action selection: a case study in route choice. In: 2018 International Joint Conference on Neural Networks, IJCNN. New York, USA: IEEE, 2018. p. 1–8.

BAZZAN, A. L. C.; GRUNITZKI, R. A multiagent reinforcement learning approach to en-route trip building. In: 2016 International Joint Conference on Neural Networks (IJCNN). New York, USA: IEEE, 2016. p. 5288–5295.

PEETA, S.; YU, J. W. A hybrid model for driver route choice incorporating en-route attributes and real-time information effects. Networks and Spatial Economics, Netherlands, v. 5, n. 1, p. 21–40, March 2005.

HENN, V. Fuzzy route choice model for traffic assignment. Fuzzy Sets and Systems, Netherlands, v. 116, n. 1, p. 77–101, November 2000.

CHEN, Y.; SU, S.; WEI, J. A policy for optimizing sub-band selection sequences in wideband spectrum sensing. Sensors, Basel, Switzerland, v. 19, n. 19, p. 1–17, September 2019.

ROIJERS, D. M.; ZINTGRAF, L. M.; NOWÉ, A. Interactive thompson sampling for multi-objective multi-armed bandits. In: Algorithmic Decision Theory. Germany: Springer International Publishing, 2017. p. 18–34.

MOFFAERT, K. V. et al. Multi-objective x-armed bandits. In: 2014 International Joint Conference on Neural Networks (IJCNN). New York, USA: IEEE, 2014. p. 2331–2338.

DRUGAN, M.; NOWE, A. Epsilon-approximate pareto optimal set of arms identification in multi-objective multi-armed bandits. In: . Brussels, Belgium: BENELEARN 2014 - 23rd annual Belgian-Dutch Conference on Machine Learning, 2014. p. 73–80.

DRUGAN, M. M.; NOWÉ, A.; MANDERICK, B. Pareto upper confidence bounds algorithms: An empirical study. In: 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). New York, USA: IEEE, 2014. p. 1–8.

WATKINS, C. J. C. H.; DAYAN, P. Q-learning. Machine Learning, Hingham, USA, v. 8, n. 3, p. 279–292, May 1992.

United States Bureau of Public Roads. Traffic Assignment Manual. Washington, USA: United States Bureau of Public Roads, 1964.

ANQUISE, C. A. H. Multi-objective reinforcement learning methods for action selection: dealing with multiple objectives and non-stationarity. Dissertação (Mestrado) — Instituto de Informática, UFRGS, Porto Alegre, 2021.




How to Cite

Huanca-Anquise, C. A. ., Bazzan, A. L. C., & R. Tavares, A. (2023). Multi-Objective, Multi-Armed Bandits: Algorithms for Repeated Games and Application to Route Choice. Revista De Informática Teórica E Aplicada, 30(1), 11–23.



Regular Papers