# Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure

@article{Zhu2018LearningTR, title={Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure}, author={Ruihao Zhu and Eytan H. Modiano}, journal={ArXiv}, year={2018}, volume={abs/1810.10637} }

We introduce efficient algorithms which achieve nearly optimal regrets for the problem of stochastic online shortest path routing with end-to-end feedback. The setting is a natural application of the combinatorial stochastic bandits problem, a special case of the linear stochastic bandits problem. We show how the difficulties posed by the large scale action set can be overcome by the networked structure of the action set. Our approach presents a novel connection between bandit learning and… Expand

#### One Citation

Meta Dynamic Pricing: Transfer Learning Across Experiments

- Computer Science, Mathematics
- ArXiv
- 2019

A meta dynamic pricing algorithm that learns an unknown prior online while solving a sequence of Thompson sampling pricing experiments for N different products, and shows that it significantly speeds up learning compared to prior-independent algorithms. Expand

#### References

SHOWING 1-10 OF 43 REFERENCES

Online linear optimization and adaptive routing

- Computer Science, Mathematics
- J. Comput. Syst. Sci.
- 2008

This paper presents two randomized online algorithms for selecting a sequence of routing paths in a network with unknown edge delays varying adversarially over time, and presents two algorithms whose regret is sublinear in the number of trials and polynomial in the size of the network. Expand

Adaptive shortest-path routing under unknown and stochastically varying link states

- Computer Science
- 2012 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt)
- 2012

By exploiting arm dependencies, a regret polynomial with the network size can be achieved while maintaining the optimal logarithmic order with time and find applications in cognitive radio and ad hoc networks with unknown and dynamic communication environments. Expand

Stochastic Online Shortest Path Routing: The Value of Feedback

- Computer Science
- IEEE Transactions on Automatic Control
- 2018

Three algorithms, with a tradeoff between computational complexity and performance, are proposed for online shortest path routing over multihop networks and the regret upper bounds of these algorithms improve over those of the existing algorithms. Expand

Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits

- Mathematics, Computer Science
- AISTATS
- 2015

This paper analyzes a UCB-like algorithm for solving the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits and proves O(K L (1 / \Delta) \log n) and O(\sqrt{K L n n}$ upper bounds on its $n$-step regret. Expand

Finite-time Analysis of the Multiarmed Bandit Problem

- Computer Science
- Machine Learning
- 2004

This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Expand

Linear Thompson Sampling Revisited

- Computer Science, Mathematics
- AISTATS
- 2017

Thompson sampling can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach. Expand

Stochastic Linear Optimization under Bandit Feedback

- Mathematics, Computer Science
- COLT
- 2008

A nearly complete characterization of the classical stochastic k-armed bandit problem in terms of both upper and lower bounds for the regret is given, and two variants of an algorithm based on the idea of “upper confidence bounds” are presented. Expand

Regret in Online Combinatorial Optimization

- Computer Science, Mathematics
- Math. Oper. Res.
- 2014

This work addresses online linear optimization problems when the possible actions of the decision maker are represented by binary vectors and shows that the standard exponentially weighted average forecaster is a provably suboptimal strategy. Expand

Learning to Optimize via Posterior Sampling

- Computer Science, Mathematics
- Math. Oper. Res.
- 2014

A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards. Expand

Improved Algorithms for Linear Stochastic Bandits

- Computer Science, Mathematics
- NIPS
- 2011

A simple modification of Auer's UCB algorithm achieves with high probability constant regret and improves the regret bound by a logarithmic factor, though experiments show a vast improvement. Expand