 Directories

Links
 “Procedural Generalization by Planning With SelfSupervised World Models”, Anand et al 2021
 “Mastering Atari Games With Limited Data”, Ye et al 2021
 “Neural Autopilot and Contextsensitivity of Habits”, Camerer & Li 2021
 “Is Curiosity All You Need? On the Utility of Emergent Behaviours from Curious Exploration”, Groth et al 2021
 “Why Generalization in RL Is Difficult: Epistemic POMDPs and Implicit Partial Observability”, Ghosh et al 2021
 “Multitask Curriculum Learning in a Complex, Visual, Hardexploration Domain: Minecraft”, Kanitscheider et al 2021
 “Reward Is Enough”, Silver et al 2021
 “Principled Exploration via Optimistic Bootstrapping and Backward Induction”, Bai et al 2021
 “Deep Bandits ShowOff: Simple and Efficient Exploration With Deep Networks”, Zhu & Rigotti 2021
 “Epistemic Autonomy: Selfsupervised Learning in the Mammalian Hippocampus”, SantosPata et al 2021
 “Bayesian Optimization Is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the BlackBox Optimization Challenge 2020”, Turner et al 2021
 “Flexible Modulation of Sequence Generation in the Entorhinalhippocampal System”, McNamee et al 2021
 “Reinforcement Learning, Bit by Bit”, Lu et al 2021
 “Asymmetric Selfplay for Automatic Goal Discovery in Robotic Manipulation”, OpenAI et al 2021
 “Informational Herding, Optimal Experimentation, and Contrarianism”, Smith et al 2021
 “First Return, Then Explore”, Ecoffet et al 2021
 “Is Pessimism Provably Efficient for Offline RL?”, Jin et al 2020
 “Metatrained Agents Implement Bayesoptimal Agents”, Mikulik et al 2020
 “Learning Not to Learn: Nature versus Nurture in Silico”, Lange & Sprekeler 2020
 “The Temporal Dynamics of Opportunity Costs: A Normative Account of Cognitive Fatigue and Boredom”, Agrawal et al 2020
 “Assessing Game Balance With AlphaZero: Exploring Alternative Rule Sets in Chess”, Tomašev et al 2020
 “The Overfitted Brain: Dreams Evolved to Assist Generalization”, Hoel 2020
 “Deep Reinforcement Learning and Its Neuroscientific Implications”, Botvinick et al 2020
 “Exploration Strategies in Deep Reinforcement Learning”, Weng 2020
 “Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search”, Rawal et al 2020
 “Planning to Explore via SelfSupervised World Models”, Sekar et al 2020
 “Exploring Bayesian Optimization: Breaking Bayesian Optimization into Small, Sizeable Chunks”, Agnihotri & Batra 2020
 “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, Levine et al 2020
 “Agent57: Outperforming the Human Atari Benchmark”, Puigdomènech et al 2020
 “Agent57: Outperforming the Atari Human Benchmark”, Badia et al 2020
 “Metalearning Curiosity Algorithms”, Alet et al 2020
 “Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey”, Narvekar et al 2020
 “AutoMLZero: Evolving Machine Learning Algorithms From Scratch”, Real et al 2020
 “Never Give Up: Learning Directed Exploration Strategies”, Badia et al 2020
 “R2D3: Making Efficient Use of Demonstrations to Solve Hard Exploration Problems”, Paine et al 2019
 “Benchmarking BonusBased Exploration Methods on the Arcade Learning Environment”, Taïga et al 2019
 “A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment”, Leibfried et al 2019
 “Meta Reinforcement Learning”, Weng 2019
 “Search on the Replay Buffer: Bridging Planning and Reinforcement Learning”, Eysenbach et al 2019
 “ICML 2019 Notes”, Abel 2019
 “Humanlevel Performance in 3D Multiplayer Games With Populationbased Reinforcement Learning”, Jaderberg et al 2019
 “Reinforcement Learning, Fast and Slow”, Botvinick et al 2019
 “Meta Reinforcement Learning As Task Inference”, Humplik et al 2019
 “Metalearning of Sequential Strategies”, Ortega et al 2019
 “Metalearners’ Learning Dynamics Are unlike Learners’”, Rabinowitz 2019
 “Ray Interference: a Source of Plateaus in Deep Reinforcement Learning”, Schaul et al 2019
 “Learning To Follow Directions in Street View”, Hermann et al 2019
 “A Generalized Framework for Population Based Training”, Li et al 2019
 “GoExplore: a New Approach for HardExploration Problems”, Ecoffet et al 2019
 “Is the FDA Too Conservative or Too Aggressive?: A Bayesian Decision Analysis of Clinical Trial Design”, Isakov et al 2019
 “Machinelearningguided Directed Evolution for Protein Engineering”, Yang et al 2019
 “Common Neural Code for Reward and Information Value”, Kobayashi & Hsu 2019
 “The Bayesian Superorganism III: Externalised Memories Facilitate Distributed Sampling”, Hunt et al 2018
 “Exploration in the Wild”, Schulz et al 2018
 “An Introduction to Deep Reinforcement Learning”, FrancoisLavet et al 2018
 “Exploration by Random Network Distillation”, Burda et al 2018
 “Computational Mechanisms of Curiosity and Goaldirected Exploration”, Schwartenbeck et al 2018
 “LargeScale Study of CuriosityDriven Learning”, Burda et al 2018
 “Visual Reinforcement Learning With Imagined Goals”, Nair et al 2018
 “Is Qlearning Provably Efficient?”, Jin et al 2018
 “Improving Widthbased Planning With Compact Policies”, Junyent et al 2018
 “Construction of Arbitrarily Strong Amplifiers of Natural Selection Using Evolutionary Graph Theory”, Pavlogiannis et al 2018
 “Reevaluating Evaluation”, Balduzzi et al 2018
 “Observe and Look Further: Achieving Consistent Performance on Atari”, Pohlen et al 2018
 “Playing Hard Exploration Games by Watching YouTube”, Aytar et al 2018
 “Some Considerations on Learning to Explore via MetaReinforcement Learning”, Stadie et al 2018
 “Learning to Search With MCTSnets”, Guez et al 2018
 “Learning and Querying Fast Generative Models for Reinforcement Learning”, Buesing et al 2018
 “Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of NoveltySeeking Agents”, Conti et al 2017
 “Emergent Complexity via MultiAgent Competition”, Bansal et al 2017
 “An Analysis of the Value of Information When Exploring Stochastic, Discrete MultiArmed Bandits”, Sledge & Principe 2017
 “The Uncertainty Bellman Equation and Exploration”, O'Donoghue et al 2017
 “ImaginationAugmented Agents for Deep Reinforcement Learning”, Weber et al 2017
 “Distral: Robust Multitask Reinforcement Learning”, Teh et al 2017
 “The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously”, Cabi et al 2017
 “A Tutorial on Thompson Sampling”, Russo et al 2017
 “Noisy Networks for Exploration”, Fortunato et al 2017
 “Scalable Generalized Linear Bandits: Online Computation and Hashing”, Jun et al 2017
 “Learning to Perform Physics Experiments via Deep Reinforcement Learning”, Denil et al 2016
 “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016
 “Unifying CountBased Exploration and Intrinsic Motivation”, Bellemare et al 2016
 “Deep Exploration via Bootstrapped DQN”, Osband et al 2016
 “Experimental Design for Partially Observed Markov Decision Processes”, Thorbergsson & Hooker 2012
 “Learning Is Planning: near Bayesoptimal Reinforcement Learning via MonteCarlo Tree Search”, Asmuth & Littman 2012
 “PILCO: A ModelBased and DataEfficient Approach to Policy Search”, Deisenroth & Rasmussen 2011
 “Evolution Strategy: Nature's Way of Optimization”, Rechenberg 1989
 “Evolutionsstrategien”, Rechenberg 1977
 Miscellaneous
Directories
Links
“Procedural Generalization by Planning With SelfSupervised World Models”, Anand et al 2021
“Procedural Generalization by Planning with SelfSupervised World Models”, (20211102; ):
One of the key promises of modelbased reinforcement learning is the ability to generalize using an internal model of the world to make predictions in novel environments and tasks. However, the generalization ability of modelbased agents is not well understood because existing work has focused on modelfree agents when benchmarking generalization. Here, we explicitly measure the generalization ability of modelbased agents in comparison to their modelfree counterparts. We focus our analysis on MuZero (Schrittwieser et al., 2020), a powerful modelbased agent, and evaluate its performance on both procedural and task generalization. We identify three factors of procedural generalization—planning, selfsupervised representation learning, and procedural data diversity—and show that by combining these techniques, we achieve stateofthe art generalization performance and data efficiency on Procgen (Cobbe et al., 2019). However, we find that these factors do not always provide the same benefits for the task generalization benchmarks in MetaWorld (Yu et al., 2019), indicating that transfer remains a challenge and may require different approaches than procedural generalization. Overall, we suggest that building generalizable agents requires moving beyond the singletask, modelfree paradigm and towards selfsupervised modelbased agents that are trained in rich, procedural, multitask environments.
“Mastering Atari Games With Limited Data”, Ye et al 2021
“Mastering Atari Games with Limited Data”, (20211030; ):
Reinforcement learning has achieved great success in many applications. However, sample efficiency remains a key challenge, with prominent methods requiring millions (or even billions) of environment steps to train. Recently, there has been significant progress in sample efficient imagebased RL algorithms; however, consistent humanlevel performance on the Atari game benchmark remains an elusive goal. We propose a sample efficient modelbased visual RL algorithm built on MuZero, which we name EfficientZero. Our method achieves 190.4% mean human performance and 116.0% median performance on the Atari 100k benchmark with only two hours of realtime game experience and outperforms the state SAC in some tasks on the DMControl 100k benchmark. This is the first time an algorithm achieves superhuman performance on Atari games with such little data. EfficientZero’s performance is also close to DQN’s performance at 200 million frames while we consume 500× less data. EfficientZero’s low sample complexity and high performance can bring RL closer to realworld applicability. We implement our algorithm in an easytounderstand manner and it is available at https://github.com/YeWR/EfficientZero. We hope it will accelerate the research of MCTSbased RL algorithms in the wider community.
“Neural Autopilot and Contextsensitivity of Habits”, Camerer & Li 2021
2021camerer.pdf
: “Neural autopilot and
contextsensitivity of habits”, (20211001; ):
 In neural autopilot theory, habits save cognitive effort by repeating reliablyrewarding choices.
 Strong habits are marked by insensitivity to reward change, but largescale field data do not show this effect.
 Habits are predictable from context variables using machine learning.
 Predictable habits can be identified in everyday behavior using machine learning.
 Identifying contextual cues, and using information about reward reliability, could personalize and improve our ability to change behavior.
This paper is about the background of 2 new ideas from neuroeconomics for understanding habits. The main idea is a 2process ‘neural autopilot’ model. This model hypothesizes that contextually cued habits occur when the reward from the habitual behavior is numerically reliable (as in related models with an ‘arbitrator’). This computational model is lightly parameterized, has the essential ingredients established in animal learning and cognitive neuroscience, and is simple enough to make nonobvious predictions. An interesting set of predictions is about how consumers react to different kinds of changes in prices and qualities of goods (‘elasticsities’). Elasticity analysis expands the habit marker of insensitivity to reward devaluation, and other types of sensitivities. The second idea is to use machine learning to discover which contextual variables seem to cue habits, in field data.
“Is Curiosity All You Need? On the Utility of Emergent Behaviours from Curious Exploration”, Groth et al 2021
“Is Curiosity All You Need? On the Utility of Emergent Behaviours from Curious Exploration”, (20210917):
Curiositybased reward schemes can present powerful exploration mechanisms which facilitate the discovery of solutions for complex, sparse or longhorizon tasks. However, as the agent learns to reach previously unexplored spaces and the objective adapts to reward new areas, many behaviours emerge only to disappear due to being overwritten by the constantly shifting objective. We argue that merely using curiosity for fast environment exploration or as a bonus reward for a specific task does not harness the full potential of this technique and misses useful skills. Instead, we propose to shift the focus towards retaining the behaviours which emerge during curiositybased learning. We posit that these selfdiscovered behaviours serve as valuable skills in an agent’s repertoire to solve related tasks. Our experiments demonstrate the continuous shift in behaviour throughout training and the benefits of a simple policy snapshot method to reuse discovered behaviour for transfer tasks.
“Why Generalization in RL Is Difficult: Epistemic POMDPs and Implicit Partial Observability”, Ghosh et al 2021
“Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability”, (20210713; ):
Generalization is a central challenge for the deployment of reinforcement learning (RL) systems in the real world. In this paper, we show that the sequential structure of the RL problem necessitates new approaches to generalization beyond the wellstudied techniques used in supervised learning. While supervised learning methods can generalize effectively without explicitly accounting for epistemic uncertainty, we show that, perhaps surprisingly, this is not the case in RL. We show that generalization to unseen test conditions from a limited number of training conditions induces implicit partial observability, effectively turning even fullyobserved MDPs into POMDPs. Informed by this observation, we recast the problem of generalization in RL as solving the induced partially observed Markov decision process, which we call the epistemic POMDP. We demonstrate the failure modes of algorithms that do not appropriately handle this partial observability, and suggest a simple ensemblebased technique for approximately solving the partially observed problem. Empirically, we demonstrate that our simple algorithm derived from the epistemic POMDP achieves significant gains in generalization over current methods on the Procgen benchmark suite.
“Multitask Curriculum Learning in a Complex, Visual, Hardexploration Domain: Minecraft”, Kanitscheider et al 2021
“Multitask curriculum learning in a complex, visual, hardexploration domain: Minecraft”, (20210628):
An important challenge in reinforcement learning is training agents that can solve a wide variety of tasks. If tasks depend on each other (e.g. needing to learn to walk before learning to run), curriculum learning can speed up learning by focusing on the next best task to learn. We explore curriculum learning in a complex, visual domain with many hard exploration challenges: Minecraft. We find that learning progress (defined as a change in success probability of a task) is a reliable measure of learnability for automatically constructing an effective curriculum. We introduce a learningprogress based curriculum and test it on a complex reinforcement learning problem (called “Simon Says”) where an agent is instructed to obtain a desired goal item. Many of the required skills depend on each other. Experiments demonstrate that: (1) a withinepisode exploration bonus for obtaining new items improves performance, (2) dynamically adjusting this bonus across training such that it only applies to items the agent cannot reliably obtain yet further increases performance, (3) the learningprogress based curriculum elegantly follows the learning curve of the agent, and (4) when the learningprogress based curriculum is combined with the dynamic exploration bonus it learns much more efficiently and obtains far higher performance than uniform baselines. These results suggest that combining intraepisode and acrosstraining exploration bonuses with learning progress creates a promising method for automated curriculum generation, which may substantially increase our ability to train more capable, generally intelligent agents.
“Reward Is Enough”, Silver et al 2021
“Reward is enough”, (20210524; ):
In this article we hypothesize that intelligence, and its associated abilities, can be understood as subserving the maximization of reward.
Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalisation and imitation. This is in contrast to the view that specialised problem formulations are needed for each ability, based on other signals or objectives.
Furthermore, we suggest that agents that learn through trial & error experience to maximise reward could learn behaviour that exhibits most if not all of these abilities, and therefore that powerful reinforcement learning agents could constitute a solution to artificial general intelligence.
[Keywords: artificial intelligence, artificial general intelligence, reinforcement learning, reward]
[See also: convergent instrumental drives; AIGAs; the Bitter Lesson; metareinforcement learning; the blessings of scale, the pretraining paradigm, and the scaling hypothesis.]
“Principled Exploration via Optimistic Bootstrapping and Backward Induction”, Bai et al 2021
“Principled Exploration via Optimistic Bootstrapping and Backward Induction”, (20210513):
One principled approach for provably efficient exploration is incorporating the upper confidence bound (UCB) into the value function as a bonus. However, UCB is specified to deal with linear and tabular settings and is incompatible with Deep Reinforcement Learning (DRL). In this paper, we propose a principled exploration method for DRL through Optimistic Bootstrapping and Backward Induction (OB2I). OB2I constructs a generalpurpose UCBbonus through nonparametric bootstrap in DRL. The UCBbonus estimates the epistemic uncertainty of stateaction pairs for optimistic exploration. We build theoretical connections between the proposed UCBbonus and the LSVIUCB in a linear setting. We propagate future uncertainty in a timeconsistent manner through episodic backward update, which exploits the theoretical advantage and empirically improves the sampleefficiency. Our experiments in the MNIST maze and Atari suite suggest that OB2I outperforms several stateoftheart exploration approaches.
“Deep Bandits ShowOff: Simple and Efficient Exploration With Deep Networks”, Zhu & Rigotti 2021
“Deep Bandits ShowOff: Simple and Efficient Exploration with Deep Networks”, (20210510):
Designing efficient exploration is central to Reinforcement Learning due to the fundamental problem posed by the explorationexploitation dilemma. Bayesian exploration strategies like Thompson Sampling resolve this tradeoff in a principled way by modeling and updating the distribution of the parameters of the actionvalue function, the outcome model of the environment. However, this technique becomes infeasible for complex environments due to the computational intractability of maintaining probability distributions over parameters of outcome models of corresponding complexity. Moreover, the approximation techniques introduced to mitigate this issue typically result in poor explorationexploitation tradeoffs, as observed in the case of deep neural network models with approximate posterior methods that have been shown to underperform in the deep bandit scenario. In this paper we introduce Sample Average Uncertainty (SAU), a simple and efficient uncertainty measure for contextual bandits. While Bayesian approaches like Thompson Sampling estimate outcomes uncertainty indirectly by first quantifying the variability over the parameters of the outcome model, SAU is a frequentist approach that directly estimates the uncertainty of the outcomes based on the value predictions. Importantly, we show theoretically that the uncertainty measure estimated by SAU asymptotically matches the uncertainty provided by Thompson Sampling, as well as its regret bounds. Because of its simplicity SAU can be seamlessly applied to deep contextual bandits as a very scalable dropin replacement for epsilongreedy exploration. We confirm empirically our theory by showing that SAUbased exploration outperforms current stateoftheart deep Bayesian bandit methods on several realworld datasets at modest computation cost. Code is available at
https://github.com/ibm/sauexplore
.
“Epistemic Autonomy: Selfsupervised Learning in the Mammalian Hippocampus”, SantosPata et al 2021
2021santospata.pdf
: “Epistemic Autonomy: Selfsupervised Learning in the Mammalian
Hippocampus”, (20210424; ):
 Biological cognition is based on selfgenerated learning objectives. However, the mechanism by which this epistemic autonomy is realized by the neuronal substrate is not understood.
 Artificial neural networks based on error backpropagation lack epistemic autonomy because they are mostly trained in a supervised fashion. In this respect, they face the symbol grounding problem of artificial intelligence.
 We propose that the entorhinalhippocampal complex, a brain structure located in the medial temporal lobe and central to memory, combines epistemic autonomy with intrinsically generated error gradients akin to error backpropagation.
 We present evidence supporting the hypothesis that the countercurrent inhibitory projections of the entorhinalhippocampal complex implement a continuous selfsupervised error minimization between network input and output.
Biological cognition is based on the ability to autonomously acquire knowledge, or epistemic autonomy.
Such selfsupervision is largely absent in artificial neural networks (ANN) because they depend on externally set learning criteria. Yet training ANN using error backpropagation has created the current revolution in artificial intelligence, raising the question of whether the epistemic autonomy displayed in biological cognition can be achieved with error backpropagationbased learning.
We present evidence suggesting that the entorhinalhippocampal complex combines epistemic autonomy with error backpropagation. Specifically, we propose that the hippocampus minimizes the error between its input and output signals through a modulatory countercurrent inhibitory network. We further discuss the computational emulation of this principle and analyze it in the context of autonomous cognitive systems.
[Keywords: error backpropagation, selfsupervised learning, hippocampus]
“Bayesian Optimization Is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the BlackBox Optimization Challenge 2020”, Turner et al 2021
“Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the BlackBox Optimization Challenge 2020”, (20210420; ):
This paper presents the results and insights from the blackbox optimization (BBO) challenge at NeurIPS 2020 which ran from JulyOctober, 2020. The challenge emphasized the importance of evaluating derivativefree optimizers for tuning the hyperparameters of machine learning models. This was the first blackbox optimization challenge with a machine learning emphasis. It was based on tuning (validation set) performance of standard machine learning models on real datasets. This competition has widespread impact as blackbox optimization (e.g., Bayesian optimization) is relevant for hyperparameter tuning in almost every machine learning project as well as many applications outside of machine learning. The final leaderboard was determined using the optimization performance on heldout (hidden) objective functions, where the optimizers ran without human intervention. Baselines were set using the default settings of several opensource blackbox optimization packages as well as random search.
“Flexible Modulation of Sequence Generation in the Entorhinalhippocampal System”, McNamee et al 2021
2021mcnamee.pdf
: “Flexible modulation of
sequence generation in the entorhinalhippocampal system”, (20210412; ):
Exploration, consolidation and planning depend on the generation of sequential state representations. However, these algorithms require disparate forms of sampling dynamics for optimal performance. We theorize how the brain should adapt internally generated sequences for particular cognitive functions and propose a neural mechanism by which this may be accomplished within the entorhinalhippocampal circuit.
Specifically, we demonstrate that the systematic modulation along the medial entorhinal cortex dorsoventral axis of grid population input into the hippocampus facilitates a flexible generative process that can interpolate between qualitatively distinct regimes of sequential hippocampal reactivations.
By relating the emergent hippocampal activity patterns drawn from our model to empirical data, we explain and reconcile a diversity of recently observed, but apparently unrelated, phenomena such as generative cycling, diffusive hippocampal reactivations and jumping trajectory events.
“Reinforcement Learning, Bit by Bit”, Lu et al 2021
“Reinforcement Learning, Bit by Bit”, (20210306):
Reinforcement learning agents have demonstrated remarkable achievements in simulated environments. Data efficiency poses an impediment to carrying this success over to real environments. The design of dataefficient agents calls for a deeper understanding of information acquisition and representation. We develop concepts and establish a regret bound that together offer principled guidance. The bound sheds light on questions of what information to seek, how to seek that information, and it what information to retain. To illustrate concepts, we design simple agents that build on them and present computational results that demonstrate improvements in data efficiency.
“Asymmetric Selfplay for Automatic Goal Discovery in Robotic Manipulation”, OpenAI et al 2021
“Asymmetric selfplay for automatic goal discovery in robotic manipulation”, (20210305; ):
We train a single, goalconditioned policy that can solve many robotic manipulation tasks, including tasks with previously unseen goals and objects. To do so, we rely on asymmetric selfplay for goal discovery, where two agents, Alice and Bob, play a game. Alice is asked to propose challenging goals and Bob aims to solve them. We show that this method is able to discover highly diverse and complex goals without any human priors. We further show that Bob can be trained with only sparse rewards, because the interaction between Alice and Bob results in a natural curriculum and Bob can learn from Alice’s trajectory when relabeled as a goalconditioned demonstration. Finally, we show that our method scales, resulting in a single policy that can transfer to many unseen holdout tasks such as setting a table, stacking blocks, and solving simple puzzles. Videos of a learned policy is available at https://roboticsselfplay.github.io.
[Keywords: selfplay, asymmetric selfplay, automatic curriculum, automatic goal generation, robotic learning, robotic manipulation, reinforcement learning]
“Informational Herding, Optimal Experimentation, and Contrarianism”, Smith et al 2021
2021smith.pdf
: “Informational Herding, Optimal Experimentation, and
Contrarianism”, (20210225; ):
In the standard herding model, privately informed individuals sequentially see prior actions and then act. An identical action herd eventually starts and public beliefs tend to “cascade sets” where social learning stops. What behaviour is socially efficient when actions ignore informational externalities?
We characterize the outcome that maximizes the discounted sum of utilities. Our 4 key findings are:
 cascade sets shrink but do not vanish, and herding should occur but less readily as greater weight is attached to posterity.
 An optimal mechanism rewards individuals mimicked by their successor.
 Cascades cannot start after period one under a signal logconcavity condition.
 Given this condition, efficient behaviour is contrarian, leaning against the myopically more popular actions in every period.
We make 2 technical contributions: as value functions with learning are not smooth, we use monotone comparative statics under uncertainty to deduce optimal dynamic behaviour. We also adapt dynamic pivot mechanisms to Bayesian learning.
[Keywords: herding, mimicking, contrarian, cascade, efficiency, monotonicity, logconcavity]
“First Return, Then Explore”, Ecoffet et al 2021
2021ecoffet.pdf#uber
: “First return, then
explore”, Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, Jeff Clune (20210224)
“Is Pessimism Provably Efficient for Offline RL?”, Jin et al 2020
“Is Pessimism Provably Efficient for Offline RL?”, (20201230):
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators.
Without assuming the sufficient coverage of the dataset, we establish a datadependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the informationtheoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the “best effort” among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from the “irrelevant” trajectories that are less covered by the dataset and not informative for the optimal policy.
“Metatrained Agents Implement Bayesoptimal Agents”, Mikulik et al 2020
“Metatrained agents implement Bayesoptimal agents”, (20201021; ):
Memorybased metalearning is a powerful technique to build agents that adapt fast to any task within a target distribution. A previous theoretical study has argued that this remarkable performance is because the metatraining protocol incentivises agents to behave Bayesoptimally. We empirically investigate this claim on a number of prediction and bandit tasks. Inspired by ideas from theoretical computer science, we show that metalearned and Bayesoptimal agents not only behave alike, but they even share a similar computational structure, in the sense that one agent system can approximately simulate the other. Furthermore, we show that Bayesoptimal agents are fixed points of the metalearning dynamics. Our results suggest that memorybased metalearning might serve as a general technique for numerically approximating Bayesoptimal agents—that is, even for task distributions for which we currently don’t possess tractable models.
“Learning Not to Learn: Nature versus Nurture in Silico”, Lange & Sprekeler 2020
“Learning not to learn: Nature versus nurture in silico”, (20201009; ; backlinks):
Animals are equipped with a rich innate repertoire of sensory, behavioral and motor skills, which allows them to interact with the world immediately after birth. At the same time, many behaviors are highly adaptive and can be tailored to specific environments by means of learning. In this work, we use mathematical analysis and the framework of metalearning (or ‘learning to learn’) to answer when it is beneficial to learn such an adaptive strategy and when to hardcode a heuristic behavior. We find that the interplay of ecological uncertainty, task complexity and the agents’ lifetime has crucial effects on the metalearned amortized Bayesian inference performed by an agent. There exist two regimes: One in which metalearning yields a learning algorithm that implements taskdependent informationintegration and a second regime in which metalearning imprints a heuristic or ‘hardcoded’ behavior. Further analysis reveals that nonadaptive behaviors are not only optimal for aspects of the environment that are stable across individuals, but also in situations where an adaptation to the environment would in fact be highly beneficial, but could not be done quickly enough to be exploited within the remaining lifetime. Hardcoded behaviors should hence not only be those that always work, but also those that are too complex to be learned within a reasonable time frame.
“The Temporal Dynamics of Opportunity Costs: A Normative Account of Cognitive Fatigue and Boredom”, Agrawal et al 2020
“The Temporal Dynamics of Opportunity Costs: A Normative Account of Cognitive Fatigue and Boredom”, (20200909; ; backlinks):
Cognitive fatigue and boredom are two phenomenological states widely associated with limitations in cognitive control. In this paper, we present a rational analysis of the temporal structure of controlled behavior, which provides a new framework for providing a formal account of these phenomena. We suggest that in controlling behavior, the brain faces competing behavioral and computational imperatives, and must balance them by tracking their opportunity costs over time. We use this analysis to flesh out previous suggestions that feelings associated with subjective effort, like cognitive fatigue and boredom, are the phenomenological counterparts of these opportunity cost measures, rather then reflecting the depletion of resources as has often been assumed. Specifically, we propose that both fatigue and boredom reflect the competing value of particular options that require foregoing immediate reward but can improve future performance: Fatigue reflects the value of offline computation (internal to the organism) to improve future decisions, while boredom signals the value of exploratory actions (external in the world) to gather information. We demonstrate that these accounts provide a mechanistically explicit and parsimonious account for a wide array of findings related to cognitive control, integrating and reimagining them under a single, formally rigorous framework.
“Assessing Game Balance With AlphaZero: Exploring Alternative Rule Sets in Chess”, Tomašev et al 2020
“Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess”, (20200909; ):
It is nontrivial to design engaging and balanced sets of game rules. Modern chess has evolved over centuries, but without a similar recourse to history, the consequences of rule changes to game dynamics are difficult to predict. AlphaZero provides an alternative in silico means of game balance assessment. It is a system that can learn nearoptimal strategies for any rule set from scratch, without any human supervision, by continually learning from its own experience. In this study we use AlphaZero to creatively explore and design new chess variants. There is growing interest in chess variants like Fischer Random Chess, because of classical chess’s voluminous opening theory, the high percentage of draws in professional play, and the nonnegligible number of games that end while both players are still in their home preparation. We compare nine other variants that involve atomic changes to the rules of chess. The changes allow for novel strategic and tactical patterns to emerge, while keeping the games close to the original. By learning nearoptimal strategies for each variant with AlphaZero, we determine what games between strong human players might look like if these variants were adopted. Qualitatively, several variants are very dynamic. An analytic comparison show that pieces are valued differently between variants, and that some variants are more decisive than classical chess. Our findings demonstrate the rich possibilities that lie beyond the rules of modern chess.
“The Overfitted Brain: Dreams Evolved to Assist Generalization”, Hoel 2020
“The Overfitted Brain: Dreams evolved to assist generalization”, (20200719; ; backlinks):
Understanding of the evolved biological function of sleep has advanced considerably in the past decade. However, no equivalent understanding of dreams has emerged. Contemporary neuroscientific theories generally view dreams as epiphenomena, and the few proposals for their biological function are contradicted by the phenomenology of dreams themselves. Now, the recent advent of deep neural networks (DNNs) has finally provided the novel conceptual framework within which to understand the evolved function of dreams. Notably, all DNNs face the issue of overfitting as they learn, which is when performance on one data set increases but the network’s performance fails to generalize (often measured by the divergence of performance on training vs. testing data sets). This ubiquitous problem in DNNs is often solved by modelers via “noise injections” in the form of noisy or corrupted inputs. The goal of this paper is to argue that the brain faces a similar challenge of overfitting, and that nightly dreams evolved to combat the brain’s overfitting during its daily learning. That is, dreams are a biological mechanism for increasing generalizability via the creation of corrupted sensory inputs from stochastic activity across the hierarchy of neural structures. Sleep loss, specifically dream loss, leads to an overfitted brain that can still memorize and learn but fails to generalize appropriately. Herein this “overfitted brain hypothesis” is explicitly developed and then compared and contrasted with existing contemporary neuroscientific theories of dreams. Existing evidence for the hypothesis is surveyed within both neuroscience and deep learning, and a set of testable predictions are put forward that can be pursued both in vivo and in silico.
“Deep Reinforcement Learning and Its Neuroscientific Implications”, Botvinick et al 2020
2020botvinick.pdf#deepmind
: “Deep
Reinforcement Learning and Its Neuroscientific Implications”, Matthew Botvinick,
Jane X. Wang, Will Dabney, Kevin J. Miller, Zeb KurthNelson (20200713; )
“Exploration Strategies in Deep Reinforcement Learning”, Weng 2020
“Exploration Strategies in Deep Reinforcement Learning”, (20200607):
Exploitation versus exploration is a critical topic in Reinforcement Learning. We’d like the RL agent to find the best solution as fast as possible. However, in the meantime, committing to solutions too quickly without enough exploration sounds pretty bad, as it could lead to local minima or total failure. Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently, while exploration remains more like an open topic.
I would like to discuss several common exploration strategies in Deep RL here. As this is a very big topic, my post by no means can cover all the important subtopics. I plan to update it periodically and keep further enriching the content gradually in time.
Classic Exploration Strategies
Key Exploration Problems
 The HardExploration Problem
 The NoisyTV Problem
Intrinsic Rewards as Exploration Bonuses
Countbased Exploration
 Counting by Density Model
 Counting after Hashing
Predictionbased Exploration
 Forward Dynamics
 Random Networks
 Physical Properties
Memorybased Exploration
 Episodic Memory
 Direct Exploration
QValue Exploration
Variational Options
“Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search”, Rawal et al 2020
“Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search”, (20200527; ):
Neural Architecture Search (NAS) explores a large space of architectural motifs—a computeintensive process that often involves groundtruth evaluation of each motif by instantiating it within a large network, and training and evaluating the network with thousands of domainspecific data samples. Inspired by how biological motifs such as cells are sometimes extracted from their natural environment and studied in an artificial Petri dish setting, this paper proposes the Synthetic Petri Dish model for evaluating architectural motifs. In the Synthetic Petri Dish, architectural motifs are instantiated in very small networks and evaluated using very few learned synthetic data samples (to effectively approximate performance in the full problem). The relative performance of motifs in the Synthetic Petri Dish can substitute for their groundtruth performance, thus accelerating the most expensive step of NAS. Unlike other neural networkbased prediction models that parse the structure of the motif to estimate its performance, the Synthetic Petri Dish predicts motif performance by training the actual motif in an artificial setting, thus deriving predictions from its true intrinsic properties. Experiments in this paper demonstrate that the Synthetic Petri Dish can therefore predict the performance of new motifs with significantly higher accuracy, especially when insufficient ground truth data is available. Our hope is that this work can inspire a new research direction in studying the performance of extracted components of models in an alternative controlled setting.
“Planning to Explore via SelfSupervised World Models”, Sekar et al 2020
“Planning to Explore via SelfSupervised World Models”, (20200512):
Reinforcement learning allows solving complex tasks, however, the learning tends to be taskspecific and the sample efficiency remains a challenge. We present Plan2Explore, a selfsupervised reinforcement learning agent that tackles both these challenges through a new approach to selfsupervised exploration and fast adaptation to new tasks, which need not be known during exploration. During exploration, unlike prior methods which retrospectively compute the novelty of observations after the agent has already reached them, our agent acts efficiently by leveraging planning to seek out expected future novelty. After exploration, the agent quickly adapts to multiple downstream tasks in a zero or a fewshot manner. We evaluate on challenging control tasks from highdimensional image inputs. Without any training supervision or taskspecific interaction, Plan2Explore outperforms prior selfsupervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards. Videos and code at https://ramanans1.github.io/plan2explore/
“Exploring Bayesian Optimization: Breaking Bayesian Optimization into Small, Sizeable Chunks”, Agnihotri & Batra 2020
“Exploring Bayesian Optimization: Breaking Bayesian Optimization into small, sizeable chunks”, (20200505; ):
[Discussion of Bayesian optimization (BO), a decisiontheoretic application of Bayesian statistics (typically using Gaussian processes for flexibility) which tries to model a set of variables to find the maximum or best in the fewest number of collected data points possible. This differs from normal experiment design which tries to simply maximize the overall information about all points given a fixed number of samples, not just the best point, or “active learning”, which tries to select data points which make the model as predictive as possible while collecting samples. The difference can be visualized by watching posterior distributions for simple 2D problems evolve as data is collected according to different BO or active learning or simple gridsearch/random baseline strategies. The optimal strategy is usually infeasible to calculate, so various heuristics like “expected improvement” or “Thompson sampling” are used, and their different behavior can be visualized and compared. BO is heavily used in machine learning to find the best combinations of settings for machine learning models.]
In this article, we looked at Bayesian Optimization for optimizing a blackbox function. Bayesian Optimization is well suited when the function evaluations are expensive, making grid or exhaustive search impractical. We looked at the key components of Bayesian Optimization. First, we looked at the notion of using a surrogate function (with a prior over the space of objective functions) to model our blackbox function. Next, we looked at the “Bayes” in Bayesian Optimization — the function evaluations are used as data to obtain the surrogate posterior. We look at acquisition functions, which are functions of the surrogate posterior and are optimized sequentially. This new sequential optimization is inexpensive and thus of utility of us. We also looked at a few acquisition functions and showed how these different functions balance exploration and exploitation. Finally, we looked at some practical examples of Bayesian Optimization for optimizing hyperparameters for machine learning models.
“Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, Levine et al 2020
“Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, (20200504):
In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decisionmaking domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.
“Agent57: Outperforming the Human Atari Benchmark”, Puigdomènech et al 2020
“Agent57: Outperforming the human Atari benchmark”, (20200331; ; backlinks):
The Atari57 suite of games is a longstanding benchmark to gauge agent performance across a wide range of tasks. We’ve developed Agent57, the first deep reinforcement learning agent to obtain a score that is above the human baseline on all 57 Atari 2600 games. Agent57 combines an algorithm for efficient exploration with a metacontroller that adapts the exploration and long vs. shortterm behaviour of the agent.
…In 2012, the Arcade Learning environment—a suite of 57 Atari 2600 games (dubbed Atari57)—was proposed as a benchmark set of tasks: these canonical Atari games pose a broad range of challenges for an agent to master…Unfortunately, the average performance can fail to capture how many tasks an agent is doing well on, and so is not a good statistic for determining how general an agent is: it captures that an agent is doing sufficiently well, but not that it is doing sufficiently well on a sufficiently wide set of tasks. So although average scores have increased, until now, the number of above human games has not.
…Back in 2012, DeepMind developed the Deep Qnetwork agent (DQN) to tackle the Atari57 suite. Since then, the research community has developed many extensions and alternatives to DQN. Despite these advancements, however, all deep reinforcement learning agents have consistently failed to score in four games: Montezuma’s Revenge, Pitfall, Solaris and Skiing. For Agent57 to tackle these four challenging games in addition to the other Atari57 games, several changes to DQN were necessary.
DQN improvements
 Distributed agents
 Shortterm memory
 Episodic memory
Intrinsic motivation methods to encourage directed exploration
 Seeking novelty over long time scales
 Seeking novelty over short time scales
 Metacontroller: learning to balance exploration with exploitation
Agent57: putting it all together
…With Agent57, we have succeeded in building a more generally intelligent agent that has abovehuman performance on all tasks in the Atari57 benchmark. It builds on our previous agent Never Give Up, and instantiates an adaptive metacontroller that helps the agent to know when to explore and when to exploit, as well as what timehorizon it would be useful to learn with. A wide range of tasks will naturally require different choices of both of these tradeoffs, therefore the metacontroller provides a way to dynamically adapt such choices.
Agent57 was able to scale with increasing amounts of computation: the longer it trained, the higher its score got. While this enabled Agent57 to achieve strong general performance, it takes a lot of computation and time; the data efficiency can certainly be improved. Additionally, this agent shows better 5^{th} percentile performance on the set of Atari57 games. This by no means marks the end of Atari research, not only in terms of data efficiency, but also in terms of general performance. We offer two views on this: firstly, analyzing the performance among percentiles gives us new insights on how general algorithms are. While Agent57 achieves strong results on the first percentiles of the 57 games and holds better mean and median performance than NGU or R2D2, as illustrated by MuZero, it could still obtain a higher average performance. Secondly, all current algorithms are far from achieving optimal performance in some games. To that end, key improvements to use might be enhancements in the representations that Agent57 uses for exploration, planning, and credit assignment.
“Agent57: Outperforming the Atari Human Benchmark”, Badia et al 2020
“Agent57: Outperforming the Atari Human Benchmark”, (20200330):
Atari games have been a longstanding benchmark in the reinforcement learning (RL) community for the past decade. This benchmark was proposed to test general competency of RL algorithms. Previous work has achieved good average performance by doing outstandingly well on many games of the set, but very poorly in several of the most challenging games. We propose Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games. To achieve this result, we train a neural network which parameterizes a family of policies ranging from very exploratory to purely exploitative. We propose an adaptive mechanism to choose which policy to prioritize throughout the training process. Additionally, we utilize a novel parameterization of the architecture that allows for more consistent and stable learning.
“Metalearning Curiosity Algorithms”, Alet et al 2020
“Metalearning curiosity algorithms”, (20200311; ):
We hypothesize that curiosity is a mechanism found by evolution that encourages meaningful exploration early in an agent’s life in order to expose it to experiences that enable it to obtain high rewards over the course of its lifetime. We formulate the problem of generating curious behavior as one of metalearning: an outer loop will search over a space of curiosity mechanisms that dynamically adapt the agent’s reward signal, and an inner loop will perform standard reinforcement learning using the adapted reward signal. However, current metaRL methods based on transferring neural network weights have only generalized between very similar tasks. To broaden the generalization, we instead propose to metalearn algorithms: pieces of code similar to those designed by humans in ML papers. Our rich language of programs combines neural networks with other building blocks such as buffers, nearestneighbor modules and custom loss functions. We demonstrate the effectiveness of the approach empirically, finding two novel curiosity algorithms that perform on par or better than humandesigned published curiosity algorithms in domains as disparate as grid navigation with image inputs, acrobot, lunar lander, ant and hopper.
“Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey”, Narvekar et al 2020
“Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey”, (20200310):
Reinforcement learning (RL) is a popular paradigm for addressing sequential decision tasks in which the agent has only limited environmental feedback. Despite many advances over the past three decades, learning in many domains still requires a large amount of interaction with the environment, which can be prohibitively expensive in realistic scenarios. To address this problem, transfer learning has been applied to reinforcement learning such that experience gained in one task can be leveraged when starting to learn the next, harder task. More recently, several lines of research have explored how tasks, or data samples themselves, can be sequenced into a curriculum for the purpose of learning a problem that may otherwise be too difficult to learn from scratch. In this article, we present a framework for curriculum learning (CL) in reinforcement learning, and use it to survey and classify existing CL methods in terms of their assumptions, capabilities, and goals. Finally, we use our framework to find open problems and suggest directions for future RL curriculum learning research.
“AutoMLZero: Evolving Machine Learning Algorithms From Scratch”, Real et al 2020
“AutoMLZero: Evolving Machine Learning Algorithms From Scratch”, (20200306; ):
Machine learning research has advanced in multiple aspects, including model structures and learning methods. The effort to automate such research, known as AutoML, has also made significant progress. However, this progress has largely focused on the architecture of neural networks, where it has relied on sophisticated expertdesigned layers as building blocks—or similarly restrictive search spaces. Our goal is to show that AutoML can go further: it is possible today to automatically discover complete machine learning algorithms just using basic mathematical operations as building blocks. We demonstrate this by introducing a novel framework that significantly reduces human bias through a generic search space. Despite the vastness of this space, evolutionary search can still discover twolayer neural networks trained by backpropagation. These simple neural networks can then be surpassed by evolving directly on tasks of interest, e.g. CIFAR10 variants, where modern techniques emerge in the top algorithms, such as bilinear interactions, normalized gradients, and weight averaging. Moreover, evolution adapts algorithms to different task types: e.g., dropoutlike techniques appear when little data is available. We believe these preliminary successes in discovering machine learning algorithms from scratch indicate a promising new direction for the field.
“Never Give Up: Learning Directed Exploration Strategies”, Badia et al 2020
“Never Give Up: Learning Directed Exploration Strategies”, (20200214):
We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memorybased intrinsic reward using knearest neighbors over the agent’s recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. A selfsupervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control. We employ the framework of Universal Value Function Approximators (UVFA) to simultaneously learn many directed exploration policies with the same neural network, with different tradeoffs between exploration and exploitation. By using the same neural network for different degrees of exploration/exploitation, transfer is demonstrated from predominantly exploratory policies yielding effective exploitative policies. The proposed method can be incorporated to run with modern distributed RL agents that collect large amounts of experience from many actors running in parallel on separate environment instances. Our method doubles the performance of the base agent in all hard exploration in the Atari57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve nonzero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or handcrafted features.
“R2D3: Making Efficient Use of Demonstrations to Solve Hard Exploration Problems”, Paine et al 2019
“R2D3: Making Efficient Use of Demonstrations to Solve Hard Exploration Problems”, (20190903; ):
[previously: R2D2] This paper introduces R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. We also introduce a suite of 8 tasks that combine these 3 properties, and show that R2D3 can solve several of the tasks where other state of the art methods (both with and without demonstrations) fail to see even a single successful trajectory after tens of billions of steps of exploration.
…Wall Sensor Stack: The original Wall Sensor Stack environment had a bug that the R2D3 agent was able to exploit. We fixed the bug and verified the agent can learn the proper stacking behavior.
…Another desirable property of our approach is that our agents are able to learn to outperform the demonstrators, and in some cases even to discover strategies that the demonstrators were not aware of. In one of our tasks the agent is able to discover and exploit a bug in the environment in spite of all the demonstrators completing the task in the intended way…R2D3 performed better than our average human demonstrator on Baseball, Drawbridge, Navigate Cubes and the Wall Sensor tasks. The behavior on Wall Sensor Stack in particular is quite interesting. On this task R2D3 found a completely different strategy than the human demonstrators by exploiting a bug in the implementation of the environment. The intended strategy for this task is to stack 2 blocks on top of each other so that one of them can remain in contact with a wall mounted sensor, and this is the strategy employed by the demonstrators. However, due to a bug in the environment the strategy learned by R2D3 was to trick the sensor into remaining active even when it is not in contact with the key by pressing the key against it in a precise way.
“Benchmarking BonusBased Exploration Methods on the Arcade Learning Environment”, Taïga et al 2019
“Benchmarking BonusBased Exploration Methods on the Arcade Learning Environment”, (20190806):
This paper provides an empirical evaluation of recently developed exploration algorithms within the Arcade Learning Environment (ALE). We study the use of different reward bonuses that incentives exploration in reinforcement learning. We do so by fixing the learning algorithm used and focusing only on the impact of the different exploration bonuses in the agent’s performance. We use Rainbow, the stateoftheart algorithm for valuebased agents, and focus on some of the bonuses proposed in the last few years. We consider the impact these algorithms have on performance within the popular game Montezuma’s Revenge which has gathered a lot of interest from the exploration community, across the set of seven games identified by Bellemare et al 2016 as challenging for exploration, and easier games where exploration is not an issue. We find that, in our setting, recently developed bonuses do not provide significantly improved performance on Montezuma’s Revenge or hard exploration games. We also find that existing bonusbased methods may negatively impact performance on games in which exploration is not an issue and may even perform worse than εgreedy exploration.
“A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment”, Leibfried et al 2019
“A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment”, (20190726):
Empowerment is an informationtheoretic method that can be used to intrinsically motivate learning agents. It attempts to maximize an agent’s control over the environment by encouraging visiting states with a large number of reachable next states. Empowered learning has been shown to lead to complex behaviors, without requiring an explicit reward signal. In this paper, we investigate the use of empowerment in the presence of an extrinsic reward signal. We hypothesize that empowerment can guide reinforcement learning (RL) agents to find good early behavioral solutions by encouraging highly empowered states. We propose a unified Bellman optimality principle for empowered reward maximization. Our empowered reward maximization approach generalizes both Bellman’s optimality principle as well as recent informationtheoretical extensions to it. We prove uniqueness of the empowered values and show convergence to the optimal solution. We then apply this idea to develop offpolicy actorcritic RL algorithms which we validate in highdimensional continuous robotics domains (MuJoCo). Our methods demonstrate improved initial and competitive final performance compared to modelfree stateoftheart techniques.
“Meta Reinforcement Learning”, Weng 2019
“Meta Reinforcement Learning”, (20190623; ):
[Review/discussion] MetaRL is metalearning on reinforcement learning tasks. After trained over a distribution of tasks, the agent is able to solve a new task by developing a new RL algorithm with its internal activity dynamics. This post starts with the origin of metaRL and then dives into three key components of metaRL…, a good metalearning model is expected to generalize to new tasks or new environments that have never been encountered during training. The adaptation process, essentially a mini learning session, happens at test with limited exposure to the new configurations. Even without any explicit finetuning (no gradient backpropagation on trainable variables), the metalearning model autonomously adjusts internal hidden states to learn.
“Search on the Replay Buffer: Bridging Planning and Reinforcement Learning”, Eysenbach et al 2019
“Search on the Replay Buffer: Bridging Planning and Reinforcement Learning”, (20190612):
The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collisionfree paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and highdimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a suboptimal solution. We introduce a general control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment—namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goalconditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in imagebased environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.
“ICML 2019 Notes”, Abel 2019
“ICML 2019 Notes”, (201906; ):
The 2019 ICML edition of David Abel’s famous conference notes: he goes to as many presentations and talks as possible, jotting down opinionated summaries & equations, with a particular focus on DRL. Topics covered:
Tutorial: PACBayes Theory (Part II) · PACBayes Theory · PACBayes and Task Awareness · Tutorial: MetaLearning · Two Ways to View MetaLearning · MetaLearning Algorithms · MetaReinforcement Learning · Challenges and Frontiers in Meta Learning · Tuesday June: Main Conference Best Paper Talk: Challenging Assumptions in Learning Disentangled Representations Contributed Talks: Deep RL · DQN and Time Discretization · Nonlinear Distributional Gradient TD Learning · Composing Entropic Policies using Divergence Correction · TibGM: A Graphical Model Approach for RL · MultiAgent Adversarial IRL · Policy Consolidation for Continual RL · OffPolicy Evaluation Deep RL w/o Exploration · Random Expert Distillation · Revisiting the Softmax Bellman Operator · Contributed Talks: RL Theory · Distributional RL for Efficient Exploration · Optimistic Policy Optimization via Importance Sampling · Neural Logic RL · Learning to Collaborate in MDPs · PredictorCorrector Policy Optimization · Learning a Prior over Intent via Meta IRL · DeepMDP: Learning Late Space Models for RL · Importance Sampling Policy Evaluation · Learning from a Learner · Separating Value Functions Across TimeScales · Learning Action Representations in RL · Bayesian Counterfactual Risk Minimization · PerDecision Option Counting · Problem Dependent Regret Bounds in RL · A Theory of Regularized MDPs · Discovering Options for Exploration by Minimizing Cover Time · Policy Certificates: Towards Accountable RL · Action Robust RL · The Value Function Polytope · Wednesday June: Main Conference Contributed Talks: Multitask and Lifelong Learning · Domain Agnostic Learning with Disentangled Representations · Composing Value Functions in RL · CAVIA: Fast Context Adaptation via Meta Learning · Gradient Based MetaLearning · Towards Understanding Knowledge Distillation · Transferable Adversarial Training · Contributed Talks: RL Theory · Provably Efficient Imitation Learning from Observation Alone · Dead Ends and Secure Exploration · Statistics and Samples in Distributional RL · Hessian Aided Policy Gradient · Maximum Entropy Exploration · Combining Multiple Models for OffPolicy Evaluation · SampleOptimal ParametricQLearning Using Linear Features · Transfer of Samples in Policy Search · Exploration Conscious RL Revisited · Kernel Based RL in Robust MDPs · Thursday June: Main Conference Contributed Talks: RL · Batch Policy learning under Constraints · Quantifying Generalization in RL · Learning Latent Dynamics for Planning from Pixels · Projections for Approximate Policy Iteration · Learning Structured Decision Problems with Unawareness · Calibrated ModelBased Deep RL · RL in Configurable Continuous Environments · TargetBased TemporalDifference Learning · Linearized Control: Stable Algorithms and Complexity Guarantees · Contributed Talks: Deep Learning Theory · Why do Larger Models Generalize Better? · On the Spectral Bias of Neural Nets · Recursive Sketches for Modular Deep Learning · ZeroShot Knowledge Distillation in Deep Networks · Convergence Theory for Deep Learning via OverParameterization · Best Paper Award: Rates of Convergence for Sparse Gaussian Process Regression · Friday June: Workshops Workshop: AI for Climate Change · John Platt on What ML can do to help Climate Change · Jack Kelly: Why It’s Hard to Mitigate Climate Change, and How to Do Better, Andrew Ng: Tackling Climate Change with AI through Collaboration · Workshop: RL for Real Life · Panel Discussion · Workshop: Real World Sequential Decision Making · Emma Brunskill on Efficient RL When Data is Costly · Miro Dudik: Doubly Robust OffPolicy Evaluation via Shrinkage
“Humanlevel Performance in 3D Multiplayer Games With Populationbased Reinforcement Learning”, Jaderberg et al 2019
“Humanlevel performance in 3D multiplayer games with populationbased reinforcement learning”, (20190531; ):
Artificial teamwork: Artificially intelligent agents are getting better and better at 2player games, but most realworld endeavors require teamwork. Jaderberg et al. designed a computer program that excels at playing the video game Quake III Arena in Capture the Flag mode, where 2 multiplayer teams compete in capturing the flags of the opposing team. The agents were trained by playing thousands of games, gradually learning successful strategies not unlike those favored by their human counterparts. Computer agents competed successfully against humans even when their reaction times were slowed to match those of humans.
Reinforcement learning (RL) has shown great success in increasingly complex singleagent environments and 2player turnbased games. However, the real world contains multiple agents, each learning and acting independently to cooperate and compete with other agents. We used a tournamentstyle evaluation to demonstrate that an agent can achieve humanlevel performance in a 3dimensional multiplayer firstperson video game, Quake III Arena in Capture the Flag mode, using only pixels and game points scored as input. We used a 2tier optimization process in which a population of independent RL agents are trained concurrently from thousands of parallel matches on randomly generated environments. Each agent learns its own internal reward signal and rich representation of the world. These results indicate the great potential of multiagent reinforcement learning for artificial intelligence research.
“Reinforcement Learning, Fast and Slow”, Botvinick et al 2019
“Reinforcement Learning, Fast and Slow”, (20190516; ):
Recent AI research has given rise to powerful techniques for deep reinforcement learning. In their combination of representation learning with rewarddriven behavior, deep reinforcement learning would appear to have inherent interest for psychology and neuroscience.
One reservation has been that deep reinforcement learning procedures demand large amounts of training data, suggesting that these algorithms may differ fundamentally from those underlying human learning.
While this concern applies to the initial wave of deep RL techniques, subsequent AI work has established methods that allow deep RL systems to learn more quickly and efficiently. Two particularly interesting and promising techniques center, respectively, on episodic memory and metalearning. Alongside their interest as AI techniques, deep RL methods leveraging episodic memory and metalearning have direct and interesting implications for psychology and neuroscience. One subtle but critically important insight which these techniques bring into focus is the fundamental connection between fast and slow forms of learning.
Deep reinforcement learning (RL) methods have driven impressive advances in artificial intelligence in recent years, exceeding human performance in domains ranging from Atari to Go to nolimit poker. This progress has drawn the attention of cognitive scientists interested in understanding human learning. However, the concern has been raised that deep RL may be too sampleinefficient—that is, it may simply be too slow—to provide a plausible model of how humans learn. In the present review, we counter this critique by describing recently developed techniques that allow deep RL to operate more nimbly, solving problems much more quickly than previous methods. Although these techniques were developed in an AI context, we propose that they may have rich implications for psychology and neuroscience. A key insight, arising from these AI methods, concerns the fundamental connection between fast RL and slower, more incremental forms of learning.
“Meta Reinforcement Learning As Task Inference”, Humplik et al 2019
“Meta reinforcement learning as task inference”, (20190515; ):
Humans achieve efficient learning by relying on prior knowledge about the structure of naturally occurring tasks. There is considerable interest in designing reinforcement learning (RL) algorithms with similar properties. This includes proposals to learn the learning algorithm itself, an idea also known as meta learning. One formal interpretation of this idea is as a partially observable multitask RL problem in which task information is hidden from the agent. Such unknown task problems can be reduced to Markov decision processes (MDPs) by augmenting an agent’s observations with an estimate of the belief about the task based on past experience. However estimating the belief state is intractable in most partiallyobserved MDPs. We propose a method that separately learns the policy and the task belief by taking advantage of various kinds of privileged information. Our approach can be very effective at solving standard metaRL environments, as well as a complex continuous control environment with sparse rewards and requiring longterm memory.
“Metalearning of Sequential Strategies”, Ortega et al 2019
“Metalearning of Sequential Strategies”, (20190508; ):
In this report we review memorybased metalearning as a tool for building sampleefficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building nearoptimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memorybased metalearning within a Bayesian framework, showing that the metalearned strategies are nearoptimal because they amortize Bayesfiltered data, where the adaptation is implemented in the memory dynamics as a statemachine of sufficient statistics. Essentially, memorybased metalearning translates the hard problem of probabilistic sequential inference into a regression problem.
“Metalearners’ Learning Dynamics Are unlike Learners’”, Rabinowitz 2019
“Metalearners’ learning dynamics are unlike learners’”, (20190503; ):
Metalearning is a tool that allows us to build sampleefficient learning systems. Here we show that, once metatrained, LSTM MetaLearners aren’t just faster learners than their sampleinefficient deep learning (DL) and reinforcement learning (RL) brethren, but that they actually pursue fundamentally different learning trajectories. We study their learning dynamics on three sets of structured tasks for which the corresponding learning dynamics of DL and RL systems have been previously described: linear regression (Saxe et al., 2013), nonlinear regression (Rahaman et al., 2018; Xu et al., 2018), and contextual bandits (Schaul et al., 2019). In each case, while sampleinefficient DL and RL Learners uncover the task structure in a staggered manner, metatrained LSTM MetaLearners uncover almost all task structure concurrently, congruent with the patterns expected from Bayesoptimal inference algorithms. This has implications for research areas wherever the learning behaviour itself is of interest, such as safety, curriculum design, and humanintheloop machine learning.
“Ray Interference: a Source of Plateaus in Deep Reinforcement Learning”, Schaul et al 2019
“Ray Interference: a Source of Plateaus in Deep Reinforcement Learning”, (20190425; ):
Rather than proposing a new method, this paper investigates an issue present in existing learning algorithms. We study the learning dynamics of reinforcement learning (RL), specifically a characteristic coupling between learning and data generation that arises because RL agents control their future data distribution. In the presence of function approximation, this coupling can lead to a problematic type of ‘ray interference’, characterized by learning dynamics that sequentially traverse a number of performance plateaus, effectively constraining the agent to learn one thing at a time even when learning in parallel is better. We establish the conditions under which ray interference occurs, show its relation to saddle points and obtain the exact learning dynamics in a restricted setting. We characterize a number of its properties and discuss possible remedies.
“Learning To Follow Directions in Street View”, Hermann et al 2019
“Learning To Follow Directions in Street View”, (20190301):
Navigating and understanding the real world remains a key challenge in machine learning and inspires a great variety of research in areas such as language grounding, planning, navigation and computer vision. We propose an instructionfollowing task that requires all of the above, and which combines the practicality of simulated environments with the challenges of ambiguous, noisy real world data. StreetNav is built on top of Google Street View and provides visually accurate environments representing real places. Agents are given driving instructions which they must learn to interpret in order to successfully navigate in this environment. Since humans equipped with driving instructions can readily navigate in previously unseen cities, we set a high bar and test our trained agents for similar cognitive capabilities. Although deep reinforcement learning (RL) methods are frequently evaluated only on data that closely follow the training distribution, our dataset extends to multiple cities and has a clean train/test separation. This allows for thorough testing of generalisation ability. This paper presents the StreetNav environment and tasks, models that establish strong baselines, and extensive analysis of the task and the trained agents.
“A Generalized Framework for Population Based Training”, Li et al 2019
“A Generalized Framework for Population Based Training”, (20190205):
Population Based Training (PBT) is a recent approach that jointly optimizes neural network weights and hyperparameters which periodically copies weights of the best performers and mutates hyperparameters during training. Previous PBT implementations have been synchronized glassbox systems. We propose a general, blackbox PBT framework that distributes many asynchronous “trials” (a small number of training steps with warmstarting) across a cluster, coordinated by the PBT controller. The blackbox design does not make assumptions on model architectures, loss functions or training procedures. Our system supports dynamic hyperparameter schedules to optimize both differentiable and nondifferentiable metrics. We apply our system to train a stateoftheart WaveNet generative model for human voice synthesis. We show that our PBT system achieves better accuracy, less sensitivity and faster convergence compared to existing methods, given the same computational resource.
“GoExplore: a New Approach for HardExploration Problems”, Ecoffet et al 2019
“GoExplore: a New Approach for HardExploration Problems”, (20190130):
A grand challenge in reinforcement learning is intelligent exploration, especially when rewards are sparse or deceptive. Two Atari games serve as benchmarks for such hardexploration domains: Montezuma’s Revenge and Pitfall. On both games, current RL algorithms perform poorly, even those with intrinsic motivation, which is the dominant method to improve performance on hardexploration domains. To address this shortfall, we introduce a new algorithm called GoExplore. It exploits the following principles: (1) remember previously visited states, (2) first return to a promising state (without exploration), then explore from it, and (3) solve simulated environments through any available means (including by introducing determinism), then robustify via imitation learning. The combined effect of these principles is a dramatic performance improvement on hardexploration problems. On Montezuma’s Revenge, GoExplore scores a mean of over 43k points, almost 4 times the previous state of the art. GoExplore can also harness humanprovided domain knowledge and, when augmented with it, scores a mean of over 650k points on Montezuma’s Revenge. Its max performance of nearly 18 million surpasses the human world record, meeting even the strictest definition of “superhuman” performance. On Pitfall, GoExplore with domain knowledge is the first algorithm to score above zero. Its mean score of almost 60k points exceeds expert human performance. Because GoExplore produces highperforming demonstrations automatically and cheaply, it also outperforms imitation learning work where humans provide solution demonstrations. GoExplore opens up many new research directions into improving it and weaving its insights into current RL algorithms. It may also enable progress on previously unsolvable hardexploration problems in many domains, especially those that harness a simulator during training (e.g. robotics).
“Is the FDA Too Conservative or Too Aggressive?: A Bayesian Decision Analysis of Clinical Trial Design”, Isakov et al 2019
2019isakov.pdf
: “Is the FDA too conservative or too aggressive?: A Bayesian
decision analysis of clinical trial design”,
(20190104; ):
Implicit in the drugapproval process is a host of decisions—target patient population, control group, primary endpoint, sample size, followup period, etc.—all of which determine the tradeoff between Type I and Type II error. We explore the application of Bayesian decision analysis (BDA) to minimize the expected cost of drug approval, where the relative costs of the two types of errors are calibrated using U.S. Burden of Disease Study 2010 data. The results for conventional fixedsample randomized clinicaltrial designs suggest that for terminal illnesses with no existing therapies such as pancreatic cancer, the standard threshold of 2.5% is substantially more conservative than the BDAoptimal threshold of 23.9% to 27.8%. For relatively less deadly conditions such as prostate cancer, 2.5% is more risktolerant or aggressive than the BDAoptimal threshold of 1.2% to 1.5%. We compute BDAoptimal sizes for 25 of the most lethal diseases and show how a BDAinformed approval process can incorporate all stakeholders’ views in a systematic, transparent, internally consistent, and repeatable manner.
“Machinelearningguided Directed Evolution for Protein Engineering”, Yang et al 2019
2019yang.pdf
: “Machinelearningguided
directed evolution for protein engineering”, Kevin K. Yang, Zachary Wu, Frances H. Arnold (20190101)
“Common Neural Code for Reward and Information Value”, Kobayashi & Hsu 2019
2019kobayashi.pdf
: “Common neural code for reward
and information value”, (20190101;
):
Adaptive information seeking is critical for goaldirected behavior. Growing evidence suggests the importance of intrinsic motives such as curiosity or need for novelty, mediated through dopaminergic valuation systems, in driving informationseeking behavior. However, valuing information for its own sake can be highly suboptimal when agents need to evaluate instrumental benefit of information in a forwardlooking manner. Here we show that informationseeking behavior in humans is driven by subjective value that is shaped by both instrumental and noninstrumental motives, and that this subjective value of information (SVOI) shares a common neural code with more basic reward value. Specifically, using a task where subjects could purchase information to reduce uncertainty about outcomes of a monetary lottery, we found information purchase decisions could be captured by a computational model of SVOI incorporating utility of anticipation, a form of noninstrumental motive for information seeking, in addition to instrumental benefits. Neurally, trialbytrial variation in SVOI was correlated with activity in striatum and ventromedial prefrontal cortex. Furthermore, crosscategorical decoding revealed that, within these regions, SVOI and expected utility of lotteries were represented using a common code. These findings provide support for the common currency hypothesis and shed insight on neurocognitive mechanisms underlying informationseeking behavior.
“The Bayesian Superorganism III: Externalised Memories Facilitate Distributed Sampling”, Hunt et al 2018
“The Bayesian Superorganism III: externalised memories facilitate distributed sampling”, (20181221; ):
A key challenge for any animal is to avoid wasting time by searching for resources in places it has already found to be unprofitable. This challenge is particularly strong when the organism is a central place forager—returning to a nest between foraging bouts—because it is destined repeatedly to cover much the same ground. Furthermore, this problem will reach its zenith if many individuals forage from the same central place, as in social insects. Foraging performance may be greatly enhanced by coordinating movement trajectories such that each ant visits separate parts of the surrounding (unknown) space. In this third of three papers, we find experimental evidence for an externalised spatial memory in Temnothorax albipennis ants: chemical markers (either pheromones or other cues such as cuticular hydrocarbon footprints) that are used by nestmates to mark explored space. We show these markers could be used by the ants to scout the space surrounding their nest more efficiently through indirect coordination. We also develop a simple model of this marking behaviour that can be applied in the context of Markov chain Monte Carlo methods (see part two of this series). This substantially enhances the performance of standard methods like the Metropolis– Hastings algorithm in sampling from sparse probability distributions (such as those confronted by the ants) with little additional computational cost. Our Bayesian framework for superorganismal behaviour motivates the evolution of exploratory mechanisms such as trail marking in terms of enhanced collective information processing.
“Exploration in the Wild”, Schulz et al 2018
“Exploration in the wild”, (20181210):
Making good decisions requires people to appropriately explore their available options and generalize what they have learned. While computational models have successfully explained exploratory behavior in constrained laboratory tasks, it is unclear to what extent these models generalize to complex real world choice problems. We investigate the factors guiding exploratory behavior in a data set consisting of 195,333 customers placing 1,613,967 orders from a large online food delivery service. We find important hallmarks of adaptive exploration and generalization, which we analyze using computational models. We find evidence for several theoretical predictions: (1) customers engage in uncertaintydirected exploration, (2) they adjust their level of exploration to the average restaurant quality in a city, and (3) they use featurebased generalization to guide exploration towards promising restaurants. Our results provide new evidence that people use sophisticated strategies to explore complex, realworld environments.
“An Introduction to Deep Reinforcement Learning”, FrancoisLavet et al 2018
“An Introduction to Deep Reinforcement Learning”, (20181130; ):
Deep reinforcement learning is the combination of reinforcement learning (RL) and deep learning. This field of research has been able to solve a wide range of complex decisionmaking tasks that were previously out of reach for a machine. Thus, deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications. We assume the reader is familiar with basic machine learning concepts.
“Exploration by Random Network Distillation”, Burda et al 2018
“Exploration by Random Network Distillation”, (20181030):
We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma’s Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the first level.
“Computational Mechanisms of Curiosity and Goaldirected Exploration”, Schwartenbeck et al 2018
“Computational mechanisms of curiosity and goaldirected exploration”, (20180907; ):
Successful behaviour depends on the right balance between maximising reward and soliciting information about the world. Here, we show how different types of informationgain emerge when casting behaviour as surprise minimisation. We present two distinct mechanisms for goaldirected exploration that express separable profiles of active sampling to reduce uncertainty. ‘Hidden state’ exploration motivates agents to sample unambiguous observations to accurately infer the (hidden) state of the world. Conversely, ‘model parameter’ exploration, compels agents to sample outcomes associated with high uncertainty, if they are informative for their representation of the task structure. We illustrate the emergence of these types of informationgain, termed active inference and active learning, and show how these forms of exploration induce distinct patterns of ‘Bayesoptimal’ behaviour. Our findings provide a computational framework to understand how distinct levels of uncertainty induce different modes of informationgain in decisionmaking.
“LargeScale Study of CuriosityDriven Learning”, Burda et al 2018
“LargeScale Study of CuriosityDriven Learning”, (20180813):
Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent. However, annotating each environment with handdesigned, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is a type of intrinsic reward function which uses prediction error as reward signal. In this paper: (a) We perform the first largescale study of purely curiositydriven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the handdesigned extrinsic rewards of many game environments. (b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks, but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.). (c) We demonstrate limitations of the predictionbased rewards in stochastic setups. Gameplay videos and code are at https://pathak22.github.io/largescalecuriosity/
“Visual Reinforcement Learning With Imagined Goals”, Nair et al 2018
“Visual Reinforcement Learning with Imagined Goals”, (20180712):
For an autonomous agent to fulfill a wide range of userspecified goals at test time, it must be able to learn broadly applicable and generalpurpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such generalpurpose skills by combining unsupervised representation learning and reinforcement learning of goalconditioned policies. Since the particular goals that might be required at testtime are not known in advance, the agent performs a selfsupervised “practice” phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for selfsupervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sampleefficiency of our method. Our offpolicy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a realworld robotic system, and substantially outperforms prior techniques.
“Is Qlearning Provably Efficient?”, Jin et al 2018
“Is Qlearning Provably Efficient?”, (20180710):
Modelfree reinforcement learning (RL) algorithms, such as Qlearning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than modelbased approaches. However, empirical work has suggested that modelfree algorithms may require more samples to learn [Deisenroth & Rasmussen 2011, Schulman et al 2015]. The theoretical question of “whether modelfree algorithms can be made sample efficient” is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions.
We prove that, in an episodic MDP setting, Qlearning with UCB exploration achieves regret 𝒪(√H^{3}SAT), where S and A are the numbers of states and actions, H is the number of steps per episode, and T is the total number of steps. This sample efficiency matches the optimal regret that can be achieved by any modelbased approach, up to a single √H factor. To the best of our knowledge, this is the first analysis in the modelfree setting that establishes √T regret without requiring access to a “simulator.”
“Improving Widthbased Planning With Compact Policies”, Junyent et al 2018
“Improving widthbased planning with compact policies”, (20180615; ):
Optimal action selection in decision problems characterized by sparse, delayed rewards is still an open challenge. For these problems, current deep reinforcement learning methods require enormous amounts of data to learn controllers that reach humanlevel performance. In this work, we propose a method that interleaves planning and learning to address this issue. The planning step hinges on the IteratedWidth (IW) planner, a state of the art planner that makes explicit use of the state representation to perform structured exploration. IW is able to scale up to problems independently of the size of the state space. From the stateactions visited by IW, the learning step estimates a compact policy, which in turn is used to guide the planning step. The type of exploration used by our method is radically different than the standard random exploration used in RL. We evaluate our method in simple problems where we show it to have superior performance than the stateoftheart reinforcement learning algorithms A2C and Alpha Zero. Finally, we present preliminary results in a subset of the Atari games suite.
“Construction of Arbitrarily Strong Amplifiers of Natural Selection Using Evolutionary Graph Theory”, Pavlogiannis et al 2018
“Construction of arbitrarily strong amplifiers of natural selection using evolutionary graph theory”, (20180614; ; backlinks):
[WP] Because of the intrinsic randomness of the evolutionary process, a mutant with a fitness advantage has some chance to be selected but no certainty. Any experiment that searches for advantageous mutants will lose many of them due to random drift. It is therefore of great interest to find population structures that improve the odds of advantageous mutants. Such structures are called amplifiers of natural selection: they increase the probability that advantageous mutants are selected. Arbitrarily strong amplifiers guarantee the selection of advantageous mutants, even for very small fitness advantage. Despite intensive research over the past decade, arbitrarily strong amplifiers have remained rare. Here we show how to construct a large variety of them. Our amplifiers are so simple that they could be useful in biotechnology, when optimizing biological molecules, or as a diagnostic tool, when searching for faster dividing cells or viruses. They could also occur in natural population structures.
In the evolutionary process, mutation generates new variants, while selection chooses between mutants that have different reproductive rates. Any new mutant is initially present at very low frequency and can easily be eliminated by random drift. The probability that the lineage of a new mutant eventually takes over the entire population is called the fixation probability. It is a key quantity of evolutionary dynamics and characterizes the rate of evolution.
…In this work we resolve several open questions regarding strong amplification under uniform and temperature initialization. First, we show that there exists a vast variety of graphs with selfloops and weighted edges that are arbitrarily strong amplifiers for both uniform and temperature initialization. Moreover, many of those strong amplifiers are structurally simple, therefore they might be realizable in natural or laboratory setting. Second, we show that both selfloops and weighted edges are key features of strong amplification. Namely, we show that without either selfloops or weighted edges, no graph is a strong amplifier under temperature initialization, and no simple graph is a strong amplifier under uniform initialization.
…In general, the fixation probability depends not only on the graph, but also on the initial placement of the invading mutants…For a wide class of population structures^{17}, which include symmetric ones^{28}, the fixation probability is the same as for the wellmixed population.
… A population structure is an arbitrarily strong amplifier (for brevity hereafter also called “strong amplifier”) if it ensures a fixation probability arbitrarily close to one for any advantageous mutant, r > 1. Strong amplifiers can only exist in the limit of large population size.
Numerical studies^{30} suggest that for spontaneously arising mutants and small population size, many unweighted graphs amplify for some values of r. But for a large population size, randomly constructed, unweighted graphs do not amplify^{31}. Moreover, proven amplifiers for all values of r are rare. For spontaneously arising mutants (uniform initialization): (1) the Star has fixation probability of ~1 − 1⁄r^{2} in the limit of large N, and is thus an amplifier^{17, 32, 33}; (2) the Superstar (introduced in ref. 17, see also ref. 34) and the Incubator (introduced in refs. 35, 36), which are graphs with unbounded degree, are strong amplifiers.
…In this work we resolve several open questions regarding strong amplification under uniform and temperature initialization. First, we show that there exists a vast variety of graphs with selfloops and weighted edges that are arbitrarily strong amplifiers for both uniform and temperature initialization. Moreover, many of those strong amplifiers are structurally simple, therefore they might be realizable in natural or laboratory setting. Second, we show that both selfloops and weighted edges are key features of strong amplification. Namely, we show that without either selfloops or weighted edges, no graph is a strong amplifier under temperature initialization, and no simple graph is a strong amplifier under uniform initialization.
…Intuitively, the weight assignment creates a sense of global flow in the branches, directed toward the hub. This guarantees that the first 2 steps happen with high probability. For the third step, we show that once the mutants fixate in the hub, they are extremely likely to resist all resident invasion attempts and instead they will invade and take over the branches one by one thereby fixating on the whole graph. For more detailed description, see “Methods” section “Construction of strong amplifiers”.
Necessary conditions for amplification: Our main result shows that a large variety of population structures can provide strong amplification. A natural followup question concerns the features of population structures under which amplification can emerge. We complement our main result by proving that both weights and selfloops are essential for strong amplification. Thus, we establish a strong dichotomy. Without either weights or selfloops, no graph can be a strong amplifier under temperature initialization, and no simple graph can be a strong amplifier under uniform initialization. On the other hand, if we allow both weights and selfloops, strong amplification is ubiquitous.
…Some naturally occurring population structures could be amplifiers of natural selection. For example, the germinal centers of the immune system might constitute amplifiers for the affinity maturation process of adaptive immunity^{46}. Habitats of animals that are divided into multiple islands with a central breeding location could potentially also act as amplifiers of selection. Our theory helps to identify those structures in natural settings.
“Reevaluating Evaluation”, Balduzzi et al 2018
“Reevaluating Evaluation”, (20180607; ):
Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing wellbalanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agentvsagent and agentvstask. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation—since there is no harm (computational cost aside) from including all available tasks and agents.
“Observe and Look Further: Achieving Consistent Performance on Atari”, Pohlen et al 2018
“Observe and Look Further: Achieving Consistent Performance on Atari”, (20180529):
Despite substantial advances in the field of deep Reinforcement Learning (RL), today’s algorithms still fail to learn humanlevel policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently.
In this paper, we propose an algorithm that addresses each of these challenges and is able to learn humanlevel policies on nearly all Atari games. A new transformed Bellman operator allows our algorithm to process rewards of varying densities and scales; an auxiliary temporal consistency loss allows us to train stably using a discount factor of γ = 0.999 (instead of γ = 0.99) extending the effective planning horizon by an order of magnitude; and we ease the exploration problem by using human demonstrations that guide the agent towards rewarding states.
When tested on a set of 42 Atari games, our algorithm exceeds the performance of an average human on 40 games using a common set of hyper parameters. Furthermore, it is the first deep RL algorithm to solve the first level of Montezuma’s Revenge.
“Playing Hard Exploration Games by Watching YouTube”, Aytar et al 2018
“Playing hard exploration games by watching YouTube”, (20180529):
Deep reinforcement learning methods traditionally struggle with tasks where environment rewards are particularly sparse. One successful method of guiding exploration in these domains is to imitate trajectories provided by a human demonstrator. However, these demonstrations are typically collected under artificial conditions, i.e. with access to the agent’s exact environment setup and the demonstrator’s action and reward trajectories. Here we propose a twostage method that overcomes these limitations by relying on noisy, unaligned footage without access to such data. First, we learn to map unaligned videos from multiple sources to a common representation using selfsupervised objectives constructed over both time and modality (i.e. vision and sound). Second, we embed a single YouTube video in this representation to construct a reward function that encourages an agent to imitate human gameplay. This method of oneshot imitation allows our agent to convincingly exceed humanlevel performance on the infamously hard exploration games Montezuma’s Revenge, Pitfall! and Private Eye for the first time, even if the agent is not presented with any environment rewards.
“Some Considerations on Learning to Explore via MetaReinforcement Learning”, Stadie et al 2018
“Some Considerations on Learning to Explore via MetaReinforcement Learning”, (20180303; ):
We consider the problem of exploration in meta reinforcement learning. Two new meta reinforcement learning algorithms are suggested: EMAML and ERL^{2}. Results are presented on a novel environment we call ‘Krazy World’ and a set of maze environments. We show EMAML and ERL^{2} deliver better performance on tasks where exploration is important.
“Learning to Search With MCTSnets”, Guez et al 2018
“Learning to Search with MCTSnets”, (20180213; ):
Planning problems are among the most important and wellstudied problems in artificial intelligence. They are most typically solved by tree search algorithms that simulate ahead into the future, evaluate future states, and backup those evaluations to the root of a search tree. Among these algorithms, MonteCarlo tree search (MCTS) is one of the most general, powerful and widely used. A typical implementation of MCTS uses cleverly designed rules, optimized to the particular characteristics of the domain. These rules control where the simulation traverses, what to evaluate in the states that are reached, and how to backup those evaluations. In this paper we instead learn where, what and how to search. Our architecture, which we call an MCTSnet, incorporates simulationbased search inside a neural network, by expanding, evaluating and backingup a vector embedding. The parameters of the network are trained endtoend using gradientbased optimisation. When applied to small searches in the well known planning problem Sokoban, the learned search algorithm significantly outperformed MCTS baselines.
“Learning and Querying Fast Generative Models for Reinforcement Learning”, Buesing et al 2018
“Learning and Querying Fast Generative Models for Reinforcement Learning”, (20180208):
A key challenge in modelbased reinforcement learning (RL) is to synthesize computationally efficient and accurate environment models. We show that carefully designed generative models that learn and operate on compact state representations, socalled statespace models, substantially reduce the computational costs for predicting outcomes of sequences of actions. Extensive experiments establish that statespace models accurately capture the dynamics of Atari games from the Arcade Learning Environment from raw pixels. The computational speedup of statespace models while maintaining high accuracy makes their application in RL feasible: We demonstrate that agents which query these models for decision making outperform strong modelfree baselines on the game MSPACMAN, demonstrating the potential of using learned environment models for planning.
“Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of NoveltySeeking Agents”, Conti et al 2017
“Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of NoveltySeeking Agents”, (20171218):
Evolution strategies (ES) are a family of blackbox optimization algorithms able to train deep neural networks roughly as well as Qlearning and policy gradient methods on challenging deep reinforcement learning (RL) problems, but are much faster (e.g. hours vs. days) because they parallelize better. However, many RL problems require directed exploration because they have reward functions that are sparse or deceptive (i.e. contain local optima), and it is unknown how to encourage such exploration with ES. Here we show that algorithms that have been invented to promote directed exploration in smallscale evolved neural networks via populations of exploring agents, specifically novelty search (NS) and quality diversity (QD) algorithms, can be hybridized with ES to improve its performance on sparse or deceptive deep RL tasks, while retaining scalability. Our experiments confirm that the resultant new algorithms, NSES and two QD algorithms, NSRES and NSRAES, avoid local optima encountered by ES to achieve higher performance on Atari and simulated robots learning to walk around a deceptive trap. This paper thus introduces a family of fast, scalable algorithms for reinforcement learning that are capable of directed exploration. It also adds this new family of exploration algorithms to the RL toolbox and raises the interesting possibility that analogous algorithms with multiple simultaneous paths of exploration might also combine well with existing RL algorithms outside ES.
“Emergent Complexity via MultiAgent Competition”, Bansal et al 2017
“Emergent Complexity via MultiAgent Competition”, (20171010; ):
Reinforcement learning algorithms can train agents that solve problems in complex, interesting environments. Normally, the complexity of the trained agent is closely related to the complexity of the environment. This suggests that a highly capable agent requires a complex environment for training. In this paper, we point out that a competitive multiagent environment trained with selfplay can produce behaviors that are far more complex than the environment itself. We also point out that such environments come with a natural curriculum, because for any skill level, an environment full of agents of this level will have the right level of difficulty. This work introduces several competitive multiagent environments where agents compete in a 3D world with simulated physics. The trained agents learn a wide variety of complex and interesting skills, even though the environment themselves are relatively simple. The skills include behaviors such as running, blocking, ducking, tackling, fooling opponents, kicking, and defending using both arms and legs. A highlight of the learned behaviors can be found here: https://goo.gl/eR7fbX
“An Analysis of the Value of Information When Exploring Stochastic, Discrete MultiArmed Bandits”, Sledge & Principe 2017
“An Analysis of the Value of Information when Exploring Stochastic, Discrete MultiArmed Bandits”, (20171008; ):
In this paper, we propose an informationtheoretic exploration strategy for stochastic, discrete multiarmed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the tradeoff between policy information and obtainable rewards. High amounts of policy information are associated with explorationdominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulatedannealinglike update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.
“The Uncertainty Bellman Equation and Exploration”, O'Donoghue et al 2017
“The Uncertainty Bellman Equation and Exploration”, (20170915):
We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any timestep to the expected value at subsequent timesteps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any timestep to the expected uncertainties at subsequent timesteps, thereby extending the potential exploratory benefit of a policy beyond individual timesteps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Qvalues induced by any policy. This bound can be much tighter than traditional countbased bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBEexploration strategy for εgreedy improves DQN performance on 51 out of 57 games in the Atari suite.
“ImaginationAugmented Agents for Deep Reinforcement Learning”, Weber et al 2017
“ImaginationAugmented Agents for Deep Reinforcement Learning”, (20170719):
We introduce ImaginationAugmented Agents (I2As), a novel architecture for deep reinforcement learning combining modelfree and modelbased aspects. In contrast to most existing modelbased reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.
“Distral: Robust Multitask Reinforcement Learning”, Teh et al 2017
“Distral: Robust Multitask Reinforcement Learning”, (20170713):
Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a “distilled” policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable—attributes that are critical in deep reinforcement learning.
“The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously”, Cabi et al 2017
“The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously”, (20170711):
This paper introduces the Intentional Unintentional (IU) agent. This agent endows the deep deterministic policy gradients (DDPG) agent for continuous control with the ability to solve several tasks simultaneously. Learning to solve many tasks simultaneously has been a longstanding, core goal of artificial intelligence, inspired by infant development and motivated by the desire to build flexible robot manipulators capable of many diverse behaviours. We show that the IU agent not only learns to solve many tasks simultaneously but it also learns faster than agents that target a single task atatime. In some cases, where the single task DDPG method completely fails, the IU agent successfully solves the task. To demonstrate this, we build a playroom environment using the MuJoCo physics engine, and introduce a grounded formal language to automatically generate tasks.
“A Tutorial on Thompson Sampling”, Russo et al 2017
“A Tutorial on Thompson Sampling”, (20170707; ):
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.
“Noisy Networks for Exploration”, Fortunato et al 2017
“Noisy Networks for Exploration”, (20170630):
We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent’s policy can be used to aid efficient exploration.
The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead.
We find that replacing the conventional exploration heuristics for A3C, DQN and dueling agents (entropy reward and εgreedy respectively) with NoisyNet yields substantially higher scores for a wide range of Atari games, in some cases advancing the agent from sub to superhuman performance.
“Scalable Generalized Linear Bandits: Online Computation and Hashing”, Jun et al 2017
“Scalable Generalized Linear Bandits: Online Computation and Hashing”, (20170601):
Generalized Linear Bandits (GLBs), a natural extension of the stochastic linear bandits, has been popular and successful in recent years. However, existing GLBs scale poorly with the number of rounds and the number of arms, limiting their utility in practice. This paper proposes new, scalable solutions to the GLB problem in two respects.
First, unlike existing GLBs, whose pertimestep space and time complexity grow at least linearly with time t, we propose a new algorithm that performs online computations to enjoy a constant space and time complexity. At its heart is a novel Generalized Linear extension of the Onlinetoconfidenceset Conversion (GLOC method) that takes any online learning algorithm and turns it into a GLB algorithm. As a special case, we apply GLOC to the online Newton step algorithm, which results in a lowregret GLB algorithm with much lower time and memory complexity than prior work. Second, for the case where the number N of arms is very large, we propose new algorithms in which each next arm is selected via an inner product search. Such methods can be implemented via hashing algorithms (i.e., “hashamenable”) and result in a time complexity sublinear in N. While a Thompson sampling extension of GLOC is hashamenable, its regret bound for ddimensional arm sets scales with d^{3⁄2}, whereas GLOC’s regret bound scales with d. Towards closing this gap, we propose a new hashamenable algorithm whose regret bound scales with d^{5⁄5}. Finally, we propose a fast approximate hashkey computation (inner product) with a better accuracy than the stateoftheart, which can be of independent interest.
We conclude the paper with preliminary experimental results confirming the merits of our methods.
“Learning to Perform Physics Experiments via Deep Reinforcement Learning”, Denil et al 2016
“Learning to Perform Physics Experiments via Deep Reinforcement Learning”, (20161106):
When encountering novel objects, humans are able to infer a wide range of physical properties such as mass, friction and deformability by interacting with them in a goal driven way. This process of active interaction is in the same spirit as a scientist performing experiments to discover hidden facts. Recent advances in artificial intelligence have yielded machines that can achieve superhuman performance in Go, Atari, natural language processing, and complex control problems; however, it is not clear that these systems can rival the scientific intuition of even a young child. In this work we introduce a basic set of tasks that require agents to estimate properties such as mass and cohesion of objects in an interactive simulated environment where they can manipulate the objects and observe the consequences. We found that state of art deep reinforcement learning methods can learn to perform the experiments necessary to discover such hidden properties. By systematically manipulating the problem difficulty and the cost incurred by the agent for performing experiments, we found that agents learn different strategies that balance the cost of gathering information against the cost of making mistakes in different situations.
“Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016
“Bayesian Reinforcement Learning: A Survey”, (20160914; ; backlinks):
Bayesian methods for machine learning have been widely investigated, yielding principled methods for incorporating prior information into inference algorithms. In this survey, we provide an indepth review of the role of Bayesian methods for the reinforcement learning (RL) paradigm. The major incentives for incorporating Bayesian reasoning in RL are: 1) it provides an elegant approach to actionselection (exploration/exploitation) as a function of the uncertainty in learning; and 2) it provides a machinery to incorporate prior knowledge into the algorithms. We first discuss models and methods for Bayesian inference in the simple singlestep Bandit model. We then review the extensive recent literature on Bayesian methods for modelbased RL, where prior information can be expressed on the parameters of the Markov model. We also present Bayesian methods for modelfree RL, where priors are expressed over the value function or policy class. The objective of the paper is to provide a comprehensive survey on Bayesian RL algorithms and their theoretical and empirical properties.
“Unifying CountBased Exploration and Intrinsic Motivation”, Bellemare et al 2016
“Unifying CountBased Exploration and Intrinsic Motivation”, (20160606; ):
We consider an agent’s uncertainty about its environment and the problem of generalizing this uncertainty across observations. Specifically, we focus on the problem of exploration in nontabular reinforcement learning. Drawing inspiration from the intrinsic motivation literature, we use density models to measure uncertainty, and propose a novel algorithm for deriving a pseudocount from an arbitrary density model. This technique enables us to generalize countbased exploration algorithms to the nontabular case. We apply our ideas to Atari 2600 games, providing sensible pseudocounts from raw pixels. We transform these pseudocounts into intrinsic rewards and obtain significantly improved exploration in a number of hard games, including the infamously difficult Montezuma’s Revenge.
“Deep Exploration via Bootstrapped DQN”, Osband et al 2016
“Deep Exploration via Bootstrapped DQN”, (20160215):
Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and statistically efficient manner through use of randomized value functions. Unlike dithering strategies such as epsilongreedy exploration, bootstrapped DQN carries out temporallyextended (or deep) exploration; this can lead to exponentially faster learning. We demonstrate these benefits in complex stochastic MDPs and in the largescale Arcade Learning Environment. Bootstrapped DQN substantially improves learning times and performance across most Atari games.
“Experimental Design for Partially Observed Markov Decision Processes”, Thorbergsson & Hooker 2012
“Experimental design for Partially Observed Markov Decision Processes”, (20120918; ):
This paper deals with the question of how to most effectively conduct experiments in Partially Observed Markov Decision Processes so as to provide data that is most informative about a parameter of interest. Methods from Markov decision processes, especially dynamic programming, are introduced and then used in an algorithm to maximize a relevant Fisher Information. The algorithm is then applied to two POMDP examples. The methods developed can also be applied to stochastic dynamical systems, by suitable discretization, and we consequently show what control policies look like in the MorrisLecar Neuron model, and simulation results are presented. We discuss how parameter dependence within these methods can be dealt with by the use of priors, and develop tools to update control policies online. This is demonstrated in another stochastic dynamical system describing growth dynamics of DNA template in a PCR model.
“Learning Is Planning: near Bayesoptimal Reinforcement Learning via MonteCarlo Tree Search”, Asmuth & Littman 2012
“Learning is planning: near Bayesoptimal reinforcement learning via MonteCarlo tree search”, (20120214; ):
Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unknown MDP is equivalent to optimal behavior in the known beliefspace MDP, although the size of this beliefspace MDP grows exponentially with the amount of history retained, and is potentially infinite. We show how an agent can use one particular MCTS algorithm, Forward Search Sparse Sampling (FSSS), in an efficient way to act nearly Bayesoptimally for all but a polynomial number of steps, assuming that FSSS can be used to act efficiently in any possible underlying MDP.
“PILCO: A ModelBased and DataEfficient Approach to Policy Search”, Deisenroth & Rasmussen 2011
2011deisenroth.pdf
: “PILCO: A ModelBased
and DataEfficient Approach to Policy Search”,
(20110601; ; backlinks):
In this paper, we introduce PILCO, a practical, dataefficient modelbased policy search method. PILCO reduces model bias, one of the key problems of modelbased reinforcement learning, in a principled way.
By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into longterm planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using stateoftheart approximate inference. Furthermore, policy gradients are computed analytically for policy improvement.
We report unprecedented learning efficiency on challenging and highdimensional control tasks.
[Remarkably, PILCO can learn your standard “Cartpole” task within just a few trials by carefully building a Bayesian Gaussian process model and picking the maximallyinformative experiments to run. Cartpole is quite difficult for a human, incidentally, there’s an installation of one in the SF Exploratorium, and I just had to try it out once I recognized it. (My sampleefficiency was not better than PILCO.)]
“Evolution Strategy: Nature's Way of Optimization”, Rechenberg 1989
1989rechenberg.pdf
: “Evolution Strategy: Nature's
Way of Optimization”, (1989):
Biological Evolution has done development work on animated matter for more than 3 billion years. The method used in this process is the equivalent of an astute optimization procedure which is provable by the theory of Evolution Strategy. The result is that biological structures and processes have been optimized in the grand experiment of evolution from which a logical conclusion can be drawn:
• Evolution has optimized biological processes. • Evolution in itself is a biological process. • Ergo: Evolution must have optimized itself.
The evolution of evolution is a fact that is not only important for the momentary contribution of an individual to the survival of the species. In the course of several generations better genetic strategies are being selected and improved as well, better in the sense that they promote a faster environmental accommodation. The strategy of evolution thus becomes an optimization procedure worthy of imitation by engineers. Elements of Evolution Strategy are such mechanisms as gene mutation, chromosome mutation, recombination, selection, annidation, dominance, recession, etc. The theory of Evolution Strategy led to the discovery of the socalled “Evolution Window”. The meaning of this term is that changes (mutational jumps) lead to evolutionary progress only when they lie within a narrowly confined and calculable stepwidth band. Mutation which fall outside the evolution window are ineffective.
“Evolutionsstrategien”, Rechenberg 1977
1977rechenberg.pdf
: “Evolutionsstrategien”, Ingo Rechenberg (19770101)
Miscellaneous

2017chupeau.pdf
(20170101; ; backlinks) 
https://www.sumsar.net/blog/2015/01/probablepointsandcredibleintervalsparttwo/
( ) 
https://www.quantamagazine.org/clevermachineslearnhowtobecurious20170919/

https://www.nature.com/articles/s41467020192444#deepmind
( ) 
https://people.idsia.ch/~juergen/FKI12690_(revised)bw_ocr.pdf

https://patentimages.storage.googleapis.com/57/53/22/91b8a6792dbb1e/US20180204116A1.pdf

https://openai.com/blog/reinforcementlearningwithpredictionbasedrewards/

https://lilianweng.github.io/lillog/2020/01/29/curriculumforreinforcementlearning.html#openai
( ) 
https://en.wikipedia.org/wiki/Montezuma%27s_Revenge_(video_game)

https://deepmind.com/blog/safetyfirstaiautonomousdatacentrecoolingandindustrialcontrol/

https://deepmind.com/blog/article/capturetheflagscience
( ) 
https://blog.openai.com/learningmontezumasrevengefromasingledemonstration/

https://bair.berkeley.edu/blog/2021/11/05/epistemicpomdp/
( ) 
https://80000hours.org/podcast/episodes/brianchristianalgorithmstoliveby/
( ) 
http://people.csail.mit.edu/pkrafft/papers/krafftthesisfinal.pdf
( ) 
http://papers.nips.cc/paper/4031montecarloplanninginlargepomdps.pdf
( ) 
http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html
(backlinks) 
Maildelivery
( ; backlinks)