Learning Not to Learn:Nature versus Nurture in SilicoRobert Tjarko Lange & Henning SprekelerBerlin Institue of eAbstractAnimals are equipped with a rich innate repertoire of sensory, behavioral and motorskills, which allows them to interact with the world immediately after birth. Atthe same time, many behaviors are highly adaptive and can be tailored to specificenvironments by means of learning. In this work, we use mathematical analysisand the framework of memory-based meta-learning (or ’learning to learn’) toanswer when it is beneficial to learn such an adaptive strategy and when to hardcode a heuristic behavior. We find that the interplay of ecological uncertainty,task complexity and the agents’ lifetime has crucial effects on the meta-learnedamortized Bayesian inference performed by an agent. There exist two regimes: Onein which meta-learning yields a learning algorithm that implements task-dependentinformation-integration and a second regime in which meta-learning imprints aheuristic or ’hard-coded’ behavior. Further analysis reveals that non-adaptivebehaviors are not only optimal for aspects of the environment that are stable acrossindividuals, but also in situations where an adaptation to the environment wouldin fact be highly beneficial, but could not be done quickly enough to be exploitedwithin the remaining lifetime. Hard-coded behaviors should hence not only bethose that always work, but also those that are too complex to be learned within areasonable time frame.1IntroductionThe ’nature versus nurture’ debate (e.g., Mutti et al., 1996; Tabery, 2014) – the question of whichaspects of behavior are ’hard-coded’ by evolution, and which are learned from experience – isone of the oldest and most controversial debates in biology. Evolutionary principles prescribe thathard-coded behavioral routines should be those for which there is no benefit in adaptation. This isbelieved to be the case for behaviors whose evolutionary advantage varies little among individuals ofa species. Mating instincts or flight reflexes are general solutions that rarely present an evolutionarydisadvantage. On the other hand, features of the environment that vary substantially for individualsof a species potentially ask for adaptive behavior (Buss, 2015). Naturally, the same principles shouldnot only apply to biological but also to artificial agents. But how can a reinforcement learning agentdifferentiate between these two behavioral regimes?A promising approach to automatically learn rules of adaptation that facilitate environment-specificspecialization is meta-learning (Schmidhuber, 1987; Thrun and Pratt, 1998). At its core lies the ideaof using generic optimization methods to learn inductive biases for a given ensemble of tasks. Inthis approach, the inductive bias usually has its own set of parameters (e.g., weights in a recurrentnetwork; Hochreiter et al., 2001) that are optimized on the whole task ensemble, that is, on a long,’evolutionary’ time scale. These parameters in turn control how a different set of parameters (e.g.,activities in the network) are updated on a much faster time scale. These rapidly adapting parametersthen allow the system to adapt to a specific task at hand. Notably, the parameters of the system thatPreprint. Under review.
are subject to ’nature’ – i.e., those that shape the inductive bias and are common across tasks – andthose that are subject to ’nurture’ are usually predefined from the start.In this work, we use the memory-based meta-learning approach for a different goal, namely to acquirea qualitative understanding of which aspects of behavior should be hard-coded and which shouldbe adaptive. Our hypothesis is that meta-learning can not only learn efficient learning algorithms,but can also decide not to be adaptive at all, and to instead apply a generic heuristic to the wholeensemble of tasks. Phrased in the language of biology, meta-learning can decide whether to hard-codea behavior or to render it adaptive, based on the range of environments the individuals of a speciescould encounter. We study the dependence of the meta-learned algorithm on three central features ofthe meta-reinforcement learning problem: Ecological uncertainty: How diverse is the range of tasks the agent could encounter? Task complexity: How long does it take to learn the optimal strategy for the task at hand?Note that this could be different from the time it takes to execute the optimal strategy. Expected lifetime: How much time can the agent spend on exploration and exploitation?Using analytical and numerical analyses, we show that non-adaptive behaviors are optimal in twocases – when the optimal policy varies little across the tasks within the task ensemble and whenthe time it takes to learn the optimal policy is too long to allow a sufficient exploitation of thelearned policy. Our results suggest that not only the design of the meta-task distribution, but also thelifetime of the agent can have strong effects on the meta-learned algorithm of RNN-based agents. Inparticular, we find highly nonlinear and potentially discontinuous effects of ecological uncertainty,task complexity and lifetime on the optimal algorithm. As a consequence, a meta-learned adaptationstrategy that was optimized, e.g., for a given lifetime may not generalize well to other lifetimes. Thisis essential for research questions that are interested in the conducted adaptation behavior, includingcurriculum design, safe exploration as well as human-in-the-loop applications. Our work may providea principled way of examining the constraint-dependence of meta-learned inductive biases.The remainder of this paper is structured as follows: First, we review the background in memorybased meta-reinforcement learning and contrast the related literature. Afterwards, we analyze aGaussian multi-arm bandit setting, which allows us to analytically disentangle the behavioral impactof ecological uncertainty, task complexity and lifetime. Our derivation of the lifetime-dependentBayes optimal exploration reveals a highly non-linear interplay of these three factors. We shownumerically that memory-based meta-learning reproduces our theoretical results and can learn not tolearn. Furthermore, we extend our analysis to more complicated exploration problems. Throughout,we analyze the resulting recurrent dynamics of the network and the representations associated withlearning and non-adaptive strategies.2Related Work & BackgroundMeta-learning or ’learning to learn’ (e.g., Schmidhuber, 1987; Thrun and Pratt, 1998; Hochreiteret al., 2001; Duan et al., 2016; Wang et al., 2016; Finn et al., 2017) has been proposed as a computational framework for acquiring task distribution-specific learning rules. During a costly outer loopoptimization, an agent crafts a niche-specific adaptation strategy, which is applicable to an engineeredtask distribution. At inference time, the acquired inner loop learning algorithm is executed for a fixedamount of timesteps (lifetime) on a test task. This framework has successfully been applied to arange of applications such as the meta-learning of optimization updates (Andrychowicz et al., 2016;Flennerhag et al., 2018, 2019), agent (Rabinowitz et al., 2018) and world models (Nagabandi et al.,2018) and explicit models of memory (Santoro et al., 2016; Bartunov et al., 2019). Already, earlywork by Schmidhuber (1987) suggested an evolutionary perspective on recursively learning the rulesof learning. This perspective holds the promise of explaining the emergence of mechanisms underlying both natural and artificial behaviors. Furthermore, a similarity between the hidden activations ofLSTM-based meta-learners and the recurrent activity of neurons in the prefrontal cortex (Wang et al.,2018) has recently been suggested.Previous work has shown that LSTM-based meta-learning is capable of distilling a sequential integration algorithm akin to amortized Bayesian inference (Ortega et al., 2019; Rabinowitz, 2019; Mikuliket al., 2020). Here we investigate when the integration of information might not be the optimalstrategy to meta-learn. We analytically characterize a task regime in which not adapting to sensory2
information is optimal. Furthermore, we study whether LSTM-based meta-learning is capable ofinferring when to learn and when to execute a non-adaptive program. Rabinowitz (2019) previouslystudied the outer loop learning dynamics and found differences across several tasks, the origin ofwhich is however not fully understood. Our work may provide an explanation for these differentmeta-learning dynamics and the dependence on the task distribution as well as the time horizon ofadaptation.Our work is most closely related to Pardo et al. (2017) and Zintgraf et al. (2019). Pardo et al. (2017)study the impact of fixed time limits and time-awareness on deep reinforcement learning agents.They propose using a timestamp as part of the state representation in order to avoid state-aliasingand the non-Markovianity resulting from a finite horizon treatment of an infinite horizon problem.Our setting differs in several aspects. First, we study the case of meta-reinforcement learning wherethe agent has to learn within a single lifetime. Second, we focus on a finite horizon perspective withlimited adaptation. Zintgraf et al. (2019), on the other hand, investigate meta reinforcement-learningfor Bayes-adaptive Markov Decision Processes and introduce a novel architecture that disentanglestask-specific belief representations from policy representations. Similarly to our work, Zintgrafet al. (2019) are interested in using the meta-learning framework to distill Bayes optimal explorationbehavior. While their adaptation setup extends over multiple episodes, we focus on single lifetimeadaption and analytically analyze when it is beneficial to learn in the first place.Finally, our work extends upon the efforts of computational ethology (Stephens, 1991) and experimental evolution (Dunlap and Stephens, 2009, 2016; Marcus et al., 2018), which aims to characterize theconditions under which behavioral plasticity may evolve. Their work shows that both environmentalchange and the predictability of the environment shape the selection pressure, which evolves adaptivetraits. Our work is based on memory-based meta-learning with function approximation and aims toextend these original findings to task distributions for which no analytical solution may be available.3Learning not to LearnTo disentangle the influence of ecological uncertainty, task complexity, and lifetime on the nature ofthe meta-learned strategy, we first focus on a minimal two-arm Gaussian bandit task, which allowsfor an analytical solution. The agent experiences episodes consisting of T arm pulls, representing thelifetime of the agent. The statistics of the bandit are constant during each episode, but vary betweenepisodes. To keep it simple, one of the two arms is deterministic and always returns a reward of 0. Thetask distribution is represented by the variable expected reward of the other arm, which is sampledat the beginning of an episode, from a Gaussian distribution with mean -1 and standard deviationσp , i.e. µ N ( 1, σp2 ). The standard deviation σp controls the uncertainty of the ecological niche.For σp 1, the deterministic arm is almost always the better option. For σp 1, the chances ofeither arm being the best in the given episode are largely even. While the mean µ remains constantfor the lifetime T of the agent, the reward obtained in a given trial is stochastic and is sampled froma second Gaussian, r N (µ, σl ). This trial-to-trial variability controls how many pulls the agentneeds to estimate the mean reward of the stochastic arm. The standard deviation σl hence controlshow quickly the agent can learn the optimal policy. We therefore use it as a proxy for task complexity.In this simple setting, the optimal meta-learned strategy can be calculated analytically. The optimalexploration strategy is to initially explore the stochastic arm for a given trial number n. Afterwards, itchooses the best arm based on its maximum a posteriori-estimate of the remaining episode return.The optimal amount of exploration trials n? can then be derived analytically: 1TXn? arg max E[rt n, T, σl , σp ] arg max [ n Eµ,r [(T n) µ p(µ̂ 0)]] ,nt 1nwhere µ̂ is the estimate of the mean reward of the stochatic arm after the n exploration trials. We findtwo distinct types of behavior (left-hand side of figure 1): A regime in which learning via explorationis effective and a second regime in which not learning is the optimal behavior. It may be optimal notto learn for two reasons: First, the ecological uncertainty may be so small that it is very unlikely thatthe stochastic first arm is better. Second, if the trial-to-trial variability is too large relative to the rangeof potential ecological niches, so that it may simply not be possible to integrate sufficient informationgiven a limited lifespan. We make two observations:1Please refer to the supplementary material for a detailed derivation of this analytical result as well as thehyperparameters of the numerical experiments.3
Figure 1: Theory and meta-learned exploration in a two-arm Gaussian bandit. Left: Bayes optimalexploration behavior for different lifetimes and across uncertainty conditions σl , σp . Right: Metalearned exploration behavior using the RL2 (Wang et al., 2016) framework. There exist two behavioralregimes (learning by exploration and a heuristic non-explorative strategy) for both the theoreticalresult and the numerical meta-learned behaviors. The amount of meta-learned exploration is averagedboth over 5 independent training runs and 100 episodes for each of the 400 trained networks.1. There exists a hard nonlinear threshold between learning and not learning behaviors described by the ratio of σl and σp . If σl is too large, the value of exp