The Eighty Five Percent Rule for optimal learning. Robert C. Wilson, Amitai Shenhav, Mark Straccia & Jonathan D. Cohen. Nature Communications, volume 10, Article number: 4646 (2019). November 5 2019. https://www.nature.com/articles/s41467-019-12552-4
Abstract: Researchers and educators have long wrestled with the question of how best to teach their clients be they humans, non-human animals or machines. Here, we examine the role of a single variable, the difficulty of training, on the rate of learning. In many situations we find that there is a sweet spot in which training is neither too easy nor too hard, and where learning progresses most quickly. We derive conditions for this sweet spot for a broad class of learning algorithms in the context of binary classification tasks. For all of these stochastic gradient-descent based learning algorithms, we find that the optimal error rate for training is around 15.87% or, conversely, that the optimal training accuracy is about 85%. We demonstrate the efficacy of this ‘Eighty Five Percent Rule’ for artificial neural networks used in AI and biologically plausible neural networks thought to describe animal learning.
Discussion
In this article we considered the effect of training accuracy on learning in the case of binary classification tasks and stochastic gradient-descent-based learning rules. We found that the rate of learning is maximized when the difficulty of training is adjusted to keep the training accuracy at around 85%. We showed that training at the optimal accuracy proceeds exponentially faster than training at a fixed difficulty. Finally we demonstrated the efficacy of the Eighty Five Percent Rule in the case of artificial and biologically plausible neural networks.
Our results have implications for a number of fields. Perhaps most directly, our findings move towards a theory for identifying the optimal environmental settings in order to maximize the rate of gradient-based learning. Thus the Eighty Five Percent Rule should hold for a wide range of machine learning algorithms including multilayered feedforward and recurrent neural networks (e.g. including ‘deep learning’ networks using backpropagation9, reservoir computing networks21,22, as well as Perceptrons). Of course, in these more complex situations, our assumptions may not always be met. For example, as shown in the Methods, relaxing the assumption that the noise is Gaussian leads to changes in the optimal training accuracy: from 85% for Gaussian, to 82% for Laplacian noise, to 75% for Cauchy noise (Eq. (31) in the “Methods”).
More generally, extensions to this work should consider how batch-based training changes the optimal accuracy, and how the Eighty Five Percent Rule changes when there are more than two categories. In batch learning, the optimal difficulty to select for the examples in each batch will likely depend on the rate of learning relative to the size of the batch. If learning is slow, then selecting examples in a batch that satisfy the 85% rule may work, but if learning is fast, then mixing in more difficult examples may be best. For multiple categories, it is likely possible to perform similar analyses, although the mapping between decision variable and categories will be more complex as will be the error rates which could be category specific (e.g., misclassifying category 1 as category 2 instead of category 3).
In Psychology and Cognitive Science, the Eighty Five Percent Rule accords with the informal intuition of many experimentalists that participant engagement is often maximized when tasks are neither too easy nor too hard. Indeed it is notable that staircasing procedures (that aim to titrate task difficulty so that error rate is fixed during learning) are commonly designed to produce about 80–85% accuracy17. Similarly, when given a free choice about the difficulty of task they can perform, participants will spontaneously choose tasks of intermediate difficulty levels as they learn23. Despite the prevalence of this intuition, to the best of our knowledge no formal theoretical work has addressed the effect of training accuracy on learning, a test of which is an important direction for future work.
More generally, our work closely relates to the Region of Proximal Learning and Desirable Difficulty frameworks in education24,25,26 and Curriculum Learning and Self-Paced Learning7,8 in computer science. These related, but distinct, frameworks propose that people and machines should learn best when training tasks involve just the right amount of difficulty. In the Desirable Difficulties framework, the difficulty in the task must be of a ‘desirable’ kind, such as spacing practice over time, that promotes learning as opposed to an undesirable kind that does not. In the Region of Proximal Learning framework, which builds on early work by Piaget27 and Vygotsky28, this optimal difficulty is in a region of difficulty just beyond the person’s current ability. Curriculum and Self-Paced Learning in computer science build on similar intuitions, that machines should learn best when training examples are presented in order from easy to hard. In practice, the optimal difficulty in all of these domains is determined empirically and is often dependent on many factors29. In this context, our work offers a way of deriving the desired difficulty and the region of proximal learning in the special case of binary classification tasks for which stochastic gradient-descent learning rules apply. As such our work represents the first step towards a more mathematical instantiation of these theories, although it remains to be generalized to a broader class of circumstances, such as multi-choice tasks and different learning algorithms.
[...] our work points to a mathematical theory of the state of ‘Flow’34. This state, ‘in which an individual is completely immersed in an activity without reflective self-consciousness but with a deep sense of control’ [ref. 35, p. 1], is thought to occur most often when the demands of the task are well matched to the skills of the participant. This idea of balance between skill and challenge was captured originally with a simple conceptual diagram (Fig. 5) with two other states: ‘anxiety’ when challenge exceeds skill and ‘boredom’ when skill exceeds challenge. These three qualitatively different regions (flow, anxiety, and boredom) arise naturally in our model. Identifying the precision, β, with the level of skill and the level challenge with the inverse of true decision variable, 1/Δ, we see that when challenge equals skill, flow is associated with a high learning rate and accuracy, anxiety with low learning rate and accuracy and boredom with high accuracy but low learning rate (Fig. 5b, c). Intriguingly, recent work by Vuorre and Metcalfe, has found that subjective feelings of Flow peaks on tasks that are subjectively rated as being of intermediate difficulty36. In addition work on learning to control brain computer interfaces finds that subjective, self-reported measures of ‘optimal difficulty’, peak at a difficulty associated with maximal learning, and not at a difficulty associated with optimal decoding of neural activity37. Going forward, it will be interesting to test whether these subjective measures of engagement peak at the point of maximal learning gradient, which for binary classification tasks is 85%.
No comments:
Post a Comment