such as "Introduction", "Conclusion"..etc
Fiorillo et al.  associated the presentation of five different visual stimuli to macaques with the delayed, probabilistic (pr = 0, 0.25, 0.5, 0.75, 1) delivery of juice rewards. They used a delay conditioning paradigm, in which the stimulus persists for a fixed interval of 2s, with reward being delivered when the stimulus disappears. After training, the monkeys' anticipatory licking behavior indicated that they were aware of the different reward probabilities associated with each stimulus.
Figure 1a shows population histograms of extracellularly-recorded DA cell activity, for each pr. TD theory predicts that the phasic activation of the DA cells at the time of the visual stimuli should correspond to the average expected reward, and so should increase with pr. Figure 1a shows exactly this – indeed, across the population, the increase is quite linear. Morris et al.  report a similar result in an instrumental (trace) conditioning task also involving probabilistic reinforcement.
By contrast, at the time of potential reward delivery, TD theory predicts that on average there should be no activity, as, on average, there is no prediction error at that time. Of course, in the probabilistic reinforcement design (at least for pr ≠ 0, 1) there is in fact a prediction error at the time of delivery or non-delivery of reward on every single trial. On trials in which a reward is delivered, the prediction error should be positive (as the reward obtained is larger than the average reward expected). Conversely, on trials with no reward it should be negative (see Figure 1c). Crucially, under TD, the average of these differences, weighted by their probabilities of occurring, should be zero. If it is not zero, then this prediction error should act as a plasticity signal, changing the predictions until there is no prediction error. At variance with this expectation, the data in Figure 1a which is averaged over both rewarded and unrewarded trials, show that there is in fact positive mean activity at this time. This is also evident in the data of Morris et al.  (see Figure 3c). The positive DA responses show no signs of disappearing even with substantial training (over the course of months).
Worse than this for the TD model, and indeed the focus of Fiorillo et al. , is the apparent ramping of DA activity towards the expected time of the reward. As the magnitude of the ramp is greatest for pr = 0.5, Fiorillo et al. suggested that it reports the uncertainty in reward delivery, rather than a prediction error, and speculated that this signal could explain the apparently appetitive properties of uncertainty (as seen in gambling).
Both the ramping activity and the activity at the expected time of reward pose critical challenges to the TD theory. TD learning operates by arranging for DA activity at one time in a trial to be predicted away by cues available earlier in that trial. Thus, it is not clear how any seemingly predictable activity, be it that at the time of the reward or in the ramp before, can persist without being predicted away by the onset of the visual stimulus. After all, the pr-dependent activity in response to the stimulus confirms its status as a valid predictor. Furthermore, a key aspect of TD , is that it couples prediction to action choice by using the value of a state as an indication of the future rewards available from that state, and therefore its attractiveness as a target for action. From this perspective, since the ramping activity is explicitly not predicted by the earlier cue, it cannot influence early actions, such as the decision to gamble. For instance, consider a competition between two actions: one eventually leading to a state with a deterministic reward and therefore no ramp, and the other leading to a state followed by a probabilistic reward with the same mean, and a ramp. Since the ramp does not affect the activity at the time of the conditioned stimulus, it cannot be used to evaluate or favour the second action (gambling) over the first, despite the extra uncertainty.
We suggest the alternative hypothesis that both these anomalous firing patterns result directly from the constraints implied by the low baseline rate of activity of DA neurons (2–4 Hz) on the coding of the signed prediction error. As noted by Fiorillo et al. , positive prediction errors are represented by firing rates of ~270% above baseline, while negative errors are represented by a decrease of only ~55% below baseline (see also [14,18]). This asymmetry is a straightforward consequence of the coding of a signed quantity by firing which has a low baseline, though, obviously, can only be positive. Firing rates above baseline can encode positive prediction errors by using a large dynamic range, however, below baseline firing rates can only go down to zero, imposing a restriction on coding of negative prediction errors.
Consequently, one has to be careful interpreting the sums (or averages) of peri-stimulus-time-histograms (PSTHs) of activity over different trials, as was done in Figure 1a. The asymmetrically coded positive and negative error signals at the time of the receipt or non-receipt of reward should indeed not sum up to zero, even if they represent correct TD prediction errors. When summed, the low firing representing the negative errors in the unrewarded trials will not "cancel out" the rapid firing encoding positive errors in the rewarded trials, and, overall, the average will show a positive response. In the brain, of course, as responses are not averaged over (rewarded and unrewarded) trials, but over neurons within a trial, this need not pose a problem.
This explains the persistent positive activity (on average) at the time of delivery or non-delivery of the reward. But what about the ramp prior to this time? At least in certain neural representations of the time between stimulus and reward, when trials are averaged, this same asymmetry leads TD to result exactly in a ramping of activity toward the time of the reward. The TD learning mechanism has the effect of propagating, on a trial-by-trial basis, prediction errors that arise at one time in a trial (such as at the time of the reward) towards potential predictors (such as the CS) that arise at earlier times within each trial. Under the asymmetric representation of positive and negative prediction errors that we have just discussed, averaging these propagating errors over multiple trials (as in Figure 1a) will lead to positive means for epochs within a trial before a reward. The precise shape of the resulting ramp of activity depends on the way stimuli are represented over time, as well as on the speed of learning, as will be discussed below.
Figure 2 illustrates this view of the provenance of the ramping activity. Here, a tapped delay-line representation of time since the stimulus is used. For this, each unit ('neuron') becomes active (i.e., assumes the value 1) at a certain lag after the stimulus has been presented, so that every timestep after the stimulus onset is consistently represented by the firing of one unit. Learning is based on the (dopaminergically-reported) TD error, formalized as δ(t) = r(t) + V(t) - V(t - 1), with V(t) the weighted input from the active unit at time t, and r(t) the reward obtained at time t. Updating the weights of the units according to the standard TD update rule with a fixed learning rate, allows V(t) to, on average, represent the expected future rewards (see Figure 1 caption). As each subsequent timestep is separately represented, TD prediction errors can arise at any time within the trial. Figure 2a shows these errors in six consecutive simulated trials in which pr = 0.5. In every trial, a new positive or negative error arises at the time of the reward, consequent on receipt or non-receipt of the reward, and step-by-step the errors from previous trials propagate back to the time of the stimulus, through the constant updating of the weights (eg. the error highlighted in red). When averaging (or, as in PSTHs, summing) over trials, these errors cancel each other on average, resulting in an overall flat histogram in the interval after the stimulus onset, and leading up to the time of the reward (black line in Figure 2b, summed over the 10 trials shown in thin blue). However, when summed after asymmetric scaling of the negative errors by a factor of d = 1/6 (which simulates the asymmetric coding of positive and negative prediction errors by DA neurons), a positive ramp of activity ensues, as illustrated by the black line in Figure 2c. Note that this rescaling is only a representational issue, resulting from the constraints of encoding a negative value about a low baseline firing rate, and should not affect the learning of the weights, so as not to learn wrong values (see discussion). However, as PSTHs are directly sums of neuronal spikes, this representational issue bears on the resulting histogram.
Figures 1b,d show the ramp arising from this combination of asymmetric coding and inter-trial averaging, for comparison with the experimental data. Figure 1b shows the PSTH computed from our simulated data by averaging over the asymmetrically-represented δ(t) signal in ~50 trials for each stimulus type. Figure 1d shows the results for the pr = 0.5 case, divided into rewarded and unrewarded trials for comparison with Figure 1c. The simulated results resemble the experimental data closely in that they replicate the net positive response to the uncertain rewards, as well as the ramping effect, which is highest in the pr = 0.5 case.
It is simple to derive the average response at the time of the reward (t = N) in trial T, i.e., the average TD error δT(N), from the TD learning rule with the simplified tapped delay-line time representation and a fixed learning rate α. The value at the next to last timestep in a trial, as a function of trial number (with initial values taken to be zero), is
where r(t) is the reward at the end of trial t. The error signal at the last timestep of trial T is simply the difference between the obtained reward r(T), and the value predicting that reward VT - 1 (N - 1). This error is positive with probability pr, and negative with probability (1 - pr). Scaling the negative errors by a factor of d ∈ (0, 1], we thus get
For symmetric coding of positive and negative errors (d = 1), the average response is 0. For asymmetric coding (0 d pr = 0.5. However, δT is positive, and concomitantly, the ramps are positive, and in this particular setting, are related to uncertainty, because of, rather than instead of, the coding of δ(t).
Indeed, there is a key difference between the uncertainty and TD accounts of the ramping activity. According to the former, the ramping is a within-trial phenomena, coding uncertainty in reward; by contrast, the latter suggests that ramps arise only through averaging across multiple trials. Within a trial, when averaging over simultaneously recorded neurons rather than trials, the traces should not show a smooth ramp, but intermittent positive and negative activity corresponding to back-propagating prediction errors from the immediately previous trials (as in Figure 2a).
Enter the code exactly as it appears. All letters are case insensitive, there is no zero.