R2RThttps://r2rt.com/2018-04-08T00:00:00-04:00Synthetic Gradients with Tensorflow2018-04-08T00:00:00-04:002018-04-08T00:00:00-04:00Silviu Pitistag:r2rt.com,2018-04-08:/synthetic-gradients-with-tensorflow.htmlI stumbled upon Max Jaderberg's Synthetic Gradients paper while thinking about different forms of communication between neural modules. It's a simple idea: rather than compute gradients through backpropagation, we can train a model to predict what those gradients will be, and use our prediction to update our weights. I wanted to try using this in my own work and didn't find a Tensorflow implementation to my liking, so here is mine. I also take this opportunity to (attempt to) answer one of the questions I had while reading the paper: why not use synthetic loss instead of synthetic gradients?<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>I stumbled upon Max Jaderberg’s <a href="https://arxiv.org/abs/1703.00522">Synthetic Gradients paper</a> while thinking about different forms of communication between neural modules. It’s a simple idea: rather than compute gradients through backpropagation, we can train a model to predict what those gradients will be, and use our prediction to update our weights. It’s dynamic programming for neural networks.</p> <p>This is the kind of idea I like because, if it works, it expands our modeling capabilities substantially. It would allow us to connect and train various neural modules asynchronously. Whether this turns out to be useful remains to be seen. I wanted to try using this in my own work and didn’t find a Tensorflow implementation to my liking, so here is mine. I also take this opportunity to (attempt to) answer one of the questions I had while reading the paper: why not use synthetic loss instead of synthetic gradients? Supposing we had multiple paths in a DAG architecture—then a synthetic loss (or better, advantage) would give us an interpretable measure of the “quality” of a part of the input, whereas synthetic gradients do not (without additional assumptions).</p> <p>Below, we use Tensorflow to implement the fully-connected MNIST experiment, as well as the convolutional CIFAR 10 experiment. The <a href="https://arxiv.org/abs/1703.00522">Synthetic Gradients paper</a> itself is a non-technical and easy read, so I’m not going go into any detail about what exactly it is we’re doing. Jaderberg’s <a href="https://deepmind.com/blog/decoupled-neural-networks-using-synthetic-gradients/">blog post</a> may be helpful on this front. I also enjoyed Delip Rao’s <a href="http://deliprao.com/archives/187">blog post</a> and <a href="http://deliprao.com/archives/191">follow-up</a>.<br /> ### Implementation</p> <h4 id="imports-and-data">Imports and data</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> tensorflow <span class="im">as</span> tf, numpy <span class="im">as</span> np, time <span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt, seaborn <span class="im">as</span> sns <span class="im">from</span> sklearn.utils <span class="im">import</span> shuffle <span class="op">%</span>matplotlib inline sns.<span class="bu">set</span>(color_codes<span class="op">=</span><span class="va">True</span>)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">(xtr, ytr), (xte, yte) <span class="op">=</span> tf.keras.datasets.mnist.load_data(path<span class="op">=</span><span class="st">&#39;mnist.npz&#39;</span>) xtr <span class="op">=</span> xtr.reshape([<span class="op">-</span><span class="dv">1</span>,<span class="dv">784</span>]).astype(np.float32) <span class="op">/</span> <span class="fl">255.</span> xte <span class="op">=</span> xte.reshape([<span class="op">-</span><span class="dv">1</span>,<span class="dv">784</span>]).astype(np.float32) <span class="op">/</span> <span class="fl">255.</span></code></pre></div> <h4 id="utility-functions">Utility functions</h4> <p>Note that the layer and model functions below return their variables. This is so we can do selectively compute gradients for different variables as appropriate.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> reset_graph(): <span class="cf">if</span> <span class="st">&#39;sess&#39;</span> <span class="kw">in</span> <span class="bu">globals</span>() <span class="kw">and</span> sess: sess.close() tf.reset_default_graph()</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> layer_dense_bn_relu(h, size, training<span class="op">=</span><span class="va">True</span>): l <span class="op">=</span> tf.layers.Dense(size) h <span class="op">=</span> tf.layers.batch_normalization(l(h), training<span class="op">=</span>training) <span class="cf">return</span> tf.nn.relu(h), l.trainable_variables <span class="kw">def</span> model_linear(h, output_dim, output_activation<span class="op">=</span><span class="va">None</span>, kernel_initializer<span class="op">=</span>tf.zeros_initializer, other_inputs<span class="op">=</span><span class="va">None</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> h is input that gets mapped to output_dim dims</span> <span class="co"> other_inputs is vector of other inputs</span> <span class="co"> &quot;&quot;&quot;</span> <span class="cf">if</span> other_inputs <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: h <span class="op">=</span> tf.concat([h, other_inputs], axis<span class="op">=</span><span class="bu">len</span>(h.get_shape())<span class="op">-</span><span class="dv">1</span>) l <span class="op">=</span> tf.layers.Dense(output_dim, activation<span class="op">=</span>output_activation, kernel_initializer<span class="op">=</span>kernel_initializer) <span class="cf">return</span> l(h), l.trainable_variables <span class="kw">def</span> model_two_layer(h, output_dim, output_activation<span class="op">=</span><span class="va">None</span>, kernel_initializer<span class="op">=</span>tf.zeros_initializer, other_inputs<span class="op">=</span><span class="va">None</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> h is input that gets mapped to output_dim dims</span> <span class="co"> other_inputs is vector of other inputs</span> <span class="co"> &quot;&quot;&quot;</span> <span class="cf">if</span> other_inputs <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: h <span class="op">=</span> tf.concat([h, other_inputs], axis<span class="op">=</span><span class="bu">len</span>(h.get_shape())<span class="op">-</span><span class="dv">1</span>) h, v1 <span class="op">=</span> layer_dense_bn_relu(h, <span class="dv">1024</span>) h, v2 <span class="op">=</span> layer_dense_bn_relu(h, <span class="dv">1024</span>) l <span class="op">=</span> tf.layers.Dense(output_dim, activation<span class="op">=</span>output_activation, kernel_initializer<span class="op">=</span>kernel_initializer) <span class="cf">return</span> l(h), v1 <span class="op">+</span> v2 <span class="op">+</span> l.trainable_variables</code></pre></div> <h4 id="synthetic-grad-loss-wrappers-and-more-utilities">Synthetic grad / loss wrappers and more utilities</h4> <p>Synthetic loss is just like synthetic gradients except we are predicting a scalar loss and then computing the gradients with respect to that loss. I thought this work similarly to the synthetic gradients, but it doesn’t seem to work at all (discussed below).</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> sg_wrapper(x, h, hvs, model, other_inputs<span class="op">=</span><span class="va">None</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Predicts grads for x, h, and hvs (vars between x and h) using model.</span> <span class="co"> Returns:</span> <span class="co"> - synth grad for x</span> <span class="co"> - synth grad for h</span> <span class="co"> - synth grads &amp; vars for hvs</span> <span class="co"> - sg model variables (so they can be trained)</span> <span class="co"> &quot;&quot;&quot;</span> sg, sg_vars <span class="op">=</span> model(h, h.get_shape()[<span class="op">-</span><span class="dv">1</span>], other_inputs<span class="op">=</span>other_inputs) xs <span class="op">=</span> hvs <span class="op">+</span> [x] gvs <span class="op">=</span> <span class="bu">list</span>(<span class="bu">zip</span>(tf.gradients(h, xs, grad_ys<span class="op">=</span>sg), xs)) <span class="cf">return</span> gvs[<span class="op">-</span><span class="dv">1</span>][<span class="dv">0</span>], sg, gvs[:<span class="op">-</span><span class="dv">1</span>], sg_vars <span class="kw">def</span> sl_wrapper(h, hvs, model, other_inputs<span class="op">=</span><span class="va">None</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Predicts loss given h, and produces grads_and_vars for hvs, using model.</span> <span class="co"> Returns:</span> <span class="co"> - synth loss for h</span> <span class="co"> - synth grads &amp; vars for hvs</span> <span class="co"> - model variables (so they can be trained)</span> <span class="co"> &quot;&quot;&quot;</span> sl, sl_vars <span class="op">=</span> model(h, <span class="dv">1</span>, tf.square, <span class="va">None</span>, other_inputs) gvs <span class="op">=</span> <span class="bu">list</span>(<span class="bu">zip</span>(tf.gradients(sl, hvs), hvs)) <span class="cf">return</span> sl, gvs, sl_vars <span class="kw">def</span> loss_grads_with_target(loss, vs, target): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Returns grad and vars for vs and target with respect to loss. </span> <span class="co"> &quot;&quot;&quot;</span> xs <span class="op">=</span> vs <span class="op">+</span> [target] gvs <span class="op">=</span> <span class="bu">list</span>(<span class="bu">zip</span>(tf.gradients(loss, xs), xs)) <span class="cf">return</span> gvs[<span class="op">-</span><span class="dv">1</span>][<span class="dv">0</span>], gvs[:<span class="op">-</span><span class="dv">1</span>] <span class="kw">def</span> model_grads(output_target_vars_tuple): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Returns grads and vars for models given an iterable of tuples of</span> <span class="co"> (model output, model target, model variables).</span> <span class="co"> &quot;&quot;&quot;</span> gvs <span class="op">=</span> [] <span class="cf">for</span> prediction, target, vs <span class="kw">in</span> output_target_vars_tuple: loss <span class="op">=</span> tf.losses.mean_squared_error(prediction, target) gvs <span class="op">+=</span> <span class="bu">list</span>(<span class="bu">zip</span>(tf.gradients(loss, vs), vs)) <span class="cf">return</span> gvs</code></pre></div> <h4 id="mnist-experiment">MNIST Experiment</h4> <p>Note: the paper claims that the learning rate was not optimized, but I found that the results are quite sensitive to changes in the learning rate.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_graph_mnist_fcn(sg<span class="op">=</span><span class="va">False</span>, sl<span class="op">=</span><span class="va">False</span>, conditioned<span class="op">=</span><span class="va">False</span>, no_bprop<span class="op">=</span><span class="va">False</span>): reset_graph() g <span class="op">=</span> {} g[<span class="st">&#39;training&#39;</span>] <span class="op">=</span> training <span class="op">=</span> tf.placeholder_with_default(<span class="va">True</span>, []) g[<span class="st">&#39;x&#39;</span>] <span class="op">=</span> x <span class="op">=</span> tf.placeholder(tf.float32, [<span class="va">None</span>, <span class="dv">784</span>], name<span class="op">=</span><span class="st">&#39;x_placeholder&#39;</span>) g[<span class="st">&#39;y&#39;</span>] <span class="op">=</span> y <span class="op">=</span> tf.placeholder(tf.int64, [<span class="va">None</span>], name<span class="op">=</span><span class="st">&#39;y_placeholder&#39;</span>) other_inputs <span class="op">=</span> <span class="va">None</span> <span class="cf">if</span> conditioned: other_inputs <span class="op">=</span> tf.one_hot(y, <span class="dv">10</span>) h1, h1vs <span class="op">=</span> layer_dense_bn_relu(x, <span class="dv">256</span>, training) <span class="cf">if</span> sg: _, sg1, gvs1, svars1 <span class="op">=</span> sg_wrapper(x, h1, h1vs, model_two_layer, other_inputs) <span class="cf">elif</span> sl: sl1, gvs1, svars1 <span class="op">=</span> sl_wrapper(h1, h1vs, model_two_layer, other_inputs) h2, h2vs <span class="op">=</span> layer_dense_bn_relu(h1, <span class="dv">256</span>, training) <span class="cf">if</span> sg: sg1_target, sg2, gvs2, svars2 <span class="op">=</span> sg_wrapper(h1, h2, h2vs, model_two_layer, other_inputs) <span class="cf">elif</span> sl: sl2, gvs2, svars2 <span class="op">=</span> sl_wrapper(h2, h2vs, model_two_layer, other_inputs) logit_layer <span class="op">=</span> tf.layers.Dense(<span class="dv">10</span>) logits <span class="op">=</span> logit_layer(h2) logit_vs <span class="op">=</span> logit_layer.trainable_variables g[<span class="st">&#39;loss&#39;</span>] <span class="op">=</span> loss <span class="op">=\</span> tf.nn.sparse_softmax_cross_entropy_with_logits(logits<span class="op">=</span>logits, labels<span class="op">=</span>y) <span class="cf">if</span> sg: sg2_target, gvs3 <span class="op">=</span> loss_grads_with_target(loss, logit_vs, h2) gvs_sg <span class="op">=</span> model_grads([(sg1, sg1_target, svars1), (sg2, sg2_target, svars2)]) <span class="cf">elif</span> sl: gvs3 <span class="op">=</span> <span class="bu">list</span>(<span class="bu">zip</span>(tf.gradients(loss, logit_vs), logit_vs)) gvs_sl <span class="op">=</span> model_grads([(sl1, sl2, svars1), (sl2, tf.expand_dims(loss, <span class="dv">1</span>), svars2)]) <span class="cf">elif</span> no_bprop: gvs3 <span class="op">=</span> <span class="bu">list</span>(<span class="bu">zip</span>(tf.gradients(loss, logit_vs), logit_vs)) <span class="cf">with</span> tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): opt <span class="op">=</span> tf.train.AdamOptimizer(<span class="fl">3e-5</span>) <span class="cf">if</span> sg: g[<span class="st">&#39;ts&#39;</span>] <span class="op">=\</span> opt.apply_gradients(gvs1 <span class="op">+</span> gvs2 <span class="op">+</span> gvs3 <span class="op">+</span> gvs_sg) <span class="cf">elif</span> sl: g[<span class="st">&#39;ts&#39;</span>] <span class="op">=\</span> opt.apply_gradients(gvs1 <span class="op">+</span> gvs2 <span class="op">+</span> gvs3 <span class="op">+</span> gvs_sl) <span class="cf">elif</span> no_bprop: g[<span class="st">&#39;ts&#39;</span>] <span class="op">=\</span> opt.apply_gradients(gvs3) <span class="cf">else</span>: g[<span class="st">&#39;ts&#39;</span>] <span class="op">=\</span> opt.minimize(loss) g[<span class="st">&#39;accuracy&#39;</span>] <span class="op">=</span> tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, <span class="dv">1</span>), y), tf.float32)) g[<span class="st">&#39;init&#39;</span>] <span class="op">=</span> tf.global_variables_initializer() <span class="cf">return</span> g</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> train(graph, iters <span class="op">=</span> <span class="dv">25000</span>, batch_size <span class="op">=</span> <span class="dv">256</span>): g <span class="op">=</span> graph res_tr <span class="op">=</span> [] res_te <span class="op">=</span> [] batches_per_epoch <span class="op">=</span> <span class="bu">len</span>(xtr)<span class="op">//</span>batch_size num_epochs <span class="op">=</span> iters <span class="op">//</span> batches_per_epoch <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(g[<span class="st">&#39;init&#39;</span>]) <span class="cf">for</span> epoch <span class="kw">in</span> <span class="bu">range</span>(num_epochs): x, y <span class="op">=</span> shuffle(xtr, ytr) acc <span class="op">=</span> <span class="dv">0</span> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(batches_per_epoch): feed_dict <span class="op">=</span> {g[<span class="st">&#39;x&#39;</span>]: x[i<span class="op">*</span>batch_size:(i<span class="op">+</span><span class="dv">1</span>)<span class="op">*</span>batch_size], g[<span class="st">&#39;y&#39;</span>]: y[i<span class="op">*</span>batch_size:(i<span class="op">+</span><span class="dv">1</span>)<span class="op">*</span>batch_size]} acc_, _ <span class="op">=</span> sess.run([g[<span class="st">&#39;accuracy&#39;</span>], g[<span class="st">&#39;ts&#39;</span>]], feed_dict) acc <span class="op">+=</span> acc_ <span class="cf">if</span> (i<span class="op">+</span><span class="dv">1</span>) <span class="op">%</span> batches_per_epoch <span class="op">==</span> <span class="dv">0</span>: res_tr.append(acc <span class="op">/</span> batches_per_epoch) acc_te <span class="op">=</span> <span class="dv">0</span> <span class="cf">for</span> j <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10</span>): feed_dict <span class="op">=</span> {g[<span class="st">&#39;x&#39;</span>]: xte[j<span class="op">*</span><span class="dv">1000</span>:(j<span class="op">+</span><span class="dv">1</span>)<span class="op">*</span><span class="dv">1000</span>], g[<span class="st">&#39;y&#39;</span>]: yte[j<span class="op">*</span><span class="dv">1000</span>:(j<span class="op">+</span><span class="dv">1</span>)<span class="op">*</span><span class="dv">1000</span>], g[<span class="st">&#39;training&#39;</span>]: <span class="va">False</span>} acc_te <span class="op">+=</span> sess.run(g[<span class="st">&#39;accuracy&#39;</span>], feed_dict) acc_te <span class="op">/=</span> <span class="fl">10.</span> res_te.append(acc_te) <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\r</span><span class="st">Epoch </span><span class="sc">{}</span><span class="st">/</span><span class="sc">{}</span><span class="st">: </span><span class="sc">{:4f}</span><span class="st"> (TR) </span><span class="sc">{:4f}</span><span class="st"> (TE)&quot;</span>\ .<span class="bu">format</span>(epoch, num_epochs, acc<span class="op">/</span>batches_per_epoch, acc_te), end<span class="op">=</span><span class="st">&#39;&#39;</span>) acc <span class="op">=</span> <span class="dv">0</span> <span class="cf">return</span> res_tr, res_te</code></pre></div> <h4 id="results">Results</h4> <p>Below we are running only 25k iterations, which is enough to get the point (the 500k from the paper is quite excessive!).</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() g <span class="op">=</span> build_graph_mnist_fcn() <span class="co"># baseline</span> _, res_baseline <span class="op">=</span> train(g) <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\n</span><span class="st">Took </span><span class="sc">{}</span><span class="st"> seconds!&quot;</span>.<span class="bu">format</span>(time.time() <span class="op">-</span> t))</code></pre></div> <pre><code>Epoch 105/106: 1.000000 (TR) 0.980100 (TE) Took 54.59836196899414 seconds!</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() g <span class="op">=</span> build_graph_mnist_fcn(no_bprop<span class="op">=</span><span class="va">True</span>) _, res_no_bprop <span class="op">=</span> train(g) <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\n</span><span class="st">Took </span><span class="sc">{}</span><span class="st"> seconds!&quot;</span>.<span class="bu">format</span>(time.time() <span class="op">-</span> t))</code></pre></div> <pre><code>Epoch 105/106: 0.881460 (TR) 0.889000 (TE) Took 33.66543793678284 seconds!</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() g <span class="op">=</span> build_graph_mnist_fcn(sl<span class="op">=</span><span class="va">True</span>) _, res_sl <span class="op">=</span> train(g) <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\n</span><span class="st">Took </span><span class="sc">{}</span><span class="st"> seconds!&quot;</span>.<span class="bu">format</span>(time.time() <span class="op">-</span> t))</code></pre></div> <pre><code>Epoch 105/106: 0.832816 (TR) 0.842900 (TE) Took 137.18904900550842 seconds!</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() g <span class="op">=</span> build_graph_mnist_fcn(sg<span class="op">=</span><span class="va">True</span>) _, res_sg <span class="op">=</span> train(g) <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\n</span><span class="st">Took </span><span class="sc">{}</span><span class="st"> seconds!&quot;</span>.<span class="bu">format</span>(time.time() <span class="op">-</span> t))</code></pre></div> <pre><code>Epoch 105/106: 0.997162 (TR) 0.977700 (TE) Took 115.9250328540802 seconds!</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() g <span class="op">=</span> build_graph_mnist_fcn(sg<span class="op">=</span><span class="va">True</span>, conditioned<span class="op">=</span><span class="va">True</span>) _, res_sgc <span class="op">=</span> train(g) <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\n</span><span class="st">Took </span><span class="sc">{}</span><span class="st"> seconds!&quot;</span>.<span class="bu">format</span>(time.time() <span class="op">-</span> t))</code></pre></div> <pre><code>Epoch 105/106: 0.999983 (TR) 0.980100 (TE) Took 117.7770209312439 seconds!</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.figure(figsize<span class="op">=</span>(<span class="dv">10</span>,<span class="dv">6</span>)) plt.plot(res_baseline, label<span class="op">=</span><span class="st">&quot;backprop&quot;</span>) plt.plot(res_no_bprop, label<span class="op">=</span><span class="st">&quot;no bprop&quot;</span>) plt.plot(res_sg, label<span class="op">=</span><span class="st">&quot;sg&quot;</span>) plt.plot(res_sgc, label<span class="op">=</span><span class="st">&quot;sg + c&quot;</span>) plt.plot(res_sl, label<span class="op">=</span><span class="st">&quot;sl&quot;</span>) plt.title(<span class="st">&quot;Synthetic Gradients on MNIST&quot;</span>) plt.xlabel(<span class="st">&quot;Epoch&quot;</span>) plt.ylabel(<span class="st">&quot;Accuracy&quot;</span>) plt.ylim([<span class="fl">0.5</span>,<span class="fl">1.</span>]) plt.legend()</code></pre></div> <figure> <img src="https://r2rt.com/static/images/synthetic_gradients/output_19_1.png" alt="MNIST FCN Results" /><figcaption>MNIST FCN Results</figcaption> </figure> <p>The results for synthetic gradients are similar to those in the paper over the first 100 epochs (25k mini-batches).</p> <p>We see that synthetic loss failed—doing worse than even the “no backpropagation” baseline (it is also the slowest approach!). This could be the result of a number of things (e.g., the loss distribution is bi-modal and hard to model, or perhaps I made a mistake in my implementation, as I did not debug extensively); I think, however, that there is something fundamentally wrong with doing gradient descent with respect to an approximated loss function. Though we might get a reasonable estimate of the loss, there is no guarantee that the gradient of our model will match the gradient of the actual loss. Imagine, for example, approximating a line with a zig-zag: one could get arbitrary good approximations but the gradient would always be wrong).</p> <h2 id="cifar-10-cnn-experiment">CIFAR 10 CNN Experiment</h2> <p>This is just to show how the implementation works with a CNN architecture. Once again, results match the paper.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">(xtr, ytr), (xte, yte) <span class="op">=</span> tf.keras.datasets.cifar10.load_data() xtr <span class="op">=</span> xtr.astype(np.float32) <span class="op">/</span> <span class="fl">255.</span> ytr <span class="op">=</span> ytr.reshape([<span class="op">-</span><span class="dv">1</span>]) xte <span class="op">=</span> xte.astype(np.float32) <span class="op">/</span> <span class="fl">255.</span> yte <span class="op">=</span> yte.reshape([<span class="op">-</span><span class="dv">1</span>])</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> layer_conv_bn_relu(h, num_filters, filter_dim, padding<span class="op">=</span><span class="st">&quot;same&quot;</span>, pooling<span class="op">=</span><span class="va">None</span>, training<span class="op">=</span><span class="va">True</span>): l <span class="op">=</span> tf.layers.Conv2D(num_filters, filter_dim, padding<span class="op">=</span>padding) h <span class="op">=</span> tf.layers.batch_normalization(l(h), training<span class="op">=</span>training) h <span class="op">=</span> tf.nn.relu(h) <span class="cf">if</span> pooling <span class="op">==</span> <span class="st">&quot;max&quot;</span>: h <span class="op">=</span> tf.layers.max_pooling2d(h, <span class="dv">3</span>, <span class="dv">3</span>) <span class="cf">elif</span> pooling <span class="op">==</span> <span class="st">&quot;avg&quot;</span>: h <span class="op">=</span> tf.layers.average_pooling2d(h, <span class="dv">3</span>, <span class="dv">3</span>) <span class="cf">return</span> h, l.trainable_variables <span class="kw">def</span> model_two_layer_conv(h, output_dim, other_inputs<span class="op">=</span><span class="va">None</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> h is what we are computing the synth grads for, channels last data format</span> <span class="co"> other_inputs is vector of other inputs, assumed to have same non-channel dims</span> <span class="co"> &quot;&quot;&quot;</span> <span class="cf">if</span> other_inputs <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: h <span class="op">=</span> tf.concat([h, other_inputs], axis<span class="op">=</span><span class="bu">len</span>(h.get_shape())<span class="op">-</span><span class="dv">1</span>) h, v1 <span class="op">=</span> layer_conv_bn_relu(h, <span class="dv">128</span>, <span class="dv">5</span>, padding<span class="op">=</span><span class="st">&quot;same&quot;</span>) h, v2 <span class="op">=</span> layer_conv_bn_relu(h, <span class="dv">128</span>, <span class="dv">5</span>, padding<span class="op">=</span><span class="st">&quot;same&quot;</span>) l <span class="op">=</span> tf.layers.Conv2D(output_dim, <span class="dv">5</span>, padding<span class="op">=</span><span class="st">&quot;same&quot;</span>, kernel_initializer<span class="op">=</span>tf.zeros_initializer) <span class="cf">return</span> l(h), v1 <span class="op">+</span> v2 <span class="op">+</span> l.trainable_variables</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_graph_cifar_cnn(sg<span class="op">=</span><span class="va">False</span>): reset_graph() g <span class="op">=</span> {} g[<span class="st">&#39;training&#39;</span>] <span class="op">=</span> training <span class="op">=</span> tf.placeholder_with_default(<span class="va">True</span>, []) g[<span class="st">&#39;x&#39;</span>] <span class="op">=</span> x <span class="op">=</span> tf.placeholder(tf.float32, [<span class="va">None</span>, <span class="dv">32</span>, <span class="dv">32</span>, <span class="dv">3</span>], name<span class="op">=</span><span class="st">&#39;x_placeholder&#39;</span>) g[<span class="st">&#39;y&#39;</span>] <span class="op">=</span> y <span class="op">=</span> tf.placeholder(tf.int64, [<span class="va">None</span>], name<span class="op">=</span><span class="st">&#39;y_placeholder&#39;</span>) h1, h1vs <span class="op">=</span> layer_conv_bn_relu(x, <span class="dv">128</span>, <span class="dv">5</span>, <span class="st">&#39;same&#39;</span>, <span class="st">&#39;max&#39;</span>, training<span class="op">=</span>training) <span class="cf">if</span> sg: _, sg1, gvs1, svars1 <span class="op">=</span> sg_wrapper(x, h1, h1vs, model_two_layer_conv) h2, h2vs <span class="op">=</span> layer_conv_bn_relu(h1, <span class="dv">128</span>, <span class="dv">5</span>, <span class="st">&#39;same&#39;</span>, <span class="st">&#39;avg&#39;</span>, training<span class="op">=</span>training) <span class="cf">if</span> sg: sg1_target, sg2, gvs2, svars2 <span class="op">=</span> sg_wrapper(h1, h2, h2vs, model_two_layer_conv) h <span class="op">=</span> tf.reshape(h2, [<span class="op">-</span><span class="dv">1</span>, <span class="dv">9</span><span class="op">*</span><span class="dv">128</span>]) logit_layer <span class="op">=</span> tf.layers.Dense(<span class="dv">10</span>) logits <span class="op">=</span> logit_layer(h) logit_vs <span class="op">=</span> logit_layer.trainable_variables g[<span class="st">&#39;loss&#39;</span>] <span class="op">=</span> loss <span class="op">=</span> tf.nn.sparse_softmax_cross_entropy_with_logits(logits<span class="op">=</span>logits, labels<span class="op">=</span>y) <span class="cf">if</span> sg: sg2_target, gvs3 <span class="op">=</span> loss_grads_with_target(loss, logit_vs, h2) gvs_sg <span class="op">=</span> model_grads([(sg1, sg1_target, svars1), (sg2, sg2_target, svars2)]) <span class="cf">with</span> tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): opt <span class="op">=</span> tf.train.AdamOptimizer(<span class="fl">3e-5</span>) <span class="cf">if</span> sg: g[<span class="st">&#39;ts&#39;</span>] <span class="op">=\</span> opt.apply_gradients(gvs1 <span class="op">+</span> gvs2 <span class="op">+</span> gvs3 <span class="op">+</span> gvs_sg) <span class="cf">else</span>: g[<span class="st">&#39;ts&#39;</span>] <span class="op">=\</span> opt.minimize(loss) g[<span class="st">&#39;accuracy&#39;</span>] <span class="op">=</span> tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, <span class="dv">1</span>), y), tf.float32)) g[<span class="st">&#39;init&#39;</span>] <span class="op">=</span> tf.global_variables_initializer() <span class="cf">return</span> g</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() g <span class="op">=</span> build_graph_cifar_cnn(sg<span class="op">=</span><span class="va">True</span>) res_tr, res_sg <span class="op">=</span> train(g, iters<span class="op">=</span><span class="dv">25000</span>) <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\n</span><span class="st">Took </span><span class="sc">{}</span><span class="st"> seconds&quot;</span>.<span class="bu">format</span>(time.time() <span class="op">-</span> t))</code></pre></div> <pre><code>Epoch 127/128: 0.774700 (TR) 0.648300 (TE) Took 943.7978417873383 seconds</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() g <span class="op">=</span> build_graph_cifar_cnn() <span class="co">#baseline</span> res_tr_backprop, res_backprop <span class="op">=</span> train(g, iters<span class="op">=</span><span class="dv">25000</span>) <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\n</span><span class="st">Took </span><span class="sc">{}</span><span class="st"> seconds&quot;</span>.<span class="bu">format</span>(time.time() <span class="op">-</span> t))</code></pre></div> <pre><code>Epoch 127/128: 0.901683 (TR) 0.752400 (TE) Took 584.2685778141022 seconds</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.figure(figsize<span class="op">=</span>(<span class="dv">10</span>,<span class="dv">6</span>)) plt.plot(res_backprop, label<span class="op">=</span><span class="st">&quot;backprop&quot;</span>) plt.plot(res_sg, label<span class="op">=</span><span class="st">&quot;sg&quot;</span>) plt.title(<span class="st">&quot;Synthetic Gradients on CIFAR (CNN)&quot;</span>) plt.xlabel(<span class="st">&quot;Epoch&quot;</span>) plt.ylabel(<span class="st">&quot;Accuracy&quot;</span>) plt.legend()</code></pre></div> <figure> <img src="https://r2rt.com/static/images/synthetic_gradients/output_27_1.png" alt="CIFAR 10 CNN Results" /><figcaption>CIFAR 10 CNN Results</figcaption> </figure> </body> </html> Deconstruction with Discrete Embeddings2017-02-15T00:00:00-05:002017-02-15T00:00:00-05:00Silviu Pitistag:r2rt.com,2017-02-15:/deconstruction-with-discrete-embeddings.htmlIn my post Beyond Binary, I showed how easy it is to create trainable "one-hot" neurons with the straight-through estimator. My motivation for this is made clear in this post, in which I demonstrate the potential of discrete embeddings. In short, discrete embeddings allow for explicit deconstruction of inherently fuzzy data, which allows us to apply explicit reasoning and algorithms over the data, and communicate fuzzy ideas with concrete symbols. Using discrete embeddings, we can (1) create a language model over the embeddings, which immediately gives us access to RNN-based generation of internal embeddings (and sequences thereof), and (2) index sub-parts of the embeddings, instead of entire embedding vectors, which gives us (i.e., our agents) access to search techniques that go beyond cosine similarity, such as phrase search and search using lightweight structure.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML" type="text/javascript"></script> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>In my post <a href="http://r2rt.com/beyond-binary-ternary-and-one-hot-neurons.html">Beyond Binary</a>, I showed how easy it is to create trainable “one-hot” neurons with the straight-through estimator. My motivation for this is made clear in this post, in which I demonstrate the potential of discrete embeddings. In short, discrete embeddings allow for explicit deconstruction of inherently fuzzy data, which allows us to apply explicit reasoning and algorithms over the data, and communicate fuzzy ideas with concrete symbols. Using discrete embeddings, we can (1) create a language model over the embeddings, which immediately gives us access to RNN-based generation of internal embeddings (and sequences thereof), and (2) index sub-parts of the embeddings, instead of entire embedding vectors, which gives us (i.e., our agents) access to search techniques that go beyond cosine similarity, such as phrase search and search using lightweight structure.</p> <p>The ultimate task we will tackle in this post is as follows:</p> <p>Suppose we arranged the MNIST training images in a sequence. Does it contain any subsequences of three consecutive digits that have the following three features, respectively, and if so, where:</p> <figure> <img src="https://r2rt.com/static/images/DEE_feature_query.png" alt="png" /><figcaption>png</figcaption> </figure> <p>The proposed solution presented does not use class labels, and does not involve any vectors representing sequences of digits such as RNN states. Instead, we will make use of an inverted index over the symbolic language that discrete embeddings provide. The proposed solution is not perfect, and reveals the general challenge that we will face in our quest to enable AIs to represent fuzzy concepts with explicit symbols. It will be interesting to see if the approach presented can be improved to the point where it becomes practical to use as a general technique.</p> <p><strong>Link to IPython notebook</strong>: This post is available <a href="https://github.com/spitis/r2rt-deconstruction">here</a> as an iPython notebook, together with the supporting code and trained models.</p> <h4 id="human-language-and-stage-iii-architectures">Human language and Stage III architectures</h4> <p>A comparison to human language is apt: although our thoughts are generally fuzzy and fraught with ambiguities (like real-valued embeddings (e.g., consider how many reasonable sequences could be decoded from the dense vector representation of a sequence in a sequence autoencoder)), much of our communication is explicit (of course, there is also body language, intonation, etc., which might be considered as real-valued, fuzzy communication channels).</p> <p>If we think about the future of AI, I think this comparison is doubly important. In my mind, to date, there have been two major paradigms in AI:</p> <ul> <li><p><strong>Stage I: Traditional AI based on explicit programming</strong></p> <p>In the 20th century, many advances in AI were based explicit algorithms and data structures, such as <a href="https://en.wikipedia.org/wiki/A*_search_algorithm">A* search</a>, <a href="https://en.wikipedia.org/wiki/Frame_(artificial_intelligence)">frames</a>, <a href="https://en.wikipedia.org/wiki/Case-based_reasoning">case-based reasoning</a>, etc.. A representative computer program based primarily on traditional AI was IBM’s <a href="https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)">Deep Blue</a>.</p></li> <li><p><strong>Stage II: Machine learning based on implicit programming</strong></p> <p>More recently, the advances in AI have been primarily based on data, which is used to implicitly program the computer. We tell the computer what we want it to do by defining objectives and rewards, providing it with a general learning algorithm, and feeding it data. It learns on its own how to accomplish those ends: the means are never made explicit. A representative computer program based primarily on machine learning is DeepMind’s <a href="https://en.wikipedia.org/wiki/AlphaGo">AlphaGo</a>.</p></li> </ul> <p>Inevitably, I see the third major paradigm of AI to be the following hybrid:</p> <ul> <li><p><strong>Stage III Implicit programming via explicit symbols, AKA programming via natural language</strong>:</p> <p>As a prerequisite to strong AI, our computers will need to be able to communicate with humans via a higher-order interface like natural language. If such an interface becomes strong enough, and computers are given sufficient dominion over their internal functionality, we can teach them by talking to them. This requires that the computer be able to represent fuzzy concepts using explicit symbols. A representative (fictional) computer program that employed this approach is Iron Man’s <a href="https://en.wikipedia.org/wiki/Edwin_Jarvis">J.A.R.V.I.S.</a>.</p></li> </ul> <p>One very cool example of an architecture exhibiting stage III abilities is the VAE-GAN, presented in <a href="http://www.jmlr.org/proceedings/papers/v48/larsen16.html">Larsen et al. (2016)</a>. The VAE-GAN is able to generate new images and modify old images according to high-order discrete features. Here is the impressive Figure 6 from Larsen et al. (2016):</p> <figure> <img src="https://r2rt.com/static/images/DEE_vaeganfigure6.png" alt="png" /><figcaption>png</figcaption> </figure> <p>You can see how given certain discrete features (Bald, Bangs, Black Hair, etc.), the VAE-GAN can modify the input image so as to include those features. In the same way, it can also generate new samples. This is a human-like ability that falls squarely within Stage III.</p> <p>This ability is similar in function to, but (per my current understanding) quite different in implementation from, the generative method we explore below. Because I’ve read next to nothing about GANs, I won’t be able to give good commentary in this post. Nevertheless, I think the similarities are strong enough to merit an in-depth analysis. Indeed, a quick scan shows that certain VAE and GAN architectures have already adopted a form of discrete embeddings (e.g., <a href="https://arxiv.org/abs/1606.03657">this paper (InfoGAN)</a> and <a href="https://arxiv.org/abs/1609.02200">this paper (Discrete VAE)</a>), but I haven’t had a chance to actually read any of these papers yet. I’m not sure how similar/different they are from what I present in this post, and I plan to report back on VAEs/GANs and other architectures after I’ve gotten my hands dirty with them.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p> <p>Because I only discuss discrete embeddings below, I wanted to emphasize here the dual nature of discrete and real embeddings. Certainly a discrete symbol can be embedded in real space (see, e.g., word2vec). In this post, we are specifically concerned with embedding real vectors in discrete space (maybe I should have titled this post “vec2word”). Each approach has its own advantanges. For example, it is far more difficult to capture continuous features like width, height, weight, angle, etc. with discrete features. There is nothing stopping us from creating models that have both real and discrete internal embeddings (either as a mixed embedding, or as separate/interchangeable <em>dual</em> embeddings). In fact, we can keep all of our models that use real embeddings exactly the way they are, and add an internal discrete dual of the real embeddings via a discrete autoencoder (a la Part I of this post)—this alone will give us access to the capabilities demonstrated in this post.</p> <p>With that said, let’s dive in. The ideas in this post are a long ways away from taking us all the way to Stage III, but I think they’re a step in the right direction. We proceed as follows:</p> <ul> <li><p>First, we’ll build a dead simple MNIST autoencoder with an explicit hidden layer and demonstrate that discrete embeddings are sufficiently expressive to achieve good performance. We’ll then visualize the embeddings to get a sense of what each neuron represents.</p></li> <li><p>Then, we’ll show how we can create an RNN-based generative model of MNIST, and demonstrate a few cool things that it can do, including generating samples based on a prototype (i.e., generate “more like this”) and generating samples based on specific feature.</p></li> <li><p>Finally, we’ll show how we can apply traditional information retrieval (IR)—which falls squarely within the Stage I paradigm—to do <em>post-classification</em> over the features of a Stage II model, and in particular, over sequences of such features.</p></li> </ul> <p>Throughout our journey, we will repeatedly come face-to-face with the following general challenge:</p> <ul> <li>In order to represent certain fuzzy ideas with explicit codes (words), we need a lot of explicit codes (words). But what happens if the fuzzy ideas are larger than a single word? Then we need to compose words. But no matter how we compose words, we have a line-drawing problem: what is included inside the fuzzy concept, and what is not?</li> </ul> <p>This challenge is inherent to language in general, and is one of the fundamental problems of the legal domain. If it had an easy solution, legalese would not exist, judicial dockets would not be so full, and policy drafting would be a whole lot easier. I will refer to this problem throughout as the problem of <strong>explicit definition</strong>.</p> <h2 id="part-i-an-autocoder-with-discrete-embeddings">Part I: An autocoder with discrete embeddings</h2> <p>The purpose of this part is to (1) prepare us for parts II and III, and (2) to give us some intuitions about how discrete embeddings might represent real-valued embeddings, but also (3) to show that real embeddings and discrete embeddings can effectively be interchanged (we can encode one into the other and back), and can therefore co-exist within the same model. This means that everything we do in Parts II and III can be applied to any layer of a real-valued model by discretely autoencoding that layer.</p> <h4 id="imports-and-helper-functions">Imports and helper functions</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np, tensorflow <span class="im">as</span> tf, matplotlib.pyplot <span class="im">as</span> plt <span class="im">from</span> tensorflow.examples.tutorials.mnist <span class="im">import</span> input_data <span class="op">%</span>matplotlib inline mnist <span class="op">=</span> input_data.read_data_sets(<span class="st">&#39;MNIST_data&#39;</span>, one_hot<span class="op">=</span><span class="va">True</span>) <span class="co"># deconstruction.py: https://github.com/spitis/r2rt-deconstruction/blob/master/deconstruction.py</span> <span class="im">from</span> deconstruction <span class="im">import</span> <span class="op">*</span></code></pre></div> <h4 id="autoencoder-architecture">Autoencoder architecture</h4> <p>The architecture of our autoencoder is very simple. We use a single convolutional layer with 16 5x5 filters, followed by a 2x2 max pooling layer. Then there is the fully-connected (FC) embedding layer, which consists of either explicit (one-hot) neurons, real neurons, or both. The embedding layer is then projected back into the original input space using a fully-connected projection followed by a sigmoid.</p> <p>For our explicit autoencoder, we use 80 explicit neurons, each with 8 dimensions. We also enforce, via training, that the first dimension of each neuron is dead by requiring the network to project all 0s when the explicit neurons are all hot in the first dimension (this will be useful for certain visualizations).</p> <p>To recap, the layers used are:</p> <ol type="1"> <li>Input</li> <li>5x5 Conv (16 dims)</li> <li>FC with 8-dimensional one-hot activation (80 neurons, 640 binary dimensions), dead in the first dimension</li> <li>FC softmax projection</li> </ol> <p>We train the network with simple gradient descent with a batch size of 50 and a learning rate of 1e-1, after an initial “warm-up” epoch that uses a learning rate of 1e-4 (so that the one-hot neurons learn to fire more predictably).</p> <p><strong>Important note on mixed embeddings</strong>: In case you try this at home with a mix of real and explicit neurons (within the same layer), beware that the explicit neurons are noisy during training. This will cause the model to put its faith in the real neurons (which fire deterministically), and will cause the most important information to be stored in the real neurons. If training a mixed model, it is advised that you either apply heavy noise to the real neurons (e.g., via dropout or gaussian noise) or pretrain the explicit neurons to capture as much information as they can, and then train the real neurons afterwards so that they finetune the explicit neurons (because it makes much more sense, at least to me, to finetune explicit chunks with real values, than vice versa).</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_classifier(onehot_dims <span class="op">=</span> <span class="dv">8</span>, explicit_neurons <span class="op">=</span> <span class="dv">80</span>) sess <span class="op">=</span> tf.InteractiveSession() sess.run(tf.global_variables_initializer()) <span class="co"># Uncomment to run training, or use the below checkpoint to load</span> <span class="co"># tr, va = train_autoencoder(g, sess, num_epochs=30)</span> saver <span class="op">=</span> tf.train.Saver() saver.restore(sess, <span class="st">&#39;./ckpts/trained_80x8_autoencoder&#39;</span>)</code></pre></div> <h4 id="autoencoder-results">Autoencoder results</h4> <p>Training for 30 epochs gets us to a loss between 5.5 and 6.0 on the validation set, and with some more training, we can get the loss down to 5.0. By comparison if we train the same architecture using 80 real neurons, we can easily achieve losses under 4.0. Nevertheless, while our 8-dimensional (really 7, because of the dead dimension) one-hot neurons are less expressive, they still have a lot of power: <span class="math inline">$$7^{80}$$</span> possible combinations is a big number. A visual comparison of inputs and projections on the first 12 images of the MNIST test set is shown below.</p> <h5 id="original-mnist-top-vs-projections-bottom">Original MNIST (top) vs projections (bottom)</h5> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: mnist.test.images[:<span class="dv">12</span>], g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_2xn(<span class="dv">12</span>,np.concatenate((mnist.test.images[:<span class="dv">12</span>], projs)))</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_11_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>As you can see, our projections of the first 12 MNIST test images are only slightly corrupted. The major features of the original images remain, with a few faded portions.</p> <h4 id="visualizing-features">Visualizing features</h4> <p>Given that our features are explicit, we can visualize each one precisely. Let’s have a go at it, by setting all neurons in the “off” position (first dimension), except for one, chosen at random. You can view the results below for 24 random samples that have a single random feature of a single random neuron turned on.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">emb_feed <span class="op">=</span> np.tile(np.array([[[<span class="dv">1</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>]]]), (<span class="dv">24</span>, <span class="dv">80</span>, <span class="dv">1</span>)) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">24</span>): emb_feed[i][i<span class="op">%</span><span class="dv">80</span>] <span class="op">=</span> np.eye(<span class="dv">8</span>)[np.random.randint(<span class="dv">1</span>,<span class="dv">8</span>)] projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;embedding&#39;</span>]: emb_feed, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_2xn(<span class="dv">12</span>,projs)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_14_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>Notice how several of the features are pure noise. This is because some neurons have activations that never fire: despite being able to fire in 7 different live positions, some neurons fire only in a subset of those 7. Let’s see the firing densities of the first 5 neurons on the MNIST validation set:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">embs <span class="op">=</span> sess.run(g[<span class="st">&#39;embedding&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: mnist.validation.images, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) densities <span class="op">=</span> np.<span class="bu">sum</span>(embs, axis<span class="op">=</span><span class="dv">0</span>) <span class="op">/</span> np.<span class="bu">sum</span>(np.<span class="bu">sum</span>(embs, axis<span class="op">=</span><span class="dv">0</span>), axis <span class="op">=</span> <span class="dv">1</span>, keepdims<span class="op">=</span><span class="va">True</span>) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">5</span>): <span class="bu">print</span>(<span class="st">&quot;</span><span class="sc">{:.2f}</span><span class="st"> - </span><span class="sc">{:.2f}</span><span class="st"> - </span><span class="sc">{:.2f}</span><span class="st"> - </span><span class="sc">{:.2f}</span><span class="st"> - </span><span class="sc">{:.2f}</span><span class="st"> - </span><span class="sc">{:.2f}</span><span class="st"> - </span><span class="sc">{:.2f}</span><span class="st"> - </span><span class="sc">{:.2f}</span><span class="st">&quot;</span>.<span class="bu">format</span>(<span class="op">*</span>densities[i]))</code></pre></div> <pre><code>0.00 - 0.16 - 0.21 - 0.17 - 0.16 - 0.18 - 0.00 - 0.12 0.00 - 0.00 - 0.14 - 0.18 - 0.18 - 0.17 - 0.19 - 0.14 0.00 - 0.22 - 0.19 - 0.15 - 0.16 - 0.15 - 0.00 - 0.13 0.00 - 0.26 - 0.20 - 0.00 - 0.23 - 0.12 - 0.07 - 0.11 0.00 - 0.09 - 0.00 - 0.22 - 0.29 - 0.10 - 0.12 - 0.18</code></pre> <p>Because we trained our network to consider the first dimension as dead, it never fires. Additionally, we see that the network has learned a few more dead dimensions: the first and third neurons never fire in the 7th position, the second neuron never fires in the 2nd position, the fourth neuron never fires in the 4th position, and the last neuron never fires in the 3rd position.</p> <p>These positions correspond to the “noisy” items in the grid above. If we only plot live features according to the densities we just computed, we eliminate the noisy features:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">emb_feed <span class="op">=</span> np.tile(np.array([[[<span class="dv">1</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>]]]), (<span class="dv">100</span>, <span class="dv">80</span>, <span class="dv">1</span>)) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>): emb_feed[i][i<span class="op">%</span><span class="dv">80</span>] <span class="op">=</span> np.eye(<span class="dv">8</span>)[np.random.choice([<span class="dv">0</span>,<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>,<span class="dv">4</span>,<span class="dv">5</span>,<span class="dv">6</span>,<span class="dv">7</span>], p<span class="op">=</span>densities[i<span class="op">%</span><span class="dv">80</span>])] projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;embedding&#39;</span>]: emb_feed, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_nxn(<span class="dv">10</span>,projs)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_18_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>Pretty cool. There seems to be a fairly even distribution between sharp positive features, negative hole features (the ones that look like black holes), and somewhat random features. Note that these features are all very faint individually (the function that plots the images is automatically normalizing the white level), and they only become sharp once you add many of them together.</p> <p>In fact, let’s do that now. If we ignore the fact that neurons are dependent (there is a complex joint probability distribution), we can sample them according to their raw densities:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">emb_feed <span class="op">=</span> np.tile(np.array([[[<span class="dv">1</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>]]]), (<span class="dv">100</span>, <span class="dv">80</span>, <span class="dv">1</span>)) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>): <span class="cf">for</span> j <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">80</span>): emb_feed[i][j] <span class="op">=</span> np.eye(<span class="dv">8</span>)[np.random.choice([<span class="dv">0</span>,<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>,<span class="dv">4</span>,<span class="dv">5</span>,<span class="dv">6</span>,<span class="dv">7</span>], p<span class="op">=</span>densities[j])] projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;embedding&#39;</span>]: emb_feed, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_nxn(<span class="dv">10</span>,projs)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_20_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>This generates some pretty cool looking designs that look more like murky shadows of numbers than numbers themselves.</p> <p>Note that despite the almost random appearance of these designs, they can each be precisely communicated in a sparse binary code consisting of exactly 560 zeros and 80 ones.</p> <h2 id="part-ii-generating-digits-with-an-rnn">Part II: Generating digits with an RNN</h2> <p>Our neurons are explicit, and so we can think of each neuron as a word in a language, where each MNIST digit is made up of 80 different words. Modeling the joint distribution then, is very much like modeling a language, which is an area in which RNNs shine. In this section we create an RNN-based generator that models the joint distribution of the explicit MNIST embeddings.</p> <h4 id="data">Data</h4> <p>We want to feed our RNN sequences of 80 neurons, but our neurons have no predetermined order (query how to give them a heirarchical structure, which would greatly expand their usefulness). Thus, in order to allow the RNN to generate the remainder of an embedding given arbitrary neurons, we randomize the order in which the neurons are presented to the RNN during training. This forces us to use 640-dimensional one-hot vectors for our model inputs and targets (using 8-dimensional vectors would pre-suppose an order).</p> <p>To illustrate precisely what our training sequences look like, below is our data generator, and two different training sequences generated by the first image in the MNIST training set (note how they just permutations of each other). Each index indicates where the flattened embedding layer is “hot”.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> imgs_to_indices(imgs): embs <span class="op">=</span> sess.run(g[<span class="st">&#39;embedding&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: imgs, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) <span class="co">#[n, 80, 8]</span> idx <span class="op">=</span> np.argmax(embs, axis<span class="op">=</span><span class="dv">2</span>) <span class="co">#[n, 80]</span> res <span class="op">=</span> [] <span class="cf">for</span> img <span class="kw">in</span> <span class="bu">range</span>(<span class="bu">len</span>(imgs)): neuron_perm <span class="op">=</span> np.random.permutation(<span class="bu">list</span>(<span class="bu">range</span>(<span class="dv">80</span>))) <span class="co">#order of neurons we will present</span> res.append(idx[img][neuron_perm] <span class="op">+</span> neuron_perm <span class="op">*</span> <span class="dv">8</span>) <span class="cf">return</span> np.array(res) <span class="kw">def</span> gen_random_neuron_batch(n): x, _ <span class="op">=</span> mnist.train.next_batch(n) <span class="co"># [n, 784]</span> res <span class="op">=</span> imgs_to_indices(x) <span class="cf">return</span> res[:,:<span class="op">-</span><span class="dv">1</span>], res[:,<span class="dv">1</span>:]</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">2</span>): <span class="bu">print</span>(imgs_to_indices(mnist.train.images[:<span class="dv">1</span>]))</code></pre></div> <pre><code>[[ 18 167 116 546 122 180 236 145 563 13 101 249 575 446 475 347 73 194 583 283 131 354 393 523 204 365 339 469 54 259 187 3 625 37 223 228 47 457 489 419 242 175 83 385 383 110 518 141 484 589 412 157 305 500 613 543 291 210 70 90 313 433 594 634 509 299 426 28 404 617 372 58 329 273 603 325 265 557 451 531]] [[404 70 339 634 236 223 433 419 329 90 13 583 157 457 110 116 543 37 523 54 28 83 500 575 283 180 141 451 594 325 347 475 204 393 3 365 175 291 265 518 259 531 167 228 299 354 305 589 372 187 58 446 625 210 563 557 249 426 412 101 613 489 469 313 47 617 385 131 242 509 484 18 603 194 73 122 145 383 273 546]]</code></pre> <h4 id="model">Model</h4> <p>The architecture of our RNN is very simple. We use a 1000-unit GRU cell, with a 100-dimensional (real-valued) embedding layer. The network is trained for 10 epochs with Adam at an initial learning rate 2e-3. As the number of possible input sequences is practically limitless (each of the 55000 training examples can generate 80! permuted sequences), no regularization is used. The norm of the gradient is clipped at 1.0. I did not do a hyperparameter search, or even early stopping, so this is likely far from an optimal generator.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">h <span class="op">=</span> build_recurrent_generator() sess.run(tf.variables_initializer([v <span class="cf">for</span> v <span class="kw">in</span> tf.global_variables() <span class="cf">if</span> <span class="st">&#39;generator&#39;</span> <span class="kw">in</span> v.name])) <span class="co"># Uncomment to run training, or use the below checkpoint to load</span> <span class="co"># train_RNN(h, sess, batch_generator=gen_random_neuron_batch, epochs=10)</span> saver <span class="op">=</span> tf.train.Saver() saver.restore(sess, <span class="st">&#39;./ckpts/gen_80x8_1000state_RANDOMPERM_10epochs_&#39;</span>)</code></pre></div> <h4 id="generation">Generation</h4> <p>After some helper functions:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> gen_next_step(current_input, prior_state<span class="op">=</span><span class="va">None</span>): <span class="co">&quot;&quot;&quot;Accepts a current input (index) (shape (batch_size, 1)) and a prior_state</span> <span class="co"> (shape (batch_size, 1000)). Returns dist over the next step, and the resulting state.&quot;&quot;&quot;</span> <span class="cf">if</span> prior_state <span class="kw">is</span> <span class="va">None</span>: feed_dict<span class="op">=</span>{h[<span class="st">&#39;x&#39;</span>]: current_input} <span class="cf">else</span>: feed_dict<span class="op">=</span>{h[<span class="st">&#39;x&#39;</span>]: current_input, h[<span class="st">&#39;init_state&#39;</span>]: prior_state} <span class="cf">return</span> sess.run([h[<span class="st">&#39;next_y_dist&#39;</span>], h[<span class="st">&#39;final_state&#39;</span>]], feed_dict<span class="op">=</span>feed_dict)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> generate_embedding(prompt<span class="op">=</span>np.array([<span class="dv">2</span>]), top_choices <span class="op">=</span> <span class="dv">1</span>): <span class="co">&quot;&quot;&quot; Accepts a prompt, and generates the rest&quot;&quot;&quot;</span> state <span class="op">=</span> <span class="va">None</span> <span class="cf">while</span> <span class="bu">len</span>(prompt) <span class="op">&lt;</span> <span class="dv">80</span>: <span class="cf">if</span> state <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: y_dist, state <span class="op">=</span> gen_next_step(np.expand_dims([prompt[<span class="op">-</span><span class="dv">1</span>]], <span class="dv">0</span>), state) <span class="cf">else</span>: y_dist, state <span class="op">=</span> gen_next_step(np.expand_dims(prompt, <span class="dv">0</span>)) p <span class="op">=</span> y_dist[<span class="dv">0</span>] p[np.argsort(p)[:<span class="op">-</span>top_choices]] <span class="op">=</span> <span class="dv">0</span> p <span class="op">=</span> p <span class="op">/</span> np.<span class="bu">sum</span>(p) next_idx <span class="op">=</span> np.random.choice(<span class="dv">640</span>, p<span class="op">=</span>p) prompt <span class="op">=</span> np.concatenate((prompt, [next_idx])) <span class="cf">return</span> prompt</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> emb_idx_to_emb(emb_idx): emb <span class="op">=</span> np.eye(<span class="dv">8</span>)[np.array(<span class="bu">sorted</span>(emb_idx)) <span class="op">%</span> <span class="dv">8</span>] <span class="cf">return</span> np.reshape(emb, (<span class="dv">1</span>, <span class="dv">80</span>, <span class="dv">8</span>))</code></pre></div> <p>We can now generate some digits. Below, we sample 12 random images from the MNIST validation set, transform them into their neuron indices, randomly permute the indices, and take the first n-indices (for n in {5, 10, 20, 40}) as the prompt to our RNN. The RNN generates the rest, which we convert back into 80x8 embeddings, feed into our decoder, and plot. The top rows of each figure are the orginal digits, and the bottom rows are the digits generated from using their neuron activations as prompts.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> generate_from_originals(size_of_prompt): originals, generated <span class="op">=</span> [], [] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">12</span>): img_idx <span class="op">=</span> np.random.randint(<span class="dv">0</span>, <span class="dv">5000</span>) originals.append(mnist.validation.images[img_idx:img_idx<span class="op">+</span><span class="dv">1</span>]) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">12</span>): img <span class="op">=</span> imgs_to_indices(originals[i]) emb_idx <span class="op">=</span> generate_embedding(img[<span class="dv">0</span>][:size_of_prompt], <span class="dv">3</span>) emb <span class="op">=</span> emb_idx_to_emb(emb_idx) generated.append(emb) originals, generated <span class="op">=</span> np.squeeze(originals), np.squeeze(generated) projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;embedding&#39;</span>]: generated, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_2xn(<span class="dv">12</span>, np.concatenate((originals, projs)))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">generate_from_originals(size_of_prompt<span class="op">=</span><span class="dv">5</span>)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_34_0.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">generate_from_originals(size_of_prompt<span class="op">=</span><span class="dv">10</span>)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_35_0.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">generate_from_originals(size_of_prompt<span class="op">=</span><span class="dv">20</span>)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_36_0.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">generate_from_originals(size_of_prompt<span class="op">=</span><span class="dv">40</span>)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_37_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>From the above, we can see how important each neuron is. With only a 5 neuron prompt, our generator is about 50/50 to generate the same digit vs another digit (again, originals are on the top row, and generated digits are on the bottom). When it generates the same digit, it varies quite a bit from the original. As we ramp up the prompt to 40 neurons from the original, our generator starts to generate samples that look more and more like the original.</p> <h4 id="generated-samples-vs.the-prompts-used-to-generate-them">Generated samples vs. the prompts used to generate them</h4> <p>To get an idea of how the prompt conditions the model, we plot some samples (top) against the prompts they were generated from (bottom). All come from the same random image, and the prompt sizes from left to right start are [1, 5, 9 … 37].</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res, prompts <span class="op">=</span> [], [] img_idx <span class="op">=</span> np.random.randint(<span class="dv">0</span>, <span class="dv">5000</span>) img <span class="op">=</span> imgs_to_indices(mnist.validation.images[img_idx:img_idx<span class="op">+</span><span class="dv">1</span>])[<span class="dv">0</span>] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10</span>): prompts.append(np.random.permutation(img)) emb_idx <span class="op">=</span> generate_embedding(prompts[<span class="op">-</span><span class="dv">1</span>][:(<span class="dv">4</span><span class="op">*</span>i)<span class="op">+</span><span class="dv">1</span>],<span class="dv">3</span>) emb <span class="op">=</span> emb_idx_to_emb(emb_idx) res.append(emb) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10</span>): set_neurons <span class="op">=</span> <span class="bu">set</span>(prompts[i][:(<span class="dv">4</span><span class="op">*</span>i)<span class="op">+</span><span class="dv">1</span>] <span class="op">//</span> <span class="dv">8</span>) neurons_indices <span class="op">=</span> np.array([<span class="dv">8</span><span class="op">*</span>j <span class="cf">for</span> j <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">80</span>) <span class="cf">if</span> j <span class="kw">not</span> <span class="kw">in</span> set_neurons]) emb <span class="op">=</span> emb_idx_to_emb(np.concatenate((prompts[i][:(<span class="dv">4</span><span class="op">*</span>i)<span class="op">+</span><span class="dv">1</span>],neurons_indices))) res.append(emb) res <span class="op">=</span> np.squeeze(res)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;embedding&#39;</span>]: res, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_2xn(<span class="dv">10</span>, projs)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_41_0.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="generating-by-prototype-more-like-this-samples">Generating by prototype: “more like this” samples</h4> <p>Below, we use a prompt of 30 neurons to generate “more like this” of a rather distinctive 7 from the mnist test set. As demonstrated above, we could make our generated samples more similar or less similar by varying the length of our prompt. Note how our generator sometimes generates 9 instead of 7.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot_nxn(<span class="dv">1</span>, mnist.test.images[<span class="dv">80</span>:<span class="dv">81</span>])</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_44_0.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res <span class="op">=</span> [] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">0</span>, <span class="dv">100</span>): img_idx <span class="op">=</span> np.random.randint(<span class="dv">0</span>, <span class="dv">5000</span>) img <span class="op">=</span> imgs_to_indices(mnist.test.images[<span class="dv">80</span>:<span class="dv">81</span>]) emb_idx <span class="op">=</span> generate_embedding(img[<span class="dv">0</span>][:<span class="dv">30</span>], <span class="dv">2</span>) emb <span class="op">=</span> emb_idx_to_emb(emb_idx) res.append(emb) res <span class="op">=</span> np.squeeze(res) projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;embedding&#39;</span>]: res, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_nxn(<span class="dv">10</span>, projs)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_45_0.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="generating-samples-with-distinct-features">Generating samples with distinct features</h4> <p>Now let’s do something really cool. We’ll draw eight of our own small features, find out which explicit dimensions they correspond to, and then use those dimensions as prompts to generate samples that have that feature.</p> <p>Here, we run up against the general challenge of explicit definition for the first time. While we can draw the features, and see the features, when does an MNIST digit have that feature? Given that reasonable people can disagree on the answer, it is unreasonable for us to try expect a single explicit feature specification (in our case, a prompt) to be perfect. Nevertheless, we need to provide our RNN with a prompt in order for it to generate samples.</p> <p>The method we use below is more than a little brittle, and takes a bit of trial and error to get right. To find the embeddings that each feature corresponds to we add random noise the image of the feature, and run a batch of the noisy images through our encoder. We then construct our prompts by subsampling the activations that fire often for that feature.</p> <p>Based on trial and error, I defined “fire often” to mean firing in more than 5/6 of all noisy samples, and chose to subsample 60% of the activations for each prompt. Obviously, this is not a practical approach for the general case, and additional work will need to be done in order to find something that generalizes well to other cases.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> matplotlib.image <span class="im">as</span> mpimg</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">features <span class="op">=</span> [] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">8</span>): features.append(np.reshape(<span class="fl">1.</span> <span class="op">-</span> mpimg.imread(<span class="st">&#39;./images/feature_</span><span class="sc">{}</span><span class="st">.png&#39;</span>.<span class="bu">format</span>(i), <span class="bu">format</span><span class="op">=</span><span class="st">&#39;grayscale&#39;</span>)[:,:,<span class="dv">0</span>] <span class="op">/</span> <span class="fl">255.</span>, (<span class="dv">1</span>,<span class="dv">784</span>))) features <span class="op">=</span> np.array(features) features <span class="op">*=</span> <span class="fl">1.5</span></code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot_mxn(<span class="dv">1</span>, <span class="dv">8</span>, features)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_49_0.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> get_noisy_imgs(feature): noisy_imgs <span class="op">=</span> [] <span class="cf">for</span> img <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">300</span>): noisy_img <span class="op">=</span> feature.copy() <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">784</span>): <span class="cf">while</span> np.random.random() <span class="op">&gt;</span> <span class="fl">0.75</span>: noisy_img[<span class="dv">0</span>, i] <span class="op">+=</span> <span class="bu">min</span>(<span class="fl">0.2</span>, <span class="dv">1</span><span class="op">-</span>noisy_img[<span class="dv">0</span>, i]) noisy_imgs.append(noisy_img) <span class="cf">return</span> np.squeeze(noisy_imgs)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">first_feature_with_noise <span class="op">=</span> get_noisy_imgs(features[<span class="dv">0</span>])[:<span class="dv">20</span>] plot_2xn(<span class="dv">10</span>, first_feature_with_noise)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_51_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>Just for fun, below are the projections of the first feature. You can see that the autoencoder can’t autoencode the noise properly, as it was trained on regular looking digits, but that the feature we asked for is present in all of them.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: first_feature_with_noise, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_2xn(<span class="dv">10</span>, projs[:<span class="dv">20</span>])</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_53_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>We generate noisy images for each feature, and collect the firing patterns:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">firing_patterns <span class="op">=</span> [] <span class="cf">for</span> feature <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">8</span>): noisy_imgs <span class="op">=</span> get_noisy_imgs(features[feature]) embs <span class="op">=</span> sess.run(g[<span class="st">&#39;embedding&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: noisy_imgs, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) firing_patterns.append(np.<span class="bu">sum</span>(embs, axis<span class="op">=</span><span class="dv">0</span>))</code></pre></div> <p>For reference, here are the firing patterns for the first three neurons of the first feature:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">firing_patterns[<span class="dv">0</span>][:<span class="dv">3</span>]</code></pre></div> <pre><code>array([[ 0., 0., 6., 141., 130., 10., 0., 13.], [ 0., 0., 5., 8., 216., 64., 5., 2.], [ 0., 52., 0., 1., 1., 87., 0., 159.]], dtype=float32)</code></pre> <p>As described above, we use those firing patterns to create prompts. We generate 7 samples for each feature:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">generated_samples <span class="op">=</span> [] <span class="cf">for</span> feature <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">8</span>): num_common_activations <span class="op">=</span> np.<span class="bu">sum</span>(firing_patterns[feature] <span class="op">&gt;</span> <span class="dv">250</span>) prompt <span class="op">=</span> np.array(<span class="bu">list</span>(<span class="bu">range</span>(<span class="dv">640</span>)))[np.argsort(np.reshape(firing_patterns[feature], (<span class="dv">640</span>)))[::<span class="op">-</span><span class="dv">1</span>][:num_common_activations]] nums <span class="op">=</span> np.sort(np.reshape(firing_patterns[feature], (<span class="dv">640</span>)))[::<span class="op">-</span><span class="dv">1</span>][:num_common_activations] den <span class="op">=</span> np.<span class="bu">sum</span>(nums) p <span class="op">=</span> nums<span class="op">/</span>den generated_samples.append([]) <span class="cf">for</span> sample <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">0</span>, <span class="dv">7</span>): emb_idx <span class="op">=</span> generate_embedding(np.random.choice(prompt, size<span class="op">=</span><span class="bu">max</span>(<span class="bu">int</span>(num_common_activations<span class="op">*</span>.<span class="dv">6</span>), <span class="bu">min</span>(<span class="bu">int</span>(num_common_activations), <span class="dv">5</span>)), replace<span class="op">=</span><span class="va">False</span>, p<span class="op">=</span>p), <span class="dv">2</span>) emb <span class="op">=</span> emb_idx_to_emb(emb_idx) generated_samples[<span class="op">-</span><span class="dv">1</span>].append(emb) generated_samples <span class="op">=</span> np.squeeze(generated_samples) generated_samples <span class="op">=</span> np.reshape(np.swapaxes(generated_samples, <span class="dv">0</span>, <span class="dv">1</span>), (<span class="dv">56</span>, <span class="dv">80</span>, <span class="dv">8</span>))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;embedding&#39;</span>]: generated_samples, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_nxn(<span class="dv">8</span>, np.concatenate((np.squeeze(features), projs)))</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_60_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>The original features are shown in the top row, and the samples generated based on that feature are shown below.</p> <p>Although most generated samples have the feature we wanted, many do not, and for some features, there is not much diversity in the samples (they seem to be biased towards a single digit, whereas multiple digits might contain the desired feature). We might consider a few techniques to improve the results:</p> <ul> <li>First, we can make the feature more fuzzy—instead of just adding noise, we can also randomly shift and scale it, so as to capture a more diverse set of neuron activations, and create more diverse samples.</li> <li>Second, we might think that instead of sampling neuron activations, we can obtain the networks “confidence” in each neuron activation (e.g., the real number from the softmax activation, before it is turned into a one-hot vector); this might give a better sense of the importance of each neuron activation.</li> <li>Third, we might take more care in designing our RNN generator (e.g., actually validate it’s performance, and not just use our first random guess at a workable architecture/hyperparameters).</li> <li>Finally, I note that our autoencoder could not be much simpler than it already is. If we had multiple convolutional and deconvolutional layers in both the encoding and decoding sections, I suspect that generated samples would contain the feature in much more abstract and diverse ways.</li> </ul> <p>Below, we’ll see how we can take advantage of memory to create better prompt and improve the generation results.</p> <h2 id="part-iii-structured-information-retrieval-using-discrete-embeddings">Part III: Structured information retrieval using discrete embeddings</h2> <p>Now we get to the most interesting part of our little journey. As we noted above, using discrete embeddings to represent data is very much like using a language, which means that everything we know and love above searching for language applies to searching for embeddings.</p> <p>We proceed as follows:</p> <ol type="1"> <li><p>We’ll do a crash course on search, and comment on the relations between search, classification and generation.</p></li> <li><p>We’ll take a look at the how the memory of the MemNN architecture uses a cosine distance-like measure to achieve impressive results.</p></li> <li><p>We’ll take a look at the effectiveness of cosine distance-based retrieval of individual MNIST digits. This will help develop an intuition for why cosine distance-based retrieval is effective for MemNNs. We’ll also see how we can use our memory to improve the feature-based generator we made above.</p></li> <li><p>Finally, we’ll get to the point: we’ll look at the effectiveness (or lack thereof) of cosine distance-based retrieval over sequences of MNIST digits, and consider the challenges that sequences of memories pose for cosine distance-based retrieval. This is key, because an agent that cannot deal with sequences is a mere function approximator, not an intelligent actor. We’ll show how retrieval over the explicit embedding space is able to effectively deal with sequences, where cosine distance-based retrieval fails.</p></li> </ol> <h3 id="a-crash-course-on-search">A crash course on search</h3> <p>I recently had the pleasure of reading the textbook <a href="http://www.ir.uwaterloo.ca/book/">Information Retrieval: Implementing and Evaluating Search Engines</a> by Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack, which this section is heavily based upon. This is a gem among textbooks, up there with great math books like Axler’s Linear Algebra Done Right and Abbott’s Understanding Analysis, and was my inspiration for thinking about discrete embeddings. It’s not short, but it’s a page turner, and I highly recommend it.</p> <p>To some extent, we all understand search, because we use it every day. A search engine takes some words (the <em>query</em>), and finds the documents that contain them. The key data structure used in search is the <a href="https://en.wikipedia.org/wiki/Inverted_index"><em>Inverted Index</em></a>, which you can consider as a hash table that maps each word to its <em>postings</em> list. The <em>postings</em> of a word contain all the positions that the word appears in a document collection. For example, the postings for the word “deconstruction” might look something like this:</p> <pre><code>&quot;deconstruction&quot;: &lt;document: 23, positions: 1, 55 ... 1554&gt;, &lt;document: 45, positions: 6, 8&gt;, ..., &lt;document: 2442, positions: 52&gt;</code></pre> <p>where each document, and each position within a document has been given a number. Postings are a <em>sorted</em> list of those documents and positions. Because they are sorted, postings lists can be merged efficiently during query evaluation, so that we can iterate over the postings to find which documents in our collection contain multiple terms, or, if doing a phrase search, find those documents containing the phrase being searched. In addition to the postings list, most search engines also store some global information about the words and documents themselves, including:</p> <ul> <li>total documents indexed</li> <li>the length of each document</li> <li>the frequency of each word within a document (usually included in the postings list itself)</li> <li>number of documents in the collection</li> </ul> <p>Using the above information, search engines can efficiently score each document in the collection according to some scoring function of the document and the query, and return the top results. Scoring functions often treat both the query and the document as a <em>bag of words</em>, which ignores proximity, order and other relationships between the terms (i.e., each term is treated as independent). Using the <em>bag of words</em> assumption, we can model each query or document as a <em>term vector</em> of length |V|, where |V| is the size of our vocabulary. Given the 5-term vocabulary, <span class="math inline">$$\{t_1, t_2 ... t_5\}$$</span>, the document or query <span class="math inline">$$\{t_1, t_2, t_1, t_4\}$$</span> might have the vector <span class="math inline">$$[2, 1, 0, 1, 0]$$</span>. Scoring functions in the bag of words paradigm can be expressed in the following form (equation 5.16 from Büttcher et al.):</p> <p><span class="math display">$\text{score}(q, d) = \text{quality}(d) + \sum_{t \in q}\text{score}(t, d)$</span></p> <p>where q, d, and t refer to a query, document, and term, respectively. An example of a quality function is Google’s <a href="https://en.wikipedia.org/wiki/PageRank">PageRank</a> algorithm. In our MNIST example, the quality of an example might be some measure of how characteristic of the dataset a digit is (e.g., how “clean” the digit is). This quality rating is independent of the query. The <span class="math inline">$$\text{score}(t, d)$$</span> function in the second term can usually be dissected into the following form:</p> <p><span class="math display">$\text{score}(t, d) = TF \cdot IDF$</span></p> <p>where TF is the <em>term frequency</em> of t in document d, and IDF is t’s <em>inverse document frequency</em>. Term frequency is a measure (function) of how often the query term appears in a document, and inverse document frequency is a measure of how frequent the term is across all documents. The IDF term appears because, intuitively, if a term is very common across all documents (low IDF), we don’t want it to contribute a lot to the score. Vice versa if the term is very rare across all documents (high IDF)—we want it to contribute a lot to the score if it shows up in a document. IDF usually takes on a “standard form” (equation 2.13 of Büttcher et al.):</p> <p><span class="math display">$IDF = \log(N/N_t).$</span></p> <p>Let us take a moment to appreciate that summing TF-IDF scores is actually very similar to taking the cosine similarity of the query’s term vector and document’s term vector, given that each vector is adjusted appropriately. Starting with the vector consisting of the count of each term in the query, adjust it by multiplying each entry by the IDF for that term. Starting with the vector consisting of the count of each term in the document, adjust it by the length of the document (i.e., normalize the vector). If we now take their dot product and adjust by the query length, we obtain the cosine similarity between our IDF-adjusted query vector and the document vector. When ranking results, we don’t care about the scores themselves, and only the order, and thus the final step of adjusting the dot product by the query length is unnecessary. Thus, <em>bag of words</em> scoring models are closely related to cosine distance. Below, we’ll do a visual demonstration of the effectiveness of cosine distance, and adjusted cosine distance (a la TF-IDF), for retrieving relevant memories.</p> <p>The above said, it is important to recognize that not all TF-IDF methods can be easily cast to cosine similarity: in particular, the TF component is usually taken as a function of the term frequency, rather than a simple count thereof, resulting in more complex search methods.</p> <h3 id="search-vs-classification-and-the-relationship-between-search-and-generation">Search vs classification, and the relationship between search and generation</h3> <p>Search and classification are closely related: whereas classification takes and instance and places it in a category, search takes a category and finds instances in that category. A key point to note here is that information needs during search can be incredibly diverse: whereas classification usually involves only a handful of relevant categories, the number of potential queries in a search is limitless. I like to think of search as <code>post-classification</code>: We define a category <em>ex-post</em> (after the fact), and then find examples of that category that are stored in our memory.</p> <p>If we think about the way humans search and classify, we note that classification usually occurs at the high levels of abstraction with little subtlety (categories in italics):</p> <ul> <li>This is a <em>pen</em>.</li> <li>The sky is <em>blue</em>.</li> <li>Boy, was that [thing] <em>cool</em>!</li> </ul> <p>On the contrary, when we search our memory (or generate hypotheticals and new ideas) the details become very important:</p> <ul> <li>That reminds of the time that X, Y, and Z happened, because of A, B, C.</li> <li>We could differentiate our product by adding [some specific detail].</li> </ul> <p>In order for our networks / agents to do the latter, we require them to operate and reason over individual <em>features</em> or details. I call this <em>deconstruction</em>, in order to connect it to what humans do (this is related to, but different from, from the more mainstream ML concepts of <em>feature learning</em> and <em>learned representations</em>). Humans are very good at calling out explicit details, and this is the point of this post: to give our networks explicit representations that they can operate over.</p> <p>Note that generation and memory are very closely related, in that their outcomes are often interchangeable. If we need to come up with an example of something, we often search our memories first, before trying to generate one. We can even go a step further, and say that all of our human memories are “generated”, not remembered—consider what would happen if we overtrained the RNN in Part II above to the point where it “memorized the training set”.</p> <p>It will be useful to have a specific example of external memory that is not MNIST, so let us now consider now the MemNN. We’ll examine how its external memory works, and how it might be replaced by an external generator to accomplish similar ends.</p> <h3 id="an-introduction-to-memnns">An Introduction to MemNNs</h3> <p>The MemNN, introduced by <a href="https://arxiv.org/abs/1410.3916">Weston et al. (2014)</a> and expanded upon in <a href="https://arxiv.org/abs/1502.05698">Weston et al. (2015)</a>, is a very basic, memory-augmented neural network that has been proven effective on a variety of question-answering tasks. For our purposes, it will be sufficient to look specifically at the type of task it was used to solve, how the external memory module enabled this, and how we could extend its capabilities by using a memory-generation (mem-gen) module. Below is a description of the most basic MemNN presented in Weston et al. (2014) as it relates to the above—a full description of the architecture is presented in Weston et al. (2014), and improvements thereon, in Weston et al. (2015).</p> <p>An example of the basic QA task, and the MemNN’s responses (in CAPS), is shown below (this is a reproduction of Figure 1 from Weston et al. (2014)):</p> <pre><code>Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to the office. Joe left the milk. Joe went to the bathroom. Where is the milk now? OFFICE Where is Joe? BATHROOM Where was Joe before the office? KITCHEN</code></pre> <p>Isn’t that awesome? The MemNN is able to answer these questions by (1) retrieving relevant memories (<em>retrieval</em>), (2) using the retrieved memories together with the input to generate a response (<em>response</em>).</p> <p>Each memory of the MemNN is stored as a bag of words representation of a <em>sequence</em> of text. In the above example, each sentence is such a sequence (“Joe went to the kitchen”). The bag of words representation is as described above in “A crash course on search”: a sparse vector, with a 1 in the indices corresponding to “Joe”, “went”, “to”, “the”, and “kitchen”. Weston et al. also provide a method for dealing with arbitrarily long sequences of text: they propose a “segmenter” to divide long sequences into smaller units (e.g., sentences), and then use the bag of words representations of those sub-sequences as inputs and memories.</p> <p>The inputs to the MemNN have the same bag of words representation. Memories are retrieved one-by-one, where earlier memories retrieved impact the search for later memories. A memory is retrieved based on the input (if it is a question) and any previously retrieved memories by:</p> <ol type="1"> <li>Embedding the input into a real-valued vector, <span class="math inline">$$x$$</span>.</li> <li>Embedding any previously retrieved memories into a real-valued vector, <span class="math inline">$$m$$</span>.</li> <li>Embedding the each candidate memory, <span class="math inline">$$i$$</span>, into a real-valued vector, <span class="math inline">$$c_i$$</span>.</li> <li>Retrieving the candidate memory whose embedded vector, <span class="math inline">$$c_i$$</span>, gives the highest dot product with the vector <span class="math inline">$$x + m$$</span>.</li> </ol> <p>The embeddings used in Steps 1, 2 and 3 are all different. Step 4 is like taking the cosine similarity (without normalizing for length) between the embeddings of the model’s current concious representation (its query), and each potential memory. The resulting retrieval model is reminscent of the basic TF-IDF retrieval model presented above: we are doing something similar to cosine distance over the bag of words vectors, except that we first adjust the vectors appropriately (by the embeddings (cf. adjustment by IDF above)); in both cases we do not normalize our vectors. The beauty of this retrieval model is that the MemNN can learn effective embeddings for querying and retrieval.</p> <p>To bring generation back into the picture, let us consider two more questions about the above story:</p> <pre><code>Can Joe pick up the milk? Can Fred pick up the milk?</code></pre> <p>Both questions ask about a hypothetical situation. A memory and a generated sample actually serve the same function here: if the MemNN can retrieve an example of Joe/Fred picking up the milk, then it can answer yes; but if it cannot, a high probability generated example will suffice. Thus, we might think of augmenting our architectures with an external generator (and perhaps a paired discriminator to determine the plausibility of generated samples), in addition to an external memory. This would expand the MemNN’s capabilities and allow it to answer hypothetical questions. As noted, these two modules would be complementary, and the memory module would often provide useful information for use during generation.</p> <h3 id="retrieval-of-individual-mnist-digits">Retrieval of Individual MNIST Digits</h3> <p>In this part, we’ll show how cosine distance-based retrieval fairs on MNIST digits, which will provide some visual intuitions for why cosine distance-based retrieval works for MemNNs, and in general. We’ll also get to see, visually, the interchangeability of generation and retrieval, and demonstrate the use of memory to improve generation.</p> <p>Below, we’ll use numpy to do exhaustive nearest neighbors search using cosine distance over the MNIST training set. We’ll do experiments on both the original input space (784 pixel vectors) and the explicit embedding space (640 sparse binary vectors). We’ll use the same distinctive 7 we used above when generated by prototype.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">normalized_mnist <span class="op">=</span> mnist.train.images <span class="op">/</span> np.linalg.norm(mnist.train.images, axis<span class="op">=</span><span class="dv">1</span>, keepdims<span class="op">=</span><span class="va">True</span>) nns <span class="op">=</span> np.argsort(np.dot(normalized_mnist, mnist.test.images[<span class="dv">80</span>]))[::<span class="op">-</span><span class="dv">1</span>][:<span class="dv">9</span>] <span class="co"># 9 nearest neighbors to first validation image</span> res <span class="op">=</span> np.array([mnist.test.images[<span class="dv">80</span>]] <span class="op">+</span> [mnist.train.images[i] <span class="cf">for</span> i <span class="kw">in</span> nns]) plot_mxn(<span class="dv">1</span>,<span class="dv">10</span>, res)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_66_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>The first image above is the query, and the next 9 are the nearest neighbors from the training set (in order), measured by cosine distance over the 784-pixel input space.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">explicit_mnist <span class="op">=</span> [] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">550</span>): imgs <span class="op">=</span> sess.run(g[<span class="st">&#39;embedding&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: mnist.train.images[i<span class="op">*</span><span class="dv">100</span>:(i<span class="op">+</span><span class="dv">1</span>)<span class="op">*</span><span class="dv">100</span>], g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) explicit_mnist.append(np.reshape(imgs, (<span class="op">-</span><span class="dv">1</span>, <span class="dv">640</span>))) explicit_mnist <span class="op">=</span> np.concatenate(explicit_mnist) normalized_exp_mnist <span class="op">=</span> explicit_mnist <span class="op">/</span> np.linalg.norm(explicit_mnist, axis<span class="op">=</span><span class="dv">1</span>, keepdims<span class="op">=</span><span class="va">True</span>) query <span class="op">=</span> np.reshape(sess.run(g[<span class="st">&#39;embedding&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: mnist.test.images[<span class="dv">80</span>:<span class="dv">81</span>], g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}), (<span class="dv">640</span>,)) nns <span class="op">=</span> np.argsort(np.dot(normalized_exp_mnist, query))[::<span class="op">-</span><span class="dv">1</span>][:<span class="dv">9</span>] <span class="co"># 9 nearest neighbors to first validation image</span> res <span class="op">=</span> np.array([mnist.test.images[<span class="dv">80</span>]] <span class="op">+</span> [mnist.train.images[i] <span class="cf">for</span> i <span class="kw">in</span> nns]) plot_mxn(<span class="dv">1</span>,<span class="dv">10</span>, res)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_68_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>The first image above is the query, and the next 9 are the nearest neighbors from the training set (in order), measured by cosine distance over the 640-dimensional explicit embedding space.</p> <p>We might also see if we can improve this result by modifying our query vector. Our retrieval will no longer be strict cosine distance, but rather cosine distance-based, like the TF-IDF and MemNN retrieval methods described above.</p> <p>We adjust our query vector by multiplying each dimension by its “IDF” (the logarithm of the ratio of the number of training samples to the frequency of that dimension in the training set):</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">embs <span class="op">=</span> np.reshape(explicit_mnist, (<span class="dv">55000</span>, <span class="dv">80</span>, <span class="dv">8</span>)) IDF_vector <span class="op">=</span> np.log(<span class="dv">55000</span> <span class="op">/</span> (np.reshape(np.<span class="bu">sum</span>(embs, axis<span class="op">=</span><span class="dv">0</span>), (<span class="dv">640</span>)) <span class="op">+</span> <span class="dv">1</span>)) query <span class="op">=</span> np.reshape(sess.run(g[<span class="st">&#39;embedding&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: mnist.test.images[<span class="dv">80</span>:<span class="dv">81</span>], g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}), (<span class="dv">640</span>,)) <span class="op">*</span> IDF_vector nns <span class="op">=</span> np.argsort(np.dot(normalized_exp_mnist, query))[::<span class="op">-</span><span class="dv">1</span>][:<span class="dv">9</span>] <span class="co"># 9 nearest neighbors to first validation image</span> res <span class="op">=</span> np.array([mnist.test.images[<span class="dv">80</span>]] <span class="op">+</span> [mnist.train.images[i] <span class="cf">for</span> i <span class="kw">in</span> nns]) plot_mxn(<span class="dv">1</span>,<span class="dv">10</span>, res)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_70_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>The first image above is the query, and the next 9 are the nearest neighbors from the training set (in order), measured by the cosine distance between the modified query (query vector * the IDF vector) and the embedding of each training sample.</p> <p>The results of our modified query are not clearly better or worse, so the most we can conclude from this demonstration is that adjusting our query by the IDF of each feature did not seem to hurt.</p> <p>Notice how the practical utility of these nearest neighbors would be similar to the generated samples in the generating by prototype example from Part II.</p> <h4 id="retrieval-of-individual-mnist-digits-by-feature">Retrieval of Individual MNIST Digits by Feature</h4> <p>Now let’s do the equivalent of generation based on features, first over the original input space, and then over the explicit embedding space. Take note of how the resulting figures are similar to the figure we produced above when generating samples with these same specific features.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res <span class="op">=</span> [f[<span class="dv">0</span>] <span class="cf">for</span> f <span class="kw">in</span> features] nns <span class="op">=</span> [] <span class="cf">for</span> idx <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">8</span>): nns.append(np.argsort(np.dot(normalized_mnist, res[idx]))[::<span class="op">-</span><span class="dv">1</span>][:<span class="dv">7</span>]) nns <span class="op">=</span> np.array(nns).T.reshape((<span class="dv">56</span>)) <span class="cf">for</span> n <span class="kw">in</span> nns: res.append(mnist.train.images[n]) res <span class="op">=</span> np.array(res) plot_nxn(<span class="dv">8</span>, res)</code></pre></div> <p>Retrieval by feature over the original input space:</p> <figure> <img src="https://r2rt.com/static/images/DEE_output_73_0.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">firing_patterns <span class="op">=</span> [] <span class="cf">for</span> feature <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">8</span>): noisy_imgs <span class="op">=</span> get_noisy_imgs(features[feature]) embs <span class="op">=</span> sess.run(g[<span class="st">&#39;embedding&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: noisy_imgs, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) firing_patterns.append(np.<span class="bu">sum</span>(embs, axis<span class="op">=</span><span class="dv">0</span>)) feature_embs <span class="op">=</span> np.array(firing_patterns) feature_embs[feature_embs <span class="op">&lt;</span> <span class="dv">200</span>] <span class="op">=</span> <span class="fl">0.</span> <span class="co"># get rid of the random neurons that were firing</span> feature_embs <span class="op">=</span> feature_embs<span class="op">*</span>feature_embs <span class="co"># amplify the effect of the neurons that fire most often</span> feature_embs <span class="op">=</span> np.reshape(feature_embs, (<span class="dv">8</span>, <span class="dv">640</span>)) <span class="op">*</span> IDF_vector <span class="co"># adjust the by activation IDFs</span> res <span class="op">=</span> [f[<span class="dv">0</span>] <span class="cf">for</span> f <span class="kw">in</span> features] nns <span class="op">=</span> [] <span class="cf">for</span> idx <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">8</span>): nns.append(np.argsort(np.dot(normalized_exp_mnist, feature_embs[idx]))[::<span class="op">-</span><span class="dv">1</span>][:<span class="dv">7</span>]) nns <span class="op">=</span> np.array(nns).T.reshape((<span class="dv">56</span>)) <span class="cf">for</span> n <span class="kw">in</span> nns: res.append(mnist.train.images[n]) res <span class="op">=</span> np.array(res)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot_nxn(<span class="dv">8</span>, res)</code></pre></div> <p>Retrieval by feature over the explicit embedding space:</p> <figure> <img src="https://r2rt.com/static/images/DEE_output_76_0.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="using-memory-for-generation">Using memory for generation</h4> <p>Now that we have a working memory system, we can use it to demonstrate the interaction of memory and generation. Suppose we want to generate samples from features like we did in Part II. Instead of defining prompts for the features by using noise, we can instead use the activations from the nearest neighbors. This is definition by association (e.g., a unknown word that you can nevertheless guess the meaning of, due to its context).</p> <p>We find the 30 nearest neighbors, and subsample their most common activations to define each prompt.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res <span class="op">=</span> [f[<span class="dv">0</span>] <span class="cf">for</span> f <span class="kw">in</span> features] firing_patterns <span class="op">=</span> [] <span class="cf">for</span> idx <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">8</span>): nns <span class="op">=</span> np.argsort(np.dot(normalized_mnist, res[idx]))[::<span class="op">-</span><span class="dv">1</span>][:<span class="dv">30</span>] embs <span class="op">=</span> sess.run(g[<span class="st">&#39;embedding&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: [mnist.train.images[n] <span class="cf">for</span> n <span class="kw">in</span> nns], g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) firing_patterns.append(np.<span class="bu">sum</span>(embs, axis<span class="op">=</span><span class="dv">0</span>))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">generated_samples <span class="op">=</span> [] <span class="cf">for</span> feature <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">8</span>): num_common_activations <span class="op">=</span> np.<span class="bu">sum</span>(firing_patterns[feature] <span class="op">&gt;</span> <span class="dv">18</span>) prompt <span class="op">=</span> np.array(<span class="bu">list</span>(<span class="bu">range</span>(<span class="dv">640</span>)))[np.argsort(np.reshape(firing_patterns[feature], (<span class="dv">640</span>)))[::<span class="op">-</span><span class="dv">1</span>][:num_common_activations]] nums <span class="op">=</span> np.sort(np.reshape(firing_patterns[feature], (<span class="dv">640</span>)))[::<span class="op">-</span><span class="dv">1</span>][:num_common_activations] den <span class="op">=</span> np.<span class="bu">sum</span>(nums) p <span class="op">=</span> nums<span class="op">/</span>den generated_samples.append([]) <span class="cf">for</span> sample <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">0</span>, <span class="dv">7</span>): emb_idx <span class="op">=</span> generate_embedding(np.random.choice(prompt, size<span class="op">=</span><span class="bu">min</span>(<span class="bu">int</span>(num_common_activations<span class="op">*</span>.<span class="dv">8</span>), <span class="dv">8</span>), replace<span class="op">=</span><span class="va">False</span>, p<span class="op">=</span>p), <span class="dv">3</span>) emb <span class="op">=</span> emb_idx_to_emb(emb_idx) generated_samples[<span class="op">-</span><span class="dv">1</span>].append(emb) generated_samples <span class="op">=</span> np.squeeze(generated_samples) generated_samples <span class="op">=</span> np.reshape(np.swapaxes(generated_samples, <span class="dv">0</span>, <span class="dv">1</span>), (<span class="dv">56</span>, <span class="dv">80</span>, <span class="dv">8</span>))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">projs <span class="op">=</span> sess.run(g[<span class="st">&#39;projection&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;embedding&#39;</span>]: generated_samples, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) plot_nxn(<span class="dv">8</span>, np.concatenate((np.squeeze(features), projs)))</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_80_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>Generation by feature with memory-based feature definitions.</p> <p>This approach to defining prompts seems to provide a bit more stability, and diversity, than the noise-based approach above. If you’re familiar with information retrieval, this should remind you of <a href="https://en.wikipedia.org/wiki/Relevance_feedback">pseudo-relevance feedback</a>.</p> <h4 id="some-takeaways">Some takeaways</h4> <p>There are a few key takeaways from the above discussion:</p> <ul> <li><p>First, we visually demonstrated the relationship between generation and memory: the outputs are more or less interchangeable (of course, the digits in memory are of superior quality—though as noted in Part II, the generator has a lot of room for improvement).</p></li> <li><p>Second, we visually demonstrated the power of cosine distance: we can retrieve entire images based on only a fragment, which provides us with a content-addressable memory. We can now see why MemNN’s can work effectively: they only need a small fragment of what they are trying to retrieve from memory in order to retrieve it. The embedding layers that are used before taking the cosine distance can amplify the most relevant parts of the query.</p></li> <li><p>Third, using context provided by memory can result in better definitions when facing the general problem of explicit definition. Even so, the definitions used were still brittle, as several of the samples we generated do not match the desired feature at all.</p></li> </ul> <h3 id="enter-sequences-where-cosine-distance-fails">Enter sequences: where cosine distance fails</h3> <p>Although this post has so far dealt with MNIST, a non-sequential dataset of individual digits, what I’m truly interested in is how we deal with sequences or sets. We’ve actually already seen two instances of set-based recall:</p> <ul> <li><p>The first is the MemNN. If you had been paying close attention to my description of the MemNN above, you may have realized that the basic MemNN architecture I described has no way of answering the sample questions provided. This is because each of the sample questions has a sequential component: Where is the milk <strong>now</strong>? Where is Joe [<strong>now</strong>]? Where was Joe <strong>before the office</strong>? In order for the MemNN to answer questions of this nature, Weston et al. augment the basic architecture with a mechanism for modeling the time of each memory. Their approach is described in Section 3.4 of <a href="https://arxiv.org/abs/1410.3916">Weston et al. (2015)</a>. Notably, their approach involves iterating over timestamped memories.</p></li> <li><p>The second is actually feature-based generation and retrieval. If we view each MNIST digit as a sum of its features, then we are actually retrieving a set from a subset thereof.</p></li> </ul> <p>The recall of sequences and sets is critical for agents that experience time (e.g., RNNs), and also has many practical applications (the most notable being plain old text-based search, where we recall documents based on their words (features)). We’ve seen how cosine distance-based search can be extremely effective for set retrieval, if we have a vector that represents the entire set.</p> <p>Cosine distance can also be effective for sequence retrieval, <em>if we have a vector that represents the sequence</em>. For example, the vectors used by the MemNN as memories and inputs are precisely this: vectors representing an entire sentence (sequence of words). For cosine distance to work on sequences, we not only need a vector that represents the sequence, but it must also represent the temporal features of the sequence that we are interested in (if any). The MemNN sequence vectors use a bag of words model and do not contain any temporal features.</p> <p>Consider the following toy retrieval tasks:</p> <p>If we were to write out the entire MNIST training sequence,</p> <ul> <li>are there any instances of the consecutive digits “5134” in the sequence, and if so, where?</li> <li>how many times does the digit “6” appear within four digits of the digit “3”?</li> <li>are there any strings of ten digits that contain all features in some given set of sub-digit features?</li> </ul> <p>For an agent that works only with vector-based memories and cosine distance-based retrieval, all three are quite difficult.</p> <p>If we index each digit as a separate memory, all involve multiple queries. For example, in the first instance we would have to (1) retrieve the positions of <em>all</em> 5s in memory, and (2) filter out the 5s that are not followed by a 1 and a 3, which requires us to either retrieve the positions of all 1s and 3s, or examine the digits following each 5. The second and third tasks are even more difficult.</p> <p>If instead we index sequences of digits, we run into a few problems:</p> <ul> <li>How do we represent sequences in memory? Bag of words (adding individual vectors) might work for problems that ignore order and temporal structure, but will fail for any problem that does not. Use a sequence autoencoder is one solution, but it is fraught with ambiguity and may not contain the necessary information to properly execute the query. How could we design a sequence vector that contains the information required for the second and third tasks above, in addition to other such questions?</li> <li>How do we represent the query as a vector? We already ran into some trouble with this when trying to represent small features as entire vectors of discrete embeddings: we ended up simulating a bunch of noisy sets and then finding the most important activations across all simulations. This did OK, but was unprincipled and had lackluster results.</li> <li>What length of sequence do we want to index? It is impractical to index all sequence lengths, so we need to make an <em>ex ante</em> choice. This runs counter to our idea of search as <em>ex post</em> classification.</li> </ul> <p>These are hard problems, so let us table cosine distance, and instead take a look at what I call <strong>structured information retrieval</strong>, which, with a little bit of work, we can make work with our discrete embeddings.</p> <h3 id="structured-information-retrieval">Structured information retrieval</h3> <p>When you do a phrase search on Google, i.e., a search with “double quotes”, you are doing structured information retrieval. Cosine distance is still in the picture, but it is no longer the critical ingredient: phrase search requires us to first identity all documents that contain the phrase. Recall that query terms are stored in an inverted index together with their postings. The phrase search algorithm operates on the postings list and find all instances of a phrase, without ever needing to retrieve an entire document. You can see the details of the phrase search algorithm in Section 2.1.1 of Büttcher et al..</p> <p>Phrase search is just one example of a more general class of constraint-based queries that define a structure on a set of terms. Within this class of queries, we can include Boolean search, passage retrieval, and more general query languages that allow users to define lightweight structure such as “find sentences containing terms X and Y” or the third task above (see Section 5.2 of Büttcher et al. for a detailed discussion).</p> <p>Structured information retrieval cannot be done with cosine distance: it requires access to the postings list of each query term. With real-valued embeddings, there is no principled way to create such an index (or at least, none that I could think of). With one-hot embeddings, however, we have some additional flexibility.</p> <p>Below we will take a simple approach to information retrieval over the sequence of the MNIST training images. In particular, we will show how discrete embeddings allow us to solve the following problem:</p> <p>Does the ordered sequence of MNIST training images (55000 images) contain any subsequences of four consecutive digits that have the following four features, respectively, and if so, where:</p> <figure> <img src="https://r2rt.com/static/images/DEE_feature_query.png" alt="png" /><figcaption>png</figcaption> </figure> <h3 id="indexing-one-hot-features">Indexing one-hot features</h3> <p>To begin, we need to build our index, which is a dictionary mapping features to a <em>positions</em> vector. We will treat each dimension of our explicit embedding as a feature (so that there are 640 total features), and we will index the entire MNIST training sequence in order. We also define a functions to find the intersection and union of postings lists, which will allow us to compose postings lists to create postings lists for features.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">embs <span class="op">=</span> explicit_mnist.T postings <span class="op">=</span> [<span class="va">None</span>] <span class="op">*</span> <span class="dv">640</span> <span class="cf">for</span> i, positions <span class="kw">in</span> <span class="bu">enumerate</span>(embs): postings[i] <span class="op">=</span> np.squeeze(np.argwhere(positions)) <span class="im">from</span> itertools <span class="im">import</span> chain <span class="kw">def</span> ipostings(idx_list): <span class="co">&quot;&quot;&quot;Gets the intersection of postings for the activations in idx_list&quot;&quot;&quot;</span> res <span class="op">=</span> postings[idx_list[<span class="dv">0</span>]] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">1</span>, <span class="bu">len</span>(idx_list)): res <span class="op">=</span> <span class="bu">list</span>(intersect_sorted(res, postings[idx_list[i]])) <span class="cf">return</span> np.array(res) <span class="kw">def</span> upostings(list_of_postings): <span class="co">&quot;&quot;&quot;Gets the union of postings&quot;&quot;&quot;</span> <span class="cf">return</span> np.array(<span class="bu">sorted</span>(<span class="bu">set</span>(chain.from_iterable(list_of_postings))))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># positions where the activation 1 is hot</span> postings[<span class="dv">1</span>] <span class="op">&gt;</span> array([ <span class="dv">4</span>, <span class="dv">6</span>, <span class="dv">18</span>, ..., <span class="dv">54972</span>, <span class="dv">54974</span>, <span class="dv">54980</span>]) <span class="co"># positions where activations 1 and 10 co-occur</span> ipostings([<span class="dv">1</span>, <span class="dv">10</span>]) <span class="op">&gt;</span> array([ <span class="dv">32</span>, <span class="dv">84</span>, <span class="dv">95</span>, ..., <span class="dv">54680</span>, <span class="dv">54702</span>, <span class="dv">54966</span>]) <span class="co"># positions where activations 1 and 5 co-occur (empty, because they are from the same neuron)</span> ipostings([<span class="dv">1</span>, <span class="dv">5</span>]) <span class="op">&gt;</span> array([], dtype<span class="op">=</span>float64) <span class="co"># positions where activation 1 co-occurs with any of the activations 10, 15, or 23</span> upostings((ipostings([<span class="dv">1</span>, <span class="dv">10</span>]), ipostings([<span class="dv">1</span>, <span class="dv">15</span>]), ipostings([<span class="dv">1</span>, <span class="dv">23</span>]))) <span class="op">&gt;</span> array([ <span class="dv">18</span>, <span class="dv">32</span>, <span class="dv">38</span>, ..., <span class="dv">54972</span>, <span class="dv">54974</span>, <span class="dv">54980</span>])</code></pre></div> <h4 id="finding-consecutive-features">Finding consecutive features</h4> <p>To find subsequences with the desired features, we first identify what those feature are, and then run a phrase search using their postings list. We obtain the feature posting lists by (1) finding the posting list for each group of three activations in the five most common activations for that features, (2) taking their union.</p> <p>Our definition of the features is extremely brittle. Here we have even less flexibility than we did when retrieving nearest neighbors or generating samples because we need to use exact definitions. Our process is far from perfect: we use the noise-based approach to produce firing patterns, and then define the feature to be any instance where any combination of 4 such common activations fire together. This no doubt misses a lot of digits that do contain the features, but trial and error shows that using less than the intersection of 4 common activations produces very noisy results (we obtain a lot of matches that do not contain the feature).</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">firing_patterns <span class="op">=</span> [] <span class="cf">for</span> feature <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">8</span>): noisy_imgs <span class="op">=</span> get_noisy_imgs(features[feature]) embs <span class="op">=</span> sess.run(g[<span class="st">&#39;embedding&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: noisy_imgs, g[<span class="st">&#39;stochastic&#39;</span>]: <span class="va">False</span>}) firing_patterns.append(np.<span class="bu">sum</span>(embs, axis<span class="op">=</span><span class="dv">0</span>)) F1, F2, F3 <span class="op">=</span> <span class="dv">1</span>, <span class="dv">3</span>, <span class="dv">6</span> <span class="co"># the indices of the features that make up our query</span> <span class="im">from</span> itertools <span class="im">import</span> combinations <span class="co"># Get the common activations</span> neuron_lists <span class="op">=</span> [] <span class="cf">for</span> feature <span class="kw">in</span> [F1, F2, F3]: num_common_activations <span class="op">=</span> np.<span class="bu">sum</span>(firing_patterns[feature] <span class="op">&gt;</span> <span class="dv">260</span>) firing_pattern <span class="op">=</span> np.reshape(firing_patterns[feature], (<span class="dv">640</span>)).astype(np.int32) neuron_lists.append( np.array(<span class="bu">list</span>(<span class="bu">range</span>(<span class="dv">640</span>)))[np.argsort(firing_pattern)[::<span class="op">-</span><span class="dv">1</span>][:<span class="bu">min</span>(<span class="dv">10</span>, num_common_activations)]] ) <span class="co"># Get intersected postings for each combination of 3 activations in the most common</span> combo_lists <span class="op">=</span> [<span class="bu">list</span>(<span class="bu">map</span>(ipostings, combinations(l, <span class="bu">max</span>(<span class="bu">len</span>(l) <span class="op">-</span> <span class="dv">4</span>, <span class="dv">4</span>)))) <span class="cf">for</span> l <span class="kw">in</span> neuron_lists] <span class="co"># Get the union of the above for each feature</span> positions <span class="op">=</span> <span class="bu">list</span>(<span class="bu">map</span>(upostings, combo_lists))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;Because our data is small, we take a naive approach to phrase search.</span> <span class="co">A more efficient algorithm is presented in Section 2.1.1 of Büttcher et al.&quot;&quot;&quot;</span> res12 <span class="op">=</span> [np.concatenate([features[F1], features[F2]])] res123 <span class="op">=</span> [np.concatenate([features[F1], features[F2], features[F3]])] <span class="cf">for</span> p <span class="kw">in</span> positions[<span class="dv">0</span>]: <span class="cf">if</span> p<span class="op">+</span><span class="dv">1</span> <span class="kw">in</span> positions[<span class="dv">1</span>]: <span class="cf">if</span> p<span class="op">+</span><span class="dv">2</span> <span class="kw">in</span> positions[<span class="dv">2</span>]: <span class="bu">print</span>(<span class="st">&quot;Full phrase match found in positions&quot;</span>, p, <span class="st">&quot;to&quot;</span>, p<span class="op">+</span><span class="dv">2</span>) res123.append(mnist.train.images[p:p<span class="op">+</span><span class="dv">3</span>]) res12.append(mnist.train.images[p:p<span class="op">+</span><span class="dv">2</span>]) <span class="bu">print</span>(<span class="st">&quot;Partial phrase match of 1st and 2nd query elements found in positions&quot;</span>, p, <span class="st">&quot;to&quot;</span>, p<span class="op">+</span><span class="dv">1</span>) res23 <span class="op">=</span> [np.concatenate([features[F2], features[F3]])] <span class="cf">for</span> p <span class="kw">in</span> positions[<span class="dv">1</span>]: <span class="cf">if</span> p<span class="op">+</span><span class="dv">1</span> <span class="kw">in</span> positions[<span class="dv">2</span>]: res23.append(mnist.train.images[p:p<span class="op">+</span><span class="dv">2</span>]) <span class="bu">print</span>(<span class="st">&quot;Partial phrase match of 2nd and 3rd query elements found in positions&quot;</span>, p, <span class="st">&quot;to&quot;</span>, p<span class="op">+</span><span class="dv">1</span>)</code></pre></div> <pre><code>Partial phrase match of 1st and 2nd query elements found in positions 1151 to 1152 Partial phrase match of 1st and 2nd query elements found in positions 39054 to 39055 Partial phrase match of 2nd and 3rd query elements found in positions 14294 to 14295 Partial phrase match of 2nd and 3rd query elements found in positions 14436 to 14437 Partial phrase match of 2nd and 3rd query elements found in positions 17310 to 17311 Partial phrase match of 2nd and 3rd query elements found in positions 20984 to 20985 Partial phrase match of 2nd and 3rd query elements found in positions 21198 to 21199 Partial phrase match of 2nd and 3rd query elements found in positions 27052 to 27053 Partial phrase match of 2nd and 3rd query elements found in positions 28786 to 28787 Partial phrase match of 2nd and 3rd query elements found in positions 45352 to 45353 Partial phrase match of 2nd and 3rd query elements found in positions 47446 to 47447 Partial phrase match of 2nd and 3rd query elements found in positions 54635 to 54636</code></pre> <h3 id="matches">Matches</h3> <p>Our approach found no matches for three consecutive features. We did, however, find matches for each pair of consecutive features.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># There are no matches for all three features</span> plot_mxn(<span class="bu">len</span>(res123), <span class="dv">3</span>, np.concatenate(res123))</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_99_0.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># There are 2 matches for the first and second features</span> plot_mxn(<span class="bu">len</span>(res12), <span class="dv">2</span>, np.concatenate(res12))</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_100_0.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># There are several matches for the second and third features</span> plot_mxn(<span class="bu">len</span>(res23), <span class="dv">2</span>, np.concatenate(res23))</code></pre></div> <figure> <img src="https://r2rt.com/static/images/DEE_output_101_0.png" alt="png" /><figcaption>png</figcaption> </figure> <p>Given that we can obtain the positions vector for individual features, using other structured information retrieval techniques is as straightforward as using them on text. With relative ease we could answer much more complex queries such as: are there any strings of 10 digits that contain the three features in order?</p> <p>At the same time, these abilities are limited by the quality of our positions vectors, which are in turn limited by the quality of our feature definitions.</p> <h2 id="further-work">Further work</h2> <p>In this post, we played with discrete embeddings and saw that they can be used in much the same way as language. We also ran into several problems as a result of the basic challenge of language, which is semantic ambiguity: what’s in a definition? As a result of this challenge, I do not think the approach presented in this post is practical in its current form. Nevertheless, I think this is a good direction to be thinking about as we try to make progress toward to Stage III architectures.</p> <p>Some of the questions I’m interested in working on going forward are:</p> <ul> <li>MNIST is perhaps too simple; are the various tasks we carried out (generation and memory) using discrete embeddings practical for more complex datasets?</li> <li>Is there a way to create discrete embeddings with fewer activations, but where each activation is more meaningful, so that feature definitions will be easier?</li> <li>Is there a useful distribution that we could enforce on discrete embeddings to make feature definitions easier to work with? Is there someway to enforce, e.g., <a href="https://en.wikipedia.org/wiki/Zipf&#39;s_law">Zipf’s law</a> on the definitions of certain landmark features?</li> <li>Is there a general method for constructing feature definitions in terms of an explicit embedding that will extend to other use cases?</li> <li>One of the key motivations behind the use of a symbolic language is communication. But here, all symbols were created intra-agent: they have meaning only within the agent’s mind (we constructed symbolic definitions of our features by feeding the agent real-valued vectors). How could we design a useful inter-agent model, so that agents could use symbols to communicate?</li> <li>How do the methods here relate to GANs and VAEs? Can they be combined in a useful way?</li> <li>Is there a good way to define a feature discriminator (i.e., to decide whether something is an instance of a feature or not (remember, it has only 1 sample of the feature to learn from)).</li> <li>Indexing discrete features requires that representations remain static. Is there a good way to reindex past features so that we can overcome this limitation?</li> <li>Is there a way we can take advantage of the distribution of the softmax activations in the embedding layer (before we turn it into a one-hot vector)? For instance, to give us a measure of our network’s confidence in its final predictions, or the quality of the discrete embeddings?</li> </ul> <p>And also two specific questions about particular architectures:</p> <ul> <li>Would adding an external generator to the MemNN increase the range of tasks it can solve? (E.g., to QA about hypothetical scenarios).</li> <li>Can we create discrete embeddings of word vectors that behave like word2vec, and perhaps offer some of the advantages of discrete embeddings? Cf. <a href="https://arxiv.org/abs/1506.02004">Faruqui et al. (2015)</a>.</li> </ul> <section class="footnotes"> <hr /> <ol> <li id="fn1"><p>If this is something you’re interested in exploring yourself, my planned starting point is this <a href="http://blog.otoro.net/2016/04/01/generating-large-images-from-latent-vectors/">post</a> by <a href="https://github.com/hardmaru">hardmaru</a> (and the underlying <a href="https://github.com/hardmaru/cppn-gan-vae-tensorflow">implementation</a>), which uses a Tensorflow-based VAE-GAN to turn MNIST into something quite beautiful.<a href="#fnref1">↩</a></p></li> </ol> </section> </body> </html> Beyond Binary: Ternary and One-hot Neurons2017-02-08T00:00:00-05:002017-02-08T00:00:00-05:00Silviu Pitistag:r2rt.com,2017-02-08:/beyond-binary-ternary-and-one-hot-neurons.htmlWhile playing with some applications of binary neurons, I found myself wanting to use explicit activations that go beyond a simple yes/no decision. For example, we might want our neural network to make a choice between several categories (in the form of a one-hot vector) or we might want it to make a choice between ordered categories (e.g., a scale of 1 to 10). It's rather easy to extend the straight-through estimator to work well on both of these cases, and I thought I would share my work in this post. I share code for implementing ternary and one-hot neurons in Tensorflow, and show that they can learn to solve MNIST.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML" type="text/javascript"></script> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>While playing with some applications of binary neurons, I found myself wanting to use explicit activations that go beyond a simple yes/no decision. For example, we might want our neural network to make a choice between several categories (in the form of a one-hot vector) or we might want it to make a choice between ordered categories (e.g., a scale of 1 to 10). It’s rather easy to extend the straight-through estimator to work well on both of these cases, and I thought I would share my work in this post. I share code for implementing ternary and one-hot neurons in Tensorflow, and show that they can learn to solve MNIST.</p> <p>This is a follow-up post to <a href="http://r2rt.com/binary-stochastic-neurons-in-tensorflow.html">Binary Stochastic Neurons in Tensorflow</a>, and assumes familiarity with binary neurons and the straight-through estimator discussed therein.</p> <p><strong>Note Feb. 8, 2017</strong>: I haven’t had a chance to read either of these two papers that I came across after writing this post (they look like they are related to the straight-through softmax activation (I think the first offers up an even better estimator… to be decided)):</p> <ul> <li><a href="https://arxiv.org/abs/1611.01144">Categorical Reparameterization with Gumbel-Softmax</a></li> <li><a href="https://arxiv.org/abs/1611.00712">The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables</a></li> </ul> <p>I plan to update this post once I get around to reading them.</p> <p><strong>IPython Notebook</strong>: This post is also available as an IPython notebook <a href="https://gist.github.com/spitis/4d600aa019a5689953b1b61a8ddb074d">here</a>.</p> <h2 id="general-n-ary-neurons">General n-ary neurons</h2> <p>Whereas the binary neuron outputs 0 or 1, a ternary neuron might output -1, 0, or 1 (or alternatively, 0, 1 or 2). Similarly, we could create an arbitrary n-ary neuron that outputs ordered categories, such as a scale from 1 to 10. Actually, but for the activation function, the code for all of these neurons is the same as the code for the binary neuron: we either round the real output of the activation function to the nearest integer (deterministic), or use its decimal portion to sample either the integer below or the integer above from a bernoulli distribution (stochastic). Note that this means stochastic choices are made only between two adjacent categories, and never across all categories. On the backward pass, we use the straight-through estimator, which means that we replace the gradient of rounding (deterministic) or sampling (stochastic) with the identity. If our activation function is threshholded to [0, 1], this results in a binary neuron, but if it is threshholded to [-1,1], we can output three ordered values.</p> <p>We might be tempted to use tanh, which is threshholded to [-1, 1], to create ternary neurons. Picking the right activation, however, is a bit trickier than finding a function that has the correct range. The standard tanh is not very good here because its slope near 0 is close to 1, so that a neuron outputting 0 will tend to get pushed away from 0. Instead, we want something that looks like a soft set of stairs (where each step is a sigmoid), so that the neuron can learn to consistently output intermediate values.</p> <p>With such an activation, a neuron outputting 0 will have a small slope on the backward pass, so that many mistakes (in the same direction) will be required to move it towards 1 or -1.</p> <p>For the ternary case, the following function works well:</p> <p><span class="math display">$f(x) = 1.5\tanh(x) + 0.5\tanh(-3x).$</span></p> <p>Here is its plot drawn by Wolfram Alpha:</p> <figure> <img src="https://r2rt.com/static/images/BB_ternary_activation.png" alt="Ternary activation function" /><figcaption>Ternary activation function</figcaption> </figure> <p>This activation function works well because its slope goes to 0 as it approaches each of the integers in the output range (note that the binary neuron has this property too). Here’s the plot of the derivative for reference:</p> <figure> <img src="https://r2rt.com/static/images/BB_ternary_derivative.png" alt="Ternary activation derivative" /><figcaption>Ternary activation derivative</figcaption> </figure> <p>The above ternary function is implemented below. I’ve been more interested in the one-hot activations, so I haven’t figured out how to make slope annealing work for this ternary neuron, or a general formula for n-ary neurons. If the mathematically-inclined reader would like to leave some ideas in the comments, that would be much appreciated.</p> <h2 id="one-hot-neurons">One-hot neurons</h2> <p>N-ary neurons are cool, but restrictive. Binary neurons are restricted to yes/no decisions, and ternary+ neurons express the prior that the output categories are linearly ordered. As an example to illustrate this “ordering” point, consider the categories “dislike”, “neutral” and “like”. There is a natural order between these categories (“like” is closer to “neutral” than to “dislike”). Most categories, however, do not have a natural ordering. For example, there is no natural order between Boston, New York and Toronto. To create a neuron that can decided between unordered categories, we would like it to output a one-hot vector like [1, 0, 0] (Boston) or [0, 0, 1] (Toronto). Luckily, the straight through estimator extends nicely to this scenario.</p> <p>We define a d-dimensional one-hot neuron, <span class="math inline">$$N_d: \mathbb{R}^n \to \mathbb{R}^d$$</span> as follows. Given an input, <span class="math inline">$$x \in \mathbb{R}^n$$</span>, we perform the following steps:</p> <ol type="1"> <li><p>Compute a d-dimensional vector of logits, <span class="math inline">$$z = Wx + b$$</span>, where <span class="math inline">$$W \in \mathbb{R}^{n \times d}$$</span> and <span class="math inline">$$b \in \mathbb{R}^d$$</span>.</p></li> <li><p>Compute softmax activations, <span class="math inline">$$\hat{y} = \text{softmax}_\tau(z)$$</span>, whose <span class="math inline">$$i$$</span>-th component is</p> <p><span class="math display">$\hat{y}_i = \frac{\exp(z_i /\tau)}{\sum_{k=1}^d \exp(z_k / \tau)}$</span></p> <p>where <span class="math inline">$$z_i$$</span> is the <span class="math inline">$$i$$</span>-th component of <span class="math inline">$$z$$</span>, and <span class="math inline">$$\tau \in (0, \infty)$$</span> is the temperature (used for annealing).</p></li> <li><p>Next, we either sample a one-hot vector according to the distribution defined by <span class="math inline">$$\hat{y}$$</span> (stochastic), or simply use the maximum value of <span class="math inline">$$\hat{y}$$</span> to determine the one-hot output (deterministic). The result is the output of our neuron, <span class="math inline">$$y$$</span>. A formal definition of both of these operations is a bit too ugly for this blog post, but this should be fairly straightforward.</p></li> <li><p>Neither sampling from nor using the maximum of <span class="math inline">$$\hat{y}$$</span> have useful (non-zero) gradients. But we can use the straight-through estimator and replace their gradient on the backward pass with the identity. As in the binary case, this leads to a learnable softmax activation.</p></li> </ol> <p>This definition of one-hot neurons allows for temperature-annealing to be used (note that whereas slope is increased during annealing, temperature is decreased during annealing), which we test below.</p> <h2 id="implementation-in-tensorflow">Implementation in Tensorflow</h2> <h4 id="imports-and-helper-functions">Imports and helper functions</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np, tensorflow <span class="im">as</span> tf, matplotlib.pyplot <span class="im">as</span> plt, seaborn <span class="im">as</span> sns <span class="im">from</span> tensorflow.examples.tutorials.mnist <span class="im">import</span> input_data <span class="op">%</span>matplotlib inline sns.<span class="bu">set</span>(color_codes<span class="op">=</span><span class="va">True</span>) mnist <span class="op">=</span> input_data.read_data_sets(<span class="st">&#39;MNIST_data&#39;</span>, one_hot<span class="op">=</span><span class="va">True</span>) <span class="im">from</span> tensorflow.python.framework <span class="im">import</span> ops <span class="im">from</span> collections <span class="im">import</span> Counter <span class="kw">def</span> reset_graph(): <span class="cf">if</span> <span class="st">&#39;sess&#39;</span> <span class="kw">in</span> <span class="bu">globals</span>() <span class="kw">and</span> sess: sess.close() tf.reset_default_graph() <span class="kw">def</span> layer_linear(inputs, shape, scope<span class="op">=</span><span class="st">&#39;linear_layer&#39;</span>): <span class="cf">with</span> tf.variable_scope(scope): w <span class="op">=</span> tf.get_variable(<span class="st">&#39;w&#39;</span>,shape) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>,shape[<span class="op">-</span><span class="dv">1</span>:]) <span class="cf">return</span> tf.matmul(inputs,w) <span class="op">+</span> b <span class="kw">def</span> layer_softmax(inputs, shape, scope<span class="op">=</span><span class="st">&#39;softmax_layer&#39;</span>): <span class="cf">with</span> tf.variable_scope(scope): w <span class="op">=</span> tf.get_variable(<span class="st">&#39;w&#39;</span>,shape) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>,shape[<span class="op">-</span><span class="dv">1</span>:]) <span class="cf">return</span> tf.nn.softmax(tf.matmul(inputs,w) <span class="op">+</span> b) <span class="kw">def</span> compute_accuracy(y, pred): correct <span class="op">=</span> tf.equal(tf.argmax(y,<span class="dv">1</span>), tf.argmax(pred,<span class="dv">1</span>)) <span class="cf">return</span> tf.reduce_mean(tf.cast(correct, tf.float32)) <span class="kw">def</span> plot_n(data_and_labels, lower_y <span class="op">=</span> <span class="fl">0.</span>, title<span class="op">=</span><span class="st">&quot;Learning Curves&quot;</span>): fig, ax <span class="op">=</span> plt.subplots() <span class="cf">for</span> data, label <span class="kw">in</span> data_and_labels: ax.plot(<span class="bu">range</span>(<span class="dv">0</span>,<span class="bu">len</span>(data)<span class="op">*</span><span class="dv">100</span>,<span class="dv">100</span>),data, label<span class="op">=</span>label) ax.set_xlabel(<span class="st">&#39;Training steps&#39;</span>) ax.set_ylabel(<span class="st">&#39;Accuracy&#39;</span>) ax.set_ylim([lower_y,<span class="dv">1</span>]) ax.set_title(title) ax.legend(loc<span class="op">=</span><span class="dv">4</span>) plt.show()</code></pre></div> <h4 id="functions-for-ternary-and-n-ary-neurons">Functions for ternary and n-ary neurons</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> st_round(x): <span class="co">&quot;&quot;&quot;Rounds a tensor using the straight through estimator for the gradient.&quot;&quot;&quot;</span> g <span class="op">=</span> tf.get_default_graph() <span class="cf">with</span> ops.name_scope(<span class="st">&quot;StRound&quot;</span>) <span class="im">as</span> name: <span class="cf">with</span> g.gradient_override_map({<span class="st">&quot;Round&quot;</span>: <span class="st">&quot;Identity&quot;</span>}): <span class="cf">return</span> tf.<span class="bu">round</span>(x, name<span class="op">=</span>name)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> sample_closest_ints(x): <span class="co">&quot;&quot;&quot;If x is a float, then samples floor(x) with probability x - floor(x), and ceil(x) with</span> <span class="co"> probability ceil(x) - x, using the straight through estimator for the gradient.</span> <span class="co"> E.g.,:</span> <span class="co"> if x is 0.6, sample_closest_ints(x) will be 1 with probability 0.6, and 0 otherwise,</span> <span class="co"> and the gradient will be pass-through (identity).</span> <span class="co"> &quot;&quot;&quot;</span> <span class="cf">with</span> ops.name_scope(<span class="st">&quot;SampleClosestInts&quot;</span>) <span class="im">as</span> name: <span class="cf">with</span> tf.get_default_graph().gradient_override_map({<span class="st">&quot;Ceil&quot;</span>: <span class="st">&quot;Identity&quot;</span>,<span class="st">&quot;Sub&quot;</span>: <span class="st">&quot;SampleClosestInts&quot;</span>}): <span class="cf">return</span> tf.ceil(x <span class="op">-</span> tf.random_uniform(tf.shape(x)), name<span class="op">=</span>name) <span class="at">@ops.RegisterGradient</span>(<span class="st">&quot;SampleClosestInts&quot;</span>) <span class="kw">def</span> sample_closest_ints_grad(op, grad): <span class="cf">return</span> [grad, tf.zeros(tf.shape(op.inputs[<span class="dv">1</span>]))]</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> binary_activation(x, slope_tensor): <span class="cf">return</span> tf.cond(tf.equal(<span class="fl">1.</span>, slope_tensor), <span class="kw">lambda</span>: tf.sigmoid(x), <span class="kw">lambda</span>: tf.sigmoid(slope_tensor <span class="op">*</span> x)) <span class="kw">def</span> ternary_activation(x, slope_tensor <span class="op">=</span> <span class="va">None</span>, alpha <span class="op">=</span> <span class="dv">1</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Does not support slope annealing (slope_tensor is ignored)</span> <span class="co"> Wolfram Alpha plot:</span> <span class="co"> https://www.wolframalpha.com/input/?i=plot+(1.5*tanh(x)+%2B+0.5(tanh(-(3-1e-2)*x))),+x%3D+-2+to+2</span> <span class="co"> &quot;&quot;&quot;</span> <span class="cf">return</span> <span class="fl">1.5</span><span class="op">*</span>tf.tanh(alpha<span class="op">*</span>x) <span class="op">+</span> <span class="fl">0.5</span><span class="op">*</span>(tf.tanh(<span class="op">-</span>(<span class="dv">3</span><span class="op">/</span>alpha)<span class="op">*</span>x))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> n_ary_activation(x, activation<span class="op">=</span>binary_activation, slope_tensor<span class="op">=</span><span class="va">None</span>, stochastic_tensor<span class="op">=</span><span class="va">None</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> n-ary activation for creating binary and ternary neurons (and n-ary neurons, if you can</span> <span class="co"> create the right activation function). Given a tensor and an activation, it applies the</span> <span class="co"> activation to the tensor, and then either samples the results (if stochastic tensor is</span> <span class="co"> true), or rounds the results (if stochastic_tensor is false) to the closest integer</span> <span class="co"> values. The default activation is a sigmoid (when slope_tensor = 1), which results in a</span> <span class="co"> binary neuron, as in http://r2rt.com/binary-stochastic-neurons-in-tensorflow.html.</span> <span class="co"> Uses the straight through estimator during backprop. See https://arxiv.org/abs/1308.3432.</span> <span class="co"> Arguments:</span> <span class="co"> * x: the pre-activation / logit tensor</span> <span class="co"> * activation: sigmoid, hard sigmoid, or n-ary activation</span> <span class="co"> * slope_tensor: slope adjusts the slope of the activation function, for purposes of the</span> <span class="co"> Slope Annealing Trick (see http://arxiv.org/abs/1609.01704)</span> <span class="co"> * stochastic_tensor: whether to sample the closest integer, or round to it.</span> <span class="co"> &quot;&quot;&quot;</span> <span class="cf">if</span> slope_tensor <span class="kw">is</span> <span class="va">None</span>: slope_tensor <span class="op">=</span> tf.constant(<span class="fl">1.0</span>) <span class="cf">if</span> stochastic_tensor <span class="kw">is</span> <span class="va">None</span>: stochastic_tensor <span class="op">=</span> tf.constant(<span class="va">True</span>) p <span class="op">=</span> activation(x, slope_tensor) <span class="cf">return</span> tf.cond(stochastic_tensor, <span class="kw">lambda</span>: sample_closest_ints(p), <span class="kw">lambda</span>: st_round(p))</code></pre></div> <h4 id="functions-to-make-a-layer-of-one-hot-neurons">Functions to make a layer of one-hot neurons</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> st_sampled_softmax(logits): <span class="co">&quot;&quot;&quot;Takes logits and samples a one-hot vector according to them, using the straight</span> <span class="co"> through estimator on the backward pass.&quot;&quot;&quot;</span> <span class="cf">with</span> ops.name_scope(<span class="st">&quot;STSampledSoftmax&quot;</span>) <span class="im">as</span> name: probs <span class="op">=</span> tf.nn.softmax(logits) onehot_dims <span class="op">=</span> logits.get_shape().as_list()[<span class="dv">1</span>] res <span class="op">=</span> tf.one_hot(tf.squeeze(tf.multinomial(logits, <span class="dv">1</span>), <span class="dv">1</span>), onehot_dims, <span class="fl">1.0</span>, <span class="fl">0.0</span>) <span class="cf">with</span> tf.get_default_graph().gradient_override_map({<span class="st">&#39;Ceil&#39;</span>: <span class="st">&#39;Identity&#39;</span>, <span class="st">&#39;Mul&#39;</span>: <span class="st">&#39;STMul&#39;</span>}): <span class="cf">return</span> tf.ceil(res<span class="op">*</span>probs) <span class="kw">def</span> st_hardmax_softmax(logits): <span class="co">&quot;&quot;&quot;Takes logits and creates a one-hot vector with a 1 in the position of the maximum</span> <span class="co"> logit, using the straight through estimator on the backward pass.&quot;&quot;&quot;</span> <span class="cf">with</span> ops.name_scope(<span class="st">&quot;STHardmaxSoftmax&quot;</span>) <span class="im">as</span> name: probs <span class="op">=</span> tf.nn.softmax(logits) onehot_dims <span class="op">=</span> logits.get_shape().as_list()[<span class="dv">1</span>] res <span class="op">=</span> tf.one_hot(tf.argmax(probs, <span class="dv">1</span>), onehot_dims, <span class="fl">1.0</span>, <span class="fl">0.0</span>) <span class="cf">with</span> tf.get_default_graph().gradient_override_map({<span class="st">&#39;Ceil&#39;</span>: <span class="st">&#39;Identity&#39;</span>, <span class="st">&#39;Mul&#39;</span>: <span class="st">&#39;STMul&#39;</span>}): <span class="cf">return</span> tf.ceil(res<span class="op">*</span>probs) <span class="at">@ops.RegisterGradient</span>(<span class="st">&quot;STMul&quot;</span>) <span class="kw">def</span> st_mul(op, grad): <span class="co">&quot;&quot;&quot;Straight-through replacement for Mul gradient (does not support broadcasting).&quot;&quot;&quot;</span> <span class="cf">return</span> [grad, grad] <span class="kw">def</span> layer_hard_softmax(x, shape, onehot_dims, temperature_tensor<span class="op">=</span><span class="va">None</span>, stochastic_tensor<span class="op">=</span><span class="va">None</span>, scope<span class="op">=</span><span class="st">&#39;hard_softmax_layer&#39;</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Creates a layer of one-hot neurons. Note that the neurons are flattened before returning,</span> <span class="co"> so that the shape of the layer needs to be a multiple of the dimension of the one-hot outputs.</span> <span class="co"> Arguments:</span> <span class="co"> * x: the layer inputs / previous layer</span> <span class="co"> * shape: the tuple of [size_previous, layer_size]. Layer_size must be a multiple of onehot_dims,</span> <span class="co"> since each neuron&#39;s output is flattened (i.e., the number of neurons will only be</span> <span class="co"> layer_size / onehot_dims)</span> <span class="co"> * onehot_dims: the size of each neuron&#39;s output</span> <span class="co"> * temperature_tensor: the temperature for the softmax</span> <span class="co"> * stochastic_tensor: whether the one hot outputs are sampled from the softmax distribution</span> <span class="co"> (stochastic - recommended for training), or chosen according to its maximal element</span> <span class="co"> (deterministic - recommended for inference)</span> <span class="co"> &quot;&quot;&quot;</span> <span class="cf">assert</span>(<span class="bu">len</span>(shape) <span class="op">==</span> <span class="dv">2</span>) <span class="cf">assert</span>(shape[<span class="dv">1</span>] <span class="op">%</span> onehot_dims <span class="op">==</span> <span class="dv">0</span>) <span class="cf">if</span> temperature_tensor <span class="kw">is</span> <span class="va">None</span>: temperature_tensor <span class="op">=</span> tf.constant(<span class="fl">1.</span>) <span class="cf">if</span> stochastic_tensor <span class="kw">is</span> <span class="va">None</span>: stochastic_tensor <span class="op">=</span> tf.constant(<span class="va">True</span>) <span class="cf">with</span> tf.variable_scope(scope): w <span class="op">=</span> tf.get_variable(<span class="st">&#39;w&#39;</span>,shape) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>,shape[<span class="op">-</span><span class="dv">1</span>:]) logits <span class="op">=</span> tf.reshape((tf.matmul(x, w) <span class="op">+</span> b) <span class="op">/</span> temperature_tensor, [<span class="op">-</span><span class="dv">1</span>, onehot_dims]) <span class="cf">return</span> tf.cond(stochastic_tensor, <span class="kw">lambda</span>: tf.reshape(st_sampled_softmax(logits), [<span class="op">-</span><span class="dv">1</span>, shape[<span class="dv">1</span>]]), <span class="kw">lambda</span>: tf.reshape(st_hardmax_softmax(logits), [<span class="op">-</span><span class="dv">1</span>, shape[<span class="dv">1</span>]]))</code></pre></div> <h4 id="function-to-build-graph-for-mnist-classifier">Function to build graph for MNIST classifier</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], lr <span class="op">=</span> <span class="fl">0.5</span>, activation<span class="op">=</span>binary_activation, onehot_dims <span class="op">=</span> <span class="dv">0</span>): reset_graph() <span class="co">&quot;&quot;&quot;Placeholders&quot;&quot;&quot;</span> x <span class="op">=</span> tf.placeholder(tf.float32, [<span class="va">None</span>, <span class="dv">784</span>], name<span class="op">=</span><span class="st">&#39;x_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.float32, [<span class="va">None</span>, <span class="dv">10</span>], name<span class="op">=</span><span class="st">&#39;y_placeholder&#39;</span>) stochastic_tensor <span class="op">=</span> tf.constant(<span class="va">True</span>) slope_tensor <span class="op">=</span> tf.constant(<span class="fl">1.0</span>) temperature_tensor <span class="op">=</span> <span class="dv">1</span> <span class="op">/</span> slope_tensor layers <span class="op">=</span> {<span class="dv">0</span>: x} num_hidden_layers <span class="op">=</span> <span class="bu">len</span>(hidden_dims) dims <span class="op">=</span> [<span class="dv">784</span>] <span class="op">+</span> hidden_dims <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">1</span>, num_hidden_layers<span class="op">+</span><span class="dv">1</span>): <span class="cf">with</span> tf.variable_scope(<span class="st">&quot;layer_&quot;</span> <span class="op">+</span> <span class="bu">str</span>(i)): <span class="cf">if</span> onehot_dims: layers[i] <span class="op">=</span> layer_hard_softmax(layers[i<span class="dv">-1</span>], dims[i<span class="dv">-1</span>:i<span class="op">+</span><span class="dv">1</span>], onehot_dims, temperature_tensor, stochastic_tensor) <span class="cf">else</span>: pre_activations <span class="op">=</span> layer_linear(layers[i<span class="dv">-1</span>], dims[i<span class="dv">-1</span>:i<span class="op">+</span><span class="dv">1</span>], scope<span class="op">=</span><span class="st">&#39;layer_&#39;</span> <span class="op">+</span> <span class="bu">str</span>(i)) <span class="cf">if</span> activation <span class="kw">is</span> tf.tanh <span class="kw">or</span> activation <span class="kw">is</span> tf.sigmoid: layers[i] <span class="op">=</span> activation(pre_activations) <span class="cf">else</span>: layers[i] <span class="op">=</span> n_ary_activation(pre_activations, activation, slope_tensor, stochastic_tensor) final_hidden_layer <span class="op">=</span> layers[num_hidden_layers] preds <span class="op">=</span> layer_softmax(final_hidden_layer, [dims[<span class="op">-</span><span class="dv">1</span>], <span class="dv">10</span>]) loss <span class="op">=</span> <span class="op">-</span>tf.reduce_mean(y <span class="op">*</span> tf.log(preds), reduction_indices<span class="op">=</span><span class="dv">1</span>) ts <span class="op">=</span> tf.train.GradientDescentOptimizer(lr).minimize(loss) accuracy <span class="op">=</span> compute_accuracy(y, preds) <span class="cf">return</span> <span class="bu">dict</span>( x<span class="op">=</span>x, y<span class="op">=</span>y, stochastic<span class="op">=</span>stochastic_tensor, slope<span class="op">=</span>slope_tensor, final_hidden_layer <span class="op">=</span> final_hidden_layer, loss<span class="op">=</span>loss, ts<span class="op">=</span>ts, accuracy<span class="op">=</span>accuracy, init_op<span class="op">=</span>tf.global_variables_initializer() )</code></pre></div> <h4 id="function-to-train-the-classifier">Function to train the classifier</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> train_classifier(<span class="op">\</span> hidden_dims<span class="op">=</span>[<span class="dv">100</span>,<span class="dv">100</span>], activation<span class="op">=</span>binary_activation, onehot_dims <span class="op">=</span> <span class="dv">0</span>, stochastic_train<span class="op">=</span><span class="va">True</span>, stochastic_eval<span class="op">=</span><span class="va">True</span>, slope_annealing_rate<span class="op">=</span><span class="va">None</span>, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.5</span>, verbose<span class="op">=</span><span class="va">True</span>, label<span class="op">=</span><span class="va">None</span>): g <span class="op">=</span> build_classifier(hidden_dims<span class="op">=</span>hidden_dims, lr<span class="op">=</span>lr, activation<span class="op">=</span>activation, onehot_dims<span class="op">=</span>onehot_dims) <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(g[<span class="st">&#39;init_op&#39;</span>]) slope <span class="op">=</span> <span class="dv">1</span> res_tr, res_val, sample_layers <span class="op">=</span> [], [], [] <span class="cf">for</span> epoch <span class="kw">in</span> <span class="bu">range</span>(epochs): feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: mnist.validation.images, g[<span class="st">&#39;y&#39;</span>]: mnist.validation.labels, g[<span class="st">&#39;stochastic&#39;</span>]: stochastic_eval, g[<span class="st">&#39;slope&#39;</span>]: slope} acc, final_hidden <span class="op">=</span> sess.run([g[<span class="st">&#39;accuracy&#39;</span>], g[<span class="st">&#39;final_hidden_layer&#39;</span>]], feed_dict<span class="op">=</span>feed_dict) sample_layers.append(final_hidden) <span class="cf">if</span> verbose: <span class="bu">print</span>(<span class="st">&quot;Epoch&quot;</span>, epoch, acc) <span class="cf">else</span>: <span class="bu">print</span>(<span class="st">&#39;.&#39;</span>, end<span class="op">=</span><span class="st">&#39;&#39;</span>) accuracy <span class="op">=</span> <span class="dv">0</span> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">1</span>,<span class="dv">1001</span>): x, y <span class="op">=</span> mnist.train.next_batch(<span class="dv">50</span>) feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: x, g[<span class="st">&#39;y&#39;</span>]: y, g[<span class="st">&#39;stochastic&#39;</span>]: stochastic_train} acc, _ <span class="op">=</span> sess.run([g[<span class="st">&#39;accuracy&#39;</span>],g[<span class="st">&#39;ts&#39;</span>]], feed_dict<span class="op">=</span>feed_dict) accuracy <span class="op">+=</span> acc <span class="cf">if</span> i <span class="op">%</span> <span class="dv">100</span> <span class="op">==</span> <span class="dv">0</span> <span class="kw">and</span> i <span class="op">&gt;</span> <span class="dv">0</span>: res_tr.append(accuracy<span class="op">/</span><span class="dv">100</span>) accuracy <span class="op">=</span> <span class="dv">0</span> feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: mnist.validation.images, g[<span class="st">&#39;y&#39;</span>]: mnist.validation.labels, g[<span class="st">&#39;stochastic&#39;</span>]: stochastic_eval, g[<span class="st">&#39;slope&#39;</span>]: slope} res_val.append(sess.run(g[<span class="st">&#39;accuracy&#39;</span>], feed_dict<span class="op">=</span>feed_dict)) <span class="cf">if</span> slope_annealing_rate <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: slope <span class="op">=</span> slope<span class="op">*</span>slope_annealing_rate <span class="cf">if</span> verbose: <span class="bu">print</span>(<span class="st">&quot;Annealed slope:&quot;</span>, slope, <span class="st">&quot;| Annealed temperature:&quot;</span>, <span class="dv">1</span><span class="op">/</span> slope) feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: mnist.validation.images, g[<span class="st">&#39;y&#39;</span>]: mnist.validation.labels, g[<span class="st">&#39;stochastic&#39;</span>]: stochastic_eval, g[<span class="st">&#39;slope&#39;</span>]: slope} <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\n</span><span class="st">Final epoch, epoch&quot;</span>, epoch<span class="op">+</span><span class="dv">1</span>, <span class="st">&quot;:&quot;</span>, sess.run(g[<span class="st">&#39;accuracy&#39;</span>], feed_dict<span class="op">=</span>feed_dict)) <span class="cf">if</span> label <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: <span class="cf">return</span> (res_tr, label <span class="op">+</span> <span class="st">&quot; - Training&quot;</span>), (res_val, label <span class="op">+</span> <span class="st">&quot; - Validation&quot;</span>) <span class="cf">else</span>: <span class="cf">return</span> [(res_tr, <span class="st">&quot;Training&quot;</span>), (res_val, <span class="st">&quot;Validation&quot;</span>)], sample_layers</code></pre></div> <h3 id="experiments">Experiments</h3> <p>Let us now demonstrate that the above neurons train on MNIST, and can achieve reasonably good results. I’m not hunting for hyperparameters here, so the below may not be optimal. All training is done over 20 epochs with a learning rate of 0.1.</p> <h4 id="tanh-baseline-real-valued-activations">Tanh baseline (real-valued activations)</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res, _ <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], activation<span class="op">=</span>tf.tanh, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, verbose<span class="op">=</span><span class="va">False</span>) plot_n(res, lower_y<span class="op">=</span><span class="fl">0.8</span>, title<span class="op">=</span><span class="st">&quot;Tanh Baseline&quot;</span>)</code></pre></div> <pre><code>.................... Final epoch, epoch 20 : 0.9772</code></pre> <figure> <img src="https://r2rt.com/static/images/BB_output_21_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="binary-neurons-with-slope-annealing-baseline">Binary neurons with slope annealing baseline</h4> <p>More results on binary neurons available in my prior <a href="http://r2rt.com/binary-stochastic-neurons-in-tensorflow.html">post</a>.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res, _ <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], activation<span class="op">=</span>binary_activation, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, slope_annealing_rate<span class="op">=</span><span class="fl">1.1</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, verbose<span class="op">=</span><span class="va">False</span>) plot_n(res, lower_y<span class="op">=</span><span class="fl">0.8</span>, title<span class="op">=</span><span class="st">&quot;Binary Stochastic w/ Slope Annealing Baseline&quot;</span>)</code></pre></div> <pre><code>.................... Final epoch, epoch 20 : 0.9732</code></pre> <figure> <img src="https://r2rt.com/static/images/BB_output_23_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="ternary-neurons">Ternary neurons</h4> <p>As noted above, I didn’t spend the time to figure out a good way to slope-anneal the ternary activation, so the below is not slope-annealed.</p> <p>The interesting thing to note is that these neurons are more expressive than the binary neurons, and so they not only get closer to the tanh baseline, but they also offer less regularization / overfit more quickly.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res, sample_layers <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], activation<span class="op">=</span>ternary_activation, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, verbose<span class="op">=</span><span class="va">False</span>) plot_n(res, lower_y<span class="op">=</span><span class="fl">0.8</span>, title<span class="op">=</span><span class="st">&quot;Ternary Stochastic&quot;</span>)</code></pre></div> <pre><code>.................... Final epoch, epoch 20 : 0.9764</code></pre> <figure> <img src="https://r2rt.com/static/images/BB_output_26_1.png" alt="png" /><figcaption>png</figcaption> </figure> <p>If you’re curious as to the distribution of outputs, here it is:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">c <span class="op">=</span> Counter(np.reshape(sample_layers[<span class="op">-</span><span class="dv">1</span>], [<span class="op">-</span><span class="dv">1</span>])) g <span class="op">=</span> sns.barplot(x <span class="op">=</span> <span class="bu">list</span>(c.keys()), y <span class="op">=</span> <span class="bu">list</span>(c.values())) sns.plt.title(<span class="st">&#39;Distribution of ternary outputs on MNIST&#39;</span>)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/BB_output_28_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="one-hot-neurons-1">One-hot neurons</h4> <p>First, let’s take a look at what the layer activations look like, in case my description above wasn’t super clear. We’ll pull out the last layer of a network that uses 100 5-dimensional one-hot neurons, and then “unflatten” it into shape [100, 5].</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res, sample_layers <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">500</span>], onehot_dims<span class="op">=</span><span class="dv">5</span>, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, verbose<span class="op">=</span><span class="va">False</span>)</code></pre></div> <pre><code>.................... Final epoch, epoch 20 : 0.9794</code></pre> <p>Here is what the activations of the first 10 neurons of the first sample in the validation set look like. As discussed, each neuron outputs a 5D one-hot vector:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">np.reshape(sample_layers[<span class="dv">0</span>][<span class="dv">0</span>], [<span class="dv">100</span>, <span class="dv">5</span>])[:<span class="dv">10</span>]</code></pre></div> <pre><code>array([[ 0., 0., 0., 0., 1.], [ 0., 1., 0., 0., 0.], [ 0., 0., 0., 1., 0.], [ 0., 0., 1., 0., 0.], [ 1., 0., 0., 0., 0.], [ 0., 0., 1., 0., 0.], [ 0., 0., 1., 0., 0.], [ 0., 1., 0., 0., 0.], [ 0., 0., 0., 1., 0.], [ 0., 1., 0., 0., 0.]], dtype=float32)</code></pre> <h4 id="one-hot-neurons-temperature-annealing">One-hot neurons: temperature annealing</h4> <p>Now, let’s test some 5D neurons to see if temperature annealing does anything. Looks like not really:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">_, res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">500</span>], onehot_dims<span class="op">=</span><span class="dv">5</span>, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, label<span class="op">=</span> <span class="st">&quot;No temperature annealing&quot;</span>, verbose<span class="op">=</span><span class="va">False</span>) _, res2 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">500</span>], onehot_dims<span class="op">=</span><span class="dv">5</span>, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, slope_annealing_rate<span class="op">=</span><span class="fl">1.2</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, label<span class="op">=</span> <span class="st">&quot;Annealing rate 1.2&quot;</span>, verbose<span class="op">=</span><span class="va">False</span>) _, res3 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">500</span>], onehot_dims<span class="op">=</span><span class="dv">5</span>, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, slope_annealing_rate<span class="op">=</span><span class="fl">1.5</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, label<span class="op">=</span> <span class="st">&quot;Annealing rate 1.5&quot;</span>, verbose<span class="op">=</span><span class="va">False</span>) plot_n([res1] <span class="op">+</span> [res2] <span class="op">+</span> [res3], lower_y<span class="op">=</span><span class="fl">0.8</span>, title<span class="op">=</span><span class="st">&quot;5D one-hot neurons, temperature annealing (validation)&quot;</span>)</code></pre></div> <pre><code>.................... Final epoch, epoch 20 : 0.981 .................... Final epoch, epoch 20 : 0.9796 .................... Final epoch, epoch 20 : 0.9798</code></pre> <figure> <img src="https://r2rt.com/static/images/BB_output_34_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="one-hot-neurons-number-of-dimensions">One-hot neurons: number of dimensions</h4> <p>We can also easily change with the number of dimensions of each neuron. I keep the layer size constant in terms of the number of neurons, but make them progressively more expressive and see what happens (because layers are flattened, it appears that the layer size is growing, but the number of neurons stays the same). Looks like not much happens, but maybe this has to do with the simplicity of the dataset. Query whether more expressive one-hot neurons would make a difference on a harder dataset. Note: I plotted the training curves locally, and they show the same result (I would have thought higher dimensional neurons would fit the training data better, but apparently not – perhaps due to stochasticity).</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">_, res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">300</span>], onehot_dims<span class="op">=</span><span class="dv">3</span>, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, label<span class="op">=</span> <span class="st">&quot;3 dimensions&quot;</span>, verbose<span class="op">=</span><span class="va">False</span>) _, res2 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">500</span>], onehot_dims<span class="op">=</span><span class="dv">5</span>, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, label<span class="op">=</span> <span class="st">&quot;5 dimensions&quot;</span>, verbose<span class="op">=</span><span class="va">False</span>) _, res3 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">700</span>], onehot_dims<span class="op">=</span><span class="dv">7</span>, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, label<span class="op">=</span> <span class="st">&quot;7 dimensions&quot;</span>, verbose<span class="op">=</span><span class="va">False</span>) _, res4 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">1000</span>], onehot_dims<span class="op">=</span><span class="dv">10</span>, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, label<span class="op">=</span> <span class="st">&quot;10 dimensions&quot;</span>, verbose<span class="op">=</span><span class="va">False</span>) plot_n([res1] <span class="op">+</span> [res2] <span class="op">+</span> [res3] <span class="op">+</span> [res4], lower_y<span class="op">=</span><span class="fl">0.8</span>, title<span class="op">=</span><span class="st">&quot;N-dimensional one-hot neurons&quot;</span>)</code></pre></div> <pre><code>.................... Final epoch, epoch 20 : 0.9772 .................... Final epoch, epoch 20 : 0.98 .................... Final epoch, epoch 20 : 0.981 .................... Final epoch, epoch 20 : 0.9754</code></pre> <figure> <img src="https://r2rt.com/static/images/BB_output_36_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h2 id="conclusion">Conclusion</h2> <p>That’s it for now! In this post we saw how we can use the straight-through estimator to create expressive trainable explicit neurons. In particular, we coded up neurons that can represent more than 2 ordered categories (the ternary, or general n-ary neuron), and also neurons that can represent 3 or more unordered categories (the one-hot neuron). Moreover, we showed that they are all competitive with a real-valued tanh baseline on MNIST, and that they provide strong built-in regularization.</p> <p>I discuss any applications in this post, but I’m hoping to show you something really cool using one-hot neurons in the next one.</p> </body> </html> Non-Zero Initial States for Recurrent Neural Networks2016-11-20T00:00:00-05:002016-11-20T00:00:00-05:00Silviu Pitistag:r2rt.com,2016-11-20:/non-zero-initial-states-for-recurrent-neural-networks.htmlThe default approach to initializing the state of an RNN is to use a zero state. This often works well, particularly for sequence-to-sequence tasks like language modeling where the proportion of outputs that are significantly impacted by the initial state is small. In some cases, however, it makes sense to (1) train the initial state as a model parameter, (2) use a noisy initial state, or (3) both. This post examines the rationale behind trained and noisy intial states briefly, and presents drop-in Tensorflow implementations.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>The default approach to initializing the state of an RNN is to use a zero state. This often works well, particularly for sequence-to-sequence tasks like language modeling where the proportion of outputs that are significantly impacted by the initial state is small. In some cases, however, it makes sense to (1) train the initial state as a model parameter, (2) use a noisy initial state, or (3) both. This post examines the rationale behind trained and noisy intial states briefly, and presents drop-in Tensorflow implementations.</p> <h3 id="training-the-initial-state">Training the initial state</h3> <p>If there are enough sequences or state resets in the training data (e.g., this will often be the case if we are doing sequence classification), it may make sense to train the initial state as a variable. This way, the model can learn a good default state. If we have only a few state resets, however, training the initial state as a variable may result in overfitting on the start of each sequence. To see this, consider that with n-step truncated backpropagation, only the first n-steps of each sequence will contribute to the gradient of the initial state, so that even if our single training sequence has one million steps, only thirty of them will be used to train the initial state.</p> <p>I haven’t seen anyone evaluate this technique (edit 11/22/16: though it appears to be common knowledge), and so I don’t have a good citation for empirical results. Instead, please see the experimental results in this post.</p> <h3 id="using-a-noisy-initial-state">Using a noisy initial state</h3> <p>Using a zero-valued initial state can also result in overfitting, though in a different way. Ordinarily, losses at the early steps of a sequence-to-sequence model (i.e., those immediately after a state reset) will be larger than those at later steps, because there is less history. Thus, their contribution to the gradient during learning will be relatively higher. But if all state resets are associated with a zero-state, the model can (and will) learn how to compensate for precisely this. As the ratio of state resets to total observations increases, the model parameters will become increasingly tuned to this zero state, which may affect performance on later time steps.</p> <p>One simple solution is to make the initial state noisy. This is the approach suggested by <a href="http://www.scs-europe.net/conf/ecms2015/invited/Contribution_Zimmermann_Grothmann_Tietz.pdf">Zimmerman et al. (2012)</a>, who take it even a step further by making the magnitude of the initial state noise change according to the backpropagated error. This post will only take the first step of making the initial state noisy.</p> <h2 id="tensorflow-implementations">Tensorflow implementations</h2> <p>In some cases, e.g., as in my post on <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html">variable length sequences</a>, creating a variable or noisy initial state to match the cell state is straightforward. However, we often want to switch out the RNN cell or build complicated cells with nested states. My motivation for writing this post was to provide a method like the <code>zero_state</code> method of Tensorflow’s base RNNCell class that automatically constructs a variable or noisy intitial state.</p> <h5 id="implementation-model">Implementation model</h5> <p>We’ll model the implementation after the <code>zero_state</code> method of Tensorflow’s base RNNCell class, shown below with minor modifications to make it a top-level function. You can view the original <code>zero_state</code> method <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn_cell.py">here</a>.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np, tensorflow <span class="im">as</span> tf <span class="im">from</span> tensorflow.python.util <span class="im">import</span> nest _state_size_with_prefix <span class="op">=</span> tf.nn.rnn_cell._state_size_with_prefix <span class="kw">def</span> zero_state(cell, batch_size, dtype): <span class="co">&quot;&quot;&quot;Return zero-filled state tensor(s).</span> <span class="co"> Args:</span> <span class="co"> cell: RNNCell.</span> <span class="co"> batch_size: int, float, or unit Tensor representing the batch size.</span> <span class="co"> dtype: the data type to use for the state.</span> <span class="co"> Returns:</span> <span class="co"> If state_size is an int or TensorShape, then the return value is a</span> <span class="co"> N-D tensor of shape [batch_size x state_size] filled with zeros.</span> <span class="co"> If state_size is a nested list or tuple, then the return value is</span> <span class="co"> a nested list or tuple (of the same structure) of 2-D tensors with</span> <span class="co"> the shapes [batch_size x s] for each s in state_size.</span> <span class="co"> &quot;&quot;&quot;</span> state_size <span class="op">=</span> cell.state_size <span class="cf">if</span> nest.is_sequence(state_size): state_size_flat <span class="op">=</span> nest.flatten(state_size) zeros_flat <span class="op">=</span> [ tf.zeros( tf.pack(_state_size_with_prefix(s, prefix<span class="op">=</span>[batch_size])), dtype<span class="op">=</span>dtype) <span class="cf">for</span> s <span class="kw">in</span> state_size_flat] <span class="cf">for</span> s, z <span class="kw">in</span> <span class="bu">zip</span>(state_size_flat, zeros_flat): z.set_shape(_state_size_with_prefix(s, prefix<span class="op">=</span>[<span class="va">None</span>])) zeros <span class="op">=</span> nest.pack_sequence_as(structure<span class="op">=</span>state_size, flat_sequence<span class="op">=</span>zeros_flat) <span class="cf">else</span>: zeros_size <span class="op">=</span> _state_size_with_prefix(state_size, prefix<span class="op">=</span>[batch_size]) zeros <span class="op">=</span> tf.zeros(tf.pack(zeros_size), dtype<span class="op">=</span>dtype) zeros.set_shape(_state_size_with_prefix(state_size, prefix<span class="op">=</span>[<span class="va">None</span>])) <span class="cf">return</span> zeros</code></pre></div> <h5 id="implementation">Implementation</h5> <p>Rather than rewriting the <code>zero_state</code> method to initialize the state with a variable (or with noise) directly, we will abstract out the <code>tf.zeros</code> function, to make the method more flexible. Our abstracted function, <code>get_initial_cell_state</code>, takes an additional <code>initializer</code> argument, which takes the place of <code>tf.zeros</code> and determines how the state is initialized. This would be a simple modification, but for the fact that we need to be careful with how variable states are created (e.g., we don’t want a different variable for each sample in the batch), which pushes some of the complexity into the <code>initializer</code> function.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> get_initial_cell_state(cell, initializer, batch_size, dtype): <span class="co">&quot;&quot;&quot;Return state tensor(s), initialized with initializer.</span> <span class="co"> Args:</span> <span class="co"> cell: RNNCell.</span> <span class="co"> batch_size: int, float, or unit Tensor representing the batch size.</span> <span class="co"> initializer: function with two arguments, shape and dtype, that</span> <span class="co"> determines how the state is initialized.</span> <span class="co"> dtype: the data type to use for the state.</span> <span class="co"> Returns:</span> <span class="co"> If state_size is an int or TensorShape, then the return value is a</span> <span class="co"> N-D tensor of shape [batch_size x state_size] initialized</span> <span class="co"> according to the initializer.</span> <span class="co"> If state_size is a nested list or tuple, then the return value is</span> <span class="co"> a nested list or tuple (of the same structure) of 2-D tensors with</span> <span class="co"> the shapes [batch_size x s] for each s in state_size.</span> <span class="co"> &quot;&quot;&quot;</span> state_size <span class="op">=</span> cell.state_size <span class="cf">if</span> nest.is_sequence(state_size): state_size_flat <span class="op">=</span> nest.flatten(state_size) init_state_flat <span class="op">=</span> [ initializer(_state_size_with_prefix(s), batch_size, dtype, i) <span class="cf">for</span> i, s <span class="kw">in</span> <span class="bu">enumerate</span>(state_size_flat)] init_state <span class="op">=</span> nest.pack_sequence_as(structure<span class="op">=</span>state_size, flat_sequence<span class="op">=</span>init_state_flat) <span class="cf">else</span>: init_state_size <span class="op">=</span> _state_size_with_prefix(state_size) init_state <span class="op">=</span> initializer(init_state_size, batch_size, dtype, <span class="va">None</span>) <span class="cf">return</span> init_state</code></pre></div> <p><code>initializer</code> must be a function with four arguments: <code>shape</code> and <code>dtype</code>, a la <code>tf.zeros</code>, and additionally <code>batch_size</code> and <code>index</code>, which are introduced to play nice with variables. We can achieve the same behavior as the original <code>zero_state</code> method with the following <code>initializer</code> function:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> zero_state_initializer(shape, batch_size, dtype, index): z <span class="op">=</span> tf.zeros(tf.pack(_state_size_with_prefix(shape, [batch_size])), dtype) z.set_shape(_state_size_with_prefix(shape, prefix<span class="op">=</span>[<span class="va">None</span>])) <span class="cf">return</span> z</code></pre></div> <p>Then calling <code>get_initial_cell_state(cell, zero_state_initializer, batch_size, tf.float32)</code> does the same thing as calling <code>zero_state(cell, batch_size, tf.float32)</code>.</p> <p>Given this abstraction, we add support for a variable initializer like so:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> make_variable_state_initializer(<span class="op">**</span>kwargs): <span class="kw">def</span> variable_state_initializer(shape, batch_size, dtype, index): args <span class="op">=</span> kwargs.copy() <span class="cf">if</span> args.get(<span class="st">&#39;name&#39;</span>): args[<span class="st">&#39;name&#39;</span>] <span class="op">=</span> args[<span class="st">&#39;name&#39;</span>] <span class="op">+</span> <span class="st">&#39;_&#39;</span> <span class="op">+</span> <span class="bu">str</span>(index) <span class="cf">else</span>: args[<span class="st">&#39;name&#39;</span>] <span class="op">=</span> <span class="st">&#39;init_state_&#39;</span> <span class="op">+</span> <span class="bu">str</span>(index) args[<span class="st">&#39;shape&#39;</span>] <span class="op">=</span> shape args[<span class="st">&#39;dtype&#39;</span>] <span class="op">=</span> dtype var <span class="op">=</span> tf.get_variable(<span class="op">**</span>args) var <span class="op">=</span> tf.expand_dims(var, <span class="dv">0</span>) var <span class="op">=</span> tf.tile(var, tf.pack([batch_size] <span class="op">+</span> [<span class="dv">1</span>] <span class="op">*</span> <span class="bu">len</span>(shape))) var.set_shape(_state_size_with_prefix(shape, prefix<span class="op">=</span>[<span class="va">None</span>])) <span class="cf">return</span> var <span class="cf">return</span> variable_state_initializer</code></pre></div> <p>We can now get a variable initial state by calling <code>get_initial_cell_state(cell, make_variable_initializer(), batch_size, tf.float32)</code>.</p> <p>Finally, we can add a noisy wrapper for our zero or variable state intializers like so:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> make_gaussian_state_initializer(initializer, deterministic_tensor<span class="op">=</span><span class="va">None</span>, stddev<span class="op">=</span><span class="fl">0.3</span>): <span class="kw">def</span> gaussian_state_initializer(shape, batch_size, dtype, index): init_state <span class="op">=</span> initializer(shape, batch_size, dtype, index) <span class="cf">if</span> deterministic_tensor <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: <span class="cf">return</span> tf.cond(deterministic_tensor, <span class="kw">lambda</span>: init_state, <span class="kw">lambda</span>: init_state <span class="op">+</span> tf.random_normal(tf.shape(init_state), stddev<span class="op">=</span>stddev)) <span class="cf">else</span>: <span class="cf">return</span> init_state <span class="op">+</span> tf.random_normal(tf.shape(init_state), stddev<span class="op">=</span>stddev) <span class="cf">return</span> gaussian_state_initializer</code></pre></div> <p>This wrapper adds gaussian noise to the underlying initial_state. E.g., to create an initializer function that initializes the state with a mean of zero and standard deviation of 0.1, we call <code>make_gaussian_state_initializer(zero_state_initializer, stddev=0.01)</code>. The deterministic_tensor is an optional boolean tensor that can be used to disable added noise at test time (recommended).</p> <h2 id="an-experiment-on-the-truncated-ptb-dataset">An experiment on the truncated PTB dataset</h2> <p>Now let us test our initializers on a “truncated” PTB language modeling task. This will be the same as the regular PTB dataset, except that we will modify the usual training routine so as to <em>not</em> propagate the final state forward (i.e., it will truncate the state propagation). By reseting the state between each training step, we make the PTB dataset behave like a dataset with many state resets.</p> <h5 id="helper-functions">Helper functions</h5> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> tensorflow.models.rnn.ptb <span class="im">import</span> reader <span class="im">from</span> enum <span class="im">import</span> Enum <span class="co">#data from http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz</span> raw_data <span class="op">=</span> reader.ptb_raw_data(<span class="st">&#39;ptb_data&#39;</span>) train_data, val_data, test_data, num_classes <span class="op">=</span> raw_data batch_size, num_steps <span class="op">=</span> <span class="dv">30</span>, <span class="dv">50</span> <span class="kw">def</span> gen_epochs(n, num_steps, batch_size, dataset<span class="op">=</span>train_data): <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(n): <span class="cf">yield</span> reader.ptb_iterator(dataset, batch_size, num_steps) <span class="kw">def</span> reset_graph(): <span class="cf">if</span> <span class="st">&#39;sess&#39;</span> <span class="kw">in</span> <span class="bu">globals</span>() <span class="kw">and</span> sess: sess.close() tf.reset_default_graph() <span class="kw">def</span> eval_network(sess, g, num_steps <span class="op">=</span> num_steps, batch_size <span class="op">=</span> batch_size): losses <span class="op">=</span> [] <span class="cf">for</span> X, Y <span class="kw">in</span> <span class="bu">next</span>(gen_epochs(<span class="dv">1</span>, num_steps, batch_size, dataset<span class="op">=</span>val_data<span class="op">+</span>test_data)): feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: X, g[<span class="st">&#39;y&#39;</span>]: Y, g[<span class="st">&#39;deterministic&#39;</span>]: <span class="va">True</span>} loss_ <span class="op">=</span> sess.run([g[<span class="st">&#39;loss&#39;</span>]], feed_dict)[<span class="dv">0</span>] losses.append(loss_) <span class="cf">return</span> np.mean(losses, axis<span class="op">=</span><span class="dv">0</span>) <span class="kw">def</span> train_network(sess, g, num_epochs, num_steps <span class="op">=</span> num_steps, batch_size <span class="op">=</span> batch_size): sess.run(tf.initialize_all_variables()) losses <span class="op">=</span> [] val_losses <span class="op">=</span> [] <span class="cf">for</span> idx, epoch <span class="kw">in</span> <span class="bu">enumerate</span>(gen_epochs(num_epochs, num_steps, batch_size)): loss <span class="op">=</span> [] <span class="cf">for</span> X, Y <span class="kw">in</span> epoch: feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: X, g[<span class="st">&#39;y&#39;</span>]: Y} loss_, _ <span class="op">=</span> sess.run([g[<span class="st">&#39;loss&#39;</span>], g[<span class="st">&#39;train_step&#39;</span>]], feed_dict) loss.append(loss_) val_loss <span class="op">=</span> eval_network(sess, g) <span class="bu">print</span>(<span class="st">&quot;Average perplexity for Epoch&quot;</span>, idx, <span class="st">&quot;: Training -&quot;</span>, np.exp(np.mean(loss)), <span class="st">&quot;Validation -&quot;</span>, np.exp(np.mean(val_loss))) losses.append(np.mean(loss, axis<span class="op">=</span><span class="dv">0</span>)) val_losses.append(val_loss) <span class="cf">return</span> np.array(losses), np.array(val_losses) <span class="kw">class</span> StateInitializer(Enum): ZERO_STATE <span class="op">=</span> <span class="dv">1</span> VARIABLE_STATE <span class="op">=</span> <span class="dv">2</span> NOISY_ZERO_STATE <span class="op">=</span> <span class="dv">3</span> NOISY_VARIABLE_STATE <span class="op">=</span> <span class="dv">4</span></code></pre></div> <h5 id="graph">Graph</h5> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_graph( state_initializer, state_size <span class="op">=</span> <span class="dv">200</span>, num_classes <span class="op">=</span> num_classes, batch_size <span class="op">=</span> batch_size, num_steps <span class="op">=</span> num_steps, num_layers <span class="op">=</span> <span class="dv">2</span>): reset_graph() x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) lr <span class="op">=</span> tf.constant(<span class="fl">1.</span>) deterministic <span class="op">=</span> tf.constant(<span class="va">False</span>) embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [num_classes, state_size]) rnn_inputs <span class="op">=</span> tf.nn.embedding_lookup(embeddings, x) cell <span class="op">=</span> tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple<span class="op">=</span><span class="va">True</span>) cell <span class="op">=</span> tf.nn.rnn_cell.MultiRNNCell([cell] <span class="op">*</span> num_layers, state_is_tuple<span class="op">=</span><span class="va">True</span>) <span class="cf">if</span> state_initializer <span class="op">==</span> StateInitializer.ZERO_STATE: initializer <span class="op">=</span> zero_state_initializer <span class="cf">elif</span> state_initializer <span class="op">==</span> StateInitializer.VARIABLE_STATE: initializer <span class="op">=</span> make_variable_state_initializer() <span class="cf">elif</span> state_initializer <span class="op">==</span> StateInitializer.NOISY_ZERO_STATE: initializer <span class="op">=</span> make_gaussian_state_initializer(zero_state_initializer, deterministic) <span class="cf">elif</span> state_initializer <span class="op">==</span> StateInitializer.NOISY_VARIABLE_STATE: initializer <span class="op">=</span> make_gaussian_state_initializer(make_variable_state_initializer(), deterministic) init_state <span class="op">=</span> get_initial_cell_state(cell, initializer, batch_size, tf.float32) rnn_outputs, final_state <span class="op">=</span> tf.nn.dynamic_rnn(cell, rnn_inputs, initial_state<span class="op">=</span>init_state) <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) <span class="co">#reshape rnn_outputs and y so we can get the logits in a single matmul</span> rnn_outputs <span class="op">=</span> tf.reshape(rnn_outputs, [<span class="op">-</span><span class="dv">1</span>, state_size]) y_reshaped <span class="op">=</span> tf.reshape(y, [<span class="op">-</span><span class="dv">1</span>]) logits <span class="op">=</span> tf.matmul(rnn_outputs, W) <span class="op">+</span> b losses <span class="op">=</span> tf.reshape(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped), [batch_size, num_steps]) loss_by_timestep <span class="op">=</span> tf.reduce_mean(losses, reduction_indices<span class="op">=</span><span class="dv">0</span>) train_step <span class="op">=</span> tf.train.AdamOptimizer().minimize(loss_by_timestep) <span class="cf">return</span> <span class="bu">dict</span>( x <span class="op">=</span> x, y <span class="op">=</span> y, lr <span class="op">=</span> lr, deterministic <span class="op">=</span> deterministic, init_state <span class="op">=</span> init_state, final_state <span class="op">=</span> final_state, loss <span class="op">=</span> loss_by_timestep, train_step <span class="op">=</span> train_step )</code></pre></div> <h5 id="experiment">Experiment</h5> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">tr_losses, val_losses <span class="op">=</span> [<span class="va">None</span>] <span class="op">*</span> <span class="dv">4</span>, [<span class="va">None</span>] <span class="op">*</span> <span class="dv">4</span> g <span class="op">=</span> build_graph(state_initializer<span class="op">=</span>StateInitializer.ZERO_STATE) sess <span class="op">=</span> tf.InteractiveSession() tr_losses[<span class="dv">0</span>], val_losses[<span class="dv">0</span>] <span class="op">=</span> train_network(sess, g, num_epochs<span class="op">=</span><span class="dv">20</span>)</code></pre></div> <pre><code>Average perplexity for Epoch 0 : Training - 674.599 Validation - 483.888 Average perplexity for Epoch 1 : Training - 421.366 Validation - 348.751 Average perplexity for Epoch 2 : Training - 305.943 Validation - 272.674 Average perplexity for Epoch 3 : Training - 241.748 Validation - 235.801 Average perplexity for Epoch 4 : Training - 205.29 Validation - 212.853 Average perplexity for Epoch 5 : Training - 180.5 Validation - 198.029 Average perplexity for Epoch 6 : Training - 160.867 Validation - 186.862 Average perplexity for Epoch 7 : Training - 145.657 Validation - 179.394 Average perplexity for Epoch 8 : Training - 133.973 Validation - 173.399 Average perplexity for Epoch 9 : Training - 124.281 Validation - 169.236 Average perplexity for Epoch 10 : Training - 115.586 Validation - 166.216 Average perplexity for Epoch 11 : Training - 108.34 Validation - 163.99 Average perplexity for Epoch 12 : Training - 101.959 Validation - 162.627 Average perplexity for Epoch 13 : Training - 96.3985 Validation - 162.423 Average perplexity for Epoch 14 : Training - 91.6309 Validation - 163.904 Average perplexity for Epoch 15 : Training - 87.29 Validation - 163.679 Average perplexity for Epoch 16 : Training - 83.2224 Validation - 164.169 Average perplexity for Epoch 17 : Training - 79.5156 Validation - 165.162 Average perplexity for Epoch 18 : Training - 76.1198 Validation - 166.714 Average perplexity for Epoch 19 : Training - 73.1628 Validation - 168.515</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph(state_initializer<span class="op">=</span>StateInitializer.VARIABLE_STATE) sess <span class="op">=</span> tf.InteractiveSession() tr_losses[<span class="dv">1</span>], val_losses[<span class="dv">1</span>] <span class="op">=</span> train_network(sess, g, num_epochs<span class="op">=</span><span class="dv">20</span>)</code></pre></div> <pre><code>Average perplexity for Epoch 0 : Training - 525.724 Validation - 325.364 Average perplexity for Epoch 1 : Training - 275.811 Validation - 239.312 Average perplexity for Epoch 2 : Training - 210.521 Validation - 204.103 Average perplexity for Epoch 3 : Training - 176.135 Validation - 184.352 Average perplexity for Epoch 4 : Training - 153.307 Validation - 171.528 Average perplexity for Epoch 5 : Training - 136.591 Validation - 162.493 Average perplexity for Epoch 6 : Training - 123.592 Validation - 156.533 Average perplexity for Epoch 7 : Training - 113.033 Validation - 152.028 Average perplexity for Epoch 8 : Training - 104.201 Validation - 149.743 Average perplexity for Epoch 9 : Training - 96.7272 Validation - 148.263 Average perplexity for Epoch 10 : Training - 90.313 Validation - 147.438 Average perplexity for Epoch 11 : Training - 84.7536 Validation - 147.409 Average perplexity for Epoch 12 : Training - 79.8758 Validation - 147.533 Average perplexity for Epoch 13 : Training - 75.5331 Validation - 148.11 Average perplexity for Epoch 14 : Training - 71.5848 Validation - 149.513 Average perplexity for Epoch 15 : Training - 67.9394 Validation - 151.243 Average perplexity for Epoch 16 : Training - 64.6299 Validation - 153.503 Average perplexity for Epoch 17 : Training - 61.6355 Validation - 156.37 Average perplexity for Epoch 18 : Training - 58.9116 Validation - 160.145 Average perplexity for Epoch 19 : Training - 56.4397 Validation - 164.863</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph(state_initializer<span class="op">=</span>StateInitializer.NOISY_ZERO_STATE) sess <span class="op">=</span> tf.InteractiveSession() tr_losses[<span class="dv">2</span>], val_losses[<span class="dv">2</span>] <span class="op">=</span> train_network(sess, g, num_epochs<span class="op">=</span><span class="dv">20</span>)</code></pre></div> <pre><code>Average perplexity for Epoch 0 : Training - 625.676 Validation - 407.948 Average perplexity for Epoch 1 : Training - 337.045 Validation - 277.074 Average perplexity for Epoch 2 : Training - 245.198 Validation - 230.573 Average perplexity for Epoch 3 : Training - 202.941 Validation - 205.394 Average perplexity for Epoch 4 : Training - 175.752 Validation - 189.294 Average perplexity for Epoch 5 : Training - 156.077 Validation - 178.006 Average perplexity for Epoch 6 : Training - 141.035 Validation - 170.011 Average perplexity for Epoch 7 : Training - 128.985 Validation - 164.033 Average perplexity for Epoch 8 : Training - 118.946 Validation - 160.09 Average perplexity for Epoch 9 : Training - 110.475 Validation - 157.405 Average perplexity for Epoch 10 : Training - 103.191 Validation - 155.624 Average perplexity for Epoch 11 : Training - 96.9187 Validation - 154.584 Average perplexity for Epoch 12 : Training - 91.4146 Validation - 154.25 Average perplexity for Epoch 13 : Training - 86.494 Validation - 154.48 Average perplexity for Epoch 14 : Training - 82.1429 Validation - 155.172 Average perplexity for Epoch 15 : Training - 78.1957 Validation - 156.681 Average perplexity for Epoch 16 : Training - 74.6005 Validation - 158.523 Average perplexity for Epoch 17 : Training - 71.3612 Validation - 160.869 Average perplexity for Epoch 18 : Training - 68.3056 Validation - 163.278 Average perplexity for Epoch 19 : Training - 65.4805 Validation - 165.645</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph(state_initializer<span class="op">=</span>StateInitializer.NOISY_VARIABLE_STATE) sess <span class="op">=</span> tf.InteractiveSession() tr_losses[<span class="dv">3</span>], val_losses[<span class="dv">3</span>] <span class="op">=</span> train_network(sess, g, num_epochs<span class="op">=</span><span class="dv">20</span>)</code></pre></div> <pre><code>Average perplexity for Epoch 0 : Training - 517.27 Validation - 331.341 Average perplexity for Epoch 1 : Training - 278.846 Validation - 239.6 Average perplexity for Epoch 2 : Training - 210.333 Validation - 203.027 Average perplexity for Epoch 3 : Training - 174.959 Validation - 182.456 Average perplexity for Epoch 4 : Training - 151.81 Validation - 169.388 Average perplexity for Epoch 5 : Training - 135.121 Validation - 160.613 Average perplexity for Epoch 6 : Training - 122.301 Validation - 154.474 Average perplexity for Epoch 7 : Training - 111.991 Validation - 150.337 Average perplexity for Epoch 8 : Training - 103.425 Validation - 147.664 Average perplexity for Epoch 9 : Training - 96.1806 Validation - 145.957 Average perplexity for Epoch 10 : Training - 89.8921 Validation - 145.308 Average perplexity for Epoch 11 : Training - 84.3145 Validation - 145.255 Average perplexity for Epoch 12 : Training - 79.3745 Validation - 146.052 Average perplexity for Epoch 13 : Training - 74.96 Validation - 147.01 Average perplexity for Epoch 14 : Training - 71.0005 Validation - 148.22 Average perplexity for Epoch 15 : Training - 67.3658 Validation - 150.713 Average perplexity for Epoch 16 : Training - 64.0655 Validation - 153.78 Average perplexity for Epoch 17 : Training - 61.0874 Validation - 157.101 Average perplexity for Epoch 18 : Training - 58.3892 Validation - 160.376 Average perplexity for Epoch 19 : Training - 55.9478 Validation - 164.157</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="op">%</span>matplotlib inline <span class="im">import</span> seaborn <span class="im">as</span> sns sns.<span class="bu">set</span>(color_codes<span class="op">=</span><span class="va">True</span>) <span class="kw">def</span> best_epoch(val_losses): <span class="cf">return</span> np.argmin(np.mean(val_losses, axis<span class="op">=</span><span class="dv">1</span>)) labels <span class="op">=</span> [<span class="st">&#39;Zero&#39;</span>, <span class="st">&#39;Variable&#39;</span>, <span class="st">&#39;Noisy&#39;</span>, <span class="st">&#39;Noisy Variable&#39;</span>] <span class="kw">def</span> plot_losses(losses, title, y_range): <span class="kw">global</span> val_losses fig, ax <span class="op">=</span> plt.subplots() <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="bu">len</span>(losses)): data <span class="op">=</span> np.exp(losses[i][best_epoch(val_losses[i])]) ax.plot(<span class="bu">range</span>(<span class="dv">0</span>,num_steps),data,label<span class="op">=</span>labels[i]) ax.set_xlabel(<span class="st">&#39;Step number&#39;</span>) ax.set_ylabel(<span class="st">&#39;Average loss&#39;</span>) ax.set_ylim(y_range) ax.set_title(title) ax.legend(loc<span class="op">=</span><span class="dv">1</span>) plt.show()</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot_losses(tr_losses, <span class="st">&#39;Best epoch training perplexities&#39;</span>, [<span class="dv">70</span>, <span class="dv">110</span>])</code></pre></div> <figure> <img src="https://r2rt.com/static/images/NonzeroStateInit_output_25_0.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot_losses(val_losses, <span class="st">&#39;Best epoch validation perplexities&#39;</span>, [<span class="dv">120</span>, <span class="dv">200</span>])</code></pre></div> <figure> <img src="https://r2rt.com/static/images/NonzeroStateInit_output_26_0.png" alt="png" /><figcaption>png</figcaption> </figure> <h3 id="empirical-results">Empirical results</h3> <p>From the above experiment we make the following observations:</p> <ul> <li>All non-zero state intializations sped up training and improved generalization.</li> <li>Training the initial state as a variable was more effective than using a noisy zero-mean initial state.</li> <li>Adding noise to a variable initial state provided only marginal benefit.</li> </ul> <p>Finally, I would note that “truncating” the PTB dataset produced worse results than would be obtained if the dataset were not truncated, even if we use noisy or variable state initializations. We can see this by comparing the above results to the “non-regularized LSTM” from <a href="https://arxiv.org/pdf/1409.2329v5.pdf">Zaremba et al. (2015)</a>, which had a very similar architecture, but did not truncate the sequences in the dataset. I would expect truncation will have this effect in general, so that these non-zero state initializations will only really be useful for datasets that have many naturally-occuring state resets.</p> </body> </html> Recurrent Neural Networks in Tensorflow III - Variable Length Sequences2016-11-15T00:00:00-05:002016-11-15T00:00:00-05:00Silviu Pitistag:r2rt.com,2016-11-15:/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.htmlThis is the third in a series of posts about recurrent neural networks in Tensorflow. In this post, we'll use Tensorflow to construct an RNN that operates on input sequences of variable lengths.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <h2 id="task">Task</h2> <p>In this post, we’ll use Tensorflow to construct an RNN that operates on input sequences of variable lengths. We’ll use this RNN to classify bloggers by age bracket and gender using sentence-long writing samples. One time step will represent a single word, with the complete input sequence representing a single sentence. The challenge is to build a model that can classify multiple sentences of different lengths at the same time.</p> <h3 id="other-tutorials-on-variable-length-sequences">Other tutorials on variable length sequences</h3> <p>There are a couple other tutorials on this topic. For example, the <a href="https://www.tensorflow.org/versions/master/tutorials/seq2seq/index.html">official Tensorflow seq2seq tutorial model</a> accomodates variable length sequences. This official model, however, is a bit advanced for a first exposure and a little too specialized to be easily portable to other contexts. Danijar Hafner has written a more approachable guide <a href="http://danijar.com/variable-sequence-lengths-in-tensorflow/">here</a>, which I recommend. In contrast to Danijar’s post, this post is written a linear ipython notebook-style to make it easy for you to follow along step-by-step. This post also includes a section on bucketing, a technique that can significantly improve your model’s training time.</p> <h2 id="data">Data</h2> <p>The data for this post is sourced from the “Blog Authorship Corpus”, available <a href="http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm">here</a>. The original dataset was tokenized and split into sentences using <a href="https://spacy.io/">spacy</a>. Sentences with less than 5 tokens and sentences with more than 30 tokens were discarded. Number-like tokens were replaced by “&lt;#&gt;”. Tokens other than the 9999 most common tokens were replaced by “&lt; UNK &gt;”, for a 10000 token vocabulary. Sentences were tagged with the gender (0 for male, 1 for female) and age bracket (0 for teens, 1 for 20s, 2 for 30s) and placed into a pandas dataframe. The modified data and code to import can be found <a href="https://github.com/spitis/blogs_data">here</a>.</p> <p>Below is the head of the dataframe (tokens in “string” are delimited by spaces):</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> pandas <span class="im">as</span> pd, numpy <span class="im">as</span> np, tensorflow <span class="im">as</span> tf <span class="im">import</span> blogs_data <span class="co">#available at https://github.com/spitis/blogs_data</span></code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">df <span class="op">=</span> blogs_data.loadBlogs().sample(frac<span class="op">=</span><span class="dv">1</span>).reset_index(drop<span class="op">=</span><span class="va">True</span>) vocab, reverse_vocab <span class="op">=</span> blogs_data.loadVocab() train_len, test_len <span class="op">=</span> np.floor(<span class="bu">len</span>(df)<span class="op">*</span><span class="fl">0.8</span>), np.floor(<span class="bu">len</span>(df)<span class="op">*</span><span class="fl">0.2</span>) train, test <span class="op">=</span> df.ix[:train_len<span class="dv">-1</span>], df.ix[train_len:train_len <span class="op">+</span> test_len] df <span class="op">=</span> <span class="va">None</span> train.head()</code></pre></div> <table> <colgroup> <col style="width: 2%" /> <col style="width: 6%" /> <col style="width: 6%" /> <col style="width: 9%" /> <col style="width: 34%" /> <col style="width: 34%" /> <col style="width: 6%" /> </colgroup> <thead> <tr class="header"> <th></th> <th>post_id</th> <th>gender</th> <th>age</th> <th>string</th> <th>as_numbers</th> <th>length</th> </tr> </thead> <tbody> <tr class="odd"> <td>0</td> <td>118860</td> <td>0</td> <td>0</td> <td>the last time i checked the constitution of th…</td> <td>[4, 127, 63, 3, 1837, 4, 3871, 8, 4, 1236, 927…</td> <td>22</td> </tr> <tr class="even"> <td>1</td> <td>178031</td> <td>1</td> <td>1</td> <td>but i do wish more people were so <UNK> ( star…</td> <td>[20, 3, 31, 360, 77, 79, 88, 27, 0, 43, 631, 2…</td> <td>15</td> </tr> <tr class="odd"> <td>2</td> <td>182592</td> <td>1</td> <td>0</td> <td>it came back to me right away and i was off .</td> <td>[10, 209, 93, 5, 19, 136, 192, 6, 3, 17, 129, 2]</td> <td>12</td> </tr> <tr class="even"> <td>3</td> <td>144982</td> <td>0</td> <td>2</td> <td>day &lt;#&gt; - get to class &lt;#&gt; min . early .</td> <td>[94, 12, 33, 59, 5, 320, 12, 2703, 2, 457, 2]</td> <td>11</td> </tr> <tr class="odd"> <td>4</td> <td>43048</td> <td>0</td> <td>0</td> <td>cause you do n’t know how much i , i need you …</td> <td>[332, 15, 31, 28, 64, 86, 96, 3, 1, 3, 157, 15…</td> <td>21</td> </tr> </tbody> </table> <p> </p> <p>We’re going to build an RNN that accepts batches of data from the “as_numbers” column and predicts the “gender” and “age_bracket” columns. The first step is to construct a simple iterator that returns batches of inputs along with their targets, and the length of each input:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> SimpleDataIterator(): <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, df): <span class="va">self</span>.df <span class="op">=</span> df <span class="va">self</span>.size <span class="op">=</span> <span class="bu">len</span>(<span class="va">self</span>.df) <span class="va">self</span>.epochs <span class="op">=</span> <span class="dv">0</span> <span class="va">self</span>.shuffle() <span class="kw">def</span> shuffle(<span class="va">self</span>): <span class="va">self</span>.df <span class="op">=</span> <span class="va">self</span>.df.sample(frac<span class="op">=</span><span class="dv">1</span>).reset_index(drop<span class="op">=</span><span class="va">True</span>) <span class="va">self</span>.cursor <span class="op">=</span> <span class="dv">0</span> <span class="kw">def</span> next_batch(<span class="va">self</span>, n): <span class="cf">if</span> <span class="va">self</span>.cursor<span class="op">+</span>n<span class="dv">-1</span> <span class="op">&gt;</span> <span class="va">self</span>.size: <span class="va">self</span>.epochs <span class="op">+=</span> <span class="dv">1</span> <span class="va">self</span>.shuffle() res <span class="op">=</span> <span class="va">self</span>.df.ix[<span class="va">self</span>.cursor:<span class="va">self</span>.cursor<span class="op">+</span>n<span class="dv">-1</span>] <span class="va">self</span>.cursor <span class="op">+=</span> n <span class="cf">return</span> res[<span class="st">&#39;as_numbers&#39;</span>], res[<span class="st">&#39;gender&#39;</span>]<span class="op">*</span><span class="dv">3</span> <span class="op">+</span> res[<span class="st">&#39;age_bracket&#39;</span>], res[<span class="st">&#39;length&#39;</span>]</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> SimpleDataIterator(train) d <span class="op">=</span> data.next_batch(<span class="dv">3</span>) <span class="bu">print</span>(<span class="st">&#39;Input sequences</span><span class="ch">\n</span><span class="st">&#39;</span>, d[<span class="dv">0</span>], end<span class="op">=</span><span class="st">&#39;</span><span class="ch">\n\n</span><span class="st">&#39;</span>) <span class="bu">print</span>(<span class="st">&#39;Target values</span><span class="ch">\n</span><span class="st">&#39;</span>, d[<span class="dv">1</span>], end<span class="op">=</span><span class="st">&#39;</span><span class="ch">\n\n</span><span class="st">&#39;</span>) <span class="bu">print</span>(<span class="st">&#39;Sequence lengths</span><span class="ch">\n</span><span class="st">&#39;</span>, d[<span class="dv">2</span>])</code></pre></div> <pre><code>Input sequences 0 [27, 3, 576, 146, 13, 204, 37, 150, 6, 804, 94... 1 [10, 210, 30, 1554, 10, 22, 325, 6240, 11, 4, ... 2 [2927, 78, 9324, 5, 2273, 4, 5937, 8, 1058, 4,... Target values 0 4 1 4 2 1 Sequence lengths 0 13 1 18 2 22</code></pre> <p>We are immediately faced with a problem, in that our 3 sequences are of different lengths: we cannot feed them into a Tensorflow graph as is, unless we create a different tensor for each (inefficient, and hard!). To solve this, we <strong>pad shorter sequences so that all sequences are the same length</strong>. Then all sequences will fit into a single tensor.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> PaddedDataIterator(SimpleDataIterator): <span class="kw">def</span> next_batch(<span class="va">self</span>, n): <span class="cf">if</span> <span class="va">self</span>.cursor<span class="op">+</span>n <span class="op">&gt;</span> <span class="va">self</span>.size: <span class="va">self</span>.epochs <span class="op">+=</span> <span class="dv">1</span> <span class="va">self</span>.shuffle() res <span class="op">=</span> <span class="va">self</span>.df.ix[<span class="va">self</span>.cursor:<span class="va">self</span>.cursor<span class="op">+</span>n<span class="dv">-1</span>] <span class="va">self</span>.cursor <span class="op">+=</span> n <span class="co"># Pad sequences with 0s so they are all the same length</span> maxlen <span class="op">=</span> <span class="bu">max</span>(res[<span class="st">&#39;length&#39;</span>]) x <span class="op">=</span> np.zeros([n, maxlen], dtype<span class="op">=</span>np.int32) <span class="cf">for</span> i, x_i <span class="kw">in</span> <span class="bu">enumerate</span>(x): x_i[:res[<span class="st">&#39;length&#39;</span>].values[i]] <span class="op">=</span> res[<span class="st">&#39;as_numbers&#39;</span>].values[i] <span class="cf">return</span> x, res[<span class="st">&#39;gender&#39;</span>]<span class="op">*</span><span class="dv">3</span> <span class="op">+</span> res[<span class="st">&#39;age_bracket&#39;</span>], res[<span class="st">&#39;length&#39;</span>]</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">data <span class="op">=</span> PaddedDataIterator(train) d <span class="op">=</span> data.next_batch(<span class="dv">3</span>) <span class="bu">print</span>(<span class="st">&#39;Input sequences</span><span class="ch">\n</span><span class="st">&#39;</span>, d[<span class="dv">0</span>], end<span class="op">=</span><span class="st">&#39;</span><span class="ch">\n\n</span><span class="st">&#39;</span>)</code></pre></div> <pre><code>Input sequences [[ 34 90 5 470 16 19 16 7 159 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 82 1 109 7 377 8 421 8 0 33 124 3 69 180 17 90 5 133 16 19 33 34 12 3819 85 164 129 25] [1786 5570 1 13 7817 235 60 6168 19 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]</code></pre> <p>Our padded iterator now returns a single input matrix of dimension [batch_size, max_sequence_length], where shorter sequences have been padded with zeros.</p> <h4 id="a-note-on-pad-symbols">A note on PAD symbols</h4> <p>For this model, the zero that we used to pad our input sequences with is the index of the “&lt; UNK &gt;” symbol (representing UNKnown words) in our vocabulary. In this case, what we pad with doesn’t affect the outcome, and so I chose to keep it simple, but there are cases in which we might want to introduce a special “PAD” symbol. For example, here we will be feeding in a length tensor that holds information about our input sequence lengths, but suppose instead that we want to have Tensorflow calculate our sequence lengths; in that case, using 0 to represent both &lt; UNK &gt; and PAD would make it impossible for the graph to disambiguate between a sentence ending in &lt; UNK &gt; and a padded sentence. If we were to add a special PAD symbol, it would likely want to represent it as zero, and so it would have to displace the &lt; UNK &gt; symbol, which would then need to be represented by a different index.</p> <p>The advantage of the approach shown here (zero-padding with no special PAD symbol) is that it generalizes better to sequences with multi-dimensional continuous input (e.g., stock price data). In such cases, it does not really make sense to have a separate PAD symbol.</p> <h2 id="a-basic-model-for-sequence-classification">A basic model for sequence classification</h2> <p>We’ll now construct a <strong>sequence classification</strong> model using this data that assigns a single label to an entire input sequence. Later, we’ll look at how we can instead construct a sequence-to-sequence model that predicts the authors age and gender at each time step.</p> <p>Our model makes use of Tensorflow’s <code>dynamic_rnn</code>, specifying a <code>sequence_length</code> parameter, which is fed into the model along with the data. Calling <code>dynamic_rnn</code> with a <code>sequence_length</code> parameter returns padded outputs: e.g., if the maximum sequence length is 10, but a specific example in the batch has a sequence length of only 4 followed by 6 zero steps for padding, the output for that time step will also have a length of only 4, with 6 additional zero steps for padding.</p> <p>This introduces some added complexity, since for sequence classification we only care about the final output of each sequence. This could be solved in one line using <code>tf.gather_nd</code>, as commented below, however, the gradient for <code>tf.gather_nd</code> is not yet implemented as of this writing. It is expected to be implemented shortly (you can view the status on Github <a href="https://github.com/tensorflow/tensorflow/issues/5342">here</a>). In the interim, I have adopted Danijar Hafner’s solution, which you can read more about in his <a href="http://danijar.com/variable-sequence-lengths-in-tensorflow/">post</a> under the heading “Select the Last Relevant Output”.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> reset_graph(): <span class="cf">if</span> <span class="st">&#39;sess&#39;</span> <span class="kw">in</span> <span class="bu">globals</span>() <span class="kw">and</span> sess: sess.close() tf.reset_default_graph() <span class="kw">def</span> build_graph( vocab_size <span class="op">=</span> <span class="bu">len</span>(vocab), state_size <span class="op">=</span> <span class="dv">64</span>, batch_size <span class="op">=</span> <span class="dv">256</span>, num_classes <span class="op">=</span> <span class="dv">6</span>): reset_graph() <span class="co"># Placeholders</span> x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, <span class="va">None</span>]) <span class="co"># [batch_size, num_steps]</span> seqlen <span class="op">=</span> tf.placeholder(tf.int32, [batch_size]) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size]) keep_prob <span class="op">=</span> tf.constant(<span class="fl">1.0</span>) <span class="co"># Embedding layer</span> embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [vocab_size, state_size]) rnn_inputs <span class="op">=</span> tf.nn.embedding_lookup(embeddings, x) <span class="co"># RNN</span> cell <span class="op">=</span> tf.nn.rnn_cell.GRUCell(state_size) init_state <span class="op">=</span> tf.get_variable(<span class="st">&#39;init_state&#39;</span>, [<span class="dv">1</span>, state_size], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) init_state <span class="op">=</span> tf.tile(init_state, [batch_size, <span class="dv">1</span>]) rnn_outputs, final_state <span class="op">=</span> tf.nn.dynamic_rnn(cell, rnn_inputs, sequence_length<span class="op">=</span>seqlen, initial_state<span class="op">=</span>init_state) <span class="co"># Add dropout, as the model otherwise quickly overfits</span> rnn_outputs <span class="op">=</span> tf.nn.dropout(rnn_outputs, keep_prob) <span class="co">&quot;&quot;&quot;</span> <span class="co"> Obtain the last relevant output. The best approach in the future will be to use:</span> <span class="co"> last_rnn_output = tf.gather_nd(rnn_outputs, tf.pack([tf.range(batch_size), seqlen-1], axis=1))</span> <span class="co"> which is the Tensorflow equivalent of numpy&#39;s rnn_outputs[range(30), seqlen-1, :], but the</span> <span class="co"> gradient for this op has not been implemented as of this writing.</span> <span class="co"> The below solution works, but throws a UserWarning re: the gradient.</span> <span class="co"> &quot;&quot;&quot;</span> idx <span class="op">=</span> tf.<span class="bu">range</span>(batch_size)<span class="op">*</span>tf.shape(rnn_outputs)[<span class="dv">1</span>] <span class="op">+</span> (seqlen <span class="op">-</span> <span class="dv">1</span>) last_rnn_output <span class="op">=</span> tf.gather(tf.reshape(rnn_outputs, [<span class="op">-</span><span class="dv">1</span>, state_size]), idx) <span class="co"># Softmax layer</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) logits <span class="op">=</span> tf.matmul(last_rnn_output, W) <span class="op">+</span> b preds <span class="op">=</span> tf.nn.softmax(logits) correct <span class="op">=</span> tf.equal(tf.cast(tf.argmax(preds,<span class="dv">1</span>),tf.int32), y) accuracy <span class="op">=</span> tf.reduce_mean(tf.cast(correct, tf.float32)) loss <span class="op">=</span> tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y)) train_step <span class="op">=</span> tf.train.AdamOptimizer(<span class="fl">1e-4</span>).minimize(loss) <span class="cf">return</span> { <span class="st">&#39;x&#39;</span>: x, <span class="st">&#39;seqlen&#39;</span>: seqlen, <span class="st">&#39;y&#39;</span>: y, <span class="st">&#39;dropout&#39;</span>: keep_prob, <span class="st">&#39;loss&#39;</span>: loss, <span class="st">&#39;ts&#39;</span>: train_step, <span class="st">&#39;preds&#39;</span>: preds, <span class="st">&#39;accuracy&#39;</span>: accuracy } <span class="kw">def</span> train_graph(graph, batch_size <span class="op">=</span> <span class="dv">256</span>, num_epochs <span class="op">=</span> <span class="dv">10</span>, iterator <span class="op">=</span> PaddedDataIterator): <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.initialize_all_variables()) tr <span class="op">=</span> iterator(train) te <span class="op">=</span> iterator(test) step, accuracy <span class="op">=</span> <span class="dv">0</span>, <span class="dv">0</span> tr_losses, te_losses <span class="op">=</span> [], [] current_epoch <span class="op">=</span> <span class="dv">0</span> <span class="cf">while</span> current_epoch <span class="op">&lt;</span> num_epochs: step <span class="op">+=</span> <span class="dv">1</span> batch <span class="op">=</span> tr.next_batch(batch_size) feed <span class="op">=</span> {g[<span class="st">&#39;x&#39;</span>]: batch[<span class="dv">0</span>], g[<span class="st">&#39;y&#39;</span>]: batch[<span class="dv">1</span>], g[<span class="st">&#39;seqlen&#39;</span>]: batch[<span class="dv">2</span>], g[<span class="st">&#39;dropout&#39;</span>]: <span class="fl">0.6</span>} accuracy_, _ <span class="op">=</span> sess.run([g[<span class="st">&#39;accuracy&#39;</span>], g[<span class="st">&#39;ts&#39;</span>]], feed_dict<span class="op">=</span>feed) accuracy <span class="op">+=</span> accuracy_ <span class="cf">if</span> tr.epochs <span class="op">&gt;</span> current_epoch: current_epoch <span class="op">+=</span> <span class="dv">1</span> tr_losses.append(accuracy <span class="op">/</span> step) step, accuracy <span class="op">=</span> <span class="dv">0</span>, <span class="dv">0</span> <span class="co">#eval test set</span> te_epoch <span class="op">=</span> te.epochs <span class="cf">while</span> te.epochs <span class="op">==</span> te_epoch: step <span class="op">+=</span> <span class="dv">1</span> batch <span class="op">=</span> te.next_batch(batch_size) feed <span class="op">=</span> {g[<span class="st">&#39;x&#39;</span>]: batch[<span class="dv">0</span>], g[<span class="st">&#39;y&#39;</span>]: batch[<span class="dv">1</span>], g[<span class="st">&#39;seqlen&#39;</span>]: batch[<span class="dv">2</span>]} accuracy_ <span class="op">=</span> sess.run([g[<span class="st">&#39;accuracy&#39;</span>]], feed_dict<span class="op">=</span>feed)[<span class="dv">0</span>] accuracy <span class="op">+=</span> accuracy_ te_losses.append(accuracy <span class="op">/</span> step) step, accuracy <span class="op">=</span> <span class="dv">0</span>,<span class="dv">0</span> <span class="bu">print</span>(<span class="st">&quot;Accuracy after epoch&quot;</span>, current_epoch, <span class="st">&quot; - tr:&quot;</span>, tr_losses[<span class="op">-</span><span class="dv">1</span>], <span class="st">&quot;- te:&quot;</span>, te_losses[<span class="op">-</span><span class="dv">1</span>]) <span class="cf">return</span> tr_losses, te_losses</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph() tr_losses, te_losses <span class="op">=</span> train_graph(g)</code></pre></div> <pre><code>Accuracy after epoch 1 - tr: 0.319347791963 - te: 0.351068906904 Accuracy after epoch 2 - tr: 0.355731238225 - te: 0.357366258375 Accuracy after epoch 3 - tr: 0.361505161451 - te: 0.358625811348 Accuracy after epoch 4 - tr: 0.363629598859 - te: 0.359358642169 Accuracy after epoch 5 - tr: 0.365078599278 - te: 0.358609453518 Accuracy after epoch 6 - tr: 0.365907767689 - te: 0.359358642169 Accuracy after epoch 7 - tr: 0.367192406322 - te: 0.359833019263 Accuracy after epoch 8 - tr: 0.368336397059 - te: 0.360304124791 Accuracy after epoch 9 - tr: 0.369028188455 - te: 0.360434987437 Accuracy after epoch 10 - tr: 0.37021715381 - te: 0.36041535804</code></pre> <p>After 10 epochs, our network has an accuracy of about 36%, about twice as good as chance–not bad for predicting age and gender from a single sentence!</p> <h2 id="improving-training-speed-using-bucketing">Improving training speed using bucketing</h2> <p>For the network above, we used a batch_size of 256. But each example in the batch had a different length ranging from 5 to 30. As the maximum length for each batch is usually very close to 30, short sequences required a lot of padding (e.g., all sequences of length 5 in the batch are padded with up to 25 zeros). Given this dataset, each batch is padded with an average of over 3000 zeros, or over 10 padding symbols per sample:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">tr <span class="op">=</span> PaddedDataIterator(train) padding <span class="op">=</span> <span class="dv">0</span> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>): lengths <span class="op">=</span> tr.next_batch(<span class="dv">256</span>)[<span class="dv">2</span>].values max_len <span class="op">=</span> <span class="bu">max</span>(lengths) padding <span class="op">+=</span> np.<span class="bu">sum</span>(max_len <span class="op">-</span> lengths) <span class="bu">print</span>(<span class="st">&quot;Average padding / batch:&quot;</span>, padding<span class="op">/</span><span class="dv">100</span>)</code></pre></div> <pre><code>Average padding / batch: 3279.9</code></pre> <p>This leads to a lot of excess computation, and we can improve upon it by “bucketing” our training samples. If we select our batches such that the lengths of the samples in each batch are within, say, 5 of each other, then the amount of padding in a batch of 256 is bounded by 256 * 5 = 1280. This would make our worst case outcome more than twice as good as previous average case outcome.</p> <p>To take advantage of bucketing, we simply modify our DataIterator. There are many ways one might implement this, but the key point to keep in mind is that we should not “bias” the order in which different sequence lengths are sampled any more than necessary to achieve bucketing. E.g., sorting our data by sequence length might seem like a good solution, but then each epoch would be trained on short sequences before longer sequences, which could harm results. Here is one solution, which uses a predetermined batch_size:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> BucketedDataIterator(): <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, df, num_buckets <span class="op">=</span> <span class="dv">5</span>): df <span class="op">=</span> df.sort_values(<span class="st">&#39;length&#39;</span>).reset_index(drop<span class="op">=</span><span class="va">True</span>) <span class="va">self</span>.size <span class="op">=</span> <span class="bu">len</span>(df) <span class="op">/</span> num_buckets <span class="va">self</span>.dfs <span class="op">=</span> [] <span class="cf">for</span> bucket <span class="kw">in</span> <span class="bu">range</span>(num_buckets): <span class="va">self</span>.dfs.append(df.ix[bucket<span class="op">*</span><span class="va">self</span>.size: (bucket<span class="op">+</span><span class="dv">1</span>)<span class="op">*</span><span class="va">self</span>.size <span class="op">-</span> <span class="dv">1</span>]) <span class="va">self</span>.num_buckets <span class="op">=</span> num_buckets <span class="co"># cursor[i] will be the cursor for the ith bucket</span> <span class="va">self</span>.cursor <span class="op">=</span> np.array([<span class="dv">0</span>] <span class="op">*</span> num_buckets) <span class="va">self</span>.shuffle() <span class="va">self</span>.epochs <span class="op">=</span> <span class="dv">0</span> <span class="kw">def</span> shuffle(<span class="va">self</span>): <span class="co">#sorts dataframe by sequence length, but keeps it random within the same length</span> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="va">self</span>.num_buckets): <span class="va">self</span>.dfs[i] <span class="op">=</span> <span class="va">self</span>.dfs[i].sample(frac<span class="op">=</span><span class="dv">1</span>).reset_index(drop<span class="op">=</span><span class="va">True</span>) <span class="va">self</span>.cursor[i] <span class="op">=</span> <span class="dv">0</span> <span class="kw">def</span> next_batch(<span class="va">self</span>, n): <span class="cf">if</span> np.<span class="bu">any</span>(<span class="va">self</span>.cursor<span class="op">+</span>n<span class="op">+</span><span class="dv">1</span> <span class="op">&gt;</span> <span class="va">self</span>.size): <span class="va">self</span>.epochs <span class="op">+=</span> <span class="dv">1</span> <span class="va">self</span>.shuffle() i <span class="op">=</span> np.random.randint(<span class="dv">0</span>,<span class="va">self</span>.num_buckets) res <span class="op">=</span> <span class="va">self</span>.dfs[i].ix[<span class="va">self</span>.cursor[i]:<span class="va">self</span>.cursor[i]<span class="op">+</span>n<span class="dv">-1</span>] <span class="va">self</span>.cursor[i] <span class="op">+=</span> n <span class="co"># Pad sequences with 0s so they are all the same length</span> maxlen <span class="op">=</span> <span class="bu">max</span>(res[<span class="st">&#39;length&#39;</span>]) x <span class="op">=</span> np.zeros([n, maxlen], dtype<span class="op">=</span>np.int32) <span class="cf">for</span> i, x_i <span class="kw">in</span> <span class="bu">enumerate</span>(x): x_i[:res[<span class="st">&#39;length&#39;</span>].values[i]] <span class="op">=</span> res[<span class="st">&#39;as_numbers&#39;</span>].values[i] <span class="cf">return</span> x, res[<span class="st">&#39;gender&#39;</span>]<span class="op">*</span><span class="dv">3</span> <span class="op">+</span> res[<span class="st">&#39;age_bracket&#39;</span>], res[<span class="st">&#39;length&#39;</span>]</code></pre></div> <p>With this modified iterator, we improve the average padding / batch by a factor of about 6:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">tr <span class="op">=</span> BucketedDataIterator(train, <span class="dv">5</span>) padding <span class="op">=</span> <span class="dv">0</span> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>): lengths <span class="op">=</span> tr.next_batch(<span class="dv">256</span>)[<span class="dv">2</span>].values max_len <span class="op">=</span> <span class="bu">max</span>(lengths) padding <span class="op">+=</span> np.<span class="bu">sum</span>(max_len <span class="op">-</span> lengths) <span class="bu">print</span>(<span class="st">&quot;Average padding / batch:&quot;</span>, padding<span class="op">/</span><span class="dv">100</span>)</code></pre></div> <pre><code>Average padding / batch: 573.49</code></pre> <p>We can also compare the difference in training speed, and observe that this bucketing strategy speeds up training by about 30%:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> time <span class="im">import</span> time g <span class="op">=</span> build_graph() t <span class="op">=</span> time() tr_losses, te_losses <span class="op">=</span> train_graph(g, num_epochs<span class="op">=</span><span class="dv">1</span>, iterator<span class="op">=</span>PaddedDataIterator) <span class="bu">print</span>(<span class="st">&quot;Total time for 1 epoch with PaddedDataIterator:&quot;</span>, time() <span class="op">-</span> t)</code></pre></div> <pre><code>Accuracy after epoch 1 - tr: 0.310819936427 - te: 0.341703713389 Total time for 1 epoch with PaddedDataIterator: 100.90330529212952</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph() t <span class="op">=</span> time() tr_losses, te_losses <span class="op">=</span> train_graph(g, num_epochs<span class="op">=</span><span class="dv">1</span>, iterator<span class="op">=</span>BucketedDataIterator) <span class="bu">print</span>(<span class="st">&quot;Total time for 1 epoch with BucketedDataIterator:&quot;</span>, time() <span class="op">-</span> t)</code></pre></div> <pre><code>Accuracy after epoch 1 - tr: 0.31359524197 - te: 0.349176362254 Total time for 1 epoch with BucketedDataIterator: 71.45360088348389</code></pre> <p>Note how easy it was to move to a bucketed model–all we had to do was change our data generator. This was made possible by the use of a partially-known shape for our input placeholder, with the num_steps dimension unknown. Contrast this to the more complicated approach in Tensorflow’s <a href="https://www.tensorflow.org/versions/r0.11/tutorials/seq2seq/index.html">seq2seq tutorial</a>, which builds a different graph for each of four buckets.</p> <h4 id="a-note-on-awkward-sequence-lengths">A note on awkward sequence lengths</h4> <p>Suppose we had a dataset with awkward sequence lengths that made even a bucketed approach inefficient. For example, we might have lots of very short sequences of lengths 1, 2 and 3. Alternatively, we might have a few very long sequences among our shorter ones; we want to propagate the internal state forward through time for the long sequences, but don’t have enough of them to train efficiently in parallel. One solution in both of these scenarios is to combine short sequences into longer ones, but have the internal state of the RNN reset in between each such sequence. I believe this is not possible to do with Tensorflow’s default RNN functions (e.g., <code>dynamic_rnn</code>), so if you’re looking for a way to do this, I would look into writing a custom RNN method using <code>tf.scan</code>. I show how to use <code>tf.scan</code> to build a custom RNN in my post, <a href="http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html">Recurrent Neural Networks in Tensorflow II</a>. With the right accumulator function, you could program in the state resets dynamically based on either a special PAD symbol, or an auxiliary input sequence that indicates where the state should be reset.</p> <h2 id="a-basic-model-for-sequence-to-sequence-learning">A basic model for sequence to sequence learning</h2> <p>Finally, we extend our sequence classification model to do <strong>sequence-to-sequence learning</strong>. We’ll use the same dataset, but instead of having our model guess the author’s age bracket and gender at the end of the sequence (i.e., only once), we’ll have it guess at every timestep.</p> <p>The added wrinkle when moving to a sequence-to-sequence model is that we need to make sure that time-steps with a PAD symbol do not contribute to our loss, since they are just there as filler. We do so by zeroing the loss at these time steps, which is known as applying a “mask” or “masking” the loss. This is achieved by pointwise multiplying the loss tensor (with each entry representing a time step), by a tensor of 1s and 0s, where 1s represent valid steps and 0s represent PAD steps. A similar modification is made to the “accuracy” calculation below, as noted in the comments.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_seq2seq_graph( vocab_size <span class="op">=</span> <span class="bu">len</span>(vocab), state_size <span class="op">=</span> <span class="dv">64</span>, batch_size <span class="op">=</span> <span class="dv">256</span>, num_classes <span class="op">=</span> <span class="dv">6</span>): reset_graph() <span class="co"># Placeholders</span> x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, <span class="va">None</span>]) <span class="co"># [batch_size, num_steps]</span> seqlen <span class="op">=</span> tf.placeholder(tf.int32, [batch_size]) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size]) keep_prob <span class="op">=</span> tf.constant(<span class="fl">1.0</span>) <span class="co"># Tile the target indices</span> <span class="co"># (in a regular seq2seq model, our targets placeholder might have this shape)</span> y_ <span class="op">=</span> tf.tile(tf.expand_dims(y, <span class="dv">1</span>), [<span class="dv">1</span>, tf.shape(x)[<span class="dv">1</span>]]) <span class="co"># [batch_size, num_steps]</span> <span class="co">&quot;&quot;&quot;</span> <span class="co"> Create a mask that we will use for the cost function</span> <span class="co"> This mask is the same shape as x and y_, and is equal to 1 for all non-PAD time</span> <span class="co"> steps (where a prediction is made), and 0 for all PAD time steps (no pred -&gt; no loss)</span> <span class="co"> The number 30, used when creating the lower_triangle_ones matrix, is the maximum</span> <span class="co"> sequence length in our dataset</span> <span class="co"> &quot;&quot;&quot;</span> lower_triangular_ones <span class="op">=</span> tf.constant(np.tril(np.ones([<span class="dv">30</span>,<span class="dv">30</span>])),dtype<span class="op">=</span>tf.float32) seqlen_mask <span class="op">=</span> tf.<span class="bu">slice</span>(tf.gather(lower_triangular_ones, seqlen <span class="op">-</span> <span class="dv">1</span>),<span class="op">\</span> [<span class="dv">0</span>, <span class="dv">0</span>], [batch_size, tf.reduce_max(seqlen)]) <span class="co"># Embedding layer</span> embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [vocab_size, state_size]) rnn_inputs <span class="op">=</span> tf.nn.embedding_lookup(embeddings, x) <span class="co"># RNN</span> cell <span class="op">=</span> tf.nn.rnn_cell.GRUCell(state_size) init_state <span class="op">=</span> tf.get_variable(<span class="st">&#39;init_state&#39;</span>, [<span class="dv">1</span>, state_size], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) init_state <span class="op">=</span> tf.tile(init_state, [batch_size, <span class="dv">1</span>]) rnn_outputs, final_state <span class="op">=</span> tf.nn.dynamic_rnn(cell, rnn_inputs, sequence_length<span class="op">=</span>seqlen, initial_state<span class="op">=</span>init_state) <span class="co"># Add dropout, as the model otherwise quickly overfits</span> rnn_outputs <span class="op">=</span> tf.nn.dropout(rnn_outputs, keep_prob) <span class="co">#reshape rnn_outputs and y</span> rnn_outputs <span class="op">=</span> tf.reshape(rnn_outputs, [<span class="op">-</span><span class="dv">1</span>, state_size]) y_reshaped <span class="op">=</span> tf.reshape(y_, [<span class="op">-</span><span class="dv">1</span>]) <span class="co"># Softmax layer</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) logits <span class="op">=</span> tf.matmul(rnn_outputs, W) <span class="op">+</span> b preds <span class="op">=</span> tf.nn.softmax(logits) <span class="co"># To calculate the number correct, we want to count padded steps as incorrect</span> correct <span class="op">=</span> tf.cast(tf.equal(tf.cast(tf.argmax(preds,<span class="dv">1</span>),tf.int32), y_reshaped),tf.int32) <span class="op">*\</span> tf.cast(tf.reshape(seqlen_mask, [<span class="op">-</span><span class="dv">1</span>]),tf.int32) <span class="co"># To calculate accuracy we want to divide by the number of non-padded time-steps,</span> <span class="co"># rather than taking the mean</span> accuracy <span class="op">=</span> tf.reduce_sum(tf.cast(correct, tf.float32)) <span class="op">/</span> tf.reduce_sum(tf.cast(seqlen, tf.float32)) loss <span class="op">=</span> tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped) loss <span class="op">=</span> loss <span class="op">*</span> tf.reshape(seqlen_mask, [<span class="op">-</span><span class="dv">1</span>]) <span class="co"># To calculate average loss, we need to divide by number of non-padded time-steps,</span> <span class="co"># rather than taking the mean</span> loss <span class="op">=</span> tf.reduce_sum(loss) <span class="op">/</span> tf.reduce_sum(seqlen_mask) train_step <span class="op">=</span> tf.train.AdamOptimizer(<span class="fl">1e-4</span>).minimize(loss) <span class="cf">return</span> { <span class="st">&#39;x&#39;</span>: x, <span class="st">&#39;seqlen&#39;</span>: seqlen, <span class="st">&#39;y&#39;</span>: y, <span class="st">&#39;dropout&#39;</span>: keep_prob, <span class="st">&#39;loss&#39;</span>: loss, <span class="st">&#39;ts&#39;</span>: train_step, <span class="st">&#39;preds&#39;</span>: preds, <span class="st">&#39;accuracy&#39;</span>: accuracy }</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_seq2seq_graph() tr_losses, te_losses <span class="op">=</span> train_graph(g, iterator<span class="op">=</span>BucketedDataIterator)</code></pre></div> <pre><code>Accuracy after epoch 1 - tr: 0.292434578401 - te: 0.316306242085 Accuracy after epoch 2 - tr: 0.320437548276 - te: 0.322733865921 Accuracy after epoch 3 - tr: 0.325227848205 - te: 0.322927395211 Accuracy after epoch 4 - tr: 0.327002136049 - te: 0.324078651696 Accuracy after epoch 5 - tr: 0.327847927489 - te: 0.324469006651 Accuracy after epoch 6 - tr: 0.328276157813 - te: 0.324198486081 Accuracy after epoch 7 - tr: 0.329078430968 - te: 0.324715245167 Accuracy after epoch 8 - tr: 0.330095707002 - te: 0.325317926384 Accuracy after epoch 9 - tr: 0.330612316872 - te: 0.32550007953 Accuracy after epoch 10 - tr: 0.331520609485 - te: 0.326069803531</code></pre> <p>As expected, our sequence-to-sequence model has slightly worse accuracy than our sequence classification model (because it’s early guesses are nearly random and reduce the accuracy).</p> <h2 id="conclusion">Conclusion</h2> <p>In this post, we learned four concepts, all related to building RNNs that work with variable length sequences. First, we learned how to pad input sequences so that we can feed in a single zero-padded input tensor. Second, we learned how to get the last relevant output in a sequence classification model. Third, we learned how to use bucketing to get a significantly boost in training time. Finally, we learned how to “mask” our loss function so that we can train sequence-to-sequence models with variable length sequences.</p> </body> </html> Binary Stochastic Neurons in Tensorflow2016-09-24T00:00:00-04:002016-09-24T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-09-24:/binary-stochastic-neurons-in-tensorflow.htmlIn this post, I introduce and discuss binary stochastic neurons, implement trainable binary stochastic neurons in Tensorflow, and conduct several simple experiments on the MNIST dataset to get a feel for their behavior. Binary stochastic neurons offer two advantages over real-valued neurons: they can act as a regularizer and they enable conditional computation by enabling a network to make yes/no decisions. Conditional computation opens the door to new and exciting neural network architectures, such as the choice of experts architecture and heirarchical multiscale neural networks.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML" type="text/javascript"></script> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>In this post, I introduce and discuss binary stochastic neurons, implement trainable binary stochastic neurons in Tensorflow, and conduct several simple experiments on the MNIST dataset to get a feel for their behavior. Binary stochastic neurons offer two advantages over real-valued neurons: they can act as a regularizer and they enable conditional computation by enabling a network to make yes/no decisions. Conditional computation opens the door to new and exciting neural network architectures, such as the choice of experts architecture and heirarchical multiscale neural networks, which I plan to discuss in future posts.</p> <h3 id="the-binary-stochastic-neuron">The binary stochastic neuron</h3> <p>A binary stochastic neuron is a neuron with a noisy output: some proportion <span class="math inline">$$p$$</span> of the time it outputs 1, otherwise 0. An easy way to turn a real-valued input, <span class="math inline">$$a$$</span>, into this proportion, <span class="math inline">$$p$$</span>, is to set <span class="math inline">$$p = \text{sigm}(a)$$</span>, where <span class="math inline">$$\text{sigm}$$</span> is the logistic sigmoid, <span class="math inline">$$\text{sigm}(x) = \frac{1}{1 + \exp(-x)}$$</span>. Thus, we define the binary stochastic neuron, <span class="math inline">$$\text{BSN}$$</span>, as:</p> <p><span class="math display">$\text{BSN}(a) = \textbf{1}_{z\ \lt\ \text{sigm}(a)}$</span></p> <p>where <span class="math inline">$$\textbf{1}_{x}$$</span> is the <a href="https://en.wikipedia.org/wiki/Indicator_function">indicator function</a> on the truth value of <span class="math inline">$$x$$</span> and <span class="math inline">$$z \sim U[0,1]$$</span>.</p> <h3 id="advantages-of-the-binary-stochastic-neuron">Advantages of the binary stochastic neuron</h3> <ol type="1"> <li><p>A binary stochastic neuron is a noisy modification of the logistic sigmoid: instead of outputting <span class="math inline">$$p$$</span>, it outputs 1 with probability <span class="math inline">$$p$$</span> and 0 otherwise. Noise generally serves as a regularizer (see, e.g., <a href="http://www.jmlr.org/papers/v15/srivastava14a.html">Srivastava et al. (2014)</a> and <a href="https://arxiv.org/abs/1511.06807">Neelakantan et al. (2015)</a>), and so we might expect the same from binary stochastic neurons as compared to the logistic neurons. Indeed, this is the claimed “unpublished result” from the end of <a href="https://www.youtube.com/watch?v=LN0xtUuJsEI&amp;list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9&amp;index=41">Hinton et al.’s Coursera Lecture 9c</a>, which I demonstrate empirically in this post.</p></li> <li><p>Further, by enabling networks to make binary decisions, the binary stochastic neuron allows for conditional computation. This opens the door to some interesting new architectures. For example, instead of a mixture of experts architecture, which weights the outputs of several “expert” sub-networks and requires that all subnetworks be computed, we could use a <em>choice</em> of experts architecture, which conditionally uses expert sub-networks as needed. This architecture is implicitly proposed in <a href="https://arxiv.org/abs/1308.3432">Bengio et al. (2013)</a>, wherein the experiments use a choice of expert units architecture (i.e., a gated architecture where gates must be 1 or 0). Another example, proposed in <a href="https://arxiv.org/abs/1308.3432">Bengio et al. (2013)</a> and implemented by <a href="https://arxiv.org/abs/1609.01704">Chung et al. (2016)</a>, is the Heirarchical Multiscale Recurrent Neural Network (HM-RNN) architecture, which achieves great results on language modelling tasks.</p></li> </ol> <h3 id="training-the-binary-stochastic-neuron">Training the binary stochastic neuron</h3> <p>For any single trial, the binary stochastic neuron generally has a derivative of 0 and cannot be trained by simple backpropagation. To see this, consider that if <span class="math inline">$$z \neq \text{sigm}(a)$$</span> in the <span class="math inline">$$\text{BSN}$$</span> function above, there exists a <a href="https://en.wikipedia.org/wiki/Neighbourhood_(mathematics)">neighborhood</a> around <span class="math inline">$$a$$</span> such that the output of <span class="math inline">$$\text{BSN}(a)$$</span> is unchanged (i.e., the derivative is 0). We get around this by <em>estimating</em> the derivative with respect to the <em>expected</em> loss, rather than calculating the derivative with respect to the outcome of a single trial. We can only estimate this derivative, because in any given trial, we only see the loss value with respect to the given noise – we don’t know what the loss would have been given another level of noise. We call a method that provides such an estimate an “estimator”. An estimator is <em>unbiased</em> if the expectation of its estimate equals the expectation of the derivative it is estimating; otherwise, it is <em>biased</em>.</p> <p>In this post we implement the two estimators discussed in <a href="https://arxiv.org/abs/1308.3432">Bengio et al. (2013)</a>:</p> <ol type="1"> <li><p>The REINFORCE estimator, which is an unbiased estimator and a special case of the REINFORCE algorithm discussed in <a href="http://link.springer.com/article/10.1007/BF00992696">Williams (1992)</a>.</p> <p>The REINFORCE estimator estimates the expectation of <span class="math inline">$$\frac{\partial L}{\partial a}$$</span> as <span class="math inline">$$(\text{BSN}(a) - \text{sigm}(a))(L - c)$$</span>, where <span class="math inline">$$c$$</span> is a constant. <a href="https://arxiv.org/abs/1308.3432">Bengio et al. (2013)</a> proves that:</p> <p><span class="math display">$\mathbb{E}[(\text{BSN}(a) - \text{sigm}(a))(L - c)] = \mathbb{E}\big[\frac{\partial L}{\partial a}\big].$</span></p> <p><a href="https://arxiv.org/abs/1308.3432">Bengio et al. (2013)</a> further shows that to minimize the variance of the estimation, we choose:</p> <p><span class="math display">$c = \bar L = \frac{\mathbb{E}[\text{BSN}(a) - \text{sigm}(a))^2L]}{\mathbb{E}[\text{BSN}(a) - \text{sigm}(a))^2]}$</span></p> <p>which we can practically implement by keeping track of the numerator and denominator as a moving average. Interestingly, the REINFORCE estimator does not require any backpropagated loss gradient–it operates directly on the loss of the network.</p></li> <li><p>The straight through (ST) estimator, which is a biased estimator that was first proposed by <a href="https://www.youtube.com/watch?v=LN0xtUuJsEI&amp;list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9&amp;index=41">Hinton et al.’s Coursera Lecture 9c</a>.</p> <p>The ST estimator simply replaces the derivative factor used during backpropagation, <span class="math inline">$$\frac{d\text{BSN}(a)}{da} = 0$$</span>, with the identity function <span class="math inline">$$\frac{d\text{BSN}(a)}{da} = 1$$</span>.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> A variant of the ST estimator replaces the derivative factor with <span class="math inline">$$\frac{d\text{BSN}(a)}{da} = \frac{d\text{sigm}(a)}{da}$$</span>. Whereas <a href="https://arxiv.org/abs/1308.3432">Bengio et al. (2013)</a> found that the former is more effective, the latter variant was successfully used in <a href="https://arxiv.org/abs/1609.01704">Chung et al. (2016)</a> in combination with the <em>slope-annealing trick</em> and deterministic binary neurons (which we will see perform very similarly to, if not better than, stochastic binary neurons when used with slope-annealing). The slope-anealing trick modifies <span class="math inline">$$\text{BSN}(a)$$</span> by first multiplying the input <span class="math inline">$$a$$</span> by a slope <span class="math inline">$$m$$</span> as follows:</p> <p><span class="math display">$\text{BSN}_{\text{SL}(m)}(a) = \textbf{1}_{z \lt \text{sigm}(ma)}.$</span></p> <p>Then, we increase the slope as training progresses and use <span class="math inline">$$\frac{d\text{BSN}_{\text{SL}(m)}(a)}{da} = \frac{d\text{sigm}(ma)}{da}$$</span> when computing the gradient. The idea behind this is that as the slope increases, the logistic sigmoid approaches a step function, so that it’s derivative approaches the true derivative. All three variants are tested in this post.</p></li> </ol> <h3 id="implementing-the-binary-stochastic-neuron-in-tensorflow">Implementing the binary stochastic neuron in Tensorflow</h3> <p>The tricky part of implementing a binary stochastic neuron in Tensorflow is not the forward computation, but the implementation of the REINFORCE and straight through estimators. Each requires replacing the gradient of one or more Tensorflow operations. The <a href="https://www.tensorflow.org/how_tos/adding_an_op/">official approach</a> to this is to write a new op in C++, which seems wholly unnecessary. There are, however, two workable unofficial approaches, one of which is <a href="http://stackoverflow.com/questions/36456436/how-can-i-define-only-the-gradient-for-a-tensorflow-subgraph/36480182">a trick credited to Sergey Ioffe</a>, and another that uses <code>gradient_override_map</code>, an experimental feature of Tensorflow that is documented <a href="https://www.tensorflow.org/api_docs/python/framework/core_graph_data_structures#Graph.gradient_override_map">here</a>. We will use <code>gradient_override_map</code>, which works well for our purposes.</p> <h4 id="imports-and-utility-functions">Imports and Utility Functions</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np <span class="im">import</span> tensorflow <span class="im">as</span> tf <span class="im">from</span> tensorflow.examples.tutorials.mnist <span class="im">import</span> input_data <span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="op">%</span>matplotlib inline mnist <span class="op">=</span> input_data.read_data_sets(<span class="st">&#39;MNIST_data&#39;</span>, one_hot<span class="op">=</span><span class="va">True</span>) <span class="im">from</span> tensorflow.python.framework <span class="im">import</span> ops <span class="im">from</span> enum <span class="im">import</span> Enum <span class="im">import</span> seaborn <span class="im">as</span> sns sns.<span class="bu">set</span>(color_codes<span class="op">=</span><span class="va">True</span>) <span class="kw">def</span> reset_graph(): <span class="cf">if</span> <span class="st">&#39;sess&#39;</span> <span class="kw">in</span> <span class="bu">globals</span>() <span class="kw">and</span> sess: sess.close() tf.reset_default_graph() <span class="kw">def</span> layer_linear(inputs, shape, scope<span class="op">=</span><span class="st">&#39;linear_layer&#39;</span>): <span class="cf">with</span> tf.variable_scope(scope): w <span class="op">=</span> tf.get_variable(<span class="st">&#39;w&#39;</span>,shape) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>,shape[<span class="op">-</span><span class="dv">1</span>:]) <span class="cf">return</span> tf.matmul(inputs,w) <span class="op">+</span> b <span class="kw">def</span> layer_softmax(inputs, shape, scope<span class="op">=</span><span class="st">&#39;softmax_layer&#39;</span>): <span class="cf">with</span> tf.variable_scope(scope): w <span class="op">=</span> tf.get_variable(<span class="st">&#39;w&#39;</span>,shape) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>,shape[<span class="op">-</span><span class="dv">1</span>:]) <span class="cf">return</span> tf.nn.softmax(tf.matmul(inputs,w) <span class="op">+</span> b) <span class="kw">def</span> accuracy(y, pred): correct <span class="op">=</span> tf.equal(tf.argmax(y,<span class="dv">1</span>), tf.argmax(pred,<span class="dv">1</span>)) <span class="cf">return</span> tf.reduce_mean(tf.cast(correct, tf.float32)) <span class="kw">def</span> plot_n(data_and_labels, lower_y <span class="op">=</span> <span class="fl">0.</span>, title<span class="op">=</span><span class="st">&quot;Learning Curves&quot;</span>): fig, ax <span class="op">=</span> plt.subplots() <span class="cf">for</span> data, label <span class="kw">in</span> data_and_labels: ax.plot(<span class="bu">range</span>(<span class="dv">0</span>,<span class="bu">len</span>(data)<span class="op">*</span><span class="dv">100</span>,<span class="dv">100</span>),data, label<span class="op">=</span>label) ax.set_xlabel(<span class="st">&#39;Training steps&#39;</span>) ax.set_ylabel(<span class="st">&#39;Accuracy&#39;</span>) ax.set_ylim([lower_y,<span class="dv">1</span>]) ax.set_title(title) ax.legend(loc<span class="op">=</span><span class="dv">4</span>) plt.show() <span class="kw">class</span> StochasticGradientEstimator(Enum): ST <span class="op">=</span> <span class="dv">0</span> REINFORCE <span class="op">=</span> <span class="dv">1</span></code></pre></div> <pre><code>Extracting MNIST_data/train-images-idx3-ubyte.gz Extracting MNIST_data/train-labels-idx1-ubyte.gz Extracting MNIST_data/t10k-images-idx3-ubyte.gz Extracting MNIST_data/t10k-labels-idx1-ubyte.gz</code></pre> <h4 id="binary-stochastic-neuron-with-straight-through-estimator">Binary stochastic neuron with straight through estimator</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> binaryRound(x): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Rounds a tensor whose values are in [0,1] to a tensor with values in {0, 1},</span> <span class="co"> using the straight through estimator for the gradient.</span> <span class="co"> &quot;&quot;&quot;</span> g <span class="op">=</span> tf.get_default_graph() <span class="cf">with</span> ops.name_scope(<span class="st">&quot;BinaryRound&quot;</span>) <span class="im">as</span> name: <span class="cf">with</span> g.gradient_override_map({<span class="st">&quot;Round&quot;</span>: <span class="st">&quot;Identity&quot;</span>}): <span class="cf">return</span> tf.<span class="bu">round</span>(x, name<span class="op">=</span>name) <span class="co"># For Tensorflow v0.11 and below use:</span> <span class="co">#with g.gradient_override_map({&quot;Floor&quot;: &quot;Identity&quot;}):</span> <span class="co"># return tf.round(x, name=name)</span></code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> bernoulliSample(x): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Uses a tensor whose values are in [0,1] to sample a tensor with values in {0, 1},</span> <span class="co"> using the straight through estimator for the gradient.</span> <span class="co"> E.g.,:</span> <span class="co"> if x is 0.6, bernoulliSample(x) will be 1 with probability 0.6, and 0 otherwise,</span> <span class="co"> and the gradient will be pass-through (identity).</span> <span class="co"> &quot;&quot;&quot;</span> g <span class="op">=</span> tf.get_default_graph() <span class="cf">with</span> ops.name_scope(<span class="st">&quot;BernoulliSample&quot;</span>) <span class="im">as</span> name: <span class="cf">with</span> g.gradient_override_map({<span class="st">&quot;Ceil&quot;</span>: <span class="st">&quot;Identity&quot;</span>,<span class="st">&quot;Sub&quot;</span>: <span class="st">&quot;BernoulliSample_ST&quot;</span>}): <span class="cf">return</span> tf.ceil(x <span class="op">-</span> tf.random_uniform(tf.shape(x)), name<span class="op">=</span>name) <span class="at">@ops.RegisterGradient</span>(<span class="st">&quot;BernoulliSample_ST&quot;</span>) <span class="kw">def</span> bernoulliSample_ST(op, grad): <span class="cf">return</span> [grad, tf.zeros(tf.shape(op.inputs[<span class="dv">1</span>]))]</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> passThroughSigmoid(x, slope<span class="op">=</span><span class="dv">1</span>): <span class="co">&quot;&quot;&quot;Sigmoid that uses identity function as its gradient&quot;&quot;&quot;</span> g <span class="op">=</span> tf.get_default_graph() <span class="cf">with</span> ops.name_scope(<span class="st">&quot;PassThroughSigmoid&quot;</span>) <span class="im">as</span> name: <span class="cf">with</span> g.gradient_override_map({<span class="st">&quot;Sigmoid&quot;</span>: <span class="st">&quot;Identity&quot;</span>}): <span class="cf">return</span> tf.sigmoid(x, name<span class="op">=</span>name) <span class="kw">def</span> binaryStochastic_ST(x, slope_tensor<span class="op">=</span><span class="va">None</span>, pass_through<span class="op">=</span><span class="va">True</span>, stochastic<span class="op">=</span><span class="va">True</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Sigmoid followed by either a random sample from a bernoulli distribution according</span> <span class="co"> to the result (binary stochastic neuron) (default), or a sigmoid followed by a binary</span> <span class="co"> step function (if stochastic == False). Uses the straight through estimator.</span> <span class="co"> See https://arxiv.org/abs/1308.3432.</span> <span class="co"> Arguments:</span> <span class="co"> * x: the pre-activation / logit tensor</span> <span class="co"> * slope_tensor: if passThrough==False, slope adjusts the slope of the sigmoid function</span> <span class="co"> for purposes of the Slope Annealing Trick (see http://arxiv.org/abs/1609.01704)</span> <span class="co"> * pass_through: if True (default), gradient of the entire function is 1 or 0;</span> <span class="co"> if False, gradient of 1 is scaled by the gradient of the sigmoid (required if</span> <span class="co"> Slope Annealing Trick is used)</span> <span class="co"> * stochastic: binary stochastic neuron if True (default), or step function if False</span> <span class="co"> &quot;&quot;&quot;</span> <span class="cf">if</span> slope_tensor <span class="kw">is</span> <span class="va">None</span>: slope_tensor <span class="op">=</span> tf.constant(<span class="fl">1.0</span>) <span class="cf">if</span> pass_through: p <span class="op">=</span> passThroughSigmoid(x) <span class="cf">else</span>: p <span class="op">=</span> tf.sigmoid(slope_tensor<span class="op">*</span>x) <span class="cf">if</span> stochastic: <span class="cf">return</span> bernoulliSample(p) <span class="cf">else</span>: <span class="cf">return</span> binaryRound(p)</code></pre></div> <h4 id="binary-stochastic-neuron-with-reinforce-estimator">Binary stochastic neuron with REINFORCE estimator</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> binaryStochastic_REINFORCE(x, stochastic <span class="op">=</span> <span class="va">True</span>, loss_op_name<span class="op">=</span><span class="st">&quot;loss_by_example&quot;</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Sigmoid followed by a random sample from a bernoulli distribution according</span> <span class="co"> to the result (binary stochastic neuron). Uses the REINFORCE estimator.</span> <span class="co"> See https://arxiv.org/abs/1308.3432.</span> <span class="co"> </span><span class="al">NOTE</span><span class="co">: Requires a loss operation with name matching the argument for loss_op_name</span> <span class="co"> in the graph. This loss operation should be broken out by example (i.e., not a</span> <span class="co"> single number for the entire batch).</span> <span class="co"> &quot;&quot;&quot;</span> g <span class="op">=</span> tf.get_default_graph() <span class="cf">with</span> ops.name_scope(<span class="st">&quot;BinaryStochasticREINFORCE&quot;</span>): <span class="cf">with</span> g.gradient_override_map({<span class="st">&quot;Sigmoid&quot;</span>: <span class="st">&quot;BinaryStochastic_REINFORCE&quot;</span>, <span class="st">&quot;Ceil&quot;</span>: <span class="st">&quot;Identity&quot;</span>}): p <span class="op">=</span> tf.sigmoid(x) reinforce_collection <span class="op">=</span> g.get_collection(<span class="st">&quot;REINFORCE&quot;</span>) <span class="cf">if</span> <span class="kw">not</span> reinforce_collection: g.add_to_collection(<span class="st">&quot;REINFORCE&quot;</span>, {}) reinforce_collection <span class="op">=</span> g.get_collection(<span class="st">&quot;REINFORCE&quot;</span>) reinforce_collection[<span class="dv">0</span>][p.op.name] <span class="op">=</span> loss_op_name <span class="cf">return</span> tf.ceil(p <span class="op">-</span> tf.random_uniform(tf.shape(x))) <span class="at">@ops.RegisterGradient</span>(<span class="st">&quot;BinaryStochastic_REINFORCE&quot;</span>) <span class="kw">def</span> _binaryStochastic_REINFORCE(op, _): <span class="co">&quot;&quot;&quot;Unbiased estimator for binary stochastic function based on REINFORCE.&quot;&quot;&quot;</span> loss_op_name <span class="op">=</span> op.graph.get_collection(<span class="st">&quot;REINFORCE&quot;</span>)[<span class="dv">0</span>][op.name] loss_tensor <span class="op">=</span> op.graph.get_operation_by_name(loss_op_name).outputs[<span class="dv">0</span>] sub_tensor <span class="op">=</span> op.outputs[<span class="dv">0</span>].consumers()[<span class="dv">0</span>].outputs[<span class="dv">0</span>] <span class="co">#subtraction tensor</span> ceil_tensor <span class="op">=</span> sub_tensor.consumers()[<span class="dv">0</span>].outputs[<span class="dv">0</span>] <span class="co">#ceiling tensor</span> outcome_diff <span class="op">=</span> (ceil_tensor <span class="op">-</span> op.outputs[<span class="dv">0</span>]) <span class="co"># Provides an early out if we want to avoid variance adjustment for</span> <span class="co"># whatever reason (e.g., to show that variance adjustment helps)</span> <span class="cf">if</span> op.graph.get_collection(<span class="st">&quot;REINFORCE&quot;</span>)[<span class="dv">0</span>].get(<span class="st">&quot;no_variance_adj&quot;</span>): <span class="cf">return</span> outcome_diff <span class="op">*</span> tf.expand_dims(loss_tensor, <span class="dv">1</span>) outcome_diff_sq <span class="op">=</span> tf.square(outcome_diff) outcome_diff_sq_r <span class="op">=</span> tf.reduce_mean(outcome_diff_sq, reduction_indices<span class="op">=</span><span class="dv">0</span>) outcome_diff_sq_loss_r <span class="op">=</span> tf.reduce_mean(outcome_diff_sq <span class="op">*</span> tf.expand_dims(loss_tensor, <span class="dv">1</span>), reduction_indices<span class="op">=</span><span class="dv">0</span>) L_bar_num <span class="op">=</span> tf.Variable(tf.zeros(outcome_diff_sq_r.get_shape()), trainable<span class="op">=</span><span class="va">False</span>) L_bar_den <span class="op">=</span> tf.Variable(tf.ones(outcome_diff_sq_r.get_shape()), trainable<span class="op">=</span><span class="va">False</span>) <span class="co">#Note: we already get a decent estimate of the average from the minibatch</span> decay <span class="op">=</span> <span class="fl">0.95</span> train_L_bar_num <span class="op">=</span> tf.assign(L_bar_num, L_bar_num<span class="op">*</span>decay <span class="op">+\</span> outcome_diff_sq_loss_r<span class="op">*</span>(<span class="dv">1</span><span class="op">-</span>decay)) train_L_bar_den <span class="op">=</span> tf.assign(L_bar_den, L_bar_den<span class="op">*</span>decay <span class="op">+\</span> outcome_diff_sq_r<span class="op">*</span>(<span class="dv">1</span><span class="op">-</span>decay)) <span class="cf">with</span> tf.control_dependencies([train_L_bar_num, train_L_bar_den]): L_bar <span class="op">=</span> train_L_bar_num<span class="op">/</span>(train_L_bar_den<span class="fl">+1e-4</span>) L <span class="op">=</span> tf.tile(tf.expand_dims(loss_tensor,<span class="dv">1</span>), tf.constant([<span class="dv">1</span>,L_bar.get_shape().as_list()[<span class="dv">0</span>]])) <span class="cf">return</span> outcome_diff <span class="op">*</span> (L <span class="op">-</span> L_bar)</code></pre></div> <h4 id="wrapper-to-create-layer-of-binary-stochastic-neurons">Wrapper to create layer of binary stochastic neurons</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> binary_wrapper(<span class="op">\</span> pre_activations_tensor, estimator<span class="op">=</span>StochasticGradientEstimator.ST, stochastic_tensor<span class="op">=</span>tf.constant(<span class="va">True</span>), pass_through<span class="op">=</span><span class="va">True</span>, slope_tensor<span class="op">=</span>tf.constant(<span class="fl">1.0</span>)): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Turns a layer of pre-activations (logits) into a layer of binary stochastic neurons</span> <span class="co"> Keyword arguments:</span> <span class="co"> *estimator: either ST or REINFORCE</span> <span class="co"> *stochastic_tensor: a boolean tensor indicating whether to sample from a bernoulli</span> <span class="co"> distribution (True, default) or use a step_function (e.g., for inference)</span> <span class="co"> *pass_through: for ST only - boolean as to whether to substitute identity derivative on the</span> <span class="co"> backprop (True, default), or whether to use the derivative of the sigmoid</span> <span class="co"> *slope_tensor: for ST only - tensor specifying the slope for purposes of slope annealing</span> <span class="co"> trick</span> <span class="co"> &quot;&quot;&quot;</span> <span class="cf">if</span> estimator <span class="op">==</span> StochasticGradientEstimator.ST: <span class="cf">if</span> pass_through: <span class="cf">return</span> tf.cond(stochastic_tensor, <span class="kw">lambda</span>: binaryStochastic_ST(pre_activations_tensor), <span class="kw">lambda</span>: binaryStochastic_ST(pre_activations_tensor, stochastic<span class="op">=</span><span class="va">False</span>)) <span class="cf">else</span>: <span class="cf">return</span> tf.cond(stochastic_tensor, <span class="kw">lambda</span>: binaryStochastic_ST(pre_activations_tensor, slope_tensor <span class="op">=</span> slope_tensor, pass_through<span class="op">=</span><span class="va">False</span>), <span class="kw">lambda</span>: binaryStochastic_ST(pre_activations_tensor, slope_tensor <span class="op">=</span> slope_tensor, pass_through<span class="op">=</span><span class="va">False</span>, stochastic<span class="op">=</span><span class="va">False</span>)) <span class="cf">elif</span> estimator <span class="op">==</span> StochasticGradientEstimator.REINFORCE: <span class="co"># binaryStochastic_REINFORCE was designed to only be stochastic, so using the ST version</span> <span class="co"># for the step fn for purposes of using step fn at evaluation / not for training</span> <span class="cf">return</span> tf.cond(stochastic_tensor, <span class="kw">lambda</span>: binaryStochastic_REINFORCE(pre_activations_tensor), <span class="kw">lambda</span>: binaryStochastic_ST(pre_activations_tensor, stochastic<span class="op">=</span><span class="va">False</span>)) <span class="cf">else</span>: <span class="cf">raise</span> <span class="pp">ValueError</span>(<span class="st">&quot;Unrecognized estimator.&quot;</span>)</code></pre></div> <h4 id="function-to-build-graph-for-mnist-classifier">Function to build graph for MNIST classifier</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], lr <span class="op">=</span> <span class="fl">0.5</span>, pass_through <span class="op">=</span> <span class="va">True</span>, non_binary <span class="op">=</span> <span class="va">False</span>, estimator <span class="op">=</span> StochasticGradientEstimator.ST, no_var_adj<span class="op">=</span><span class="va">False</span>): reset_graph() g <span class="op">=</span> {} <span class="cf">if</span> no_var_adj: tf.get_default_graph().add_to_collection(<span class="st">&quot;REINFORCE&quot;</span>, {<span class="st">&quot;no_variance_adj&quot;</span>: no_var_adj}) g[<span class="st">&#39;x&#39;</span>] <span class="op">=</span> tf.placeholder(tf.float32, [<span class="va">None</span>, <span class="dv">784</span>], name<span class="op">=</span><span class="st">&#39;x_placeholder&#39;</span>) g[<span class="st">&#39;y&#39;</span>] <span class="op">=</span> tf.placeholder(tf.float32, [<span class="va">None</span>, <span class="dv">10</span>], name<span class="op">=</span><span class="st">&#39;y_placeholder&#39;</span>) g[<span class="st">&#39;stochastic&#39;</span>] <span class="op">=</span> tf.constant(<span class="va">True</span>) g[<span class="st">&#39;slope&#39;</span>] <span class="op">=</span> tf.constant(<span class="fl">1.0</span>) g[<span class="st">&#39;layers&#39;</span>] <span class="op">=</span> {<span class="dv">0</span>: g[<span class="st">&#39;x&#39;</span>]} hidden_layers <span class="op">=</span> <span class="bu">len</span>(hidden_dims) dims <span class="op">=</span> [<span class="dv">784</span>] <span class="op">+</span> hidden_dims <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">1</span>, hidden_layers<span class="op">+</span><span class="dv">1</span>): <span class="cf">with</span> tf.variable_scope(<span class="st">&quot;layer_&quot;</span> <span class="op">+</span> <span class="bu">str</span>(i)): pre_activations <span class="op">=</span> layer_linear(g[<span class="st">&#39;layers&#39;</span>][i<span class="dv">-1</span>], dims[i<span class="dv">-1</span>:i<span class="op">+</span><span class="dv">1</span>], scope<span class="op">=</span><span class="st">&#39;layer_&#39;</span> <span class="op">+</span> <span class="bu">str</span>(i)) <span class="cf">if</span> non_binary: g[<span class="st">&#39;layers&#39;</span>][i] <span class="op">=</span> tf.sigmoid(pre_activations) <span class="cf">else</span>: g[<span class="st">&#39;layers&#39;</span>][i] <span class="op">=</span> binary_wrapper(pre_activations, estimator <span class="op">=</span> estimator, pass_through <span class="op">=</span> pass_through, stochastic_tensor <span class="op">=</span> g[<span class="st">&#39;stochastic&#39;</span>], slope_tensor <span class="op">=</span> g[<span class="st">&#39;slope&#39;</span>]) g[<span class="st">&#39;pred&#39;</span>] <span class="op">=</span> layer_softmax(g[<span class="st">&#39;layers&#39;</span>][hidden_layers], [dims[<span class="op">-</span><span class="dv">1</span>], <span class="dv">10</span>]) g[<span class="st">&#39;loss&#39;</span>] <span class="op">=</span> <span class="op">-</span>tf.reduce_mean(g[<span class="st">&#39;y&#39;</span>] <span class="op">*</span> tf.log(g[<span class="st">&#39;pred&#39;</span>]),reduction_indices<span class="op">=</span><span class="dv">1</span>) <span class="co"># named loss_by_example necessary for REINFORCE estimator</span> tf.identity(g[<span class="st">&#39;loss&#39;</span>], name<span class="op">=</span><span class="st">&quot;loss_by_example&quot;</span>) g[<span class="st">&#39;ts&#39;</span>] <span class="op">=</span> tf.train.GradientDescentOptimizer(lr).minimize(g[<span class="st">&#39;loss&#39;</span>]) g[<span class="st">&#39;accuracy&#39;</span>] <span class="op">=</span> accuracy(g[<span class="st">&#39;y&#39;</span>], g[<span class="st">&#39;pred&#39;</span>]) g[<span class="st">&#39;init_op&#39;</span>] <span class="op">=</span> tf.global_variables_initializer() <span class="cf">return</span> g</code></pre></div> <h4 id="function-to-train-the-classifier">Function to train the classifier</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> train_classifier(<span class="op">\</span> hidden_dims<span class="op">=</span>[<span class="dv">100</span>,<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, stochastic_train<span class="op">=</span><span class="va">True</span>, stochastic_eval<span class="op">=</span><span class="va">True</span>, slope_annealing_rate<span class="op">=</span><span class="va">None</span>, epochs<span class="op">=</span><span class="dv">10</span>, lr<span class="op">=</span><span class="fl">0.5</span>, non_binary<span class="op">=</span><span class="va">False</span>, no_var_adj<span class="op">=</span><span class="va">False</span>, train_set <span class="op">=</span> mnist.train, val_set <span class="op">=</span> mnist.validation, verbose<span class="op">=</span><span class="va">False</span>, label<span class="op">=</span><span class="va">None</span>): <span class="cf">if</span> slope_annealing_rate <span class="kw">is</span> <span class="va">None</span>: g <span class="op">=</span> build_classifier(hidden_dims<span class="op">=</span>hidden_dims, lr<span class="op">=</span>lr, pass_through<span class="op">=</span><span class="va">True</span>, non_binary<span class="op">=</span>non_binary, estimator<span class="op">=</span>estimator, no_var_adj<span class="op">=</span>no_var_adj) <span class="cf">else</span>: g <span class="op">=</span> build_classifier(hidden_dims<span class="op">=</span>hidden_dims, lr<span class="op">=</span>lr, pass_through<span class="op">=</span><span class="va">False</span>, non_binary<span class="op">=</span>non_binary, estimator<span class="op">=</span>estimator, no_var_adj<span class="op">=</span>no_var_adj) <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(g[<span class="st">&#39;init_op&#39;</span>]) slope <span class="op">=</span> <span class="dv">1</span> res_tr, res_val <span class="op">=</span> [], [] <span class="cf">for</span> epoch <span class="kw">in</span> <span class="bu">range</span>(epochs): feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: val_set.images, g[<span class="st">&#39;y&#39;</span>]: val_set.labels, g[<span class="st">&#39;stochastic&#39;</span>]: stochastic_eval, g[<span class="st">&#39;slope&#39;</span>]: slope} <span class="cf">if</span> verbose: <span class="bu">print</span>(<span class="st">&quot;Epoch&quot;</span>, epoch, sess.run(g[<span class="st">&#39;accuracy&#39;</span>], feed_dict<span class="op">=</span>feed_dict)) accuracy <span class="op">=</span> <span class="dv">0</span> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">1001</span>): x, y <span class="op">=</span> train_set.next_batch(<span class="dv">50</span>) feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: x, g[<span class="st">&#39;y&#39;</span>]: y, g[<span class="st">&#39;stochastic&#39;</span>]: stochastic_train} acc, _ <span class="op">=</span> sess.run([g[<span class="st">&#39;accuracy&#39;</span>],g[<span class="st">&#39;ts&#39;</span>]], feed_dict<span class="op">=</span>feed_dict) accuracy <span class="op">+=</span> acc <span class="cf">if</span> i <span class="op">%</span> <span class="dv">100</span> <span class="op">==</span> <span class="dv">0</span> <span class="kw">and</span> i <span class="op">&gt;</span> <span class="dv">0</span>: res_tr.append(accuracy<span class="op">/</span><span class="dv">100</span>) accuracy <span class="op">=</span> <span class="dv">0</span> feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: val_set.images, g[<span class="st">&#39;y&#39;</span>]: val_set.labels, g[<span class="st">&#39;stochastic&#39;</span>]: stochastic_eval, g[<span class="st">&#39;slope&#39;</span>]: slope} res_val.append(sess.run(g[<span class="st">&#39;accuracy&#39;</span>], feed_dict<span class="op">=</span>feed_dict)) <span class="cf">if</span> slope_annealing_rate <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: slope <span class="op">=</span> slope<span class="op">*</span>slope_annealing_rate <span class="cf">if</span> verbose: <span class="bu">print</span>(<span class="st">&quot;Sigmoid slope:&quot;</span>, slope) feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: val_set.images, g[<span class="st">&#39;y&#39;</span>]: val_set.labels, g[<span class="st">&#39;stochastic&#39;</span>]: stochastic_eval, g[<span class="st">&#39;slope&#39;</span>]: slope} <span class="bu">print</span>(<span class="st">&quot;Epoch&quot;</span>, epoch<span class="op">+</span><span class="dv">1</span>, sess.run(g[<span class="st">&#39;accuracy&#39;</span>], feed_dict<span class="op">=</span>feed_dict)) <span class="cf">if</span> label <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: <span class="cf">return</span> (res_tr, label <span class="op">+</span> <span class="st">&quot; - Training&quot;</span>), (res_val, label <span class="op">+</span> <span class="st">&quot; - Validation&quot;</span>) <span class="cf">else</span>: <span class="cf">return</span> [(res_tr, <span class="st">&quot;Training&quot;</span>), (res_val, <span class="st">&quot;Validation&quot;</span>)]</code></pre></div> <h3 id="experiments">Experiments</h3> <p>We’ve now set up a good foundation from which we can run a number of simple experiments. The experiments are as follows:</p> <ul> <li><strong>Experiment 0</strong>: A non-stochastic, non-binary baseline.</li> <li><strong>Experiment 1</strong>: A comparison of variance-adjusted REINFORCE and non-variance adjusted REINFORCE, which shows that the variance adjustment allows for faster learning and higher learning rates.</li> <li><strong>Experiment 2</strong>: A comparison of pass-through ST and sigmoid-adjusted ST, which shows that the sigmoid-adjusted ST estimator obtains better results, a result that does not agree with the findings of <a href="https://arxiv.org/abs/1308.3432">Bengio et al. (2013</a>.</li> <li><strong>Experiment 3</strong>: A comparison of sigmoid-adjusted ST and slope-annealed sigmoid-adjusted ST, which shows that a well-tuned slope-annealed ST outperforms the base sigmoid-adjusted ST.</li> <li><strong>Experiment 4</strong>: A direct comparison of variance-adjusted REINFORCE and slope-annealed ST, which shows that ST performs significantly better than REINFORCE.</li> <li><strong>Experiment 5</strong>: A look at the deterministic step function, during training and evaluation, which shows that deterministic evaluation can provide a slight boost at inference, and that with slope annealing, deterministic training is just as effective, if not more effective than stochastic training.</li> <li><strong>Experiment 6</strong>: A look at how network depth affects performance, which shows that deep stochastic networks can be difficult to train.</li> <li><strong>Experiment 7</strong>: A look at using binary stochastic neurons as a regularizer, which validates Hinton’s claim that stochastic neurons can serve as effective regularizers.</li> </ul> <h4 id="experiment-0-a-non-stochastic-non-binary-baseline">Experiment 0: A non-stochastic, non-binary baseline</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">1.0</span>, non_binary<span class="op">=</span><span class="va">True</span>) plot_n(res, lower_y<span class="op">=</span><span class="fl">0.8</span>, title<span class="op">=</span><span class="st">&quot;Logistic Sigmoid Baseline&quot;</span>)</code></pre></div> <pre><code>Epoch 20 0.9698</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_16_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="experiment-1-variance-adjusted-vs.not-variance-adjusted-reinforce">Experiment 1: Variance-adjusted vs. not variance-adjusted REINFORCE</h4> <p>Recall that the REINFORCE estimator estimates the expectation of <span class="math inline">$$\frac{\partial L}{\partial a}$$</span> as <span class="math inline">$$(\text{BSN}(a) - \text{sigm}(a))(L - c)$$</span>, where <span class="math inline">$$c$$</span> is a constant. The non-variance-adjusted form of REINFORCE uses <span class="math inline">$$c = 0$$</span>, whereas the variance-adjusted form uses the variance minimizing result stated above. Naturally we should prefer the least variance, and the experimental results below agree.</p> <p>It seems that both forms of REINFORCE often break down for learning rates greater than or equal to 0.3 (compare to the learning rate of 1.0 that used in Experiment 0). After a few trials, variance-adjusted REINFORCE appears to be more resistant to such failures.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(<span class="st">&quot;Variance-adjusted:&quot;</span>) res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.REINFORCE, epochs<span class="op">=</span><span class="dv">3</span>, lr<span class="op">=</span><span class="fl">0.3</span>, verbose<span class="op">=</span><span class="va">True</span>) <span class="bu">print</span>(<span class="st">&quot;Not variance-adjusted:&quot;</span>)<span class="kw">and</span> res2<span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.REINFORCE, epochs<span class="op">=</span><span class="dv">3</span>, lr<span class="op">=</span><span class="fl">0.3</span>, no_var_adj<span class="op">=</span><span class="va">True</span>, verbose<span class="op">=</span><span class="va">True</span>)</code></pre></div> <pre><code>Variance-adjusted: Epoch 0 0.1026 Epoch 1 0.4466 Epoch 2 0.511 Epoch 3 0.575 Not variance-adjusted: Epoch 0 0.0964 Epoch 1 0.0958 Epoch 2 0.0958 Epoch 3 0.0958</code></pre> <p>In terms of performance at lower learning rates, a learning rate of about 0.05 provided the best results. The results show that the variance-adjusted REINFORCE learns faster, but that its non-variance adjusted eventually catches up. This result is consistent with the mathematical result that they are both unbiased estimators. Performance is predictably worse than it was for the plain logistic sigmoid in Experiment 0, although there is almost no generalization gap, consistent with the hypothesis that binary stochastic neurons can act as regularizers.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.REINFORCE, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.05</span>, label <span class="op">=</span> <span class="st">&quot;Variance-adjusted&quot;</span>) res2<span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.REINFORCE, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.05</span>, no_var_adj<span class="op">=</span><span class="va">True</span>, label <span class="op">=</span> <span class="st">&quot;Not variance-adjusted&quot;</span>) plot_n(res1 <span class="op">+</span> res2, lower_y<span class="op">=</span><span class="fl">0.6</span>, title<span class="op">=</span><span class="st">&quot;Experiment 1: REINFORCE variance adjustment&quot;</span>)</code></pre></div> <pre><code>Epoch 20 0.9274 Epoch 20 0.923</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_20_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="experiment-2-pass-through-vs.sigmoid-adjusted-st-estimation">Experiment 2: Pass-through vs. sigmoid-adjusted ST estimation</h4> <p>Recall that one variant of the straight-through estimator uses the identity function as the backpropagated gradient (pass-through), and another variant multiplies that by the gradient of the logistic sigmoid that the neuron calculates (sigmoid-adjusted). In <a href="https://arxiv.org/abs/1308.3432">Bengio et al. (2013)</a>, it was remarked that, surprisingly, the former performs better. My results below disagree, and by a surprisingly wide margin.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, label <span class="op">=</span> <span class="st">&quot;Pass-through - 0.1&quot;</span>) res2 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.0</span>, label <span class="op">=</span> <span class="st">&quot;Sigmoid-adjusted - 0.1&quot;</span>) res3 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, label <span class="op">=</span> <span class="st">&quot;Pass-through - 0.3&quot;</span>) res4 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.0</span>, label <span class="op">=</span> <span class="st">&quot;Sigmoid-adjusted - 0.3&quot;</span>) res5 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">1.0</span>, label <span class="op">=</span> <span class="st">&quot;Pass-through - 1.0&quot;</span>) res6 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">1.0</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.0</span>, label <span class="op">=</span> <span class="st">&quot;Sigmoid-adjusted - 1.0&quot;</span>) plot_n(res1[<span class="dv">1</span>:] <span class="op">+</span> res2[<span class="dv">1</span>:] <span class="op">+</span> res3[<span class="dv">1</span>:] <span class="op">+</span> res4[<span class="dv">1</span>:] <span class="op">+</span> res5[<span class="dv">1</span>:] <span class="op">+</span> res6[<span class="dv">1</span>:], lower_y<span class="op">=</span><span class="fl">0.4</span>, title<span class="op">=</span><span class="st">&quot;Experiment 2: Pass-through vs sigmoid-adjusted ST&quot;</span>)</code></pre></div> <pre><code>Epoch 20 0.8334 Epoch 20 0.9566 Epoch 20 0.8828 Epoch 20 0.9668 Epoch 20 0.0958 Epoch 20 0.9572</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_22_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="experiment-3-pass-through-vs.slope-annealed-st-estimation">Experiment 3: Pass-through vs. slope-annealed ST estimation</h4> <p>Recall that <a href="https://arxiv.org/abs/1609.01704">Chung et al. (2016)</a> improves upon the sigmoid-adjusted variant of the ST estimator by using the <em>slope-annealing trick</em>, which slowly increases the slope of the logistic sigmoid as training progresses. Using the slope-annealing trick with an annealing rate of 1.1 times per epoch (so the slope at epoch 20 is <span class="math inline">$$1.1^{19} \approx 6.1$$</span>), we’re able to improve upon the sigmoid-adjusted ST estimator, and even beat our non-stochastic, non-binary baseline! Note that the slope annealed neuron used here is not the same as the one used by <a href="https://arxiv.org/abs/1609.01704">Chung et al. (2016)</a>, who employ a deterministic step function and use a hard sigmoid in place of a sigmoid for the backpropagation.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.0</span>, label <span class="op">=</span> <span class="st">&quot;Sigmoid-adjusted - 0.1&quot;</span>) res2 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.1</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.1</span>, label <span class="op">=</span> <span class="st">&quot;Slope-annealed - 0.1&quot;</span>) res3 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.0</span>, label <span class="op">=</span> <span class="st">&quot;Sigmoid-adjusted - 0.3&quot;</span>) res4 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.1</span>, label <span class="op">=</span> <span class="st">&quot;Slope-annealed - 0.3&quot;</span>) res5 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">1.0</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.0</span>, label <span class="op">=</span> <span class="st">&quot;Sigmoid-adjusted - 1.0&quot;</span>) res6 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">1.0</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.1</span>, label <span class="op">=</span> <span class="st">&quot;Slope-annealed - 1.0&quot;</span>) plot_n(res1[<span class="dv">1</span>:] <span class="op">+</span> res2[<span class="dv">1</span>:] <span class="op">+</span> res3[<span class="dv">1</span>:] <span class="op">+</span> res4[<span class="dv">1</span>:] <span class="op">+</span> res5[<span class="dv">1</span>:] <span class="op">+</span> res6[<span class="dv">1</span>:], lower_y<span class="op">=</span><span class="fl">0.6</span>, title<span class="op">=</span><span class="st">&quot;Experiment 3: Sigmoid-adjusted vs slope-annealed ST&quot;</span>)</code></pre></div> <pre><code>Epoch 20 0.9548 Epoch 20 0.974 Epoch 20 0.9704 Epoch 20 0.9764 Epoch 20 0.9608 Epoch 20 0.9624</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_24_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="experiment-4-variance-adjusted-reinforce-vs-slope-annealed-st">Experiment 4: Variance-adjusted REINFORCE vs slope-annealed ST</h4> <p>We now directly compare the variance-adjusted REINFORCE and slope-annealed ST, both at their best learning rates. In this setting, despite being a biased estimator, the straight-through estimator displays faster learning, less variance, and better overall results than the variance-adjusted REINFORCE estimator.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.REINFORCE, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.05</span>, label <span class="op">=</span> <span class="st">&quot;Variance-adjusted REINFORCE&quot;</span>) res2 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.1</span>, label <span class="op">=</span> <span class="st">&quot;Slope-annealed ST&quot;</span>) plot_n(res1[<span class="dv">1</span>:] <span class="op">+</span> res2[<span class="dv">1</span>:], lower_y<span class="op">=</span><span class="fl">0.6</span>, title<span class="op">=</span><span class="st">&quot;Experiment 4: Variance-adjusted REINFORCE vs slope-annealed ST&quot;</span>)</code></pre></div> <pre><code>Epoch 20 0.926 Epoch 20 0.9782</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_26_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="experiment-5-a-look-at-the-deterministic-step-function-during-training-and-evaluation">Experiment 5: A look at the deterministic step function, during training and evaluation</h4> <p>Similar to how dropout is not applied at inference when using dropout for training, it makes sense that we might replace the stochastic sigmoid with a deterministic step function at inference when using binary neurons. We might go even further than that, and use deterministic neurons during training, which is the approach taken by <a href="https://arxiv.org/abs/1609.01704">Chung et al. (2016)</a>. The following three combinations are compared below, using the slope-annealed straight through estimator, without slope annealing:</p> <ul> <li>stochastic during training, stochastic during test</li> <li>stochastic during training, deterministic during test</li> <li>deterministic during training, deterministic during test</li> </ul> <p>The results show that deterministic neurons train the fastest, but also display more overfitting and may not achieve the best final results. Stochastic inference and deterministic inference, when combined with stochastic training, are closely comparable. Similar results hold for the REINFORCE estimator.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.1</span>, label <span class="op">=</span> <span class="st">&quot;Stochastic, Stochastic&quot;</span>) res2 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.1</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, label <span class="op">=</span> <span class="st">&quot;Stochastic, Deterministic&quot;</span>) res3 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate <span class="op">=</span> <span class="fl">1.1</span>, stochastic_train<span class="op">=</span><span class="va">False</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, label <span class="op">=</span> <span class="st">&quot;Deterministic, Deterministic&quot;</span>) plot_n(res1 <span class="op">+</span> res2 <span class="op">+</span> res3, lower_y<span class="op">=</span><span class="fl">0.6</span>, title<span class="op">=</span><span class="st">&quot;Experiment 5: Stochastic vs Deterministic (Slope-annealed ST)&quot;</span>)</code></pre></div> <pre><code>Epoch 20 0.9776 Epoch 20 0.977 Epoch 20 0.9704</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_28_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="experiment-6-the-effect-of-depth-on-reinforce-and-st-estimators">Experiment 6: The effect of depth on REINFORCE and ST estimators</h4> <p>Next, I look at how each estimator interacts with depth. From a theoretical perpective, there is reason to think the straight-through estimator will suffer from depth; as noted by <a href="https://arxiv.org/abs/1308.3432">Bengio et al. (2013)</a>, it is not even guaranteed to have the same sign as the expected gradient during backpropagation. It turns out that the slope-annealed straight-through estimator is resilient to depth, even at a reasonable learning rate. The REINFORCE estimator, on the other hand, starts to fail as depth is introduced. However, if we lower the learning rate dramatically (25x), we can start to get the deeper networks to train with the REINFORCE estimator.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate<span class="op">=</span><span class="fl">1.1</span>, label <span class="op">=</span> <span class="st">&quot;1 hidden layer&quot;</span>) res2 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>, <span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate<span class="op">=</span><span class="fl">1.1</span>, label <span class="op">=</span> <span class="st">&quot;2 hidden layers&quot;</span>) res3 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>, <span class="dv">100</span>, <span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.ST, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.3</span>, slope_annealing_rate<span class="op">=</span><span class="fl">1.1</span>, label <span class="op">=</span> <span class="st">&quot;3 hidden layers&quot;</span>) plot_n(res1[<span class="dv">1</span>:] <span class="op">+</span> res2[<span class="dv">1</span>:] <span class="op">+</span> res3[<span class="dv">1</span>:], title<span class="op">=</span><span class="st">&quot;Experiment 6: The effect of depth (straight-through)&quot;</span>)</code></pre></div> <pre><code>Epoch 20 0.9774 Epoch 20 0.9738 Epoch 20 0.9728</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_30_1.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.REINFORCE, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.05</span>, label <span class="op">=</span> <span class="st">&quot;1 hidden layer&quot;</span>) res2 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>,<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.REINFORCE, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.05</span>, label <span class="op">=</span> <span class="st">&quot;2 hidden layers&quot;</span>) res3 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>,<span class="dv">100</span>,<span class="dv">100</span>], estimator<span class="op">=</span>StochasticGradientEstimator.REINFORCE, epochs<span class="op">=</span><span class="dv">20</span>, lr<span class="op">=</span><span class="fl">0.05</span>, label <span class="op">=</span> <span class="st">&quot;3 hidden layers&quot;</span>) plot_n(res1[<span class="dv">1</span>:] <span class="op">+</span> res2[<span class="dv">1</span>:] <span class="op">+</span> res3[<span class="dv">1</span>:], title<span class="op">=</span><span class="st">&quot;Experiment 6: The effect of depth (REINFORCE)&quot;</span>)</code></pre></div> <pre><code>Epoch 20 0.9302 Epoch 20 0.8788 Epoch 20 0.2904</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_31_1.png" alt="png" /><figcaption>png</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res1 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>], epochs<span class="op">=</span><span class="dv">50</span>, non_binary<span class="op">=</span><span class="va">True</span>, lr<span class="op">=</span><span class="fl">0.002</span>, label <span class="op">=</span> <span class="st">&quot;1 hidden layer&quot;</span>) res2 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>,<span class="dv">100</span>], epochs<span class="op">=</span><span class="dv">50</span>, non_binary<span class="op">=</span><span class="va">True</span>, lr<span class="op">=</span><span class="fl">0.002</span>, label <span class="op">=</span> <span class="st">&quot;2 hidden layers&quot;</span>) res3 <span class="op">=</span> train_classifier(hidden_dims<span class="op">=</span>[<span class="dv">100</span>,<span class="dv">100</span>,<span class="dv">100</span>], epochs<span class="op">=</span><span class="dv">50</span>, non_binary<span class="op">=</span><span class="va">True</span>, lr<span class="op">=</span><span class="fl">0.002</span>, label <span class="op">=</span> <span class="st">&quot;3 hidden layers&quot;</span>) plot_n(res1[<span class="dv">1</span>:] <span class="op">+</span> res2[<span class="dv">1</span>:] <span class="op">+</span> res3[<span class="dv">1</span>:], title<span class="op">=</span><span class="st">&quot;Experiment 6: The effect of depth (REINFORCE) (LR = 0.002)&quot;</span>)</code></pre></div> <pre><code>Epoch 50 0.931 Epoch 50 0.9294 Epoch 50 0.9096</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_32_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h4 id="experiment-7-using-binary-stochastic-neurons-as-a-regularizer.">Experiment 7: Using binary stochastic neurons as a regularizer.</h4> <p>I now test the “unpublished result” put forth at the end of <a href="https://www.youtube.com/watch?v=LN0xtUuJsEI&amp;list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9&amp;index=41">Hinton et al.’s Coursera Lecture 9c</a>, which states that we can improve upon the performance of an overfitting multi-layer sigmoid net by turning its neurons binary stochastic neurons with a straight-through estimator.</p> <p>To test the claim, we will need a dataset that is easier to overfit than MNIST, and so the following experiment uses the MNIST validation set for training (10x smaller than the MNIST training set and therefore much easier to overfit). The hidden layer size is also increased by a factor of 2 to increase overfitting.</p> <p>We can see below that the stochastic net has a clear advantage in terms of both the generalization gap and training speed, ultimately resulting in a better final fit.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">res1 <span class="op">=</span> train_classifier(hidden_dims <span class="op">=</span> [<span class="dv">200</span>], epochs<span class="op">=</span><span class="dv">20</span>, train_set<span class="op">=</span>mnist.validation, val_set<span class="op">=</span>mnist.test, lr <span class="op">=</span> <span class="fl">0.03</span>, non_binary <span class="op">=</span> <span class="va">True</span>, label <span class="op">=</span> <span class="st">&quot;Deterministic sigmoid net&quot;</span>) res2 <span class="op">=</span> train_classifier(hidden_dims <span class="op">=</span> [<span class="dv">200</span>], epochs<span class="op">=</span><span class="dv">20</span>, stochastic_eval<span class="op">=</span><span class="va">False</span>, train_set<span class="op">=</span>mnist.validation, val_set<span class="op">=</span>mnist.test, slope_annealing_rate<span class="op">=</span><span class="fl">1.1</span>, estimator<span class="op">=</span>StochasticGradientEstimator.ST, lr <span class="op">=</span> <span class="fl">0.3</span>, label <span class="op">=</span> <span class="st">&quot;Binary stochastic net&quot;</span>) plot_n(res1 <span class="op">+</span> res2, lower_y<span class="op">=</span><span class="fl">0.8</span>, title<span class="op">=</span><span class="st">&quot;Experiment 8: Using binary stochastic neurons as a regularizer&quot;</span>)</code></pre></div> <pre><code>Epoch 20 0.9276 Epoch 20 0.941</code></pre> <figure> <img src="https://r2rt.com/static/images/BSN_output_34_1.png" alt="png" /><figcaption>png</figcaption> </figure> <h3 id="conclusion">Conclusion</h3> <p>In this post we introduced, implemented and experimented with binary stochastic neurons in Tensorflow. We saw that the biased straight-through estimator generally outperforms the unbiased REINFORCE estimator, and can even outperform a non-stochastic, non-binary sigmoid net. We explored the variants of each estimator, and showed that the slope-annealed straight through estimator is better than other straight through variants, and that it is worth using the variance-adjusted REINFORCE estimator over the not variance-adjusted REINFORCE estimator. Finally, we explored the potential use for binary stochastic neurons as regularizers, and demonstrated that a stochastic binary network trained with the slope-annealed straight through estimator trains faster and generalizes better than an ordinary sigmoid net.</p> <section class="footnotes"> <hr /> <ol> <li id="fn1"><p><strong>Note</strong>: In a previous version of this post, I had instead used <span class="math inline">$$\frac{d\text{BSN}(a)}{da} = \text{BSN}(a)$$</span>. This formulation would return the identity if the binary neuron evaluated to 1, and 0 otherwise. I had similarly multiplied the derivatives of the two variants (sigmoid-adjusted and slope-annealed) by a factor of <span class="math inline">$$\text{BSN}(a)$$</span>. This prior formulation was consistent with my reading of Bengio et al. (2013) and achieved respectable results, but in light of a comment made on this post, and upon review of Hinton’s Coursera lecture where the straight-through estimator is first proposed, I believe the version now reflected in the post, <span class="math inline">$$\frac{d\text{BSN}(a)}{da} = 1$$</span>, is more correct. Although the pass-through variant performs worse with the revised derivative, the sigmoid-adjusted and slope-annealed variants benefit greatly from this change, outperforming both new and old pass-through formulations by a respectable margin.<a href="#fnref1">↩</a></p></li> </ol> </section> </body> </html> Preliminary Note on the Complexity of a Neural Network2016-08-16T00:00:00-04:002016-08-16T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-08-16:/preliminary-note-on-the-complexity-of-a-neural-network.htmlThis post is a preliminary note on the "complexity" of neural networks. It's a topic that has not gotten much attention in the literature, yet is of central importance to the general understanding of neural networks. In this post I discuss complexity and generalization in broad terms, and make the argument that network structure (including parameter counts), the training methodology, and the regularizers used, though each different in concept, all contribute to this notion of neural network "complexity".<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML" type="text/javascript"></script> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>This post is a preliminary note on the “complexity” of neural networks. It’s a topic that has not gotten much attention in the literature, yet is of central importance to our general understanding of neural networks. In this post I discuss complexity and generalization in broad terms, and make the argument that network structure (including parameter counts), the training methodology, and the regularizers used, though each different in concept, all contribute to this notion of neural network “complexity”.</p> <h3 id="model-complexity-and-selection">Model complexity and selection</h3> <p>In designing neural networks there is an inevitable trade-off between model complexity and generalization performance. On one hand, we want the model to be expressive and have the ability to model highly complex non-linear relationships. On the other hand, the more expressive the model, the more capable it is of memorizing the training data. This is the problem of <strong>model selection</strong> and it is closely related to the “complexity” of the model: pick a model that is too simple and it will “underfit” the data, but pick a model that is too complex and it will “overfit” the data.</p> <p>There is no consistent analytical definition of complexity, but one reasonable way of to think about it is the effective number of distinct hypotheses the model could potentially represent after training; i.e., the effective number of patterns that the model is capable of describing. See, e.g., <a href="http://amlbook.com/">Abu-Mostafa eta al.</a> or <a href="http://www.pnas.org/content/97/21/11170.long">Myung et al. (2000)</a>.</p> <p>When it comes to neural networks, this notion of complexity and the problem of model selection are delicate subjects. Whereas certain analytical criteria exist for balancing complexity and generalization in traditional statistical models, these criteria have not been extended in any compelling, generally applicable way to the analysis neural networks.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> As a result, the primary approach for neural network model selection has not been analytical, but rather empirical: cross-validation.</p> <p>This is unfortunate because it is difficult to compare neural network architectures using only empirical tests. One reason for this is that the effects of different architectural features and hyperparameters on complexity are not necessarily independent, which means that the size of the search grows exponentially in the number of features or hyperparameters (including, e.g., the choice of regularizers). Furthermore, these effects are not necessarily independent of the dataset (e.g., a feature that is ineffective on a small dataset may be very effective on a large one), and so results may not generalize across tasks. As a single state of the art model can take hours, days or even weeks to train on modern hardware, this spells bad news for finding the “best” model. Cf., e.g., <a href="https://arxiv.org/abs/1412.3555">Chung et al. (2014)</a> (conducts an empirical comparison of the GRU and LSTM and is unable to determine whether one architecture is superior to the other).</p> <p>This difficulty is not restricted to comparing entire architectures: the fact that adding or removing a feature changes the model complexity makes it difficult to evaluate the efficacy of that single feature. This generally affects the strength of results favoring proposed features in the literature.</p> <p>But not all hope is lost. In some cases, we can put forth theoretical grounds for the superiority of one architecture over another. See, e.g., my post on <a href="https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html">Written Memories</a>, which argues that the proposed pseudo LSTM is objectively better than the basic LSTM. In other cases, an architecture or feature overwhelmingly outperforms the competition in a wide range of configurations. In these cases, we can assume the position that one architecture is superior than another, while keeping in the back of our minds the possibility that our conclusion may not generalize to all cases. Sometimes (and ideally), a proposed feature has both strong theoretical grounds and undeniably empirical performance. See, e.g., <a href="https://arxiv.org/abs/1502.03167">Ioffe and Szegedy (2015)</a>, which introduced batch normalization for feedforward neural networks.</p> <p>Yet in most cases, we are stuck with many alternatives and little guidance as to how to choose between them, or as to whether such choices even matter. If we had some broadly applicable and theoretically grounded measure of model complexity at our disposal, it would be incredibly useful for making such determinations. I am planning a follow-up post to explore this problem in more detail. For this preliminary note, however, I want to make a few comments on the factors that contribute to model complexity, which include the network structure, the training methodology and the regularization methods used.</p> <h4 id="network-structure">Network structure</h4> <p>It’s quite common for authors to use the “number of parameters” as a proxy for model complexity to support the inference that one architecture is better than another (with fewer parameters being better). See e.g., <a href="https://arxiv.org/abs/1607.03474">Zilly et al. (2016)</a> (“[Our model] outperforms [the competing model] using less than half the parameters…”), <a href="https://arxiv.org/abs/1508.06615">Kim et al. (2015)</a> (“[Our model] is on par with the existing state-of-the-art despite having 60% fewer parameters.”), and <a href="https://arxiv.org/abs/1606.06630">Wu et al. (2016)</a>(“… the number of parameters [is] largely reduced, which makes our model more practical for large scale problems, while <em>avoiding overfitting</em>.” (<em>emphasis added</em>)), among many others. I sympathize with the approach, if only because there exists no established criteria for measuring parametric complexity of neural networks, but using simple parameter counts for this purpose is misleading.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a></p> <p>In particular, different models with the same number of parameters may have quite different complexities as result of the model’s functional form. This is true for all models, not just neural networks. As a trivial example, consider that the double biases in the model <span class="math inline">$$\alpha_0 x + \alpha_1 + \alpha_2$$</span> are really one parameter masquerading as two. See <a href="http://www.pnas.org/content/97/21/11170.long">Myung et al. (2000)</a> for a non-trivial example and a more detailed discussion of the “geometric complexity”. As compared to traditional statistical models, however, neural networks are particularly succeptible to this kind of structural complexity; the number of hidden layers, the connectivity between them, and the activation functions used are just a few examples of structural features that impact network complexity. That said, if the network structure of two models is similar (e.g., same number of layers and same activation functions), the complexity of the model will generally be a monotonically increasing function of the number of parameters.</p> <h4 id="training-method">Training method</h4> <p>It might be surprising that the training method (e.g., the combination of initialization method, loss function and optimization algorithm) can be viewed as a factor in model complexity. To see this, consider that the complexity of a model can be interpretted as the effective number of patterns the model can represent after training, but that even if some choice of parameters could represent pattern A, pattern A may not be reachable gradient descent (or, more likely, pattern A could be very difficult to reach relative to pattern B). Here is an illustration:</p> <figure> <img src="https://r2rt.com/static/images/RR_unreachable_pattern.png" alt="Unreachable pattern" /><figcaption>Unreachable pattern</figcaption> </figure> <p>Note that this illustration is oversimplified, in that we would have to consider the “reachableness” of pattern A under all potential datasets, not just this one dataset.</p> <p>This is an interesting perspective that has a slightly different flavor than criteria such as Rissanen’s <a href="https://en.wikipedia.org/wiki/Minimum_description_length">minimum description length</a>. Taking this broad view of complexity, we see that the learning rate, which is the first hyperparameter we tune when training neural networks, directly impacts the model’s complexity. E.g., a high learning rate will bias our model away from local minima that are located in narrow valleys. This same argument applies to batch size and truncated backpropagation steps.</p> <h4 id="regularization">Regularization</h4> <p>Finally, somewhere in between structural features and the training method lie regularizers. Consider that dropout can be viewed as a modification to the training method (since the model’s architecture is unchanged at test time) or as an ensemble method (see, e.g., <a href="https://papers.nips.cc/paper/4878-understanding-dropout">Baldi and Sadowski (2013)</a>). Similarly, weight decay might be viewed as a direct modification to the loss function that modifies the training method, or as a structural feature. To see how weight decay can be interpretted as a structural feature, compare it to a hard “cap” on the parameters, say a restriction to the interval [-0.5, 0.5]. This is a structural change that impacts the number of representable hypotheses, but the function of weight decay is very similar, only that it serves as a soft cap.</p> <h3 id="future-work">Future Work</h3> <p>This topic has been on my mind recently as I wrestle with the question of how to justify an argument that one architecture is better than another. I’m currently working toward a couple preliminary ideas for an analytical approach to neural network complexity, which I will write about in case they pan out. In the meantime, if you are aware of any research in this area, I would greatly appreciate if you could share it in the comments!</p> <section class="footnotes"> <hr /> <ol> <li id="fn1"><p>There have, however, been specific applications of these analytical criteria. See, e.g., <a href="http://www.cs.toronto.edu/~fritz/absps/colt93.pdf">Hinton and van Camp (1993)</a>, which shows how weight decay and weight noise can be justified using the minimum description length (MDL) criteria.<a href="#fnref1">↩</a></p></li> <li id="fn2"><p>It’s true that models with fewer parameters will be smaller in size (memory footprint / space), which could be advantageous for large models. The intent of pointing out a lower parameter count, however, is usually to signify that the model is less complex and therefore less prone to overfitting.<a href="#fnref2">↩</a></p></li> </ol> </section> </body> </html> Written Memories: Understanding, Deriving and Extending the LSTM2016-07-26T00:00:00-04:002016-07-26T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-07-26:/written-memories-understanding-deriving-and-extending-the-lstm.htmlWhen I was first introduced to Long Short-Term Memory networks (LSTMs), it was hard to look past their complexity. I didn't understand why they were designed the way they were designed, just that they worked. It turns out that LSTMs can be understood, and that, despite their superficial complexity, LSTMs are actually based on a couple incredibly simple, even beautiful, insights into neural networks. This post is what I wish I had when first learning about recurrent neural networks (RNNs).<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML" type="text/javascript"></script> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>When I was first introduced to Long Short-Term Memory networks (LSTMs), it was hard to look past their complexity. I didn’t understand why they were designed the way they were designed, just that they worked. It turns out that LSTMs can be understood, and that, despite their superficial complexity, LSTMs are actually based on a couple incredibly simple, even beautiful, insights into neural networks. This post is what I wish I had when first learning about recurrent neural networks (RNNs).</p> <p>In this post, we do a few things:</p> <ol type="1"> <li>We’ll define and describe RNNs generally, focusing on the limitations of vanilla RNNs that led to the development of the LSTM.</li> <li>We’ll describe the intuitions behind the LSTM architecture, which will enable us to build up to and derive the LSTM. Along the way we will derive the GRU. We’ll also derive a pseudo LSTM, which we’ll see is better in principle and performance to the standard LSTM.</li> <li>We’ll then extend these intuitions to show how they lead directly to a few recent and exciting architectures: highway and residual networks, and Neural Turing Machines.</li> </ol> <p>This is a post about theory, not implementations. For how to implement RNNs using Tensorflow, check out my posts <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html">Recurrent Neural Networks in Tensorflow I</a> and <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html">Recurrent Neural Networks in Tensorflow II</a>.</p> <h4 id="contents-quick-links">Contents / quick links:</h4> <ul> <li><a href="#recurrent-neural-networks">Recurrent neural networks</a></li> <li><a href="#what-rnns-can-do-choosing-the-time-step">What RNNs can do; choosing the time step</a></li> <li><a href="#the-vanilla-rnn">The vanilla RNN</a></li> <li><a href="#information-morphing-and-vanishing-and-exploding-sensitivity">Information morphing and vanishing and exploding sensitivity</a></li> <li><a href="#a-mathematically-sufficient-condition-for-vanishing-sensitivity">A mathematically sufficient condition for vanishing sensitivity</a></li> <li><a href="#a-minimum-weight-initialization-for-avoid-vanishing-gradients">A minimum weight initialization for avoid vanishing gradients</a></li> <li><a href="#backpropagation-through-time-and-vanishing-sensitivity">Backpropagation through time and vanishing sensitivity</a></li> <li><a href="#dealing-with-vanishing-and-exploding-gradients">Dealing with vanishing and exploding gradients</a></li> <li><a href="#written-memories-the-intuition-behind-lstms">Written memories: the intuition behind LSTMs</a></li> <li><a href="#using-selectivity-to-control-and-coordinate-writing">Using selectivity to control and coordinate writing</a></li> <li><a href="#gates-as-a-mechanism-for-selectivity">Gates as a mechanism for selectivity</a></li> <li><a href="#gluing-gates-together-to-derive-a-prototype-lstm">Gluing gates together to derive a prototype LSTM</a></li> <li><a href="#three-working-models-the-normalized-prototype-the-gru-and-the-pseudo-lstm">Three working models: the normalized prototype, the GRU and the pseudo LSTM</a></li> <li><a href="#deriving-the-lstm">Deriving the LSTM</a></li> <li><a href="#the-lstm-with-peepholes">The LSTM with peepholes</a></li> <li><a href="#an-empirical-comparison-of-the-basic-lstm-and-the-pseudo-lstm">An empirical comparison of the basic LSTM and the pseudo LSTM</a></li> <li><a href="#extending-the-lstm">Extending the LSTM</a></li> </ul> <h4 id="prerequisites">Prerequisites<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></h4> <p>This post assumes the reader is already familiar with:</p> <ol type="1"> <li>Feedforward neural networks</li> <li>Backpropagation</li> <li>Basic linear algebra</li> </ol> <p>We’ll review everything else, starting with RNNs in general.</p> <h3 id="recurrent-neural-networks">Recurrent neural networks</h3> <p>From one moment to the next, our brain operates as a function: it accepts inputs from our senses (external) and our thoughts (internal) and produces outputs in the form of actions (external) and new thoughts (internal). We see a bear and then think “bear”. We can model this behavior with a feedforward neural network: we can teach a feedforward neural network to think “bear” when it is shown an image of a bear.</p> <p>But our brain is not a one-shot function. It runs repeatedly through time. We see a bear, then think “bear”, then think “run”.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> Importantly, the very same function that transforms the image of a bear into the thought “bear” also transforms the thought “bear” into the thought “run”. It is a <em>recurring</em> function, which we can model with a <em>recurrent</em> neural network (RNN).</p> <p>An RNN is a composition of identical feedforward neural networks, one for each moment, or step in time, which we will refer to as “RNN cells”. Note that this is a much broader definition of an RNN than that usually given (the “vanilla” RNN is covered later on as a precursor to the LSTM). These cells operate on their own output, allowing them to be composed. They can also operate on external input and produce external output. Here is a diagram of a single RNN cell:</p> <figure> <img src="https://r2rt.com/static/images/NH_SingleRNNcell.png" alt="Single RNN Cell" /><figcaption>Single RNN Cell</figcaption> </figure> <p>Here is a diagram of three composed RNN cells:</p> <figure> <img src="https://r2rt.com/static/images/NH_ComposedRNNcells.png" alt="Composed RNN Cells" /><figcaption>Composed RNN Cells</figcaption> </figure> <p>You can think of the recurrent outputs as a “state” that is passed to the next timestep. Thus an RNN cell accepts a prior state and an (optional) current input and produces a current state and an (optional) current output.</p> <p>Here is the algebraic description of the RNN cell:</p> <p><span class="math display">$\left(\begin{matrix} s_t \\ o_t \\ \end{matrix}\right) = f\left(\begin{matrix} s_{t-1} \\ x_t \\ \end{matrix}\right)$</span></p> <p>where:</p> <ul> <li><span class="math inline">$$s_t$$</span> and <span class="math inline">$$s_{t-1}$$</span> are our current and prior states,</li> <li><span class="math inline">$$o_t$$</span> is our (possibly empty) current output,</li> <li><span class="math inline">$$x_t$$</span> is our (possibly empty) current input, and</li> <li><span class="math inline">$$f$$</span> is our recurrent function.</li> </ul> <p>Our brain operates in place: current neural activity takes the place of past neural activity. We can see RNNs as operating in place as well: because RNN cells are identical, they can all be viewed as the same object, with the “state” of the RNN cell being overwritten at each time step. Here is a diagram of this framing:</p> <figure> <img src="https://r2rt.com/static/images/NH_StateLoop.png" alt="RNN State Loop" /><figcaption>RNN State Loop</figcaption> </figure> <p>Most introductions to RNNs start with this “single cell loop” framing, but I think you’ll find the sequential frame more intuitive, particularly when thinking about backpropagation. When starting with the single cell loop framing, RNN’s are said to “unrolled” to obtain the sequential framing above.</p> <h3 id="what-rnns-can-do-choosing-the-time-step">What RNNs can do; choosing the time step</h3> <p>The RNN structure described above is incredibly general. In theory, it can do anything: if we give the neural network inside each cell at least one hidden layer, each cell becomes a universal function approximator.<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a> This means that an RNN cell can emulate any function, from which it follows that an RNN could, in theory, emulate our brain perfectly. Though we know that the brain can theoretically be modeled this way, it’s an entirely different matter to actually design and train an RNN to do this. We are, however, making good progress.</p> <p>With this analogy of the brain in mind, all we need to do to see how we can use an RNN to handle a task is to ask how a human would handle the same task.</p> <p>Consider, for example, English-to-French translation. A human reads an English sentence (“the cat sat on the mat”), pauses, and then writes out the French translation (“le chat s’assit sur le tapis”). To emulate this behavior with an RNN, the only choice we have to make (other than designing the RNN cell itself, which for now we treat as a black box) is deciding what the time steps used should be, which determines the form the inputs and outputs, or how the RNN interacts with the external world.</p> <p>One option is to set the time step according to the content. That is, we might use the entire sentence as a time step, in which case our RNN is just a feed-forward network:</p> <figure> <img src="https://r2rt.com/static/images/NH_SentenceTimeStep.png" alt="Translation using sentence-based time step" /><figcaption>Translation using sentence-based time step</figcaption> </figure> <p>The final state does not matter when translating a single sentence. It might matter, however, if the sentence were part of a paragraph being translated, since it would contain information about the prior sentences. Note that the intial state is indicated above as blank, but when evaluating individual sequences, it can useful to train the initial state as a variable. It may be that the best “a sequence is starting” state representation might not be the blank zero state.</p> <p>Alternatively, we might say that each word or each character is a time step. Here is an illustration of what an RNN translating “the cat sat” on a per word basis might look like:</p> <figure> <img src="https://r2rt.com/static/images/NH_WordTimeStep.png" alt="Translation using word-based time step" /><figcaption>Translation using word-based time step</figcaption> </figure> <p>After the first time step, the state contains an internal representation of “the”; after the second, of “the cat”; after the the third, “the cat sat”. The network does not produce any outputs at the first three time steps. It starts producing outputs when it receives a blank input, at which point it knows the input has terminated. When it is done producing outputs, it produces a blank output to signal that it’s finished.</p> <p>In practice, even powerful RNN architectures like deep LSTMs might not perform well on multiple tasks (here there are two: reading, then translating). To accomodate this, we can split the network into multiple RNNs, each of which specializes in one task. In this example, we would use an “encoder” network that reads in the English (blue) and a separate “decoder” network that reads in the French (orange):</p> <figure> <img src="https://r2rt.com/static/images/NH_WordTimeStep_SeparateRNNs.png" alt="Translation using word-based time step and two RNNs" /><figcaption>Translation using word-based time step and two RNNs</figcaption> </figure> <p>Additionally, as shown in the above diagram, the decoder network is being fed in the last true value (i.e., the target value during training, and the network’s prior choice of translated word during testing). For an example of an RNN encoder-decoder model, see <a href="https://arxiv.org/pdf/1406.1078v3.pdf">Cho et al. (2014)</a>.</p> <p>Notice that having two separate networks still fits the definition of a single RNN: we can define the recurring function as a split function that takes, alongside its other inputs, an input specifying which split of the function to use.</p> <p>The time step does not have to be content-based; it can be an actual unit of time. For example, we might consider the time step to be one second, and enforce a reading rate of 5 characters per second. The inputs for the first three time steps would be <code>the c</code>, <code>at sa</code> and <code>t on</code>.</p> <p>We could also do something more interesting: we can let the RNN decide when its ready to move on to the next input, and even what that input should be. This is similar to how a human might focus on certain words or phrases for an extended period of time to translate them or might double back through the source. To do this, we use the RNN’s output (an external action) to determine its next input dynamically. For example, we might have the RNN output actions like “read the last input again”, “backtrack 5 timesteps of input”, etc. Successful attention-based translation models are a play on this: they accept the entire English sequence at each time step and their RNN cell decides which parts are most relevant to the current French word they are producing.</p> <p>There is nothing special about this English-to-French translation example. Whatever the human task we choose, we can build different RNN models by choosing different time steps. We can even reframe something like handwritten digit recognition, for which a one-shot function (single time step) is the typical approach, as a many-time step task. Indeed, take a look at some of the MNIST digits yourself and observe how you need to focus on some longer than others. Feedforward neural networks cannot exhibit that behavior; RNNs can.</p> <h3 id="the-vanilla-rnn">The vanilla RNN</h3> <p>Now that we’ve covered the big picture, lets take a look inside the RNN cell. The most basic RNN cell is a single layer neural network, the output of which is used as both the RNN cell’s current (external) output and the RNN cell’s current state:</p> <figure> <img src="https://r2rt.com/static/images/NH_VanillaRNNcell.png" alt="Vanilla RNN Cell" /><figcaption>Vanilla RNN Cell</figcaption> </figure> <p>Note how the prior state vector is the same size as the current state vector. As discussed above, this is critical for composition of RNN cells. Here is the algebraic description of the vanilla RNN cell:</p> <p><span class="math display">$s_t = \phi(Ws_{t-1} + Ux_t + b)$</span></p> <p>where:</p> <ul> <li><span class="math inline">$$\phi$$</span> is the activation function (e.g., sigmoid, tanh, ReLU),</li> <li><span class="math inline">$$s_t \in \Bbb{R}^n$$</span> is the current state (and current output),</li> <li><span class="math inline">$$s_{t-1} \in \Bbb{R}^n$$</span> is the prior state,</li> <li><span class="math inline">$$x_t \in \Bbb{R}^m$$</span> is the current input,</li> <li><span class="math inline">$$W \in \Bbb{R}^{n \times n}$$</span>, <span class="math inline">$$U \in \Bbb{R}^{m \times n}$$</span>, and <span class="math inline">$$b \in \Bbb{R}^n$$</span> are the weights and biases, and</li> <li><span class="math inline">$$n$$</span> and <span class="math inline">$$m$$</span> are the state and input sizes.</li> </ul> <p>Even this basic RNN cell is quite powerful. Though it does not meet the criteria for universal function approximation within a single cell, it is known that a series of composed vanilla RNN cells is Turing complete and can therefore implement any algorithm. See <a href="http://binds.cs.umass.edu/papers/1995_Siegelmann_JComSysSci.pdf">Siegelmann and Sontag (1992)</a>. This is nice, in theory, but there is a problem in practice: training vanilla RNNs with backpropagation algorithm turns out to be quite difficult, even more so than training very deep feedforward neural networks. This difficulty is due to the problems of information morphing and vanishing and exploding sensitivity caused by repeated application of the same nonlinear function.</p> <h3 id="information-morphing-and-vanishing-and-exploding-sensitivity">Information morphing and vanishing and exploding sensitivity<a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a></h3> <p>Instead of the brain, consider modeling the entire world as an RNN: from each moment to the next, the state of the world is modified by a fantastically complex recurring function called time. Now consider how a small change today will affect the world in one hundred years. It could be that something as small as the flutter of a butterfly’s wing will ultimately cause a typhoon halfway around the world.<a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a> But it could also be that our actions today ultimately do not matter. So Einstein wasn’t around to discover relativity? This would have made a difference in the 1950s, but maybe then someone else discovers relativity, so that the difference becomes smaller by the 2000s, and ultimately approaches zero by the year 2050. Finally, it could be that the importance of a small change fluctuates: perhaps Einstein’s discovery was in fact caused by a comment his wife made in response to a butteryfly that happened to flutter by, so that the butterfly exploded into a big change during the 20th century that then quickly vanished.</p> <p>In the Einstein example, note that the past change is the introduction of new information (the theory of relativity), and more generally that the introduction of this new information was a direct result of our recurring function (the flow of time). Thus, we can consider information itself as a change that is morphed by the recurring function such that its effects vanish, explode or simply fluctuate.</p> <p>This discussion shows that the state of the world (or an RNN) is constantly changing and that the present can be either extremely sensitive or extremely insensitive to past changes: effects can compound or dissolve. These are problems, and they extend to RNNs (and feedforward neural networks) in general:</p> <ol type="1"> <li><p>Information Morphing</p> <p>First, if information constantly morphs, it is difficult to exploit past information properly when we need it. The best usable state of the information may have occured at some point in the past. On top of learning how to exploit the information today (if it were around in its original, usable form), we must also learn how to decode the original state from the current state, if that is even possible. This leads to difficult learning and poor results.<a href="#fn6" class="footnoteRef" id="fnref6"><sup>6</sup></a></p> <p>It’s very easy to show that information morphing occurs in a vanilla RNN. Indeed, suppose it were possible for an RNN cell to maintain its prior state completely in the absence of external inputs. Then <span class="math inline">$$F(x) = \phi(Ws_{t-1} + b)$$</span> is the identity function with respect to <span class="math inline">$$s_{t-1}$$</span>. But the identity function is linear and <span class="math inline">$$F(x)$$</span> is nonlinear, so we have a contradiction. Therefore, an RNN cell inevitably morphs the state from one time step to the next. Even the trivial task of outputting <span class="math inline">$$s_t = x_t$$</span> is impossible for a vanilla RNN.</p> <p>This is the root cause of what is known in some circles as the <em>degradation</em> problem. See, e.g., <a href="https://arxiv.org/abs/1512.03385">He et al. (2015)</a>. The authors of He et al. claims this is “unexpected” and “counterintuitive”, but I hope this discussion shows that the degradation problem, or information morphing, is actually quite natural (and in many cases desirable). We’ll see below that although information morphing was not among the original motivations for introducing LSTMs, the principle behind LSTMs happens to solve the problem effectively. In fact, the effectiveness of the residual networks used by He at al. (2015) is a result of the fundamental principle of LSTMs.</p></li> <li><p>Vanishing and Exploding Gradients</p> <p>Second, we train RNNs using the backpropagation algorithm. But backpropagation is a gradient-based algorithm, and vanishing and exploding “sensitivity” is just another way of saying vanishing and exploding gradients (the latter is the accepted term, but I find the former more descriptive). If the gradients explode, we can’t train our model. If they vanish, it’s difficult for us to learn long-term dependencies, since backpropagation will be too sensitive to recent distractions. This makes training difficult.</p> <p>I’ll come back to the difficulty of training RNNs via backpropagation in a second, but first I’d like to give a short mathematical demonstration of how easy it is for the vanilla RNN to suffer from the vanishing gradients and what we can do to help avoid this at the start of training.</p></li> </ol> <h3 id="a-mathematically-sufficient-condition-for-vanishing-sensitivity">A mathematically sufficient condition for vanishing sensitivity</h3> <p>In this section I give a mathematical proof of a sufficient condition for vanishing sensitivity in vanilla RNNs. This section is a bit mathy, and you can safely skip the details of the proof. It is essentially the same as the proof of the similar result in <a href="http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf">Pascanu et al. (2013)</a>, but I think you will find this presentation easier to follow. The proof here also takes advantage of the mean value theorem to go one step further than Pascanu et al. and reach a slightly stronger result, effectively showing vanishing <em>causation</em> rather than vanishing sensitivity.<a href="#fn7" class="footnoteRef" id="fnref7"><sup>7</sup></a> Note that mathematical analyses of vanishing and exploding gradients date back to the early 1990s, in <a href="http://www.dsi.unifi.it/~paolo/ps/tnn-94-gradient.pdf">Bengio et al. (1994)</a> and <a href="http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf">Hochreiter (1991)</a> (original in German, relevant portions summarized in <a href="http://isle.illinois.edu/sst/meetings/2015/hochreiter-lstm.pdf">Hochreiter and Schmidhuber (1997)</a>).</p> <p>Let <span class="math inline">$$s_t$$</span> be our state vector at time <span class="math inline">$$t$$</span> and let <span class="math inline">$$\Delta v$$</span> be the change in a vector <span class="math inline">$$v$$</span> induced by a change in the state vector, <span class="math inline">$$\Delta s_t$$</span>, at time <span class="math inline">$$t$$</span>. Our objective is to provide a mathematically sufficient condition so that the change in state at time step <span class="math inline">$$t + k$$</span> caused by a change in state at time step <span class="math inline">$$t$$</span> vanishes as <span class="math inline">$$n \to \infty$$</span>; i.e., we will prove a sufficient condition for:</p> <p><span class="math display">$\lim_{k \to \infty}\frac{\Delta s_{t+k}}{\Delta s_t} = 0.$</span></p> <p>By constrast, Pascanu et al. (2013) proved the same sufficient condition for the following result, which can easily be extended to obtain the above:</p> <p><span class="math display">$\lim_{k \to \infty}\frac{\partial s_{t+k}}{\partial s_t} = 0.$</span></p> <p>To begin, from our definition of a vanilla RNN cell, we have:</p> <p><span class="math display">$s_{t+1} = \phi(z_t) \hspace{30px} \text{where} \hspace{30px} z_t = Ws_{t} + Ux_{t+1} + b.$</span></p> <p>Applying the mean value theorem in several variables, we get that there exists <span class="math inline">$$c \in [z_t,\ z_t + \Delta z_t]$$</span> such that:</p> <p><span class="math display">$\begin{split} \Delta s_{t+1} &amp; = [\phi&#39;(c)] \Delta z_t\\ &amp; = [\phi&#39;(c)]\Delta(W s_t).\\ &amp; = [\phi&#39;(c)]W\Delta s_t.\\ \end{split}$</span></p> <p>Now let <span class="math inline">$$\Vert A \Vert$$</span> represent the matrix 2-norm, <span class="math inline">$$\rvert v\rvert$$</span> the Euclidean vector norm, and define:</p> <p><span class="math display">$\gamma = \sup_{c \in [z_t,\ z_t + \Delta z_t]}\Vert [\phi&#39;(c)] \Vert \\$</span></p> <p>Note that for the logistic sigmoid, <span class="math inline">$$\gamma \leq \frac{1}{4}$$</span>, and for tanh, <span class="math inline">$$\gamma \leq 1$$</span>.<a href="#fn8" class="footnoteRef" id="fnref8"><sup>8</sup></a></p> <p>Taking the vector norm of each side, we obtain, where the first inequality comes from the definition of the 2-norm (applied twice), and second from the definition of supremum:</p> <p><span class="math display">$\begin{equation} \begin{split} \rvert\Delta s_{t+1}\rvert &amp; = \rvert[\phi&#39;(c)]W\Delta s_t\rvert\\ &amp; \leq \Vert [\phi&#39;(c)] \Vert \Vert W \Vert \rvert\Delta s_{t}\rvert\\ &amp; \leq \gamma \Vert W \Vert \rvert\Delta s_{t}\rvert\\ &amp; = \Vert \gamma W \Vert \rvert\Delta s_{t}\rvert. \end{split} \end{equation}$</span></p> <p>By expanding this formula over <span class="math inline">$$k$$</span> time steps we get <span class="math inline">$$\rvert\Delta s_{t+k}\rvert \leq \Vert \gamma W \Vert^k \rvert\Delta s_{t}\rvert$$</span> so that:</p> <p><span class="math display">$\frac{\rvert\Delta s_{t+k}\rvert}{\rvert\Delta s_t\rvert} \leq \Vert \gamma W \Vert^k.$</span></p> <p>Therefore, if <span class="math inline">$$\Vert \gamma W \Vert &lt; 1$$</span>, we have that <span class="math inline">$$\frac{\rvert\Delta s_{t+k}\rvert}{\rvert\Delta s_t\rvert}$$</span> decreases exponentially in time, and have proven a sufficient condition for:</p> <p><span class="math display">$\lim_{k \to \infty}\frac{\Delta s_{t+k}}{\Delta s_t} = 0.$</span></p> <p>When will <span class="math inline">$$\Vert \gamma W \Vert &lt; 1$$</span>? <span class="math inline">$$\gamma$$</span> is bounded to <span class="math inline">$$\frac{1}{4}$$</span> for the logistic sigmoid and to 1 for tanh, which tells us that the sufficient condition for vanishing gradients is for <span class="math inline">$$\Vert W \Vert$$</span> to be less than 4 or 1, respectively.</p> <p>An immediate lesson from this is that if our weight initializations for <span class="math inline">$$W$$</span> are too small, our RNN may be unable to learn anything right off the bat, due to vanishing gradients. Let’s now extend this analysis to determine a desirable weight initialization.</p> <h3 id="a-minimum-weight-initialization-for-avoid-vanishing-gradients">A minimum weight initialization for avoid vanishing gradients</h3> <p>It is beneficial to find a weight initialization that will not immediately suffer from this problem. Extending the above analysis to find the initialization of <span class="math inline">$$W$$</span> that gets us as close to equality as possible leads to a nice result.</p> <p>First, let us assume that <span class="math inline">$$\phi = \tanh$$</span> and take <span class="math inline">$$\gamma = 1$$</span>,<a href="#fn9" class="footnoteRef" id="fnref9"><sup>9</sup></a> but you could just as easily assume that <span class="math inline">$$\phi = \sigma$$</span> and take <span class="math inline">$$\gamma = \frac{1}{4}$$</span> to reach a different result.</p> <p>Our goal is to find an initialization of W for which:</p> <ol type="1"> <li><span class="math inline">$$\Vert \gamma W \Vert = 1$$</span>.</li> <li>We get as close to equality as possible in equation (1).</li> </ol> <p>From point 1, since we took <span class="math inline">$$\gamma$$</span> to be 1, we have <span class="math inline">$$\Vert W \Vert = 1$$</span>. From point 2, we get that we should try to set all singular values of <span class="math inline">$$W$$</span> to 1, not just the largest. Then, if all singular values of <span class="math inline">$$W$$</span> equal 1, that means that the norm of each column of <span class="math inline">$$W$$</span> is 1 (since each column is <span class="math inline">$$We_i$$</span> for some elementary basis vector <span class="math inline">$$e_i$$</span> and we have <span class="math inline">$$\rvert We_i\rvert = \rvert e_i\rvert = 1$$</span>). That means that for column <span class="math inline">$$j$$</span> we have:</p> <p><span class="math display">$\Sigma_{i}w_{ij}^2 = 1$</span></p> <p>There are <span class="math inline">$$n$$</span> entries in column <span class="math inline">$$j$$</span>, and we are choosing each from the same random distribution, so let us find a distribution for a random weight <span class="math inline">$$w$$</span> for which:</p> <p><span class="math display">$n\mathbb{E}(w^2) = 1$</span></p> <p>Now let’s suppose we want to initialize <span class="math inline">$$w$$</span> uniformly in the interval <span class="math inline">$$[-R,\ R]$$</span>. Then the mean of <span class="math inline">$$w$$</span> is 0, so that, by definition, <span class="math inline">$$\mathbb{E}(w^2)$$</span> is its variance, <span class="math inline">$$\mathbb{V}(w)$$</span>. The variance of a uniform distribution over the interval <span class="math inline">$$[a,\ b]$$</span> is given by <span class="math inline">$$\frac{(b-a)^2}{12}$$</span>, from which we get <span class="math inline">$$\mathbb{V}(w) = \frac{R^2}{3}$$</span>. Substituting this into our equation we get:</p> <p><span class="math display">$n\frac{R^2}{3} = 1$</span></p> <p>So that:</p> <p><span class="math display">$R = \frac{\sqrt{3}}{\sqrt{n}}$</span></p> <p>This suggests that we initialize our weights from the uniform distribution over the interval: <span class="math display">$\bigg[ -\frac{\sqrt{3}}{\sqrt{n}},\ \frac{\sqrt{3}}{\sqrt{n}}\bigg].$</span></p> <p>This is a nice result because it is the Xavier-Glorot initialization for a square weight matrix, yet was motivated by a different idea. The Xavier-Glorot initialization, introduced by <a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Glorot and Bengio (2010)</a>, has proven to be an effective weight initialization prescription in practice. More generally, the Xavier-Glorot prescription applies to <span class="math inline">$$m$$</span>-by-<span class="math inline">$$n$$</span> weight matrices used in a layer that has an activation function whose derivative is near one at the origin (like <span class="math inline">$$\tanh$$</span>), and says that we should initialize our weights according to a uniform distribution of the interval: <span class="math display">$\bigg[-\frac{\sqrt{6}}{\sqrt{m + n}},\ \frac{\sqrt{6}}{\sqrt{m + n}}\bigg].$</span></p> <p>You can easily modify the above analysis to obtain initialization prescriptions when using the logistic sigmoid (use <span class="math inline">$$\gamma = \frac{1}{4}$$</span>) and when initializing the weights according to a different random distribution (e.g., a Gaussian distribution).</p> <h3 id="backpropagation-through-time-and-vanishing-sensitivity">Backpropagation through time and vanishing sensitivity</h3> <p>Training an RNN with backpropagation is very similar to training a feedforward network with backpropagation. Since it is assumed you are already familiar with backpropagation generally, there are only a few comments to make:</p> <ol type="1"> <li><p>We backpropagate errors through time</p> <p>For RNNs we need to backpropagate errors from the current RNN cell back through the state, back through time, to prior RNN cells. This allows the RNN to learn to capture long term time dependencies. Because the model’s parameters are shared across RNN cells (each RNN cell has identical weights and biases), we need to calculate the gradient with respect to each time step separately and then add them up. This is similar to the way we backpropagate errors to shared parameters in other models, such as convolutional networks.</p></li> <li><p>There is a trade-off between weight update frequency and accurate gradients</p> <p>For all gradient-based training algorithms, there is an unavoidable trade-off between (1) frequency of parameter updates (backward passes), and (2) accurate long-term gradients. To see this, consider what happens when we update the gradients at each step, but backpropagate errors more than one step back:</p> <ol type="1"> <li>At time <span class="math inline">$$t$$</span> we use our current weights, <span class="math inline">$$W_t$$</span>, to calculate the current output and current state, <span class="math inline">$$o_t$$</span> and <span class="math inline">$$s_t$$</span>.</li> <li>Second, we use <span class="math inline">$$o_t$$</span> to run a backward pass and update <span class="math inline">$$W_t$$</span> to <span class="math inline">$$W_{t+1}$$</span>.</li> <li>Third, at time <span class="math inline">$$t+1$$</span>, we use <span class="math inline">$$W_{t+1}$$</span> and <span class="math inline">$$s_t$$</span>, as calculated in step 1 using the original <span class="math inline">$$W_t$$</span>, to calculate <span class="math inline">$$o_{t+1}$$</span> and <span class="math inline">$$s_{t+1}$$</span>.</li> <li>Finally, we use <span class="math inline">$$o_{t+1}$$</span> to run a backward pass. But <span class="math inline">$$o_{t+1}$$</span> was computed using <span class="math inline">$$s_t$$</span>, which was computed using <span class="math inline">$$W_t$$</span> (not <span class="math inline">$$W_{t+1}$$</span>), which means the gradients we compute for weights at time step <span class="math inline">$$t$$</span> are evaluated at our old weights, <span class="math inline">$$W_t$$</span>, and not the current weights, <span class="math inline">$$W_{t+1}$$</span>. They are thus only an estimate of the gradient, if it were computed with respect to the current weights. This effect will only compound as we backpropagate errors even further.</li> </ol> <p>We could compute more accurate gradients by doing fewer parameter updates (backward passes), but then we might be giving up training speed (which can be particularly harmful at the start of training). Note the similarity to the trade off to the one faces by choosing a mini-batch size for mini-batch gradient descent: the larger the batch size, the more accurate the estimate of the gradient, but also the fewer gradient updates.</p> <p>We could also choose to not propagate errors back more steps than the frequency of our parameter updates, but then we are not calculating the full gradient of the cost with respect to the weights and this is just the flip-side of the coin; the same trade-off occurs.</p> <p>This effect is discussed in <a href="https://web.stanford.edu/class/psych209a/ReadingsByDate/02_25/Williams%20Zipser95RecNets.pdf">Williams and Zipser (1995)</a>, which provides an excellent overview of the options for calculating gradients for gradient-based training algorithms.</p></li> <li><p>Vanishing gradients plus shared parameters means unbalanced gradient flow and oversensitivity to recent distractions</p> <p>Consider a feedforward neural network. Exponentially vanishing gradients mean that changes made to the weights in the earlier layers will be exponentially smaller than those made to the weights in later layers. This is bad, even if we train the network for exponentially longer, so that the early layers eventually learn. To see this, consider that during training the early layers and later layers learn how to communicate with each other. The early layers initially send crude signals, so the later layers quickly become very good at interpretting these crude signals. But then the early laters are encouraged to learn how to produce better crude symbols rather than producing more sophisticated ones.</p> <p>RNNs have it worse, because unlike for feedforward nets, the weights in early layers and later layers are shared. This means that instead of simply miscommunicating, they can directly conflict: the gradient to a particular weight might be positive in the early layers but negative in the later layers, resulting in a negative overall gradient, so that the early layers are unlearning faster than they can learn. In the words of Hochreiter and Schmidhuber (1997): “Backpropagation through time is too sensitive to recent distractions.”</p></li> <li><p>Therefore it makes sense to truncate backpropagation</p> <p>Limiting the number of steps that we backpropagate errors in training is called truncating the backpropagation. Notice immediately that if the input/output sequence we are fitting is infinitely long we must truncate the backpropagation, else our algorithm would halt on the backward pass. If the sequence is finite but very long, we may still need to truncate the backpropagation due to computation infeasability.</p> <p>However, <em>even if</em> we had a supercomputer that could instantly backpropagate an error an infinite number of timesteps, point 2 above tells us that we need to truncate our backpropagation due to our gradients becoming inaccurate as a result of weight updates.</p> <p>Finally, vanishing gradients create yet another reason for us truncate our backpropagation. If our gradients vanish, then gradients that are backpropagated many steps will be very small and have a negligible effect on training.</p> <p>Note that we choose not only how often to truncate backpropagation, but also how often to update our model parameters. See my post on <a href="https://r2rt.com/styles-of-truncated-backpropagation.html">Styles of Truncated Backpropagation</a> for an empirical comparison of two possible methods of truncation, or refer to the discussion in <a href="https://web.stanford.edu/class/psych209a/ReadingsByDate/02_25/Williams%20Zipser95RecNets.pdf">Williams and Zipser (1995)</a>.</p></li> <li><p>There is also such thing as forward propagation of gradient components</p> <p>Something useful to know (in case you come up with the idea yourself), is that backpropagation is not our only choice for training RNNs. Instead of backpropagating errors, we can also propagate gradient components forward, allowing us to compute the error gradient with respect to the weights at each time step. This alternate algorithm is called “real-time recurrent learning (RTRL)”. Full RTRL is too computationally expensive to be practical, running in <span class="math inline">$$O(n^4)$$</span> time (as compared to truncated backpropagation, which is <span class="math inline">$$O(n^2)$$</span> when parameters are updated with the same frequency as backward passes). Similar to how truncated backpropagation approximates full backpropagation (whose time complexity, <span class="math inline">$$O(n^2L)$$</span>, can be much higher than RTRL when the number of time steps, <span class="math inline">$$L$$</span>, is large), there exists an approximate version of RTRL called subgrouped RTRL. It promises the same time complexity as truncated backpropagation (<span class="math inline">$$O(n^2)$$</span>) when the size of the subgroups is fixed, but is qualitatively different in how it approximates the gradient. Note that RTRL is a gradient-based algorithm and therefore suffers from the vanishing and exploding gradient problem. You can learn more about RTRL in <a href="https://web.stanford.edu/class/psych209a/ReadingsByDate/02_25/Williams%20Zipser95RecNets.pdf">Williams and Zipser (1995)</a>. RTRL is just something I wanted to bring to your attention, and beyond our scope; in this post, I assume the use of truncated backpropagation to calculate our gradients.</p></li> </ol> <h3 id="dealing-with-vanishing-and-exploding-gradients">Dealing with vanishing and exploding gradients</h3> <p>If our gradient explodes backpropagation will not work because we will get <code>NaN</code> values for the gradient at early layers. An easy solution for this is to clip the gradient to a maximum value, as proposed by <a href="http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf">Mikolov (2012)</a> and reasserted in <a href="http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf">Pascanu et al. (2013)</a>. This works in practice to prevent <code>NaN</code> values and allows training to continue.</p> <p>Vanishing gradients are tricker to deal with in vanilla RNNs. We saw above that good weight initializations are crucial, but this only impacts the start of training – what about the middle of training? The approach suggested in <a href="http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf">Pascanu et al. (2013)</a> is to introduce a regularization term that enforces constant backwards error flow. This is an easy solution that seems to work for the few experiments on which it was tested in Pascanu et al. (2013). Unfortunately, it is difficult to find a justification for why this should work <em>all the time</em>, because we are imposing an opinion about the way gradients should flow on the model. This opinion may be correct for some tasks, in which case our imposition will help achieve better results. However, it may be that for some tasks we want gradients to vanish completely, and for others, it may be that we want them to grow. In these cases, the regularizer would detract from the model’s performance, and there doesn’t seem to be any justification for saying that one situation is more common than the other. LSTMs avoid this issue altogether.</p> <h3 id="written-memories-the-intuition-behind-lstms">Written memories: the intuition behind LSTMs</h3> <p>Very much like the messages passed by children playing a game of <a href="https://en.wikipedia.org/wiki/Chinese_whispers">broken telephone</a>, information is morphed by RNN cells and the original message is lost. A small change in the original message may not have made any difference in the final message, or it may have resulted in something completely different.</p> <p>How can we protect the integrity of messages? This is the fundamental principle of LSTMs: to ensure the integrity of our messages in the real world, we write them down. Writing is a <em>delta to the current state</em>: it is an act of creation (pen on paper) or destruction (carving in stone); the subject itself does not morph when you write on it and the error gradient on the backward-pass is constant.</p> <p>This is precisely what was proposed by the landmark paper of <a href="http://isle.illinois.edu/sst/meetings/2015/hochreiter-lstm.pdf">Hocreiter and Schmidhuber (1997)</a>, which introduced the LSTM. They asked: “how can we achieve constant error flow through a single unit with a single connection to itself [i.e., a single piece of isolated information]?”</p> <p>The answer, quite simply, is to avoid information morphing: changes to the state of an LSTM are explicitly written in, by an explicit addition or subtraction, so that each element of the state stays constant without outside interference: “the unit’s activation has to remain constant … this will be ensured by using the identity function”.</p> <blockquote> <p><strong>The fundamental principle of LSTMs: Write it down.</strong></p> <p>To ensure the integrity of our messages in the real world, we write them down. Writing is an incremental change that can be additive (pen on paper) or subtractive (carving in rock), and which remains unchanged absent outside interference. In LSTMs, everything is written down and, assuming no interference from other state units or external inputs, carries its prior state forward.</p> <p>Practically speaking, this means that any state changes are incremental, so that <span class="math inline">$$s_{t+1} = s_t + \Delta s_{t+1}$$</span>.<a href="#fn10" class="footnoteRef" id="fnref10"><sup>10</sup></a></p> </blockquote> <p>Now Hochreiter and Schmidhuber observed that just “writing it down” had been tried before, but hadn’t worked so well. To see why, consider what happens when we keep writing in changes:</p> <p>Some of our writes are positive, and some are negative, so it’s not true that our canvas necessarily blows up: our writes could theoretically cancel each other out. However, it turns out that it’s quite hard to learn how to coordinate this. In particular, at the start of training, we start with random initializations and our network is making some fairly random writes. From the very start of training, we end up with something that looks like this:</p> <figure> <img src="https://r2rt.com/static/images/NH_Pollock_5.jpg" alt="Pollock No. 5" /><figcaption>Pollock No. 5</figcaption> </figure> <p>Even if we eventually learn to coordinate our writes properly, it’s very difficult to record anything useful on top of that chaos (albeit, in this example, very pretty and somewhat regular chaos that was <a href="https://en.wikipedia.org/wiki/No._5,_1948">worth \$140 million</a> about 10 years ago). This is the fundamental challenge of LSTMs: Uncontrolled and uncoordinated writing causes chaos and overflow from which it can be very hard to recover.</p> <blockquote> <p><strong>The fundamental challenge of LSTMs: Uncontrolled and uncoordinated writing.</strong></p> <p>Uncontrolled and uncoordinated writes, particularly at the start of training when writes are completely random, create a chaotic state that leads to bad results and from which it can be difficult to recover.</p> </blockquote> <p>Hochreiter and Schmidhuber recognized this problem, splitting it into several subproblems, which they termed “input weight conflict”, “output weight conflict”, the “abuse problem”, and “internal state drift”. The LSTM architecture was carefully designed in order to overcome these problems, starting with the idea of selectivity.</p> <h3 id="using-selectivity-to-control-and-coordinate-writing">Using selectivity to control and coordinate writing</h3> <p>According to the early literature on LSTMs, the key to overcoming the fundamental challenge of LSTMs and keeping our state under control is to be selective in <strong>three things</strong>: what we write, what we read (because we need to read something to know what to write), and what we forget (because obselete information is a distraction and should be forgotten).</p> <p>Part of the reason our state can become so chaotic is that the base RNN writes to every element of the state. This is a problem I suffer from a lot. I have a paper in front of my computer and I write down a lot of things on the same paper. When it fills I take out the paper under it and start writing on that one. The cycle repeats and I end up with a bunch of papers on my desk that contain an overwhelming amount of gibberish.</p> <p>Hochreiter and Schmidhuber describe this as “input weight conflict”: if each unit is being written to by all units at each time step, it will collect a lot of useless information, rendering its original state unusable. Thus, the RNN must learn how to use some of its units to cancel out other incoming writes and “protect” the state, which results in difficult learning.</p> <blockquote> <p><strong>First form of selectivity: Write selectively.</strong></p> <p>To get the most out of our writings in the real world, we need to be selective about what we write; when taking class notes, we only record the most important points and we certainly don’t write our new notes on top of our old notes. In order for our RNN cells to do this, they need a mechanism for selective writing.</p> </blockquote> <p>The second reason our state can become chaotic is the flip side of the first: for each write it makes, the base RNN reads from every element of the state. As a mild example: if I’m writing a blog post on the intuition behind LSTMs while on vacation in a national park with a wild bear on the loose, I might include the things I’ve been reading about bear safety in my blog post. This is just one thing, and only mildly chaotic, but imagine what this post would look like if I included all the things…</p> <p>Hochreiter and Schmidhuber describe this as “output weight conflict”: if irrelevant units are read by all other units at each time step, they produce a potentially huge influx of irrelevant information. Thus, the RNN must learn how to use some of its units to cancel out the irrelevant information, which results in difficult learning.</p> <p>Note the difference between reads and writes: If we choose not to read from a unit, it cannot affect any element of our state and our read decision impacts the entire state. If we choose not to write to a unit, that impacts only that single element of our state. This does not mean the impact of selective reads is more significant than the impact of selective writes: reads are summed together and squashed by a non-linearity, whereas writes are absolute, so that the impact of a read decision is broad but shallow, and the impact of a write decision is narrow but deep.</p> <blockquote> <p><strong>Second form of selectivity: Read selectively.</strong></p> <p>In order to perform well in the real-world, we need to apply the most relevant knowledge by being selective in what we read or consume. In order for our RNN cells to do this, they need a mechanism for selective reading.</p> </blockquote> <p>The third form of selectivity relates to how we dispose of information that is no longer needed. My old paper notes get thrown out. Otherwise I end up with an overwhelming number of papers, even if I were to be selective in writing them. Unused files in my Dropbox get overwritten, else I would run out of space, even if I were to be selective in creating them.</p> <p>This intuition was not introduced in the original LSTM paper, which led the original LSTM model to have trouble with simple tasks involving long sequences. Rather, it was introduced by <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.5709&amp;rep=rep1&amp;type=pdf">Gers et al. (2000)</a>. According to Gers et al., in some cases the state of the original LSTM model would grow indefinitely, eventually causing the network to break down. In other words, the original LSTM suffered from information overload.</p> <blockquote> <p><strong>Third form of selectivity: Forget selectively.</strong></p> <p>In the real-world, we can only keep so many things in mind at once; in order to make room for new information, we need to selectively forget the least relevant old information. In order for our RNN cells to do this, they need a mechanism for selective forgetting.</p> </blockquote> <p>With that, there are just two more steps to deriving the LSTM:</p> <ol type="1"> <li>we need to determine a mechanism for selectivity, and</li> <li>we need to glue the pieces together.</li> </ol> <h3 id="gates-as-a-mechanism-for-selectivity">Gates as a mechanism for selectivity</h3> <p>Selective reading, writing and forgetting involves separate read, write and forget decisions for each element of the state. We will make these decisions by taking advantage of state-sized read, write and forget vectors with values between 0 and 1 specifying the percentage of reading, writing and forgetting that we do for each state element. Note that while it may be more natural to think of reading, writing and forgetting as binary decisions, we need our decisions to be implemented via a differentiable function. The logistic sigmoid is a natural choice since it is differentiable and produces continuous values between 0 and 1.</p> <p>We call these read, write and forget vectors “gates”, and we can compute them using the simplest function we have, as we did for the vanilla RNN: the single-layer neural network. Our three gates at time step <span class="math inline">$$t$$</span> are denoted <span class="math inline">$$i_t$$</span>, the input gate (for writing), <span class="math inline">$$o_t$$</span>, the output gate (for reading) and <span class="math inline">$$f_t$$</span>, the forget gate (for remembering). From the names, we immediately notice that two things are backwards for LSTMs:</p> <ul> <li>Admittedly this is a bit of a chicken and egg, but I would usually think of first reading then writing. Indeed, this ordering is strongly suggested by the RNN cell specification–we need to read the prior state before we can write to a new one, so that even if we are starting with a blank initial state, we are reading from it. The names input gate and output gate suggest the opposite temporal relationship, which the LSTM adopts. We’ll see that this complicates the architecture.</li> <li>The forget gate is used for forgetting, but it actually operates as a remember gate. E.g., a 1 in a forget gate vector means remember everything, not forget everything. This makes no practical difference, but might be confusing.</li> </ul> <p>Here are the mathematical definitions of the gates (notice the similarities):</p> <p><span class="math display">$\begin{equation} \begin{split} i_t &amp;= \sigma(W_is_{t-1} + U_ix_t + b_i) \\ o_t &amp;= \sigma(W_os_{t-1} + U_ox_t + b_o) \\ f_t &amp;= \sigma(W_fs_{t-1} + U_fx_t + b_f) \\ \end{split} \end{equation}$</span></p> <p>We could use more complicated functions for the gates as well. A simple yet effective recent example is the use of “multiplicative integration”. See <a href="https://arxiv.org/abs/1606.06630">Wu et al. (2016)</a>.</p> <p>Let’s now take a closer look at how our gates interact.</p> <h3 id="gluing-gates-together-to-derive-a-prototype-lstm">Gluing gates together to derive a prototype LSTM</h3> <p>If there were no write gate, read selectivity says that we should use the read gate when reading the prior state in order to produce the next write to the state (as discussed above, the read naturally comes before the write when we are zoomed in on a single RNN cell). The fundamental principle of LSTMs says that our write will be incremental to the prior state; therefore, we are calculating <span class="math inline">$$\Delta s_t$$</span>, not <span class="math inline">$$s_t$$</span>. Let’s call this would-be <span class="math inline">$$\Delta s_t$$</span> our <em>candidate write</em>, and denote it <span class="math inline">$$\tilde{s}_t$$</span>.</p> <p>We calculate <span class="math inline">$$\tilde{s}_t$$</span> the same way we would calculate the state in a vanilla RNN, except that instead of using the prior state, <span class="math inline">$$s_{t-1}$$</span>, we first multiply the prior state element-wise by the read gate to get the <em>gated prior state</em>, <span class="math inline">$$o_t \odot s_{t-1}$$</span>:</p> <p><span class="math display">$\tilde{s_t} = \phi(W(o_t \odot s_{t-1}) + Ux_t + b)$</span></p> <p>Note that <span class="math inline">$$\odot$$</span> denotes element-wise multiplication, and <span class="math inline">$$o_t$$</span> is our read gate (output gate).</p> <p><span class="math inline">$$\tilde{s}_t$$</span> is only a candidate write because we are applying selective writing and have a write gate. Thus, we multiply <span class="math inline">$$\tilde{s}_t$$</span> element-wise by our write gate, <span class="math inline">$$i_t$$</span>, to obtain our true write, <span class="math inline">$$i_t \odot \tilde{s}_t$$</span>.</p> <p>The final step is to add this to our prior state, but forget selectivity says that we need to have a mechanism for forgetting. So before we add anything to our prior state, we multiply it (element-wise) by the forget gate (which actually operates as a remember gate). Our final prototype LSTM equation is:</p> <p><span class="math display">$s_t = f_t \odot s_{t-1} + i_t \odot \tilde{s}_t$</span></p> <p>If we gather all of our equations together, we get the full spec for our prototype LSTM cell (note that <span class="math inline">$$s_t$$</span> is also the cell’s external output at each time step):</p> <p><strong>The Prototype LSTM</strong></p> <p><span class="math display">$\begin{equation} \begin{split} i_t &amp;= \sigma(W_is_{t-1} + U_ix_t + b_i) \\ o_t &amp;= \sigma(W_os_{t-1} + U_ox_t + b_o) \\ f_t &amp;= \sigma(W_fs_{t-1} + U_fx_t + b_f) \\ \\ \tilde{s_t}&amp; = \phi(W(o_t \odot s_{t-1}) + Ux_t + b)\\ s_t &amp;= f_t \odot s_{t-1} + i_t \odot \tilde{s}_t \end{split} \end{equation}$</span></p> <p>At the risk of distracting you from the equations (which are far more descriptive), here is what the data flow looks like:</p> <figure> <img src="https://r2rt.com/static/images/NH_PrototypeLSTMCell.png" alt="Prototype LSTM Cell" /><figcaption>Prototype LSTM Cell</figcaption> </figure> <p>In theory, this prototype <em>should</em> work, and it would be quite beautiful if it did. In practice, the selectivity measures taken are not (usually) enough to overcome the fundamental challenge of LSTMs: the selective forgets and the selective writes are not coordinated at the start of training which can cause the state to quickly become large and chaotic. Further, since the state is potentially unbounded, the gates and the candidate write will often become saturated, which causes problems for training.</p> <p>This was observed by Hochreiter and Schmidhuber (1997), who termed the problem “internal state drift”, because “if the [writes] are mostly positive or mostly negative, then the internal state will tend to drift away over time”. It turns out that this problem is so severe that the prototype we created above tends to fail in practice, even with very small initial learning rates and carefully chosen bias initializations. The clearest empirical demonstration of this can be found in <a href="https://arxiv.org/abs/1503.04069">Greff et al. (2015)</a>, which contains an empirical comparison of 8 LSTM variants. The worst performing variant, often failing to converge, is substantially similar to the prototype above.</p> <p>By enforcing a bound on the state to prevent it from blowing up, we can overcome this problem. There are a few ways to do this, which lead to different models of the LSTM.</p> <h3 id="three-working-models-the-normalized-prototype-the-gru-and-the-pseudo-lstm">Three working models: the normalized prototype, the GRU and the pseudo LSTM</h3> <p>The selectivity measures taken in our prototype LSTM were not powerful enough to overcome the fundamental challenge of LSTMs. In particular, the state, which is used to compute both the gates and the candidate write can grow unbounded.</p> <p>I’ll cover three options, each of which bounds the state in order to give us a working LSTM:</p> <h4 id="the-normalized-prototype-a-soft-bound-via-normalization">The normalized prototype: a soft bound via normalization</h4> <p>We can impose a soft bound by normalizing the state. One method that has worked for me in preliminary tests is simply dividing <span class="math inline">$$s_t$$</span> by <span class="math inline">$$\sqrt{\text{Var}(s_t) + 1}$$</span>, where we add 1 to prevent the initially zero state from blowing up. We might also subtract the mean state before dividing out the variance, but this did not seem to help in preliminary tests. We might then consider adding in scale and shift factors for expressiveness, a la layer normalization<a href="#fn11" class="footnoteRef" id="fnref11"><sup>11</sup></a>, but then the model ventures into layer normalized LSTM territory (and we may want to compare it to other layer normalized LSTM models).</p> <p>In any case, this provides a method for creating a soft bound on the state, and has performed slightly better for me in preliminary tests than regular LSTMs (including the pseudo LSTM derived below).</p> <h4 id="the-gru-a-hard-bound-via-write-forget-coupling-or-overwriting">The GRU: a hard bound via write-forget coupling, or overwriting</h4> <p>One way to impose a hard bound on the state and coordinate our writes and forgets is to explicitly link them; in other words, instead of doing selective writes and selective forgets, we forego some expressiveness and do selective overwrites by setting our forget gate equal to 1 minus our write gate, so that:</p> <p><span class="math display">$s_t = (1-i_t) \odot s_{t-1} + i_t \odot \tilde{s}_t$</span></p> <p>This works because it turns <span class="math inline">$$s_t$$</span> into an element-wise weighted average of <span class="math inline">$$s_{t-1}$$</span> and <span class="math inline">$$\tilde{s}_t$$</span>, which is bounded if both <span class="math inline">$$s_{t-1}$$</span> and <span class="math inline">$$\tilde{s}_t$$</span> are bounded. This is the case if we use <span class="math inline">$$\phi = \tanh$$</span> (whose output is bound to (-1, 1)).</p> <p>We’ve now derived the gated recurrent unit (GRU). To conform for the GRU terminology used in the literature, we call the overwrite gate an update gate and label it <span class="math inline">$$z_t$$</span>. Note that although called an “update” gate, it operates as “do-not-update” gate by specifying the percentage of the prior state that we don’t want to overwrite. Thus, the update gate, <span class="math inline">$$z_t$$</span>, is the same as the forget gate from our prototype LSTM, <span class="math inline">$$f_t$$</span>, and the write gate is calculated by <span class="math inline">$$1 - z_t$$</span>.</p> <p>Note that, for whatever reason, the authors who introduced the GRU called their read gate a reset gate (at least we get to use <span class="math inline">$$r_t$$</span> for it!).</p> <p><strong>The GRU</strong></p> <p><span class="math display">$\begin{equation} \begin{split} r_t &amp;= \sigma(W_rs_{t-1} + U_rx_t + b_r) \\ z_t &amp;= \sigma(W_zs_{t-1} + U_zx_t + b_z) \\ \\ \tilde{s_t}&amp; = \phi(W(r_t \odot s_{t-1}) + Ux_t + b)\\ s_t &amp;= z_t \odot s_{t-1} + (1 - z_t) \odot \tilde{s}_t \end{split} \end{equation}$</span></p> <p>At the risk of distracting you from the equations (which are far more descriptive), here is what the data flow looks like:</p> <figure> <img src="https://r2rt.com/static/images/NH_GRUCell.png" alt="GRU Cell" /><figcaption>GRU Cell</figcaption> </figure> <p>This is the GRU cell first introduced by <a href="http://emnlp2014.org/papers/pdf/EMNLP2014179.pdf">Cho et al. (2014)</a>. I hope you agree that the derivation of the GRU in this post was motivated at every step. There hasn’t been a single arbitrary “hiccup” in our logic, as there will be in order for us to arrive at the LSTM. Contrary to what some authors have written (e.g., “The GRU is an alternative to the LSTM which is similarly difficult to justify” - <a href="http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf">Jozefowicz et al. (2015)</a>), we see that the GRU is a very natural architecture.</p> <h4 id="the-pseudo-lstm-a-hard-bound-via-non-linear-squashing">The Pseudo LSTM: a hard bound via non-linear squashing</h4> <p>We now take the second-to-last step on our journey to full LSTMs, by using a third method to bind our state: we pass the state through a squashing function (e.g., the logistic sigmoid or tanh). The hiccup here is that we cannot apply the squashing function to the state itself (for this would result in information morphing and violate our fundamental principle of LSTMs). Instead, we pass the state through the squashing function every time we need to use it for anything except making incremental writes to it. By doing this, our gates and candidate write don’t become saturated and we maintain good gradient flow.</p> <p>To this point, our external output has been the same as our state, but here, the only time we don’t squash the state is when we make incremental writes to it. Thus, our cell’s output and state are different.</p> <p>This is an easy enough modification to our prototype. Denoting our new squashing function by <span class="math inline">$$\phi$$</span> (it does not have to be the same as the nonlinearity we use to compute the candidate write but tanh is generally used for both in practice):</p> <p><strong>The Pseudo LSTM</strong></p> <p><span class="math display">$\begin{equation} \begin{split} i_t &amp;= \sigma(W_i(\phi(s_{t-1})) + U_ix_t + b_i) \\ o_t &amp;= \sigma(W_o(\phi(s_{t-1})) + U_ox_t + b_o) \\ f_t &amp;= \sigma(W_f(\phi(s_{t-1})) + U_fx_t + b_f) \\ \\ \tilde{s_t}&amp; = \phi(W(o_t \odot \phi(s_{t-1})) + Ux_t + b)\\ s_t &amp;= f_t \odot s_{t-1} + i_t \odot \tilde{s}_t\\ \\ \text{rnn}_{out} &amp; = \phi(s_t) \end{split} \end{equation}$</span></p> <p>At the risk of distracting you from the equations (which are far more descriptive), here is what the data flow looks like:</p> <figure> <img src="https://r2rt.com/static/images/NH_PseudoLSTMCell.png" alt="Pseudo LSTM Cell" /><figcaption>Pseudo LSTM Cell</figcaption> </figure> <p>The pseudo LSTM is almost an LSTM - it’s just backwards. From this presentation, we see clearly that the only motivated difference between the GRU and the LSTM is the approach they take to bounding the state. We’ll see that this pseudo LSTM has some advantages over the standard LSTM.</p> <h3 id="deriving-the-lstm">Deriving the LSTM</h3> <p>There are a number of LSTM variants used in the literature, but the differences between them are not so important for our purposes. They all share one key difference with our pseudo LSTM: the real LSTM places the read operation <em>after</em> the write operation.</p> <blockquote> <p><strong>LSTM Diff 1 (the LSTM hiccup)</strong>: Read comes <em>after</em> write. This forces the LSTM to pass a shadow state between time steps.</p> </blockquote> <p>If you read Hochreiter and Schmidhuber (1997) you will observe that they were thinking of the state (as we’ve been using state so far) as being separate from the rest of the RNN cell.<a href="#fn12" class="footnoteRef" id="fnref12"><sup>12</sup></a> Hochreiter and Schmidhuber thought of the state as a “memory cell” that had a “constant error” (because absent reading and writing, it carries the state forward and has a constant gradient during backpropagation). Perhaps this is why, viewing the state as a separate memory cell, they saw the order of operations as inputs (writes) followed by outputs (reads). Indeed, most diagrams of the LSTM, including the ones in Hochreiter and Schmidhuber (1997) and <a href="https://arxiv.org/abs/1308.0850">Graves (2013)</a> are confusing because they focus on this “memory cell” rather than on the LSTM cell as a whole.<a href="#fn13" class="footnoteRef" id="fnref13"><sup>13</sup></a> I don’t include examples here so as to not distract from raw understanding.</p> <p>This difference in read-write order has the following important implication: We need to read the state in order to create a candidate write. But if creating the candidate write comes before the read operation inside our RNN cell, we can’t do that unless we pass a pre-gated “shadow state” from one time step to the next along with our normal state. The write-then-read order thus forces the LSTM to pass a shadow state from RNN cell to RNN cell.</p> <p>Going forward, to conform to the common letters used in describing the LSTM, we rename the main state, <span class="math inline">$$s_t$$</span>, to <span class="math inline">$$c_t$$</span> (c is for cell, or constant error). We’ll make the corresponding change to our candidate write, which will now be <span class="math inline">$$\tilde{c}_t$$</span>. We will also introduce a separate shadow state, <span class="math inline">$$h_t$$</span> (h is for hidden state) that will has the same size as our regular state. <span class="math inline">$$h_{t-1}$$</span> is analogous to the <em>gated prior state</em> from our prototype LSTM, <span class="math inline">$$o_t \odot s_{t-1}$$</span>, except that it is squashed by a non-linearity (to impose a bound on the values used to compute the candidate write). Thus the prior state our LSTM receives at time step <span class="math inline">$$t$$</span> is a tuple of closely-related vectors: <span class="math inline">$$(c_{t-1},\ h_{t-1})$$</span>, where <span class="math inline">$$h_{t-1} = o_{t-1} \odot \phi(c_{t-1})$$</span>.</p> <p>This is truly a hiccup, and not because it makes things more complicated (which it does). It’s a hiccup because we end up using a read gate calculated at time <span class="math inline">$$t-1$$</span>, using the shadow state from time <span class="math inline">$$t-2$$</span> and the the inputs from time <span class="math inline">$$t-1$$</span>, in order to gate the relevant state information for use at time <span class="math inline">$$t$$</span>. This is like day trading based on yesterday’s news.</p> <p>Our hiccup created an <span class="math inline">$$h_{t-1}$$</span>, the presence of which goes on to create two more differences to our pseudo LSTM:</p> <p>First, instead of using the (squashed) ungated prior state, <span class="math inline">$$\phi(c_{t-1})$$</span>, to compute the gates, the standard LSTM uses <span class="math inline">$$h_{t-1} = o_{t-1} \odot \phi(c_{t-1})$$</span>, which has been subjected to a read gate, and an outdated read gate at that.</p> <blockquote> <p><strong>LSTM Diff 2</strong>: Gates are computed using the gated shadow state, <span class="math inline">$$h_{t-1} = o_{t-1} \odot \phi(c_{t-1})$$</span>, instead of a squashed main state, <span class="math inline">$$\phi(c_{t-1})$$</span>.</p> </blockquote> <p>Second, instead of using the (squashed) ungated state, <span class="math inline">$$\phi(c_{t})$$</span> as the LSTM’s external output, the standard LSTM uses <span class="math inline">$$h_{t} = o_{t} \odot \phi(c_{t})$$</span>, which has been subjected to a read gate.</p> <blockquote> <p><strong>LSTM Diff 3</strong>: The LSTM’s external output is the gated shadow state, <span class="math inline">$$h_{t} = o_{t} \odot \phi(c_{t})$$</span>, instead of a squashed main state, <span class="math inline">$$\phi(c_{t})$$</span>.</p> </blockquote> <p>While we can see how these differences came to be, as a result of the “memory cell” view of the LSTM’s true state, at least the first and third lack a principled motivation (the second can be interpreted as asserting that information that is irrelevant for the candidate write is also irrelevant for gate computations, which makes sense). Thus, while I strongly disagreed above with <a href="http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf">Jozefowicz et al. (2015)</a> about the GRU being “difficult to justify”, I agree with them that there are LSTM components whose “purpose is not immediately apparent”.</p> <p>We will now rewrite our pseudo LSTM backwards, taking into account all three differences, to get a real LSTM. It now receives two quantities as the prior state, <span class="math inline">$$c_{t-1}$$</span> and <span class="math inline">$$h_{t-1}$$</span>, and produces two quantities which it will pass to the next time step, <span class="math inline">$$c_{t}$$</span> and <span class="math inline">$$h_{t}$$</span>. The LSTM we get is quite “normal”: this is the version of the LSTM you will find implemented as the “BasicLSTMCell” in Tensorflow.</p> <p><strong>The basic LSTM</strong></p> <p><span class="math display">$\begin{equation} \begin{split} i_t &amp;= \sigma(W_ih_{t-1} + U_ix_t + b_i) \\ o_t &amp;= \sigma(W_oh_{t-1} + U_ox_t + b_o) \\ f_t &amp;= \sigma(W_fh_{t-1} + U_fx_t + b_f) \\ \\ \tilde{c_t}&amp; = \phi(Wh_{t-1} + Ux_t + b)\\ c_t &amp;= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t\\ \\ h_t &amp;= o_t \odot \phi(c_t)\\ \\ \text{rnn}_{out} &amp; = h_t \end{split} \end{equation}$</span></p> <p>At the risk of distracting you from the equations (which are far more descriptive), here is what the data flow looks like:</p> <figure> <img src="https://r2rt.com/static/images/NH_BasicLSTMCell.png" alt="Basic LSTM Cell" /><figcaption>Basic LSTM Cell</figcaption> </figure> <h3 id="the-lstm-with-peepholes">The LSTM with peepholes</h3> <p>The potential downside of LSTM Diff 2 (hiding of potentially relevant information) was recognized by <a href="ftp://ftp.idsia.ch/pub/juergen/TimeCount-IJCNN2000.pdf">Gers and Schmidhuber (2000)</a>, who introduced “peephole” connections in response. Peepholes connections include the original unmodified prior state, <span class="math inline">$$c_{t-1}$$</span> in the calculation of the gates. In introducing these peepholes, Gers and Schmidhuber (2000) also noticed the outdated input to the read gate (due to LSTM Diff 1), and partially fixed it by moving the calculation of the read gate, <span class="math inline">$$o_t$$</span>, to come after the calculation of <span class="math inline">$$c_t$$</span>, so that <span class="math inline">$$o_t$$</span> uses <span class="math inline">$$c_t$$</span> instead of <span class="math inline">$$c_{t-1}$$</span> in its peephole connection.</p> <p>Making these changes, we get one of the most common variants of the LSTM. This is the architecture used in <a href="http://arxiv.org/pdf/1308.0850v5.pdf">Graves (2013)</a>. Note that each <span class="math inline">$$P_x$$</span> is an <span class="math inline">$$n \times n$$</span> matrix (a peephole matrix), much like each <span class="math inline">$$W_x$$</span>.</p> <p><strong>The LSTM with peepholes</strong></p> <p><span class="math display">$\begin{equation} \begin{split} i_t &amp;= \sigma(W_ih_{t-1} + U_ix_t + P_ic_{t-1} + b_i) \\ f_t &amp;= \sigma(W_fh_{t-1} + U_fx_t + P_fc_{t-1} + b_f) \\ \\ \tilde{c_t}&amp; = \phi(Wh_{t-1} + Ux_t + b)\\ c_t &amp;= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t\\ \\ o_t &amp;= \sigma(W_oh_{t-1} + U_ox_t + P_oc_{t} + b_o) \\ \\ h_t &amp;= o_t \odot \phi(c_t)\\ \\ \text{rnn}_{out} &amp; = h_t \end{split} \end{equation}$</span></p> <h3 id="an-empirical-comparison-of-the-basic-lstm-and-the-pseudo-lstm">An empirical comparison of the basic LSTM and the pseudo LSTM</h3> <p>I now compare the basic LSTM to our pseudo LSTM to see if LSTM Diffs 1, 2 and 3 really are harmful. All combinations of the three differences are tested, for a total of 8 possible architectures:</p> <ol type="1"> <li><p>Pseudo LSTM: as above.</p></li> <li><p>Pseudo LSTM plus LSTM Diff 1: Shadow state containing read-gated squashed state, <span class="math inline">$$o_{t-1} \odot \phi(c_{t-1})$$</span>, is passed to time step <span class="math inline">$$t$$</span>, where it used in computation of the candidate write only. Gates and outputs are calculated using the ungated squashed state.</p></li> <li><p>Pseudo LSTM plus LSTM Diffs 1 and 2: Shadow state containing read-gated squashed state, <span class="math inline">$$o_{t-1} \odot \phi(c_{t-1})$$</span>, is passed to time step <span class="math inline">$$t$$</span>, where it used in computation of the candidate write and each of the three gates.</p></li> <li><p>Pseudo LSTM plus LSTM Diffs 1 and 3: Shadow state containing read-gated squashed state, <span class="math inline">$$o_{t-1} \odot \phi(c_{t-1})$$</span>, is passed to time step <span class="math inline">$$t$$</span>, where it used in computation of the candidate write only. The shadow state, <span class="math inline">$$o_t \odot \phi(c_t)$$</span>, is also used as the cell output at time step <span class="math inline">$$t$$</span> (i.e., the cell output is read-gated).</p></li> <li><p>Pseudo LSTM plus LSTM Diff 2: Read-gated squashed prior state, <span class="math inline">$$o_t \odot \phi(s_{t-1})$$</span>, is used in place of squashed prior state, <span class="math inline">$$\phi(s_{t-1})$$</span>, to compute the write gate and forget gate.</p></li> <li><p>Pseudo LSTM plus LSTM Diffs 2 and 3: Read-gated squashed prior state, <span class="math inline">$$o_t \odot \phi(s_{t-1})$$</span>, is used in place of squashed prior state, <span class="math inline">$$\phi(s_{t-1})$$</span>, to compute the write gate and forget gate, and also to gate the cell output.</p></li> <li><p>Pseudo LSTM plus LSTM Diff 3: Pseudo LSTM using read-gated squashed state as its external output, <span class="math inline">$$o_t \odot \phi(s_t)$$</span>, instead of squashed state, <span class="math inline">$$\phi(s_t)$$</span>.</p></li> <li><p>Basic LSTM: as above.</p></li> </ol> <p>In architectures 5-7, the read gate is calculated at time <span class="math inline">$$t$$</span> (i.e., they do not incorporate the time delay caused by LSTM Diff 1). All architectures use a forget gate bias of 1, and read/write gate biases of 0.</p> <p>Using the PTB dataset, I run 5 trials of up to 20 epochs of each. Training is cut short if the loss does not fall after 2 epochs, and the minimum epoch validation loss is reported. Gradients are calculated with respect to a softmax/cross-entropy loss via backpropagation truncated to 30 steps, and learning is performed in batches of 30 with an AdamOptimizer and learning rates of 3e-3, 1e-3, 3e-4, and 1e-4. The state size used is 250. No dropout, layer normalization or other features are added. Architectures are composed of a single layer of RNN cells (i.e., this is not a comparison of deep architectures). RNN inputs are passed through an embedding layer, and RNN outputs are passed through a softmax.</p> <p>The best epoch validation losses, shown as the average of 5 runs with a 95% confidence interval, are as follows (lower is better):</p> <div style="font-size: 0.8em; margin: 20px"> <table style="width:25%;"> <colgroup> <col style="width: 2%" /> <col style="width: 2%" /> <col style="width: 2%" /> <col style="width: 2%" /> <col style="width: 2%" /> <col style="width: 2%" /> <col style="width: 2%" /> <col style="width: 2%" /> <col style="width: 2%" /> </colgroup> <thead> <tr class="header"> <th>LR</th> <th>1 (pseudo)</th> <th>2 {1}</th> <th>3 {1,2}</th> <th>4 {1,3}</th> <th>5 {2}</th> <th>6 {2,3}</th> <th>7 {3}</th> <th>8 (basic)</th> </tr> </thead> <tbody> <tr class="odd"> <td><strong>3e-03</strong></td> <td>433.7 ± 10.6</td> <td>430.6 ± 6.1</td> <td>390.3 ± 1.2</td> <td>424.5 ± 3.5</td> <td><strong><em>389.0 ± 1.4</em></strong></td> <td>399.1 ± 2.4</td> <td>425.7 ± 1.4</td> <td>396.2 ± 1.6</td> </tr> <tr class="even"> <td><strong>1e-03</strong></td> <td>387.2 ± 0.8</td> <td>388.6 ± 1.0</td> <td>388.7 ± 0.6</td> <td>414.3 ± 2.5</td> <td><strong><em>386.0 ± 0.8</em></strong></td> <td>396.3 ± 1.9</td> <td>413.9 ± 2.4</td> <td>396.6 ± 0.9</td> </tr> <tr class="odd"> <td><strong>3e-04</strong></td> <td>389.2 ± 0.6</td> <td>391.1 ± 0.8</td> <td>391.3 ± 0.8</td> <td>407.9 ± 4.3</td> <td><strong><em>388.8 ± 0.6</em></strong></td> <td>397.7 ± 1.5</td> <td>408.7 ± 1.7</td> <td>398.8 ± 2.1</td> </tr> <tr class="even"> <td><strong>1e-04</strong></td> <td>403.9 ± 1.0</td> <td>403.9 ± 0.8</td> <td>404.2 ± 1.3</td> <td>419.7 ± 0.4</td> <td><strong><em>403.1 ± 1.2</em></strong></td> <td>416.8 ± 1.4</td> <td>419.9 ± 1.4</td> <td>418.1 ± 1.2</td> </tr> </tbody> </table> </div> <p>We see that LSTM Diff 2 (using a read gated state for write and forget gate computations) is actually slightly beneficial as compared to the pseudo LSTM. In fact, LSTM Diff 2 is neutral or beneficial in all cases where it is added. It turns out (at least for this task), that information is that irrelevant to the candidate write computation is also irrelevant to the gate computations.</p> <p>We see that LSTM Diff 1 (using a prior state for the candidate write that was gated using a read gate computed at the prior time step) is not significant, though it tends to be slightly harmful.</p> <p>Finally, we see that LSTM Diff 3 (using a read gated state for the cell outputs) significantly harms performance, but that LSTM Diff 2 does a good job of recovering the loss.</p> <p>Thus, we conclude that LSTM Diff 2 is a worthwhile solo addition to the pseudo LSTM. The pseudo LSTM + LSTM Diff 2 was the winner for all tested learning rates and outperformed the basic LSTM by a significant margin on the full range of tested learning rates.</p> <h3 id="extending-the-lstm">Extending the LSTM</h3> <p>At this point, we’ve completely derived the LSTM, we know why it works, and we know why each component of the LSTM is the way it is. We’ve also used our intuitions to create an LSTM variant that is empirically better than the basic LSTM on tests, and objectively better in the sense that it uses the most recent available information.</p> <p>We’ll now (very) briefly take a look at how this knowledge was applied in two recent and exciting innovations: highway and residual networks, and memory-augmented recurrent architectures.</p> <h4 id="highway-networks-and-residual-networks">Highway networks and residual networks</h4> <p>Two new architectures, highway networks and residual networks, draw on the intuitions of LSTMs to produce state of the art results on tasks using feedforward networks. Very deep feedforward nets have historically been difficult to train for the very same reasons as recurrent architectures: even in the absence of a recurring function, gradients vanish and information morphs. A residual network, introduced by <a href="https://arxiv.org/abs/1512.03385">He et al. (2015)</a>, won the ImageNet 2015 classification task by enabling the training of a very deep feedforward network. Highway networks, introduced by <a href="https://arxiv.org/abs/1505.00387">Srivastava et al. (2015)</a> demonstrate a similar ability, and have shown impressive experimental results. Both residual networks and highway networks are an application of the fundamental principle of LSTMs to feedforward neural networks.</p> <p>Their derivation begins as a direct application of the fundamental principle of LSTMs:</p> <p>Let <span class="math inline">$$x_l$$</span> represent the network’s representation of the network inputs, <span class="math inline">$$x_0$$</span>, at layer <span class="math inline">$$l$$</span>. Then instead of transforming the current representation at each layer, <span class="math inline">$$x_{n+1} = T(x_n)$$</span>, we compute the delta to the current state: <span class="math inline">$$x_{n+1} = x_n + \Delta x_{n+1}$$</span>.</p> <p>However, in doing this, we run into the fundamental challenge of LSTMs: uncontrolled and uncoordinated deltas. Intuitively, the fundamental challenge is not as much of a challenge for feedforward networks. Even if the representation progresses uncontrollably as we move deeper through the network, the layers are no longer linked (there is no parameter sharing between layers), so that deeper layers can adapt to the increasing average level of chaos (and, if we apply batch normalization, the magnitude and variance of the chaos becomes less relevant). In any case, the fundamental challenge is still an issue, and just as the GRU and LSTM diverge in their treatment of this issue, so too do highway networks and residual networks.</p> <p>Highway networks overcome the challenge as does the LSTM: they train a write gate and a forget gate at each layer (in the absence of a recurring function, parameters are not shared across layers). In Srivastava et al. (2015), the two gates are merged, as per the GRU, into a single overwrite gate. This does a good enough job of overcoming the fundamental challenge of LSTMs and enables the training of very deep feedforward networks.</p> <p>Residual networks take a slightly different approach. In order to control the deltas being written, residual networks use a multi-layer neural network to calculate them. This is a form of selectivity: it enables a much more precise delta calculation and is expressive enough to replace gating mechanisms entirely (observe that both are second order mechanisms that differ in how they are calculated). It’s likely that we can apply this same approach to an LSTM architecture in order to overcome the fundamental challenge of LSTMs in an RNN context (query whether it is more effective than using gates).</p> <h4 id="neural-turing-machine">Neural Turing Machine</h4> <p>As a second extension of the LSTM, consider the Neural Turing Machine (NTM), introduced in <a href="https://arxiv.org/abs/1410.5401">Graves et al. (2014)</a>, which is a example of a memory-augmented recurrent architecture.</p> <p>Recall that the reason the LSTM is backwards from our pseudo LSTM was that the main state was viewed as a memory cell separate from the rest of the RNN cell. The problem was that the rest of the cell’s state was represented by a mere shadow of the LSTM’s memory cell. NTMs take this memory cell view but fix the shadow state problem, by introducing three key architectural changes to the LSTM:</p> <ul> <li>Instead of a memory cell (represented by a state vector), they use a memory bank (represented by a state matrix), which is a “long” memory cell, in that instead of a state unit having a single real value, it has a vector of real values. This forces the memory bank to coordinate reads and writes to write entire memories and to retrieve entire memories at once. In short, it is an opinionated approach that enforces organization within the state.</li> <li>The read, write and forget gates, now called read and write “heads” (where the write head represents both write and forget gates), are much more sophisticated and include several opinionated decisions as to their functionality. For example, a sparsity constraint is employed so that there is a limit to the amount of reading and writing done at each time step. To get around the limits of sparsity on each head, Graves et al. allow for multiple read and write heads.</li> <li>Instead of a shadow state, which is a mere image of the memory cell, NTMs have a “controller”, which coordinates the interaction between the RNN cell’s external inputs and outputs and the internal memory bank. The controller can be, e.g., an LSTM itself, thereby maintaining an independent state. In this sense, the NTM’s memory bank truly is separate from the rest of the RNN cell.</li> </ul> <p>The power of this architecture should be immediately clear: instead of reading and writing single numbers, we write vectors of numbers. This frees the rest of the network from having to coordinate groups of reads and writes, allowing it to focus on higher order tasks instead.</p> <p>This was a very brief introduction to a topic that I am not myself well acquainted to, so I encourage you to read the source: <a href="https://arxiv.org/abs/1410.5401">Graves et al. (2014)</a>.</p> <h3 id="conclusion">Conclusion</h3> <p>In this post, we’ve covered a lot of material, which has hopefully provided some powerful intuitions into recurrent architectures and neural networks generally. You should now have a solid understanding of LSTMs and the motivations behind them, and hopefully have gotten some ideas about how to apply the principles of LSTMs to building deep recurrent and feedforward archictures.</p> <section class="footnotes"> <hr /> <ol> <li id="fn1"><p>A great introductory resource for the prerequisites is Andrew Ng’s <a href="https://www.coursera.org/learn/machine-learning/">machine learning</a> (first 5 weeks). A great intermediate resource is Andrej Karpathy’s <a href="http://cs231n.github.io/">CS231n</a>.<a href="#fnref1">↩</a></p></li> <li id="fn2"><p>Do educate yourself on <a href="http://www.bearsmart.com/play/bear-encounters/">bear safety</a>; your first thought may be think “run”, but that’s not a good idea.<a href="#fnref2">↩</a></p></li> <li id="fn3"><p>A universal function approximator can emulate any (Borel measurable) function. Some smart people have proven mathematically that feedforward neural networks with a single, large hidden layer operate as universal function approximators. See <a href="http://neuralnetworksanddeeplearning.com/chap4.html">Michael Nielson’s writeup</a> for the visual intuitions behind this, or refer to the original papers by <a href="http://deeplearning.cs.cmu.edu/pdfs/Kornick_et_al.pdf">Hornik et al. (1989)</a> and <a href="https://www.dartmouth.edu/~gvc/Cybenko_MCSS.pdf">Cybenko (1989)</a> for formal proofs.<a href="#fnref3">↩</a></p></li> <li id="fn4"><p>While commonly known as the vanishing and exploding gradient problem, in my view this name hides the true nature of the problem. The alternate name, vanishing and exploding <em>sensitivity</em>, is borrowed from <a href="https://arxiv.org/pdf/1410.5401.pdf">Graves et al. (2014), Neural Turing Machines</a><a href="#fnref4">↩</a></p></li> <li id="fn5"><p>Credit to <a href="http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf">Ilya Sutskever’s thesis</a> for the butterfly effect reference.<a href="#fnref5">↩</a></p></li> <li id="fn6"><p>It is worth noting that there is a type of RNN, the “echo state” network, designed to take advantage of information morphing. It works by choosing an initial recurring function that is regular in the way information morphs, so that the state today is an “echo” of the past. In echo state networks, we don’t train the initial function (for that would change the way information morphs, making it unpredictable). Rather, we learn to interpret the state of the network from its outputs. Essentially, these networks take advantage of information morphing to impose a time signature on the morphing data, and we learn to be archeologists (e.g., in real life, we know how long ago dinosaurs lived by looking at the radioactive decay of the rocks surrounding their fossils).<a href="#fnref6">↩</a></p></li> <li id="fn7"><p>Pascanu et al. (2013) mention this stronger result in passing in Section 2.2 of their paper, but it is never explicitly justified.<a href="#fnref7">↩</a></p></li> <li id="fn8"><p>To see why this is the case, consider the following argument: <span class="math inline">$$\gamma$$</span> is the largest singular value of <span class="math inline">$$[\phi&#39;(c)]$$</span> (the Jacobian of <span class="math inline">$$\phi$$</span> evaluated at some vector <span class="math inline">$$c$$</span>) for all vectors <span class="math inline">$$c$$</span> on the interval <span class="math inline">$$[z_t,\ z_t + \Delta z_t]$$</span>. For point-wise non-linearities like the logistic sigmoid and tanh, <span class="math inline">$$[\phi&#39;(c)]$$</span> will be a diagonal matrix whose entry in row <span class="math inline">$$i$$</span>, column <span class="math inline">$$i$$</span> will be the derivative of <span class="math inline">$$\phi$$</span> evaluated at the <span class="math inline">$$i$$</span>th element of <span class="math inline">$$c$$</span>. Since <span class="math inline">$$[\phi&#39;(c)]$$</span> is a diagonal matrix, the absolute values of its diagonal entries are its singular values. Therefore, if <span class="math inline">$$\phi&#39;(x)$$</span> is bounded for all real numbers <span class="math inline">$$x$$</span>, so too will be the singular values of <span class="math inline">$$[\phi&#39;(c)]$$</span>, regardless of what <span class="math inline">$$c$$</span> is. The derivatives of the logistic sigmoid and tanh both reach their maximum values (upper bounds) of <span class="math inline">$$\frac{1}{4}$$</span> and <span class="math inline">$$1$$</span> respectively when evaluated at 0. Therefore, it follows that for the logistic sigmoid, <span class="math inline">$$\gamma \leq \frac{1}{4}$$</span>, and for tanh, <span class="math inline">$$\gamma \leq 1$$</span>.<a href="#fnref8">↩</a></p></li> <li id="fn9"><p>This is a more or less fair assumption, since our initial weights will be small and at least some of our activations will not be saturated to start, so that <span class="math inline">$$\gamma$$</span>, the supremum of the norm of the Jacobian of <span class="math inline">$$\tanh(z(s_t))$$</span> should be very close to 1.<a href="#fnref9">↩</a></p></li> <li id="fn10"><p>Note that the usage of <span class="math inline">$$\Delta$$</span> here is different than in the discussion of vanishing gradients above. Here the delta is from one timestep to the next; above the deltas are two state vectors at the same time step.<a href="#fnref10">↩</a></p></li> <li id="fn11"><p>See my post <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html">RNNs in Tensorflow II</a> for more on layer normalization, which is a recent RNN add-on introduced by <a href="http://arxiv.org/abs/1607.06450">Lei Ba et al. (2016)</a><a href="#fnref11">↩</a></p></li> <li id="fn12"><p>This is actually quite natural once we get to the pseudo LSTM: any time the state interacts with anything but its own delta (i.e., writes to the state), it is squashed.<a href="#fnref12">↩</a></p></li> <li id="fn13"><p>The one <em>good</em> diagram of LSTMs includes the whole LSTM cell and can be found in <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Christopher Olah’s post on LSTMs</a>.<a href="#fnref13">↩</a></p></li> </ol> </section> </body> </html> Recurrent Neural Networks in Tensorflow II2016-07-25T00:00:00-04:002016-07-25T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-07-25:/recurrent-neural-networks-in-tensorflow-ii.htmlThis is the second in a series of posts about recurrent neural networks in Tensorflow. In this post, we will build upon our vanilla RNN by learning how to use Tensorflow's scan and dynamic_rnn models, upgrading the RNN cell and stacking multiple RNNs, and adding dropout and layer normalization. We will then use our upgraded RNN to generate some text, character by character.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML" type="text/javascript"></script> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>This is the second in a series of posts about recurrent neural networks in Tensorflow. The first post lives <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html">here</a>. In this post, we will build upon our vanilla RNN by learning how to use Tensorflow’s scan and dynamic_rnn models, upgrading the RNN cell and stacking multiple RNNs, and adding dropout and layer normalization. We will then use our upgraded RNN to generate some text, character by character.</p> <p><strong>Note 3/14/2017</strong>: This tutorial is quite a bit deprecated by changes to the TF api. Leaving it up since it may still be useful, and most changes to the API are cosmetic (biggest change is that many of the RNN cells and functions are in the tf.contrib.rnn module). There was also a change to the ptb_iterator. A (slightly modified) copy of the old version which should work until I update this tutorial is uploaded <a href="https://gist.github.com/spitis/2dd1720850154b25d2cec58d4b75c4a0">here</a>.</p> <h3 id="recap-of-our-model">Recap of our model</h3> <p>In the last post, we built a very simple, no frills RNN that was quickly able to learn to solve the toy task we created for it.</p> <p>Here is the formal statement of our model from last time:</p> <p><span class="math inline">$$S_t = \text{tanh}(W(X_t \ @ \ S_{t-1}) + b_s)$$</span></p> <p><span class="math inline">$$P_t = \text{softmax}(US_t + b_p)$$</span></p> <p>where <span class="math inline">$$@$$</span> represents vector concatenation, <span class="math inline">$$X_t \in R^n$$</span> is an input vector, <span class="math inline">$$W \in R^{d \times (n + d)}, \ b_s \in R^d, \ U \in R^{n \times d}$$</span>, <span class="math inline">$$b_p \in R^n$$</span>, <span class="math inline">$$n$$</span> is the size of the input and output vectors, and d is the size of the hidden state vector. At time step 0, <span class="math inline">$$S_{-1}$$</span> (the initial state) is initialized as a vector of zeros.</p> <h3 id="task-and-data">Task and data</h3> <p>This time around we will be building a character-level language model to generate character sequences, a la Andrej Karpathy’s <a href="https://github.com/karpathy/char-rnn">char-rnn</a> (and see, e.g., a Tensorflow implementation by Sherjil Ozair <a href="https://github.com/sherjilozair/char-rnn-tensorflow">here</a>).</p> <p>Why do something that’s already been done? Well, this is a much harder task than the toy model from last time. This model needs to handle long sequences and learn long time dependencies. That makes a great task for learning about adding features to our RNN, and seeing how our changes affect the results as we go.</p> <p>To start, let’s create our data generator. We’ll use the tiny-shakespeare corpus as our data, though we could use any plain text file. We’ll choose to use all of the characters in the text file as our vocabulary, treating lowercase and capital letters are separate characters. In practice, there may be some advantage to forcing the network to use similar representations for capital and lowercase letters by using the same one-hot representations for each, plus a binary flag to indicate whether or not the letter is a capital. Additionally, it is likely a good idea to restrict the vocabulary (i.e., the set of characters) used, by replacing uncommon characters with an UNK token (like a square: □).</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Imports</span> <span class="co">&quot;&quot;&quot;</span> <span class="im">import</span> numpy <span class="im">as</span> np <span class="im">import</span> tensorflow <span class="im">as</span> tf <span class="op">%</span>matplotlib inline <span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="im">import</span> time <span class="im">import</span> os <span class="im">import</span> urllib.request <span class="im">from</span> tensorflow.models.rnn.ptb <span class="im">import</span> reader</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Load and process data, utility functions</span> <span class="co">&quot;&quot;&quot;</span> file_url <span class="op">=</span> <span class="st">&#39;https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/data/tiny-shakespeare.txt&#39;</span> file_name <span class="op">=</span> <span class="st">&#39;tinyshakespeare.txt&#39;</span> <span class="cf">if</span> <span class="kw">not</span> os.path.exists(file_name): urllib.request.urlretrieve(file_url, file_name) <span class="cf">with</span> <span class="bu">open</span>(file_name,<span class="st">&#39;r&#39;</span>) <span class="im">as</span> f: raw_data <span class="op">=</span> f.read() <span class="bu">print</span>(<span class="st">&quot;Data length:&quot;</span>, <span class="bu">len</span>(raw_data)) vocab <span class="op">=</span> <span class="bu">set</span>(raw_data) vocab_size <span class="op">=</span> <span class="bu">len</span>(vocab) idx_to_vocab <span class="op">=</span> <span class="bu">dict</span>(<span class="bu">enumerate</span>(vocab)) vocab_to_idx <span class="op">=</span> <span class="bu">dict</span>(<span class="bu">zip</span>(idx_to_vocab.values(), idx_to_vocab.keys())) data <span class="op">=</span> [vocab_to_idx[c] <span class="cf">for</span> c <span class="kw">in</span> raw_data] <span class="kw">del</span> raw_data <span class="kw">def</span> gen_epochs(n, num_steps, batch_size): <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(n): <span class="cf">yield</span> reader.ptb_iterator(data, batch_size, num_steps) <span class="kw">def</span> reset_graph(): <span class="cf">if</span> <span class="st">&#39;sess&#39;</span> <span class="kw">in</span> <span class="bu">globals</span>() <span class="kw">and</span> sess: sess.close() tf.reset_default_graph() <span class="kw">def</span> train_network(g, num_epochs, num_steps <span class="op">=</span> <span class="dv">200</span>, batch_size <span class="op">=</span> <span class="dv">32</span>, verbose <span class="op">=</span> <span class="va">True</span>, save<span class="op">=</span><span class="va">False</span>): tf.set_random_seed(<span class="dv">2345</span>) <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.initialize_all_variables()) training_losses <span class="op">=</span> [] <span class="cf">for</span> idx, epoch <span class="kw">in</span> <span class="bu">enumerate</span>(gen_epochs(num_epochs, num_steps, batch_size)): training_loss <span class="op">=</span> <span class="dv">0</span> steps <span class="op">=</span> <span class="dv">0</span> training_state <span class="op">=</span> <span class="va">None</span> <span class="cf">for</span> X, Y <span class="kw">in</span> epoch: steps <span class="op">+=</span> <span class="dv">1</span> feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: X, g[<span class="st">&#39;y&#39;</span>]: Y} <span class="cf">if</span> training_state <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: feed_dict[g[<span class="st">&#39;init_state&#39;</span>]] <span class="op">=</span> training_state training_loss_, training_state, _ <span class="op">=</span> sess.run([g[<span class="st">&#39;total_loss&#39;</span>], g[<span class="st">&#39;final_state&#39;</span>], g[<span class="st">&#39;train_step&#39;</span>]], feed_dict) training_loss <span class="op">+=</span> training_loss_ <span class="cf">if</span> verbose: <span class="bu">print</span>(<span class="st">&quot;Average training loss for Epoch&quot;</span>, idx, <span class="st">&quot;:&quot;</span>, training_loss<span class="op">/</span>steps) training_losses.append(training_loss<span class="op">/</span>steps) <span class="cf">if</span> <span class="bu">isinstance</span>(save, <span class="bu">str</span>): g[<span class="st">&#39;saver&#39;</span>].save(sess, save) <span class="cf">return</span> training_losses</code></pre></div> <pre><code>Data length: 1115394</code></pre> <h3 id="using-tf.scan-and-dynamic_rnn-to-speed-things-up">Using tf.scan and dynamic_rnn to speed things up</h3> <p>Recall from <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html">last post</a> that we represented each duplicate tensor of our RNN (e.g., the rnn inputs, rnn outputs, the predictions and the loss) as a <em>list</em> of tensors:</p> <figure> <img src="https://r2rt.com/static/images/BasicRNNLabeled.png" alt="Diagram of Basic RNN - Labeled" /><figcaption>Diagram of Basic RNN - Labeled</figcaption> </figure> <p>This worked quite well for our toy task, because our longest dependency was 7 steps back and we never really needed to backpropagate errors more than 10 steps. Even with a word-level RNN, using lists will probably be sufficient. See, e.g., my post on <a href="http://r2rt.com/styles-of-truncated-backpropagation.html">Styles of Truncated Backpropagation</a>, where I build a 40-step graph with no problems. But for a character-level model, 40 characters isn’t a whole lot. We might want to capture much longer dependencies. So let’s see what happens when we build a graph that is 200 time steps wide:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_basic_rnn_graph_with_list( state_size <span class="op">=</span> <span class="dv">100</span>, num_classes <span class="op">=</span> vocab_size, batch_size <span class="op">=</span> <span class="dv">32</span>, num_steps <span class="op">=</span> <span class="dv">200</span>, learning_rate <span class="op">=</span> <span class="fl">1e-4</span>): reset_graph() x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) x_one_hot <span class="op">=</span> tf.one_hot(x, num_classes) rnn_inputs <span class="op">=</span> [tf.squeeze(i,squeeze_dims<span class="op">=</span>[<span class="dv">1</span>]) <span class="cf">for</span> i <span class="kw">in</span> tf.split(<span class="dv">1</span>, num_steps, x_one_hot)] cell <span class="op">=</span> tf.nn.rnn_cell.BasicRNNCell(state_size) init_state <span class="op">=</span> cell.zero_state(batch_size, tf.float32) rnn_outputs, final_state <span class="op">=</span> tf.nn.rnn(cell, rnn_inputs, initial_state<span class="op">=</span>init_state) <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) logits <span class="op">=</span> [tf.matmul(rnn_output, W) <span class="op">+</span> b <span class="cf">for</span> rnn_output <span class="kw">in</span> rnn_outputs] y_as_list <span class="op">=</span> [tf.squeeze(i, squeeze_dims<span class="op">=</span>[<span class="dv">1</span>]) <span class="cf">for</span> i <span class="kw">in</span> tf.split(<span class="dv">1</span>, num_steps, y)] loss_weights <span class="op">=</span> [tf.ones([batch_size]) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(num_steps)] losses <span class="op">=</span> tf.nn.seq2seq.sequence_loss_by_example(logits, y_as_list, loss_weights) total_loss <span class="op">=</span> tf.reduce_mean(losses) train_step <span class="op">=</span> tf.train.AdamOptimizer(learning_rate).minimize(total_loss) <span class="cf">return</span> <span class="bu">dict</span>( x <span class="op">=</span> x, y <span class="op">=</span> y, init_state <span class="op">=</span> init_state, final_state <span class="op">=</span> final_state, total_loss <span class="op">=</span> total_loss, train_step <span class="op">=</span> train_step )</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() build_basic_rnn_graph_with_list() <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to build the graph.&quot;</span>)</code></pre></div> <pre><code>It took 5.626644849777222 seconds to build the graph.</code></pre> <p>It took over 5 seconds to build the graph of the most basic RNN model! This could bad… what happens when we move up to a 3-layer LSTM?</p> <p>Below, we switch out the RNN cell for a Multi-layer LSTM cell. We’ll go over the details of how to do this in the next section.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_multilayer_lstm_graph_with_list( state_size <span class="op">=</span> <span class="dv">100</span>, num_classes <span class="op">=</span> vocab_size, batch_size <span class="op">=</span> <span class="dv">32</span>, num_steps <span class="op">=</span> <span class="dv">200</span>, num_layers <span class="op">=</span> <span class="dv">3</span>, learning_rate <span class="op">=</span> <span class="fl">1e-4</span>): reset_graph() x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [num_classes, state_size]) rnn_inputs <span class="op">=</span> [tf.squeeze(i) <span class="cf">for</span> i <span class="kw">in</span> tf.split(<span class="dv">1</span>, num_steps, tf.nn.embedding_lookup(embeddings, x))] cell <span class="op">=</span> tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple<span class="op">=</span><span class="va">True</span>) cell <span class="op">=</span> tf.nn.rnn_cell.MultiRNNCell([cell] <span class="op">*</span> num_layers, state_is_tuple<span class="op">=</span><span class="va">True</span>) init_state <span class="op">=</span> cell.zero_state(batch_size, tf.float32) rnn_outputs, final_state <span class="op">=</span> tf.nn.rnn(cell, rnn_inputs, initial_state<span class="op">=</span>init_state) <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) logits <span class="op">=</span> [tf.matmul(rnn_output, W) <span class="op">+</span> b <span class="cf">for</span> rnn_output <span class="kw">in</span> rnn_outputs] y_as_list <span class="op">=</span> [tf.squeeze(i, squeeze_dims<span class="op">=</span>[<span class="dv">1</span>]) <span class="cf">for</span> i <span class="kw">in</span> tf.split(<span class="dv">1</span>, num_steps, y)] loss_weights <span class="op">=</span> [tf.ones([batch_size]) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(num_steps)] losses <span class="op">=</span> tf.nn.seq2seq.sequence_loss_by_example(logits, y_as_list, loss_weights) total_loss <span class="op">=</span> tf.reduce_mean(losses) train_step <span class="op">=</span> tf.train.AdamOptimizer(learning_rate).minimize(total_loss) <span class="cf">return</span> <span class="bu">dict</span>( x <span class="op">=</span> x, y <span class="op">=</span> y, init_state <span class="op">=</span> init_state, final_state <span class="op">=</span> final_state, total_loss <span class="op">=</span> total_loss, train_step <span class="op">=</span> train_step )</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() build_multilayer_lstm_graph_with_list() <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to build the graph.&quot;</span>)</code></pre></div> <pre><code>It took 25.640846967697144 seconds to build the graph.</code></pre> <p>Yikes, almost 30 seconds.</p> <p>Now this isn’t that big of an issue for training, because we only need to build the graph once. It could be a big issue, however, if we need to build the graph multiple times at test time.</p> <p>To get around this long compile time, Tensorflow allows us to create the graph at runtime. Here is a quick demonstration of the difference, using Tensorflow’s dynamic_rnn function:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_multilayer_lstm_graph_with_dynamic_rnn( state_size <span class="op">=</span> <span class="dv">100</span>, num_classes <span class="op">=</span> vocab_size, batch_size <span class="op">=</span> <span class="dv">32</span>, num_steps <span class="op">=</span> <span class="dv">200</span>, num_layers <span class="op">=</span> <span class="dv">3</span>, learning_rate <span class="op">=</span> <span class="fl">1e-4</span>): reset_graph() x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [num_classes, state_size]) <span class="co"># Note that our inputs are no longer a list, but a tensor of dims batch_size x num_steps x state_size</span> rnn_inputs <span class="op">=</span> tf.nn.embedding_lookup(embeddings, x) cell <span class="op">=</span> tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple<span class="op">=</span><span class="va">True</span>) cell <span class="op">=</span> tf.nn.rnn_cell.MultiRNNCell([cell] <span class="op">*</span> num_layers, state_is_tuple<span class="op">=</span><span class="va">True</span>) init_state <span class="op">=</span> cell.zero_state(batch_size, tf.float32) rnn_outputs, final_state <span class="op">=</span> tf.nn.dynamic_rnn(cell, rnn_inputs, initial_state<span class="op">=</span>init_state) <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) <span class="co">#reshape rnn_outputs and y so we can get the logits in a single matmul</span> rnn_outputs <span class="op">=</span> tf.reshape(rnn_outputs, [<span class="op">-</span><span class="dv">1</span>, state_size]) y_reshaped <span class="op">=</span> tf.reshape(y, [<span class="op">-</span><span class="dv">1</span>]) logits <span class="op">=</span> tf.matmul(rnn_outputs, W) <span class="op">+</span> b total_loss <span class="op">=</span> tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped)) train_step <span class="op">=</span> tf.train.AdamOptimizer(learning_rate).minimize(total_loss) <span class="cf">return</span> <span class="bu">dict</span>( x <span class="op">=</span> x, y <span class="op">=</span> y, init_state <span class="op">=</span> init_state, final_state <span class="op">=</span> final_state, total_loss <span class="op">=</span> total_loss, train_step <span class="op">=</span> train_step )</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() build_multilayer_lstm_graph_with_dynamic_rnn() <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to build the graph.&quot;</span>)</code></pre></div> <pre><code>It took 0.5314393043518066 seconds to build the graph.</code></pre> <p>Much better. One would think that pushing the graph construction to execution time would cause execution of the graph to go slower, but in this case, using dynamic_rnn actually speeds things up:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_multilayer_lstm_graph_with_list() t <span class="op">=</span> time.time() train_network(g, <span class="dv">3</span>) <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to train for 3 epochs.&quot;</span>)</code></pre></div> <pre><code>Average training loss for Epoch 0 : 3.53323210245 Average training loss for Epoch 1 : 3.31435756163 Average training loss for Epoch 2 : 3.21755325109 It took 117.78161263465881 seconds to train for 3 epochs.</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_multilayer_lstm_graph_with_dynamic_rnn() t <span class="op">=</span> time.time() train_network(g, <span class="dv">3</span>) <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to train for 3 epochs.&quot;</span>)</code></pre></div> <pre><code>Average training loss for Epoch 0 : 3.55792756053 Average training loss for Epoch 1 : 3.3225021006 Average training loss for Epoch 2 : 3.28286816745 It took 96.69413661956787 seconds to train for 3 epochs.</code></pre> <p>It’s not a breeze to work through and understand the dynamic_rnn code (which lives <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn_cell.py">here</a>), but we can obtain a similar result ourselves by using tf.scan (dynamic_rnn does not use scan). Scan runs just a tad slower than Tensorflow’s optimized code, but is easier to understand and write yourself.</p> <p>Scan is a higher-order function that you might be familiar with if you’ve done any programming in OCaml, Haskell or the like. In general, it takes a function (<span class="math inline">$$f: (x_t, y_{t-1}) \mapsto y_t$$</span>), a sequence (<span class="math inline">$$[x_0, x_1 \dots x_n]$$</span>) and an initial value (<span class="math inline">$$y_{-1}$$</span>) and returns a sequence (<span class="math inline">$$[y_0, y_1 \dots y_n]$$</span>) according to the rule: <span class="math inline">$$y_t = f(x_t, y_{t-1})$$</span>. In Tensorflow, scan treats the first dimension of a Tensor as the sequence. Thus, if fed a Tensor of shape [n, m, o] as the sequence, scan would unpack it into a sequence of n-tensors, each with shape [m, o]. You can learn more about Tensorflow’s scan <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functional_ops.md#tfscanfn-elems-initializernone-parallel_iterations10-back_proptrue-swap_memoryfalse-namenone-scan">here</a>.</p> <p>Below, I use scan with an LSTM so as to compare to the dynamic_rnn using Tensorflow above. Because LSTMs store their state in a 2-tuple, and we’re using a 3-layer network, the scan function produces, as <code>final_states</code> below, a 3-tuple (one for each layer) of 2-tuples (one for each LSTM state), each of shape [num_steps, batch_size, state_size]. We need only the last state, which is why we unpack, slice and repack <code>final_states</code> to get <code>final_state</code> below.</p> <p>Another thing to note is that scan produces rnn_outputs with shape [num_steps, batch_size, state_size], whereas the dynamic_rnn produces rnn_outputs with shape [batch_size, num_steps, state_size] (the first two dimensions are switched). Dynamic_rnn has the flexibility to switch this behavior, using the “time_major” argument. Tf.scan does not have this flexibility, which is why we transpose <code>rnn_inputs</code> and <code>y</code> in the code below.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_multilayer_lstm_graph_with_scan( state_size <span class="op">=</span> <span class="dv">100</span>, num_classes <span class="op">=</span> vocab_size, batch_size <span class="op">=</span> <span class="dv">32</span>, num_steps <span class="op">=</span> <span class="dv">200</span>, num_layers <span class="op">=</span> <span class="dv">3</span>, learning_rate <span class="op">=</span> <span class="fl">1e-4</span>): reset_graph() x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [num_classes, state_size]) rnn_inputs <span class="op">=</span> tf.nn.embedding_lookup(embeddings, x) cell <span class="op">=</span> tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple<span class="op">=</span><span class="va">True</span>) cell <span class="op">=</span> tf.nn.rnn_cell.MultiRNNCell([cell] <span class="op">*</span> num_layers, state_is_tuple<span class="op">=</span><span class="va">True</span>) init_state <span class="op">=</span> cell.zero_state(batch_size, tf.float32) rnn_outputs, final_states <span class="op">=</span> <span class="op">\</span> tf.scan(<span class="kw">lambda</span> a, x: cell(x, a[<span class="dv">1</span>]), tf.transpose(rnn_inputs, [<span class="dv">1</span>,<span class="dv">0</span>,<span class="dv">2</span>]), initializer<span class="op">=</span>(tf.zeros([batch_size, state_size]), init_state)) <span class="co"># there may be a better way to do this:</span> final_state <span class="op">=</span> <span class="bu">tuple</span>([tf.nn.rnn_cell.LSTMStateTuple( tf.squeeze(tf.<span class="bu">slice</span>(c, [num_steps<span class="dv">-1</span>,<span class="dv">0</span>,<span class="dv">0</span>], [<span class="dv">1</span>, batch_size, state_size])), tf.squeeze(tf.<span class="bu">slice</span>(h, [num_steps<span class="dv">-1</span>,<span class="dv">0</span>,<span class="dv">0</span>], [<span class="dv">1</span>, batch_size, state_size]))) <span class="cf">for</span> c, h <span class="kw">in</span> final_states]) <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) rnn_outputs <span class="op">=</span> tf.reshape(rnn_outputs, [<span class="op">-</span><span class="dv">1</span>, state_size]) y_reshaped <span class="op">=</span> tf.reshape(tf.transpose(y,[<span class="dv">1</span>,<span class="dv">0</span>]), [<span class="op">-</span><span class="dv">1</span>]) logits <span class="op">=</span> tf.matmul(rnn_outputs, W) <span class="op">+</span> b total_loss <span class="op">=</span> tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped)) train_step <span class="op">=</span> tf.train.AdamOptimizer(learning_rate).minimize(total_loss) <span class="cf">return</span> <span class="bu">dict</span>( x <span class="op">=</span> x, y <span class="op">=</span> y, init_state <span class="op">=</span> init_state, final_state <span class="op">=</span> final_state, total_loss <span class="op">=</span> total_loss, train_step <span class="op">=</span> train_step )</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">t <span class="op">=</span> time.time() g <span class="op">=</span> build_multilayer_lstm_graph_with_scan() <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to build the graph.&quot;</span>) t <span class="op">=</span> time.time() train_network(g, <span class="dv">3</span>) <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to train for 3 epochs.&quot;</span>)</code></pre></div> <pre><code>It took 0.6475389003753662 seconds to build the graph. Average training loss for Epoch 0 : 3.55362293501 Average training loss for Epoch 1 : 3.32045680079 Average training loss for Epoch 2 : 3.27433713688 It took 101.60246014595032 seconds to train for 3 epochs.</code></pre> <p>Using scan was only marginally slower than using dynamic_rnn, and gives us the flexibility and understanding to tweak the code if we ever need to (e.g., if for some reason we wanted to create a skip connection from the state at timestep t-2 to timestep t, it would be easy to do with scan).</p> <h3 id="upgrading-the-rnn-cell">Upgrading the RNN cell</h3> <p>Above, we seamlessly swapped out the BasicRNNCell we were using for a Multi-layered LSTM cell. This was possible because the RNN cells conform to a general structure: every RNN cell is a function of the current input, <span class="math inline">$$X_t$$</span>, and the prior state, <span class="math inline">$$S_{t-1}$$</span>, that outputs a current state, <span class="math inline">$$S_{t}$$</span>, and a current output, <span class="math inline">$$Y_t$$</span>. Thus, in the same way that we can swap out activation functions in a feedforward net (e.g., change the tanh activation to a sigmoid or a relu activation), we can swap out the entire recurrence function (cell) in an RNN.</p> <p>Note that while for basic RNN cells, the current output equals the current state (<span class="math inline">$$Y_t = S_t$$</span>), this does not have to be the case. We’ll see how LSTMs and multi-layered RNNs diverge from this below.</p> <p>Two popular choices for RNN cells are the GRU cell and the LSTM cell. By using gates, GRU and LSTM cells avoid the vanishing gradient problem and allow the network to learn longer-term dependencies. Their internals are quite complicated, and I would refer you to my post <a href="https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html">Written Memories: Understanding, Deriving and Extending the LSTM</a> for a good starting point to learn about them.</p> <p>All we have to do to upgrade our vanilla RNN cell is to replace this line:</p> <pre><code>cell = tf.nn.rnn_cell.BasicRNNCell(state_size)</code></pre> <p>with this for LSTM:</p> <pre><code>cell = tf.nn.rnn_cell.LSTMCell(state_size)</code></pre> <p>or this for GRU:</p> <pre><code>cell = tf.nn.rnn_cell.GRUCell(state_size)</code></pre> <p>The LSTM keeps two sets of internal state vectors, <span class="math inline">$$c$$</span> (for memory <strong>c</strong>ell or <strong>c</strong>onstant error carousel) and <span class="math inline">$$h$$</span> (for <strong>h</strong>idden state). By default, they are concatenated into a single vector, but as of this writing, using the default arguments to LSTMCell will produce a warning message:</p> <pre><code>WARNING:tensorflow:&lt;tensorflow.python.ops.rnn_cell.LSTMCell object at 0x7faade1708d0&gt;: Using a concatenated state is slower and will soon be deprecated. Use state_is_tuple=True.</code></pre> <p>This error tells us that it’s faster to represent the LSTM state as a tuple of <span class="math inline">$$c$$</span> and <span class="math inline">$$h$$</span>, rather than as a concatenation of <span class="math inline">$$c$$</span> and <span class="math inline">$$h$$</span>. You can tack on the argument <code>state_is_tuple=True</code> to have it do that.</p> <p>By using a tuple for the state, we can also easily replace the base cell with a “MultiRNNCell” for multiple layers. To see why this works, consider that while a single cell:</p> <figure> <img src="https://r2rt.com/static/images/RNN_BasicRNNCell.png" alt="Diagram of Basic RNN Cell" /><figcaption>Diagram of Basic RNN Cell</figcaption> </figure> <p>looks different from a two cells stacked on top of each other:</p> <figure> <img src="https://r2rt.com/static/images/RNN_MultiRNNCellUngrouped.png" alt="Diagram of Multi RNN Cell 1" /><figcaption>Diagram of Multi RNN Cell 1</figcaption> </figure> <p>we can wrap the two cells into a single two-layer cell to make them look and behave as a single cell:</p> <figure> <img src="https://r2rt.com/static/images/RNN_MultiRNNCellGrouped.png" alt="Diagram of Multi RNN Cell 2" /><figcaption>Diagram of Multi RNN Cell 2</figcaption> </figure> <p>To make this switch, we call <code>tf.nn.rnn_cell.MultiRNNCell</code>, which takes a <em>list</em> of RNNCells as its inputs and wraps them into a single cell:</p> <pre><code>cell = tf.nn.rnn_cell.MultiRNNCell([tf.nn.rnn_cell.BasicRNNCell(state_size)] * num_layers)</code></pre> <p>Note that if you are wrapping an LSTMCell that uses <code>state_is_tuple=True</code>, you should pass this same argument to the MultiRNNCell as well.</p> <h3 id="writing-a-custom-rnn-cell">Writing a custom RNN cell</h3> <p>It’s almost too easy to use the standard GRU or LSTM cells, so let’s define our own RNN cell. Here’s a random idea that may or may not work: starting with a GRU cell, instead of taking a single transformation of its input, we enable it to take a weighted average of multiple transformations of its input. That is, using the notation from <a href="http://arxiv.org/pdf/1406.1078v3.pdf">Cho et al. (2014)</a>, instead of using <span class="math inline">$$Wx$$</span> in our candidate state, <span class="math inline">$$\tilde h^{(t)} = \text{tanh}(Wx + U(r \odot h^{(t-1)})$$</span>, we use a weighted average of <span class="math inline">$$W_1 x, \ W_2 x \dots W_n x$$</span> for some n. In other words, we will replace <span class="math inline">$$Wx$$</span> with <span class="math inline">$$\Sigma\lambda_iW_ix$$</span> for some weights <span class="math inline">$$\lambda_i$$</span> that sum to 1. The vector of weights, <span class="math inline">$$\lambda$$</span>, will be calculated as <span class="math inline">$$\lambda = \text{softmax}(W_{avg}x^{(t)} + U_{avg}h^{(t-1)} + b)$$</span>. The idea is that we might benefit from treat the input differently in different scenarios (e.g., we may want to treat verbs differently than nouns).</p> <p>To write the custom cell, we need to extend tf.nn.rnn_cell.RNNCell. Specifically, we need to fill in 3 abstract methods and write an <code>__init__</code> method (take a look at the Tensorflow code <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn_cell.py">here</a>). First, let’s start with a GRU cell, adapted from Tensorflow’s implementation:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> GRUCell(tf.nn.rnn_cell.RNNCell): <span class="co">&quot;&quot;&quot;Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078).&quot;&quot;&quot;</span> <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, num_units): <span class="va">self</span>._num_units <span class="op">=</span> num_units <span class="at">@property</span> <span class="kw">def</span> state_size(<span class="va">self</span>): <span class="cf">return</span> <span class="va">self</span>._num_units <span class="at">@property</span> <span class="kw">def</span> output_size(<span class="va">self</span>): <span class="cf">return</span> <span class="va">self</span>._num_units <span class="kw">def</span> <span class="fu">__call__</span>(<span class="va">self</span>, inputs, state, scope<span class="op">=</span><span class="va">None</span>): <span class="cf">with</span> tf.variable_scope(scope <span class="kw">or</span> <span class="bu">type</span>(<span class="va">self</span>).<span class="va">__name__</span>): <span class="co"># &quot;GRUCell&quot;</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&quot;Gates&quot;</span>): <span class="co"># Reset gate and update gate.</span> <span class="co"># We start with bias of 1.0 to not reset and not update.</span> ru <span class="op">=</span> tf.nn.rnn_cell._linear([inputs, state], <span class="dv">2</span> <span class="op">*</span> <span class="va">self</span>._num_units, <span class="va">True</span>, <span class="fl">1.0</span>) ru <span class="op">=</span> tf.nn.sigmoid(ru) r, u <span class="op">=</span> tf.split(<span class="dv">1</span>, <span class="dv">2</span>, ru) <span class="cf">with</span> tf.variable_scope(<span class="st">&quot;Candidate&quot;</span>): c <span class="op">=</span> tf.nn.tanh(tf.nn.rnn_cell._linear([inputs, r <span class="op">*</span> state], <span class="va">self</span>._num_units, <span class="va">True</span>)) new_h <span class="op">=</span> u <span class="op">*</span> state <span class="op">+</span> (<span class="dv">1</span> <span class="op">-</span> u) <span class="op">*</span> c <span class="cf">return</span> new_h, new_h</code></pre></div> <p>We modify the <code>__init__</code> method to take a parameter <span class="math inline">$$n$$</span> at initialization, which will determine the number of transformation matrices <span class="math inline">$$W_i$$</span> it will create:</p> <pre><code>def __init__(self, num_units, num_weights): self._num_units = num_units self._num_weights = num_weights</code></pre> <p>Then, we modify the <code>Candidate</code> variable scope of the <code>__call__</code> method to do a weighted average as shown below (note that all of the <span class="math inline">$$W_i$$</span> matrices are created as a single variable and then split into multiple tensors):</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> CustomCell(tf.nn.rnn_cell.RNNCell): <span class="co">&quot;&quot;&quot;Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078).&quot;&quot;&quot;</span> <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, num_units, num_weights): <span class="va">self</span>._num_units <span class="op">=</span> num_units <span class="va">self</span>._num_weights <span class="op">=</span> num_weights <span class="at">@property</span> <span class="kw">def</span> state_size(<span class="va">self</span>): <span class="cf">return</span> <span class="va">self</span>._num_units <span class="at">@property</span> <span class="kw">def</span> output_size(<span class="va">self</span>): <span class="cf">return</span> <span class="va">self</span>._num_units <span class="kw">def</span> <span class="fu">__call__</span>(<span class="va">self</span>, inputs, state, scope<span class="op">=</span><span class="va">None</span>): <span class="cf">with</span> tf.variable_scope(scope <span class="kw">or</span> <span class="bu">type</span>(<span class="va">self</span>).<span class="va">__name__</span>): <span class="co"># &quot;GRUCell&quot;</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&quot;Gates&quot;</span>): <span class="co"># Reset gate and update gate.</span> <span class="co"># We start with bias of 1.0 to not reset and not update.</span> ru <span class="op">=</span> tf.nn.rnn_cell._linear([inputs, state], <span class="dv">2</span> <span class="op">*</span> <span class="va">self</span>._num_units, <span class="va">True</span>, <span class="fl">1.0</span>) ru <span class="op">=</span> tf.nn.sigmoid(ru) r, u <span class="op">=</span> tf.split(<span class="dv">1</span>, <span class="dv">2</span>, ru) <span class="cf">with</span> tf.variable_scope(<span class="st">&quot;Candidate&quot;</span>): lambdas <span class="op">=</span> tf.nn.rnn_cell._linear([inputs, state], <span class="va">self</span>._num_weights, <span class="va">True</span>) lambdas <span class="op">=</span> tf.split(<span class="dv">1</span>, <span class="va">self</span>._num_weights, tf.nn.softmax(lambdas)) Ws <span class="op">=</span> tf.get_variable(<span class="st">&quot;Ws&quot;</span>, shape <span class="op">=</span> [<span class="va">self</span>._num_weights, inputs.get_shape()[<span class="dv">1</span>], <span class="va">self</span>._num_units]) Ws <span class="op">=</span> [tf.squeeze(i) <span class="cf">for</span> i <span class="kw">in</span> tf.split(<span class="dv">0</span>, <span class="va">self</span>._num_weights, Ws)] candidate_inputs <span class="op">=</span> [] <span class="cf">for</span> idx, W <span class="kw">in</span> <span class="bu">enumerate</span>(Ws): candidate_inputs.append(tf.matmul(inputs, W) <span class="op">*</span> lambdas[idx]) Wx <span class="op">=</span> tf.add_n(candidate_inputs) c <span class="op">=</span> tf.nn.tanh(Wx <span class="op">+</span> tf.nn.rnn_cell._linear([r <span class="op">*</span> state], <span class="va">self</span>._num_units, <span class="va">True</span>, scope<span class="op">=</span><span class="st">&quot;second&quot;</span>)) new_h <span class="op">=</span> u <span class="op">*</span> state <span class="op">+</span> (<span class="dv">1</span> <span class="op">-</span> u) <span class="op">*</span> c <span class="cf">return</span> new_h, new_h</code></pre></div> <p>Let’s see how the custom cell stacks up to a regular GRU cell (using <code>num_steps = 30</code>, since this performs much better than <code>num_steps = 200</code> after 5 epochs – can you see why that might happen?):</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_multilayer_graph_with_custom_cell( cell_type <span class="op">=</span> <span class="va">None</span>, num_weights_for_custom_cell <span class="op">=</span> <span class="dv">5</span>, state_size <span class="op">=</span> <span class="dv">100</span>, num_classes <span class="op">=</span> vocab_size, batch_size <span class="op">=</span> <span class="dv">32</span>, num_steps <span class="op">=</span> <span class="dv">200</span>, num_layers <span class="op">=</span> <span class="dv">3</span>, learning_rate <span class="op">=</span> <span class="fl">1e-4</span>): reset_graph() x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [num_classes, state_size]) rnn_inputs <span class="op">=</span> tf.nn.embedding_lookup(embeddings, x) <span class="cf">if</span> cell_type <span class="op">==</span> <span class="st">&#39;Custom&#39;</span>: cell <span class="op">=</span> CustomCell(state_size, num_weights_for_custom_cell) <span class="cf">elif</span> cell_type <span class="op">==</span> <span class="st">&#39;GRU&#39;</span>: cell <span class="op">=</span> tf.nn.rnn_cell.GRUCell(state_size) <span class="cf">elif</span> cell_type <span class="op">==</span> <span class="st">&#39;LSTM&#39;</span>: cell <span class="op">=</span> tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple<span class="op">=</span><span class="va">True</span>) <span class="cf">else</span>: cell <span class="op">=</span> tf.nn.rnn_cell.BasicRNNCell(state_size) <span class="cf">if</span> cell_type <span class="op">==</span> <span class="st">&#39;LSTM&#39;</span>: cell <span class="op">=</span> tf.nn.rnn_cell.MultiRNNCell([cell] <span class="op">*</span> num_layers, state_is_tuple<span class="op">=</span><span class="va">True</span>) <span class="cf">else</span>: cell <span class="op">=</span> tf.nn.rnn_cell.MultiRNNCell([cell] <span class="op">*</span> num_layers) init_state <span class="op">=</span> cell.zero_state(batch_size, tf.float32) rnn_outputs, final_state <span class="op">=</span> tf.nn.dynamic_rnn(cell, rnn_inputs, initial_state<span class="op">=</span>init_state) <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) <span class="co">#reshape rnn_outputs and y</span> rnn_outputs <span class="op">=</span> tf.reshape(rnn_outputs, [<span class="op">-</span><span class="dv">1</span>, state_size]) y_reshaped <span class="op">=</span> tf.reshape(y, [<span class="op">-</span><span class="dv">1</span>]) logits <span class="op">=</span> tf.matmul(rnn_outputs, W) <span class="op">+</span> b total_loss <span class="op">=</span> tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped)) train_step <span class="op">=</span> tf.train.AdamOptimizer(learning_rate).minimize(total_loss) <span class="cf">return</span> <span class="bu">dict</span>( x <span class="op">=</span> x, y <span class="op">=</span> y, init_state <span class="op">=</span> init_state, final_state <span class="op">=</span> final_state, total_loss <span class="op">=</span> total_loss, train_step <span class="op">=</span> train_step )</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_multilayer_graph_with_custom_cell(cell_type<span class="op">=</span><span class="st">&#39;GRU&#39;</span>, num_steps<span class="op">=</span><span class="dv">30</span>) t <span class="op">=</span> time.time() train_network(g, <span class="dv">5</span>, num_steps<span class="op">=</span><span class="dv">30</span>) <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to train for 5 epochs.&quot;</span>)</code></pre></div> <pre><code>Average training loss for Epoch 0 : 2.92919953048 Average training loss for Epoch 1 : 2.35888109404 Average training loss for Epoch 2 : 2.21945820894 Average training loss for Epoch 3 : 2.12258511006 Average training loss for Epoch 4 : 2.05038544733 It took 284.6971204280853 seconds to train for 5 epochs.</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_multilayer_graph_with_custom_cell(cell_type<span class="op">=</span><span class="st">&#39;Custom&#39;</span>, num_steps<span class="op">=</span><span class="dv">30</span>) t <span class="op">=</span> time.time() train_network(g, <span class="dv">5</span>, num_steps<span class="op">=</span><span class="dv">30</span>) <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to train for 5 epochs.&quot;</span>)</code></pre></div> <pre><code>Average training loss for Epoch 0 : 3.04418995892 Average training loss for Epoch 1 : 2.5172702761 Average training loss for Epoch 2 : 2.37068433601 Average training loss for Epoch 3 : 2.27533404217 Average training loss for Epoch 4 : 2.20167231745 It took 537.6112766265869 seconds to train for 5 epochs.</code></pre> <p>So much for that idea. Our custom cell took almost twice as long to train and seems to perform worse than a standard GRU cell.</p> <h3 id="adding-dropout">Adding Dropout</h3> <p>Adding features like dropout to the network is easy: we figure out where they belong and drop them in.</p> <p>Dropout belongs <em>in between layers, not on the state or in intra-cell connections</em>. See <a href="https://arxiv.org/pdf/1409.2329.pdf">Zaremba et al. (2015), Recurrent Neural Network Regularization</a> (“The main idea is to apply the dropout operator only to the non-recurrent connections.”)</p> <p>Thus, to apply dropout, we need to wrap the input and/or output of <em>each</em> cell. In our RNN implementation using list, we might do something like this:</p> <pre><code>rnn_inputs = [tf.nn.dropout(rnn_input, keep_prob) for rnn_input in rnn_inputs] rnn_outputs = [tf.nn.dropout(rnn_output, keep_prob) for nn_output in rnn_outputs]</code></pre> <p>In our dynamic_rnn or scan implementations, we might apply dropout directly to the rnn_inputs or rnn_outputs:</p> <pre><code>rnn_inputs = tf.nn.dropout(rnn_inputd, keep_prob) rnn_outputs = tf.nn.dropout(rnn_outputd, keep_prob)</code></pre> <p>But what happens when we use <code>MultiRNNCell</code>? How can we have dropout in between layers like in Zaremba et al. (2015)? The answer is to wrap our base RNN cell with dropout, thereby including it as part of the base cell, similar to how we wrapped our three RNN cells into a single MultiRNNCell above. Tensorflow allows us to do this without writing a new RNNCell by using <code>tf.nn.rnn_cell.DropoutWrapper</code>:</p> <pre><code>cell = tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple=True) cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=input_dropout, output_keep_prob=output_dropout) cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)</code></pre> <p>Note that if we wrap a base cell with dropout and then use it to build a MultiRNNCell, both input dropout and output dropout will be applied between layers (so if both are, say, 0.9, the dropout in between layers will be 0.9 * 0.9 = 0.81). If we want equal dropout on all inputs and outputs of a multi-layered RNN, we can use only output or input dropout on the base cell, and then wrap the entire MultiRNNCell with the input or output dropout like so:</p> <pre><code>cell = tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple=True) cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=global_dropout) cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True) cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=global_dropout)</code></pre> <h3 id="layer-normalization">Layer normalization</h3> <p>Layer normalization is a feature that was published just a few days ago by <a href="https://arxiv.org/abs/1607.06450">Lei Ba et al. (2016)</a>, which we can use to improve our RNN. It was inspired by batch normalization, which you can read about and learn how to implement in my post <a href="http://r2rt.com/implementing-batch-normalization-in-tensorflow.html">here</a>. Batch normalization (for feed-forward and convolutional neural networks) and layer normalization (for recurrent neural networks) generally improve training time and achieve better overall performance. In this section, we’ll apply what we’ve learned in this post to implement layer normalization in Tensorflow.</p> <p>Layer normalization is applied as follows: the initial layer normalization function is applied individually to each training example so as to normalize the output vector of a linear transformation to have a mean of 0 and a variance of 1. In math: <span class="math inline">$$LN_{initial}: v \mapsto \frac{v - \mu_v}{\sqrt{\sigma_v^2 + \epsilon}}$$</span> for some vector <span class="math inline">$$v$$</span> and some small value of <span class="math inline">$$\epsilon$$</span> for numerical stability. For some the same reasons we add scale and shift parameters to the initial batch normalization transform (see my <a href="http://r2rt.com/implementing-batch-normalization-in-tensorflow.html">batch normalization post</a> for details), we add scale, <span class="math inline">$$\alpha$$</span>, and shift, <span class="math inline">$$\beta$$</span>, parameters here as well, so that the final layer normalization function is:</p> <p><span class="math display">$LN: v \mapsto \alpha \odot \frac{v - \mu_v}{\sqrt{\sigma_v^2 + \epsilon}} + \beta$</span></p> <p>Note that <span class="math inline">$$\odot$$</span> is point-wise multiplication.</p> <p>To add layer normalization to our network, we first write a function that will layer normalization a 2D tensor along its second dimension:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> ln(tensor, scope <span class="op">=</span> <span class="va">None</span>, epsilon <span class="op">=</span> <span class="fl">1e-5</span>): <span class="co">&quot;&quot;&quot; Layer normalizes a 2D tensor along its second axis &quot;&quot;&quot;</span> <span class="cf">assert</span>(<span class="bu">len</span>(tensor.get_shape()) <span class="op">==</span> <span class="dv">2</span>) m, v <span class="op">=</span> tf.nn.moments(tensor, [<span class="dv">1</span>], keep_dims<span class="op">=</span><span class="va">True</span>) <span class="cf">if</span> <span class="kw">not</span> <span class="bu">isinstance</span>(scope, <span class="bu">str</span>): scope <span class="op">=</span> <span class="st">&#39;&#39;</span> <span class="cf">with</span> tf.variable_scope(scope <span class="op">+</span> <span class="st">&#39;layer_norm&#39;</span>): scale <span class="op">=</span> tf.get_variable(<span class="st">&#39;scale&#39;</span>, shape<span class="op">=</span>[tensor.get_shape()[<span class="dv">1</span>]], initializer<span class="op">=</span>tf.constant_initializer(<span class="dv">1</span>)) shift <span class="op">=</span> tf.get_variable(<span class="st">&#39;shift&#39;</span>, shape<span class="op">=</span>[tensor.get_shape()[<span class="dv">1</span>]], initializer<span class="op">=</span>tf.constant_initializer(<span class="dv">0</span>)) LN_initial <span class="op">=</span> (tensor <span class="op">-</span> m) <span class="op">/</span> tf.sqrt(v <span class="op">+</span> epsilon) <span class="cf">return</span> LN_initial <span class="op">*</span> scale <span class="op">+</span> shift</code></pre></div> <p>Let’s apply it our layer normalization function as it was applied by Lei Ba et al. (2016) to LSTMs (in their experiments “Teaching machines to read and comprehend” and “Handwriting sequence generation”). Lei Ba et al. apply layer normalization to the output of each gate <em>inside</em> the LSTM cell, which means that we get to take a second shot at writing a new type of RNN cell. We’ll start with Tensorflow’s official code, located <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn_cell.py">here</a>, and modify it accordingly:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">class</span> LayerNormalizedLSTMCell(tf.nn.rnn_cell.RNNCell): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Adapted from TF&#39;s BasicLSTMCell to use Layer Normalization.</span> <span class="co"> Note that state_is_tuple is always True.</span> <span class="co"> &quot;&quot;&quot;</span> <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, num_units, forget_bias<span class="op">=</span><span class="fl">1.0</span>, activation<span class="op">=</span>tf.nn.tanh): <span class="va">self</span>._num_units <span class="op">=</span> num_units <span class="va">self</span>._forget_bias <span class="op">=</span> forget_bias <span class="va">self</span>._activation <span class="op">=</span> activation <span class="at">@property</span> <span class="kw">def</span> state_size(<span class="va">self</span>): <span class="cf">return</span> tf.nn.rnn_cell.LSTMStateTuple(<span class="va">self</span>._num_units, <span class="va">self</span>._num_units) <span class="at">@property</span> <span class="kw">def</span> output_size(<span class="va">self</span>): <span class="cf">return</span> <span class="va">self</span>._num_units <span class="kw">def</span> <span class="fu">__call__</span>(<span class="va">self</span>, inputs, state, scope<span class="op">=</span><span class="va">None</span>): <span class="co">&quot;&quot;&quot;Long short-term memory cell (LSTM).&quot;&quot;&quot;</span> <span class="cf">with</span> tf.variable_scope(scope <span class="kw">or</span> <span class="bu">type</span>(<span class="va">self</span>).<span class="va">__name__</span>): c, h <span class="op">=</span> state <span class="co"># change bias argument to False since LN will add bias via shift</span> concat <span class="op">=</span> tf.nn.rnn_cell._linear([inputs, h], <span class="dv">4</span> <span class="op">*</span> <span class="va">self</span>._num_units, <span class="va">False</span>) i, j, f, o <span class="op">=</span> tf.split(<span class="dv">1</span>, <span class="dv">4</span>, concat) <span class="co"># add layer normalization to each gate</span> i <span class="op">=</span> ln(i, scope <span class="op">=</span> <span class="st">&#39;i/&#39;</span>) j <span class="op">=</span> ln(j, scope <span class="op">=</span> <span class="st">&#39;j/&#39;</span>) f <span class="op">=</span> ln(f, scope <span class="op">=</span> <span class="st">&#39;f/&#39;</span>) o <span class="op">=</span> ln(o, scope <span class="op">=</span> <span class="st">&#39;o/&#39;</span>) new_c <span class="op">=</span> (c <span class="op">*</span> tf.nn.sigmoid(f <span class="op">+</span> <span class="va">self</span>._forget_bias) <span class="op">+</span> tf.nn.sigmoid(i) <span class="op">*</span> <span class="va">self</span>._activation(j)) <span class="co"># add layer_normalization in calculation of new hidden state</span> new_h <span class="op">=</span> <span class="va">self</span>._activation(ln(new_c, scope <span class="op">=</span> <span class="st">&#39;new_h/&#39;</span>)) <span class="op">*</span> tf.nn.sigmoid(o) new_state <span class="op">=</span> tf.nn.rnn_cell.LSTMStateTuple(new_c, new_h) <span class="cf">return</span> new_h, new_state</code></pre></div> <p>And that’s it! Let’s try this out.</p> <h3 id="final-model">Final model</h3> <p>At this point, we’ve covered all of the graph modifications we planned to cover, so here is our final model, which allows for dropout and layer normalized LSTM cells:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_graph( cell_type <span class="op">=</span> <span class="va">None</span>, num_weights_for_custom_cell <span class="op">=</span> <span class="dv">5</span>, state_size <span class="op">=</span> <span class="dv">100</span>, num_classes <span class="op">=</span> vocab_size, batch_size <span class="op">=</span> <span class="dv">32</span>, num_steps <span class="op">=</span> <span class="dv">200</span>, num_layers <span class="op">=</span> <span class="dv">3</span>, build_with_dropout<span class="op">=</span><span class="va">False</span>, learning_rate <span class="op">=</span> <span class="fl">1e-4</span>): reset_graph() x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) dropout <span class="op">=</span> tf.constant(<span class="fl">1.0</span>) embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [num_classes, state_size]) rnn_inputs <span class="op">=</span> tf.nn.embedding_lookup(embeddings, x) <span class="cf">if</span> cell_type <span class="op">==</span> <span class="st">&#39;Custom&#39;</span>: cell <span class="op">=</span> CustomCell(state_size, num_weights_for_custom_cell) <span class="cf">elif</span> cell_type <span class="op">==</span> <span class="st">&#39;GRU&#39;</span>: cell <span class="op">=</span> tf.nn.rnn_cell.GRUCell(state_size) <span class="cf">elif</span> cell_type <span class="op">==</span> <span class="st">&#39;LSTM&#39;</span>: cell <span class="op">=</span> tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple<span class="op">=</span><span class="va">True</span>) <span class="cf">elif</span> cell_type <span class="op">==</span> <span class="st">&#39;LN_LSTM&#39;</span>: cell <span class="op">=</span> LayerNormalizedLSTMCell(state_size) <span class="cf">else</span>: cell <span class="op">=</span> tf.nn.rnn_cell.BasicRNNCell(state_size) <span class="cf">if</span> build_with_dropout: cell <span class="op">=</span> tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob<span class="op">=</span>dropout) <span class="cf">if</span> cell_type <span class="op">==</span> <span class="st">&#39;LSTM&#39;</span> <span class="kw">or</span> cell_type <span class="op">==</span> <span class="st">&#39;LN_LSTM&#39;</span>: cell <span class="op">=</span> tf.nn.rnn_cell.MultiRNNCell([cell] <span class="op">*</span> num_layers, state_is_tuple<span class="op">=</span><span class="va">True</span>) <span class="cf">else</span>: cell <span class="op">=</span> tf.nn.rnn_cell.MultiRNNCell([cell] <span class="op">*</span> num_layers) <span class="cf">if</span> build_with_dropout: cell <span class="op">=</span> tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob<span class="op">=</span>dropout) init_state <span class="op">=</span> cell.zero_state(batch_size, tf.float32) rnn_outputs, final_state <span class="op">=</span> tf.nn.dynamic_rnn(cell, rnn_inputs, initial_state<span class="op">=</span>init_state) <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) <span class="co">#reshape rnn_outputs and y</span> rnn_outputs <span class="op">=</span> tf.reshape(rnn_outputs, [<span class="op">-</span><span class="dv">1</span>, state_size]) y_reshaped <span class="op">=</span> tf.reshape(y, [<span class="op">-</span><span class="dv">1</span>]) logits <span class="op">=</span> tf.matmul(rnn_outputs, W) <span class="op">+</span> b predictions <span class="op">=</span> tf.nn.softmax(logits) total_loss <span class="op">=</span> tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped)) train_step <span class="op">=</span> tf.train.AdamOptimizer(learning_rate).minimize(total_loss) <span class="cf">return</span> <span class="bu">dict</span>( x <span class="op">=</span> x, y <span class="op">=</span> y, init_state <span class="op">=</span> init_state, final_state <span class="op">=</span> final_state, total_loss <span class="op">=</span> total_loss, train_step <span class="op">=</span> train_step, preds <span class="op">=</span> predictions, saver <span class="op">=</span> tf.train.Saver() )</code></pre></div> <p>Let’s compare the GRU, LSTM and LN_LSTM after training each for 20 epochs using 80 step sequences.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph(cell_type<span class="op">=</span><span class="st">&#39;GRU&#39;</span>, num_steps<span class="op">=</span><span class="dv">80</span>) t <span class="op">=</span> time.time() losses <span class="op">=</span> train_network(g, <span class="dv">20</span>, num_steps<span class="op">=</span><span class="dv">80</span>, save<span class="op">=</span><span class="st">&quot;saves/GRU_20_epochs&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to train for 20 epochs.&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;The average loss on the final epoch was:&quot;</span>, losses[<span class="op">-</span><span class="dv">1</span>])</code></pre></div> <pre><code>It took 1051.6652357578278 seconds to train for 20 epochs. The average loss on the final epoch was: 1.75318197903</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph(cell_type<span class="op">=</span><span class="st">&#39;LSTM&#39;</span>, num_steps<span class="op">=</span><span class="dv">80</span>) t <span class="op">=</span> time.time() losses <span class="op">=</span> train_network(g, <span class="dv">20</span>, num_steps<span class="op">=</span><span class="dv">80</span>, save<span class="op">=</span><span class="st">&quot;saves/LSTM_20_epochs&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to train for 20 epochs.&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;The average loss on the final epoch was:&quot;</span>, losses[<span class="op">-</span><span class="dv">1</span>])</code></pre></div> <pre><code>It took 614.4890048503876 seconds to train for 20 epochs. The average loss on the final epoch was: 2.02813237837</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph(cell_type<span class="op">=</span><span class="st">&#39;LN_LSTM&#39;</span>, num_steps<span class="op">=</span><span class="dv">80</span>) t <span class="op">=</span> time.time() losses <span class="op">=</span> train_network(g, <span class="dv">20</span>, num_steps<span class="op">=</span><span class="dv">80</span>, save<span class="op">=</span><span class="st">&quot;saves/LN_LSTM_20_epochs&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to train for 20 epochs.&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;The average loss on the final epoch was:&quot;</span>, losses[<span class="op">-</span><span class="dv">1</span>])</code></pre></div> <pre><code>It took 3867.550405740738 seconds to train for 20 epochs. The average loss on the final epoch was: 1.71850851623</code></pre> <p>It looks like the layer normalized LSTM just managed to edge out the GRU in the last few epochs, though the increase in training time hardly seems worth it (perhaps my implementation could be improved?). It would be interesting to see how they would perform on a validation or test set and also to try out a layer normalized GRU. For now, let’s use the GRU to generate some text.</p> <h3 id="generating-text">Generating text</h3> <p>To generate text, were going to rebuild the graph so as to accept a single character at a time and restore our saved model. We’ll give the network a single character prompt, grab its predicted probability distribution for the next character, use that distribution to pick the next character, and repeat. When picking the next character, our <code>generate_characters</code> function can be set to use the whole probability distribution (default), or be forced to pick one of the top n most likely characters in the distribution. The latter option should obtain more English-like results.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> generate_characters(g, checkpoint, num_chars, prompt<span class="op">=</span><span class="st">&#39;A&#39;</span>, pick_top_chars<span class="op">=</span><span class="va">None</span>): <span class="co">&quot;&quot;&quot; Accepts a current character, initial state&quot;&quot;&quot;</span> <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.initialize_all_variables()) g[<span class="st">&#39;saver&#39;</span>].restore(sess, checkpoint) state <span class="op">=</span> <span class="va">None</span> current_char <span class="op">=</span> vocab_to_idx[prompt] chars <span class="op">=</span> [current_char] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(num_chars): <span class="cf">if</span> state <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: [[current_char]], g[<span class="st">&#39;init_state&#39;</span>]: state} <span class="cf">else</span>: feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: [[current_char]]} preds, state <span class="op">=</span> sess.run([g[<span class="st">&#39;preds&#39;</span>],g[<span class="st">&#39;final_state&#39;</span>]], feed_dict) <span class="cf">if</span> pick_top_chars <span class="kw">is</span> <span class="kw">not</span> <span class="va">None</span>: p <span class="op">=</span> np.squeeze(preds) p[np.argsort(p)[:<span class="op">-</span>pick_top_chars]] <span class="op">=</span> <span class="dv">0</span> p <span class="op">=</span> p <span class="op">/</span> np.<span class="bu">sum</span>(p) current_char <span class="op">=</span> np.random.choice(vocab_size, <span class="dv">1</span>, p<span class="op">=</span>p)[<span class="dv">0</span>] <span class="cf">else</span>: current_char <span class="op">=</span> np.random.choice(vocab_size, <span class="dv">1</span>, p<span class="op">=</span>np.squeeze(preds))[<span class="dv">0</span>] chars.append(current_char) chars <span class="op">=</span> <span class="bu">map</span>(<span class="kw">lambda</span> x: idx_to_vocab[x], chars) <span class="bu">print</span>(<span class="st">&quot;&quot;</span>.join(chars)) <span class="cf">return</span>(<span class="st">&quot;&quot;</span>.join(chars))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph(cell_type<span class="op">=</span><span class="st">&#39;LN_LSTM&#39;</span>, num_steps<span class="op">=</span><span class="dv">1</span>, batch_size<span class="op">=</span><span class="dv">1</span>) generate_characters(g, <span class="st">&quot;saves/LN_LSTM_20_epochs&quot;</span>, <span class="dv">750</span>, prompt<span class="op">=</span><span class="st">&#39;A&#39;</span>, pick_top_chars<span class="op">=</span><span class="dv">5</span>)</code></pre></div> <pre><code>ATOOOS UIEAOUYOUZZZZZZUZAAAYAYf n fsflflrurctuateot t ta&#39;s a wtutss ESGNANO: Whith then, a do makes and them and to sees, I wark on this ance may string take thou honon To sorriccorn of the bairer, whither, all I&#39;d see if yiust the would a peid. LARYNGLe: To would she troust they fould. PENMES: Thou she so the havin to my shald woust of As tale we they all my forder have As to say heant thy wansing thag and Whis it thee shath his breact, I be and might, she Tirs you desarvishensed and see thee: shall, What he hath with that is all time, And sen the have would be sectiens, way thee, They are there to man shall with me to the mon, And mere fear would be the balte, as time an at And the say oun touth, thy way womers thee.</code></pre> <p>You can see that this network has learned something. It’s definitely not random, though there is a bit of a warm up at the beginning (the state starts at 0). I was expecting something a bit better, however, given <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/#shakespeare">Karpathy’s Shakespeare results</a>. His model used more data, a state_size of 512, and was trained quite a bit longer than this one. Let’s see if we can match that. I couldn’t find a suitable premade dataset, so I had to make one myself: I concatenated the scripts from the Star Wars movies, the Star Trek movies, Tarantino and the Matrix. The final file size is 3.3MB, which is a bit smaller than the full works of William Shakespeare. Let’s load these up and try this again, with a larger state size:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Load new data</span> <span class="co">&quot;&quot;&quot;</span> file_url <span class="op">=</span> <span class="st">&#39;https://gist.githubusercontent.com/spitis/59bfafe6966bfe60cc206ffbb760269f/&#39;</span><span class="op">+\</span> <span class="co">&#39;raw/030a08754aada17cef14eed6fac7797cda830fe8/variousscripts.txt&#39;</span> file_name <span class="op">=</span> <span class="st">&#39;variousscripts.txt&#39;</span> <span class="cf">if</span> <span class="kw">not</span> os.path.exists(file_name): urllib.request.urlretrieve(file_url, file_name) <span class="cf">with</span> <span class="bu">open</span>(file_name,<span class="st">&#39;r&#39;</span>) <span class="im">as</span> f: raw_data <span class="op">=</span> f.read() <span class="bu">print</span>(<span class="st">&quot;Data length:&quot;</span>, <span class="bu">len</span>(raw_data)) vocab <span class="op">=</span> <span class="bu">set</span>(raw_data) vocab_size <span class="op">=</span> <span class="bu">len</span>(vocab) idx_to_vocab <span class="op">=</span> <span class="bu">dict</span>(<span class="bu">enumerate</span>(vocab)) vocab_to_idx <span class="op">=</span> <span class="bu">dict</span>(<span class="bu">zip</span>(idx_to_vocab.values(), idx_to_vocab.keys())) data <span class="op">=</span> [vocab_to_idx[c] <span class="cf">for</span> c <span class="kw">in</span> raw_data] <span class="kw">del</span> raw_data</code></pre></div> <pre><code>Data length: 3299132</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph(cell_type<span class="op">=</span><span class="st">&#39;GRU&#39;</span>, num_steps<span class="op">=</span><span class="dv">80</span>, state_size <span class="op">=</span> <span class="dv">512</span>, batch_size <span class="op">=</span> <span class="dv">50</span>, num_classes<span class="op">=</span>vocab_size, learning_rate<span class="op">=</span><span class="fl">5e-4</span>) t <span class="op">=</span> time.time() losses <span class="op">=</span> train_network(g, <span class="dv">30</span>, num_steps<span class="op">=</span><span class="dv">80</span>, batch_size <span class="op">=</span> <span class="dv">50</span>, save<span class="op">=</span><span class="st">&quot;saves/GRU_30_epochs_variousscripts&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;It took&quot;</span>, time.time() <span class="op">-</span> t, <span class="st">&quot;seconds to train for 30 epochs.&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;The average loss on the final epoch was:&quot;</span>, losses[<span class="op">-</span><span class="dv">1</span>])</code></pre></div> <pre><code>It took 4877.8002140522 seconds to train for 30 epochs. The average loss on the final epoch was: 0.726858645461</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">g <span class="op">=</span> build_graph(cell_type<span class="op">=</span><span class="st">&#39;GRU&#39;</span>, num_steps<span class="op">=</span><span class="dv">1</span>, batch_size<span class="op">=</span><span class="dv">1</span>, num_classes<span class="op">=</span>vocab_size, state_size <span class="op">=</span> <span class="dv">512</span>) generate_characters(g, <span class="st">&quot;saves/GRU_30_epochs_variousscripts&quot;</span>, <span class="dv">750</span>, prompt<span class="op">=</span><span class="st">&#39;D&#39;</span>, pick_top_chars<span class="op">=</span><span class="dv">5</span>)</code></pre></div> <pre><code>DENT&#39;SUEENCK Bartholomew of the TIE FIGHTERS are stunned. There is a crowd and armored switcheroos. PICARD (continuing) Couns two dim is tired. In order to the sentence... The sub bottle appears on the screen into a small shuttle shift of the ceiling. The DAMBA FETT splash fires and matches them into the top, transmit to stable high above upon their statels, falling from an alien shaft. ANAKIN and OBI-WAN stand next to OBI-WAN down the control plate of smoke at the TIE fighter. They stare at the centre of the station loose into a comlink cover -- comes up to the General, the GENERAL HUNTAN AND FINNFURMBARD from the PICADOR to a beautiful Podracisly. ENGINEER Naboo from an army seventy medical security team area re-weilergular. EXT.</code></pre> <p>Not sure these are that much better than before, but it’s sort of readable?</p> <h3 id="conclusion">Conclusion</h3> <p>In this post, we used a character sequence generation task to learn how to use Tensorflow’s scan and dynamic_rnn functions, how to use advanced RNN cells and stack multiple RNNs, and how to add features to our RNN like dropout and layer normalization. In the next post, we will use a machine translation task to look at handling variable length sequences and building RNN encoders and decoders.</p> </body> </html> Styles of Truncated Backpropagation2016-07-19T00:00:00-04:002016-07-19T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-07-19:/styles-of-truncated-backpropagation.htmlIn my post on Recurrent Neural Networks in Tensorflow, I observed that Tensorflow's approach to truncated backpropagation (feeding in truncated subsequences of length n) is qualitatively different than "backpropagating errors a maximum of n steps". In this post, I explore the differences, and ask whether one approach is better than the other.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>In my post on <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html">Recurrent Neural Networks in Tensorflow</a>, I observed that Tensorflow’s approach to truncated backpropagation (feeding in truncated subsequences of length n) is qualitatively different than “backpropagating errors a maximum of n steps”. In this post, I explore the differences, implement a truncated backpropagation algorithm in Tensorflow that maintains the distribution between backpropagated errors, and ask whether one approach is better than the other.</p> <p>I conclude that:</p> <ul> <li>Because a well-implemented evenly-distributed truncated backpropagation algorithm would run about as fast as full backpropagation over the sequence, and full backpropagation performs slightly better, it is most likely not worth implementing such an algorithm.</li> <li>The discussion and preliminary experiments in this post show that n-step Tensorflow-style truncated backprop (i.e., with num_steps = n) does not effectively backpropagate errors the full n-steps. Thus, if you are using Tensorflow-style truncated backpropagation and need to capture n-step dependencies, you may benefit from using a num_steps that is appreciably higher than n in order to effectively backpropagate errors the desired n steps.</li> </ul> <h3 id="differences-in-styles-of-truncated-backpropagation">Differences in styles of truncated backpropagation</h3> <p>Suppose we are training an RNN on sequences of length 10,000. If we apply non-truncated backpropagation through time, the entire sequence is fed into the network at once, the error at time step 10,000 will be back propagated all the way back to time step 1. The two problems with this are that it is (1) expensive to backpropagate the error so many steps, and (2) due to vanishing gradients, backpropagated errors get smaller and smaller layer by layer, which makes further backpropagation insignificant.</p> <p>To deal with this, we might implement “truncated” backpropagation. A good description of truncated backpropagation is provided in Section 2.8.6 of <a href="http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf">Ilya Sutskever’s Ph.D. thesis</a>:</p> <blockquote> <p>“[Truncated backpropagation] processes the sequence one timestep at a time, and every k1 timesteps, it runs BPTT for k2 timesteps…”</p> </blockquote> <p>Tensorflow-style truncated backpropagation uses k1 = k2 (= num_steps). See <a href="https://www.tensorflow.org/versions/r0.9/tutorials/recurrent/index.html#truncated-backpropagation">Tensorflow api docs</a>. The question this post addresses is whether setting k1 = 1 achieves better results. I will deem this “true” truncated backpropagation, since every error that can be backpropagated k2 steps is backpropagated the full k2 steps.</p> <p>To understand why these two approaches are qualitatively different, consider how they differ on sequences of length 49 with backpropagation of errors truncated to 7 steps. In both, every error is backpropagated to the weights at the current timestep. However, in Tensorflow-style truncated backpropagation, the sequence is broken into 7 subsequences, each of length 7, and only 7 over the errors are backpropagated 7 steps. In “true” truncated backpropagation, 42 of the errors can be backpropagated for 7 steps, and 42 are. This may lead to different results because the ratio of 7-step to 1-step errors used to update the weights is significantly different.</p> <p>To visualize the difference, here is how true truncated backpropagation looks on a sequence of length 6 with errors truncated to 3 steps:</p> <figure> <img src="https://r2rt.com/static/images/RNN_true_truncated_backprop.png" alt="Diagram of True Truncated Backpropagation" /><figcaption>Diagram of True Truncated Backpropagation</figcaption> </figure> <p>And here is how Tensorflow-style truncated backpropagation looks on the same sequence:</p> <figure> <img src="https://r2rt.com/static/images/RNN_tf_truncated_backprop.png" alt="Diagram of Tensorflow Truncated Backpropagation" /><figcaption>Diagram of Tensorflow Truncated Backpropagation</figcaption> </figure> <h3 id="experiment-design">Experiment design</h3> <p>To compare the performance of the two algorithms, I write implement a “true” truncated backpropagation algorithm and compare results. The algorithms are compared on a vanilla-RNN, based on the one used in my prior post, <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html">Recurrent Neural Networks in Tensorflow I</a>, except that I upgrade the task and model complexity, since the basic model from my prior post learned the simple patterns in the toy dataset very quickly. The task will be language modeling on the ptb dataset, and to match the increased complexity of this task, I add an embedding layer and dropout to the basic RNN model.</p> <p>I compare the best performance of each algorithm on the validation set after 20 epochs for the cases below. In each case, I use an AdamOptimizer (it does better than other optimizers in preliminary tests) and learning rates of 0.003, 0.001 and 0.0003.</p> <p><em>5-step truncated backpropagation</em></p> <ul> <li>True, sequences of length 20</li> <li>TF-style</li> </ul> <p><em>10-step truncated backpropagation</em></p> <ul> <li>True, sequences of length 30</li> <li>TF-style</li> </ul> <p><em>20-step truncated backpropagation</em></p> <ul> <li>True, sequences of length 40</li> <li>TF-style</li> </ul> <p><em>40-step truncated backpropagation</em></p> <ul> <li>TF-style</li> </ul> <h3 id="code">Code</h3> <h4 id="imports-and-data-generators">Imports and data generators</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np <span class="im">import</span> tensorflow <span class="im">as</span> tf <span class="op">%</span>matplotlib inline <span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="im">from</span> tensorflow.models.rnn.ptb <span class="im">import</span> reader <span class="co">#data from http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz</span> raw_data <span class="op">=</span> reader.ptb_raw_data(<span class="st">&#39;ptb_data&#39;</span>) train_data, val_data, test_data, num_classes <span class="op">=</span> raw_data <span class="kw">def</span> gen_epochs(n, num_steps, batch_size): <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(n): <span class="cf">yield</span> reader.ptb_iterator(train_data, batch_size, num_steps)</code></pre></div> <h4 id="model">Model</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_graph(num_steps, bptt_steps <span class="op">=</span> <span class="dv">4</span>, batch_size <span class="op">=</span> <span class="dv">200</span>, num_classes <span class="op">=</span> num_classes, state_size <span class="op">=</span> <span class="dv">4</span>, embed_size <span class="op">=</span> <span class="dv">50</span>, learning_rate <span class="op">=</span> <span class="fl">0.01</span>): <span class="co">&quot;&quot;&quot;</span> <span class="co"> Builds graph for a simple RNN</span> <span class="co"> Notable parameters:</span> <span class="co"> num_steps: sequence length / steps for TF-style truncated backprop</span> <span class="co"> bptt_steps: number of steps for true truncated backprop</span> <span class="co"> &quot;&quot;&quot;</span> g <span class="op">=</span> tf.get_default_graph() <span class="co"># placeholders</span> x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, <span class="va">None</span>], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, <span class="va">None</span>], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) default_init_state <span class="op">=</span> tf.zeros([batch_size, state_size]) init_state <span class="op">=</span> tf.placeholder_with_default(default_init_state, [batch_size, state_size], name<span class="op">=</span><span class="st">&#39;state_placeholder&#39;</span>) dropout <span class="op">=</span> tf.placeholder(tf.float32, [], name<span class="op">=</span><span class="st">&#39;dropout_placeholder&#39;</span>) x_one_hot <span class="op">=</span> tf.one_hot(x, num_classes) x_as_list <span class="op">=</span> [tf.squeeze(i, squeeze_dims<span class="op">=</span>[<span class="dv">1</span>]) <span class="cf">for</span> i <span class="kw">in</span> tf.split(<span class="dv">1</span>, num_steps, x_one_hot)] <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;embeddings&#39;</span>): embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [num_classes, embed_size]) <span class="kw">def</span> embedding_lookup(one_hot_input): <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;embeddings&#39;</span>, reuse<span class="op">=</span><span class="va">True</span>): embeddings <span class="op">=</span> tf.get_variable(<span class="st">&#39;embedding_matrix&#39;</span>, [num_classes, embed_size]) embeddings <span class="op">=</span> tf.identity(embeddings) g.add_to_collection(<span class="st">&#39;embeddings&#39;</span>, embeddings) <span class="cf">return</span> tf.matmul(one_hot_input, embeddings) rnn_inputs <span class="op">=</span> [embedding_lookup(i) <span class="cf">for</span> i <span class="kw">in</span> x_as_list] <span class="co">#apply dropout to inputs</span> rnn_inputs <span class="op">=</span> [tf.nn.dropout(x, dropout) <span class="cf">for</span> x <span class="kw">in</span> rnn_inputs] <span class="co"># rnn_cells</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;rnn_cell&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [embed_size <span class="op">+</span> state_size, state_size]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [state_size], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) <span class="kw">def</span> rnn_cell(rnn_input, state): <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;rnn_cell&#39;</span>, reuse<span class="op">=</span><span class="va">True</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [embed_size <span class="op">+</span> state_size, state_size]) W <span class="op">=</span> tf.identity(W) g.add_to_collection(<span class="st">&#39;Ws&#39;</span>, W) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [state_size], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) b <span class="op">=</span> tf.identity(b) g.add_to_collection(<span class="st">&#39;bs&#39;</span>, b) <span class="cf">return</span> tf.tanh(tf.matmul(tf.concat(<span class="dv">1</span>, [rnn_input, state]), W) <span class="op">+</span> b) state <span class="op">=</span> init_state rnn_outputs <span class="op">=</span> [] <span class="cf">for</span> rnn_input <span class="kw">in</span> rnn_inputs: state <span class="op">=</span> rnn_cell(rnn_input, state) rnn_outputs.append(state) <span class="co">#apply dropout to outputs</span> rnn_outputs <span class="op">=</span> [tf.nn.dropout(x, dropout) <span class="cf">for</span> x <span class="kw">in</span> rnn_outputs] final_state <span class="op">=</span> rnn_outputs[<span class="op">-</span><span class="dv">1</span>] <span class="co">#logits and predictions</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W_softmax&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b_softmax&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) logits <span class="op">=</span> [tf.matmul(rnn_output, W) <span class="op">+</span> b <span class="cf">for</span> rnn_output <span class="kw">in</span> rnn_outputs] predictions <span class="op">=</span> [tf.nn.softmax(logit) <span class="cf">for</span> logit <span class="kw">in</span> logits] <span class="co">#losses</span> y_as_list <span class="op">=</span> [tf.squeeze(i, squeeze_dims<span class="op">=</span>[<span class="dv">1</span>]) <span class="cf">for</span> i <span class="kw">in</span> tf.split(<span class="dv">1</span>, num_steps, y)] losses <span class="op">=</span> [tf.nn.sparse_softmax_cross_entropy_with_logits(logit,label) <span class="op">\</span> <span class="cf">for</span> logit, label <span class="kw">in</span> <span class="bu">zip</span>(logits, y_as_list)] total_loss <span class="op">=</span> tf.reduce_mean(losses) <span class="co">&quot;&quot;&quot;</span> <span class="co"> Implementation of true truncated backprop using TF&#39;s high-level gradients function.</span> <span class="co"> Because I add gradient-ops for each error, this are a number of duplicate operations,</span> <span class="co"> making this a slow implementation. It would be considerably more effort to write an</span> <span class="co"> efficient implementation, however, so for testing purposes, it&#39;s OK that this goes slow.</span> <span class="co"> An efficient implementation would still require all of the same operations as the full</span> <span class="co"> backpropagation through time of errors in a sequence, and so any advantage would not come</span> <span class="co"> from speed, but from having a better distribution of backpropagated errors.</span> <span class="co"> &quot;&quot;&quot;</span> embed_by_step <span class="op">=</span> g.get_collection(<span class="st">&#39;embeddings&#39;</span>) Ws_by_step <span class="op">=</span> g.get_collection(<span class="st">&#39;Ws&#39;</span>) bs_by_step <span class="op">=</span> g.get_collection(<span class="st">&#39;bs&#39;</span>) <span class="co"># Collect gradients for each step in a list</span> embed_grads <span class="op">=</span> [] W_grads <span class="op">=</span> [] b_grads <span class="op">=</span> [] <span class="co"># Keeping track of vanishing gradients for my own curiousity</span> vanishing_grad_list <span class="op">=</span> [] <span class="co"># Loop through the errors, and backpropagate them to the relevant nodes</span> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(num_steps): start <span class="op">=</span> <span class="bu">max</span>(<span class="dv">0</span>,i<span class="op">+</span><span class="dv">1</span><span class="op">-</span>bptt_steps) stop <span class="op">=</span> i<span class="op">+</span><span class="dv">1</span> grad_list <span class="op">=</span> tf.gradients(losses[i], embed_by_step[start:stop] <span class="op">+\</span> Ws_by_step[start:stop] <span class="op">+\</span> bs_by_step[start:stop]) embed_grads <span class="op">+=</span> grad_list[<span class="dv">0</span> : stop <span class="op">-</span> start] W_grads <span class="op">+=</span> grad_list[stop <span class="op">-</span> start : <span class="dv">2</span> <span class="op">*</span> (stop <span class="op">-</span> start)] b_grads <span class="op">+=</span> grad_list[<span class="dv">2</span> <span class="op">*</span> (stop <span class="op">-</span> start) : ] <span class="cf">if</span> i <span class="op">&gt;=</span> bptt_steps: vanishing_grad_list.append(grad_list[stop <span class="op">-</span> start : <span class="dv">2</span> <span class="op">*</span> (stop <span class="op">-</span> start)]) grad_embed <span class="op">=</span> tf.add_n(embed_grads) <span class="op">/</span> (batch_size <span class="op">*</span> bptt_steps) grad_W <span class="op">=</span> tf.add_n(W_grads) <span class="op">/</span> (batch_size <span class="op">*</span> bptt_steps) grad_b <span class="op">=</span> tf.add_n(b_grads) <span class="op">/</span> (batch_size <span class="op">*</span> bptt_steps) <span class="co">&quot;&quot;&quot;</span> <span class="co"> Training steps</span> <span class="co"> &quot;&quot;&quot;</span> opt <span class="op">=</span> tf.train.AdamOptimizer(learning_rate) grads_and_vars_tf_style <span class="op">=</span> opt.compute_gradients(total_loss, tf.trainable_variables()) grads_and_vars_true_bptt <span class="op">=</span> <span class="op">\</span> [(grad_embed, tf.trainable_variables()[<span class="dv">0</span>]), (grad_W, tf.trainable_variables()[<span class="dv">1</span>]), (grad_b, tf.trainable_variables()[<span class="dv">2</span>])] <span class="op">+</span> <span class="op">\</span> opt.compute_gradients(total_loss, tf.trainable_variables()[<span class="dv">3</span>:]) train_tf_style <span class="op">=</span> opt.apply_gradients(grads_and_vars_tf_style) train_true_bptt <span class="op">=</span> opt.apply_gradients(grads_and_vars_true_bptt) <span class="cf">return</span> <span class="bu">dict</span>( train_tf_style <span class="op">=</span> train_tf_style, train_true_bptt <span class="op">=</span> train_true_bptt, gvs_tf_style <span class="op">=</span> grads_and_vars_tf_style, gvs_true_bptt <span class="op">=</span> grads_and_vars_true_bptt, gvs_gradient_check <span class="op">=</span> opt.compute_gradients(losses[<span class="op">-</span><span class="dv">1</span>], tf.trainable_variables()), loss <span class="op">=</span> total_loss, final_state <span class="op">=</span> final_state, x<span class="op">=</span>x, y<span class="op">=</span>y, init_state<span class="op">=</span>init_state, dropout<span class="op">=</span>dropout, vanishing_grads<span class="op">=</span>vanishing_grad_list )</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> reset_graph(): <span class="cf">if</span> <span class="st">&#39;sess&#39;</span> <span class="kw">in</span> <span class="bu">globals</span>() <span class="kw">and</span> sess: sess.close() tf.reset_default_graph()</code></pre></div> <h3 id="some-quick-tests">Some quick tests</h3> <h4 id="timing-test">Timing test</h4> <p>As expected, my implementation of true BPTT is slow as there are duplicate operations being performed. An efficient implementation would run at roughly the same speed as the full backpropagation.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">reset_graph() g <span class="op">=</span> build_graph(num_steps <span class="op">=</span> <span class="dv">40</span>, bptt_steps <span class="op">=</span> <span class="dv">20</span>) sess <span class="op">=</span> tf.InteractiveSession() sess.run(tf.initialize_all_variables()) X, Y <span class="op">=</span> <span class="bu">next</span>(reader.ptb_iterator(train_data, batch_size<span class="op">=</span><span class="dv">200</span>, num_steps<span class="op">=</span><span class="dv">40</span>))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="op">%%</span>timeit gvs_bptt <span class="op">=</span> sess.run(g[<span class="st">&#39;gvs_true_bptt&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]:X, g[<span class="st">&#39;y&#39;</span>]:Y, g[<span class="st">&#39;dropout&#39;</span>]: <span class="dv">1</span>})</code></pre></div> <pre><code>10 loops, best of 3: 173 ms per loop</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="op">%%</span>timeit gvs_tf <span class="op">=</span> sess.run(g[<span class="st">&#39;gvs_tf_style&#39;</span>], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]:X, g[<span class="st">&#39;y&#39;</span>]:Y, g[<span class="st">&#39;dropout&#39;</span>]: <span class="dv">1</span>})</code></pre></div> <pre><code>10 loops, best of 3: 80.2 ms per loop</code></pre> <h4 id="vanshing-gradients-demonstration">Vanshing gradients demonstration</h4> <p>To demonstrate the vanishing gradient problem, I collected this information. As you can see, the gradients vanish very quickly, decreasing by a factor of of about 3-4 at each step.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">vanishing_grads, gvs <span class="op">=</span> sess.run([g[<span class="st">&#39;vanishing_grads&#39;</span>], g[<span class="st">&#39;gvs_true_bptt&#39;</span>]], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]:X, g[<span class="st">&#39;y&#39;</span>]:Y, g[<span class="st">&#39;dropout&#39;</span>]: <span class="dv">1</span>}) vanishing_grads <span class="op">=</span> np.array(vanishing_grads) weights <span class="op">=</span> gvs[<span class="dv">1</span>][<span class="dv">1</span>] <span class="co"># sum all the grads from each loss node</span> vanishing_grads <span class="op">=</span> np.<span class="bu">sum</span>(vanishing_grads, axis<span class="op">=</span><span class="dv">0</span>) <span class="co"># now calculate the l1 norm at each bptt step</span> vanishing_grads <span class="op">=</span> np.<span class="bu">sum</span>(np.<span class="bu">sum</span>(np.<span class="bu">abs</span>(vanishing_grads),axis<span class="op">=</span><span class="dv">1</span>),axis<span class="op">=</span><span class="dv">1</span>) vanishing_grads</code></pre></div> <pre><code>array([ 5.28676978e-08, 1.51207473e-07, 4.04591049e-07, 1.55859300e-06, 5.00411124e-06, 1.32292716e-05, 3.94736344e-05, 1.17605050e-04, 3.37805774e-04, 1.01710076e-03, 2.74375151e-03, 8.92040879e-03, 2.23708227e-02, 7.23497868e-02, 2.45202959e-01, 7.39126682e-01, 2.19093657e+00, 6.16793633e+00, 2.27248211e+01, 9.78200531e+01], dtype=float32)</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="bu">len</span>(vanishing_grads) <span class="op">-</span> <span class="dv">1</span>): <span class="bu">print</span>(vanishing_grads[i<span class="op">+</span><span class="dv">1</span>] <span class="op">/</span> vanishing_grads[i])</code></pre></div> <pre><code>2.86011 2.67573 3.85227 3.21066 2.64368 2.98381 2.97933 2.87237 3.0109 2.69762 3.25117 2.50782 3.23411 3.38913 3.01435 2.96422 2.81521 3.68435 4.30455</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plt.plot(vanishing_grads)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/RNN_output_19_1.png" alt="Plot of Vanishing Gradients" /><figcaption>Plot of Vanishing Gradients</figcaption> </figure> <h4 id="quick-accuracy-test">Quick accuracy test</h4> <p>A sanity check to make sure the true truncated backpropagation algorithm is doing the right thing.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># first test using bptt_steps &gt;= num_steps</span> reset_graph() g <span class="op">=</span> build_graph(num_steps <span class="op">=</span> <span class="dv">7</span>, bptt_steps <span class="op">=</span> <span class="dv">7</span>) X, Y <span class="op">=</span> <span class="bu">next</span>(reader.ptb_iterator(train_data, batch_size<span class="op">=</span><span class="dv">200</span>, num_steps<span class="op">=</span><span class="dv">7</span>)) <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.initialize_all_variables()) gvs_bptt, gvs_tf <span class="op">=\</span> sess.run([g[<span class="st">&#39;gvs_true_bptt&#39;</span>],g[<span class="st">&#39;gvs_tf_style&#39;</span>]], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]:X, g[<span class="st">&#39;y&#39;</span>]:Y, g[<span class="st">&#39;dropout&#39;</span>]: <span class="fl">0.8</span>}) <span class="co"># assert embedding gradients are the same</span> <span class="cf">assert</span>(np.<span class="bu">max</span>(gvs_bptt[<span class="dv">0</span>][<span class="dv">0</span>] <span class="op">-</span> gvs_tf[<span class="dv">0</span>][<span class="dv">0</span>]) <span class="op">&lt;</span> <span class="fl">1e-4</span>) <span class="co"># assert weight gradients are the same</span> <span class="cf">assert</span>(np.<span class="bu">max</span>(gvs_bptt[<span class="dv">1</span>][<span class="dv">0</span>] <span class="op">-</span> gvs_tf[<span class="dv">1</span>][<span class="dv">0</span>]) <span class="op">&lt;</span> <span class="fl">1e-4</span>) <span class="co"># assert bias gradients are the same</span> <span class="cf">assert</span>(np.<span class="bu">max</span>(gvs_bptt[<span class="dv">2</span>][<span class="dv">0</span>] <span class="op">-</span> gvs_tf[<span class="dv">2</span>][<span class="dv">0</span>]) <span class="op">&lt;</span> <span class="fl">1e-4</span>)</code></pre></div> <h3 id="experiment">Experiment</h3> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Train the network</span> <span class="co">&quot;&quot;&quot;</span> <span class="kw">def</span> train_network(num_epochs, num_steps, use_true_bptt, batch_size <span class="op">=</span> <span class="dv">200</span>, bptt_steps <span class="op">=</span> <span class="dv">7</span>, state_size <span class="op">=</span> <span class="dv">4</span>, learning_rate <span class="op">=</span> <span class="fl">0.01</span>, dropout <span class="op">=</span> <span class="fl">0.8</span>, verbose <span class="op">=</span> <span class="va">True</span>): reset_graph() tf.set_random_seed(<span class="dv">1234</span>) g <span class="op">=</span> build_graph(num_steps <span class="op">=</span> num_steps, bptt_steps <span class="op">=</span> bptt_steps, state_size <span class="op">=</span> state_size, batch_size <span class="op">=</span> batch_size, learning_rate <span class="op">=</span> learning_rate) <span class="cf">if</span> use_true_bptt: train_step <span class="op">=</span> g[<span class="st">&#39;train_true_bptt&#39;</span>] <span class="cf">else</span>: train_step <span class="op">=</span> g[<span class="st">&#39;train_tf_style&#39;</span>] <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.initialize_all_variables()) training_losses <span class="op">=</span> [] val_losses <span class="op">=</span> [] <span class="cf">for</span> idx, epoch <span class="kw">in</span> <span class="bu">enumerate</span>(gen_epochs(num_epochs, num_steps, batch_size)): training_loss <span class="op">=</span> <span class="dv">0</span> steps <span class="op">=</span> <span class="dv">0</span> training_state <span class="op">=</span> np.zeros((batch_size, state_size)) <span class="cf">for</span> X, Y <span class="kw">in</span> epoch: steps <span class="op">+=</span> <span class="dv">1</span> training_loss_, training_state, _ <span class="op">=</span> sess.run([g[<span class="st">&#39;loss&#39;</span>], g[<span class="st">&#39;final_state&#39;</span>], train_step], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: X, g[<span class="st">&#39;y&#39;</span>]: Y, g[<span class="st">&#39;dropout&#39;</span>]: dropout, g[<span class="st">&#39;init_state&#39;</span>]: training_state}) training_loss <span class="op">+=</span> training_loss_ <span class="cf">if</span> verbose: <span class="bu">print</span>(<span class="st">&quot;Average training loss for Epoch&quot;</span>, idx, <span class="st">&quot;:&quot;</span>, training_loss<span class="op">/</span>steps) training_losses.append(training_loss<span class="op">/</span>steps) val_loss <span class="op">=</span> <span class="dv">0</span> steps <span class="op">=</span> <span class="dv">0</span> training_state <span class="op">=</span> np.zeros((batch_size, state_size)) <span class="cf">for</span> X,Y <span class="kw">in</span> reader.ptb_iterator(val_data, batch_size, num_steps): steps <span class="op">+=</span> <span class="dv">1</span> val_loss_, training_state <span class="op">=</span> sess.run([g[<span class="st">&#39;loss&#39;</span>], g[<span class="st">&#39;final_state&#39;</span>]], feed_dict<span class="op">=</span>{g[<span class="st">&#39;x&#39;</span>]: X, g[<span class="st">&#39;y&#39;</span>]: Y, g[<span class="st">&#39;dropout&#39;</span>]: <span class="dv">1</span>, g[<span class="st">&#39;init_state&#39;</span>]: training_state}) val_loss <span class="op">+=</span> val_loss_ <span class="cf">if</span> verbose: <span class="bu">print</span>(<span class="st">&quot;Average validation loss for Epoch&quot;</span>, idx, <span class="st">&quot;:&quot;</span>, val_loss<span class="op">/</span>steps) <span class="bu">print</span>(<span class="st">&quot;***&quot;</span>) val_losses.append(val_loss<span class="op">/</span>steps) <span class="cf">return</span> training_losses, val_losses</code></pre></div> <h3 id="results">Results</h3> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Procedure to collect results</span> <span class="co"># Note: this takes a few hours to run</span> bptt_steps <span class="op">=</span> [(<span class="dv">5</span>,<span class="dv">20</span>), (<span class="dv">10</span>,<span class="dv">30</span>), (<span class="dv">20</span>,<span class="dv">40</span>), (<span class="dv">40</span>,<span class="dv">40</span>)] lrs <span class="op">=</span> [<span class="fl">0.003</span>, <span class="fl">0.001</span>, <span class="fl">0.0003</span>] <span class="cf">for</span> bptt_step, lr <span class="kw">in</span> ((x, y) <span class="cf">for</span> x <span class="kw">in</span> bptt_steps <span class="cf">for</span> y <span class="kw">in</span> lrs): _, val_losses <span class="op">=</span> <span class="op">\</span> train_network(<span class="dv">20</span>, bptt_step[<span class="dv">0</span>], use_true_bptt<span class="op">=</span><span class="va">False</span>, state_size<span class="op">=</span><span class="dv">100</span>, batch_size<span class="op">=</span><span class="dv">32</span>, learning_rate<span class="op">=</span>lr, verbose<span class="op">=</span><span class="va">False</span>) <span class="bu">print</span>(<span class="st">&quot;** TF STYLE **&quot;</span>, bptt_step, lr) <span class="bu">print</span>(np.<span class="bu">min</span>(val_losses)) <span class="cf">if</span> bptt_step[<span class="dv">0</span>] <span class="op">!=</span> <span class="dv">0</span>: _, val_losses <span class="op">=</span> <span class="op">\</span> train_network(<span class="dv">20</span>, bptt_step[<span class="dv">1</span>], use_true_bptt<span class="op">=</span><span class="va">True</span>, bptt_steps<span class="op">=</span> bptt_step[<span class="dv">0</span>], state_size<span class="op">=</span><span class="dv">100</span>, batch_size<span class="op">=</span><span class="dv">32</span>, learning_rate<span class="op">=</span>lr, verbose<span class="op">=</span><span class="va">False</span>) <span class="bu">print</span>(<span class="st">&quot;** TRUE STYLE **&quot;</span>, bptt_step, lr) <span class="bu">print</span>(np.<span class="bu">min</span>(val_losses))</code></pre></div> <p>Here are the results in a table:</p> <h5 id="minimum-validation-loss-achieved-in-20-epochs">Minimum validation loss achieved in 20 epochs:</h5> <table> <tbody> <tr class="odd"> <td>BPTT Steps</td> <td>5</td> <td></td> <td></td> </tr> <tr class="even"> <td>Learning Rate</td> <td>0.003</td> <td>0.001</td> <td>0.0003</td> </tr> <tr class="odd"> <td>True (20-seq)</td> <td><strong>5.12</strong></td> <td><strong>5.01</strong></td> <td>5.09</td> </tr> <tr class="even"> <td>TF Style</td> <td>5.21</td> <td>5.04</td> <td><strong>5.04</strong></td> </tr> <tr class="odd"> <td></td> <td></td> <td></td> <td></td> </tr> <tr class="even"> <td>BPTT Steps</td> <td>10</td> <td></td> <td></td> </tr> <tr class="odd"> <td>Learning Rate</td> <td>0.003</td> <td>0.001</td> <td>0.0003</td> </tr> <tr class="even"> <td>True (30-seq)</td> <td><strong>5.07</strong></td> <td><strong>5.00</strong></td> <td>5.12</td> </tr> <tr class="odd"> <td>TF Style</td> <td>5.15</td> <td>5.03</td> <td><strong>5.05</strong></td> </tr> <tr class="even"> <td></td> <td></td> <td></td> <td></td> </tr> <tr class="odd"> <td>BPTT Steps</td> <td>20</td> <td></td> <td></td> </tr> <tr class="even"> <td>Learning Rate</td> <td>0.003</td> <td>0.001</td> <td>0.0003</td> </tr> <tr class="odd"> <td>True (40-seq)</td> <td><strong>5.05</strong></td> <td>5.00</td> <td>5.15</td> </tr> <tr class="even"> <td>TF Style</td> <td>5.11</td> <td><strong>4.99</strong></td> <td><strong>5.08</strong></td> </tr> <tr class="odd"> <td></td> <td></td> <td></td> <td></td> </tr> <tr class="even"> <td>BPTT Steps</td> <td>40</td> <td></td> <td></td> </tr> <tr class="odd"> <td>Learning Rate</td> <td>0.003</td> <td>0.001</td> <td>0.0003</td> </tr> <tr class="even"> <td>TF Style</td> <td>5.05</td> <td>4.99</td> <td>5.15</td> </tr> </tbody> </table> <h3 id="discussion">Discussion</h3> <p>As you can see, true truncated backpropagation seems to have an advantage over Tensorflow-style truncated backpropagation when truncating errors at the same number of steps. However, this advantage completely disappears (and actually reverses) when comparing true truncated backpropagation to Tensorflow-style truncated backpropagation that uses the same sequence length.</p> <p>This suggests two things:</p> <ul> <li>Because a well-implemented true truncated backpropagation algorithm would run about as fast as full backpropagation over the sequence, and full backpropagation performs slightly better, it is most likely not worth implementing an efficient true truncated backpropagation algorithm.</li> <li>Since true truncated backpropagation outperforms Tensorflow-style truncated backpropagation when truncating errors to the same number of steps, we might conclude that Tensorflow-style truncated backpropagation does not effectively backpropagate errors the full n-steps. Thus, if you need to capture n-step dependencies with Tensorflow-style truncated backpropagation, you may benefit from using a num_steps that is appreciably higher than n in order to effectively backpropagate errors the desired n steps.</li> </ul> <p><strong>Edit</strong>: After writing this post, I discovered that this distinction between styles of truncated backpropagation is discussed in <a href="https://web.stanford.edu/class/psych209a/ReadingsByDate/02_25/Williams%20Zipser95RecNets.pdf">Williams and Zipser (1992), Gradient-Based Learning Algorithms for Recurrent Networks and Their Computation Complexity</a>. The authors refer to the “true” truncated backpropagation as “truncated backpropagation” or BPTT(n) [or BPTT(n, 1)], whereas they refer to Tensorflow-style truncated backpropagation as “epochwise truncated backpropagation” or BPTT(n, n). They also allow for semi-epochwise truncated BPTT, which would do a backward pass more often than once per sequence, but less often than all possible times (i.e., in Ilya Sutskever’s language used above, this would be BPTT(k2, k1), where 1 &lt; k1 &lt; k2).</p> <p>In <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.7941&amp;rep=rep1&amp;type=pdf">Williams and Peng (1990), An Efficien Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories</a>, the authors conduct a similar experiment to the one in the post, and reach similar conclusions. In particular, Williams Peng write that: “The results of these experiments have been that the success rate of BPTT(2h; h) is essentially identical to that of BPTT(h)”. In other words, they compared “true” truncated backpropagation, with h steps of truncation, to BPTT(2h, h), which is similar to Tensorflow-style backpropagation and has 2h steps of truncation, and found that they performed similarly.</p> </body> </html> Recurrent Neural Networks in Tensorflow I2016-07-11T00:00:00-04:002016-07-11T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-07-11:/recurrent-neural-networks-in-tensorflow-i.htmlThis is the first in a series of posts about recurrent neural networks in Tensorflow. In this post, we will build a vanilla recurrent neural network (RNN) from the ground up in Tensorflow, and then translate the model into Tensorflow's RNN API.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML" type="text/javascript"></script> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>This is the first in a series of posts about recurrent neural networks in Tensorflow. In this post, we will build a vanilla recurrent neural network (RNN) from the ground up in Tensorflow, and then translate the model into Tensorflow’s RNN API.</p> <p><strong>Edit 2017/03/07</strong>: Updated to work with Tensorflow 1.0.</p> <h3 id="introduction-to-rnns">Introduction to RNNs</h3> <p>RNNs are neural networks that accept their own outputs as inputs. So as to not reinvent the wheel, here are a few blog posts to introduce you to RNNs:</p> <ol type="1"> <li><a href="https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html">Written Memories: Understanding, Deriving and Extending the LSTM</a>, on this blog</li> <li><a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/">Recurrent Neural Networks Tutorial</a>, by Denny Britz</li> <li><a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">The Unreasonable Effectiveness of Recurrent Neural Networks</a>, by Andrej Karpathy</li> <li><a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTM Networks</a>, by Christopher Olah</li> </ol> <h3 id="outline-of-the-data">Outline of the data</h3> <p>In this post, we’ll be building a no frills RNN that accepts a binary sequence X and uses it to predict a binary sequence Y. The sequences are constructed as follows:</p> <ul> <li><strong>Input sequence (X)</strong>: At time step <em>t</em>, <span class="math inline">$$X_t$$</span> has a 50% chance of being 1 (and a 50% chance of being 0). E.g., X might be [1, 0, 0, 1, 1, 1 … ].</li> <li><strong>Output sequence (Y)</strong>: At time step <em>t</em>, <span class="math inline">$$Y_t$$</span> has a base 50% chance of being 1 (and a 50% base chance to be 0). The chance of <span class="math inline">$$Y_t$$</span> being 1 is increased by 50% (i.e., to 100%) if <span class="math inline">$$X_{t-3}$$</span> is 1, and decreased by 25% (i.e., to 25%) if <span class="math inline">$$X_{t-8}$$</span> is 1. If both <span class="math inline">$$X_{t-3}$$</span> and <span class="math inline">$$X_{t-8}$$</span> are 1, the chance of <span class="math inline">$$Y_{t}$$</span> being 1 is 50% + 50% - 25% = 75%.</li> </ul> <p>Thus, there are two dependencies in the data: one at <em>t</em>-3 (3 steps back) and one at <em>t</em>-8 (8 steps back).</p> <p>This data is simple enough that we can calculate the expected cross-entropy loss for a trained RNN depending on whether or not it learns the dependencies:</p> <ul> <li>If the network learns no dependencies, it will correctly assign a probability of 62.5% to 1, for an expected cross-entropy loss of about <strong>0.66</strong>.</li> <li>If the network learns only the first dependency (3 steps back) but not the second dependency, it will correctly assign a probability of 87.5%, 50% of the time, and correctly assign a probability of 62.5% the other 50% of the time, for an expected cross entropy loss of about <strong>0.52</strong>.</li> <li>If the network learns both dependencies, it will be 100% accurate 25% of the time, correctly assign a probability of 50%, 25% of the time, and correctly assign a probability of 75%, 50% of the time, for an expected cross extropy loss of about <strong>0.45</strong>.</li> </ul> <p>Here are the calculations:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np <span class="bu">print</span>(<span class="st">&quot;Expected cross entropy loss if the model:&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;- learns neither dependency:&quot;</span>, <span class="op">-</span>(<span class="fl">0.625</span> <span class="op">*</span> np.log(<span class="fl">0.625</span>) <span class="op">+</span> <span class="fl">0.375</span> <span class="op">*</span> np.log(<span class="fl">0.375</span>))) <span class="co"># Learns first dependency only ==&gt; 0.51916669970720941</span> <span class="bu">print</span>(<span class="st">&quot;- learns first dependency: &quot;</span>, <span class="fl">-0.5</span> <span class="op">*</span> (<span class="fl">0.875</span> <span class="op">*</span> np.log(<span class="fl">0.875</span>) <span class="op">+</span> <span class="fl">0.125</span> <span class="op">*</span> np.log(<span class="fl">0.125</span>)) <span class="fl">-0.5</span> <span class="op">*</span> (<span class="fl">0.625</span> <span class="op">*</span> np.log(<span class="fl">0.625</span>) <span class="op">+</span> <span class="fl">0.375</span> <span class="op">*</span> np.log(<span class="fl">0.375</span>))) <span class="bu">print</span>(<span class="st">&quot;- learns both dependencies: &quot;</span>, <span class="fl">-0.50</span> <span class="op">*</span> (<span class="fl">0.75</span> <span class="op">*</span> np.log(<span class="fl">0.75</span>) <span class="op">+</span> <span class="fl">0.25</span> <span class="op">*</span> np.log(<span class="fl">0.25</span>)) <span class="op">-</span> <span class="fl">0.25</span> <span class="op">*</span> (<span class="dv">2</span> <span class="op">*</span> <span class="fl">0.50</span> <span class="op">*</span> np.log (<span class="fl">0.50</span>)) <span class="op">-</span> <span class="fl">0.25</span> <span class="op">*</span> (<span class="dv">0</span>))</code></pre></div> <pre><code>Expected cross entropy loss if the model: - learns neither dependency: 0.661563238158 - learns first dependency: 0.519166699707 - learns both dependencies: 0.454454367449</code></pre> <h3 id="model-architecture">Model architecture</h3> <p>The model will be as simple as possible: at time step <em>t</em>, for <span class="math inline">$$t \in \{0, 1, \dots n\}$$</span> the model accepts a (one-hot) binary <span class="math inline">$$X_t$$</span> vector and a previous state vector, <span class="math inline">$$S_{t-1}$$</span>, as inputs and produces a state vector, <span class="math inline">$$S_t$$</span>, and a predicted probability distribution vector, <span class="math inline">$$P_t$$</span>, for the (one-hot) binary vector <span class="math inline">$$Y_t$$</span>.</p> <p>Formally, the model is:</p> <p><span class="math inline">$$S_t = \text{tanh}(W(X_t \ @ \ S_{t-1}) + b_s)$$</span></p> <p><span class="math inline">$$P_t = \text{softmax}(US_t + b_p)$$</span></p> <p>where <span class="math inline">$$@$$</span> represents vector concatenation, <span class="math inline">$$X_t \in R^2$$</span> is a one-hot binary vector, <span class="math inline">$$W \in R^{d \times (2 + d)}, \ b_s \in R^d, \ U \in R^{2 \times d}$$</span>, <span class="math inline">$$b_p \in R^2$$</span> and d is the size of the state vector (I use <span class="math inline">$$d = 4$$</span> below). At time step 0, <span class="math inline">$$S_{-1}$$</span> (the initial state) is initialized as a vector of zeros.</p> <p>Here is a diagram of the model:</p> <figure> <img src="https://r2rt.com/static/images/BasicRNN.png" alt="Diagram of Basic RNN" /><figcaption>Diagram of Basic RNN</figcaption> </figure> <h3 id="how-wide-should-our-tensorflow-graph-be">How wide should our Tensorflow graph be?</h3> <p>To build models in Tensorflow generally, you first represent the model as a graph, and then execute the graph. A critical question we must answer when deciding how to represent our model is: how wide should our graph be? How many time steps of input should our graph accept at once?</p> <p>Each time step is a duplicate, so it might make sense to have our graph, G, represent a single time step: <span class="math inline">$$G(X_t, S_{t-1}) \mapsto (P_t, S_t)$$</span>. We can then execute our graph for each time step, feeding in the state returned from the previous execution into the current execution. This would work for a model that was already trained, but there’s a problem with using this approach for training: the gradients computed during backpropagation are graph-bound. We would only be able to backpropagate errors to the current timestep; we could not backpropagate the error to time step <em>t-1</em>. This means our network will not be able to learn how to store long-term dependencies (such as the two in our data) in its state.</p> <p>Alternatively, we might make our graph as wide as our data sequence. This often works, except that in this case, we have an arbitrarily long input sequence, so we have to stop somewhere. Let’s say we make our graph accept sequences of length 10,000. This solves the problem of graph-bound gradients, and the errors from time step 9999 are propagated all the way back to time step 0. Unfortunately, such backpropagation is not only (often prohibitively) expensive, but also ineffective, due to the vanishing / exploding gradient problem: it turns out that backpropagating errors over too many time steps often causes them to vanish (become insignificantly small) or explode (become overwhelmingly large). To understand why this is the case, we apply the chain rule repeatedly to <span class="math inline">$$\frac{\partial E_t}{\partial S_{t-k}}$$</span> and observe that there is a product of <span class="math inline">$$k$$</span> factors (Jacobian matrices) linking the gradient at <span class="math inline">$$S_t$$</span> and the gradient as <span class="math inline">$$S_{t-k}$$</span>:</p> <p><span class="math display">$\frac{\partial E_t}{\partial S_{t-k}} = \frac{\partial E_t}{\partial S_t} \frac{\partial S_t}{\partial S_{t-k}} = \frac{\partial E_t}{\partial S_t} \left(\frac{\partial S_t}{\partial S_{t-1}} \frac{\partial S_{t-1}}{\partial S_{t-2}} \dots \frac{\partial S_{t-k+1}}{\partial S_{t-k}}\right) = \frac{\partial E_t}{\partial S_t} \prod_{i=1}^{k}\frac{\partial S_{t-i+1}}{\partial S_{t-i}}$</span></p> <p>In the words of Pascanu et al., “<em>in the same way a product of [k] real numbers can shrink to zero or explode to infinity, so does this product of matrices …</em>” See <a href="http://arxiv.org/pdf/1211.5063v2.pdf">On the difficulty of training RNNs</a>, by Pascanu et al. or my post <a href="https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html">Written Memories: Understanding, Deriving and Extending the LSTM</a> for more detailed explanations and references.</p> <p>The usual pattern for dealing with very long sequences is therefore to “truncate” our backpropagation by backpropagating errors a maximum of <span class="math inline">$$n$$</span> steps. We choose <span class="math inline">$$n$$</span> as a hyperparameter to our model, keeping in mind the trade-off: higher <span class="math inline">$$n$$</span> lets us capture longer term dependencies, but is more expensive computationally and memory-wise.</p> <p>A natural interpretation of backpropagating errors a maximum of <span class="math inline">$$n$$</span> steps means that we backpropagate every possible error <span class="math inline">$$n$$</span> steps. That is, if we have a sequence of length 49, and choose <span class="math inline">$$n = 7$$</span>, we would backpropagate 42 of the errors the full 7 steps. <em>This is not the approach we take in Tensorflow.</em> Tensorflow’s approach is to limit the graph to being <span class="math inline">$$n$$</span> units wide. See <a href="https://www.tensorflow.org/versions/r0.9/tutorials/recurrent/index.html#truncated-backpropagation">Tensorflow’s writeup on Truncated Backpropagation</a> (“[Truncated backpropagation] is easy to implement by feeding inputs of length [<span class="math inline">$$n$$</span>] at a time and doing backward pass after each iteration.”). This means that we would take our sequence of length 49, break it up into 7 sub-sequences of length 7 that we feed into the graph in 7 separate computations, and that only the errors from the 7th input in each graph are backpropagated the full 7 steps. Therefore, even if you think there are no dependencies longer than 7 steps in your data, it may still be worthwhile to use <span class="math inline">$$n &gt; 7$$</span> so as to increase the proportion of errors that are backpropagated by 7 steps. For an empirical investigation of the difference between backpropagating every error <span class="math inline">$$n$$</span> steps and Tensorflow-style backpropagation, see my post on <a href="https://r2rt.com/styles-of-truncated-backpropagation.html">Styles of Truncated Backpropagation</a>.</p> <h3 id="using-lists-of-tensors-to-represent-the-width">Using lists of tensors to represent the width</h3> <p>Our graph will be <span class="math inline">$$n$$</span> units (time steps) wide where each unit is a perfect duplicate, sharing the same variables. The easiest way to build a graph containing these duplicate units is to build each duplicate part in parallel. This is a key point, so I’m bolding it: <strong>the easiest way to represent each type of duplicate tensor (the rnn inputs, the rnn outputs (hidden state), the predictions, and the loss) is as a <em>list</em> of tensors.</strong> Here is a diagram with references to the variables used in the code below:</p> <figure> <img src="https://r2rt.com/static/images/BasicRNNLabeled.png" alt="Diagram of Basic RNN - Labeled" /><figcaption>Diagram of Basic RNN - Labeled</figcaption> </figure> <p>We will run a training step after each execution of the graph, simultaneously grabbing the final state produced by that execution to pass on to the next execution.</p> <p>Without further ado, here is the code:</p> <h4 id="imports-config-variables-and-data-generators">Imports, config variables, and data generators</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np <span class="im">import</span> tensorflow <span class="im">as</span> tf <span class="op">%</span>matplotlib inline <span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Global config variables</span> num_steps <span class="op">=</span> <span class="dv">5</span> <span class="co"># number of truncated backprop steps (&#39;n&#39; in the discussion above)</span> batch_size <span class="op">=</span> <span class="dv">200</span> num_classes <span class="op">=</span> <span class="dv">2</span> state_size <span class="op">=</span> <span class="dv">4</span> learning_rate <span class="op">=</span> <span class="fl">0.1</span></code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> gen_data(size<span class="op">=</span><span class="dv">1000000</span>): X <span class="op">=</span> np.array(np.random.choice(<span class="dv">2</span>, size<span class="op">=</span>(size,))) Y <span class="op">=</span> [] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(size): threshold <span class="op">=</span> <span class="fl">0.5</span> <span class="cf">if</span> X[i<span class="dv">-3</span>] <span class="op">==</span> <span class="dv">1</span>: threshold <span class="op">+=</span> <span class="fl">0.5</span> <span class="cf">if</span> X[i<span class="dv">-8</span>] <span class="op">==</span> <span class="dv">1</span>: threshold <span class="op">-=</span> <span class="fl">0.25</span> <span class="cf">if</span> np.random.rand() <span class="op">&gt;</span> threshold: Y.append(<span class="dv">0</span>) <span class="cf">else</span>: Y.append(<span class="dv">1</span>) <span class="cf">return</span> X, np.array(Y) <span class="co"># adapted from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/reader.py</span> <span class="kw">def</span> gen_batch(raw_data, batch_size, num_steps): raw_x, raw_y <span class="op">=</span> raw_data data_length <span class="op">=</span> <span class="bu">len</span>(raw_x) <span class="co"># partition raw data into batches and stack them vertically in a data matrix</span> batch_partition_length <span class="op">=</span> data_length <span class="op">//</span> batch_size data_x <span class="op">=</span> np.zeros([batch_size, batch_partition_length], dtype<span class="op">=</span>np.int32) data_y <span class="op">=</span> np.zeros([batch_size, batch_partition_length], dtype<span class="op">=</span>np.int32) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(batch_size): data_x[i] <span class="op">=</span> raw_x[batch_partition_length <span class="op">*</span> i:batch_partition_length <span class="op">*</span> (i <span class="op">+</span> <span class="dv">1</span>)] data_y[i] <span class="op">=</span> raw_y[batch_partition_length <span class="op">*</span> i:batch_partition_length <span class="op">*</span> (i <span class="op">+</span> <span class="dv">1</span>)] <span class="co"># further divide batch partitions into num_steps for truncated backprop</span> epoch_size <span class="op">=</span> batch_partition_length <span class="op">//</span> num_steps <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(epoch_size): x <span class="op">=</span> data_x[:, i <span class="op">*</span> num_steps:(i <span class="op">+</span> <span class="dv">1</span>) <span class="op">*</span> num_steps] y <span class="op">=</span> data_y[:, i <span class="op">*</span> num_steps:(i <span class="op">+</span> <span class="dv">1</span>) <span class="op">*</span> num_steps] <span class="cf">yield</span> (x, y) <span class="kw">def</span> gen_epochs(n, num_steps): <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(n): <span class="cf">yield</span> gen_batch(gen_data(), batch_size, num_steps)</code></pre></div> <h4 id="model">Model</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Placeholders</span> <span class="co">&quot;&quot;&quot;</span> x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) init_state <span class="op">=</span> tf.zeros([batch_size, state_size]) <span class="co">&quot;&quot;&quot;</span> <span class="co">RNN Inputs</span> <span class="co">&quot;&quot;&quot;</span> <span class="co"># Turn our x placeholder into a list of one-hot tensors:</span> <span class="co"># rnn_inputs is a list of num_steps tensors with shape [batch_size, num_classes]</span> x_one_hot <span class="op">=</span> tf.one_hot(x, num_classes) rnn_inputs <span class="op">=</span> tf.unstack(x_one_hot, axis<span class="op">=</span><span class="dv">1</span>)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Definition of rnn_cell</span> <span class="co">This is very similar to the __call__ method on Tensorflow&#39;s BasicRNNCell. See:</span> <span class="co">https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py#L95</span> <span class="co">&quot;&quot;&quot;</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;rnn_cell&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [num_classes <span class="op">+</span> state_size, state_size]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [state_size], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) <span class="kw">def</span> rnn_cell(rnn_input, state): <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;rnn_cell&#39;</span>, reuse<span class="op">=</span><span class="va">True</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [num_classes <span class="op">+</span> state_size, state_size]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [state_size], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) <span class="cf">return</span> tf.tanh(tf.matmul(tf.concat([rnn_input, state], <span class="dv">1</span>), W) <span class="op">+</span> b)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Adding rnn_cells to graph</span> <span class="co">This is a simplified version of the &quot;static_rnn&quot; function from Tensorflow&#39;s api. See:</span> <span class="co">https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/core_rnn.py#L41</span> <span class="co">Note: In practice, using &quot;dynamic_rnn&quot; is a better choice that the &quot;static_rnn&quot;:</span> <span class="co">https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn.py#L390</span> <span class="co">&quot;&quot;&quot;</span> state <span class="op">=</span> init_state rnn_outputs <span class="op">=</span> [] <span class="cf">for</span> rnn_input <span class="kw">in</span> rnn_inputs: state <span class="op">=</span> rnn_cell(rnn_input, state) rnn_outputs.append(state) final_state <span class="op">=</span> rnn_outputs[<span class="op">-</span><span class="dv">1</span>]</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Predictions, loss, training step</span> <span class="co">Losses is similar to the &quot;sequence_loss&quot;</span> <span class="co">function from Tensorflow&#39;s API, except that here we are using a list of 2D tensors, instead of a 3D tensor. See:</span> <span class="co">https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/seq2seq/python/ops/loss.py#L30</span> <span class="co">&quot;&quot;&quot;</span> <span class="co">#logits and predictions</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) logits <span class="op">=</span> [tf.matmul(rnn_output, W) <span class="op">+</span> b <span class="cf">for</span> rnn_output <span class="kw">in</span> rnn_outputs] predictions <span class="op">=</span> [tf.nn.softmax(logit) <span class="cf">for</span> logit <span class="kw">in</span> logits] <span class="co"># Turn our y placeholder into a list of labels</span> y_as_list <span class="op">=</span> tf.unstack(y, num<span class="op">=</span>num_steps, axis<span class="op">=</span><span class="dv">1</span>) <span class="co">#losses and train_step</span> losses <span class="op">=</span> [tf.nn.sparse_softmax_cross_entropy_with_logits(labels<span class="op">=</span>label, logits<span class="op">=</span>logit) <span class="cf">for</span> <span class="op">\</span> logit, label <span class="kw">in</span> <span class="bu">zip</span>(logits, y_as_list)] total_loss <span class="op">=</span> tf.reduce_mean(losses) train_step <span class="op">=</span> tf.train.AdagradOptimizer(learning_rate).minimize(total_loss)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Train the network</span> <span class="co">&quot;&quot;&quot;</span> <span class="kw">def</span> train_network(num_epochs, num_steps, state_size<span class="op">=</span><span class="dv">4</span>, verbose<span class="op">=</span><span class="va">True</span>): <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.global_variables_initializer()) training_losses <span class="op">=</span> [] <span class="cf">for</span> idx, epoch <span class="kw">in</span> <span class="bu">enumerate</span>(gen_epochs(num_epochs, num_steps)): training_loss <span class="op">=</span> <span class="dv">0</span> training_state <span class="op">=</span> np.zeros((batch_size, state_size)) <span class="cf">if</span> verbose: <span class="bu">print</span>(<span class="st">&quot;</span><span class="ch">\n</span><span class="st">EPOCH&quot;</span>, idx) <span class="cf">for</span> step, (X, Y) <span class="kw">in</span> <span class="bu">enumerate</span>(epoch): tr_losses, training_loss_, training_state, _ <span class="op">=</span> <span class="op">\</span> sess.run([losses, total_loss, final_state, train_step], feed_dict<span class="op">=</span>{x:X, y:Y, init_state:training_state}) training_loss <span class="op">+=</span> training_loss_ <span class="cf">if</span> step <span class="op">%</span> <span class="dv">100</span> <span class="op">==</span> <span class="dv">0</span> <span class="kw">and</span> step <span class="op">&gt;</span> <span class="dv">0</span>: <span class="cf">if</span> verbose: <span class="bu">print</span>(<span class="st">&quot;Average loss at step&quot;</span>, step, <span class="st">&quot;for last 250 steps:&quot;</span>, training_loss<span class="op">/</span><span class="dv">100</span>) training_losses.append(training_loss<span class="op">/</span><span class="dv">100</span>) training_loss <span class="op">=</span> <span class="dv">0</span> <span class="cf">return</span> training_losses</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">training_losses <span class="op">=</span> train_network(<span class="dv">1</span>,num_steps) plt.plot(training_losses)</code></pre></div> <pre><code>EPOCH 0 Average loss at step 100 for last 250 steps: 0.6559883219 Average loss at step 200 for last 250 steps: 0.617185292244 Average loss at step 300 for last 250 steps: 0.595771013498 Average loss at step 400 for last 250 steps: 0.568864737153 Average loss at step 500 for last 250 steps: 0.524139249921 Average loss at step 600 for last 250 steps: 0.522666031122 Average loss at step 700 for last 250 steps: 0.522012578249 Average loss at step 800 for last 250 steps: 0.519179680347 Average loss at step 900 for last 250 steps: 0.519965928495</code></pre> <figure> <img src="https://r2rt.com/static/images/RNN_output_21_2.png" alt="RNN Output, num_steps = 5" /><figcaption>RNN Output, num_steps = 5</figcaption> </figure> <p>As you can see, the network very quickly learns to capture the first dependency (but not the second), and converges to the expected cross-entropy loss of 0.52.</p> <p>Exporting our model to a separate file in order to play with hyperparameters, we can see what happens when we use <code>num_steps = 1</code> and <code>num_steps = 10</code> (for this latter case, we also increase the state_size so as to maintain the the information about the second dependency for the required 8 steps):</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> basic_rnn <span class="kw">def</span> plot_learning_curve(num_steps, state_size<span class="op">=</span><span class="dv">4</span>, epochs<span class="op">=</span><span class="dv">1</span>): <span class="kw">global</span> losses, total_loss, final_state, train_step, x, y, init_state tf.reset_default_graph() g <span class="op">=</span> tf.get_default_graph() losses, total_loss, final_state, train_step, x, y, init_state <span class="op">=</span> <span class="op">\</span> basic_rnn.setup_graph(g, basic_rnn.RNN_config(num_steps<span class="op">=</span>num_steps, state_size<span class="op">=</span>state_size)) res <span class="op">=</span> train_network(epochs, num_steps, state_size<span class="op">=</span>state_size, verbose<span class="op">=</span><span class="va">False</span>) plt.plot(res)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">NUM_STEPS = 1</span> <span class="co">&quot;&quot;&quot;</span> plot_learning_curve(num_steps<span class="op">=</span><span class="dv">1</span>, state_size<span class="op">=</span><span class="dv">4</span>, epochs<span class="op">=</span><span class="dv">2</span>)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/RNN_output_25_0.png" alt="RNN Output, num_steps = 1" /><figcaption>RNN Output, num_steps = 1</figcaption> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">NUM_STEPS = 10</span> <span class="co">&quot;&quot;&quot;</span> plot_learning_curve(num_steps<span class="op">=</span><span class="dv">10</span>, state_size<span class="op">=</span><span class="dv">16</span>, epochs<span class="op">=</span><span class="dv">10</span>)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/RNN_output_26_0.png" alt="RNN Output, num_steps = 10" /><figcaption>RNN Output, num_steps = 10</figcaption> </figure> <p>As expected, using <code>num_steps = 10</code> comes close to our expected cross-entropy for knowing both dependencies (0.454). However, using <code>num_steps = 1</code> hovers around something slightly better than the expected cross-entropy for knowing neither dependency (0.66), and doesn’t seem to converge. What’s going on?</p> <p>The answer is that some information about the first dependency is making its way into the incoming state by pure chance. Although the model can’t learn weights that will maintain information about the first dependency (due to the backpropagation being graph-bound), it can learn to take advantage of whatever information about <span class="math inline">$$X_{t-3}$$</span> is left over in <span class="math inline">$$S_{t-1}$$</span>. In doing so, the model changes the way information about <span class="math inline">$$X_{t-3}$$</span> is stored in <span class="math inline">$$S_{t-1}$$</span>, which explains why the loss goes up and down, rather than settling at a local minima.</p> <h3 id="translating-our-model-to-tensorflow">Translating our model to Tensorflow</h3> <p>Translating our model to Tensorflow’s API is easy. We simply replace these two sections:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Definition of rnn_cell</span> <span class="co">This is very similar to the __call__ method on Tensorflow&#39;s BasicRNNCell. See:</span> <span class="co">https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py#L95</span> <span class="co">&quot;&quot;&quot;</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;rnn_cell&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [num_classes <span class="op">+</span> state_size, state_size]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [state_size], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) <span class="kw">def</span> rnn_cell(rnn_input, state): <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;rnn_cell&#39;</span>, reuse<span class="op">=</span><span class="va">True</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [num_classes <span class="op">+</span> state_size, state_size]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [state_size], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) <span class="cf">return</span> tf.tanh(tf.matmul(tf.concat([rnn_input, state], <span class="dv">1</span>), W) <span class="op">+</span> b) <span class="co">&quot;&quot;&quot;</span> <span class="co">Adding rnn_cells to graph</span> <span class="co">This is a simplified version of the &quot;static_rnn&quot; function from Tensorflow&#39;s api. See:</span> <span class="co">https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/core_rnn.py#L41</span> <span class="co">Note: In practice, using &quot;dynamic_rnn&quot; is a better choice that the &quot;static_rnn&quot;:</span> <span class="co">https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn.py#L390</span> <span class="co">&quot;&quot;&quot;</span> state <span class="op">=</span> init_state rnn_outputs <span class="op">=</span> [] <span class="cf">for</span> rnn_input <span class="kw">in</span> rnn_inputs: state <span class="op">=</span> rnn_cell(rnn_input, state) rnn_outputs.append(state) final_state <span class="op">=</span> rnn_outputs[<span class="op">-</span><span class="dv">1</span>]</code></pre></div> <p>With these two lines:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">cell <span class="op">=</span> tf.contrib.rnn.BasicRNNCell(state_size) rnn_outputs, final_state <span class="op">=</span> tf.contrib.rnn.static_rnn(cell, rnn_inputs, initial_state<span class="op">=</span>init_state)</code></pre></div> <h3 id="using-a-dynamic-rnn">Using a dynamic RNN</h3> <p>Above, we added every node for every timestep to the graph before execution. This is called “static” construction. We could also let Tensorflow dynamically create the graph at execution time, which can be more efficient. To do this, instead of using a list of tensors (of length <code>num_steps</code> and shape <code>[batch_size, features]</code>), we keep everything in a single 3-dimnesional tensor of shape <code>[batch_size, num_steps, features]</code>, and use Tensorflow’s <code>dynamic_rnn</code> function. This is shown below.</p> <h3 id="final-model-static">Final model — static</h3> <p>To recap, here’s the entire static model, as modified to use Tensorflow’s API:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Placeholders</span> <span class="co">&quot;&quot;&quot;</span> x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) init_state <span class="op">=</span> tf.zeros([batch_size, state_size]) <span class="co">&quot;&quot;&quot;</span> <span class="co">Inputs</span> <span class="co">&quot;&quot;&quot;</span> x_one_hot <span class="op">=</span> tf.one_hot(x, num_classes) rnn_inputs <span class="op">=</span> tf.unstack(x_one_hot, axis<span class="op">=</span><span class="dv">1</span>) <span class="co">&quot;&quot;&quot;</span> <span class="co">RNN</span> <span class="co">&quot;&quot;&quot;</span> cell <span class="op">=</span> tf.contrib.rnn.BasicRNNCell(state_size) rnn_outputs, final_state <span class="op">=</span> tf.contrib.rnn.static_rnn(cell, rnn_inputs, initial_state<span class="op">=</span>init_state) <span class="co">&quot;&quot;&quot;</span> <span class="co">Predictions, loss, training step</span> <span class="co">&quot;&quot;&quot;</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) logits <span class="op">=</span> [tf.matmul(rnn_output, W) <span class="op">+</span> b <span class="cf">for</span> rnn_output <span class="kw">in</span> rnn_outputs] predictions <span class="op">=</span> [tf.nn.softmax(logit) <span class="cf">for</span> logit <span class="kw">in</span> logits] y_as_list <span class="op">=</span> tf.unstack(y, num<span class="op">=</span>num_steps, axis<span class="op">=</span><span class="dv">1</span>) losses <span class="op">=</span> [tf.nn.sparse_softmax_cross_entropy_with_logits(labels<span class="op">=</span>label, logits<span class="op">=</span>logit) <span class="cf">for</span> <span class="op">\</span> logit, label <span class="kw">in</span> <span class="bu">zip</span>(logits, y_as_list)] total_loss <span class="op">=</span> tf.reduce_mean(losses) train_step <span class="op">=</span> tf.train.AdagradOptimizer(learning_rate).minimize(total_loss)</code></pre></div> <h3 id="final-model-dynamic">Final model — dynamic</h3> <p>And here it is with the <code>dynamic_rnn</code> API, which should be preferred over the static API:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">&quot;&quot;&quot;</span> <span class="co">Placeholders</span> <span class="co">&quot;&quot;&quot;</span> x <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;input_placeholder&#39;</span>) y <span class="op">=</span> tf.placeholder(tf.int32, [batch_size, num_steps], name<span class="op">=</span><span class="st">&#39;labels_placeholder&#39;</span>) init_state <span class="op">=</span> tf.zeros([batch_size, state_size]) <span class="co">&quot;&quot;&quot;</span> <span class="co">Inputs</span> <span class="co">&quot;&quot;&quot;</span> rnn_inputs <span class="op">=</span> tf.one_hot(x, num_classes) <span class="co">&quot;&quot;&quot;</span> <span class="co">RNN</span> <span class="co">&quot;&quot;&quot;</span> cell <span class="op">=</span> tf.contrib.rnn.BasicRNNCell(state_size) rnn_outputs, final_state <span class="op">=</span> tf.nn.dynamic_rnn(cell, rnn_inputs, initial_state<span class="op">=</span>init_state) <span class="co">&quot;&quot;&quot;</span> <span class="co">Predictions, loss, training step</span> <span class="co">&quot;&quot;&quot;</span> <span class="cf">with</span> tf.variable_scope(<span class="st">&#39;softmax&#39;</span>): W <span class="op">=</span> tf.get_variable(<span class="st">&#39;W&#39;</span>, [state_size, num_classes]) b <span class="op">=</span> tf.get_variable(<span class="st">&#39;b&#39;</span>, [num_classes], initializer<span class="op">=</span>tf.constant_initializer(<span class="fl">0.0</span>)) logits <span class="op">=</span> tf.reshape( tf.matmul(tf.reshape(rnn_outputs, [<span class="op">-</span><span class="dv">1</span>, state_size]), W) <span class="op">+</span> b, [batch_size, num_steps, num_classes]) predictions <span class="op">=</span> tf.nn.softmax(logits) losses <span class="op">=</span> tf.nn.sparse_softmax_cross_entropy_with_logits(labels<span class="op">=</span>y, logits<span class="op">=</span>logits) total_loss <span class="op">=</span> tf.reduce_mean(losses) train_step <span class="op">=</span> tf.train.AdagradOptimizer(learning_rate).minimize(total_loss)</code></pre></div> <h3 id="conclusion">Conclusion</h3> <p>And there you have it, a basic RNN In Tensorflow. In the <a href="https://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html">next post</a> of this series, we’ll look at how to improve our base implementation, how to upgrade to a GRU/LSTM or other custom RNN cell and use multiple layers, how to add features like dropout and layer normalization, and how to use our RNN to generate sequences.</p> </body> </html> First Convergence Bias2016-04-11T00:00:00-04:002016-04-11T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-04-11:/first-convergence-bias.htmlIn this post, I offer the results of an experiment providing support for "first convergence bias", which includes the proposition that training a randomly initialized network via backpropagation may never converge to a global minimum, regardless of the intialization and number of trials.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>In my post <a href="https://r2rt.com/skill-vs-strategy.html">Skill vs Strategy</a> I made the following proposition:</p> <blockquote> <p>Let’s say we retrain the network one million times, and each of the local minima reached leads to approximately the same performance. Is this enough for us to conclude that the resulting strategies are close to the best? I would answer in the negative; we cannot be certain that a random initialization will ever lead to an optimal strategy via backpropagation. It may be a situation like the chest shot, where in order to reach an optimal strategy, the network must be trained again after it has learned some useful hidden features.</p> </blockquote> <p>In this post, I offer the results of an experiment providing support for this proposition. The specific experiment was not designed to test this proposition, and in fact, the results I obtained are opposite of the results I expected.</p> <!--more--> <h3 id="motivation">Motivation</h3> <p>In my post <a href="https://r2rt.com/representational-power-of-deeper-layers.html">Representational Power of Deeper Layers</a>, I showed that progressively deeper layers serve as better representations of the original data. This motivated the thought that the better the first layer representation of the data, the better the last layer representation could be. Intuitively, this makes sense, if our first layer can get us 95% of the way to the best solution, then the second layer only has 5% of ground to cover, whereas if the first layer only gets 70% of the way there, the second layer has six times as much work to do.</p> <h3 id="experiment">Experiment</h3> <p>I tested this hypothesis by setting up a neural network with 3-hidden layers to classify MNIST digits. The network is trained only on 1000 MNIST training examples, as opposed to the entire training set of 50000, and achieves a test accuracy of between 86% and 89%.</p> <p>Using a simple mini-batch gradient descent with a batch-size of 50, I compared the following two scenarios:</p> <ol type="1"> <li>Training the entire network for 1000 steps.</li> <li>Training the first layer for 200 steps, training the first and second layers together for 200 steps, and only then training the entire network for 1000 steps.</li> </ol> <p>My hypothesis was that the second option would lead to better results, because by the time the third layer starts training, the first and second layers are finely tuned to produce “good” representations of the data. This hypothesis was proven incorrect.</p> <h3 id="results">Results</h3> <p>For each of the two training strategies above, I trained 500 randomly-initialized networks and recorded their final accuracies. Training the entire network at once yielded an average test accuracy of 87.87%, whereas training the two earlier layers before training the entire network yielded an average test accuracy of only 87.64%. While this is not that significant of a difference, the plot below clearly shows that the “local minima” that each strategy reaches is pulled from a different distribution.</p> <figure> <img src="https://r2rt.com/static/images/FCB_output_12_2.png" /> </figure> <h3 id="discussion">Discussion</h3> <p>Although my hypothesis was invalidated, the result is nice because it supports my prior proposition that the local minima of first convergence may be biased. I.e., we have no guarantee of getting to the best local minima after training via backpropagation. Thus, the discussion in <a href="https://r2rt.com/skill-vs-strategy.html">Skill vs Strategy</a> is very relevant. Convergence speed aside, which is the usual reason for preferring alternate training strategies, other training strategies (e.g., other optimization strategies like Adagrad) are worth exploring as they might be biased toward a superior quality of local minima.</p> <p>How this relates to “greedy layer-wise training of deep networks” (see <a href="https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf">this paper</a>) may also be interesting. I haven’t learned enough yet to dive into that paper or that method of training, but the gist, as I currently understand it, is that we can obtain better initializations of the weights in a deep network by first training each layer in an unsupervised fashion. In this experiment I did greedy layer-wise <em>supervised</em> training, which led to worse results. As an aside, the discussion in that paper also strongly supports the proposition that local minima of first convergence may be biased.</p> <h3 id="code">Code</h3> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> tensorflow <span class="im">as</span> tf <span class="im">import</span> numpy <span class="im">as</span> np <span class="im">import</span> load_mnist <span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="im">import</span> seaborn <span class="im">as</span> sns sns.<span class="bu">set</span>(color_codes<span class="op">=</span><span class="va">True</span>) <span class="op">%</span>matplotlib inline mnist <span class="op">=</span> load_mnist.read_data_sets(<span class="st">&#39;MNIST_data&#39;</span>, one_hot<span class="op">=</span><span class="va">True</span>)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> weight_variable(shape): initial <span class="op">=</span> tf.truncated_normal(shape, stddev<span class="op">=</span><span class="fl">0.1</span>) <span class="cf">return</span> tf.Variable(initial) <span class="kw">def</span> bias_variable(shape): initial <span class="op">=</span> tf.constant(<span class="fl">0.1</span>, shape<span class="op">=</span>shape) <span class="cf">return</span> tf.Variable(initial) <span class="kw">def</span> simple_fc_layer(input_layer, shape): w <span class="op">=</span> weight_variable(shape) b <span class="op">=</span> bias_variable([shape[<span class="dv">1</span>]]) <span class="cf">return</span> tf.nn.tanh(tf.matmul(input_layer,w) <span class="op">+</span> b) <span class="kw">def</span> cross_entropy_layer(input_layer,shape): w <span class="op">=</span> weight_variable(shape) b <span class="op">=</span> bias_variable([shape[<span class="dv">1</span>]]) <span class="cf">return</span> tf.nn.softmax(tf.matmul(input_layer,w) <span class="op">+</span> b) <span class="kw">def</span> accuracy(y, y_): correct <span class="op">=</span> tf.equal(tf.argmax(y,<span class="dv">1</span>), tf.argmax(y_,<span class="dv">1</span>)) <span class="cf">return</span> tf.reduce_mean(tf.cast(correct, <span class="st">&quot;float&quot;</span>))</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">x <span class="op">=</span> tf.placeholder(<span class="st">&quot;float&quot;</span>, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">784</span>]) y_ <span class="op">=</span> tf.placeholder(<span class="st">&quot;float&quot;</span>, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">10</span>]) lr <span class="op">=</span> tf.placeholder(<span class="st">&quot;float&quot;</span>) l1 <span class="op">=</span> simple_fc_layer(x, [<span class="dv">784</span>,<span class="dv">100</span>]) y1 <span class="op">=</span> cross_entropy_layer(l1,[<span class="dv">100</span>,<span class="dv">10</span>]) l2 <span class="op">=</span> simple_fc_layer(l1, [<span class="dv">100</span>,<span class="dv">100</span>]) y2 <span class="op">=</span> cross_entropy_layer(l2,[<span class="dv">100</span>,<span class="dv">10</span>]) l3 <span class="op">=</span> simple_fc_layer(l2, [<span class="dv">100</span>,<span class="dv">100</span>]) y3 <span class="op">=</span> cross_entropy_layer(l3,[<span class="dv">100</span>,<span class="dv">10</span>]) ce1 <span class="op">=</span> <span class="op">-</span>tf.reduce_sum(y_<span class="op">*</span>tf.log(y1)) ce2 <span class="op">=</span> <span class="op">-</span>tf.reduce_sum(y_<span class="op">*</span>tf.log(y2)) ce3 <span class="op">=</span> <span class="op">-</span>tf.reduce_sum(y_<span class="op">*</span>tf.log(y3)) ts1 <span class="op">=</span> tf.train.GradientDescentOptimizer(lr).minimize(ce1) ts2 <span class="op">=</span> tf.train.GradientDescentOptimizer(lr).minimize(ce2) ts3 <span class="op">=</span> tf.train.GradientDescentOptimizer(lr).minimize(ce3) a1 <span class="op">=</span> accuracy(y1,y_) a2 <span class="op">=</span> accuracy(y2,y_) a3 <span class="op">=</span> accuracy(y3,y_)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">train3 <span class="op">=</span> [] train123 <span class="op">=</span> [] <span class="cf">for</span> run <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">400</span>): <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.initialize_all_variables()) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">1000</span>): start <span class="op">=</span> (<span class="dv">50</span><span class="op">*</span>i) <span class="op">%</span> <span class="dv">1000</span> end <span class="op">=</span> start <span class="op">+</span> <span class="dv">50</span> learning_rate <span class="op">=</span> <span class="fl">0.01</span> <span class="cf">if</span> i <span class="op">&lt;</span> <span class="dv">750</span> <span class="cf">else</span> <span class="fl">0.003</span> ts3.run(feed_dict<span class="op">=</span>{x: mnist.train.images[start:end], y_: mnist.train.labels[start:end], lr:learning_rate}) res <span class="op">=</span> a3.<span class="bu">eval</span>(feed_dict<span class="op">=</span>{x: mnist.test.images, y_: mnist.test.labels}) <span class="bu">print</span>(res, end<span class="op">=</span><span class="st">&quot;</span><span class="ch">\r</span><span class="st">&quot;</span>) train3.append(res)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> run <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">400</span>): <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.initialize_all_variables()) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">200</span>): start <span class="op">=</span> (<span class="dv">50</span><span class="op">*</span>i) <span class="op">%</span> <span class="dv">1000</span> end <span class="op">=</span> start <span class="op">+</span> <span class="dv">50</span> ts1.run(feed_dict<span class="op">=</span>{x: mnist.train.images[start:end], y_: mnist.train.labels[start:end], lr: <span class="fl">0.01</span>}) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">200</span>): start <span class="op">=</span> (<span class="dv">50</span><span class="op">*</span>i) <span class="op">%</span> <span class="dv">1000</span> end <span class="op">=</span> start <span class="op">+</span> <span class="dv">50</span> ts2.run(feed_dict<span class="op">=</span>{x: mnist.train.images[start:end], y_: mnist.train.labels[start:end], lr: <span class="fl">0.01</span>}) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">1000</span>): start <span class="op">=</span> (<span class="dv">50</span><span class="op">*</span>i) <span class="op">%</span> <span class="dv">1000</span> end <span class="op">=</span> start <span class="op">+</span> <span class="dv">50</span> learning_rate <span class="op">=</span> <span class="fl">0.01</span> <span class="cf">if</span> i <span class="op">&lt;</span> <span class="dv">750</span> <span class="cf">else</span> <span class="fl">0.003</span> ts3.run(feed_dict<span class="op">=</span>{x: mnist.train.images[start:end], y_: mnist.train.labels[start:end], lr:learning_rate}) res <span class="op">=</span> a3.<span class="bu">eval</span>(feed_dict<span class="op">=</span>{x: mnist.test.images, y_: mnist.test.labels}) <span class="bu">print</span>(res, end<span class="op">=</span><span class="st">&quot;</span><span class="ch">\r</span><span class="st">&quot;</span>) train123.append(res)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">sns.distplot(train3,label<span class="op">=</span><span class="st">&quot;Train all layers&quot;</span>) sns.distplot(train123,label<span class="op">=</span><span class="st">&quot;Train layer-by-layer&quot;</span>) plt.legend()</code></pre></div> <figure> <img src="https://r2rt.com/static/images/FCB_output_12_2.png" /> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="bu">print</span>(<span class="st">&quot;Mean and standard deviation of training all layers: &quot;</span> <span class="op">+</span> <span class="bu">str</span>(np.mean(train3)) <span class="op">+</span> <span class="st">&quot;, &quot;</span> <span class="op">+</span> <span class="bu">str</span>(np.std(train3))) <span class="bu">print</span>(<span class="st">&quot;Mean and standard deviation of layer-wise training: &quot;</span> <span class="op">+</span> <span class="bu">str</span>(np.mean(train123)) <span class="op">+</span> <span class="st">&quot;, &quot;</span> <span class="op">+</span> <span class="bu">str</span>(np.std(train123)))</code></pre></div> <pre><code>Mean and standard deviation of training all layers: 0.87867, 0.00274425 Mean and standard deviation of layer-wise training: 0.876439, 0.00234411</code></pre> </body> </html> Inverting a Neural Net2016-04-05T00:00:00-04:002016-04-05T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-04-05:/inverting-a-neural-net.htmlIn this experiment, I "invert" a simple two-layer MNIST model to visualize what the final hidden layer representations look like when projected back into the original sample space.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>In this experiment, I “invert” a simple two-layer MNIST model to visualize what the final hidden layer representations look like when projected back into the original sample space.</p> <p>[<strong>Note 2017/03/05</strong>: At the time of writing this post, I did not know what an autoencoder was.]</p> <h3 id="model-setup">Model Setup</h3> <p>This is a fully-connected model with two-hidden layers of 100 hidden neurons. The model also contains inverse weight matrices (<code>w2_inv</code> and <code>w1_inv</code>) that are trained after the fact by minimizing the l1 difference (<code>x_inv_similarity</code>) between the inverse projection of a sample and the original sample.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> tensorflow <span class="im">as</span> tf <span class="im">import</span> numpy <span class="im">as</span> np <span class="im">import</span> load_mnist <span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="im">import</span> matplotlib.image <span class="im">as</span> mpimg <span class="im">import</span> matplotlib.cm <span class="im">as</span> cm <span class="im">from</span> mpl_toolkits.axes_grid1 <span class="im">import</span> ImageGrid <span class="op">%</span>matplotlib inline mnist <span class="op">=</span> load_mnist.read_data_sets(<span class="st">&#39;MNIST_data&#39;</span>, one_hot<span class="op">=</span><span class="va">True</span>) sess <span class="op">=</span> tf.InteractiveSession() <span class="kw">def</span> weight_variable(shape,name<span class="op">=</span><span class="va">None</span>): initial <span class="op">=</span> tf.truncated_normal(shape, stddev<span class="op">=</span><span class="fl">0.1</span>) <span class="cf">return</span> tf.Variable(initial,name<span class="op">=</span>name) <span class="kw">def</span> bias_variable(shape): initial <span class="op">=</span> tf.constant(<span class="fl">0.1</span>, shape<span class="op">=</span>shape) <span class="cf">return</span> tf.Variable(initial) <span class="kw">def</span> logit(p): <span class="co">&quot;&quot;&quot;element-wise logit of tensor p&quot;&quot;&quot;</span> <span class="cf">return</span> tf.log(tf.div(p,<span class="dv">1</span><span class="op">-</span>p)) <span class="kw">def</span> squash(p, dim): <span class="co">&quot;&quot;&quot;element-wise squash of dim of tensor MxN p to be between 0 and 1&quot;&quot;&quot;</span> p_ <span class="op">=</span> p <span class="op">-</span> tf.reduce_min(p,dim,keep_dims<span class="op">=</span><span class="va">True</span>) <span class="co"># add the minimum so all above 0</span> p_norm <span class="op">=</span> (p_ <span class="op">/</span> tf.reduce_max(p_,dim,keep_dims<span class="op">=</span><span class="va">True</span>)) p_norm_ <span class="op">=</span> (p_norm <span class="op">-</span> <span class="fl">0.5</span>) <span class="op">*</span> <span class="fl">0.99</span> <span class="op">+</span> <span class="fl">0.5</span> <span class="co">#squashs to be strictly 0 &lt; p_norm_ &lt; 1</span> <span class="cf">return</span> p_norm_ x <span class="op">=</span> tf.placeholder(<span class="st">&quot;float&quot;</span>, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">784</span>]) y_ <span class="op">=</span> tf.placeholder(<span class="st">&quot;float&quot;</span>, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">10</span>]) w1 <span class="op">=</span> weight_variable([<span class="dv">784</span>,<span class="dv">100</span>]) b1 <span class="op">=</span> bias_variable([<span class="dv">100</span>]) l1 <span class="op">=</span> tf.nn.sigmoid(tf.matmul(x,w1) <span class="op">+</span> b1) <span class="co">#100</span> w2 <span class="op">=</span> weight_variable([<span class="dv">100</span>,<span class="dv">100</span>]) b2 <span class="op">=</span> bias_variable([<span class="dv">100</span>]) l2 <span class="op">=</span> tf.nn.sigmoid(tf.matmul(l1,w2) <span class="op">+</span> b2) <span class="co">#100</span> w2_inv <span class="op">=</span> weight_variable([<span class="dv">100</span>,<span class="dv">100</span>]) l1_inv <span class="op">=</span> tf.matmul(logit(l2) <span class="op">-</span> b2, w2_inv) l1_inv_norm <span class="op">=</span> squash(l1_inv, <span class="dv">1</span>) <span class="co"># this &quot;excess l1 inv&quot; is minimized so as to try to get the l1_inv to be compatible</span> <span class="co"># with the logit (inverse sigmoid) function without requiring the squash operation</span> excess_l1_inv <span class="op">=</span> tf.nn.l2_loss(tf.reduce_min(l1_inv)) <span class="op">+</span> tf.nn.l2_loss(tf.reduce_max(l1_inv <span class="op">-</span> <span class="dv">1</span>)) w1_inv <span class="op">=</span> weight_variable([<span class="dv">100</span>,<span class="dv">784</span>]) x_inv <span class="op">=</span> tf.matmul(logit(l1_inv_norm) <span class="op">-</span> b1,w1_inv) w <span class="op">=</span> weight_variable([<span class="dv">100</span>,<span class="dv">10</span>]) b <span class="op">=</span> bias_variable([<span class="dv">10</span>]) y <span class="op">=</span> tf.nn.softmax(tf.matmul(l2,w) <span class="op">+</span> b) cross_entropy <span class="op">=</span> <span class="op">-</span>tf.reduce_sum(y_<span class="op">*</span>tf.log(y)) x_inv_similarity <span class="op">=</span> tf.reduce_sum(tf.<span class="bu">abs</span>(x <span class="op">-</span> x_inv)) opt <span class="op">=</span> tf.train.AdagradOptimizer(<span class="fl">0.01</span>) grads <span class="op">=</span> opt.compute_gradients(x_inv_similarity<span class="op">+</span>excess_l1_inv, [w1_inv, w2_inv]) inv_train_step <span class="op">=</span> opt.apply_gradients(grads) train_step <span class="op">=</span> tf.train.AdagradOptimizer(<span class="fl">0.01</span>).minimize(cross_entropy) sess.run(tf.initialize_all_variables()) correct_prediction <span class="op">=</span> tf.equal(tf.argmax(y,<span class="dv">1</span>), tf.argmax(y_,<span class="dv">1</span>)) accuracy <span class="op">=</span> tf.reduce_mean(tf.cast(correct_prediction, <span class="st">&quot;float&quot;</span>))</code></pre></div> <h3 id="training-the-model">Training the Model</h3> <p>First, we train the model, and then we train the inverse operations. The model achieves an accuracy of about 95%. Because we don’t want to confuse the inverse training with bad samples, we only train the model using samples that the model itself is confident it has classified correctly. This reduces noise in the inverse projections.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">2000</span>): batch <span class="op">=</span> mnist.train.next_batch(<span class="dv">1000</span>) train_step.run(feed_dict<span class="op">=</span>{x: batch[<span class="dv">0</span>], y_: batch[<span class="dv">1</span>]}) <span class="cf">if</span> i <span class="op">%</span> <span class="dv">100</span> <span class="op">==</span> <span class="dv">0</span>: <span class="bu">print</span>(i,end<span class="op">=</span><span class="st">&quot; &quot;</span>) <span class="bu">print</span>(accuracy.<span class="bu">eval</span>(feed_dict<span class="op">=</span>{x: mnist.test.images, y_: mnist.test.labels}), end<span class="op">=</span><span class="st">&quot;</span><span class="ch">\r</span><span class="st">&quot;</span>) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">1000</span>): batch <span class="op">=</span> mnist.train.next_batch(<span class="dv">1000</span>) confidence <span class="op">=</span> np.<span class="bu">max</span>(y.<span class="bu">eval</span>(feed_dict<span class="op">=</span> {x: batch[<span class="dv">0</span>]}),axis<span class="op">=</span><span class="dv">1</span>) inv_train_step.run(feed_dict<span class="op">=</span>{x: batch[<span class="dv">0</span>][confidence<span class="op">&gt;</span>.<span class="dv">8</span>], y_: batch[<span class="dv">1</span>][confidence<span class="op">&gt;</span>.<span class="dv">8</span>]}) <span class="cf">if</span> i <span class="op">%</span> <span class="dv">100</span> <span class="op">==</span> <span class="dv">0</span>: <span class="bu">print</span>(i,end<span class="op">=</span><span class="st">&quot;</span><span class="ch">\r</span><span class="st">&quot;</span>) <span class="bu">print</span>(<span class="st">&quot;Final Accuracy: &quot;</span> <span class="op">+</span> <span class="bu">str</span>(accuracy.<span class="bu">eval</span>(feed_dict<span class="op">=</span>{x: mnist.test.images, y_: mnist.test.labels})))</code></pre></div> <pre><code>Final Accuracy: 0.9521</code></pre> <h3 id="visualizing-inverse-projections">Visualizing inverse projections</h3> <p>We now show a visual comparison of the first 36 test samples and their projections.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> plot_nxn(n, images): images <span class="op">=</span> images.reshape((n<span class="op">*</span>n,<span class="dv">28</span>,<span class="dv">28</span>)) fig <span class="op">=</span> plt.figure(<span class="dv">1</span>, (n, n)) grid <span class="op">=</span> ImageGrid(fig, <span class="dv">111</span>, <span class="co"># similar to subplot(111)</span> nrows_ncols<span class="op">=</span>(n, n), <span class="co"># creates grid of axes</span> axes_pad<span class="op">=</span><span class="fl">0.1</span>, <span class="co"># pad between axes in inch.</span> ) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(n<span class="op">*</span>n): grid[i].imshow(images[i], cmap <span class="op">=</span> cm.Greys_r) <span class="co"># The AxesGrid object work as a list of axes.</span> plt.show() plot_nxn(<span class="dv">6</span>,mnist.test.images[:<span class="dv">36</span>])</code></pre></div> <figure> <img src="https://r2rt.com/static/images/INN_output_11_0.png" /> </figure> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">x1 <span class="op">=</span> x_inv.<span class="bu">eval</span>(feed_dict<span class="op">=</span>{x: mnist.test.images})[:<span class="dv">36</span>] plot_nxn(<span class="dv">6</span>,x1)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/INN_output_13_0.png" /> </figure> <p>I think the most interesting this about this is how the model completely transforms the misclassified digits. For example, the 9th sample and the 3rd to last sample each get transformed to a 6.</p> <p>It’s also interesting that the inverse projections are somewhat “idealized” versions of each digit. For example, the orientations of the inversely projected 3s and 9s and the stroke width of the inversely projected 0s are now all the same.</p> <h3 id="generating-samples">Generating samples</h3> <p>Here we generate samples of digits 1-9 by first optimizing the hidden representation so that the neural network is confident that the representaton is of a specific class, and then outputting the inverse projection.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> generate(n_samples,fake_labels): <span class="kw">global</span> w <span class="kw">global</span> b fake_l2 <span class="op">=</span> tf.Variable(tf.zeros([n_samples, <span class="dv">100</span>])) fake_y <span class="op">=</span> tf.nn.softmax(tf.matmul(tf.nn.sigmoid(fake_l2),w) <span class="op">+</span> b) fake_labels <span class="op">=</span> tf.constant(fake_labels) diff <span class="op">=</span> tf.reduce_sum(tf.<span class="bu">abs</span>(fake_labels <span class="op">-</span> fake_y)) <span class="co">#train the fake_l2 to minimize diff</span> opt <span class="op">=</span> tf.train.GradientDescentOptimizer(<span class="fl">1.</span>) grads <span class="op">=</span> opt.compute_gradients(diff, [fake_l2]) tstep <span class="op">=</span> opt.apply_gradients(grads) sess.run(tf.initialize_variables([fake_l2])) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">1000</span>): tstep.run() fake_l1_inv <span class="op">=</span> tf.matmul(fake_l2 <span class="op">-</span> b2, w2_inv) fake_l1_inv_norm <span class="op">=</span> squash(fake_l1_inv,<span class="dv">1</span>) fake_x_inv <span class="op">=</span> tf.matmul(logit(fake_l1_inv_norm) <span class="op">-</span> b1,w1_inv) <span class="cf">return</span> fake_x_inv.<span class="bu">eval</span>(), fake_y.<span class="bu">eval</span>() genned, fakes <span class="op">=</span> generate(<span class="dv">9</span>, np.eye(<span class="dv">10</span>)[[<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>,<span class="dv">4</span>,<span class="dv">5</span>,<span class="dv">6</span>,<span class="dv">7</span>,<span class="dv">8</span>,<span class="dv">9</span>]].astype(<span class="st">&quot;float32&quot;</span>))</code></pre></div> <p>Here we see that the network is over 99.5% confident that each of its hidden layer representations are good representations. Below that we see their inverse projections.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">np.<span class="bu">max</span>(fakes,axis<span class="op">=</span><span class="dv">1</span>)</code></pre></div> <pre><code>array([ 0.99675035, 0.99740452, 0.99649602, 0.99652439, 0.99734575, 0.99607605, 0.99735802, 0.99755549, 0.99680138], dtype=float32)</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">plot_nxn(<span class="dv">3</span>,genned)</code></pre></div> <figure> <img src="https://r2rt.com/static/images/INN_output_20_0.png" /> </figure> <p>A bit noisy, but it works!</p> <h3 id="visualizing-features">Visualizing Features</h3> <p>We will now show the inverse projection of each of the 100 features of the hidden representation, to get an idea of what the neural network has learned. Unfortunately, the noise is overwhelming, but we can sort of make out shadows of the learned features.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> generate_features(): fake_l2 <span class="op">=</span> tf.constant(np.eye(<span class="dv">100</span>).astype(<span class="st">&quot;float32&quot;</span>)<span class="op">*</span>(<span class="fl">1e8</span>)) fake_l1_inv <span class="op">=</span> tf.matmul(fake_l2 <span class="op">-</span> b2, w2_inv) fake_l1_inv_norm <span class="op">=</span> squash(fake_l1_inv,<span class="dv">1</span>) fake_x_inv <span class="op">=</span> tf.matmul(logit(fake_l1_inv_norm) <span class="op">-</span> b1,w1_inv) <span class="cf">return</span> fake_x_inv.<span class="bu">eval</span>() genned <span class="op">=</span> generate_features() plot_nxn(<span class="dv">10</span>,np.<span class="bu">round</span>(genned,<span class="dv">1</span>))</code></pre></div> <figure> <img src="https://r2rt.com/static/images/INN_output_24_0.png" /> </figure> </body> </html> Representational Power of Deeper Layers2016-03-30T00:00:00-04:002016-03-30T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-03-30:/representational-power-of-deeper-layers.htmlThe hidden layers in a neural network can be seen as different representations of the input. Do deeper layers learn "better" representations? In a network trained to solve a classification problem, this would mean that deeper layers provide better features than earlier layers. The natural hypothesis is that this is indeed the case. In this post, I test this hypothesis on an network with three hidden layers trained to classify the MNIST dataset. It is shown that deeper layers do in fact produce better representations of the input.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>The hidden layers in a neural network can be seen as different representations of the input. Do deeper layers learn “better” representations? In a network trained to solve a classification problem, this would mean that deeper layers provide better features than earlier layers. The natural hypothesis is that this is indeed the case. In this post, I test this hypothesis on an network with three hidden layers trained to classify the MNIST dataset. It is shown that deeper layers do in fact produce better representations of the input.</p> <h3 id="model-setup">Model setup</h3> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> tensorflow <span class="im">as</span> tf <span class="im">import</span> numpy <span class="im">as</span> np <span class="im">import</span> load_mnist <span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="op">%</span>matplotlib inline mnist <span class="op">=</span> load_mnist.read_data_sets(<span class="st">&#39;MNIST_data&#39;</span>, one_hot<span class="op">=</span><span class="va">True</span>) sess <span class="op">=</span> tf.InteractiveSession() <span class="kw">def</span> weight_variable(shape): initial <span class="op">=</span> tf.truncated_normal(shape, stddev<span class="op">=</span><span class="fl">0.1</span>) <span class="cf">return</span> tf.Variable(initial) <span class="kw">def</span> bias_variable(shape): initial <span class="op">=</span> tf.constant(<span class="fl">0.1</span>, shape<span class="op">=</span>shape) <span class="cf">return</span> tf.Variable(initial) <span class="kw">def</span> simple_fc_layer(input_layer, shape): w <span class="op">=</span> weight_variable(shape) b <span class="op">=</span> bias_variable([shape[<span class="dv">1</span>]]) <span class="cf">return</span> tf.nn.tanh(tf.matmul(input_layer,w) <span class="op">+</span> b) x <span class="op">=</span> tf.placeholder(<span class="st">&quot;float&quot;</span>, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">784</span>]) y_ <span class="op">=</span> tf.placeholder(<span class="st">&quot;float&quot;</span>, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">10</span>]) l1 <span class="op">=</span> simple_fc_layer(x, [<span class="dv">784</span>,<span class="dv">100</span>]) l2 <span class="op">=</span> simple_fc_layer(l1, [<span class="dv">100</span>,<span class="dv">100</span>]) l3 <span class="op">=</span> simple_fc_layer(l2, [<span class="dv">100</span>,<span class="dv">100</span>]) w <span class="op">=</span> weight_variable([<span class="dv">100</span>,<span class="dv">10</span>]) b <span class="op">=</span> bias_variable([<span class="dv">10</span>]) y <span class="op">=</span> tf.nn.softmax(tf.matmul(l3,w) <span class="op">+</span> b) cross_entropy <span class="op">=</span> <span class="op">-</span>tf.reduce_sum(y_<span class="op">*</span>tf.log(y)) train_step <span class="op">=</span> tf.train.GradientDescentOptimizer(<span class="fl">0.01</span>).minimize(cross_entropy) sess.run(tf.initialize_all_variables()) saver <span class="op">=</span> tf.train.Saver() saver.save(sess, <span class="st">&#39;/tmp/initial_variables.ckpt&#39;</span>) correct_prediction <span class="op">=</span> tf.equal(tf.argmax(y,<span class="dv">1</span>), tf.argmax(y_,<span class="dv">1</span>)) accuracy <span class="op">=</span> tf.reduce_mean(tf.cast(correct_prediction, <span class="st">&quot;float&quot;</span>)) base_accuracy <span class="op">=</span> [] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>): start <span class="op">=</span> (<span class="dv">50</span><span class="op">*</span>i) <span class="op">%</span> <span class="dv">54950</span> end <span class="op">=</span> start <span class="op">+</span> <span class="dv">50</span> train_step.run(feed_dict<span class="op">=</span>{x: mnist.train.images[start:end], y_: mnist.train.labels[start:end]}) <span class="cf">if</span> i <span class="op">%</span> <span class="dv">100</span> <span class="op">==</span> <span class="dv">0</span>: base_accuracy.append(accuracy.<span class="bu">eval</span>(feed_dict<span class="op">=</span>{x: mnist.test.images, y_: mnist.test.labels})) <span class="bu">print</span>(base_accuracy[<span class="op">-</span><span class="dv">1</span>])</code></pre></div> <pre><code>0.971</code></pre> <p>The network achieves an accuracy of about 97% after 10000 training steps in batches of 50 (about 1 epoch of the dataset).</p> <h3 id="increasing-representational-power">Increasing representational power</h3> <p>To show increasing representational power, I run logistic regression (supervised) and PCA (unsupervised) models on each layer of the data and show that they perform progressively better with deeper layers.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">x_test, y_test <span class="op">=</span> mnist.test.images[:<span class="dv">1000</span>], mnist.test.labels[:<span class="dv">1000</span>] y_train_single <span class="op">=</span> np.<span class="bu">sum</span>((mnist.train.labels[:<span class="dv">1000</span>] <span class="op">*</span> np.array([<span class="dv">0</span>,<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>,<span class="dv">4</span>,<span class="dv">5</span>,<span class="dv">6</span>,<span class="dv">7</span>,<span class="dv">8</span>,<span class="dv">9</span>])),axis<span class="op">=</span><span class="dv">1</span>) y_test_single <span class="op">=</span> np.<span class="bu">sum</span>((y_test <span class="op">*</span> np.array([<span class="dv">0</span>,<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>,<span class="dv">4</span>,<span class="dv">5</span>,<span class="dv">6</span>,<span class="dv">7</span>,<span class="dv">8</span>,<span class="dv">9</span>])),axis<span class="op">=</span><span class="dv">1</span>) x_arr_test <span class="op">=</span> [x_test] <span class="op">+</span> sess.run([l1,l2,l3],feed_dict<span class="op">=</span>{x:x_test,y_:y_test}) x_arr_train <span class="op">=</span> [mnist.train.images[:<span class="dv">1000</span>]] <span class="op">+</span> sess.run([l1,l2,l3],feed_dict<span class="op">=</span>{x:mnist.train.images[:<span class="dv">1000</span>],y:mnist.train.labels[:<span class="dv">1000</span>]})</code></pre></div> <h3 id="logistic-regression">Logistic Regression</h3> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> sklearn.linear_model <span class="im">import</span> LogisticRegression log_reg <span class="op">=</span> LogisticRegression() <span class="cf">for</span> idx, i <span class="kw">in</span> <span class="bu">enumerate</span>(x_arr_train): log_reg.fit(i,y_train_single) <span class="bu">print</span>(<span class="st">&quot;Layer &quot;</span> <span class="op">+</span> <span class="bu">str</span>(idx) <span class="op">+</span> <span class="st">&quot; accuracy is: &quot;</span> <span class="op">+</span> <span class="bu">str</span>(log_reg.score(x_arr_test[idx],y_test_single)))</code></pre></div> <pre><code>Layer 0 accuracy is: 0.828 Layer 1 accuracy is: 0.931 Layer 2 accuracy is: 0.953 Layer 3 accuracy is: 0.966</code></pre> <p>In support of the hypothesis, logistic regression performs progressively better the deeper the representation. There appear to be decreasing marginal returns to each additional hidden layer, and it would be interesting to see if this pattern holds up for deeper / more complex models.</p> <h3 id="pca">PCA</h3> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> sklearn.decomposition <span class="im">import</span> PCA <span class="im">from</span> matplotlib <span class="im">import</span> cm <span class="kw">def</span> plot_mnist_pca(axis, x, ix1, ix2, colors, num<span class="op">=</span><span class="dv">1000</span>): pca <span class="op">=</span> PCA() pca.fit(x) x_red <span class="op">=</span> pca.transform(x) axis.scatter(x_red[:num,ix1],x_red[:num,ix2],c<span class="op">=</span>colors[:<span class="dv">1000</span>],cmap<span class="op">=</span>cm.rainbow_r) <span class="kw">def</span> plot(list_to_plot): fig,ax <span class="op">=</span> plt.subplots(<span class="dv">3</span>,<span class="dv">4</span>,figsize<span class="op">=</span>(<span class="dv">12</span>,<span class="dv">9</span>)) fig.tight_layout() perms <span class="op">=</span> [(<span class="dv">0</span>,<span class="dv">1</span>),(<span class="dv">0</span>,<span class="dv">2</span>),(<span class="dv">1</span>,<span class="dv">2</span>)] colors <span class="op">=</span> y_test_single index <span class="op">=</span> np.zeros(colors.shape) <span class="cf">for</span> i <span class="kw">in</span> list_to_plot: index <span class="op">+=</span> (colors<span class="op">==</span>i) <span class="cf">for</span> row, axis_row <span class="kw">in</span> <span class="bu">enumerate</span>(ax): <span class="cf">for</span> col, axis <span class="kw">in</span> <span class="bu">enumerate</span>(axis_row): plot_mnist_pca(axis, x_arr_test[col][index<span class="op">==</span><span class="dv">1</span>], perms[row][<span class="dv">0</span>], perms[row][<span class="dv">1</span>], colors[index<span class="op">==</span><span class="dv">1</span>], num<span class="op">=</span><span class="dv">1000</span>) plot(<span class="bu">range</span>(<span class="dv">4</span>))</code></pre></div> <figure> <img src="https://r2rt.com/static/images/RPDL_output_9_0.png" /> </figure> <p>Each row of the above grid plots combinations (pairs) of the first three principal components with respect to the numbers 0, 1, 2 and 3 (using only 4 numbers at a time makes the separation more visible). The columns, from left to right, correspond to the input layer, the first hidden layer, the second hidden layer and the final hidden layer.</p> <p>In support of the hypothesis, the principal components of deeper layers provide visibly better separation of the data than earlier layers.</p> <h3 id="a-failed-experiment-teaching-the-neural-network-features">A failed experiment: teaching the neural network features</h3> <p>I had hypothesized that we could use the most prominent features (the top three principal components) of the final hidden layer to train a new neural network and have it perform better. For each training example, in addition to predicting the classification, the new network also performs a regression on the top three principal components of that training example’s third hidden layer representation according to the first model. The training step backpropagates both the classification error and the regression error.</p> <p>Unfortunately, this approach did not provide any noticeable improvement over the original model.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">pca <span class="op">=</span> PCA() l3_train <span class="op">=</span> l3.<span class="bu">eval</span>(feed_dict<span class="op">=</span>{x:mnist.train.images}) l3_test <span class="op">=</span> l3.<span class="bu">eval</span>(feed_dict<span class="op">=</span>{x:mnist.test.images}) pca.fit(l3_train) y_new_train <span class="op">=</span> pca.transform(l3_train)[:,:<span class="dv">3</span>] y_new_test <span class="op">=</span> pca.transform(l3_test)[:,:<span class="dv">3</span>] saver.restore(sess, <span class="st">&#39;/tmp/initial_variables.ckpt&#39;</span>) <span class="co"># create new placeholder for 3 new variables</span> y_3newfeatures_ <span class="op">=</span> tf.placeholder(<span class="st">&quot;float&quot;</span>, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">3</span>]) <span class="co"># add linear regression for new features</span> w <span class="op">=</span> weight_variable([<span class="dv">100</span>,<span class="dv">3</span>]) b <span class="op">=</span> bias_variable([<span class="dv">3</span>]) y_3newfeatures <span class="op">=</span> tf.matmul(l1,w) <span class="op">+</span> b sess.run(tf.initialize_all_variables()) new_feature_loss <span class="op">=</span> <span class="fl">1e-1</span><span class="op">*</span>tf.reduce_sum(tf.<span class="bu">abs</span>(y_3newfeatures_<span class="op">-</span>y_3newfeatures)) train_step_new_features <span class="op">=</span> tf.train.GradientDescentOptimizer(<span class="fl">0.01</span>).minimize(cross_entropy <span class="op">+</span> new_feature_loss) new_accuracy <span class="op">=</span> [] <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>): start <span class="op">=</span> (<span class="dv">50</span><span class="op">*</span>i) <span class="op">%</span> <span class="dv">54950</span> end <span class="op">=</span> start <span class="op">+</span> <span class="dv">50</span> train_step_new_features.run(feed_dict<span class="op">=</span>{x: mnist.train.images[start:end], y_: mnist.train.labels[start:end],y_3newfeatures_:y_new_train[start:end]}) <span class="cf">if</span> i <span class="op">%</span> <span class="dv">100</span> <span class="op">==</span> <span class="dv">0</span>: acc, ce, lr <span class="op">=</span> sess.run([accuracy, cross_entropy, new_feature_loss],feed_dict<span class="op">=</span>{x:mnist.test.images,y_:mnist.test.labels,y_3newfeatures_:y_new_test}) new_accuracy.append(acc) <span class="bu">print</span>(<span class="st">&quot;Accuracy: &quot;</span> <span class="op">+</span> <span class="bu">str</span>(acc) <span class="op">+</span> <span class="st">&quot; -- Cross entropy: &quot;</span> <span class="op">+</span> <span class="bu">str</span>(ce) <span class="op">+</span> <span class="st">&quot; -- New feature loss: &quot;</span> <span class="op">+</span> <span class="bu">str</span>(lr),end<span class="op">=</span><span class="st">&quot;</span><span class="ch">\r</span><span class="st">&quot;</span>) <span class="bu">print</span>(new_accuracy[<span class="op">-</span><span class="dv">1</span>])</code></pre></div> <pre><code>0.9707</code></pre> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots() ax.plot(base_accuracy, label<span class="op">=</span><span class="st">&#39;Base&#39;</span>) ax.plot(new_accuracy, label<span class="op">=</span><span class="st">&#39;New&#39;</span>) ax.set_xlabel(<span class="st">&#39;Training steps&#39;</span>) ax.set_ylabel(<span class="st">&#39;Accuracy&#39;</span>) ax.set_title(<span class="st">&#39;Base vs New Accuracy&#39;</span>) ax.legend(loc<span class="op">=</span><span class="dv">4</span>) plt.show()</code></pre></div> <figure> <img src="https://r2rt.com/static/images/RPDL_output_13_0.png" /> </figure> </body> </html> Implementing Batch Normalization in Tensorflow2016-03-29T00:00:00-04:002016-03-29T00:00:00-04:00Silviu Pitistag:r2rt.com,2016-03-29:/implementing-batch-normalization-in-tensorflow.htmlBatch normalization is deep learning technique introduced in 2015 that enables the use of higher learning rates, acts as a regularizer and can speed up training by 14 times. In this post, I show how to implement batch normalization in Tensorflow.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > span.kw { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > span.bn { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > span.ch { color: #4070a0; } /* Char */ code > span.st { color: #4070a0; } /* String */ code > span.co { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > span.al { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > span.er { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #880000; } /* Constant */ code > span.sc { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > span.ss { color: #bb6688; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #19177c; } /* Variable */ code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > span.at { color: #7d9029; } /* Attribute */ code > span.do { color: #ba2121; font-style: italic; } /* Documentation */ code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style> <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML" type="text/javascript"></script> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>Batch normalization, as described in the March 2015 <a href="http://arxiv.org/pdf/1502.03167v3.pdf">paper</a> (the BN2015 paper) by Sergey Ioffe and Christian Szegedy, is a simple and effective way to improve the performance of a neural network. In the BN2015 paper, Ioffe and Szegedy show that batch normalization enables the use of higher learning rates, acts as a regularizer and can speed up training by 14 times. In this post, I show how to implement batch normalization in Tensorflow.</p> <p><strong>Edit 2018 (that should have been made back in 2016)</strong>: If you’re just looking for a working implementation, Tensorflow has an easy to use batch_normalization layer in the tf.layers module. Just be sure to wrap your training step in a <code>with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):</code> and it will work.</p> <p><strong>Edit 07/12/16</strong>: I’ve updated this post to cover the calculation of population mean and variance at test time in more detail.</p> <p><strong>Edit 02/08/16</strong>: In case you are looking for <strong><em>recurrent batch normalization</em></strong> (i.e., from <a href="https://arxiv.org/abs/1603.09025">Cooijmans et al. (2016)</a>), I have uploaded a working Tensorflow implementation <a href="https://gist.github.com/spitis/27ab7d2a30bbaf5ef431b4a02194ac60">here</a>. The only tricky part of the implementation, as compared to the feedforward batch normalization presented this post, is storing separate population variables for different timesteps.</p> <h3 id="the-problem">The problem</h3> <p>Batch normalization is intended to solve the following problem: Changes in model parameters during learning change the distributions of the outputs of each hidden layer. This means that later layers need to adapt to these (often noisy) changes during training.</p> <h3 id="batch-normalization-in-brief">Batch normalization in brief</h3> <p>To solve this problem, the BN2015 paper propposes the <em>batch normalization</em> of the input to the activation function of each nuron (e.g., each sigmoid or ReLU function) during training, so that the input to the activation function across each training batch has a mean of 0 and a variance of 1. For example, applying batch normalization to the activation <span class="math inline">$$\sigma(Wx + b)$$</span> would result in <span class="math inline">$$\sigma(BN(Wx + b))$$</span> where <span class="math inline">$$BN$$</span> is the <em>batch normalizing transform</em>.</p> <h3 id="the-batch-normalizing-transform">The batch normalizing transform</h3> <p>To normalize a value across a batch (i.e., to batch normalize the value), we subtract the batch mean, <span class="math inline">$$\mu_B$$</span>, and divide the result by the batch standard deviation, <span class="math inline">$$\sqrt{\sigma^2_B + \epsilon}$$</span>. Note that a small constant <span class="math inline">$$\epsilon$$</span> is added to the variance in order to avoid dividing by zero.</p> <p>Thus, the initial batch normalizing transform of a given value, <span class="math inline">$$x_i$$</span>, is: <span class="math display">$BN_{initial}(x_i) = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}$</span></p> <p>Because the batch normalizing transform given above restricts the inputs to the activation function to a prescribed normal distribution, this can limit the representational power of the layer. Therefore, we allow the network to undo the batch normalizing transform by multiplying by a new scale parameter <span class="math inline">$$\gamma$$</span> and adding a new shift parameter <span class="math inline">$$\beta$$</span>. <span class="math inline">$$\gamma$$</span> and <span class="math inline">$$\beta$$</span> are learnable parameters.</p> <p>Adding in <span class="math inline">$$\gamma$$</span> and <span class="math inline">$$\beta$$</span> producing the following final batch normalizing transform: <span class="math display">$BN(x_i) = \gamma(\frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}) + \beta$</span></p> <h3 id="implementing-batch-normalization-in-tensorflow">Implementing batch normalization in Tensorflow</h3> <p>We will add batch normalization to a basic fully-connected neural network that has two hidden layers of 100 neurons each and show a similar result to Figure 1 (b) and (c) of the BN2015 paper.</p> <p>Note that this network is not yet generally suitable for use at test time. See the section <a href="#making-predictions-with-the-model">Making predictions with the model</a> below for the reason why, as well as a fixed version.</p> <h4 id="imports-config">Imports, config</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> numpy <span class="im">as</span> np, tensorflow <span class="im">as</span> tf, tqdm <span class="im">from</span> tensorflow.examples.tutorials.mnist <span class="im">import</span> input_data <span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt <span class="op">%</span>matplotlib inline mnist <span class="op">=</span> input_data.read_data_sets(<span class="st">&#39;MNIST_data&#39;</span>, one_hot<span class="op">=</span><span class="va">True</span>)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Generate predetermined random weights so the networks are similarly initialized</span> w1_initial <span class="op">=</span> np.random.normal(size<span class="op">=</span>(<span class="dv">784</span>,<span class="dv">100</span>)).astype(np.float32) w2_initial <span class="op">=</span> np.random.normal(size<span class="op">=</span>(<span class="dv">100</span>,<span class="dv">100</span>)).astype(np.float32) w3_initial <span class="op">=</span> np.random.normal(size<span class="op">=</span>(<span class="dv">100</span>,<span class="dv">10</span>)).astype(np.float32) <span class="co"># Small epsilon value for the BN transform</span> epsilon <span class="op">=</span> <span class="fl">1e-3</span></code></pre></div> <h4 id="building-the-graph">Building the graph</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Placeholders</span> x <span class="op">=</span> tf.placeholder(tf.float32, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">784</span>]) y_ <span class="op">=</span> tf.placeholder(tf.float32, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">10</span>])</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Layer 1 without BN</span> w1 <span class="op">=</span> tf.Variable(w1_initial) b1 <span class="op">=</span> tf.Variable(tf.zeros([<span class="dv">100</span>])) z1 <span class="op">=</span> tf.matmul(x,w1)<span class="op">+</span>b1 l1 <span class="op">=</span> tf.nn.sigmoid(z1)</code></pre></div> <p>Here is the same layer 1 with batch normalization:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Layer 1 with BN</span> w1_BN <span class="op">=</span> tf.Variable(w1_initial) <span class="co"># Note that pre-batch normalization bias is ommitted. The effect of this bias would be</span> <span class="co"># eliminated when subtracting the batch mean. Instead, the role of the bias is performed</span> <span class="co"># by the new beta variable. See Section 3.2 of the BN2015 paper.</span> z1_BN <span class="op">=</span> tf.matmul(x,w1_BN) <span class="co"># Calculate batch mean and variance</span> batch_mean1, batch_var1 <span class="op">=</span> tf.nn.moments(z1_BN,[<span class="dv">0</span>]) <span class="co"># Apply the initial batch normalizing transform</span> z1_hat <span class="op">=</span> (z1_BN <span class="op">-</span> batch_mean1) <span class="op">/</span> tf.sqrt(batch_var1 <span class="op">+</span> epsilon) <span class="co"># Create two new parameters, scale and beta (shift)</span> scale1 <span class="op">=</span> tf.Variable(tf.ones([<span class="dv">100</span>])) beta1 <span class="op">=</span> tf.Variable(tf.zeros([<span class="dv">100</span>])) <span class="co"># Scale and shift to obtain the final output of the batch normalization</span> <span class="co"># this value is fed into the activation function (here a sigmoid)</span> BN1 <span class="op">=</span> scale1 <span class="op">*</span> z1_hat <span class="op">+</span> beta1 l1_BN <span class="op">=</span> tf.nn.sigmoid(BN1)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Layer 2 without BN</span> w2 <span class="op">=</span> tf.Variable(w2_initial) b2 <span class="op">=</span> tf.Variable(tf.zeros([<span class="dv">100</span>])) z2 <span class="op">=</span> tf.matmul(l1,w2)<span class="op">+</span>b2 l2 <span class="op">=</span> tf.nn.sigmoid(z2)</code></pre></div> <p>Note that tensorflow provides a <code>tf.nn.batch_normalization</code>, which I apply to layer 2 below. This code does the same thing as the code for layer 1 above. See the documentation <a href="https://www.tensorflow.org/versions/master/api_docs/python/nn/normalization#batch_normalization">here</a> and the code <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn_impl.py#L702">here</a>.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Layer 2 with BN, using Tensorflows built-in BN function</span> w2_BN <span class="op">=</span> tf.Variable(w2_initial) z2_BN <span class="op">=</span> tf.matmul(l1_BN,w2_BN) batch_mean2, batch_var2 <span class="op">=</span> tf.nn.moments(z2_BN,[<span class="dv">0</span>]) scale2 <span class="op">=</span> tf.Variable(tf.ones([<span class="dv">100</span>])) beta2 <span class="op">=</span> tf.Variable(tf.zeros([<span class="dv">100</span>])) BN2 <span class="op">=</span> tf.nn.batch_normalization(z2_BN,batch_mean2,batch_var2,beta2,scale2,epsilon) l2_BN <span class="op">=</span> tf.nn.sigmoid(BN2)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Softmax</span> w3 <span class="op">=</span> tf.Variable(w3_initial) b3 <span class="op">=</span> tf.Variable(tf.zeros([<span class="dv">10</span>])) y <span class="op">=</span> tf.nn.softmax(tf.matmul(l2,w3)<span class="op">+</span>b3) w3_BN <span class="op">=</span> tf.Variable(w3_initial) b3_BN <span class="op">=</span> tf.Variable(tf.zeros([<span class="dv">10</span>])) y_BN <span class="op">=</span> tf.nn.softmax(tf.matmul(l2_BN,w3_BN)<span class="op">+</span>b3_BN)</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># Loss, optimizer and predictions</span> cross_entropy <span class="op">=</span> <span class="op">-</span>tf.reduce_sum(y_<span class="op">*</span>tf.log(y)) cross_entropy_BN <span class="op">=</span> <span class="op">-</span>tf.reduce_sum(y_<span class="op">*</span>tf.log(y_BN)) train_step <span class="op">=</span> tf.train.GradientDescentOptimizer(<span class="fl">0.01</span>).minimize(cross_entropy) train_step_BN <span class="op">=</span> tf.train.GradientDescentOptimizer(<span class="fl">0.01</span>).minimize(cross_entropy_BN) correct_prediction <span class="op">=</span> tf.equal(tf.arg_max(y,<span class="dv">1</span>),tf.arg_max(y_,<span class="dv">1</span>)) accuracy <span class="op">=</span> tf.reduce_mean(tf.cast(correct_prediction,tf.float32)) correct_prediction_BN <span class="op">=</span> tf.equal(tf.arg_max(y_BN,<span class="dv">1</span>),tf.arg_max(y_,<span class="dv">1</span>)) accuracy_BN <span class="op">=</span> tf.reduce_mean(tf.cast(correct_prediction_BN,tf.float32))</code></pre></div> <h4 id="training-the-network">Training the network</h4> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">zs, BNs, acc, acc_BN <span class="op">=</span> [], [], [], [] sess <span class="op">=</span> tf.InteractiveSession() sess.run(tf.global_variables_initializer()) <span class="cf">for</span> i <span class="kw">in</span> tqdm.tqdm(<span class="bu">range</span>(<span class="dv">40000</span>)): batch <span class="op">=</span> mnist.train.next_batch(<span class="dv">60</span>) train_step.run(feed_dict<span class="op">=</span>{x: batch[<span class="dv">0</span>], y_: batch[<span class="dv">1</span>]}) train_step_BN.run(feed_dict<span class="op">=</span>{x: batch[<span class="dv">0</span>], y_: batch[<span class="dv">1</span>]}) <span class="cf">if</span> i <span class="op">%</span> <span class="dv">50</span> <span class="kw">is</span> <span class="dv">0</span>: res <span class="op">=</span> sess.run([accuracy,accuracy_BN,z2,BN2],feed_dict<span class="op">=</span>{x: mnist.test.images, y_: mnist.test.labels}) acc.append(res[<span class="dv">0</span>]) acc_BN.append(res[<span class="dv">1</span>]) zs.append(np.mean(res[<span class="dv">2</span>],axis<span class="op">=</span><span class="dv">0</span>)) <span class="co"># record the mean value of z2 over the entire test set</span> BNs.append(np.mean(res[<span class="dv">3</span>],axis<span class="op">=</span><span class="dv">0</span>)) <span class="co"># record the mean value of BN2 over the entire test set</span> zs, BNs, acc, acc_BN <span class="op">=</span> np.array(zs), np.array(BNs), np.array(acc), np.array(acc_BN)</code></pre></div> <h3 id="improvements-in-speed-and-accuracy">Improvements in speed and accuracy</h3> <p>As seen below, there is a noticeable improvement in the accuracy and speed of training. As shown in figure 2 of the BN2015 paper, this difference can be very significant for other network architectures.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, ax <span class="op">=</span> plt.subplots() ax.plot(<span class="bu">range</span>(<span class="dv">0</span>,<span class="bu">len</span>(acc)<span class="op">*</span><span class="dv">50</span>,<span class="dv">50</span>),acc, label<span class="op">=</span><span class="st">&#39;Without BN&#39;</span>) ax.plot(<span class="bu">range</span>(<span class="dv">0</span>,<span class="bu">len</span>(acc)<span class="op">*</span><span class="dv">50</span>,<span class="dv">50</span>),acc_BN, label<span class="op">=</span><span class="st">&#39;With BN&#39;</span>) ax.set_xlabel(<span class="st">&#39;Training steps&#39;</span>) ax.set_ylabel(<span class="st">&#39;Accuracy&#39;</span>) ax.set_ylim([<span class="fl">0.8</span>,<span class="dv">1</span>]) ax.set_title(<span class="st">&#39;Batch Normalization Accuracy&#39;</span>) ax.legend(loc<span class="op">=</span><span class="dv">4</span>) plt.show()</code></pre></div> <figure> <img src="https://r2rt.com/static/images/BN_output_23_0.png" alt="Effect of batch normalization on training" /><figcaption>Effect of batch normalization on training</figcaption> </figure> <h5 id="illustration-of-input-to-activation-functions-over-time">Illustration of input to activation functions over time</h5> <p>Below is the distribution over time of the inputs to the sigmoid activation function of the first five neurons in the network’s second layer. Batch normalization has a visible and significant effect of removing variance/noise in these inputs. As described by Ioffe and Szegedy, this allows the third layer to learn faster and is responsible for the increase in accuracy and learning speed. See Figure 1 and Section 4.1 of the BN2015 paper.</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">fig, axes <span class="op">=</span> plt.subplots(<span class="dv">5</span>, <span class="dv">2</span>, figsize<span class="op">=</span>(<span class="dv">6</span>,<span class="dv">12</span>)) fig.tight_layout() <span class="cf">for</span> i, ax <span class="kw">in</span> <span class="bu">enumerate</span>(axes): ax[<span class="dv">0</span>].set_title(<span class="st">&quot;Without BN&quot;</span>) ax[<span class="dv">1</span>].set_title(<span class="st">&quot;With BN&quot;</span>) ax[<span class="dv">0</span>].plot(zs[:,i]) ax[<span class="dv">1</span>].plot(BNs[:,i])</code></pre></div> <figure> <img src="https://r2rt.com/static/images/BN_output_25_0.png" alt="Effect of batch normalization on inputs to activation functions" /><figcaption>Effect of batch normalization on inputs to activation functions</figcaption> </figure> <h3 id="making-predictions-with-the-model">Making predictions with the model</h3> <p>When using a batch normalized model at test time to make predictions, using the batch mean and batch variance can be counter-productive. To see this, consider what happens if we feed a single example into the trained model above: the inputs to our activation functions will always be 0 (since we are normalizing them to have a mean of 0), and we will always get the same prediction, regardless of the input!</p> <p>To demonstrate:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">predictions <span class="op">=</span> [] correct <span class="op">=</span> <span class="dv">0</span> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>): pred, corr <span class="op">=</span> sess.run([tf.arg_max(y_BN,<span class="dv">1</span>), accuracy_BN], feed_dict<span class="op">=</span>{x: [mnist.test.images[i]], y_: [mnist.test.labels[i]]}) correct <span class="op">+=</span> corr predictions.append(pred[<span class="dv">0</span>]) <span class="bu">print</span>(<span class="st">&quot;PREDICTIONS:&quot;</span>, predictions) <span class="bu">print</span>(<span class="st">&quot;ACCURACY:&quot;</span>, correct<span class="op">/</span><span class="dv">100</span>)</code></pre></div> <pre><code>PREDICTIONS: [8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8] ACCURACY: 0.02</code></pre> <p>Our model always predicts 8, and there appear to be only two 8s in the first 100 MNIST test samples, for an accuracy of 2%.</p> <h3 id="fixing-the-model-for-test-time">Fixing the model for test time</h3> <p>To fix this, we need to replace the batch mean and batch variance in each batch normalization step with estimates of the population mean and population variance, respectively. See Section 3.1 of the BN2015 paper. Testing the model above only worked because the entire test set was predicted at once, so the “batch mean” and “batch variance” of the test set provided good estimates for the population mean and population variance.</p> <p>To make a batch normalized model generally suitable for testing, we want to obtain estimates for the population mean and population variance at each batch normalization step before test time (i.e., during training), and use these values when making predictions. Note that for the same reason that we need batch normalization (i.e. the mean and variance of the activation inputs changes during training), it would be best to estimate the population mean and variance <em>after</em> the weights they depend on are trained, although doing these simultaneously is not the worst offense, since the weights are expected to converge near the end of training.</p> <p>And now, to actually implement this in Tensorflow, we will write a <code>batch_norm_wrapper</code> function, which we will use to wrap the inputs to our activation functions. The function will store the population mean and variance as tf.Variables, and decide whether to use the batch statistics or the population statistics for normalization. To do this, it makes use of an <code>is_training</code> flag. Because we need to learn the population mean and variance during training, we do this when <code>is_training == True</code>. Here is an outline of the code:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> batch_norm_wrapper(inputs, is_training): ... pop_mean <span class="op">=</span> tf.Variable(tf.zeros([inputs.get_shape()[<span class="op">-</span><span class="dv">1</span>]]), trainable<span class="op">=</span><span class="va">False</span>) pop_var <span class="op">=</span> tf.Variable(tf.ones([inputs.get_shape()[<span class="op">-</span><span class="dv">1</span>]]), trainable<span class="op">=</span><span class="va">False</span>) <span class="cf">if</span> is_training: mean, var <span class="op">=</span> tf.nn.moments(inputs,[<span class="dv">0</span>]) ... <span class="co"># learn pop_mean and pop_var here</span> ... <span class="cf">return</span> tf.nn.batch_normalization(inputs, batch_mean, batch_var, beta, scale, epsilon) <span class="cf">else</span>: <span class="cf">return</span> tf.nn.batch_normalization(inputs, pop_mean, pop_var, beta, scale, epsilon)</code></pre></div> <p>Note that the variables have been declared with a <code>trainable = False</code> argument, since we will be updating these ourselves rather than having the optimizer do it.</p> <p>One approach to estimating the population mean and variance during training is to use an exponential moving average, though strictly speaking, a simple average over the sample would be (marginally) better. The exponential moving average is simple and lets us avoid extra work, so we use that:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">decay <span class="op">=</span> <span class="fl">0.999</span> <span class="co"># use numbers closer to 1 if you have more data</span> train_mean <span class="op">=</span> tf.assign(pop_mean, pop_mean <span class="op">*</span> decay <span class="op">+</span> batch_mean <span class="op">*</span> (<span class="dv">1</span> <span class="op">-</span> decay)) train_var <span class="op">=</span> tf.assign(pop_var, pop_var <span class="op">*</span> decay <span class="op">+</span> batch_var <span class="op">*</span> (<span class="dv">1</span> <span class="op">-</span> decay))</code></pre></div> <p>Finally, we will need a way to call these training ops. For full control, you can add them to a graph collection (see the link to Tensorflow’s code below), but for simplicity, we will call them every time we calculate the batch_mean and batch_var. To do this, we add them as dependencies to the return value of batch_norm_wrapper when is_training is true. Here is the final batch_norm_wrapper function:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co"># this is a simpler version of Tensorflow&#39;s &#39;official&#39; version. See:</span> <span class="co"># https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L102</span> <span class="kw">def</span> batch_norm_wrapper(inputs, is_training, decay <span class="op">=</span> <span class="fl">0.999</span>): scale <span class="op">=</span> tf.Variable(tf.ones([inputs.get_shape()[<span class="op">-</span><span class="dv">1</span>]])) beta <span class="op">=</span> tf.Variable(tf.zeros([inputs.get_shape()[<span class="op">-</span><span class="dv">1</span>]])) pop_mean <span class="op">=</span> tf.Variable(tf.zeros([inputs.get_shape()[<span class="op">-</span><span class="dv">1</span>]]), trainable<span class="op">=</span><span class="va">False</span>) pop_var <span class="op">=</span> tf.Variable(tf.ones([inputs.get_shape()[<span class="op">-</span><span class="dv">1</span>]]), trainable<span class="op">=</span><span class="va">False</span>) <span class="cf">if</span> is_training: batch_mean, batch_var <span class="op">=</span> tf.nn.moments(inputs,[<span class="dv">0</span>]) train_mean <span class="op">=</span> tf.assign(pop_mean, pop_mean <span class="op">*</span> decay <span class="op">+</span> batch_mean <span class="op">*</span> (<span class="dv">1</span> <span class="op">-</span> decay)) train_var <span class="op">=</span> tf.assign(pop_var, pop_var <span class="op">*</span> decay <span class="op">+</span> batch_var <span class="op">*</span> (<span class="dv">1</span> <span class="op">-</span> decay)) <span class="cf">with</span> tf.control_dependencies([train_mean, train_var]): <span class="cf">return</span> tf.nn.batch_normalization(inputs, batch_mean, batch_var, beta, scale, epsilon) <span class="cf">else</span>: <span class="cf">return</span> tf.nn.batch_normalization(inputs, pop_mean, pop_var, beta, scale, epsilon)</code></pre></div> <h3 id="an-implementation-that-works-at-test-time">An implementation that works at test time</h3> <p>And now to demonstrate that this works, we rebuild/retrain the model with our batch_norm_wrapper function. Note that we need to build the graph once for training, and then again at test time, so we write a build_graph function (in practice, this would usually be encapsulated in a model object):</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> build_graph(is_training): <span class="co"># Placeholders</span> x <span class="op">=</span> tf.placeholder(tf.float32, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">784</span>]) y_ <span class="op">=</span> tf.placeholder(tf.float32, shape<span class="op">=</span>[<span class="va">None</span>, <span class="dv">10</span>]) <span class="co"># Layer 1</span> w1 <span class="op">=</span> tf.Variable(w1_initial) z1 <span class="op">=</span> tf.matmul(x,w1) bn1 <span class="op">=</span> batch_norm_wrapper(z1, is_training) l1 <span class="op">=</span> tf.nn.sigmoid(bn1) <span class="co">#Layer 2</span> w2 <span class="op">=</span> tf.Variable(w2_initial) z2 <span class="op">=</span> tf.matmul(l1,w2) bn2 <span class="op">=</span> batch_norm_wrapper(z2, is_training) l2 <span class="op">=</span> tf.nn.sigmoid(bn2) <span class="co"># Softmax</span> w3 <span class="op">=</span> tf.Variable(w3_initial) b3 <span class="op">=</span> tf.Variable(tf.zeros([<span class="dv">10</span>])) y <span class="op">=</span> tf.nn.softmax(tf.matmul(l2, w3)) <span class="co"># Loss, Optimizer and Predictions</span> cross_entropy <span class="op">=</span> <span class="op">-</span>tf.reduce_sum(y_<span class="op">*</span>tf.log(y)) train_step <span class="op">=</span> tf.train.GradientDescentOptimizer(<span class="fl">0.01</span>).minimize(cross_entropy) correct_prediction <span class="op">=</span> tf.equal(tf.arg_max(y,<span class="dv">1</span>),tf.arg_max(y_,<span class="dv">1</span>)) accuracy <span class="op">=</span> tf.reduce_mean(tf.cast(correct_prediction,tf.float32)) <span class="cf">return</span> (x, y_), train_step, accuracy, y, tf.train.Saver()</code></pre></div> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="co">#Build training graph, train and save the trained model</span> sess.close() tf.reset_default_graph() (x, y_), train_step, accuracy, _, saver <span class="op">=</span> build_graph(is_training<span class="op">=</span><span class="va">True</span>) acc <span class="op">=</span> [] <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.global_variables_initializer()) <span class="cf">for</span> i <span class="kw">in</span> tqdm.tqdm(<span class="bu">range</span>(<span class="dv">10000</span>)): batch <span class="op">=</span> mnist.train.next_batch(<span class="dv">60</span>) train_step.run(feed_dict<span class="op">=</span>{x: batch[<span class="dv">0</span>], y_: batch[<span class="dv">1</span>]}) <span class="cf">if</span> i <span class="op">%</span> <span class="dv">50</span> <span class="kw">is</span> <span class="dv">0</span>: res <span class="op">=</span> sess.run([accuracy],feed_dict<span class="op">=</span>{x: mnist.test.images, y_: mnist.test.labels}) acc.append(res[<span class="dv">0</span>]) saved_model <span class="op">=</span> saver.save(sess, <span class="st">&#39;./temp-bn-save&#39;</span>) <span class="bu">print</span>(<span class="st">&quot;Final accuracy:&quot;</span>, acc[<span class="op">-</span><span class="dv">1</span>])</code></pre></div> <pre><code>Final accuracy: 0.9721</code></pre> <p>And now to show that this worked, we repeat our experiment of predicting examples one by one:</p> <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">tf.reset_default_graph() (x, y_), _, accuracy, y, saver <span class="op">=</span> build_graph(is_training<span class="op">=</span><span class="va">False</span>) predictions <span class="op">=</span> [] correct <span class="op">=</span> <span class="dv">0</span> <span class="cf">with</span> tf.Session() <span class="im">as</span> sess: sess.run(tf.global_variables_initializer()) saver.restore(sess, <span class="st">&#39;./temp-bn-save&#39;</span>) <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>): pred, corr <span class="op">=</span> sess.run([tf.arg_max(y,<span class="dv">1</span>), accuracy], feed_dict<span class="op">=</span>{x: [mnist.test.images[i]], y_: [mnist.test.labels[i]]}) correct <span class="op">+=</span> corr predictions.append(pred[<span class="dv">0</span>]) <span class="bu">print</span>(<span class="st">&quot;PREDICTIONS:&quot;</span>, predictions) <span class="bu">print</span>(<span class="st">&quot;ACCURACY:&quot;</span>, correct<span class="op">/</span><span class="dv">100</span>)</code></pre></div> <pre><code>PREDICTIONS: [7, 2, 1, 0, 4, 1, 4, 9, 6, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4, 9, 6, 6, 5, 4, 0, 7, 4, 0, 1, 3, 1, 3, 4, 7, 2, 7, 1, 2, 1, 1, 7, 4, 2, 3, 5, 1, 2, 4, 4, 6, 3, 5, 5, 6, 0, 4, 1, 9, 5, 7, 8, 9, 3, 7, 4, 6, 4, 3, 0, 7, 0, 2, 9, 1, 7, 3, 2, 9, 7, 7, 6, 2, 7, 8, 4, 7, 3, 6, 1, 3, 6, 9, 3, 1, 4, 1, 7, 6, 9] ACCURACY: 0.99</code></pre> </body> </html> Skill vs Strategy2016-01-23T00:00:00-05:002016-01-23T00:00:00-05:00Silviu Pitistag:r2rt.com,2016-01-23:/skill-vs-strategy.htmlIn this post I consider the distinction between skill and strategy and what it means for machine learning. Backpropagation is limited in that it develops a skill at a specific strategy, but cannot, by itself, change strategies. I look at how strategy switches are achieved in real examples and ask what algorithm might allow machines to effectively switch strategies.<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> <title></title> <style type="text/css">code{white-space: pre;}</style> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <p>In this post I discuss several real life examples of strategies that cannot be achieved through mere practice of an inferior strategy. Backpropagation is an algorithm akin to such “mere practice” in that backpropagation develops skill at a specific strategy (i.e., it learns a specific local minimum). Like practice, backpropagation alone cannot result in a switch to a superior strategy. I look at how strategy switches are achieved in real examples and ask what algorithm might allow machines to effectively switch strategies.</p> <h2 id="shooting-hoops">Shooting hoops</h2> <p>As a basketball player, I found the difference between the shots of 12 year olds and NBA players fascinating. Compare the shot of 12 year old Julian Newman to that of Dirk Nowitzki:</p> <div class="panel panel-default"> <div class="panel-heading"> <a class="spoiler-trigger">Show</a> </div> <div class="panel-collapse collapse out"> <div class="panel-body"> <p><img src="https://r2rt.com/static/images/bball_dirk.gif" alt="Dirk Nowitzki basketball shot" style="max-width: 45%; float: right"> <img src="https://r2rt.com/static/images/bball_julian.gif" alt="Julian Newman basketball shot" style="max-width: 45%;"></p> </div> </div> </div> <!-- Source Videos: https://youtu.be/FjPNPTcxz1I http://youtu.be/eDgRMdT1QVI --> <p>Julian’s shot is as skilled as it gets given his age and physical characteristics. But notice how Julian shoots from his chest. He leans forward and pushes the ball away from his body (a chest shot). This is different from Dirk, who stands tall, raises the ball above his head, and shoots by extending his arm and flicking his wrist (a wrist shot).</p> <p>As he grows, Julian will soon find opponents’ hands inconveniently placed in the path of his shot. He’ll also soon be tall enough and strong enough to shoot like Dirk.</p> <p>What’s interesting is that Julian’s chest shot skill will not transfer perfectly. The only way he can switch to a wrist shot is by training the new style.</p> <p>To us observers, this switch may seem obvious. Yet in my experience and that of my basketball-playing friends, it is anything but. When we were Julian’s age, our chest shots came naturally and with higher accuracy (even with an opponent’s hand in front). We did switch eventually, perhaps due to a combination of logic and faith in the results of continued practice, or maybe as a simple matter of monkey see monkey do, or perhaps it was Coach Carter’s shooting drills that did it. Whatever the reason, the ball now leaves my hands from above my head with a flick of the wrist. I shoot like Dirk, except the part where the ball goes in.</p> <h2 id="the-fosbury-flop">The Fosbury Flop</h2> <p>The <a href="https://en.wikipedia.org/wiki/Fosbury_Flop">Fosbury Flop</a> is to Olympic high jumping as the wrist shot is to shooting a basketball. Though the mid-1960s, Olympic high jumpers would jump using the straddle technique, Western Roll, Eastern cut-off or scissors jump. But in 1968, a young Dick Fosbury set a new world record with his unorthodox new method of flopping backward over the bar. The Fosbury Flop was born.</p> <p>Compare the dominant pre-Fosbury style, the straddle technique, to the Fosbury Flop:</p> <div class="panel panel-default"> <div class="panel-heading"> <p><a class="spoiler-trigger">Show</a></p> </div> <div class="panel-collapse collapse out"> <div class="panel-body"> <p><img src="https://r2rt.com/static/images/fosbury.gif" alt="Fosbury Flop" style="max-width: 45%; min-width: 90px; float: right"> <img src="https://r2rt.com/static/images/straddle.gif" alt="Straddle technique" style="max-width: 45%; min-width: 90px;"></p> </div> </div> </div> <!-- Source Videos: https://www.youtube.com/watch?v=d6lpk_9T5hM https://www.youtube.com/watch?v=Id4W6VA0uLc --> <p>Unlike the chest shot, which is clearly more prone to being caught under an opponent’s hand than the wrist shot, it’s not clear that the Fosbury Flop is any better than the straddle technique. Indeed, when Fosbury first used the Flop, he was met with warnings from his coaches. And even today, you will find high jumpers debating the relative merits of each jumping style. But the Fosbury Flop caught on, and has held every world record since the straddle technique’s last record in 1978.</p> <h2 id="as-easy-as-17-plus-24">As easy as 17 plus 24</h2> <p>Another example, one of my favorites, concerns a fundamental human skill. Unlike the wrist shot for basketball players and the Fosbury Flop for high jumpers, this technique is relatively unknown among those who stand to benefit.</p> <p>Sometime in middle school, my music teacher recounted the story of a student who did arithmetic in his head faster than with a calculator. His secret? While school had taught us to add right-to-left, this student added left-to-right. A proud skeptic, I tested his technique out for myself. After a little practice, I too was summing numbers faster than with a calculator. It turns out that adjusting your answers for carried 1s isn’t as hard as it seems. In math class I had just started algebra, but in music, I had finally learned to add correctly.</p> <h2 id="mayan-astronomy">Mayan astronomy</h2> <p>As a final fun example, take a minute to watch Richard Feynman’s <a href="https://youtu.be/NM-zWTU7X-k?t=3m52s">imaginary discussion between a Mayan astronomer and his student</a>:</p> <iframe width="420" height="315" src="https://www.youtube.com/embed/NM-zWTU7X-k?start=232" frameborder="0"> </iframe> <h2 id="well-executed-strategies-are-local-minima">Well-executed strategies are local minima</h2> <p>Similar situations are common. For most any task there exist multiple strategies of varying effectiveness: there are different ways to perform in competitive sports, different ways to study, and even different ways to think about life and pursue happiness. In the language of machine learning, this is simply to say that real world situations involve non-convex objective functions with multiple local minima. The dip around each local minimum corresponds to a distinct strategy, and the minimum itself corresponds to the perfect execution of that strategy. Under this lens, <em>backpropagation is an algorithm for improving skill</em>.</p> <p>Perhaps unsurprisingly, a skillfully-executed next-best strategy works in each of the above examples. It may not be as effective as the best strategy, but it gets the job done (else it wouldn’t really be a strategy). This is similar to the empirical result that despite the non-convex nature of a neural network’s error surface, backpropagation generally converges to a local minimum that performs well. A good way to think about this might be that until something “sort of works” (i.e., it has some semblance of a strategy), it won’t even start to converge; you can’t be skilled at a non-strategy.</p> <p>In spite of the efficacy of the next-best strategy, the examples above demonstrate that a superior strategy can perform significantly better. A superior strategy can also open doors that were previously closed in related domains (e.g., our ancestors needed to develop opposable thumbs before they could develop writing). This is critical to creating strong artificial intelligence, and so begs the question: what prompts the switch to a superior strategy and how can we use this to build stronger AI?</p> <p>[<strong>Note 2017/03/05</strong>: I have since learned that my thoughts here are related to the exploration vs exploitation problem in reinforcement learning. To write more on this later.]</p> <h2 id="whats-in-a-strategy-why-random-guesses-wont-work">What’s in a strategy; why random guesses won’t work</h2> <p>Implicit in a strategy is an <em>objective</em>, which can be shooting a basketball through a hoop or minimizing an arbitrary loss function.</p> <p>A strategy is executed within a <em>context</em>, which can make reaching the objective easier. It’s easier to shoot a basketball if you are a foot taller, and its easier to train a model when you have more data.</p> <p>Because some characteristics of the context are within our control, our strategy will often first involve developing these characteristics. In basketball, you can shoot from close up by passing the ball or crossing your opponent. In training a neural network, you can gather more data or select better features.</p> <p>As developing these characteristics can take time, and different strategies often rely on different characteristics, switching strategies can be costly. While changing the structure of a small neural network may be relatively inexpensive, retraining a lawyer to be a software engineer can be difficult due to the large investment required to develop domain expertise. Therefore, we’ll often need a compelling reason to prompt a strategy switch; something more than a random guess.</p> <p>Another reason to rule out random guesses is the very reason we need the backpropagation algorithm to begin with. There are infinitely many random guesses we could make. We simply do not have infinite time to make them, and so whereas backpropagation and its variants are required meta-strategies that tell us how to converge to a working strategy, we might wonder whether there are other meta-strategies that tell us how to effectively switch strategies.</p> <h2 id="switching-strategies-through-social-learning">Switching strategies through social learning</h2> <p>One such meta-strategy is social learning, or learning by external influence. Whether it’s the Coach’s shooting drills, seeing other players hit threes using their wrist, watching Dick Fosbury set a world record with his Flop, or hearing a story of how a math whiz does mental addition faster than a calculator. In my limited experience, there is not yet a true parallel for this in the world of artificial intelligence. [<strong>Note 2017/03/05</strong>: My experience at the time was indeed very limited: genetic algorithms, and related algorithms that keep populations of candidates, like <a href="http://www.cc.gatech.edu/~isbell/papers/isbell-mimic-nips-1997.pdf">Bonet et al.’s MIMIC</a>, do precisely this kind of social learning.]</p> <p>Our machine learning models do switch strategies due to the external influence of their creators; for example, vision researches were presented with the so called “AlexNet” architecture in 2012 (described in this <a href="http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">paper</a>), and the model has since been copied and further improved. Unfortunately, such improvement is not machine autonomous, but requires the direct interference of a human. There are cases where a model is improved by mimicking a database of past samples, but it is difficult to call such programmed mimicry autonomous or intelligent.</p> <p>One big question then is, how can we get a machine to learn socially? Is there a way to do this on a more sophisticated level than averaging models or copying weight variables? There must be, because humans can do it.</p> <p>Being able to accept external influence requires the ability to understand the process that is producing the result. We can see and communicate the precise difference between a chest shot and a wrist shot and so we can accept the external influences and learn via imitation. But even human communication is limited to concepts that the receiver understands. A chess amateur may gain little by watching a grandmaster play whereas a master might learn a great deal. I think the day our programs are able to understand higher order concepts so as to learn by external influence we will be <em>very</em> close to artificial general intelligence. [<strong>Note 2017/03/05</strong>: These thoughts are very similar to the ones I put forth on Stage III models in my post on <a href="http://r2rt.com/deconstruction-with-explicit-embeddings.html">discrete embeddings</a>.]</p> <h2 id="strategies-of-first-convergence-or-why-social-learning-is-not-enough">Strategies of first convergence, or why social learning is not enough</h2> <p>With social learning, our machines will be able to stand on the shoulders of giants. But what if there are no giants?</p> <p>While it is certainly possible to first converge to the best strategy, the chest shot in basketball shows us that our first strategy may be biased. As kids grow up, they are not strong or tall enough to pull of a wrist shot and inevitably converge to a chest shot. By the time they grow up, their chest shot may be a local minima that they can never escape without some creativity or outside help.</p> <p>Suppose we are trying to classify images of people as male or female, and train an image classification neural network with backpropagation to do so. By starting with a random initialization, the network reaches a local minimum. The question is: can we be sure that this local minimum is close to the global minimum? We might try testing this by reinitializing the weights and retraining the network. Let’s say we retrain the network one million times, and each of the local minima reached leads to approximately the same performance. Is this enough for us to conclude that the resulting strategies are close to the best? I would answer in the negative; we cannot be certain that a random initialization will ever lead to an optimal strategy via backpropagation. It may be a situation like the chest shot, where in order to reach an optimal strategy, the network must be trained again after it has learned some useful hidden features.</p> <p>It’s possible, for example, that height is such a good first proxy that neural networks trained with backpropagation immediately learn to use, and even heavily rely, on height as a feature. Humans know that while height is correlated with gender, more subtle characteristics like facial structure are superior predictors. It’s possible that neural networks trained with just backpropagation, even if they eventually learn to use facial structure, will never be able to change their strategy completely and “unlearn” the use of height.</p> <p>Therefore, even if machines are able to learn strategies from each other, it may not be enough to produce the Fosbury Flop or the theory of General Relativity. <em>Monkey see, monkey do</em> is not enough for true intelligence: intelligent machines must be able to produce new strategies independently.</p> <h2 id="switching-strategies-through-creativity">Switching strategies through creativity</h2> <p>The ability to switch strategies without external influence is the ultimate mark of intelligence. It takes something more than training to stop what one is currently doing and try something else entirely. You need to have a hypothesis that another method will be superior before you try it. In the basketball example, the opponent might stick a hand in your face while you’re trying to shoot, and it might prompt the thought: “if only I could shoot from higher up.” In isolation, backpropagating neural networks cannot have these sorts of thoughts about their weights and structures.</p> <p>The key to independent strategy switches is the hypothesis–a guess. Per Feynman, this is the core of the scientific method:</p> <iframe width="420" height="315" src="https://www.youtube.com/embed/EYPapE-3FRw" frameborder="0"> </iframe> <p>As Feynman notes, however, the guesses are not random–some guesses are better than others. Our task then, is to figure out an algorithm for making effective guesses: an algorithm for creativity. [<strong>Note 2017/03/05</strong>: The ability to generate hypotheticals is one key aspect of this. Note that epsilon-greedy or other random exploration approaches don’t quite cut it—they are just another form of social learning (learning socially from ourselves), unrelated to the scientific method.]</p> </body> </html>