## Task

In this post, we’ll use Tensorflow to construct an RNN that operates on input sequences of variable lengths. We’ll use this RNN to classify bloggers by age bracket and gender using sentence-long writing samples. One time step will represent a single word, with the complete input sequence representing a single sentence. The challenge is to build a model that can classify multiple sentences of different lengths at the same time.

### Other tutorials on variable length sequences

There are a couple other tutorials on this topic. For example, the official Tensorflow seq2seq tutorial model accomodates variable length sequences. This official model, however, is a bit advanced for a first exposure and a little too specialized to be easily portable to other contexts. Danijar Hafner has written a more approachable guide here, which I recommend. In contrast to Danijar’s post, this post is written a linear ipython notebook-style to make it easy for you to follow along step-by-step. This post also includes a section on bucketing, a technique that can significantly improve your model’s training time.

## Data

The data for this post is sourced from the “Blog Authorship Corpus”, available here. The original dataset was tokenized and split into sentences using spacy. Sentences with less than 5 tokens and sentences with more than 30 tokens were discarded. Number-like tokens were replaced by “<#>”. Tokens other than the 9999 most common tokens were replaced by “< UNK >”, for a 10000 token vocabulary. Sentences were tagged with the gender (0 for male, 1 for female) and age bracket (0 for teens, 1 for 20s, 2 for 30s) and placed into a pandas dataframe. The modified data and code to import can be found here.

Below is the head of the dataframe (tokens in “string” are delimited by spaces):

```
import pandas as pd, numpy as np, tensorflow as tf
import blogs_data #available at https://github.com/spitis/blogs_data
```

```
df = blogs_data.loadBlogs().sample(frac=1).reset_index(drop=True)
vocab, reverse_vocab = blogs_data.loadVocab()
train_len, test_len = np.floor(len(df)*0.8), np.floor(len(df)*0.2)
train, test = df.ix[:train_len-1], df.ix[train_len:train_len + test_len]
df = None
train.head()
```

post_id | gender | age | string | as_numbers | length | |
---|---|---|---|---|---|---|

0 | 118860 | 0 | 0 | the last time i checked the constitution of th… | [4, 127, 63, 3, 1837, 4, 3871, 8, 4, 1236, 927… | 22 |

1 | 178031 | 1 | 1 | but i do wish more people were so |
[20, 3, 31, 360, 77, 79, 88, 27, 0, 43, 631, 2… | 15 |

2 | 182592 | 1 | 0 | it came back to me right away and i was off . | [10, 209, 93, 5, 19, 136, 192, 6, 3, 17, 129, 2] | 12 |

3 | 144982 | 0 | 2 | day <#> - get to class <#> min . early . | [94, 12, 33, 59, 5, 320, 12, 2703, 2, 457, 2] | 11 |

4 | 43048 | 0 | 0 | cause you do n’t know how much i , i need you … | [332, 15, 31, 28, 64, 86, 96, 3, 1, 3, 157, 15… | 21 |

We’re going to build an RNN that accepts batches of data from the “as_numbers” column and predicts the “gender” and “age_bracket” columns. The first step is to construct a simple iterator that returns batches of inputs along with their targets, and the length of each input:

```
class SimpleDataIterator():
def __init__(self, df):
self.df = df
self.size = len(self.df)
self.epochs = 0
self.shuffle()
def shuffle(self):
self.df = self.df.sample(frac=1).reset_index(drop=True)
self.cursor = 0
def next_batch(self, n):
if self.cursor+n-1 > self.size:
self.epochs += 1
self.shuffle()
res = self.df.ix[self.cursor:self.cursor+n-1]
self.cursor += n
return res['as_numbers'], res['gender']*3 + res['age_bracket'], res['length']
```

```
data = SimpleDataIterator(train)
d = data.next_batch(3)
print('Input sequences\n', d[0], end='\n\n')
print('Target values\n', d[1], end='\n\n')
print('Sequence lengths\n', d[2])
```

```
Input sequences
0 [27, 3, 576, 146, 13, 204, 37, 150, 6, 804, 94...
1 [10, 210, 30, 1554, 10, 22, 325, 6240, 11, 4, ...
2 [2927, 78, 9324, 5, 2273, 4, 5937, 8, 1058, 4,...
Target values
0 4
1 4
2 1
Sequence lengths
0 13
1 18
2 22
```

We are immediately faced with a problem, in that our 3 sequences are of different lengths: we cannot feed them into a Tensorflow graph as is, unless we create a different tensor for each (inefficient, and hard!). To solve this, we **pad shorter sequences so that all sequences are the same length**. Then all sequences will fit into a single tensor.

```
class PaddedDataIterator(SimpleDataIterator):
def next_batch(self, n):
if self.cursor+n > self.size:
self.epochs += 1
self.shuffle()
res = self.df.ix[self.cursor:self.cursor+n-1]
self.cursor += n
# Pad sequences with 0s so they are all the same length
maxlen = max(res['length'])
x = np.zeros([n, maxlen], dtype=np.int32)
for i, x_i in enumerate(x):
x_i[:res['length'].values[i]] = res['as_numbers'].values[i]
return x, res['gender']*3 + res['age_bracket'], res['length']
```

```
data = PaddedDataIterator(train)
d = data.next_batch(3)
print('Input sequences\n', d[0], end='\n\n')
```

```
Input sequences
[[ 34 90 5 470 16 19 16 7 159 2 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 82 1 109 7 377 8 421 8 0 33 124 3 69 180
17 90 5 133 16 19 33 34 12 3819 85 164 129 25]
[1786 5570 1 13 7817 235 60 6168 19 2 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
```

Our padded iterator now returns a single input matrix of dimension [batch_size, max_sequence_length], where shorter sequences have been padded with zeros.

#### A note on PAD symbols

For this model, the zero that we used to pad our input sequences with is the index of the “< UNK >” symbol (representing UNKnown words) in our vocabulary. In this case, what we pad with doesn’t affect the outcome, and so I chose to keep it simple, but there are cases in which we might want to introduce a special “PAD” symbol. For example, here we will be feeding in a length tensor that holds information about our input sequence lengths, but suppose instead that we want to have Tensorflow calculate our sequence lengths; in that case, using 0 to represent both < UNK > and PAD would make it impossible for the graph to disambiguate between a sentence ending in < UNK > and a padded sentence. If we were to add a special PAD symbol, it would likely want to represent it as zero, and so it would have to displace the < UNK > symbol, which would then need to be represented by a different index.

The advantage of the approach shown here (zero-padding with no special PAD symbol) is that it generalizes better to sequences with multi-dimensional continuous input (e.g., stock price data). In such cases, it does not really make sense to have a separate PAD symbol.

## A basic model for sequence classification

We’ll now construct a **sequence classification** model using this data that assigns a single label to an entire input sequence. Later, we’ll look at how we can instead construct a sequence-to-sequence model that predicts the authors age and gender at each time step.

Our model makes use of Tensorflow’s `dynamic_rnn`

, specifying a `sequence_length`

parameter, which is fed into the model along with the data. Calling `dynamic_rnn`

with a `sequence_length`

parameter returns padded outputs: e.g., if the maximum sequence length is 10, but a specific example in the batch has a sequence length of only 4 followed by 6 zero steps for padding, the output for that time step will also have a length of only 4, with 6 additional zero steps for padding.

This introduces some added complexity, since for sequence classification we only care about the final output of each sequence. This could be solved in one line using `tf.gather_nd`

, as commented below, however, the gradient for `tf.gather_nd`

is not yet implemented as of this writing. It is expected to be implemented shortly (you can view the status on Github here). In the interim, I have adopted Danijar Hafner’s solution, which you can read more about in his post under the heading “Select the Last Relevant Output”.

```
def reset_graph():
if 'sess' in globals() and sess:
sess.close()
tf.reset_default_graph()
def build_graph(
vocab_size = len(vocab),
state_size = 64,
batch_size = 256,
num_classes = 6):
reset_graph()
# Placeholders
x = tf.placeholder(tf.int32, [batch_size, None]) # [batch_size, num_steps]
seqlen = tf.placeholder(tf.int32, [batch_size])
y = tf.placeholder(tf.int32, [batch_size])
keep_prob = tf.constant(1.0)
# Embedding layer
embeddings = tf.get_variable('embedding_matrix', [vocab_size, state_size])
rnn_inputs = tf.nn.embedding_lookup(embeddings, x)
# RNN
cell = tf.nn.rnn_cell.GRUCell(state_size)
init_state = tf.get_variable('init_state', [1, state_size],
initializer=tf.constant_initializer(0.0))
init_state = tf.tile(init_state, [batch_size, 1])
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, sequence_length=seqlen,
initial_state=init_state)
# Add dropout, as the model otherwise quickly overfits
rnn_outputs = tf.nn.dropout(rnn_outputs, keep_prob)
"""
Obtain the last relevant output. The best approach in the future will be to use:
last_rnn_output = tf.gather_nd(rnn_outputs, tf.pack([tf.range(batch_size), seqlen-1], axis=1))
which is the Tensorflow equivalent of numpy's rnn_outputs[range(30), seqlen-1, :], but the
gradient for this op has not been implemented as of this writing.
The below solution works, but throws a UserWarning re: the gradient.
"""
idx = tf.range(batch_size)*tf.shape(rnn_outputs)[1] + (seqlen - 1)
last_rnn_output = tf.gather(tf.reshape(rnn_outputs, [-1, state_size]), idx)
# Softmax layer
with tf.variable_scope('softmax'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
logits = tf.matmul(last_rnn_output, W) + b
preds = tf.nn.softmax(logits)
correct = tf.equal(tf.cast(tf.argmax(preds,1),tf.int32), y)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y))
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
return {
'x': x,
'seqlen': seqlen,
'y': y,
'dropout': keep_prob,
'loss': loss,
'ts': train_step,
'preds': preds,
'accuracy': accuracy
}
def train_graph(graph, batch_size = 256, num_epochs = 10, iterator = PaddedDataIterator):
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
tr = iterator(train)
te = iterator(test)
step, accuracy = 0, 0
tr_losses, te_losses = [], []
current_epoch = 0
while current_epoch < num_epochs:
step += 1
batch = tr.next_batch(batch_size)
feed = {g['x']: batch[0], g['y']: batch[1], g['seqlen']: batch[2], g['dropout']: 0.6}
accuracy_, _ = sess.run([g['accuracy'], g['ts']], feed_dict=feed)
accuracy += accuracy_
if tr.epochs > current_epoch:
current_epoch += 1
tr_losses.append(accuracy / step)
step, accuracy = 0, 0
#eval test set
te_epoch = te.epochs
while te.epochs == te_epoch:
step += 1
batch = te.next_batch(batch_size)
feed = {g['x']: batch[0], g['y']: batch[1], g['seqlen']: batch[2]}
accuracy_ = sess.run([g['accuracy']], feed_dict=feed)[0]
accuracy += accuracy_
te_losses.append(accuracy / step)
step, accuracy = 0,0
print("Accuracy after epoch", current_epoch, " - tr:", tr_losses[-1], "- te:", te_losses[-1])
return tr_losses, te_losses
```

```
g = build_graph()
tr_losses, te_losses = train_graph(g)
```

```
Accuracy after epoch 1 - tr: 0.319347791963 - te: 0.351068906904
Accuracy after epoch 2 - tr: 0.355731238225 - te: 0.357366258375
Accuracy after epoch 3 - tr: 0.361505161451 - te: 0.358625811348
Accuracy after epoch 4 - tr: 0.363629598859 - te: 0.359358642169
Accuracy after epoch 5 - tr: 0.365078599278 - te: 0.358609453518
Accuracy after epoch 6 - tr: 0.365907767689 - te: 0.359358642169
Accuracy after epoch 7 - tr: 0.367192406322 - te: 0.359833019263
Accuracy after epoch 8 - tr: 0.368336397059 - te: 0.360304124791
Accuracy after epoch 9 - tr: 0.369028188455 - te: 0.360434987437
Accuracy after epoch 10 - tr: 0.37021715381 - te: 0.36041535804
```

After 10 epochs, our network has an accuracy of about 36%, about twice as good as chance–not bad for predicting age and gender from a single sentence!

## Improving training speed using bucketing

For the network above, we used a batch_size of 256. But each example in the batch had a different length ranging from 5 to 30. As the maximum length for each batch is usually very close to 30, short sequences required a lot of padding (e.g., all sequences of length 5 in the batch are padded with up to 25 zeros). Given this dataset, each batch is padded with an average of over 3000 zeros, or over 10 padding symbols per sample:

```
tr = PaddedDataIterator(train)
padding = 0
for i in range(100):
lengths = tr.next_batch(256)[2].values
max_len = max(lengths)
padding += np.sum(max_len - lengths)
print("Average padding / batch:", padding/100)
```

`Average padding / batch: 3279.9`

This leads to a lot of excess computation, and we can improve upon it by “bucketing” our training samples. If we select our batches such that the lengths of the samples in each batch are within, say, 5 of each other, then the amount of padding in a batch of 256 is bounded by 256 * 5 = 1280. This would make our worst case outcome more than twice as good as previous average case outcome.

To take advantage of bucketing, we simply modify our DataIterator. There are many ways one might implement this, but the key point to keep in mind is that we should not “bias” the order in which different sequence lengths are sampled any more than necessary to achieve bucketing. E.g., sorting our data by sequence length might seem like a good solution, but then each epoch would be trained on short sequences before longer sequences, which could harm results. Here is one solution, which uses a predetermined batch_size:

```
class BucketedDataIterator():
def __init__(self, df, num_buckets = 5):
df = df.sort_values('length').reset_index(drop=True)
self.size = len(df) / num_buckets
self.dfs = []
for bucket in range(num_buckets):
self.dfs.append(df.ix[bucket*self.size: (bucket+1)*self.size - 1])
self.num_buckets = num_buckets
# cursor[i] will be the cursor for the ith bucket
self.cursor = np.array([0] * num_buckets)
self.shuffle()
self.epochs = 0
def shuffle(self):
#sorts dataframe by sequence length, but keeps it random within the same length
for i in range(self.num_buckets):
self.dfs[i] = self.dfs[i].sample(frac=1).reset_index(drop=True)
self.cursor[i] = 0
def next_batch(self, n):
if np.any(self.cursor+n+1 > self.size):
self.epochs += 1
self.shuffle()
i = np.random.randint(0,self.num_buckets)
res = self.dfs[i].ix[self.cursor[i]:self.cursor[i]+n-1]
self.cursor[i] += n
# Pad sequences with 0s so they are all the same length
maxlen = max(res['length'])
x = np.zeros([n, maxlen], dtype=np.int32)
for i, x_i in enumerate(x):
x_i[:res['length'].values[i]] = res['as_numbers'].values[i]
return x, res['gender']*3 + res['age_bracket'], res['length']
```

With this modified iterator, we improve the average padding / batch by a factor of about 6:

```
tr = BucketedDataIterator(train, 5)
padding = 0
for i in range(100):
lengths = tr.next_batch(256)[2].values
max_len = max(lengths)
padding += np.sum(max_len - lengths)
print("Average padding / batch:", padding/100)
```

`Average padding / batch: 573.49`

We can also compare the difference in training speed, and observe that this bucketing strategy speeds up training by about 30%:

```
from time import time
g = build_graph()
t = time()
tr_losses, te_losses = train_graph(g, num_epochs=1, iterator=PaddedDataIterator)
print("Total time for 1 epoch with PaddedDataIterator:", time() - t)
```

```
Accuracy after epoch 1 - tr: 0.310819936427 - te: 0.341703713389
Total time for 1 epoch with PaddedDataIterator: 100.90330529212952
```

```
g = build_graph()
t = time()
tr_losses, te_losses = train_graph(g, num_epochs=1, iterator=BucketedDataIterator)
print("Total time for 1 epoch with BucketedDataIterator:", time() - t)
```

```
Accuracy after epoch 1 - tr: 0.31359524197 - te: 0.349176362254
Total time for 1 epoch with BucketedDataIterator: 71.45360088348389
```

Note how easy it was to move to a bucketed model–all we had to do was change our data generator. This was made possible by the use of a partially-known shape for our input placeholder, with the num_steps dimension unknown. Contrast this to the more complicated approach in Tensorflow’s seq2seq tutorial, which builds a different graph for each of four buckets.

#### A note on awkward sequence lengths

Suppose we had a dataset with awkward sequence lengths that made even a bucketed approach inefficient. For example, we might have lots of very short sequences of lengths 1, 2 and 3. Alternatively, we might have a few very long sequences among our shorter ones; we want to propagate the internal state forward through time for the long sequences, but don’t have enough of them to train efficiently in parallel. One solution in both of these scenarios is to combine short sequences into longer ones, but have the internal state of the RNN reset in between each such sequence. I believe this is not possible to do with Tensorflow’s default RNN functions (e.g., `dynamic_rnn`

), so if you’re looking for a way to do this, I would look into writing a custom RNN method using `tf.scan`

. I show how to use `tf.scan`

to build a custom RNN in my post, Recurrent Neural Networks in Tensorflow II. With the right accumulator function, you could program in the state resets dynamically based on either a special PAD symbol, or an auxiliary input sequence that indicates where the state should be reset.

## A basic model for sequence to sequence learning

Finally, we extend our sequence classification model to do **sequence-to-sequence learning**. We’ll use the same dataset, but instead of having our model guess the author’s age bracket and gender at the end of the sequence (i.e., only once), we’ll have it guess at every timestep.

The added wrinkle when moving to a sequence-to-sequence model is that we need to make sure that time-steps with a PAD symbol do not contribute to our loss, since they are just there as filler. We do so by zeroing the loss at these time steps, which is known as applying a “mask” or “masking” the loss. This is achieved by pointwise multiplying the loss tensor (with each entry representing a time step), by a tensor of 1s and 0s, where 1s represent valid steps and 0s represent PAD steps. A similar modification is made to the “accuracy” calculation below, as noted in the comments.

```
def build_seq2seq_graph(
vocab_size = len(vocab),
state_size = 64,
batch_size = 256,
num_classes = 6):
reset_graph()
# Placeholders
x = tf.placeholder(tf.int32, [batch_size, None]) # [batch_size, num_steps]
seqlen = tf.placeholder(tf.int32, [batch_size])
y = tf.placeholder(tf.int32, [batch_size])
keep_prob = tf.constant(1.0)
# Tile the target indices
# (in a regular seq2seq model, our targets placeholder might have this shape)
y_ = tf.tile(tf.expand_dims(y, 1), [1, tf.shape(x)[1]]) # [batch_size, num_steps]
"""
Create a mask that we will use for the cost function
This mask is the same shape as x and y_, and is equal to 1 for all non-PAD time
steps (where a prediction is made), and 0 for all PAD time steps (no pred -> no loss)
The number 30, used when creating the lower_triangle_ones matrix, is the maximum
sequence length in our dataset
"""
lower_triangular_ones = tf.constant(np.tril(np.ones([30,30])),dtype=tf.float32)
seqlen_mask = tf.slice(tf.gather(lower_triangular_ones, seqlen - 1),\
[0, 0], [batch_size, tf.reduce_max(seqlen)])
# Embedding layer
embeddings = tf.get_variable('embedding_matrix', [vocab_size, state_size])
rnn_inputs = tf.nn.embedding_lookup(embeddings, x)
# RNN
cell = tf.nn.rnn_cell.GRUCell(state_size)
init_state = tf.get_variable('init_state', [1, state_size],
initializer=tf.constant_initializer(0.0))
init_state = tf.tile(init_state, [batch_size, 1])
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, sequence_length=seqlen,
initial_state=init_state)
# Add dropout, as the model otherwise quickly overfits
rnn_outputs = tf.nn.dropout(rnn_outputs, keep_prob)
#reshape rnn_outputs and y
rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
y_reshaped = tf.reshape(y_, [-1])
# Softmax layer
with tf.variable_scope('softmax'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
logits = tf.matmul(rnn_outputs, W) + b
preds = tf.nn.softmax(logits)
# To calculate the number correct, we want to count padded steps as incorrect
correct = tf.cast(tf.equal(tf.cast(tf.argmax(preds,1),tf.int32), y_reshaped),tf.int32) *\
tf.cast(tf.reshape(seqlen_mask, [-1]),tf.int32)
# To calculate accuracy we want to divide by the number of non-padded time-steps,
# rather than taking the mean
accuracy = tf.reduce_sum(tf.cast(correct, tf.float32)) / tf.reduce_sum(tf.cast(seqlen, tf.float32))
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped)
loss = loss * tf.reshape(seqlen_mask, [-1])
# To calculate average loss, we need to divide by number of non-padded time-steps,
# rather than taking the mean
loss = tf.reduce_sum(loss) / tf.reduce_sum(seqlen_mask)
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
return {
'x': x,
'seqlen': seqlen,
'y': y,
'dropout': keep_prob,
'loss': loss,
'ts': train_step,
'preds': preds,
'accuracy': accuracy
}
```

```
g = build_seq2seq_graph()
tr_losses, te_losses = train_graph(g, iterator=BucketedDataIterator)
```

```
Accuracy after epoch 1 - tr: 0.292434578401 - te: 0.316306242085
Accuracy after epoch 2 - tr: 0.320437548276 - te: 0.322733865921
Accuracy after epoch 3 - tr: 0.325227848205 - te: 0.322927395211
Accuracy after epoch 4 - tr: 0.327002136049 - te: 0.324078651696
Accuracy after epoch 5 - tr: 0.327847927489 - te: 0.324469006651
Accuracy after epoch 6 - tr: 0.328276157813 - te: 0.324198486081
Accuracy after epoch 7 - tr: 0.329078430968 - te: 0.324715245167
Accuracy after epoch 8 - tr: 0.330095707002 - te: 0.325317926384
Accuracy after epoch 9 - tr: 0.330612316872 - te: 0.32550007953
Accuracy after epoch 10 - tr: 0.331520609485 - te: 0.326069803531
```

As expected, our sequence-to-sequence model has slightly worse accuracy than our sequence classification model (because it’s early guesses are nearly random and reduce the accuracy).

## Conclusion

In this post, we learned four concepts, all related to building RNNs that work with variable length sequences. First, we learned how to pad input sequences so that we can feed in a single zero-padded input tensor. Second, we learned how to get the last relevant output in a sequence classification model. Third, we learned how to use bucketing to get a significantly boost in training time. Finally, we learned how to “mask” our loss function so that we can train sequence-to-sequence models with variable length sequences.