Writing “Hello World” with a Neural Network

If you are trying to write an application that will generate text, you are probably trying to implement a Recurrent Neural Network (RNN). Many “Hello World” neural network tutorials do things like generate Shakespeare stanzas or poetry. This is amazing and unreasonably effective, but a lot to take in for a first go at using something like Tensorflow and establishing the intuition about what is happening under the hood and why.

Instead, let’s try to simply generate the text “Hello World” using an RNN. This is my attempt to generate a simple and understandable RNN, while simultaneously creating the most challenging way to generate the string “Hello World”.

This article is heavily based on a simplified version of the Tensorflow tutorial found here.

Final Code

The final code is about 120 lines with comments. However, don’t try to copy and paste as you read. Take the final code found here. Then read down this tutorial to better understand what it is doing.

Getting Started

This tutorial assumes you have python(3) installed and you know how to write python code. If you don’t know how to write any code, you should get comfortable with standard programming before embarking on a machine learning application.

Import your dependencies

import tensorflow as tf
tf.enable_eager_execution()
import numpy as np
import pandas as pd
import os
import time
import functools

Training data

We need to train the neural network on a dataset. It needs to learn to spell and ultimately write a series of [2] words. Usually we would import an enormous dataset here. Instead, let’s just generate the string n times to create a dataset in memory.

text = ' '.join(['Hello World' for n in range(100)])
vocab = sorted(set(text))

You can see we are creating a unique set of characters here. For instance, “l” is only going to be in our vocab variable once because that is the set of letters it is going to be spelling from.

The RNN ultimately needs to work with numbers so we need to vectorize our vocabulary. This is simply a map of indexes to characters and characters back to indexes.

char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])

Preparing for Prediction

The batch size is used to easily convert the individual characters into sequences of the desired size. The buffer size helps Tensorflow not shuffle the whole set in memory.

We now need to split our text into sequences. This would be a line or poetry or single song lyric. In our case, it is 11 characters or the length of “Hello World”. We are trying to get the RNN to predict the next most probable character so each sequence will have the same length of letters, but shifted to the right. i.e. “ello WorldH”.

Note: BATCH_SIZE is 1 in the below code because we have very little data. Increasing the batch size creates garbage output with low epochs. This is unintuitive. Generally, you should use a smaller batch size when you can tolerate the additional training time. Additional reading can be found here.

# The maximum length sentence we want for a single input in characters
seq_length = 11
examples_per_epoch = len(text)//seq_length
# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text
dataset = sequences.map(split_input_target)

BATCH_SIZE = 1
steps_per_epoch = examples_per_epoch // BATCH_SIZE
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

Creating the model

Now the really fun part. We need to create the actual model which will be trained on our data. There are two very important variables that we need to explain:

The embedding dimension is basically a mapping of the input data into a set of real valued dimensions. In our case, letters so we can set it as 8, but any low number will work since we have very little data.

The RNN Units are the number of cells in each hidden layer. This of this as defining how wide the neural network is instead of it being deep. Each cell defines some feature so a wider network can make more granular decisions. We have very little data so we do not need many features. Something like an image would have many many more.

vocab_size = len(vocab)
embedding_dim = 8
rnn_units = 8

The below code simply sets up the RNN and tells it which activation function to use. A sigmoid function will map values between 0 and 1.

rnn = functools.partial(tf.keras.layers.GRU, recurrent_activation='sigmoid')

Finally, we build our model. For now, do not worry about the params. You can see us passing in what we defined above and specifying some additional configuration. What is important is noticing that we are defining two hidden RNN layers which will increase training time, but hopefully increase accuracy.

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                               batch_input_shape=[batch_size, None]),
    rnn(rnn_units,
        return_sequences=True,
        recurrent_initializer='glorot_uniform',
        stateful=True),
    rnn(rnn_units,
        return_sequences=True,
        recurrent_initializer='glorot_uniform',
        stateful=True),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

Now we need to define our loss function. This is the function that will run at the completion of each batch to try to minimize loss. TODO insert loss function image

def loss(labels, logits):
   return tf.keras.backend.sparse_categorical_crossentropy(labels, logits, from_logits=True)

Now we can compile our model with our loss function. The learning rate will affect how quickly we descend. Imagine us trying to find the bottom of a parabola. Every iteration you jump to the left or right. Big jumps will allow you to learn fast, but you may never find the true bottom while small jumps will enable to to find the bottom even if it is a local optimum, but it may take a long time. The jump is the learning rate and it should be tuned to find the right balance.

model.compile(
     optimizer = tf.train.GradientDescentOptimizer(learning_rate=2.0),
     loss = loss)

After each batch, the weights will be saved in a non-human readable file that ultimately are used to train our model. Keeping those in a separate folder is just good housekeeping.

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
     filepath=checkpoint_prefix,
     save_weights_only=True)

Now let’s fit our model. This will take a while. We can scale up the number of epochs as needed to minimize loss.

EPOCHS=3
history = model.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback])

Now lets train our model, load the weights that we just created, and build our model.

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

Our model is done! Now we tell it to write

The below function will be very custom to each implementation, but ultimately, it takes our model and a seed string as a parameter and then generates 10 letters in a loop using the previous letters as input.

def hello_world(model, start_string):
   # Number of characters to generate
   num_generate = 10
   # Converting our start string to numbers (vectorizing)
   input_eval = [char2idx[s] for s in start_string]
   input_eval = tf.expand_dims(input_eval, 0)
   # Empty string to store our results
   text_generated = []
   temperature = 1.0
   # Here batch size == 1
   model.reset_states()
   for i in range(num_generate):
       predictions = model(input_eval)
       # remove the batch dimension
       predictions = tf.squeeze(predictions, 0)
       # using a multinomial distribution to predict the word returned by the model   
       predictions = predictions / temperature   
       predicted_id = tf.multinomial(predictions, num_samples=1)[-1,0].numpy()
       # We pass the predicted word as the next input to the model along with the previous hidden state   
       input_eval = tf.expand_dims([predicted_id], 0)   
       text_generated.append(idx2char[predicted_id])
  return (start_string + ''.join(text_generated))

Now we simply call our method and we should see “Hello World” or at least something very close.

print(hello_world(model, start_string="H"))
>>> Hello World

Extra Credit

I personally think saving the model is a great idea at this point. This will enable you to load it later or serve it from an application. You also will not have to retrain it again from the checkpoint.

model.save('saved_model.h5')

You are done! Thank you for reading and stay tuned for the next one. If you want to learn how to query this model from a web application, you can read this post.