Building a ChatBot, Pt. 3: Performance Optimization and Training

This is one of a series of posts detailing the development of SpeakEasy AI, a chatbot built from a conversational neural model trained on Reddit comments. You can go here to read about the dataset and here to read about the TensorFlow model.

Note: TensorFlow is a very young open-source project with an enthusiastic community of insanely smart developers and users who are making it better all the time. I built this ChatBot using version 0.5.0, which is already outdated at the time I'm writing this in January 2016. The experience I had optimizing this model may be different than someone using a later version.

Training on AWS

Using my Reddit dataset, training for a total of 7.5 epochs requires 348,158 steps. I have a MacBook Air with 4 CPU cores, 8GB of RAM, and a 2.2 GHz Intel Core i7 processor.

Here is what training the model is like on my local machine:

The relevant metric is the step-time, which represents the number of seconds it takes to train a singe batch. With a step-time of 14.94 seconds, the model would take 60.2 days to train.

In addition to being unbelievably slow, my computer was basically on fire:

Training on my local machine was clearly not a good option, so I looked into training it on an EC2 instance. AWS offers tons of options for renting machines with prices ranging from next to nothing to $4/hour. I started on an m4.xlarge, which has 4 cores, 16GB of RAM and costs $0.252/hour:

Definitely better, but this would still take 39.7 days to train and cost $240.

I then tried a c4.4xlarge, which has 16 cores, 30GB of RAM and costs $0.882/hour:

This would take 16.4 days to train, which isn't totally unreasonable, but it would cost almost $350...

GPU-Enabled Training

One of the most appealing claims about TensorFlow is that you can switch from CPU-only to GPU-accelerated training without having to change your code. Doing so requires a Linux machine with CUDA and cuDUNN installed (note you need to register as a developer to download cuDUNN) and a NVIDIA graphics card. I trained my model on an Ubuntu g2.2xlarge instance ($0.65/hour) set up according to this gist. Installing the correct TensorFlow package is a bit of a pain because TensorFlow only supports NVIDIA cards v3.5 or higher, and AWS EC2 instances only have v3.0. You have to apply a patch to support K520 devices, build your own pip package, and install it (I ended up just saving the pip package to s3 so I didn't have to compile it on multiple instances). After you have the correct TensorFlow package you also need to store 2 environmental variables just to get things off the ground:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"  
export CUDA_HOME=/usr/local/cuda  

Once I thought I had everything running, I ended up suddenly running out of process memory several times when I was training very large models. The fix for this was to configure the session to use BFC memory allocation (I believe this is the default setting in the 0.6.0 release):

config = tf.ConfigProto()  
config.gpu_options.allocator_type = 'BFC'  
with tf.Session(config=config) as sess:  

With GPU-enabled training, my time-step looked like this:

Much better! The GPU-accelerated learning was almost 10x faster than CPU-only learning on my dinky 4-core laptop, and 2x as fast as CPU-only learning on a machine with 16 CPUs. 7.5 epochs took 7.2 days and cost just over $100. Not bad, right?

Side Note: I did get a little greedy and tried running the model on a g2.8xlarge, which costs a whopping $2.60/hour (mostly just to see what would happen, since training an entire model this way would almost certainly be cost prohibitive), but I couldn't get TensorFlow to utilize more than 1 card. If anyone has figured out why that seems to be the case and/or how to fix it, I'd love to know.

Next: Chat with SpeakEasy