Building a ChatBot, Pt 1: My Kingdom for Some Data
This is one of a series of posts detailing the development of SpeakEasy AI, a chatbot built from a conversational neural model trained on Reddit comments.
An immediate challenge in any machine learning project is finding and deciding on an appropriate dataset. An extremely large and complex dataset with some noise will allow you to build a model that can recognize complicated patterns, but this "ideal" dataset almost never exists. Even if it did, increasing the volume of training data inevitably increases training time, and in most circumstances it isn't feasible to spend months and months training a model. So basically I started this project as Goldilocks: I needed conversational data, lots of it but not too much, and I needed it to be high quality but not too perfect.
Movie and TV Show Subtitles
The original plan was to use English subtitles from 11,000+ movies and television shows split by sentence and separated into prompt/response pairs. One obvious issue with this approach was that splitting by sentence did not ensure that the data was being parsed into a conversation between two sources. Many (if not most) lines in movie and television scripts are comprised of more than 1 sentence, and often the response represented a continuation of the prompt rather than a reaction to it. Another issue was that the quality of the initial data was very poor: subtitles often had misspellings, missed spaces or descriptions of non-verbal interactions. There were also a few aspects of the dataset that were just downright weird. For example, in a large number of subtitles, a capitalized "I" was replaced by a lowercase "l". All these factors made the data extremely hard to parse correctly, and while some noisy data is allowed (and even desirable) in machine learning, this was just too much.
A friend who is an avid Reddit user suggested that I look into Reddit comments as a potential data source. I wrote a program using Casper.js to scrape comment pages only to discover that someone on the internet had already compiled every Reddit comment ever into a single torrent and made it available here. The files add up to over 1TB once everything is unzipped, which is a mind boggling amount of data. Besides its size, using the entire dataset was not possible since Reddit appears to have changed it's database schema several times over the years and not all the comment objects are in the same format.
It sounds like I'm complaining, but this was actually fantastic: having such a gigantic amount of data to start with meant I could be extremely picky about which comments I wanted to include in my final dataset. I started with 1 month of comments (32GB), and heavily filtered it to correct grammar and exclude long rants/comments containing uncommon words (I decided to limit my vocabulary to the 25,0000 most common words in the 1 month dataset). Because Reddit allows users to reply directly to another comment, comments are matched with their parent comment to form a prompt/response pair used for training. If a comment has multiple children, the highest-voted one is used. Ultimately, I ended up with 1GB of parsed, normalized, and (most importantly) clean data from the original 32GB.