For my final project, I wanted to generate captions for images that didn’t just describe what was in the image, but put in the form of a story. A successful caption would be one which the viewer could relate to the image, and find it plausible. My inspiration for this project is a quote at the start of the movie – ‘Le Samourai’, which can be read here. I believed this quote, only to learn later it was fake, but I was interested in how the movie genre and the presentation led me to believe it. I attempted to do this earlier by captioning Japanese Woodblock prints, but I struggled to generate text that would be plausible.
I ended up using an image set from Flickr. The images look like they’re from the 2000s and for some reason have a spooky quality to them (or atleast that’s how I identified them). They seemed to lend themselves naturally to storytelling. I generated literal captions for them (that purely describe the image) using a pre-trained NeuralTalk2 model. I fed this into a Recurrent Neural Networks based text generator, to generate a paragraph of text that creates a story about the image. The text generator used results from two models – a pre-trained char-RNN model , and a Torch-RNN model that I trained myself, on horror texts from a book corpus found here (the Movie Book Corpus). Here are a couple of representative results:
Since I did not have access to high-performance machines, or a GPU, I used my own machine and tried to be intelligent with my time. I trained machines with Mystery text, Adventure novels and Romance novels, and biographies to try to see what kind of text was being generated by the Neural network that might add value to the image. I settled on horror text because a lot of them described scenes, were in first person and had the writing style that I wanted. I filtered the set of Horror texts to a few Stephen King novels and trained it with a large RNN size of about 300. I would like to thank Ross Goodwin, for his fantastic article which detailed his trial and errors.
The project could be improved by being more specific. Right now, given my introductory level to neural networks and the lack of high-performance computing at my disposal, I concentrated on getting results that had meaning. I want to make it more in the style I want. I would pick images that have particular scenes that are reminiscent of horror stories, like – ‘a house’ , ‘an empty street’. I would pull images that have those nouns and generate captions from those. I would also try to improve the presentation. I would also use a higher performant machine, to train a larger text set, with more layers and an RNN-size of about 700.
Here is a video of a slideshow with some interesting results
Update: Uploaded to Github
Some example results