For my final project I decided to make a book using the Reddit submission corpus and the Reddit comment corpus, which together represent almost all the public writing on Reddit from 2006 to 2015, well over 400GB of compressed text. I’ve been interested in storytelling, books and online forums for a while, and this seemed like a good way to try and combine these interests.
I was considering 2 approaches to the problem of getting stories from reddit: collect found stories and strange them in a way that created a meta-narrative, or try to synthesize new stories from snippets of writing. I decided to go with the former of these two, since I wanted to ensure that I would have constant narratives, and that the problem of separating stories from all the other text on Reddit was a challenge in itself.
The first thing I did was filter out anything that was unlikely to be a story, which I did by removing anything that was too short, or contained words that suggested that suggested the text political or technical conversation (words such as ‘Obama’, “Clinton”, “Reddit”, “Linux” etc.) Then I had less text, but I still had to separate the stories from several hundred gigabytes of uncompressed text files. The first thing I tried to do was sty separating them by hand. This did work, but I ended up with a book (Phradmus: A book of myths) that lacked any cohesive logic to it.
I ended up solving the problem by making very specific search requests on the body of texts. specifically, I looked for the term ‘mall santa.’ I chose this because it’s such an unusual term: Anybody using it is likely telling a story involving mall santas. I ended up with a small body of about 50/50 stories to other, which I then sorted by hand to get the body of Mall Santa: A book about mall santas.