As described the project overview, I am looking to analyze the text of over 10,000 Dilbert comic strips and then create some kind of mashup that allows for the creation of strips with new content.
My thoughts on next steps would include:
Using a Python module for image manipulation, develop code to go through each of the 10,000 strips and separate it by panel. The weekday strips follow a standard three-panel format, so they can be cropped by thirds.
A module for optical character recognition can then be used to perform OCR on each of the panels for each strip. The previously-scraped dialogue (that would serve as the ground truth for OCR output) does not specify which line is associated with a particular panel, so using a Levenshtein distance algorithm through a Python module can perform the task of matching the OCR output with the ground truth.
To perform textual analysis of the now-recognized text, the Java-based package MAchine Learning for LanguagE Toolkit (MALLET) can be used to perform topic modeling. This process would examine clusters of words that occur often together in each strip’s dialogue, and then, using contextual clues, connect words with similar meanings to build a topic model.
After this process, the idea is to replace the strip’s dialoge with another source of content. Using the image manipulation package, the existing dialog would be removed and replaced with new content, using a Dilbert-like font. I’m not sure exactly what would replace it, but it would be based on the results of the topic modeling process performed earlier so that the replacement text retains similar meaning and context. One option is for the replacement text would be an Old English dialect; another option would be to update the strip for the current decade by examining lines from the “Silicon Valley” TV series through topic modeling and select new, related content for a given strip. I am thinking the user would be able to select a theme or keyword, and the result would be a recreated strip.