09 Apr 2015

Final Project Update

My final project takes 10,000 Dilbert comic strips and slices each of them into individual panels. It then performs optical character recognition on each of the panels to extract the dialogue. The dialogue is then associated with each panel. Performing natural language processing on the dialog can determine the subject and context of the dialogue, so that a new comic strip can be generated with panels from each strip.

I had previously scraped the text from all of the comic strips published to date. The text is not associated with each panel; they are a bunch of lines that only apply to the strip.

So far, I’ve

Cleaned up the original transcript, which contains a lot of inconsistencies in how the dialog is captured. A lot of the transcripts contain additional text that is not part of the dialogue, so I’ve had to write some code to spearate only the relevant dialog.

Developed code that looks for the borders of each of the three panels of a strip so that it can be cleanly cropped.

Written code to performan OCR on the individual panels. Because of the variation in the text placement in the strip, the OCR is not perfect, so I’m using a Levenshtein algorithm to compare the OCR’ed text with the transcript for a particular strip and then deduce which of the text belongs to one specific panel.

What’s left

I need to refine the code to compare the OCR’ed text with the original transcript. There are still many cases where the OCR’ed text does not match up with the original transcript.

I need to write code to look through the panel-specific dialogue and determine the dialogue context.

I need to then, based on the dialog content of a particular panel, develop code to select panels from different strip that are related.

I would then need to create a web page that allows the user to create new panels based on specific criteria.