I am using transcript, audio, and video data from American Rhetoric, a famous speeches database, and aligning the video/audio to the textual transcript. I am still exploring the possibilities of what to do with the results of forced alignment–I’d have the exact times of when each word is spoken in the audio track or video. Below are several expressive and analytical ideas I am considering.
- Create an interactive application where someone can type in a message, and my program would generate a video of their message being spoken. The video would be stitched from clips of numerous different videos. In short, I am mapping user-input text to the audio playback time in a video, then to the actual visuals in the video clip.
- Create a video of famous public figures reciting to ridiculous pop songs, i.e. Hotline Bling, just for fun.
- Look at the top common words across speeches, trim and string together videos to public figures saying these words/sentences that contain these words.
- Do keyword searches, i.e. “evil”, “hope”, “justice”, “democracy” that I already know are abstract and frequent in famous speeches, and stringing public figures saying these words/sentences that contain these words in a new audio track/video.