Project 3–Interaction–Kinect Hand Waving

by Ben Gotow @ 10:16 am 21 February 2011

What if you could use hand gestures to control an audio visualization? Instead of relying on audio metrics like frequency and volume, you could base the visualization on the user’s interpretation of perceivable audio qualities. The end result would be a better reflection of the way that people feel about music.

To investigate this, I wrote an OpenFrameworks application that uses depth data from the Kinect to identify hands in a scene. The information about the users’ hands – position, velocity, heading, and size – is used to create an interactive visualization with long-exposure motion trails and particle effects.

There were a number of challenges in this project. I started with Processing, but it was too slow to extract hands and render the point sprite effects I wanted. I switched to OpenFrameworks and started using OpenNI to extract a skeleton from the Kinect depth image. OpenNI worked well and extracted a full skeleton with wrists that could be tracked, but it was difficult to test because the skeletal detection took nearly a minute every time the visualization was tested. It got frustrating pretty quickly, and I decided to do hand detection manually.

Detecting Hands in the Depth Image
I chose a relatively straightforward approach to finding hands in the depth image. I made three significant assumptions that made realtime detection possible:

  1. The users body intersects the bottom of the frame
  2. The user is the closest thing in the scene.
  3. The users hands are extended (at least slightly) in front of their body

Assumption 1 is important because it allows for automatic depth thresholding. By assuming that the user intersects the bottom of the frame, we can scan the bottom row of depth pixels to determine the depth of the users body. The hand detection ignores anything further away than the user.

Assumptions 2 and 3 are important for the next step in the process. The application looks for local minima in the depth image and identifies the points nearest the camera. It then uses a breadth-first search algorithm to repeatedly expand the blob to neighboring points and find the boundaries of hands. Each pixel is scored based on it’s depth and distance from the source. Pixels that are scored as part of one hand cannot be scored as part of another hand and this prevents near points in the same hand from generating multiple resulting blobs.

Interpreting Hands
Once pixels in the depth image have been identified as hands, a bounding box is created around each one. The bounding boxes are compared to those found in the previous frame and matched together, so that the user’s two hands are tracked separately.

Once each blob has been associated with the left or right hand, the algorithm determines the heading, velocity and acceleration of the hand. This information is averaged over multiple frames to eliminate noise.

Long-Exposure Motion Trails
The size and location of each hand are used to extend a motion trail from the user’s hand. The motion trail is stored in an array. Each point in the trail has an X and Y position, and a size. To render the motion trail, overlapping, alpha-blended point sprites are drawn along the entire length of the trail. A catmul-rom spline algorithm is used to interpolate between the points in the trail and create a smooth path. Though it might seem best to append a point to the motion trail every frame, this tends to cause noise. In the version below, a point is added to the trail every three frames. This increases the distance between the points in the trail and allows for more smoothing using catmul-rom interpolation.

Hand Centers
One of the early problems with the hand tracking code was the center of the blob bounding boxes were used as the input to the motion trails. When the user held up their forearm perpendicular to the camera, the entire length of their arm was recognized as a hand. To better determine where the center of the hand was, I wrote a midpoint finder based on iterative erosion of the blobs. This provided much more accurate hand centers for the motion trails.

Particle Effects
After the long-exposure motion trails were working properly, I decided that more engaging visuals were needed to create a compelling visualization. It seemed like particles would be a good solution because they could augment the feeling of motion created by the user’s gestures. Particles are created when the hand blobs are in motion, and more particles are created based on the hand velocity. The particles stream off the motion trail in the direction of motion, and curve slightly as they move away from the hand. They fade and disappear after a set number of frames.

Challenges and Obstacles
This is my first use of the open-source ofxKinect framework and OpenFrameworks. It was also my first attempt to do blob detection and blob midpoint finding, so I’m happy those worked out nicely. I investigated Processing and OpenNI but chose not to use them because of performance and debug time implications, respectively.

Live Demo
The video below shows the final visualization. It was generated in real-time from improv hand gestures I performed while listening to “Dare you to Move” by the Vitamin String Quartet.

1 Comment

  1. Hi Ben – nice work. Below are the group’s comments from today’s Etherpads.

    nice approach to the automatic depth process, good to explain the problems you encountered but I’d like to see more regarding your aesthetic and conceptual choices, also it would be nice to see the iPod video for the reference

    too much text on slides, maybe more concept images/sketches, probably start with the demo, then explain the process

    The colors look exciting, but get to the fun stuff quicker. Problem solutions are really intereting. 1 slide per problem probably would have been sufficient. Damn that looks really advanced.

    I like the teaser image in the beginning. Agree that your presentation would benefit from more images of your process. Awesome problem solving skills. I’m envious of your technical expertise! Nice looking visuals. I bet there is more detail that we are missing because of the projector too. This would be cool for conducting music.

    Fucking stupid genuis simplicity for thresholding. you = win. creative problem solving

    pretty colors. i like the particles.

    impressive documentation of process

    Great work! I would suggest having the colors change…or letting the user select which colors they want to use. I agree!

    The streaks kinda remind me of graffiti analysis – way cool.

    Nice concept and execution. I love the ghosting of the hands and trailings. Your hand detection hack leads to a happy accident – it’s nice how when you turn, your body begins to drawing, with out any big jumps.

    Nice. It would be cool to have some sound being played. could it react to the speed of the hand motions.

    Demo video would benefit from a great dance track. Could be a substitution for glowsticks in the future :)

    Agreed a music video would be sweet. I wonder if you could cross it with the type of stuff Golan has done where the type of sound is generated by the gesture you make with the Kinect.

    Yea, syncing it to sound (flickering effect to bpm? even though its an accident it’s kinda cool). That let’s you bring in your original interact with music idea as well!

    Does it stop being a visualization of the music if it is based on people’s reactions? I guess the distinction is mostly philosophical
    A little too much time going into implementation details before explaining the final product more.

    Spline interpolation? OMG. Like light drawing. Nice.

    Dude. Good work. Great attention to detail.

    I wish the trails were thicker, 3D, and sprayed more particles. Maybe it’s the projector, but it seems a little subtle. In the iPod commercials the particles make them look wet. These seem a little more ghosty.

    Oh, that’s pretty.

    Gorgeous…Well presented technical issues and resolutions.

    The final product looks fun to play with, like anyone could walk up & have a good time with it.

    Comment by Golan Levin — 21 February 2011 @ 1:11 pm

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2019 Interactive Art & Computational Design / Spring 2011 | powered by WordPress with Barecity