a — final

Superboneworld

A system transcribing people into skeletons, hanging out.

screenshot from superboneworld, yellow stick figures in various poses, sitting, standing, walking, on a black background

?

Superboneworld^† is a system that extracts and visualises human pose-skeletons^†† from video. This extracted information is then displayed in a scrolling, ticker-like format on a 4:1 display. As Superboneworld scrolls past us, we see different strands of human activity — dancing, walking, yoga, pole dancing, parkour, running, talent-showing — overlaid, combined and juxtaposed. When we reach the end of the pose-ticker, we loop back to the start, seeing the next slice of Superboneworld^†††.

I wanted to explore the commonalities and contrasts within and across certain forms of human movement. I was interested in putting in one 'world' myriad different people moving about in myriad different ways.

The capture system consisted of a neural network, the same one as the one used in pose-world, that extracts⁴ pose information from video.

This system was fed a large number of videos ripped from the internet, ranging in 0.5-10 minutes in length.

This gave me frame-by-frame knowledge⁵ of the poses of people within the video. I then visualise this by drawing all the pose information I have for all of the skeletons (i.e, in the above image, the missing left forearm of the leftmost skeleton (most likely) means that the neural network was unable to recognise it from the image.

Each video stays static relative to Superboneworld (we are moving across it), so each video produces a Bonecommunity of skeletons moving around over time, i.e a 'pose-video'. Certain Bonecommunities are drawn in a different colour because it looked nicer that way. As a Bonecommunity scrolls out of view, it is frozen in time till we get round to it again.

notes

^_† significant portion of credit for name goes to claire hentschker

^_†† a pose-skeleton is a 'skeleton' that describes the pose of a person, such as where their joints are located and how to connect those joints up to reconstruct their pose

^_††† the 'pose-videos' ('Bonecommunity') only play when they are being displayed — after we have scrolled past one, the skeletons are frozen till it next comes back round, so each time we go around Superboneworld, we see a little more of each little 'Bonecommunity'

^₄ given an image containing people, it tells me where the limbs are located (within the image), and which limbs belong to which people.

^₅ the computer's best guess

process !

neural

I first experimented with this neural network system for my event project, pose flatland (open sourced as pose-world).

The structure of the neural network did not change much from pose flatland, the primary changes being me optimising it to run faster while I was thinking about what to actually capture. The optimisation mostly consisted of moving calculations from the CPU (previously using OpenCV) to be done on the GPU by translating them into PyTorch.

I also modified the system such that it could process and produce arbitrarily large output without having to resize images to be small and square.

After getting to a state where I could reasonably process a lot of videos in a short amount of time (each frame took ~0.08 to 0.2 seconds to process, so each video takes 2-10x its length to process, depending on the number of computer-confusing people-like things in it and number of people in it).

I first experimented with parkour videos, and how to arrange them. At this stage, the viewpoint was static and pose-videos were overlaid, which was initially confusing and all-over-the-place to look at.

After striking upon the idea to use the superlong 4:1 display lying around the STUDIO, I also decided to directly to more directly interrogate the expression in videos on the internet. Before this, I was unsure if moving on from the live webcams (as used in pose flatland was a good idea, but after seeing a number of outputs from various popular trap music videos, I was convinced that it was a good idea to move in the direction of using videos on the internet.

The majority of the videos I used could be considered 'pop-culture', for some value of 'culture' & 'pop' — they were mostly all popular within the genres they embodied. For instance, one of the videos is the music video for Migos' seminal track Bad and Bougee, and another is the very important 21 Savage/Metro Boomin track X. For the videos from genres that I am less knowledgeable about, such as parkour or yoga, I choose videos that would generally showed most of people's bodies most of the time, and were somewhat popular on Youtube.

As a refresher, here is the output of the neural network directly drawn atop the image that was processed:

Here are some images for the earlier, parkour iterations:

Note the various flips and flying around:

I realised by pressing Ctrl-Alt-Cmd-8 I could significantly improve the quality of the media artifact by inverting my screen:

media object

After downloading and processing the videos, I set about arranging them. I mostly did this blind, by writing a JSON file describing where each 'pose-video' should be placed on Superboneworld. I then wrote a small p5js script that downloaded the pose-video, placed it in the correct location (for most of them, initially far, far offscreen), and then slowly scrolled across Superboneworld, taking care to pause and unpause the Bonecommunities as they came into view.

Whilst building this visualisation, I realised that they would look better drawn as blobs rather than stick figures, as they have significantly more dimensionality, and their ordering atop each other (a Bonecommunity has a z-ordering).

After this, I loaded up a webpage containing the p5js script, plugged my computer into the 4:1 screen in the STUDIO, and showed it.

Here is an image from the exhibition:

Here are some GIFs:

similar work made in the past

Golan Levin's Ghost Pole Propogator is the most visually and conceptually similar project, although I did not really notice the similarity till after making pose flatland.
Here is the best documentation of it I could find:

source code

The neural network modifications and javascript visualisers have been merged into pose-world. Look at it for instructions for how to process your own video.

The piece is available to be viewed at bad-data.com/superboneworld. It streams ~100 mb of data to your computer, so it may be a little slow to load. However, after loading it caches all the downloaded data so subsequent runs are fast.

thanks to:

Claire Hentschker for significant help in the conceptual development of this project
Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh, for doing the research and releasing an implementation that made this possible
tensorboy for the re-implementation of the above research in an easy to extend way
All the other artists in Experimental Capture who thought this project was cool and helped me make it better
&
Golan Levin for this class!

a — event

Pose Flatland

why / what

Pose Flatland is a visualisation of a horizontal sheets of commonality, orthogonal to but intersecting with myriad disparate human threads across the world. It does this by overlaying the physical flattened poses of people across the world, letting us discover similar being in distant places. A many (24) eyed world-sized camera.

LIVE DEMO FOR CLASS

how

I adapted an implementation of Real Time Multi-Person pose segmentation, to infer pose skeletons from raw RGB monocular images.
Then, I fed in live video from unsecured ipcameras on insecam.org, processing it frame by frame. After some more optimisations — moving the CPU computation to the GPU — the final network is able to process video at ~10 frames per second into pose skeletons, using a GTX 1080 Ti.

Example output from network:

You can see the pose skeleton overlaid onto a stock photo, with the bones labeled.

I created a webserver exposing this functionality, creating a streams of pose skeletons from a streams of video, and then used p5js to visualise the skeletons, overlaying them into one composition. In this iteration, 24 cameras are overlaid. The cameras cover locations where common human experience occurs – streets, restaurants, the beach, the gym, the factory, the office, the kitchen. For each of those categories, multiple locations from across the world are present.

The visualisation only updates when a camera is active and a person is detected, the final composition varying over time as patterns of activity move over the sampled locations. In the above images, the bottom image is taken near noon in Delhi, whereas the top two are from near noon in New York.

The source code is available here: aman-tiwari/pose-world

prior art

Looking back (especially when coupled with the final visualisaion), this project directly links to Golan Levin’s Ghost Pole Propogator, which uses computer vision algorithms to contract people silhouettes into centerlines.

a — event process

Event — Progress

so far ‡

I have the multi-person pose estimation code running, and am able to feed into it the IP camera images from across the world. As the cameras run at a low frame rate, and the code as of yet only runs at ~5 fps, I need to develop some method of temporal interpolation to have smooth-ish movement. I also need to optimise the code, as right now it waits for the whole round trip of input webcam to server to display to complete, to prevent backpressure. I will reimplement this as a GRPC service running in the cloud, probably.

Here is a debug image, of the pose estimation on a choir. Note that it also correctly gets orientation of the pose skeleton (look at the bone colours).

a — portrait

Portrait — Ghost Eater

I captured the ghost of Ngdon. Here it is being called upon, during Open Studio 2016.

How

I trained a pix2pix convolutional neural network (a variant of a CGAN) to map facetracker debug images to Ngdon’s face. The training data was extracted from two interviews conducted with Ngdon, about his memories of his life. I build a short openFrameworks application to take the input video, process it into frames, applies and draws the face tracker. For each frame the application produces 32 copies, with varying scales and offsets. This data augmentation massively increases the quality and diversity of the final face-mapping.

For instance, here are 6 variations of one of the input frames:

These replicated frames are then fed into phillipi/pix2pix. The neural network learns to map the right half of each of the frames to the left half. I trained the network for ~6-10 hours, on a GTX 980.
At run-time, I have a small openframeworks application that takes webcam input from a PS3 Eye, processes with the dlib facetracker, sends the debug image over ZMQ to a server running the neural network, which then echoes back its image of Ngdon’s face. With a GTX 980, and on CMU wifi, it runs at ~12fps.

Only minimally explored in the above video, the mapping works really well with opening/closing your mouth and eyes and varying head orientations.

The source code is available here: aman-tiwari/ghost-eater.

Here is a gif for when Vimeo breaks:

a — place proposal

Place Project

I am interesting in capturing people through walls and around corners.

how ≠– capture

I plan on capturing footsteps sounds through a contact microphone array. When placed on a hard surface, it should enable detecting of footsteps from long distances. Using time-delay-of-arrival estimation, I should be able to triangluate the approximate location of the footsteps sources — i.e, the people

how ≠– media artifact

I plan on projecting a map of people walking around above the STUDIO of Creative Inquiry on the roof the STUDIO of Creative Inquiry

a — event proposal

Event Project

I am interesting in capturing horizontal planes of commonality across disparate / disconnected human existences in the world.

To express this desire, I will use live camera feeds from around the world, extract semantic/significant/poignant/interesting features, and place them in a single space, a World Playhouse

how ± capture

I plan on using implementations of real-time multi-person pose estimation to extract live pose-skeletons from webcams in selected spaces around the world.

how ± media artiface

To visualise the pose-skeletons, I plan on creating a World Playhouse. Within the Playhouse, I will map the captured pose-skeletons to avatars. Inter-avatar interactions, the horizontal threads connecting these distant, originally non-overlapping rooms, will be amplified, through as-yet-undeterminded methods including but not limited to — physics, ragdoll physics.

a — place

I built an ultrasonic interferometer to map and visualize the sonic texture and echo-pathways of a place. Here is an example output:

In the above image, the x-axis corresponds to the time. Each spot on the y-axis represents a correlation bin, a bright spot represents that there was an echo at that time delay. The above image is produced from the STUDIO for Creative Inquiry ceiling.

why

i wanted to create sonic maps of a space mapping sound qualia that we wouldn’t normally pay attention to.

the final setup

A SoundLazer parametric speaker, connected to a moto pre4 usb audio digitizer. I use two AKG C542BL boundary-effect microphones connected to the digitizer as the sound input. The microphones are mounted 1.6 – 2 meters apart. Note that the SoundLazer is only powered up, but not plugged into any audio output. I then feed the sound input to openFrameworks, using the Eigen C++ scientific library to compute the cross-correlation of the two signals. I then plot the short-time cross-correlation of the two signals on the y-axis, sweeping the ‘cross-correlation brush’ across the x-axis. I also plot a small red dot at the maximum cross-correlation.

It is also possible send back the microphone input into the ultrasonic speaker, creating feedback effects that let you hear (in some way) what texture is there as you scan it (although, then it’s just a sonic interferometer).

he process

Originally, my project was going to be audio-source localisation of footsteps using contact microphones. The audio-source localisation works through computing the cross-correlation of the two microphones signals. The cross-correlation will have a peak at the predicted lag time. From this, we can calculate the extra distance the signal travelled from one microphone to another, from which we can calculate two possible angles the speaker was relative the baseline of the microphones. Using three microphones, we can figure out two angles from two different known baselines, giving us the approximate location of the speaker (including virtual speakers created from echos).

To improve the results, I whiten the cross-correlation using Rohr whitening.

Although I could get the localisation working in the air using the AKG boundary microphones, the contact microphones were not sensitive enough in the low frequencies to pick up footsteps at any distance. Although the boundary microphones could very easily pick up footsteps and movement across CFA, the output wasn’t satisfactory to me (and, every time I explained the idea to someone it seemed less and less interesting).

I realised that by computing the cross-correlation of a signal I send out myself, I would be creating a sonar. I also remembered I had a SoundLazer, and at 2am thought to change my project to using the SoundLazer’s beam to scan the acoustic reflections of a location.

The idea change required almost no change in the code (I ended up using PHAT whitening rather than Rohr whitening).

The following a debug screenshot, the vertical text on the left marks the lag (in samples, and in projected speaker angle) that the y-axis corresponds to.

the results

The following is produced from scanning my room.

The following is produced from the entrance to the STUDIO, the bright white smudges coming from people’s speech.

The following is produced from scanning the scene outside the window of the STUDIO for Creative Inquiry at noon.

The following is a hardwood table with other people sat around it.

The following were produced from scanning the north foyer of the College of Fine Arts.

a — portrait plan

I am going to create a generative model trained using video of Ngdon’s facial emotional response to provoking questions. The model will consist of a WGAN conditioned on text to generate video that embodies and hallucinates my subject’s expressions in response to audience-input questions.

a — SEM

I really wanted to see what a smooth and transparent object would look under the SEM. After initially viewing my sample under the SEM, even though there was the expectation of seeing Something, I was still surprised by how much variation was present on the smooth plastic surface. The most interesting phenomenon I was multitudes of small cylindrical nubs sticking out of the surface, at the 500nm scale. I wonder how they interact with light and affect the appearance of the googly eye, as well as how they originate. In addition, in the final zoom-out, I was intrigued by the possibilities for creating abstract compositions using micrography.