Image Sequencer

My final project is a continuation of my experiments in the event project, which is a tool to computationally sort images into sequence based on their common features.

The project runs three modules step by step to achieve this:

1) Image segmentation, extracting object of interest from the background,

2) Comparison, comparing the similarity between two given extracted objects, and

3) Sorting, ordering a set of images into sequence using comparison function.

Last time, the algorithm used in each module is fixed. For the final project, I devised several alternative algorithms for each module, so the process can be customized into different pipelines suited for a variety of datasets.

Speeding up

I only had the time to sort one dataset (horses) for my event project since processing time take more than 24 hours. So the first priority is to speed it up.

I designed several numpy filters applied on images as matrices to replace pixel by pixel processing. This made the comparison part of my program run around 200 times faster. However the image segmentation is still relatively slow because the images are uploaded and then downloaded from the online image segmentation demo.

Now that the program runs much faster, I had the chance to apply it on a variety of datasets.

Sorting Objects

475 Bottles from Technisches Museum Wien

Sorting faces

My previous algorithm compares two images based on the outline of the object of interest. However, this is not always desired for every dataset. For example, the outlines of people’s heads might be similar, yet their facial features and expressions can be vastly different so if made adjacent frames, it wouldn’t look smooth.

I used openCV and dlib for facial landmark detection, and compared the relative spatial location of pairs of key points. Then the program iterates through all faces and place the faces with the most similar key points together. Sorting 348 portraits from MET, this is what I get:

Alternative sorting methods and morphing

From peer feedback, I realized that people are not only interested in the structural similarity between objects in two images, but also other properties such as date of creation, color, and other dataset-specific information.

However, the major problem sorting images using these methods is that the result will look extremely choppy and confusing. Therefore, I thought about the possibility of “morphing” between two images.

If I am able to find a method to break a region of image into several triangles, and perform an affine transformation of the triangles from one image to another, slowly transitioning the color at the same time, I can achieve a smooth morphing effect between the images.

Affine transformation of triangles

In order to do an affine transformation on a triangle from original transformation to target transformation, I first recursively divide the original triangle into many tiny triangles using a method similar to the Sierpinski’s fractal, but instead of leaving the middle triangles untouched my algorithm operates on all the triangles.

The same step of recursion is done on the target triangle. Then, every pixel value of every tiny triangle in the original transformation is read, and then drawn onto the corresponding tiny triangles in the target transformation. Below is a sample result:


Since facial landmark detection already provided the key points, a face can now easily be divided into triangles and be transformed into another face. Below is the same portraits from MET sorted by chronological order, morphed.


The problem now is to find a consistent method to break other things that are not faces into many triangles. Currently I’m still trying to think of a way.

Re-sorting frames of a video

An idea suddenly came into my mind, what if I re-sort some images that are already sorted? If I convert a video into frames, shuffle them, and sort them using my own algorithm into a video, what would I get? It will probably still be smoothly animated, but what does the animation mean? Does it tell a new story or give a new insight into the event? The result will be what the algorithm “think” the event “should” be like, without understanding the actual event. I find this to be an exciting way to abuse my own tool.

Left: Original  Right: Re-sorted


Alternative background subtraction

I ran the algorithm on my own meaningless dance. It transformed into another, quite different, meaningless dance. I wonder what happens if I do the same on a professional dancer performing a familiar dance.

When I tried to run the image segmentation on ballet dancers, the algorithm believed them to be half-cow, half-human, armless monsters. I guess it hasn’t been trained on sufficient amount of costumed people to recognize them.

So I had to write my own background subtraction algorithm.

In the original video, the background is the same, yet the camera angle is constantly moving, so I couldn’t simply median all the frames and subtract from it. I also couldn’t use similar methods because the dancer is always right at the middle of the frame, and the average/median/maximum of the frames will all have a white blob in the center, which is not helpful.

Therefore I used a similar method described in my event post, which is for each frame, learn the background from a certain region, and subtract the pixels that resembles this sample surrounding the object of interest.

I sample the leftmost section of each frame, where the dancer has never been to, and horizontally stack this section into the width of the frame, and subtract the original from this estimated background.

Combined with erode/dialect and other openCV tricks, the it worked pretty well. This method is not only useful for ballets, as it’s a common situation in many datasets to have relatively uniform backgrounds yet complicated foregrounds.

Using the same intensity based registration (now 200x faster), the program complied a grotesque dance I’ve never seen before:


I improved my code to run 200x faster. So originally processing a dataset takes more then a day, now it takes twenty minutes or so.

I experimented with different datasets.

Really confusing basketball games:


348 portraits sorted by facial expression

I working on sorting the same dataset by beauty standards. Since there have been a lot of researches on golden ratio and perfect face proportions, I can sort the portraits from the prettiest to the ugliest.

Re-sorting shuffled video frames

Left: original  Right: re-sorted



I would like to improve my event project as my final project. I think there’re a lot of possibilities that can be explored with this tool I now possess.

For the event project, I only have the time to demonstrate the tool on horse pictures. As suggested in the group critique comments, I plan to give the subject more thought and apply the algorithm on more interesting datasets.

I will also refine the code, since currently my program is basically thrown together and glued with hacks, and running it is a complicated process. Better image registration algorithms are also mentioned in the comments, and I will try them out.




My event project is a software that takes in lots of different photographs of a certain event as input, then uses computer vision to find the similarity and differences between the images in order to sort them, and finally produce a video/animation of the event by playing the sorted images in sequence.



Animation of a browsing horse and a running horse automatically generated by my program.

(More demos coming soon!)



I was inspired by those animations complied from different photographs Golan showed us on class, eg. the sunset. I thought it’s an interesting way to observe an event, yet aligning the photos manually seemed inefficient. What if I can make a machine that automatically does this? I can simply pose any question to it: How does a horse run? How does a person dance? and get an answer right away.

Eventide, 2004 from Cassandra C. Jones on Vimeo.

I soon found it a very challenging problem. I tried to break it down into many smaller problems in order to solve it.

Background Extraction

I used a general object detection tool powered by neural networks called darknet to find the bounding boxes of objects in an image. However most objects have irregular shapes that do not resemble a box. So finding out exactly which pixels are actually part of the object and which ones are part of the (potentially busy) background is a problem.

After much thought, I finally came up with the following algorithm:

Since I already have the bounding box of the subject, if I extend the bounding box in four directions by say 10 pixels, The content between the two boxes is both a) definitely not part of the subject, and b) visually similar to the background within the bounding box.

The program then learns the visual information in these pixels, and delete everything within the bounding box that look like them.

The results look like this:

Although it is (sort of) working, it is not accurate. Since all future operations depend on it, the error propagation will be extremely large, and the final result will be foreseeably crappy.

Fortunately, before I waste more time on my wonky algorithm, Golan pointed me to this paper on semantic image segmentation and how Dan Sakamoto used it in his project (thanks Golan). The result is very accurate.

The algorithm can recognize 20 different types of objects.
I wrote a script similar to Dan Sakamoto’s which steals results from the algorithm’s online demo. Two images are downloaded for each result: one is the original photo, the other has the objects masked in color. I did some tricks in openCV and managed to extract the exact pixels of the object.


Image Comparison

I decided to develop an algorithm that can decide the similarity between any two images, and to sort all images, I simply recursively find the next most similar image and append it to the sequence.

Since the subject in an image can come in all sorts of different sizes and colors, and might be cropped, rotated, blurred, etc., I tried to use a brute-force window search to counter this problem.

The program scales and translate the second image into tens of different positions and sizes, and overlays it on top of the first image to see how much of them overlaps. A score is thus calculated, and the alignment with the highest score “wins”.

This, although naive and rather slow, turned out to be reasonably accurate. Currently only the outline is compared, I’m thinking about improving it by doing some feature-matching.

Image sorting

To sort all the images, with the information about the similarity (distance) between any two of them, is analogous to the traveling salesman problem:

I simply used Nearest Neighbor to solve it, but will probably substitute it with a more optimized algorithm. Here is a sequence of alignments chosen from 300 photos.

Notice how the horses’ head gradually lowered.


I didn’t expect exporting the sequence with all image aligned would be such a headache. Since during the matching phase, the images went through a myriad of transformations, I now have to find out what happened to each of them. And worse, the transformation needs to be propagated from one image to the next. Eventually I have it figured out.

An interesting thing happened: the horses keep getting smaller! I guess it’s because the program is trying to fit the new horses into the old horses. Since this shrinking seems to be linear, I simply wrote a linear growing to counter it:

Sans-background version:

It took a whole night to run the code and produce the above result, so I only had time to run it on horses. I plan also to run it on birds, athletes, airplanes, etc. during the following nights.


My event project is going to be a software that takes in lots of different photographs (from different sources) of a certain event as input, then uses computer vision to find the similarity and differences between the images in order to sort them, and finally produce a video/animation of the event by playing the sorted images in sequence.

For example if I input a lot of images that consist of running horses from google images, one way the software can process them is to first align by the positions of horses in them, and then sort by similarity of the pose of the horse. It can thus produce a video of a running horse consisting of frames from different images.

Similarity, I can input images of dancing people, flying birds, ball games, fights, etc.

I’m using general object detection to find the binding boxes of all objects in images. Then, depending on which works better, I can either do pixel-wise comparison or contour comparison to produce a similarity score for any two arbitrary images.

Here’s where I am in the process:

  • I wrote a program to download images from ImageNet, where tons of images are categorized by subject.
  • I found a neural network object detection tool called darknet. I tweaked its source code so it can print out the bounding boxes of all objects in an image into Terminal. Then I wrote a python program to batch process my source images using this tool, and parse the output.
  • I used openCV in python to do some simple manipulations on the source images so the comparison process will probably be more accurate.

What I’m trying to figure out:

Although I have information about the bounding boxes, most objects have irregular shapes that do not resemble a box. So finding out exactly which pixels are actually part of the object and which ones are part of the (potentially busy) background is a problem. I’m going to read darknet’s source code to find out if this information is already in there. If not, I will have to write something myself.


I currently have two ideas:

1. I want to make a machine that find out the essence of any event by taking lots of photos/videos of the event as the input. Then the program uses computer vision and/or machine learning techniques to study the data. It will align the similarities and sort the differences. The product could be a video or a multi-dimensional visualization. I haven’t got a detailed plan yet.

2. I want to capture the formation of boogers. I’m always wondering how come all of a sudden I have booger in my nose. I can probably mount a device consisting of a small webcam and a light source below my nose to find out.


Conversion of image from and to barcode.

photoBarcoder from Lingdong Huang on Vimeo.


When Golan showed us his barcode scanner, I felt that there should be more barcodes in the world. So I thought about the idea of taking photos right in barcode format, printing out and pasting the barcodes everywhere so anyone with a scanner can view that picture. My project will not be a documentation of any particular place, but rather an interesting device people can use to observe places.

A rough calculation told me that it would take more than 50 barcodes to contain an average image. However I’m just so curious about the idea so I tried it out nevertheless.

Compression Algorithm

One barcode only holds a tiny bit of information while an image contains huge amount of information. Compression of images so that it can be contained within reasonable amount of barcodes is the key problem.

I used/developed a series of compression algorithms based on the idea that compression of data is elimination of redundancy. I try to pick out the repetitions, place them together, and describe them in a concise manner.

1. Image Compression

At first, I turned the images into greyscale, so there’s only one channel to compress instead of three.

I thought about the problem from three perspectives. First, I can group pixels that are adjacent to each other and have similar values into a “patch”.

I did this recursively:

  • The program first divides the whole image into four patches.
  • For each patch, if the standard deviation of pixel values in it is below a certain threshold, we’re done with that patch.
  • If not, further divide that patch into four patches, and repeat the same procedure with each of them.
  • If the standard deviation of all patches are below threshold, or that we’ve reached the pixel size unit, we’re done.

I then thought about a more efficient manner of describing image data then recording pixel/patch colors. For example, if the image is that of a green cat on a banana, and the computer already knows what that looks like, we just write down “green cat on a banana” instead of describing each pixel in that green cat picture.

So the idea is that I can teach the computer about predefined “patterns”, and for it to reconstruct the image, I can simply tell it which “pattern” to use.

I categorized eight possible patterns that divides a patch into an upper half and lower half, each defined by a function.

Then the program tries to apply each function to all the pixels in the patch. The function that makes the difference of average values in the two halves the greatest is the “best” function. The corresponding pattern ID is then recorded.

When rendering, gradient is used to fake smooth transition between the two halves in a pattern.

Last I thought about the compression of actual pixel values. Normally if say we have 3 bits, then we can express 2^3=8 different pixel values, if we have 8 we can express 256 of them, etc. I found that adjacent pixels usually have relatively similar pixel values, so if I record the relative value compared to the region it’s in instead of recording the absolute value, I can save even more space.


^ My algorithm(above) is smaller and preserves much more information than simply pixelating the image (below).

2. Image Format

I recorded the image data directly in bits, for in this way I have the best control over them to make sure no space is wasted.

The image data consists of an array of patches, and each patch is represented by the following groups of bits

1 |2 |3 |4

Group 1 denotes how many “quarters” we’ve taken during the phase of dividing the image to get this patch. This is the only positional information we need because with it we will be able reverse the recursion to infer exact size and location of the patch.

Group 2 denotes the value of upper half of the patch, while group 3 denotes that of the lower half relative to the upper half.

Group 4 is the pattern ID 1-8 that matches the patch. If more aggressive compression is necessary, we can reduce the variety of patterns into 4 or 2.

3. More Compression

The procedures above compressed the image a lot, but it’s still not good enough to be contained in just a few barcodes. Golan urged me to find more algorithms to compress even more.

Since the compressed image has almost reached the threshold of unrecognizably, I decided to do lossless compression directly on the bit data instead.

After doing research, I figured the best idea is to try every possible kind of lossless compression on my data, and see which works. So I applied each algorithm listed by Wikipedia.

Most of them increased the size of my data instead. But one of them, arithmetic coding, was able to reduce it by around 10%.

Basically what arithmetic coding does is that it divide a string of bits into groups, and substitute groups that are used most often with a shorter code and groups that are less used with a longer code.

I then use base 64 encoding to convert bits into alpha-numeric characters to be used in barcodes. Interestingly I didn’t know the real name of this method at first, and defined my own rules and character set for my program. Later Golan mentioned base64, and I realized that it’s what I’ve already been using, so I changed my character set into the standard one.

After the encoding, every 6 bits will be turned into an ASCII character, so the elephant image above becomes a brief nonsensical code:


4. More Colors

Having done greyscale, I thought I can do three channels just for fun. I compressed each channel differently, and combined them to get a colored image. The result is exciting, because each channel is quartered and pattern matched differently, the overlaying created huge variety of (more accurate) shapes and colors, thus increasing actual recognizability.

After doing some more optimizations, I was able to compress a three-channel image into around the same size as the greyscale image. So I decided to use the 3 channel version in my final product.

5. Pre-processing images for better output

Since the number of different colors is limited in images produced with this method, increasing the contrast/saturation of the original image before compression improves the result. For example, if we have a grayish image with pixels whose colors are not too different from each other, the algorithm would treat them as a large gray patch.

So I wrote something to manipulate the histogram of each channel, cutting off excess at both ends, to make sure the whole spectrum is utilized. This is what makes my output images look so colorful.

6. Conversion to barcode



I decided that I’m going to present the product by making it into a mobile camera app, with which people can take pictures and convert them right into barcode. Then anyone can scan the barcode and reconstruct the picture in any device installed with my decompression algorithm.

I programmed everything above in Python, but I then realized that I need to translate it into some other language for it to compile for mobile platform. I found some Processing libraries that allows interfacing with the android system and sensors, so I decided to translate all my code into Processing.

After doing so I discovered that the above mentioned libraries are useless and can only do simple things. So I was forced to learn Java used in Android Studio to talk to the phone directly.

I took my phone and wandered around campus to test my app. Then I did a lot of tweaking, crashing, debugging, making GUI stuff, etc.

Below are some screenshots of an early iteration of the app.



Final Screenshots


Left: taking a photo. Right: preview a photo.

Left: generate barcode. Right: settings.

Using non-photoBarcoder barcodes to produce glitch art

^ Barcodes from coke bottle as red channel, cup noodle as green channel and delivery box as blue channel.

You can even directly type text into the program and produce images. Below is the course description used as if it’s content of barcodes.

Printing and Distributing Barcodes

I’m currently learning ESC/POS language to control a thermal printer so it can print out the barcodes to be pasted everywhere.


I currently have two ideas.

One is to install a camera on the face of each of the persons on the Walking to the Sky sculpture. The cameras will take a photo every minute or so, and these photos will represent what the statues will actually see if they have vision. For most of them I guess it will be the person before them frozen in the middle of a “walking” pose against the constantly changing sky. Some of them can probably see buildings and students and visitors walking pass them with the corner of their eyes.

The second idea is about using the barcode reader and its peculiar way to sense the world. I’m thinking of making a camera that takes photos encoded in a barcode format, and then prints it out, which is inscrutable to humans but can be decoded by a barcode reader. Then I can probably gather all the barcode photos about one place, build a little world in a box where everything is barcode, and put the barcode reader in there, who thinks itself is inside a VR experience.

For the first idea I probably need to figure out a way to have all the cameras take HDR photos, or the sky will be super white and the walking persons will before the camera will be really dark. For the second idea I need to invent a smarter way to compress image data then simply recording each pixel RGBs, or huge amounts of barcodes will be necessary to describe a single image.


Trashscape: A portrait of A from Lingdong Huang on Vimeo.

Trashscape: A virtual environment where the user can pick up and listen to A’s trash.

Mac Version Download A Leap Controller is required to play. Mouse and 3D mouse versions are in development


I thought it was an interesting idea to know my subject and explore his mind by documenting his trash.

I planned to collect all the things my subject throws away over a period of time, along with voice recordings of the subject telling some thoughts he had in mind the moment he wanted to throw away that piece of trash.

I would then do a photogrammetric scan of all the trash, and place them in a virtual 3D world where the user can wander around, pick them up and listen to the corresponding voice recording.


Trash Collection

My subject handed me a plastic bag full of all sorts of trash.

Recording My Subject

To make it as realistic as possible, we recorded the subject’s voices using an ear-shaped mic from above, so when the user listen to the trash, it sounds as if the voices are coming from the trash.

We numbered each piece of trash, from 1 to 34, and put a label on each of them. This number corresponded to the number my subject would say at the beginning of each recording, so it was impossible to confuse.

I found my subject’s speech exceed my expectation. When I was planning the project, I was thinking of the recording to be something banal such as “This is a half eaten apple. I’m throwing it away because it tastes awful.” but in fact A has something quirky and insightful to say about every piece of his trash.

Here are some sample recordings:



“Spoon Unused”


“Museum Ticket”


3D Scanning

I used photogrammetry software Agisoft PhotoScan Pro to virtualize the trash.

The software is really bad at smooth surfaces, which it cannot understand, and symmetrical objects, which it tries to warp so that the opposite sides overlap. But eventually I got 21 out of 34 trash models done. The other 13 were tiny balls of ash and hair and chips of paper that were evidently too hard for PhotoScan Pro.

The finished models really had a trash-like quality to them, which might be a problem if I was scanning any other object, but instead became a bonus since I’m scanning actual trash.


I used Unity and Leap Motion to create the virtual world.

I imported all the trash models and voices and paired them in Unity. I programmed a virtual hand so that the user can “pick up” trash by making the gesture over the Leap controller.

A question I spent a lot of time figuring out is the environment the trash were going to situate. Using the default Unity blue sky scene certainly feels unthoughtful, yet building a customized realistic scene distracts from the trash itself. I also tried to create a void with nothing but the trash, but I felt that the idea of an explorable environment is weakened by doing so.

Finally I decided to float the trash in a black liquid under a black sky. I believe this really solved the problem, and even helped bring out the inner beauty of those trash.

Things to Improve

I’m generally happy with the result. However, there are things I need to improve.

Golan pointed out that the controls are problematic. I often face this kind of problem. Since I’m testing the software hundred of times while developing it, I inevitably train myself to really master the control. However when a new user tries to use it, they usually find it way too difficult.

I’m working on the 6DoF mouse now to make a better controlling experience.

Another problem is the hand model. Currently it’s just a plain white hand from the default Leap Motion Assets. Golan and Claire gave me a lot of ideas, such as the trash picker’s glove, tweezers, A’s hand, etc.

They also mentioned things the users can do with the trash they find, such as collecting them in a bag, sorting them, etc, which I might also implement.

I’m also thinking about improving the workflow. Currently it’s really slow, and I find myself spending hours photographing each piece of trash and struggling with crappy software to make 3d models out of them. I need to automate the whole process, so anyone can just bring in their trash and get their trashscape compiled in no time.


Portrait Plan:

I will collect all the things my subject(a) throws away over a period of time, along with voice recordings of the subject telling some thoughts he has in mind the moment he wants to throw away that piece of trash.

I will then do a photogrammetric scan of all the trash, and place them in a virtual 3D world where the user can wander around, pick them up and listen to the corresponding voice recording. (diagram below)

The voice recording can be simple like “This is a half eaten apple. I’m throwing it away because it tastes awful.” or “It’s Tuesday. I’m so happy” or just any random thought that jumped into the subject’s mind at the very moment.

I’m thinking of the trash as pieces of the subject’s life he left behind, and the voice as a frozen fragment of the subject’s ideas and values. Together they become a trail of clues that we can follow to catch a glimpse of the subject as a being.

I chose photogrammetry to record the trash because I feel that the photogrammetry models have an intrinsic crappy, trash-like quality to them, and will probably be a bonus.

I’ve been thinking about ways I can make the virtual world an immersive experience. The trash can be placed on a vast piece of land, or can be all floating in an endless river in which the user is boating. I will probably make it in Unity.

I’m also thinking about a method to systematically process all the trash and recordings, so everything can be done efficiently in an assembly line manner, and new trash and recordings can be easily added to the collection.


Hello, I’m ngdon. I’m a sophomore art major and I’m interested in generating stuff with code. You can check out my projects in the last semester. I’m really excited to take this course because I’ve always been wondering about ways of capturing things in my life, and I’m especially interested in the experimental aspect of it.



I scanned a splinter of charcoal I used for drawing. I found it captivating to see it under the microscope. It was like hovering over the surface of a planet, with rocks, plants and various types of terrains. It reminded me of Blake’s “To see a world in a grain of sand/ And a heaven in a wild flower” and a Buddhist saying of the same meaning. It makes me think about what it would be like if I’m a microorganism dwelling on that charcoal. Would I live in that cave? Would I go hiking on that mountain?

I colored one of the images which I liked the best according to my imagination. I think it looks like a mountain.