Irregular Webcomic!

Archive     Blog     Cast     Forum     RSS     Books!     Poll Results     About     Search     Fan Art     Podcast     More Stuff     Random     Support on Patreon
New comics Mon-Fri; reruns Sat-Sun
<   No. 3309   2013-12-08   >

Comic #3309

1 {photo of a laptop computer with images of people's hands posed in front of it}
1 Caption: Computer Vision

First (1) | Previous (3308) | Next (3310) || Latest Rerun (2590) | Latest New (5195)
First 5 | Previous 5 | Next 5 | Latest 5
Annotations theme: First | Previous | Next | Latest || First 5 | Previous 5 | Next 5 | Latest 5
This strip's permanent URL: http://www.irregularwebcomic.net/3309.html
Annotations off: turn on
Annotations on: turn off

ICCV Welcome
ICCV 2013, Sydney.
During the past week (as this is published), I have been attending the 2013 International Conference on Computer Vision, known in the trade as ICCV. This is the major conference in the field and is held biennially in a different city around the world, and this year it happened to be held in Sydney. Since this meant my work could send me to the conference without any travel or accommodation costs, I got the chance to attend, even though my work is only tangentially related to computer vision.

What is computer vision? It is basically using computation and image processing to extract information from images, usually images collected by digital cameras. Most of us don't think about this much, but it is more common than you may realise.

A major application is surveillance. Surveillance cameras are used keep track of what is happening at places like airports, train stations, banks, busy traffic intersections, shopping malls, or even small individual installations like shops or home security. All of these cameras produce video streams that are, for the most part, incredibly boring. Nothing of interest may happen for days or even months at a time. And a place like an airport may have hundreds of cameras. You really don't want to have to sit and watch all of the resulting video, and neither do you want to pay someone to do it. Because it's so boring that when something interesting does happen, your watcher might have fallen asleep.

Instead, you can have a computer do the watching. This is computer vision.

What sort of things does a computer need to look for in a surveillance video? One important job is to keep an eye out for left luggage. If a person carries a bag into an airport, then walks off and leaves it sitting somewhere, then that's something that airport security should know about. So you want your computer to have an algorithm that detects objects moving through a scene, and detects when they stop moving. But it has to do more than that. Humans move through an airport and then often sit for a long time in one position. And they carry luggage and leave it sitting still next to them. You only want your computer vision program to call security if the person leaves the seat and leaves the luggage behind. So you need to be able to distinguish between people and luggage and make some decisions about whether a piece of luggage is being attended or has been abandoned.

Face detection
Another application: Face recognition demonstration.
All of this is not easy. In fact, it's so hard that there are conferences on computer vision every year where researchers swap details of the latest algorithms and methods for doing this sort of stuff. And this is only one aspect of computer vision. Other applications include: There are many other applications as well (I can't think of them all, because more are being thought of all the time!). Getting computers to analyse image data for all of these purposes is highly non trivial. Vision is one of those things that human beings can do much better than computers. By "vision", I mean interpreting image data to gain understanding of what is going on in a scene. A human can look at a scene and pretty much instantly determine several things about it: whether it is indoors or outdoors, if any people are present, roughly how many people there are, the positions and sizes of various objects in the scene, the identifications of those objects (a chair, a table, a dog, etc.), and so on.

Humans are incredibly good at recognising objects they can see. Infants can do it. What's more, humans are incredibly good at classifying objects they can see into meaningful semantic classes. An infant can see a dog and declare that it is a dog. They can see another dog, of a totally different breed to any dog they have ever seen before, and still recognise it as a dog. They can see a stuffed toy dog, and they still recognise that it is, in some sense, a "dog", while at the same time recognising that it is in another sense not the same thing as a living dog.

Identifying objects and classifying them in ways which allow us to understand what is happening around us is such a basic human skill that we are mostly unaware of just how amazing this ability is. People have been working on making computer algorithms capable of doing the same thing for decades and the problem is not yet solved. The difference (well, one of the differences - a computer vision researcher will be able to rattle of dozens of differences) is the amount of contextual information that humans and computers have available.

Real time tracking
Tracking moving objects in real time, demo.
Naively, one might imagine that computers have an advantage in volume of knowledge. After all, Google can find almost anything. But that operates on a vast network of computers with enormous stores of data, and it's still pretty dumb when it comes to making the sorts of connections that a human makes between things. It is actually humans who have the overwhelming advantage in contextual knowledge. We can tell the difference between a cat and a dog at a glance, without having to think consciously about it at all. We can differentiate a table, a desk, a chair, a stool, a bedside table, a chest of drawers, and so on with ease. We can recognise a person as a specific individual, we can recognise females and males, we can estimate people's ages, and we can make very good judgements about people's emotional states just by looking at them. We can recognise a car of any model as a mode of transport, know roughly how many people it will carry, and how big it is. Just by looking at it. We can look at a busy scene in a city and instantaneously parse it into streets, cars, buses, taxis, footpaths, people, buildings, trees, dogs, rubbish bins, newspaper stands, hot dog carts, bus stops, road signs, benches, doors, windows, fire hydrants, traffic lights, street lights, advertising signs, bicycles, pigeons, motorcycles, and on and on and on. Just by looking at it.

For a computer to do even a small part of this, it has to process the data contained in the pixels of an image it captures (with a camera). One approach is to search through the image pixels for what are known as "features". These are simple things that can be detected in small regions of just a few pixels, such as edges and corners, where colours or brightness change suddenly. You can then look for groupings of nearby features that might fit the shape of a known object, such as a chair, or a person. But to do this, you need a large database of what chairs or people look like when you examine them in terms of "feature space". And in this limited feature space, it's possible, in fact very likely in many cases, that sometimes the features of a chair will match better to the database of people, or vice versa.

Poster session
Computer vision is a big and active research field.
I'm simplifying a lot here, because computer vision has been an active research field for many years and there are many techniques of varying complexity for matching up images to contextual information about what objects are in the image. The main problem is that a chair, for example, looks very different when viewed from different angles, in different illumination, when partly obscured by another object, and for different models of chairs. It is impractical to store all of these various possibilities in a database, so you need to take shortcuts.

You might imagine that you could make a three dimensional model of a chair and just store that, then match it up using various angles of view and potential obscuring objects. This sort of thing can help, but only for a single model of chair. Humans have a much more contextual model of what a chair is. It is defined by a function, supporting a sitting human, rather than a specific shape. So our brains have a model of a chair that is not constrained in ways that a computer model of a chair is.

Given these sorts of recognition problems, it is amazing that computer vision works as well as it does, although there is also the sense that it works far less well than human vision. Attending a conference like ICCV, one realises that computer algorithms that contribute to general artificial intelligence (as opposed to performance in a restricted field, such a searching web pages) have a very long way to go before they can approach what humans can do. And this is why things like computer vision are such an active area of research.

LEGO® is a registered trademark of the LEGO Group of companies, which does not sponsor, authorise, or endorse this site.
This material is presented in accordance with the LEGO® Fair Play Guidelines.

My comics: Irregular Webcomic! | Darths & Droids | Eavesdropper | Planet of Hats | The Dinosaur Whiteboard | mezzacotta
My blogs: dangermouse.net (daily updates) | 100 Proofs that the Earth is a Globe (science!) | Carpe DMM (long form posts) | Snot Block & Roll (food reviews)
More comics I host: The Prisoner of Monty Hall | Lightning Made of Owls | Square Root of Minus Garfield | iToons | Comments on a Postcard | Awkward Fumbles
Last Modified: Saturday, 10 May 2014; 20:53:05 PST.
© 2002-2024 Creative Commons License
This work is copyright and is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International Licence by David Morgan-Mar. dmm@irregularwebcomic.net