Vision: A Computational Investigation into the Human Representation and Processing of Visual Information
Most of us seeing the kettle upside down on the kitchen floor would react by saying, “How did that get there?” or, “The cat’s been at it again!” We would not wonder what we were seeing. But not everyone is so fortunate. In 1973, the English neurologist Elizabeth Warrington 1 told an MIT audience about patients with damage to the right side of the brain who had no trouble identifying water buckets and similar objects in side views, yet were unable to identify them from above. Another group of patients with damage to the brain’s left side readily identified the water bucket from both views of it.
Among those in the MIT audience was the young English mathematician and neuroscientist David Marr (1945–1980), who recounts the story in his postnumously published book, Vision. Marr died of leukemia in November 1980. Because of his illness, he was forced to write his book a few years earlier than he had planned. Vision is a brilliant synthesis of the recent work on perception that has deep philosophical and psychological implications. It is the best account I know of a new approach to the study of brain function, and its closing dialogue should be read by anyone interested in brains, minds, and machines.
Warrington’s talk suggested to Marr that the brain stored information about the use and functions of objects separately from information about their shape, and that our visual system permits us to recognize objects even though we cannot name them or describe their function. At the time, it was generally believed that seeing required enormous amounts of previously acquired knowledge. But during the 1970s Marr and a small group of colleagues at the Artificial Intelligence Laboratory of MIT succeeded in largely undermining this view. What emerged was a theory of perception that integrated work in neurophysiology, psychology, and artificial intelligence and that gives us some of the most profound insights into the nature and functioning of the brain we have yet had.
Nineteenth-century anatomists spent much time arguing whether or not specific functions such as speaking, reading, and writing were situated in discrete areas of the brain. Franz Gall, whose name is most closely associated with the pseudoscience of phrenology, had argued that everything from love to religion had its own anatomical place in the cerebral cortex. When a given talent or character trait excelled the normal, a well-localized bump appeared on the skull that phrenologists and the more open-minded anatomists could find with little difficulty. The scientific establishment of the day was not persuaded by these arguments. Until 1860 most scientists viewed the brain as a whole whose functional capacities could not be compartmentalized.
After 1860 the view that function was localized became dominant. Paul Broca had found a well-circumscribed area in the left side of the brain that appeared to control crucial aspects of speech. Direct stimulation of the cerebral cortex showed that sensory and motor function of every part of the body was under the control of a specific area of the cortex.
During the 1950s neurophysiologists discovered neurons (nerve cells) in the visual cortex that are activated by specific stimuli. Within the frog’s brain they found detectors that fired whenever a moving convex object appeared in a specific part of the frog’s visual field. If the object failed to move, or if it was of the wrong shape, the neuron would not fire: hungry frogs would not jump at dead flies hanging on strings, but they would if the string was jiggled. In their studies of cat and monkey visual cortexes, David Hubel and Torsten Wiesel found specific neurons that were sensitive to lines and bars with specific horizontal, vertical, and oblique orientations. The visual cortex apparently responded to particular features such as the lines and bars in the physical environment.
While they rarely articulated this claim, the physiologists took for granted that both the search for such features and the formation of fuller images or descriptions were directed by visual knowledge already stored in the brains of higher animals. Seeing, they argued, required first knowing what one was looking at. They concluded that vision in higher animals used feature detectors to find vertical, horizontal, and oblique lines among other forms and that there was pre-existing information stored in memory cells with which the responses to the feature detectors had to be compared.
On the basis of these findings, scientists in the field of artificial intelligence decided it should be easy to build seeing machines that could identify and manipulate objects by matching electronically registered shapes with images stored in the computer’s memory. This, however, proved considerably more difficult than they had anticipated, in part because much of what we see has nothing to do with the shapes and locations of physical objects—for example, shadows, variations in illumination, dust, or different textures. Which features are important for seeing an object and which can be ignored? In addition, the computer scientists found that a seeing robot would need an enormous memory stuffed with photos, drawings, and three-dimensional reproductions of grandmas, teddy bears, bugs, and whatever else the robot might encounter in its preassigned tasks. They tried to simplify the problem by restricting visual scenes to minute worlds of toy blocks and office desks; and they concentrated on writing programs that could effectively and rapidly search computer memories for images that matched those in the robot’s eye. Some of these programs worked very well.
While the artificial intelligence researchers congratulated themselves for their successes, David Marr, who had joined the Artificial Intelligence Laboratory at MIT in 1973, thought that the very limitations of this approach meant that some fundamental questions were being overlooked, both by the physiologists and by the computer scientists. By confining their worlds to toy blocks and office desks the artificial intelligence scientists had failed to confront such basic questions as what constitutes an object (Is it the horse, the rider, or the horse and rider?) and how it could be separated from the rest of the visual image. He noted that the parts of a visual image that we name, those that have a meaning for us, do not necessarily have visually distinctive characteristics that can be uniquely specified in a computer program. The same circle could represent the sun or a wheel or a table top, depending on the scene.
Neither the neurophysiological studies nor the work in artificial intelligence had added anything new to the old and by now well-accepted view that each function is localized in the brain. In his reformulation of the fundamental questions that studies of brain function must answer, Marr broke the problem down in two ways that no longer corresponded to the view that the capacities of the brain can be localized, though the new view was, in a sense, its direct offspring.2
Marr began by asking, what is the visual system doing? In the frog, for example, it identifies flies that make good meals. And tasty flies are, for the frog’s brain, always moving. The visual system of a fly, on the other hand, needs to locate surfaces on which the fly can land. If a surface suddenly increases in size, or “explodes,” the fly’s brain will assume that an appropriate surface is nearby and it will cut its wing power and extend its legs in preparation for landing. Higher animals spend much of their time moving around and gathering food, and therefore one of the major tasks of their visual systems is to identify and describe three-dimensional shapes so that they can be avoided without much fuss or picked up and examined with relative ease.
One of the goals of the frog’s visual system, then, is to locate moving specks in the two-dimensional retinal image. The fly’s visual system will want to know when there is a surface large enough to land on, while higher animals will use the two-dimensional retinal image to derive descriptions of three-dimensional objects.
Failure to identify the goal of a visual system correctly can lead to a misinterpretation of the physiological data. The so-called feature detectors that the physiologists discovered in higher animals were misleading. Everybody assumed that their discovery meant that one of the goals of the visual system was to detect the specific features of objects. In fact, we now know that what the physiologists thought were feature detectors are probably “detectors” of changes in light intensity.
Only after we have understood the goals of a visual system—what Marr called “level one” of understanding—can we study the procedures (or programs) the visual system uses to achieve them, Marr’s second level.3 For example, given the fact that we see a visual border between two regions that are distinguished by different densities of dots, what procedure does the brain follow in order to establish this border, i.e., How is the brain programmed to identify the border? Does it use a procedure that involves measuring the distances between dots and noting where these distances change, or one that involves counting the number of dots in an area of a fixed size and noting where the number of dots changes?
Marr also distinguished a third level of understanding a visual system, that of the hardware—neurons or electric circuits—in which the procedures of level two can be carried out. Computer scientists often failed to recognize that the programs they put into computers could not be carried out in the neuronal structure of the brain. (The opposite is obviously true as well.)
The idea of levels of understanding in our knowledge of the mind was not new. A number of philosophers, among them Hilary Putnam, Daniel C. Dennett, and Jerry Fodor, and computer scientists such as Douglas Hofstadter, had been insisting on these very distinctions. Marr, however, applied them with a rigor that gave new insights into the problem of vision.
The second fundamental idea in Marr’s approach was that the visual process can be broken down into individual capacities, or “modules.” Does our seeing a tree as a three-dimensional object depend on our first recognizing it as “a tree”? In fact, we can see things as three-dimensional without knowing what they are, and Marr argued that vision generally consists of many more or less independent subtasks (recognizing trees, seeing in three dimensions) that can be studied independently. This is what he calls the principle of modular design: tree recognition and three-dimensional viewing would each be an independent module. It is not surprising that our brains accomplish tasks, such as seeing and hearing, by solving a good many independent problems that make up the general task. Otherwise, new capacities that appear in the course of evolution would have had to develop in perfect form all at once. In Marr’s words, modular design is
important because if a process is not designed in this way, a small change in one place has consequences in many other places. As a result, the process as a whole is extremely difficult to debug or to improve, whether by a human designer or in the course of natural evolution, because a small change to improve one part has to be accomplished by many simultaneous, compensatory changes elsewhere. The principle of modular design does not forbid weak interactions between different modules in a task….
She was describing work she had done with A.M. Taylor.↩
Tomaso Poggio, now at MIT, arrived at a similar formulation about the same time and was one of Marr's closest collaborators.↩
Often there are unexpected side effects of neurophysiological (or computer) activity that must be taken into account in formulating level-one processes. An entertaining example of this can be found in Crick and Mitchison's recent theory that "we dream in order to forget"—not in order to revive memories. (See Nature, vol. 304, pp. 111–114, 1983.)↩
She was describing work she had done with A.M. Taylor.↩
Tomaso Poggio, now at MIT, arrived at a similar formulation about the same time and was one of Marr’s closest collaborators.↩
Often there are unexpected side effects of neurophysiological (or computer) activity that must be taken into account in formulating level-one processes. An entertaining example of this can be found in Crick and Mitchison’s recent theory that “we dream in order to forget”—not in order to revive memories. (See Nature, vol. 304, pp. 111–114, 1983.)↩