Vision: A Computational Investigation into the Human Representation and Processing of Visual Information
Most of us seeing the kettle upside down on the kitchen floor would react by saying, “How did that get there?” or, “The cat’s been at it again!” We would not wonder what we were seeing. But not everyone is so fortunate. In 1973, the English neurologist Elizabeth Warrington 1 told an MIT audience about patients with damage to the right side of the brain who had no trouble identifying water buckets and similar objects in side views, yet were unable to identify them from above. Another group of patients with damage to the brain’s left side readily identified the water bucket from both views of it.
Among those in the MIT audience was the young English mathematician and neuroscientist David Marr (1945–1980), who recounts the story in his postnumously published book, Vision. Marr died of leukemia in November 1980. Because of his illness, he was forced to write his book a few years earlier than he had planned. Vision is a brilliant synthesis of the recent work on perception that has deep philosophical and psychological implications. It is the best account I know of a new approach to the study of brain function, and its closing dialogue should be read by anyone interested in brains, minds, and machines.
Warrington’s talk suggested to Marr that the brain stored information about the use and functions of objects separately from information about their shape, and that our visual system permits us to recognize objects even though we cannot name them or describe their function. At the time, it was generally believed that seeing required enormous amounts of previously acquired knowledge. But during the 1970s Marr and a small group of colleagues at the Artificial Intelligence Laboratory of MIT succeeded in largely undermining this view. What emerged was a theory of perception that integrated work in neurophysiology, psychology, and artificial intelligence and that gives us some of the most profound insights into the nature and functioning of the brain we have yet had.
Nineteenth-century anatomists spent much time arguing whether or not specific functions such as speaking, reading, and writing were situated in discrete areas of the brain. Franz Gall, whose name is most closely associated with the pseudoscience of phrenology, had argued that everything from love to religion had its own anatomical place in the cerebral cortex. When a given talent or character trait excelled the normal, a well-localized bump appeared on the skull that phrenologists and the more open-minded anatomists could find with little difficulty. The scientific establishment of the day was not persuaded by these arguments. Until 1860 most scientists viewed the brain as a whole whose functional capacities could not be compartmentalized.
After 1860 the view that function was localized became dominant. Paul Broca had found a well-circumscribed area in the left side of the brain that appeared to control crucial aspects of speech. Direct stimulation of the cerebral cortex showed that sensory and motor function of every part of the body was under the control of a specific area of the cortex.
During the 1950s neurophysiologists discovered neurons (nerve cells) in the visual cortex that are activated by specific stimuli. Within the frog’s brain they found detectors that fired whenever a moving convex object appeared in a specific part of the frog’s visual field. If the object failed to move, or if it was of the wrong shape, the neuron would not fire: hungry frogs would not jump at dead flies hanging on strings, but they would if the string was jiggled. In their studies of cat and monkey visual cortexes, David Hubel and Torsten Wiesel found specific neurons that were sensitive to lines and bars with specific horizontal, vertical, and oblique orientations. The visual cortex apparently responded to particular features such as the lines and bars in the physical environment.
While they rarely articulated this claim, the physiologists took for granted that both the search for such features and the formation of fuller images or descriptions were directed by visual knowledge already stored in the brains of higher animals. Seeing, they argued, required first knowing what one was looking at. They concluded that vision in higher animals used feature detectors to find vertical, horizontal, and oblique lines among other forms and that there was pre-existing information stored in memory cells with which the responses to the feature detectors had to be compared.
On the basis of these findings, scientists in the field of artificial intelligence decided it should be easy to build seeing machines that could identify and manipulate objects by matching electronically registered shapes with images stored in the computer’s memory. This, however, proved considerably more difficult than they had anticipated, in part because much of what we see has nothing to do with the shapes and locations of physical objects—for example, shadows, variations in illumination, dust, or different textures. Which features are important for seeing an object and which can be ignored? In addition, the computer scientists found that a seeing robot would need an enormous memory stuffed with photos, drawings, and three-dimensional reproductions of grandmas, teddy bears, bugs, and whatever else the robot might encounter in its preassigned tasks. They tried to simplify the problem by restricting visual scenes to minute worlds of toy blocks and office desks; and they concentrated on writing programs that could effectively and rapidly search computer memories for images that matched those in the robot’s eye. Some of these programs worked very well.
While the artificial intelligence researchers congratulated themselves for their successes, David Marr, who had joined the Artificial Intelligence Laboratory at MIT in 1973, thought that the very limitations of this approach meant that some fundamental questions were being overlooked, both by the physiologists and by the computer scientists. By confining their worlds to toy blocks and office desks the artificial intelligence scientists had failed to confront such basic questions as what constitutes an object (Is it the horse, the rider, or the horse and rider?) and how it could be separated from the rest of the visual image. He noted that the parts of a visual image that we name, those that have a meaning for us, do not necessarily have visually distinctive characteristics that can be uniquely specified in a computer program. The same circle could represent the sun or a wheel or a table top, depending on the scene.
Neither the neurophysiological studies nor the work in artificial intelligence had added anything new to the old and by now well-accepted view that each function is localized in the brain. In his reformulation of the fundamental questions that studies of brain function must answer, Marr broke the problem down in two ways that no longer corresponded to the view that the capacities of the brain can be localized, though the new view was, in a sense, its direct offspring.2
Marr began by asking, what is the visual system doing? In the frog, for example, it identifies flies that make good meals. And tasty flies are, for the frog’s brain, always moving. The visual system of a fly, on the other hand, needs to locate surfaces on which the fly can land. If a surface suddenly increases in size, or “explodes,” the fly’s brain will assume that an appropriate surface is nearby and it will cut its wing power and extend its legs in preparation for landing. Higher animals spend much of their time moving around and gathering food, and therefore one of the major tasks of their visual systems is to identify and describe three-dimensional shapes so that they can be avoided without much fuss or picked up and examined with relative ease.
One of the goals of the frog’s visual system, then, is to locate moving specks in the two-dimensional retinal image. The fly’s visual system will want to know when there is a surface large enough to land on, while higher animals will use the two-dimensional retinal image to derive descriptions of three-dimensional objects.
Failure to identify the goal of a visual system correctly can lead to a misinterpretation of the physiological data. The so-called feature detectors that the physiologists discovered in higher animals were misleading. Everybody assumed that their discovery meant that one of the goals of the visual system was to detect the specific features of objects. In fact, we now know that what the physiologists thought were feature detectors are probably “detectors” of changes in light intensity.
Only after we have understood the goals of a visual system—what Marr called “level one” of understanding—can we study the procedures (or programs) the visual system uses to achieve them, Marr’s second level.3 For example, given the fact that we see a visual border between two regions that are distinguished by different densities of dots, what procedure does the brain follow in order to establish this border, i.e., How is the brain programmed to identify the border? Does it use a procedure that involves measuring the distances between dots and noting where these distances change, or one that involves counting the number of dots in an area of a fixed size and noting where the number of dots changes?
Marr also distinguished a third level of understanding a visual system, that of the hardware—neurons or electric circuits—in which the procedures of level two can be carried out. Computer scientists often failed to recognize that the programs they put into computers could not be carried out in the neuronal structure of the brain. (The opposite is obviously true as well.)
The idea of levels of understanding in our knowledge of the mind was not new. A number of philosophers, among them Hilary Putnam, Daniel C. Dennett, and Jerry Fodor, and computer scientists such as Douglas Hofstadter, had been insisting on these very distinctions. Marr, however, applied them with a rigor that gave new insights into the problem of vision.
The second fundamental idea in Marr’s approach was that the visual process can be broken down into individual capacities, or “modules.” Does our seeing a tree as a three-dimensional object depend on our first recognizing it as “a tree”? In fact, we can see things as three-dimensional without knowing what they are, and Marr argued that vision generally consists of many more or less independent subtasks (recognizing trees, seeing in three dimensions) that can be studied independently. This is what he calls the principle of modular design: tree recognition and three-dimensional viewing would each be an independent module. It is not surprising that our brains accomplish tasks, such as seeing and hearing, by solving a good many independent problems that make up the general task. Otherwise, new capacities that appear in the course of evolution would have had to develop in perfect form all at once. In Marr’s words, modular design is
important because if a process is not designed in this way, a small change in one place has consequences in many other places. As a result, the process as a whole is extremely difficult to debug or to improve, whether by a human designer or in the course of natural evolution, because a small change to improve one part has to be accomplished by many simultaneous, compensatory changes elsewhere. The principle of modular design does not forbid weak interactions between different modules in a task….
Whether or not the entire brain is made up of the separate functional units Marr called modules is ultimately an empirical question. Work on vision has given us considerable evidence that the visual system, at least, is modular, Modularity has now become an influential concept, and Marr’s work provides the most detailed theoretical and empirical analysis of brain modularity to date.
One of the visual system’s modules was spectacularly demonstrated by Bela Julesz in 1960. Using a computer he created two identical copies of a random collection of black dots on a white background, such that no meaningful image was discernible. On one copy he displaced a square area of dots, filling in the empty space created by the displacement with more random dots. Still neither copy revealed any pattern to the unaided eye. (See illustration I on this page.)
However, when viewed in a stereoscope—the left eye sees one display and the right eye the other—the two displays are fused into one and a square jumps out at the viewer and appears to be floating above a surface of random dots. From the two-dimensional surface, the brain involuntarily derives a three-dimensional image.
The displays were prepared in this manner in order to test how much information is necessary for seeing three-dimensional images. The only difference between the two patterns is the offset square area, but this can only be revealed by comparing the two displays. The fact that when the patterns are stereoscopically fused the area appears to float tells us that the sensation of three dimensions is created with only one piece of information—that the square has been displaced in one of the patterns. Apart from the square area, all the points in the two images coincide and the brain therefore mistakenly calculates that the dots in the square area are at a different depth from the rest of the display.
p class=”initial”>This important process—stereopsis—is a module. Presented with different measures of displacement of some sets of points as opposed to others, the brain is programmed so that it compulsively derives different depths within an image. Though we know the floating square does not exist, we will always see it. We can say that the brain “computes” the floating square because we cannot imagine any other way of deriving it from the fused random dot pattern. Without any distinctive visual cues (the individual displays are collections of random dots), the brain is able to establish an accurate correspondence between the dots at the same locations in the two images. It has not confused the dots that make up the displaced square with those in the background. To do this it has used certain assumptions, rules, and processes that we call computations.
All computations, whether they be in adding machines or computers, require some kind of symbols (often called representations) with which to carry out the computations. Adding machines, of course, use numbers 4 But if we have good reason to believe that the brain spends most of its time computing, what kinds of symbols does it use in its computations? Before Marr’s work nobody could really say. This was not surprising. We are not conscious of most of what is going on in our heads. We have, as Daniel C. Dennett has written, “conscious access to the results of mental processes, but not to the processes themselves.” Yet Marr gives us concrete examples of the symbols that are essential for the computations in the visual system. In addition, he and his colleagues have shown that we can, at times, become aware of these symbolic representations. They have succeeded in putting the discussion of the visual system on an empirical footing.
Our retinas consist of some 160 million light receptors that are sensitive to varying levels of illumination, ranging from black to white, usually called “gray levels.” The image cast upon the retina is therefore broken up into a two-dimensional arrangement consisting of the different levels of light intensity reaching the receptors, an arrangement similar to the dot patterns on a television screen. From this pattern the brain creates the three dimensional scenes which we actually see, and in which we are able to distinguish objects, describe their shapes, locations, colors, textures, and so on. Since we all agree on the general makeup of a given scene, we can say that our brains compute roughly the same unique symbolic representations from the gray-level images.
Marr argued that in the first stages of visual processing the brain computes a two-dimensional sketch, which he called the primal sketch, from the gray-level retinal image. We are not conscious of this computation, but the symbol that is derived from the retinal image is quite familiar to us. It looks very much like a rough drawing. (This explains why we can make sense of artists’ sketches; they are similar to the symbols computed in our brains.)
But how does the brain compute this image? Many of the dots that make up the gray-level retinal image are identically “gray,” i.e., they show the same level of illumination. The brain derives the primal sketch by nothing where the level of grayness changes from one set of dots to another set. The lines that make up the primal sketch represent the extent, magnitude (the thickness of the line), and direction of the changes. (See illustration 2 on this page.)
Why did Marr conclude that the brain must compute a primal sketch? We are, he reasoned, very good at seeing the physical characteristics of the world around us (as opposed to frogs, which see little more than convex, moving objects): our visual systems have evolved in a way that makes this possible. At some point in evolution, the visual system adapted to the fact that changes in the illumination occur in a scene just at the point where the edges and changes in surface contours of objects are located.
A black wall, for example, will create a set of identical responses in the receptors in the eye, but if there is a white square in the middle of the wall, along the borders of the square there will be a change in the amount of light being reflected. This fact will be recorded in the primal sketch as a line indicating the extent and orientation of the change—in other words, as a sketch of a square. For the purposes of the primal sketch the brain ignores the uniform areas of whiteness within the square or blackness outside it. Therefore, what is visually significant occurs where the illumination changes. And the same is true of three-dimensional objects. It is along the edges of objects, or where there are variations in the smoothness of a surface, that the intensity of reflected light changes. A flat, uniformly lighted surface will give a uniform gray-level retinal image.
The brain therefore computes the changes in the gray-level image because these mark the physically or visually significant areas of a scene. In making this computation the brain uses no previously acquired knowledge. Its neuronal machinery has evolved in such a way that these computations are made automatically. The brain, in computing a primal sketch, is trying to analyze the physical or visually significant characteristics of the environment. Marr drew the conclusion that implicit in its analysis (or computations) is the assumption that edges, changes in contours, etc., are where light intensities change in the retinal image. This assumption, of course, is not written somewhere in the brain; it is presupposed by its design. Once we recognize it, we can understand what the brain is doing.
p class=”initial”>Such implicit assumptions are essential to our understanding of brain function just as they are essential to our understanding what any mechanical or electrical device is doing. If our assumptions are sufficiently general they will explain why a particular task or set of tasks must be carried out by the brain or mechanical device and why no other task will satisfy those assumptions. The cash register in the supermarket is an example of a device that performs a specific task—addition. But why is it constructed to perform addition rather than square roots? Because the implicit assumptions in our notions of fairness in the exchange of money and goods can be satisfied only through the use of addition. Those assumptions are: 1) buying nothing should cost nothing; 2) the order in which the items are presented for purchase should not affect the total amount paid; 3) dividing the items into piles and paying for each pile separately should not affect the total amount paid; 4) if an item is bought and then returned the total cost should be zero.
These assumptions, which make up our notion of fairness in the supermarket, happen to be the mathematical conditions that define addition. No other kind of computation will satisfy all of the assumptions all of the time. Of course, the assumptions are nowhere to be found in the cash register. They are implied by the fact that cash registers were designed to add prices.
The assumptions about the physical environment that are implied by the computations the brain performs on gray-level images are implicit in the same way. Since the brain always performs the same kinds of computations on the gray-level images there must be certain general assumptions that will explain why the brain performs those computations and not other ones. As we have seen, the brain’s calculations of intensity changes in the gray-level image are based on the assumption that changes in light intensities can represent physically or visually significant parts of the environment.
Marr called the implicit assumptions “constraints.” Without the notion of constraints we would, Marr argued, not be able to talk about, or understand, brain function. The constraint I have mentioned—that changes in light intensity can represent a physical edge—requires a further refinement if the brain is not to make many mistakes about the environment. Marr was able to make this refinement by reexamining a neurophysiological mechanism that had been the subject of considerable discussion since its discovery in the late 1960s. Physiologists then found that some cells in the brain are sensitive to lines that are widely separated, while others respond to finer details. They concluded there were several networks, or channels, of neurons, each sensitive to different spatial frequencies. The channels sensitive to coarse frequencies will only “see” intensity changes that are widely separated, whereas the finer channels can distinguish those that are closer together. Therefore, Marr argued, intensity changes found in the large channels that coincide with those in the smaller channels represent, for the visual system, physical changes in the image, an edge or change of contour. Whenever what is found in the large channels cannot be accounted for by the information in the smaller channels, the brain makes the implicit assumption that the information in the two channels has different physical causes.
In the cubist image reproduced on this page, for example, we can see Charlie Chaplin only by screwing up our eyes.
This is because no information about spatial frequencies is being provided for the middle channels sensitive to details that are neither coarse nor fine. The brain therefore assumes that the information in the large channels is not related to that in the finer channels. If it were, there would be some overlap with the information in the middle channels, but no information is coming from the middle frequency range. When we screw up our eyes the finer channels are eliminated and we see only Charlie Chaplin in the larger channels. The brain is no longer confused by information for which it cannot account. This suggests how seriously the brain takes physical constraints in the visual system. They are apparently implicit in the neuronal machinery of the brain, just as the rules of addition are implicit in our using addition in the cash register.
From the flat primal sketch the brain derives, according to Marr’s theory, the next major symbol or representation, which he called the 2 1/2-D sketch. This makes it explicit that the object has three dimensions but only from the viewer’s perspective, without providing information about the object’s appearance from other perspectives. The brain uses a number of independent calculations (modules) to help it form this new symbol, automatically analyzing the separate effects of shading and motion, to mention only two of the factors it takes into account.
To show that the brain can derive the structure of an object from seeing it in motion, Marr’s colleague, Shimon Ullman, painted random dots on two transparent cylinders of different diameters and then placed one within the other. (See illustration 4 on this page.)
Light was projected through the cylinders onto a screen, so that random dots were visible on the screen but not the outlines of the cylinders. When the cylinders are stationary, we see only the random dots. But when the cylinders are rotated in opposite directions, the two rotating cylinders are clearly visible on the screen. Ullman was able to show that if the brain assumes that an object is rigid, then it can derive its structure when it is moving. Without the implied assumption or constraint of rigidity, as when viewing the surface of a stream, no clear structure can be discerned.
Numerous experiments have shown that rigidity is “assumed” in the visual system and that it plays an important part in our perception of objects. For example, if a square is projected onto a screen and then its sides are expanded and contracted in a regular fashion, one would expect that a viewer would first see a small square, then a large square, then a small square, etc. In fact, the viewer will see a square that does not change in size, but that recedes from the viewer, approaches the viewer, and so on. The visual system misinterprets the cues as if a rigid object were being observed.
p class=”initial”>With the formation of the 2 1/2-D sketch we are coming close to the limits of “pure” perception. Acquired knowledge has played little or no role in creating the symbols used by the visual system in the early stages of visual processing. But the 2 1/2-D sketch only tells the brain about an object or a person from the viewer’s perspective; it does not give us a full sense of an object in space.
How can the brain compute a generalized view of an object—what Marr called the 3-D model—from the 2 1/2-D sketch so that we can be aware of its full structure and its situation in space? According to work Marr did with H.K. Nishihara, the brain will try to determine if there is any line which when drawn through the 2 1/2-D sketch establishes what he calls its basic pattern of symmetry. There is a basic symmetry between the right and left sides of human beings and we can imagine a line running through the middle of the head and torso around which that principal line of symmetry is established. Not all objects have such symmetries, but most do, and Marr and Nishihara’s theory only applies to these.
It might be difficult for the brain to derive the principal line of symmetry from the 2 1/2-D sketch if it is very much foreshortened at the angle at which we are observing the object—for example when we view a water bucket from the top rather than the side. Human beings and animals have one principal line of symmetry running through the head and body and many other branching lines of symmetry running through the arms, the legs, fingers, toes, etc. (See illustration 5 on this page.)
Stick figures of animals constructed out of pipe cleaners make explicit these basic lines of symmetry and Marr and Nishihara suggest that they make sense to us because they resemble the lines that the brain in fact computes. But how does the brain go on to provide us with a full image that is identifiable from any point of view? According to Marr and Nishihara, the brain automatically transposes the contours it has derived from the 2 1/2-D sketch onto axes of symmetry, giving us the three-dimensional image we see—Marr’s “3-D model.”
The importance of symmetry in visual computations is shown by two psychological consequences that follow from it. Psychologists have demonstrated that we see objects as collections of individual parts, and the 3-D model not only explains why (different lines of symmetry make up the symbol), but tells us how we tend to decompose the objects we are looking at (we do so by tracing the principal lines of symmetry to their branches). This also tells us something about our ability to generalize (Joan and Jane are women) and yet describe distinguishing characteristics. The principal lines of symmetry give us our general descriptions, while the finer distinguishing details are drawn from an analysis of the branching lines of symmetry.
p class=”initial”>The importance of the principal lines of symmetry in our understanding of visual material is perhaps best illustrated by recalling the story of Warrington’s patients with which we began. The patients who had difficulty recognizing water buckets were viewing them from a perspective that foreshortened their principal line of symmetry. Apparently this line is so important for recognition that, when it is foreshortened, some braindamaged patients have difficulty identifying the object. (See illustration 6 on this page.)
We can name and recognize the 3-D model as a “tree” or a “bucket” because it can be matched with acquired knowledge that is stored and cataloged in our brains. The problem of searching for cataloged information that had so preoccupied the artificial intelligence community when it first tried to build seeing machines occurs only at this final stage of visual processing. Nobody had imagined that so much information about shapes could be extracted from the retinal images before a search of cataloged information would be necessary. There is, consequently, a greater precision and simplicity to the search procedures we use in identifying objects than had been previously assumed. What had been one of the central issues in vision research, what many thought might have “explained” vision, we now know is important only after the visual system has analyzed shapes in the physical environment.
Just how much we can “see” without drawing on acquired knowledge was suggested by Warrington’s second group of patients who had no trouble discerning the shapes of objects from unusual viewpoints, but were unable to name them or describe their function. But this is a rather common experience. How often have we walked into a hardware store and seen objects, but had no idea about their use or function?5
p class=”initial”>We are so good at seeing that we take it for granted. Yet all the symbolic representations that Marr describes are actually very familiar to us. As Marr amusingly notes,
It is interesting to think about which representations the different artists concentrate on and sometimes disrupt. The pointillists, for example, are tampering primarily with the [gray-level] image; the rest of the scheme is left intact, and the picture has a conventional appearance otherwise. Picasso, on the other hand, clearly disrupts most at the 3-D model level. The three-dimensionality of his figures is not realistic. An example of someone who creates primarily at the surface representation stage [the 2 1/2-D sketch] is a little harder—Cezanne perhaps?
Not very long ago vision was considered a “simple” problem. Through the work of Marr and his collaborators, we now know better. And the impact of their research goes well beyond the immediate issue of seeing. Marr’s levels of understanding, his modular view of the brain, his concrete proposals of symbols and computational procedures, his stress on the overriding importance of constraints,6 have opened up the possibility of understanding other mental functions as well, including language, thought, and, perhaps, emotions.
A new discipline has been created that brings together much of philosophy, psychology, artificial intelligence, and neurophysiology, and that opens up the exciting possibility of uncovering some of the mysteries of the brain. It is too early to say how fruitful this will be, but Vision will remain one of the most remarkable achievements of the past decades.
She was describing work she had done with A.M. Taylor. ↩
Tomaso Poggio, now at MIT, arrived at a similar formulation about the same time and was one of Marr’s closest collaborators. ↩
Often there are unexpected side effects of neurophysiological (or computer) activity that must be taken into account in formulating level-one processes. An entertaining example of this can be found in Crick and Mitchison’s recent theory that “we dream in order to forget”—not in order to revive memories. (See Nature, vol. 304, pp. 111–114, 1983.) ↩
Different symbolic representations make explicit and usable different pieces of information. Various symbolic systems, for example, have been created to represent numbers throughout history. Arabic numbers (1, 2, 3, 4, etc.) make explicit the powers of ten (100, 101, 102, etc.) that go into the composition of the number. (Nineteen is 9 × 100 plus 1 × 101.) Binary numbers make explicit the powers of two that compose a given number (20, 21, 22, etc.). 10011 is 19 in the binary system. Marr claims that the use of Roman numerals (XIX, XX, etc.) which are extremely difficult to manipulate (multiply XX by XXI, for example, and compare the task with 20 × 21) explains why the Romans never made any significant contributions to mathematics. ↩
I owe this example to Ronald B. de Sousa, whose incisive comments were of invaluable help in preparing this article. ↩
Tomaso Poggio has kindly shown me some very stimulating preliminary work that could well offer a new understanding of constraints. ↩