The Golden Phonebook

Genes, Peoples, and Languages

by Luigi Luca Cavalli-Sforza, Translated from the Italian by Mark Seielstad
North Point Press/Farrar, Straus and Giroux, 227 pp., $24.00


Luigi Luca Cavalli-Sforza’s latest book summarizes the life work of this fascinating polymath, who for the last fifty-five years has been developing ingenious methods to understand the history of everybody. I first encountered his methods by chance thirteen years ago while I was browsing my weekly copy of the international scientific journal Nature. Virtually all of the journal’s articles were written in technical language incomprehensible to laypeople, and indeed to most scientists. There were studies of the high-Tc superconductor YBa2Cu3O7_s, a c-erb-A binding site, copia element genome reshuffling, corticofugal feedback, and other things that I had never heard of in my career as a scientist.

Halfway through the journal, sandwiched between articles on bimolecular chirality and on African finch bill size polymorphism, I came across something different. Five authors with Italian names, which I took to be pseudonyms, claimed to have extracted the surnames of 10,473,727 Italian telephone subscribers from the phone directories of ninety-one Italian provinces. The authors then allegedly analyzed the surnames by the usual methods applied to bona fide scientific data, such as fitting the names to equations, drawing graphs, and applying statistical tests. The paper was evidently one of those hilarious spoofs of science that Nature editors occasionally run. Cited in the list of references at the end of the paper were other articles by some of the same authors, apparently constituting other spoofs. Into respectable journals like The Annals of Human Genetics those pseudonymous Italians had managed to slip purported analyses of surnames extracted from Sardinian residential electric bills.

As I reread the Nature paper, I gradually realized that it was no joke but instead a brilliant, serious study. Written histories describe migrations, but rarely can say exactly how many people moved, where they originated, and where they ended up. The authors of the apparent spoof had figured out how to extract answers to those questions from local lists of surnames. As a glance at any telephone directory will show, there are a few common surnames (like Smith in the US) and thousands of rarer ones, but a name’s frequency differs between localities. Hence if migration is occurring between two localities with different name frequencies, the relative frequencies at the two localities differ in a way that lets one calculate the migration rates by means of a mathematical analysis. For instance, the relative frequencies of the names Garcia and Smith differ between the Mexico City and Los Angeles phone books, thereby reflecting Mexican immigrants to the US and American immigrants to Mexico (many of them named Garcia and Smith, respectively).

By analyzing the names in Italian phone directories, the authors of the Nature paper extracted patterns of migration during thousands of years of Italian history. Dante Alighieri had already written around the year AD 1305, “As to the ancient vulgar tongue …I maintain that Italy is divided into two parts, the right and the left. And if anyone should ask what is the dividing line, I answer that the Apennines are the watershed.” The authors’ maps confirmed Dante’s intuition by computer: today, surnames show that several of Italy’s sharpest boundaries to migration still coincide with the Apennines. Other surname boundaries delineate the area of western Sicily settled by Phoenicians and Carthaginians in the eighth century BC, the concentration of ancient Greek settlements in northeastern Sicily in the seventh century BC, and the Albanian settlements of the fifteenth century AD in southern Italy. Incredible as it may seem, evidence of those ancient communities lives on in the surnames of modern Italians inhabiting those areas—Italians of whom many are the genetic descendants of those ancient colonists.

The senior author of the papers on surnames was the Stanford professor Luigi Luca Cavalli-Sforza, esteemed by other scientists as the world’s leading expert on human population genetics. It would be a slight exaggeration to say that Cavalli-Sforza studies everything about everybody, because actually he is “only” interested in what genes, languages, archaeology, and culture can teach us about the history and migrations of everybody for the last several hundred thousand years. As for how he came to pursue that modest goal, all of us develop as children a sense of homeland, a landscape that is familiar to us and with which we identify. For instance, after thirty-four years of living as an adult in Los Angeles, I still feel like an alien in Southern California’s chaparral and deserts: the New England forests where I grew up feel more like home. But Cavalli-Sforza’s family moved every few months during his early childhood in Italy, and he spent his student years buffeted around Europe during World War II. Today he commutes as a peripatetic academic between Stanford and Italy, but he feels especially connected to Africa’s pygmies, among whom he worked for twenty years. Those diasporas of his own may lie behind his passion to understand the homelands and migrations of all the world’s peoples.

Genes, Peoples, and Languages is thus doubly interesting: as a window into the history of all of us, and as a window into the mind of a remarkable scientist. Cavalli-Sforza’s most striking intellectual gift is his ability to extract simple but profound conclusions from messy and seemingly trivial data. Besides his mining of Italian phone books and Sardinian electrical bills, he has analyzed questionnaires asking pygmies of different ages and sexes how far from home they had ever traveled, questionnaires asking Stanford undergraduates and their parents whether they preferred butter or margarine, and records of all consanguineous marriages for which papal dispensation was granted in 280 Italian dioceses between 1910 and 1964.

Even someone capable of recognizing that something worthwhile might lie hidden in such lists still faces the difficulty of finding a mathematical model to which to fit the data, in order to extract conclusions of interest. Miraculously, Cavalli-Sforza has managed to accomplish this again and again. His strategy is to recognize that there are only a few basic types of useful mathematical models, that one can thus reuse the same model in different fields with just small changes, and hence that the key step is to recognize good analogies. For instance, related mathematical models explain compound interest on our bank accounts, antibiotics killing bacteria, the migrations of pygmies—and the great diasporas of human history.

Curiosity about ourselves is only one reason driving us to understand history. Other reasons include history’s relevance to social issues, such as racism, technological innovation, cultural change, and genetic interventions. Professional historians mine written documents, but writing arose in the Fertile Crescent only around 3400 BC, and elsewhere in the world only later. (The Fertile Crescent is flanked by the Mediterranean on the west and the Tigris and Euphrates on the east.) Hence some other methods are needed to reconstruct that 99.9 percent of our history extending from our origins around five million years ago to the origins of writing. Our main sources of information about that preliterate past are, obviously, archaeology, plus (perhaps surprisingly) the languages and genes of living peo-ples, which reveal fossilized traces of our history to those knowing how to read them. As Cavalli-Sforza explains in the preface to Genes, Peoples, and Languages, there are gaps in what each of those three disciplines—archaeology, linguistics, and genetics—can tell us about history, but combination of their data can fill the gaps and converge on a single history.

Initially, one might suppose that archaeology would be our main source of information about the preliterate past. At their most successful, archaeological excavations do yield the bones of past peoples themselves, as well as their tools, pots, and other cultural paraphernalia. In practice, the archaeological record is very spotty, and human bones are often lacking or undiagnostic. For instance, stone spear-points of the so-called Clovis culture, made by perhaps the first humans to reach the Americas around 13,000 years ago, are known in abundance from all of the lower forty-eight US states, as well as from Central America south to Guatemala. But who were the Clovis people themselves? We don’t know, because almost none of their bones have come down to us. We usually assume that they were ancestral Native Americans who migrated over the Bering Strait from Asia, and so it came as a shock to us when the now famous, recently discovered Kennewick skeleton, the oldest complete skeleton known from the Americas, was reported to look more European than Asian.

But conclusions based on a single skeleton could easily be overturned by discovery of the next skeleton and merely emphasize how fragile conclusions based on archaeology alone can be. Even when archaeological excavations yield tens of thousands of skeletons and millions of tools, as in the case of ancient Europe, that may not even begin to address some central questions of history. When and whence did Indo-European languages, the modern world’s dominant language family and the one in which this review is written, reach Europe? Linguists are still debating the answer, because bones alone give no clue to the language that their owner spoke in life.

That is why historians need the help of scientists like Cavalli-Sforza. Languages and genes of living people contain historical information in far greater abundance than is contained in the relatively few fossils that have come down to us. Obviously, if we could be transported by a time machine back to a group of people living ten thousand years ago, we could listen to their conversations, sample their blood, and identify their languages and genes directly. The recent development of techniques for extracting DNA from mummies and bones is even giving us information about genes of long-dead people. But how does Cavalli-Sforza extract information about past migrations from the languages and genes of living peoples?

Here are two examples illustrating how modern languages let one reconstruct history. The first, concerning the origins and early migrations of the English language, is useful to begin with, because we already know the answer from historical documents. But could we have deduced the answer just from languages spoken by living people? By far the greatest number of native English-speakers today live in North America, with others in Britain, Australia, and elsewhere. Hence an extraterrestrial visitor to Earth not trained in linguistics might be misled into supposing that the English language arose in North America. But a trained linguist would note that English is only one of 144 languages of the Indo-European language family, otherwise mostly confined to Europe and western Asia.

Furthermore, within that family, English is closest to Northern Europe’s Germanic languages, within which group it is most similar to the West Germanic languages centered on Germany (not to the North Germanic languages of Scandinavia), within which subgroup it is most similar to the Low Germanic languages of North Germany (not the High Germanic languages of South Germany), within which sub-subgroup it is closest to the Frisian language still spoken along the North Sea Coast from Holland to South Denmark. Hence a linguist studying only living peoples, and deprived of all historical documents, would still deduce correctly that the English language arose in Europe along the North Sea and spread from there around the world. This of course is what written histories describe in detail, beginning with the Venerable Bede’s account of Angles, Jutes, and Saxons invading England in the fifth century AD.

Now let’s take an example where modern languages provided our first evidence about the course of history. The black farmers now occupying most of subequatorial Africa speak Bantu languages. Where did those farmers originate? The closest relatives of Bantu languages prove to be a diverse group of languages spoken in a small area of eastern Nigeria and western Cameroon. That’s because black farmers began to spread from that small area around 3000 BC and carried their speech, as well as their genes and crops and livestock and tools, over half a continent formerly thinly populated by pygmy and Khoisan hunter-gatherers. Literacy arrived only thousands of years later, so there was no African Venerable Bede on the spot to record the Bantu invasion of subequatorial Africa.

Reconstructions of ancient migrations are most convincing when results from studying both the genes and the languages of living peoples can be integrated with each other, as well as with results from archaeology. For these purposes, “living peoples” means “aboriginal peoples”—i.e., peoples still living in the approximate locations that they occupied in AD 1492, just before the overseas expansion of Europeans began to transform the world’s genetic landscape. For instance, to reconstruct New World prehistory requires studying genes and languages of surviving Native American populations, not of the Europeans, Asians, and Africans who recently supplanted them over much of the hemisphere. As one might expect from a man who pulled 10,473,727 names out of Italian phone books, Cavalli-Sforza’s genetic data base is correspondingly large: it consists of about 100,000 gene frequencies distributed over about 2,000 human populations.


Cavalli-Sforza’s reconstructions of all the major human diasporas of the past several hundred thousand years no more lend themselves to a summary of their main points than does the Encyclopedia Britannica. Nevertheless, some of his book’s flavor may be conveyed by illustrating what he has been able to learn about the ancient history of just one continent, Europe. Europe is an ideal choice because it has a high concentration of geneticists, many of whom occupy themselves by studying the genes of their neighbors. For example, Cavalli-Sforza was able to scrutinize genetic variation among villages of Italy’s Parma River Valley, because the local parish priests (most of whom were seminary students under one of Cavalli-Sforza’s students) persuaded the faithful to donate blood samples in the parish sacristies after Sunday mass.

A good starting point for studying European prehistory is the anomaly of the Basque people of northern Spain and southwestern France. Their language is unlike any other language in the world, and they have the world’s highest frequency (over 50 percent) of the gene for the blood group Rhesus-negative. Does this suggest that Basques represent the last survivors of the Cro-Magnons and other ancient Europeans, who were replaced elsewhere by Rhesus-positive invaders speaking Indo-European languages? The answer proves to be yes, but modern Europeans tell us much more than that about ancient Europeans. Six maps in Genes, Peoples, and Languages illustrate how genes of living Europeans, extracted from their blood cells and saliva and hair, still reveal the march of Stone Age farmers northwestward across Europe beginning ten thousand years ago, the hordes of nomads sweeping westward out of the Ukraine six thousand years ago, and the boundaries of the former Etruscan realm conquered by Rome before 300 BC.

The following analogy may be helpful in understanding how so much information can be extracted from gene frequencies. Suppose that you had never seen a map of Europe but that you were told the distances between all possible combinations of pairs of a hundred European cities. With the help of a computer, you could then construct a flat two-dimensional map of those cities. With even more abundant and exact data, you could also plot the cities in a third dimension, elevation above sea level, because cities at different elevations are slightly further apart than are cities with equal horizontal spacing but lying at the same elevation. If our world were five-dimensional instead of three-dimensional, you might even be able to plot cities’ positions in five dimensions. Cavalli-Sforza and his colleagues did something similar for Europe’s human populations: they combined the frequencies of all genes into a single measure of genetic distance, then they analyzed those distances by a statistical technique termed “principal components analysis” to extract the positions of Europe’s populations in five dimensions.

The dimension that proved to account for the highest proportion of genetic variation involves shifts in gene frequencies from southeast to northwest across Europe: some genes rise and others decline in frequency from the Balkans toward Britain and Scandinavia. That discovery solved a longstanding archaeological mystery concerning the origins of European agriculture. It had been known for some time that farming arose in the Fertile Crescent of southwest Asia around ten thousand years ago and gradually spread northwest across Europe. In a striking example of Cavalli-Sforza’s gift for seeing analogies, he recognized that an equation originally derived to describe the linear spread of advantageous genes, and later adapted to describe the radial spread of escaped captive muskrats across Europe in the twentieth century, should also describe the radial spread of farming across Europe thousands of years earlier. In fact, when radiocarbon dates of the earliest farmers’ archaeological sites in each part of Europe were tabulated, the equation’s fit to the data was excellent. It turned out that farming spread across Europe much like muskrats but more slowly: one kilometer per year for farming, twenty kilometers per year for muskrats. It thereby took four thousand years for farming to cover the distance from the Fertile Crescent to England.

The question remained, however, what had actually spread: farmers or farming? That is, did Europe’s original hunter-gatherers learn to farm, and did the idea of farming thereby spread northwest across Europe without changes in European human gene frequencies? Or did farmers themselves spread northwest out of the Fertile Crescent, carrying not only the idea of farming but also their genes, and interbreeding with Europe’s hunter-gatherers? Chapter Four of Genes, Peoples, and Languages describes Cavalli-Sforza’s personal history of grappling with these questions, which have fascinated him for many decades. It turned out that the first dimension of Europe’s genetic map is stunningly similar to the map of radiocarbon dates of the earliest European farming sites. That agreement implies that farmers themselves, not just farming, spread across Europe. The resulting dilution of the farmers’ genes from southeast to northwest, and the dilution of hunter-gatherers’ genes from northwest to southeast, remain the strongest genetic pattern in Europe today, five thousand years after the farmers’ spread was completed.

The persistence of ancient genetic trends is initially surprising. Why hasn’t that gene gradient, laid down between ten thousand and five thousand years ago, been completely erased by subsequent migrations during the last five thousand years? For example, we know that Mongols swept into Europe out of Asia, and that Huns and the other barbarians who destroyed the Roman Empire tore back and forth in all directions across Europe for many centuries. How could any signal from five thousand years ago still shine through all that noise?

A major reason emphasized by Cavalli-Sforza is that farming and herding yield far more edible calories per acre than do hunting and gathering, so farmers typically live at ten to a hundred times the population densities of hunter-gatherers. The arrival of farmers in an area formerly occupied by hunter-gatherers thus produces a huge population explosion and a swamping of old genes by new genes. That population explosion of Europe’s first farmers radically transformed Europe’s gene frequencies; since then, Europe has had no comparable genetic upheavals. The Huns and other barbarians threw Europe into chaos, but they contributed few genes compared to all the European residents whom they terrorized. Modern Europeans are primarily not the descendants of Huns but of ancient farmers and of the hunter-gatherers with whom those farmers mixed. In effect, the major dimension of Europe’s genetic map of five thousand years ago has persisted nearly frozen until today, because since then Europe has experienced no massive injections of new genes comparable to those carried by the first farmers.

In addition to this major dimension, at least four other dimensions emerge from Europe’s genetic landscape. The second dimension consists of gene shifts from south to north, associated with genetic adaptations to latitude, and with the genetically distinctive Lapps and related peoples reaching Scandinavia; the third consists of gene shifts from east to west, across Central Europe, signaling westward invasions of nomads (possibly bearing Indo-European languages) out of the Ukrainian steppes beginning around six thousand years ago; and the fourth consists of gene shifts both westward and eastward from Greece, stemming from the establishment of Greek colonies in both Italy and western Turkey during Greece’s golden age of the first millennium BC.

The fifth dimension of Europe’s genetic map, although it explains the lowest fraction of the genetic variation, is nevertheless very interesting. I mentioned that the most striking single genetic feature of modern Europe is the world’s highest frequency of the Rhesus-negative blood group in the Basque area within Spain and France. Today, speakers of the Basque language are concentrated within a few thousand square miles in the western Pyrenees. But the fifth dimension of Cavalli-Sforza’s genetic maps strengthens a conclusion suggested by contemporary Roman accounts, and by Basque place names preserved in Spain and France far beyond the modern Basque borders: the Basques used to be much more widespread than they are now. In fact, the genetically distinct region centered on the modern Basque area is suspiciously similar to the geographic distribution of Cro-Magnon cave art in Ice Age Europe. Hence the Basques may be the modified descendants of Cro-Magnons who replaced the Neanderthals, occupied the extended Basque region in Ice Age times, and produced the great paintings in Lascaux and Altamira caves.

Rivaling Basque genes in distinctiveness is the Basque language, the only native non-Indo-European tongue spoken in Europe (except for the recently arrived Finno-Ugric languages). Perhaps Basque is the sole survivor of the languages formerly spoken by Europe’s hunter-gatherers, before all their other languages were supplanted by those of invading farmers and steppe nomads. Isolated in mountainous terrain far from the Fertile Crescent, only the Basques were able to preserve their Ice Age speech and something of their Ice Age genes. The Song of Roland tells how Saracens ambushed Charlemagne’s rear guard and killed Roland, but Cavalli-Sforza speculates that the Basques were actually responsible.


The biggest controversy in worldwide genetic comparisons is, of course, the race controversy. This is a subject of special interest to Cavalli-Sforza, who has had a leading part in demolishing scientists’ attempts to classify human populations into races in the same way that they classify birds and other species into races. Any competent American birdwatcher can assign individuals of the common bird species known as the yellow-rumped warbler into its eastern and western races (termed the “myrtle warbler” and “Audubon warbler,” respectively). The eastern race has a white throat, the western race a yellow throat. That’s easy and uncontroversial, but it’s even easier for a layperson to distinguish Swedes infallibly from Japanese and Nigerians, just by glancing at faces. Common sense tells us that that’s how we divide humans into races, such as whites, blacks, Mongoloids, and so on.

From a scientific perspective, however, the concept of race still fails, for reasons to which Genes, Peoples, and Languages devotes its first chapter. Even if you try to subdivide human populations by visible differences like skin color, it’s completely arbitrary how far you should go on subdividing: different anthropologists recognize between three and sixty races, depending on their personal preference. If you go so far as to assign Nigerians and Kalahari Bushmen to different races within Africa—as do virtually all anthropologists who recognize races—why lump Tamils and Swedes as “Caucasians,” or Japanese and Quechuas as “Mongoloids”?

Our racial stereotypes turn out to be based on just a few external traits: skin and hair and eye color, hair form, and facial shape. Variation in those traits bears little relation to variation in well-studied genetic traits. Genetically remote populations, such as New Guinea highlanders and black Africans, may be outwardly similar. Conversely, outwardly dissimilar populations may prove to be genetically similar, as illustrated by the slight genetic differences separating blond-haired, blue-eyed, fair-skinned Swedes from black-haired, brown-eyed, darker-skinned Sicilians. As Cavalli-Sforza puts it, “It is because they are external that these racial differences strike us so forcibly, and we automatically assume that differences of similar magnitude exist below the surface, in the rest of our genetic makeup. This is simply not so: the remainder of our genetic makeup hardly differs at all.”

It is emblematic of Cavalli-Sforza’s supple intelligence that he can be meticulously allocating Africans to various branches of our evolutionary tree at one moment and passionately combating racist American educational policies at the next moment. Genes, Peoples, and Languages is, among other things, an intellectual biography—a complex portrait of a scientist capable of mentally juggling the particulars about everything and everybody, while remaining continually alert to grand designs.