Advertisement

Librarians of the Twitterverse

March of Intellect.jpg

William Heath: March of Intellect, 1829

For a brief time in the 1850s the telegraph companies of England and the United States thought that they could (and should) preserve every message that passed through their wires. Millions of telegrams—in fireproof safes. Imagine the possibilities for history!

“Fancy some future Macaulay rummaging among such a store, and painting therefrom the salient features of the social and commercial life of England in the nineteenth century,” wrote Andrew Wynter in 1854. (Wynter was what we would now call a popular-science writer; in his day job he practiced medicine, specializing in “lunatics.”) “What might not be gathered some day in the twenty-first century from a record of the correspondence of an entire people?”

Remind you of anything?

Here in the twenty-first century, the Library of Congress is now stockpiling the entire Twitterverse, or Tweetosphere, or whatever we’ll end up calling it—anyway, the corpus of all public tweets. There are a lot. The library embarked on this project in April 2010, when Jack Dorsey’s microblogging service was four years old, and four years of tweeting had produced 21 billion messages. Since then Twitter has grown, as these things do, and 21 billion tweets represents not much more than a month’s worth. As of December, the library had received 170 billion—each one a 140-character capsule garbed in metadata with the who-when-where.

The library has attached itself to the firehose. A stream of information flows from 500 million registered twitterers (counting duplicates, dead people, parodies, imaginary friends, and bots) who thumb their hurried epistles into phones and tablets and PCs, and the tweets pour into Twitter’s servers at a rate of thousands per second—tens of thousands at peak times: World Cup matches, presidential elections, Beyonce’s pregnancy—and make their way in “real time” to a company called Gnip, a social-media data provider in Boulder, Colorado. Gnip organizes them into one-hour batches on a secure server for download, where they are counted and checked and finally copied to reels of magnetic tape, to be stored in a couple of filing cabinets. In different locations, for safety. If you have ever tweeted, rest assured that each of your little gems is there for posterity.

Of course, the chance of even your very best tweet being seen again by human eyes is approximately zero.

This is an ocean of ephemera. A library of Babel. No one is under any illusions about the likely quality—seriousness, veracity, originality, wisdom—of any one tweet. The library will take the bad with the good: the rumors and lies, the prattle, puns, hoots, jeers, bluster, invective, bawdy probes, vile gossip, epigrams, anagrams, quips and jibes, hearsay and tittle-tattle, pleading, chicanery, jabbering, quibbling, block writing and ASCII art, self-promotion and humblebragging, grandiloquence and stultiloquence. New news every millisecond. A vast confusion of vows, wishes, actions, edicts, petitions, lawsuits, pleas, laws, proclamations, complaints, grievances. Now comical then tragical matters.

Call it what you will, the Twitter corpus now forms a piece of “the creative record of America” and therefore falls squarely within the library’s mission, says Robert Dizard Jr., the Deputy Librarian of Congress. Historians treasure nineteenth-century diaries; why not twenty-first-century tweets? “I think the twitter archive has the potential to allow researchers or scholars to paint a picture of the past with more colors or a fuller brushstroke.”

Scholars and researchers—several hundred of them—have already asked for access, but providing access is not so easy. The tapes are offline. They are organized by date and time. To keep the archive online, indexed for searching, would require server farms with petabytes or more, the sort of thing Google has in legions and the US government not so much.

Google and Twitter can’t seem to get along—they haven’t managed to agree on terms for enabling either real-time or historical searches. Twitter’s own search function is limited and filtered. Only the last few days are available. A Frequently Asked Question in the Twitter Help Center is “I’m Missing from Search!” (How poignant.)

Effectively searching this mass of unstructured data, this barnyard of straw, will be more difficult than people may think. Despite the metadata attached to each tweet, and despite trails of retweets and “favorite” tweets, the Twitter corpus lacks the latticework of hyperlinks that makes Google’s algorithms so potent. Twitter’s famous hashtags—#sandyhook or #fiscalcliff or #girls—are the crudest sort of signposts, not much help for smart searching. Here is a hashtag exegesis in a New Year’s tweet by the comedian Demetri Martin:

The Library of Congress dreams of being able to provide scholars instant results for all kinds of queries—“to be able to answer any question a researcher puts before the archives,” as Dizard says—but that may be a long way off. Right now, to run a single query can take days. The Gnip company, as Twitter’s collaborator, offers a form of historical search for its clients, but it, too, is slow and specialized. “I think there is broad recognition already that there is enormous value that can be derived from the data,” says Gnip’s president, Chris Moody. “That being said, we have to be realistic in terms of what’s going to be available because it is very expensive and it is very challenging.”

Advertisement

At least the job of preservation costs little enough—in the low tens of thousands, the library says. When the early telegrams were saved in safes they had weight and volume—“those sent by the Recording Telegraph being wound in tape-like lengths upon a roller, and appearing exactly like discs of sarcenet ribbon,” as Wynter said. As the telegraph exploded in popularity, there was soon no hope of collecting and storing all that paper. Nowadays, of course, tweets are just bits.

O historian of the future, will you be able to find gems in the straw? Maybe it won’t be worth your while—not unless you have a lot more time than I do. You may sample it, or listen in on something like pure thought, flickering, static-filled, in a vast dark universe.

Still, I’m enjoying my infinitesimal slice, less than one five-millionth of the whole, in real time. I’m hearing new news every day, I’m not believing everything I hear, and I’m certainly not tracking statistics or spotting trends. Mostly I believe that Twitter is a mirage — wait, let’s hear from a neophyte:

Subscribe and save 50%!

Get immediate access to the current issue and over 25,000 articles from the archives, plus the NYR App.

Already a subscriber? Sign in