21 Ιανουαρίου 2013

Librarians of the Twitterverse

 άρθρο του James Gleick στο The New York Review of Books

For a brief time in the 1850s the telegraph companies of England and the United States thought that they could (and should) preserve every message that passed through their wires. Millions of telegrams—in fireproof safes. Imagine the possibilities for history! 

“Fancy some future Macaulay rummaging among such a store, and painting therefrom the salient features of the social and commercial life of England in the nineteenth century,” wrote Andrew Wynter in 1854. (Wynter was what we would now call a popular-science writer; in his day job he practiced medicine, specializing in “lunatics.”) “What might not be gathered some day in the twenty-first century from a record of the correspondence of an entire people?”
Remind you of anything?
Here in the twenty-first century, the Library of Congress is now stockpiling the entire Twitterverse, or Tweetosphere, or whatever we’ll end up calling it—anyway, the corpus of all public tweets. There are a lot. The library embarked on this project in April 2010, when Jack Dorsey’s microblogging service was four years old, and four years of tweeting had produced 21 billion messages. Since then Twitter has grown, as these things do, and 21 billion tweets represents not much more than a month’s worth. As of December, the library had received 170 billion—each one a 140-character capsule garbed in metadata with the who-when-where.
The library has attached itself to the firehose. A stream of information flows from 500 million registered twitterers (counting duplicates, dead people, parodies, imaginary friends, and bots) who thumb their hurried epistles into phones and tablets and PCs, and the tweets pour into Twitter’s servers at a rate of thousands per second—tens of thousands at peak times: World Cup matches, presidential elections, Beyonce’s pregnancy—and make their way in “real time” to a company called Gnip, a social-media data provider in Boulder, Colorado. Gnip organizes them into one-hour batches on a secure server for download, where they are counted and checked and finally copied to reels of magnetic tape, to be stored in a couple of filing cabinets. In different locations, for safety. If you have ever tweeted, rest assured that each of your little gems is there for posterity.
Of course, the chance of even your very best tweet being seen again by human eyes is approximately zero.
This is an ocean of ephemera. A library of Babel. No one is under any illusions about the likely quality—seriousness, veracity, originality, wisdom—of any one tweet. The library will take the bad with the good: the rumors and lies, the prattle, puns, hoots, jeers, bluster, invective, bawdy probes, vile gossip, epigrams, anagrams, quips and jibes, hearsay and tittle-tattle, pleading, chicanery, jabbering, quibbling, block writing and ASCII art, self-promotion and humblebragging, grandiloquence and stultiloquence. New news every millisecond. A vast confusion of vows, wishes, actions, edicts, petitions, lawsuits, pleas, laws, proclamations, complaints, grievances. Now comical then tragical matters.
Call it what you will, the Twitter corpus now forms a piece of “the creative record of America” and therefore falls squarely within the library’s mission, says Robert Dizard Jr., the Deputy Librarian of Congress. Historians treasure nineteenth-century diaries; why not twenty-first-century tweets? “I think the twitter archive has the potential to allow researchers or scholars to paint a picture of the past with more colors or a fuller brushstroke.”
Scholars and researchers—several hundred of them—have already asked for access, but providing access is not so easy. The tapes are offline. They are organized by date and time. To keep the archive online, indexed for searching, would require server farms with petabytes or more, the sort of thing Google has in legions and the US government not so much.
Google and Twitter can’t seem to get along—they haven’t managed to agree on terms for enabling either real-time or historical searches. Twitter’s own search function is limited and filtered. Only the last few days are available. A Frequently Asked Question in the Twitter Help Center is “I’m Missing from Search!” (How poignant.)
Effectively searching this mass of unstructured data, this barnyard of straw, will be more difficult than people may think. Despite the metadata attached to each tweet, and despite trails of retweets and “favorite” tweets, the Twitter corpus lacks the latticework of hyperlinks that makes Google’s algorithms so potent. Twitter’s famous hashtags—#sandyhook or #fiscalcliff or #girls—are the crudest sort of signposts, not much help for smart searching. Here is a hashtag exegesis in a New Year’s tweet by the comedian Demetri Martin:
The Library of Congress dreams of being able to provide scholars instant results for all kinds of queries—“to be able to answer any question a researcher puts before the archives,” as Dizard says—but that may be a long way off. Right now, to run a single query can take days. The Gnip company, as Twitter’s collaborator, offers a form of historical search for its clients, but it, too, is slow and specialized. “I think there is broad recognition already that there is enormous value that can be derived from the data,” says Gnip’s president, Chris Moody. “That being said, we have to be realistic in terms of what’s going to be available because it is very expensive and it is very challenging.”
At least the job of preservation costs little enough—in the low tens of thousands, the library says. When the early telegrams were saved in safes they had weight and volume—“those sent by the Recording Telegraph being wound in tape-like lengths upon a roller, and appearing exactly like discs of sarcenet ribbon,” as Wynter said. As the telegraph exploded in popularity, there was soon no hope of collecting and storing all that paper. Nowadays, of course, tweets are just bits.
O historian of the future, will you be able to find gems in the straw? Maybe it won’t be worth your while—not unless you have a lot more time than I do. You may sample it, or listen in on something like pure thought, flickering, static-filled, in a vast dark universe.
Still, I’m enjoying my infinitesimal slice, less than one five-millionth of the whole, in real time. I’m hearing new news every day, I’m not believing everything I hear, and I’m certainly not tracking statistics or spotting trends. Mostly I believe that Twitter is a mirage — wait, let’s hear from a neophyte:

6 Ιανουαρίου 2013

Library Of Congress Twitter Archive Reaches 170 Million Messages

Less than 24 months after first announcing their plans to compile an archive of Twitter posts, the Library of Congress has collected more than 170 billion messages comprised of 140 characters or less, the institution announced on Friday.

The archive, which was first announced in April 2010, will contain every tweet posted to the popular microblogging website. Each Twitter message, dating back to the very first one (which was posted back in 2006), is being donated to the library by the social media website, according to the Associated Press (AP).

The Library’s first goal was “to acquire and preserve” all tweets from 2006 through 2010, so they could “establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date,” Gayle Osterberg, the Library’s Director of Communications, said on Friday. “This month, all those objectives will be completed.”

“The volume of tweets the Library receives each day has grown from 140 million beginning in February 2011 to nearly half a billion tweets each day as of October 2012,” she added. “The Library’s focus now is on addressing the significant technology challenges to making the archive accessible to researchers in a comprehensive, useful way. These efforts are ongoing and a priority for the Library.”

Exactly how the massive collection of microblogging messages will be made available to researchers or to the general public is not yet known, reports Adrienne LaFrance of The Washington Post.

“Colorado-based data company Gnip, is managing the transfer of tweets to the archive, which is populated by a fully automated system that processes tweets from across the globe. Each archived tweet comes with more than 50 fields of metadata — where the tweet originated, how many times it was retweeted, who follows the account that posted the tweet and so on — although content from links, photos and videos attached to tweets are not included,” she said. “But the library hasn’t started the daunting task of sorting or filtering its 133 terabytes of Twitter data, which it receives from Gnip in chronological bundles, in any meaningful way.”

“People expect fully indexed — if not online searchable — databases, and that’s very difficult to apply to massive digital databases in real time,” Deputy Librarian of Congress Robert Dizard Jr. told the Post. “The technology for archival access has to catch up with the technology that has allowed for content creation and distribution on a massive scale. Twitter is focused on creating and distributing content; that’s the model. Our focus is on collecting that data, archiving it, stabilizing it and providing access; a very different model.”

One problem is the Library, which like many government agencies has experienced funding cuts in recent years, would need to greatly overhaul their IT systems and servers in order to handle Twitter-related requests, LaFrance explains.

Dizard told her that internal testing has revealed that completing a search of the approximately 21 billion tweets from 2006 to 2010 could take up to 24 hours using the agency’s current computers. He notes the agency is considering hiring a third-party to handle public searches of the Twitter archive, but that will likely depend on whether or not the Library can afford to do so.

Rest assured, however, they do hope to make the tweet archive available to the general public eventually.

“Twitter is a new kind of collection for the Library of Congress but an important one to its mission,” Osterberg said, according to PCMag.com. “As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications and other sources routinely collected by research libraries.”

Source: redOrbit Staff & Wire Reports - Your Universe Online