Tuesday, September 20, 2011

Books of the world, unite!

Google has scanned and opened up access to an estimated 12% of all the world’s books. This unique dataset gives researchers in the humanities new ways to study cultural, historical and other trends. The new field this enables is known as ‘culturomics’. 

This article has been published in I/O Magazine september 2011

As an undergraduate in the late 80s, Jon Orwant wanted to build thinking computers. He was studying at the famous Artificial Intelligence Laboratory of the Massachusetts Institute of Technology (MIT) near Boston. It was at that time that the ‘bad boy of robotics’, Rodney Brooks, was attempting − but failed − to build his cognitive robot child, COG, in the same lab, an endeavour that AI pioneer Marvin Minsky called ‘a PR stunt, not a project’. Orwant himself was working on a system to automate programming, but he switched to the MIT Media Lab in order to work on the Connection Machine, one of the first supercomputers to try the concept of parallel computing.

In the 90s he wrote a book on the Perl programming language. Afterwards, he edited and published a magazine devoted to the language. Orwant was the sole employee, using programs he’d written in Perl to do everything from answering subscriber questions to proofread articles and lay out advertisements. After this ‘fun, but not sustainable period,’ as Orwant calls it, he joined O’Reilly, the publishing company famous for its books for geeks and nerds. That’s how he started combining programming and publishing – or, as he puts it at the Google Netherlands headquarters on the 15th floor of the Viñoly Building in Amsterdam: ‘I was writing books about software and writing software about books.’

When Google announced its dream to digitise all the books in the world, Orwant was immediately excited. In 2005, when the company opened a Google office in Boston, he was one of the very first to join, and got to work creating a Google Books team there to complement the team at Google’s headquarters in California. He started as an engineer but is presently an engineering manager. Half of the time, however, he is still programming. Orwant: ‘I don’t want just to manage. I also want to program, to build things. Google is good at letting people do what they like and what they are good at.’

Ngram Viewer
Orwant did half of the programming work for a recently developed Google tool called the Ngram Viewer, which immediately attracted a lot of attention. Visit the Ngram Viewer website, set a time period and type a word or a series of words (a so-called ngram). Press ‘enter’ and the tool will create a graph showing how often the ngram occurs over time in the corpus of digitized Google Books.

On Orwant’s MacBook Pro we type in the two-word ngram (or bigram) ‘Alan Turing’. The output graph shows an exponential increase of books mentioning Alan Turing after 1960, with a decline after 2000. Strangely enough, we also see a small peak in the decade before Turing’s birth in 1912. Were there books that already knew that Alan Turing would be born? Orwant immediately wants to find out what is going wrong here. ‘I bet this is a metadata error,’ he says confidently. ‘It must be due to mistakes in cataloguing books.’ And yes, after he has checked the books that mention ‘Alan Turing’ apparently even before his birth, it turns out he is right.

Discovering cultural trends
Google originally developed the Ngram Viewer especially for two researchers at Harvard University. In December 2010 they published a paper in Science called ‘Quantitative Analysis of Culture Using Millions of Digitized Books’. Some interesting discoveries came to light. The researchers discovered that about half a million English words were used in books but had never made it into a dictionary. They also found that the English lexicon has nearly doubled over the past century, now amounting to more than a million words. And they visualized in a graph how the Nazis had suppressed the works of a large number of artists and academics. While the names of those individuals were still mentioned in English books, their use in German books from the same period fell sharply.

This type of analysis by means of huge numbers of books has been given the name ‘culturomics’. According to Science, the humanities researchers reacted ‘with a mix of excitement and frustration’. Some said that the new tools such as the Ngram Viewer ‘could become extremely useful’. Others saw their use for humanities research as ‘almost embarrassingly crude’. In any case, to stimulate researchers to use Google’s database of books, the company is now giving so-called Digital Humanities Research Awards of $50,000 to those who manage to convince Google of the innovativeness and the feasibility of their research plan.

Automatic idea extraction
The same day the article appeared in Science, the Ngram Viewer was publicly released on the Google Labs website. ‘The Harvard researchers had privileged access,’ says Orwant, ‘but it’s my job to see that nobody needs privileged access in the future. We want to make as much information available as possible to as many people as possible. As we digitise more and more books, the Ngram Viewer will get better and better. But we are not saying to the researchers: “We’re Google. Trust us.” We want to make sure that all data will be available at any time and that researchers ten years from now can verify results using exactly the same data set.’

Asked about his future plans for Google Books, Orwant explains that progress will take place in what he calls the Digital Humanities Stack: extracting more and more meaning from books. The Digital Humanities Stack is an imaginary stack with seven layers of abstraction. The lowest layer consists of the scanned pages. On top of this are the text and the pictures that have to be distinguished on the pages. The third layer consists of the letters, symbols and punctuation that the optical character recognition has to extract from the scanned pages. ‘Thinking about the digital humanities,’ says Orwant, ‘it is Google’s goal that no humanities researcher will have to bother about these basic three layers.’

The next four layers are increasingly harder for computers to interpret. The fourth layer has to do with the structure of the page (such as a table of contents or index), the fifth with the syntax of the text, and the sixth with the meaning: the semantics. ‘And finally,’ says Orwant, ‘on top lies the hardest layer of all for a machine to interpret: the layer of ideas.’ He tells of an exciting thought experiment: ‘Whereas Einstein came up with the theory of relativity on his own, Newton and Leibniz both invented calculus independently, almost simultaneously. Maybe the idea of calculus was in the air? Wouldn’t it be great to investigate whether calculus really was in the air back then by analyzing all the books from that period? Can we find traces in other books as well?’ Extracting ideas from books – and not just easy-to-find facts – would be Orwant’s ultimate dream.

Internet
Google Books: books.google.com
Google Books Ngram Viewer: ngrams.googlelabs.com/
Website about analyzing cultural trends using Google Books: www.culturomics.org
The ‘CATCH to eCATCH’ symposium (May 2011): www.nwo.nl/nwohome.nsf/pages/NWOP_8F7CSG_Eng
Google Awards, including the Digital Humanities Research Awards: research.google.com/university/relations/research_awards.html
Examples of award-winning projects: googleblog.blogspot.com/2010/07/our-commitment-to-digital-humanities.html

Google Books Statistics
The estimated total number of unique books in the world (as of the beginning of 2011):

  • 129 million (with a minimum estimate of 3.1 million Dutch books)

Scanned by Google (as of the beginning of 2011):
  • 15 million books (11.6% of the total), 168,000 of which are in Dutch 
  • 5 billion pages 
  • 2 trillion words 
Google cooperates with (as of the beginning of 2011):
  • more than 40 major libraries 
  • more than 30,000 publishers 
Jon Orwant is Engineering Manager for Google Books, Google Magazines and Google Patents and is responsible for Google’s Digital Humanities Research Award. He is working on Book Search, Patent Search, visualizations and the digital humanities. Orwant holds Bachelor degrees in computer science and cognitive science from the Massachusetts Institute of Technology (MIT). He earned a Master’s degree and a PhD in media arts & sciences, also from MIT. He worked as CTO for O’Reilly & Associates (2000–2002) and as Director of Research at France Telecom (2002–2006). He joined Google Book Search in Boston in 2006. Orwant was a keynote speaker at the ‘CATCH to eCATCH’ symposium in Amsterdam (20 May 2011).