Skip to Content

Ngram wows them all

3 January 2011

A fantastic new tool has become available for anyone interested in using the power of the web to trace word use. Google's Ngram is a digital storehouse which comprises words and short phrases - 500 billion of them from 5.2 million books - contained in books published between 1500 and 2008 in English French, Spanish, German, Chinese and Russian.

Google was aiming at a scholarly audience, and this will be a great boon to PhD students, but it will also have a much wider appeal, lending itself to all kinds of searches. For instance, you can quickly use the tool to see that "women," in comparison with "men," is rarely mentioned until the early 1970s, when feminism gained a foothold. The lines eventually cross paths about 1986.

"The goal is to give an 8-year-old the ability to browse cultural trends throughout history, as recorded in books...We wanted to show what becomes possible when you apply very high-turbo data analysis to questions in the humanities," said Erez Lieberman Aiden, a junior fellow at the Society of Fellows at Harvard. Lieberman Aiden and Jean-Baptiste Michel, a postdoctoral fellow at Harvard, assembled the data set with Google and spearheaded a research project to demonstrate how vast digital databases can transform our understanding of language, culture and the flow of ideas.

Steven Pinker said: "When I saw they had this database, I was quite energized. There is so much ignorance. We've had to speculate what might have happened to the language."

Jean-Baptiste Michel said that the team recognised that including information with errors was worse than not including it at all, so all books that did not pass strict standards for accurate labelling and scanning were filtered out.

"That is why we end up working with 5.2 million books and not the whole 15 million," Mr. Michel wrote. (The 15 million figure refers to the number of published books that Google has digitally scanned so far.) "These filtering algorithms took us over a year to improve to our satisfaction. Indeed, if we hadn't worked on them, we'd have published our very first version of the Ngrams, totally unfiltered, back in 2008."

For their paper, Mr. Michel and Mr. Lieberman Aiden based their research on books published in English from 1800 to 2000. "We do not consider that trajectories outside of English 1800-2000 are scientifically validated," Mr. Michel wrote. "In particular, before 1800 there are just too few books: one does not have enough statistical power."

The new database is seen as being of such significance that Science magazine has taken the unusual step of making the article about it freely available on its website.

About the Ngram Viewer

Science publication