Tuesday, 15 March 2011

Word Count the Library - Google Ngram Viewer for the Google Books Stock

How do terms compare in usage over the centuries. With the Ngram Viewer now available on the Google Labs the massive stock of scanned books as part of the Google Book project becomes available for search. It allows for the graphing of terms according to the frequency they are used per year. The data ranges from 1800 to 2000.

Image taken from Google NGram / Comparing the two search terms decades of the century. Interesting how the different decades peak some decades after. Some decades have a shifted or even two peaks, like for example the 'nineties' that peak already around 1920. Of course the continuos fascination with the sixties is visible, but also te thirties cling on.

Google has grouped it into several corpora, groups of books. Most of them related to different languages, currently these are English, Chinese, French, German, Hebrew, Russian, and Spanish, but also samples like the 'Corpora Million', where no more than 6000 books per year are samples for the result. There are of course difficulties with punctuations and muti words. Generally the search field is case sensitive and punctuation is treated as individual tokens. More details on this on the Google Ngram page.

Image taken from Google NGram / Comparing the search terms month, year, day, hour and week. The different time units are used differently with the word day leading the table throughout, diving however very low around 1960 where it almost was overtaken by the term year. Have things slowed down? Surprisingly the month and the week, both in planning terms very important words are nowhere compared to the terms, year and day.

However, the results are quite tricky to interpret even though things might look very cear through the simplicity of the graph. Google has managed to make it look very simple and clean, each term is shown as a graph with time on the horizontal x axis and frequency on the vertical y axis. It has to be taken into account however, that there are changes in the usage of words, for example 'the Great War' vs 'World War I'. Even more important is the fact that more and more books are written. This of course influences the results. Google points out that there is only a catalogue of around 500'000 English books before the 19th century. This means, that a search term can have a stronger peak early on than it would have later o, since this one book has more of an impact on the sample as a whole.

Image taken from Google Ngram / Comparing the two search terms rural and urban. As you would expect the word rural dominates and the urabn term only really comes in in the last century with a dramatic peak during the 70ies.

It is a great way to explore different terms especially in combinations. Even term that have a similar meaning can show a dramatic diversion on the graphs over time. Basically it show how trends in language change. Of course also the birth of terms can be observed as some terms only apear after a certain period or after the object has been invented as for example visible in this graph showing the terms used for different rooms in a house or flat. The invention of 'living' in architecture around 1900 brought along the terms 'living room' and 'dining room'.

If your are not satisfied with what you can get fromt he graphs, Google has some of the datasets available for download (or HERE for the two Billion Timeseries) and you can have a go at visualising and searching it yourself. Note the file structure thought.

Image taken from Google NGram / Comparing the two search terms twitter and facebook. This is of course ridiculous since both terms were invented after 2000, surprising however how twitter makes a dramatic appearance during the 19 hundreds.

Via The Atlantic

Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using Millions of Digitized Books. Science (Published online ahead of print: 12/16/2010)

No comments: