Friday 17 December 2010

On culturomics

Another day when the phone won't stop ringing from correspondents because of a newly reported project involving language. This time it's the so-called Culturomics project, reported on 16 December in the journal Science and picked up in a half-chewed state by several newspapers and radio stations today.

What has happened is that a team of researchers have collaborated with Google Books to present a corpus of nearly 5.2 million digitized books, which they think is around 4 per cent of all published books. The corpus size is 500 billion words, 361 billion being English (the others from six languages - French, Spanish, German, Chinese, Russian, Hebrew). The time frame is 1800 to 2000. This is now available for online searching, and there's a site where you can type in your own words or word-sets and see how they have developed over time. There is a report on the project here. The full report can be read in the journal Science, though you have to register first.

The name culturomics is an odd one, presumably based on ergonomics, economics, and suchlike. They define it as 'the application of high-throughput data collection and analysis to the study of human culture'. Most people in this business I imagine would normally talk of 'cultural history' or 'cultural evolution'. The language side of the project is familiar, as an exercise in historical corpus linguistics. The new term may catch on, as it blends the two notions (culture and language) in a novel way. We'll just have to wait and see.

The news reports have homed in on an analogy the authors make in their paper. They say: 'The corpus cannot be read by a human. If you tried to read only the entries from the year 2000 alone, at the reasonable pace of 200 words/minute, without interruptions for food or sleep, it would take eighty years. The sequence of letters is one thousand times longer than the human genome'. This has led to such headlines as 'Cultural genome project mines Google Books for the secret history of humanity' or (in today's Guardian) 'Google creates a tool to probe "genome" of English words for cultural trends'. But it isn't anything like the human genome, which is the complete genetic account of an individual. Culture doesn't work in that way. The authors themselves don't use the phrase in their paper, and rightly so.

We mustn't exaggerate the significance of this project. It is no more than a collection of scanned books - an impressive collection, unprecedented in its size, and capable of displaying innumerable interesting trends, but far away from entire cultural reality. For a start, this is just a collection of books - no newspapers, magazines, advertisements, or other orthographic places where culture resides. No websites, blogs, social networking sites. No spoken language, of course, so over 90 percent of the daily linguistic usage of the world isn't here. Moreover, the books were selected from 'over 40 university libraries from around the world', supplemented by some books directly from publishers - so there will be limited coverage of the genres recognized in the categorization systems used in corpus linguistics . They were also, I imagine, books which presented no copyright difficulties. The final choice went through what must have been a huge filtering process. Evidently 15 million books were scanned, and 5 million selected partly on the basis of 'the quality of their OCR' [optical character recognition]. So this must mean that some types of text (those with a greater degree of orthographic regularity) will have been privileged over others.

It's still an impressively large sample, though. So what can we look for? To begin with, note that this is culture not just lexicology. No distinction is made between dictionary and encyclopedia. Anything that is a string of letters separated by a space [a 1-gram, they call it] can be searched for - including names of people, places, etc. They also searched for sequences of two strings (2-grams) and so on up to five [5-grams]. Only items which turned up more than 40 times in the corpus are displayed. So, to take one of their example, we can search for the usage of 'the Great War' [NB the searches are case sensitive], which peaks in frequency between 1915 and 1941, and for 'World War I', which then takes over. Note that, to achieve a comprehensive result, you would have to repeat the search for orthographic variations (eg 'The' for 'the' or '1' for 'I'].

A huge problem in doing this kind of thing is punctuation. I know, because I had to deal with it when carrying out a very similar string-related project in online advertising a few years ago. You have to deal with all the ways in which a punctuation mark can interfere with a string - 'radio' is different from 'radio,' for example. The culturonomists have collapsed word fragments at line-endings separated by a hyphen - though there's a problem when a non-omissible hyphen turns up at a line break. And they have treated punctuation marks as separate n-grams - so 'Why?' for example, is treated as 'Why' + '?'. They don't give details of their procedure, but it doesn't seem to work well. I searched for 'Radio 4', for example. The trace showed the usage taking off in the 1970s, as it should, but there were many instances shown before that decade. I found examples listed in the 1930s. How can that be? There was no Radio 4 then. When you click on the dates to see the sources, you find such instances as 'stereo with AM/FM radio, 4 speakers' and 'RADIO 4-INCH BLADE'.

The other big problem is homographs - words which look the same but which have different meanings. This is the biggest weakness in software which tries to do linguistic analysis, and it was the primary focus of the ad project I mentioned above. A news page which reported a street stabbing had ads down the side which read 'Buy your knives here'. The software had failed to distinguish the two senses of 'knife' (cutlery, weapons), and made the wrong association between text and ad inventory. I solved it by developing a notion of semantic targetting which used the full context of a web page to distinguish homographs. The Culturomics project has to solve the same problem, but on a larger scale (books rather than pages), and there is no sign that it has yet tried to do so. So, type 'Apple', say, into their system and you will see a large peak in the 1980s and 1990s - but is this due to the Beatles or the Mac? There's no way of knowing.

The approach, in other words, shows trends but can't interpret or explain them. It can't handle ambiguity or idiomaticity. If your query is unique and unambiguous, you'll get some interpretable results - as in their examples which trace the rise and fall of a celebrity (eg Greta Garbo, peaking in the 1930s). But even here one must be careful. They show Freud more frequent than Einstein, Galileo, and Darwin, and suggest that this is because he is 'more deeply engrained in our collective subconscious' thanks to such everyday phrases as 'Freudian slip'. But which Freud is being picked up in their totals? They assume Sigmund. But what about Lucian, Clement, Anna...?

Linguists will home in on the claims being made about vocabulary growth over time. Evidently their corpus shows 544K words in English in 1900, 597K in 1950, and 1022K in 2000, and claim that around 8500 words a year have entered English during the last century (though of course only some achieve a permanent presence). These totals are pointing in the right direction, avoiding the underestimates that are common (and incidentally showing yet again how absurd that claim was a year ago about the millionth word entering English). The real figures will of course be much higher, once other genres are taken into account.

They point out that their totals far exceed the totals in dictionaries, and - one of the most interesting findings reported - say that over half the words in their corpus (52%) are what they call 'lexical dark matter'. These are words that don't make it into dictionaries, because they are uncommon, and dictionaries focus on recording the higher frequency words in a language. Their figure is probably a bit high, as (as mentioned above) this project includes proper names as well as nouns, and nobody would want to say that knowledge of proper names is a sign of linguistic ability. (I am reminded of the old Music Hall joke: 'I say, I say, I say, I can speak French'. 'I didn't know you could speak French. Let me hear you speak French.' 'Bordeaux, Calais, Nice...')

This 'cultural observatory' has given us a fascinating tool to play with, and some interesting discoveries will come out of it, especially when one types competing usages into the Ngram Viewer, such as the choice between alternative forms of a verb (eg dreamed/dreamt). There's nothing new about this, of course, as other corpora have done the same thing; but the scale of the enterprise makes this project different (though limited by its academic library origins). For instance, I typed in 'actually to do' and 'to actually do' to see whether there is a trend in the increasing usage of the split infinitive, and there certainly is, with a dramatic increase since 1980. The spelling of 'judgment' without an 'e' has been steadily falling since the 1920s, with the form with an 'e' having a stronger presence in British English [it is possible to search separately for British and American English]. Enough, already. As with all corpora, it gets addictive.


Fran Hill said...

Fascinating, especially about the semantic links. It often amuses me that if I write a humorous blog post about, say, motherhood, in which I advocate all kinds of unacceptable practices (ironically, of course) to keep a baby quiet, serious ads will pop up as soon as I publish, offering me babycare products. The sudden appearance of these ads always give me a laugh.

snowden said...

Good post that highlights some of the weaknesses with the culturomics analysis, that being said, some of the issues your raise are, in my opinion, non-issues.

“They don't give details of their procedure, but it doesn't seem to work well.” The method is very painfully detailed in the supplementary material of the paper, which is freely available on the Science journal website.

“There's no way of knowing.” Although approximate, you can use the snippets provided from the Google Books search to estimate how many hits in a given year represent The Beatles or the Mac.

“They assume Sigmund. But what about Lucian, Clement, Anna...? “ Their assumption makes sense in the light of the relative importance of the 2grams, (see,+Lucian+Freud,+Clement+Freud,+Anna+Freud&year_start=1800&year_end=2000&corpus=0&smoothing=3). Furthermore, the 1-gram name analysis was not a part of the paper per se, only of the half chewing, as you so well put it.

Ben Zimmer said...

The name culturomics is an odd one, presumably based on ergonomics, economics, and suchlike.

No, it's based on genomics, which has spawned various other X-omics.

DC said...

Thanks for these clarifications.

Mark Davies said...
Corpus of Historical American English.

-- 400 million words, 1810s-2000s.
-- Allows for many types of searches that Google Books can't:
* accurate frequency of words and phrases by decade and year
* changes in word forms (via wildcard searches)
* grammatical changes (because corpus is "tagged" for part of speech)
* changes in meaning (via collocates; "nearby words")
* show all words that are more common in one decade than another
* integrate synonyms and customized word lists into queries
* etc etc etc
-- Funded by the National Endowment for the Humanities (NEH), 2009-2011.

Take a look at the "Compare to Google/Archives" link off the first page.

DC said...

Based on genome... Not the most felicitous choice of a term, it seems to me. It's bound to lead to confusion, as morphologically this ending has a short vowel, and that's how people will automatically say it.