DCblog: December 2010

Monday, 20 December 2010

On me/my being right

A correspondent writes about an earlier post headed 'On Shakespeare being Irish', worrying about the grammar rather than the content. Shouldn't it be 'On Shakespeare's being Irish', he asks? 'Has grammar changed?', he adds.

No, it hasn't - at least, not in the last 200 or so years. As with many issues of this kind, the arguments go back to the 18th century and the rise of prescriptivism. The construction without the possessive is the older one, and can be traced back to the Middle Ages. But the one with the possessive was felt to be more elegant and grammatically correct, and it was given the strongest possible support by Fowler (in his 1926 Dictionary). Indeed, rarely does Fowler attack a usage more intensely than in his entry on what he calls the 'fused participle'. A brief quotation:

'It is perhaps beyond hope for a generation that regards upon you giving as normal English to recover its hold upon the truth that grammar matters. Yet every just man who will abstain from the fused participle (as most good writers in fact do, though negative evidence is naturally hard to procure) retards the process of corruption; & it may therefore be worth while to take up again the statement made above, that the construction is grammatically indefensible.'

Not surprisingly, then, the issue rumbles on.

The two constructions actually express slightly different meanings. The non-possessive one highlights the verb phrase, whereas the possessive one highlights the noun phrase. In 'On Shakespeare being Irish', it's the 'being Irish' that is the focus. It's thus more likely to be used in a context where the implied contrast is with some other verb phrase, such as 'being Welsh or 'being English'. In 'On Shakespeare's being Irish', the person is the focus, so it's more likely to be used where there is a contrast with someone else. I used the first construction in my post, because the content was on the interpretation of original pronunciation, not on the person using it.

However, the prescriptive attitude has had an effect, in that over the years the use of the possessive has come to be associated with formal expression. There's therefore a stylistic contrast involved, with the non-possessive form sounding more informal. This is especially the case when the participial form is used as the subject of a clause, as in 'Going by train is out of the question', where we have the choice of:

John's going by train is out of the question.
John going by train is out of the question.

The stylistic contrast is especially noticeable when there's an initial pronoun:

My going by train is out of the question.
Me going by train is out of the question.

The contentious character of the non-possessive construction is lessened if it is 'buried' later in the sentence:

It is out of the question, my going by train.
It is out of the question, me going by train.

This is presumably why my post heading was noticed. The style I use ('On X') keeps the usage in initial position. If I'd headed the post 'On discussing the argument about Shakespeare being Irish', I wonder if my correspondent would have picked up on the point?

Friday, 17 December 2010

On culturomics

Another day when the phone won't stop ringing from correspondents because of a newly reported project involving language. This time it's the so-called Culturomics project, reported on 16 December in the journal Science and picked up in a half-chewed state by several newspapers and radio stations today.

What has happened is that a team of researchers have collaborated with Google Books to present a corpus of nearly 5.2 million digitized books, which they think is around 4 per cent of all published books. The corpus size is 500 billion words, 361 billion being English (the others from six languages - French, Spanish, German, Chinese, Russian, Hebrew). The time frame is 1800 to 2000. This is now available for online searching, and there's a site where you can type in your own words or word-sets and see how they have developed over time. There is a report on the project here. The full report can be read in the journal Science, though you have to register first.

The name culturomics is an odd one, presumably based on ergonomics, economics, and suchlike. They define it as 'the application of high-throughput data collection and analysis to the study of human culture'. Most people in this business I imagine would normally talk of 'cultural history' or 'cultural evolution'. The language side of the project is familiar, as an exercise in historical corpus linguistics. The new term may catch on, as it blends the two notions (culture and language) in a novel way. We'll just have to wait and see.

The news reports have homed in on an analogy the authors make in their paper. They say: 'The corpus cannot be read by a human. If you tried to read only the entries from the year 2000 alone, at the reasonable pace of 200 words/minute, without interruptions for food or sleep, it would take eighty years. The sequence of letters is one thousand times longer than the human genome'. This has led to such headlines as 'Cultural genome project mines Google Books for the secret history of humanity' or (in today's Guardian) 'Google creates a tool to probe "genome" of English words for cultural trends'. But it isn't anything like the human genome, which is the complete genetic account of an individual. Culture doesn't work in that way. The authors themselves don't use the phrase in their paper, and rightly so.

We mustn't exaggerate the significance of this project. It is no more than a collection of scanned books - an impressive collection, unprecedented in its size, and capable of displaying innumerable interesting trends, but far away from entire cultural reality. For a start, this is just a collection of books - no newspapers, magazines, advertisements, or other orthographic places where culture resides. No websites, blogs, social networking sites. No spoken language, of course, so over 90 percent of the daily linguistic usage of the world isn't here. Moreover, the books were selected from 'over 40 university libraries from around the world', supplemented by some books directly from publishers - so there will be limited coverage of the genres recognized in the categorization systems used in corpus linguistics . They were also, I imagine, books which presented no copyright difficulties. The final choice went through what must have been a huge filtering process. Evidently 15 million books were scanned, and 5 million selected partly on the basis of 'the quality of their OCR' [optical character recognition]. So this must mean that some types of text (those with a greater degree of orthographic regularity) will have been privileged over others.

It's still an impressively large sample, though. So what can we look for? To begin with, note that this is culture not just lexicology. No distinction is made between dictionary and encyclopedia. Anything that is a string of letters separated by a space [a 1-gram, they call it] can be searched for - including names of people, places, etc. They also searched for sequences of two strings (2-grams) and so on up to five [5-grams]. Only items which turned up more than 40 times in the corpus are displayed. So, to take one of their example, we can search for the usage of 'the Great War' [NB the searches are case sensitive], which peaks in frequency between 1915 and 1941, and for 'World War I', which then takes over. Note that, to achieve a comprehensive result, you would have to repeat the search for orthographic variations (eg 'The' for 'the' or '1' for 'I'].

A huge problem in doing this kind of thing is punctuation. I know, because I had to deal with it when carrying out a very similar string-related project in online advertising a few years ago. You have to deal with all the ways in which a punctuation mark can interfere with a string - 'radio' is different from 'radio,' for example. The culturonomists have collapsed word fragments at line-endings separated by a hyphen - though there's a problem when a non-omissible hyphen turns up at a line break. And they have treated punctuation marks as separate n-grams - so 'Why?' for example, is treated as 'Why' + '?'. They don't give details of their procedure, but it doesn't seem to work well. I searched for 'Radio 4', for example. The trace showed the usage taking off in the 1970s, as it should, but there were many instances shown before that decade. I found examples listed in the 1930s. How can that be? There was no Radio 4 then. When you click on the dates to see the sources, you find such instances as 'stereo with AM/FM radio, 4 speakers' and 'RADIO 4-INCH BLADE'.

The other big problem is homographs - words which look the same but which have different meanings. This is the biggest weakness in software which tries to do linguistic analysis, and it was the primary focus of the ad project I mentioned above. A news page which reported a street stabbing had ads down the side which read 'Buy your knives here'. The software had failed to distinguish the two senses of 'knife' (cutlery, weapons), and made the wrong association between text and ad inventory. I solved it by developing a notion of semantic targetting which used the full context of a web page to distinguish homographs. The Culturomics project has to solve the same problem, but on a larger scale (books rather than pages), and there is no sign that it has yet tried to do so. So, type 'Apple', say, into their system and you will see a large peak in the 1980s and 1990s - but is this due to the Beatles or the Mac? There's no way of knowing.

The approach, in other words, shows trends but can't interpret or explain them. It can't handle ambiguity or idiomaticity. If your query is unique and unambiguous, you'll get some interpretable results - as in their examples which trace the rise and fall of a celebrity (eg Greta Garbo, peaking in the 1930s). But even here one must be careful. They show Freud more frequent than Einstein, Galileo, and Darwin, and suggest that this is because he is 'more deeply engrained in our collective subconscious' thanks to such everyday phrases as 'Freudian slip'. But which Freud is being picked up in their totals? They assume Sigmund. But what about Lucian, Clement, Anna...?

Linguists will home in on the claims being made about vocabulary growth over time. Evidently their corpus shows 544K words in English in 1900, 597K in 1950, and 1022K in 2000, and claim that around 8500 words a year have entered English during the last century (though of course only some achieve a permanent presence). These totals are pointing in the right direction, avoiding the underestimates that are common (and incidentally showing yet again how absurd that claim was a year ago about the millionth word entering English). The real figures will of course be much higher, once other genres are taken into account.

They point out that their totals far exceed the totals in dictionaries, and - one of the most interesting findings reported - say that over half the words in their corpus (52%) are what they call 'lexical dark matter'. These are words that don't make it into dictionaries, because they are uncommon, and dictionaries focus on recording the higher frequency words in a language. Their figure is probably a bit high, as (as mentioned above) this project includes proper names as well as nouns, and nobody would want to say that knowledge of proper names is a sign of linguistic ability. (I am reminded of the old Music Hall joke: 'I say, I say, I say, I can speak French'. 'I didn't know you could speak French. Let me hear you speak French.' 'Bordeaux, Calais, Nice...')

This 'cultural observatory' has given us a fascinating tool to play with, and some interesting discoveries will come out of it, especially when one types competing usages into the Ngram Viewer, such as the choice between alternative forms of a verb (eg dreamed/dreamt). There's nothing new about this, of course, as other corpora have done the same thing; but the scale of the enterprise makes this project different (though limited by its academic library origins). For instance, I typed in 'actually to do' and 'to actually do' to see whether there is a trend in the increasing usage of the split infinitive, and there certainly is, with a dramatic increase since 1980. The spelling of 'judgment' without an 'e' has been steadily falling since the 1920s, with the form with an 'e' having a stronger presence in British English [it is possible to search separately for British and American English]. Enough, already. As with all corpora, it gets addictive.

Tuesday, 14 December 2010

On being a champion of - what?

Several correspondents, having read Michael Rosen's generous piece about me in this week's Guardian, have asked what I think about being called, as the headline put it, 'the champion of the English language'.

Well, my first thought was: not just English. If I try to champion anything, it is language, and specifically languages, and most specifically, endangered languages. English is a language, so it gets championed. It also happens to be the language which I chose to specialize in, years ago, so in that sense I guess I'm identified with it more than any other. But I'd be sad if anyone thought to interpret the headline as if it meant that I was supporting English at the expense of other languages. In fact I probably spend more time these days making the case for the importance of modern languages, and trying to get endangered languages projects off the ground. The Threlford lecture I gave a few months ago to the Institute of Linguists, was entirely on that subject, for example, as will be a lecture to the British Academy next February. And we are still a long way from the goal of having 'houses' of language(s) presenting global linguistic diversity in all its glory. The first to open, as regular readers of this blog know, will be the 'House of Languages' in Barcelona (see the website at Linguamon) - a project I know very well, as I've been chair of its scientific advisory committee from the beginning. I've tried, and failed, twice, to get a similar project off the ground in the UK. One keeps trying.

Another first is the event with which Michael ended his piece: the 'Evolving English' exhibition at the British Library. This is indeed an amazing exhibition, and it was a privilege to be associated with it. It is like having the history of English brought to life. A significant number of the important texts always instanced in histories of the language are in the same room. You are greeted by the glorious Undley bracteate. You find yourself within inches of the Beowulf manuscript. In one cabinet you can see, side by side, the Wycliffe Bible, the Tyndale fragment, the Book of Common Prayer, and the King James Bible. The curators have been ingenious, not to say cheeky: in another cabinet you will see the first English conversation, Aelfric's Colloquy; next to it is a manuscript of Harold Pinter. Everywhere you look there are headphones. A visit is not just a visual experience. The Library has an excellent collection of sound recordings, and great efforts have been made to provide an analogous audio experience for the texts of the past. If you are passing through London between now and 3 April 2011, visit this exhibition. There won't be another for a long long time.

I was the lead consultant for the exhibition - not the curator, as some online sources have put it (the three curators are British Library staff) - and wrote the accompanying book. This isn't, incidentally, a 'catalogue' of the exhibition, as some reports have suggested. It did begin as an attempt to reflect what would be in the exhibition, but it had to go to press some six months before the exhibition opened, and in the interim other decisions were made about what it was practicable to show. There were some very large display items that it would have been silly to try to fit into a book (World War I posters, for example); and conversely, there were some items that worked well in a book but which were simply too fragile to put on public display. Also, none of the audio items could go into the book - though several are available online, in the Timeline section of the Library website. There's about a two-thirds overlap in content between book and exhibition.

What's really noticeable, when you enter the exhibition, is the lack of a single chronology. Rather, what you see is a series of themes - the evolution of Standard English, local dialects of English, international varieties of English, everyday English, English in the workplace, English at play. The message is plain: there is no one 'story' of English, there are many, each of which has its own validity. It is the driving force behind my The Stories of English, which I used as the guideline for my initial proposals to the Library as to what should be in an exhibition, when the project was first mooted three years ago. What I hope, more than anything else, is that the exhibition will, through its physicality, demonstrate more than any textbook could, the way the language thrives through its multifaceted character. We see Standard English strongly represented - the prestige dialect of the language, the criterion of linguistic educatedness and the means of achieving national and international intelligibility, especially in writing. At the same time, we see regional dialects and other varieties of nonstandard English strongly represented - the varieties which express local, national, and international identity, and which are actually used by the vast majority of English speakers around the world. The atmosphere in the Library is one of mutual respect.

It would be nice to think that this atmosphere will remain after the exhibition is gone, and perhaps it will, through the book and the website. Linguistic climate change there still needs to be. The comments that followed Michael Rosen's article clearly indicate this. There is a great deal of mythology still around - for example, the unfounded belief that linguists say that 'anything goes', when it comes to language teaching in class. Readers of this blog with very long memories will recall that this was something John Humphrys said about me. He eventually apologised, in The Spectator, saying that he was only a journalist, and the role of the journalist was to simplify and exaggerate. But such simplifications and exaggerations do a great deal of harm. So, for the record, once again, and hopefully for the last time: I have never said that 'anything goes' when it comes to language. Read my lips. I have never said that 'anything goes' when it comes to language. Nor do I know of any linguist who has said such a thing. The whole point of sociolinguistics, pragmatics, and the other branches of linguistics which study language in use is actually to show that 'anything does not go'. The only people who use the phrase 'anything goes' are prescriptivists desperately trying to justify their prejudices.

If people want to find out about my educational linguistic philosophy they will find it expounded, for example, at the end of The Stories of English and in various chapters of The Fight for English. It can be summarized as follows. It is the role of schools to prepare children for the linguistic demands that society places upon them. This means being competent in Standard English as well as in the nonstandard varieties that form a part of their lives and which they will frequently encounter outside their home environment in modern English literature, in interactions with people from other parts of the English-speaking world, and especially on the internet. They have to know when to spell and punctuate according to educated norms, and when it is permissible not do so. In a word, they have to know how to manage the language - or to be masters of it (as Humpty Dumpty says to Alice in Through the Looking Glass). And, one day, to be champions of it - all of it.