Wednesday, 22 April 2009

On the biggest load of rubbish...

The phone hasn't stopped ringing this week. An American organization is claiming that the English language is just about to get its millionth word. They've even suggested a day when this will happen. It's the biggest load of rubbish I've heard in years. But it's attracted a huge amount of publicity.

All it means is that the algorithm they've been using to track English words has finally reached a million. But the English language passed a millon words years ago. Way back in the 1980s, the OED had well over half a million words in its collection. Webster had around half a million. And if you made a comparison of the two (as I did when I was writing The Cambridge Encyclopedia of the English Language) you could see straight away that the coverage was by no means the same. I estimated then that there was about a third difference in coverage between the two dictionaries, due largely to the OED's historical remit - so from these two projects alone there was evidence of some three-quarters of a million words in English.

I then did some comparisons with technical dictionaries, such as dictionaries of botany and linguistics. Most of the really specialized terms in those books weren't in either the OED or Webster. I reached a million very quickly, and it was obvious that this was a task without end. Something like 80 per cent of the vocabulary of English is scientific and technical. There are over a million insects in the world, for example, and English presumably has words for most of them - even if several are Latin loan words. At the same time I also looked at Gale's Dictionary of Abbreviations. There were over half a million of those.

And we haven't even started talking about the spoken language yet. Dictionaries traditionally base themselves on the written language. That's where they get their citations from. But we all know that there are thousands of words in everyday speech which never get recorded in dictionaries - slang, argot, colloquialisms of all kinds (such as the hundreds of words for saying you're drunk). If the American firm is relying on a trawl of internet sources for its database, it's missing out on all of that. And, of course, it's ignoring all the developing 'new Englishes' that exist in largely spoken form around the world. Dictionaries of South African, Jamaican, and other regional Englishes routinely contain 10, 15, 20 thousand or more items. And each editor acknowledges that there are many more 'out there'.

Even if it were possible to ascertain coverage, there's the methodological question of what counts as a word. This is an old chestnut for linguists, but computer firms still ignore it. Flowerpot is one word, but so is flower-pot and flower pot. Will this be counted as one word or two? No computer programme can yet identify all compounds efficiently - let alone idioms such as kick the bucket - and there are tens of thousands of these. Nor can they cope with the problem of distinguishing between words and names. David Crystal isn't a word in the English language in the usual sense; but White House is, in its sense of 'US government'.

The distinction between 'words' and 'lexemes' is critical when you're studying vocabulary. If we count Shakespeare's words, in the grammatical sense, we get around 30,000. If we count Shakespeare's lexemes, we get less than 20,000. A million words is not the issue; a million lexemes is. But I don't know of any computer algorithm which can identify lexemes efficiently. Even linguists with much more powerful brains than computers have got have trouble with the concept sometimes.

A few years ago, world population passed 6 billion. One paper even claimed to have found the 6 billionth child. It was an intriguing idea, which probably sold a few papers, but we all knew it was nonsense. Claiming to find the millionth word is the same - an intriguing idea, and extra PR for the US firm. But it's still nonsense.

17 comments:

Alex Case said...

Very well put. Language Log did an analysis of the reasons behind this ridiculous story that is also interesting:

http://languagelog.ldc.upenn.edu/nll/?p=972

DC said...

Yes, linguists have been pointing out the ridiculousness for some time, and that link is well worth reading. Thanks for mentioning it.

Paul said...

"David Crystal isn't a word in the English language in the usual sense;"

...surely it's only a matter of time...

But yes, I saw this report on BBC News the other morning. I'm sure they will be reporting back on the subject whenever the millionth word magically appears, which will be "June 10th, 2009 at 10:22 am (Stratford-on Avon Time)" according to the GLM website.

Maybe you should make an appearance then, to refute their claims?

DC said...

It was 29 April the other day - and various other dates have been proposed in the past. I'm not holding my breath.

Paul JJ Payack said...

reFrom Paul JJ Payack of the Global Language Monitor:

Professor Crystal:

You are indeed correct that there are many more than one million English words. There are millions of chemicals with estimates that range up to 70 million, as well as 600,000 species of fungus, and 45 million domain names, and all the rest.

When we began this endeavor several years ago, we began in the same way that you described, comparing the great English dictionaries, estimating the overlap, then attempting to find jargon that had passed into the mainstream, adding in the -Lishes, and the like.

In 2003, we made our original calculations about word creation, which were based on our research into the historical rates of English language word creation. We also factored in the number of people speaking Global English as a first, second or business language that had blossomed into some 1.3 billion people, according to the best estimates at the time.

The date changed a few times, because as we refined our research, we saw that our published count had gotten ahead of the original mathematical projection, and made the necessary adjustments. (We came back to our original estimate of the number of words and rate of word creation and adjusted the count accordingly.) The variance of 36 months mathematically equals .0021 in the life of a 1400 year old language.

We did not intend to make an announcement of the word count. However, the New York Times in January 2006, mentioned our estimate in passing. This was in an article where they used our PQI to help determine if the words being used in the New York real estate marketplace were predicting the bubble to burst. (Of course, the real estate world favored the term 'soft landing' at the time.)

Perhaps, I should briefly explain the Predictive Quantities Indicator (PQI). This is simply an algorithm or series of mathematical equations that I invented to measure the rate and extent of word usage. It also tracks direction, velocity, and momentum in the adoption of new words and phrases. And it can be used to look back in time. The methodology is summarized on our site in a 20-page document that's ready for download. (Hundreds have done so.) We have also discussed the PQI in detail with government agencies, academic institutions around the world (most recently in China), and investment banks, corporations,and the like. Recently, I've been invited to write about it in one of the top statistical journals. We invite anyone who is interested to sign a non-disclosure, which is typical in Silicon Valley to protect proprietary intellectual property.

I note this paragraph from the "Number of Words in the English Language" essay that has been on the site for many years:

"The central idea of writing is, of course, the idea. Ideas by their very nature are wispy sorts of things. This being so, you can’t grab an idea and do with it what you will. Rather the best for which one can hope is to encapsulate the idea and preserve it for time immemorial in some sort of ethereal amber. We call this amber, language; the basic building block of which is, of course, the word. (We are speaking now as poets and not as linguists.)"

We point out all this to every member of the media that calls on us, especially the fact that the number is an estimate.

Invariably, they have already read all the arguments, sometimes being contacted directly by one of the linguists. But upon researching the Global Language Monitor site, they have concluded that there is a story worth telling, even as they note the caveats -- which we strive to make plain.

One critic facetiously asked if the media no longer employ fact checkers. The answer is that they do, in fact, employ fact checkers and the facts convince them to move forward with the story.

The Million Word March is meant to celebrate the richness, cultural diversity and the dynamic growth of English, which has become the first truly global language.

If you (or anyone else) would like to discuss any of this directly, please contact me at pjjp@post.harvard.edu. I will make myself immediately available.

Sincerely,

PJJP

PS I will be in my office in Austin on June 10th, but would be delighted for you to head out to Stratford-on-Avon. After you explain your position that there are many more millions of English words, they will all-the-more revel in the wonder of the English language.

DC said...

Just as one mustn't underestimate, so one mustn't overestimate either. The traditional distinction between language and encyclopedic knowledge has to be respected. Just because one knows the name of many places in Germany does not mean that one can speak German. So anything that crosses linguistic boundaries in this way is irrelevant as far as a language's wordstock is concerned. Internet domain names are a case in point. Scientific names are trickier to decide about, as they include international words (in Latin), Latin loans, and various kinds of translation.

What puzzles me more than anything else, I must say, is why the media find the notion of a millionth word so gripping, or why anyone should want to establish it in the first place. It's not as if anything of interest follows from the 'discovery'. And there are plenty of more interesting ways to celebrate the growth of a language, if that is what one wants to do.

Or, of course, languages. For, insofar as a culture is scientifically literate, then of course their wordstock too will be over a million.

Barrie England said...

I had forgotten the subject of this blog. When I looked at the title again I immediately thought of Strunk and White. This year sees the 50th anniversary of the publication of that remarkable book. Anything to add to Geoffrey Pullum’s excellent demolition job at http://chronicle.com/free/v55/i32/32b01501.htm?

DC said...

You're right. That's another candidate for the title. And Geoff sums it up brilliantly. Nothing to add.

Anonymous said...

"Even linguists with much more powerful brains than computers have got have trouble with the concept sometimes."
I don't understand "have got have trouble" - Is this a usage I am not familiar with, or perhaps a mistake?

DC said...

An uncorrected blend of have got into trouble' and 'have trouble', I suspect. A typo.

But, who knows! Maybe it'll become worldwide usage one day? If it does, you saw it here first.

mceupc said...

Dear Professor David Crystal,

I confess I didn't have any "trouble" while reading the sentence: "Even linguists with much more powerful brains than computers have got have trouble with the concept sometimes." In my view,(contrarily to the previous blogger)it was clear enough. I have made a short pause at "have got"(as in a comma)then I went on reading. Do you think this is acceptable?
By the way, I found your previous answer very intelligent and spontaneous.
Dear Professor David Crystal, I would like to congratulate you for your brilliant session at the APPI Conference, Lisbon, last week. Absolutely enjoyable! We always learn a lot from your unique top stories! Thank you very much indeed.

DC said...

Now you point it out, it does read like the sort of rethink that would go on in speech (I've talked about this before on the blog, in relation to anacoluthon). I wouldn't let it stand in writing, though, without some sort of punctuation, such as a dash, to reflect the pause.

Glad you enjoyed the APPI event. Thank you.

Stanley said...

Many thanks for this post - I had heard about the hype over the alleged millionth word and thought that it was a rather strange concept. Nevertheless, it is curious that so many people should show an interest in it, even if, as you say, there is no real interest in it at all - it demonstrates what you have said on many occasions previously that everyone is interested in language.

Incidentally, I think the confusion with that other sentence arose because of "much more powerful brains than computers have got" being a sort of single noun entity in itself, thus: "Even linguists with [much more powerful brains than computers have got] have trouble with the concept sometimes." Replace everything in square brackets with "rhubarb" and the sentence structure is clear (if bizarre).

DC said...

That's a more satisfying explanation, certainly. I hope that's what I meant!

Technopat said...

Great read! Thanx 4 that - and loads of other stuff. Long may the words keep coming to you...

Keep on de-hyping!

Cheers!

Paul from France said...

Even more interesting is how many semes are there in English? This bears on the question of implicit, contextual and explicit meaning from one language to the next (Latin French point of view).
Also a question that crops up in technical translation how many different words can be used to describe the same object (polyreference as opposed to polysemy?).
An amateur linguist (and translator).

DC said...

Agreed. The question of how many senses is much more interesting, especially in relation to such questions as 'how linguistically innovative was Shakespeare?'.