The phone hasn't stopped ringing this week. An American organization is claiming that the English language is just about to get its millionth word. They've even suggested a day when this will happen. It's the biggest load of rubbish I've heard in years. But it's attracted a huge amount of publicity.
All it means is that the algorithm they've been using to track English words has finally reached a million. But the English language passed a millon words years ago. Way back in the 1980s, the OED had well over half a million words in its collection. Webster had around half a million. And if you made a comparison of the two (as I did when I was writing The Cambridge Encyclopedia of the English Language) you could see straight away that the coverage was by no means the same. I estimated then that there was about a third difference in coverage between the two dictionaries, due largely to the OED's historical remit - so from these two projects alone there was evidence of some three-quarters of a million words in English.
I then did some comparisons with technical dictionaries, such as dictionaries of botany and linguistics. Most of the really specialized terms in those books weren't in either the OED or Webster. I reached a million very quickly, and it was obvious that this was a task without end. Something like 80 per cent of the vocabulary of English is scientific and technical. There are over a million insects in the world, for example, and English presumably has words for most of them - even if several are Latin loan words. At the same time I also looked at Gale's Dictionary of Abbreviations. There were over half a million of those.
And we haven't even started talking about the spoken language yet. Dictionaries traditionally base themselves on the written language. That's where they get their citations from. But we all know that there are thousands of words in everyday speech which never get recorded in dictionaries - slang, argot, colloquialisms of all kinds (such as the hundreds of words for saying you're drunk). If the American firm is relying on a trawl of internet sources for its database, it's missing out on all of that. And, of course, it's ignoring all the developing 'new Englishes' that exist in largely spoken form around the world. Dictionaries of South African, Jamaican, and other regional Englishes routinely contain 10, 15, 20 thousand or more items. And each editor acknowledges that there are many more 'out there'.
Even if it were possible to ascertain coverage, there's the methodological question of what counts as a word. This is an old chestnut for linguists, but computer firms still ignore it. Flowerpot is one word, but so is flower-pot and flower pot. Will this be counted as one word or two? No computer programme can yet identify all compounds efficiently - let alone idioms such as kick the bucket - and there are tens of thousands of these. Nor can they cope with the problem of distinguishing between words and names. David Crystal isn't a word in the English language in the usual sense; but White House is, in its sense of 'US government'.
The distinction between 'words' and 'lexemes' is critical when you're studying vocabulary. If we count Shakespeare's words, in the grammatical sense, we get around 30,000. If we count Shakespeare's lexemes, we get less than 20,000. A million words is not the issue; a million lexemes is. But I don't know of any computer algorithm which can identify lexemes efficiently. Even linguists with much more powerful brains than computers have got have trouble with the concept sometimes.
A few years ago, world population passed 6 billion. One paper even claimed to have found the 6 billionth child. It was an intriguing idea, which probably sold a few papers, but we all knew it was nonsense. Claiming to find the millionth word is the same - an intriguing idea, and extra PR for the US firm. But it's still nonsense.