Saturday, 17 March 2007

On relevance in advertising

Sarah sends a comment which asks about online contextual advertising. She asks: Why are those snappy short ads working so well in certain cases, and not at all in others? And is this influencing other parts of our language use at all?

This is in fact an area of applied internet linguistics which I've spent a lot of time on in the last ten years. You can read up more about it at the site - but essentially my procedure avoids the problem you've noticed by providing a full lexical specification of the content of a page. To see why this is needed, consider the following example.

A few years ago there was a page on CNN reporting a street stabbing in Chicago. The ads down the side said such things as 'Buy the best knives here', 'Get knives on eBay', and so on! The stupid software had found the word 'knife' and assumed that the page was about knives, and automatically assigned cutlery ads to it. No-one was happy, least of all the cutlery firms, who certainly didn't want their product to be associated with homicides.

To avoid this crazy result, my approach analyses all the content words (strictly, 'lexemes') on the page and weights them in terms of relevance. For the CNN page, a word like 'knife' would be outranked by the cluster of other words on the page that relate to crime. It then classifies the page using a set of around 1500 categories derived from the taxonomy I developed when working on the Cambridge encyclopedia family (there's an earlier post about that). It would conclude that this is a page about a crime - specifically, a homicide. It might also conclude that it was about some other things, too, such as policing or urban renewal. (Web pages are usually multi-thematic.)

Any advertiser wanting to place an ad alongside this report would want it to be relevant to the page - an ad about crime prevention, say, or careers in the police force. All advertisers have to do is apply the same classification system to their ad portfolio, and the software picks out the relevant ads. It's a simple principle, but it works very well, and is now beginning to be widely used by the company that is now developing it, adpepper media.

The principle is simple, but the linguistics took a long time to develop - ten years, in fact. Every sense of every content word in a college-sized English dictionary had to be investigated and assigned to the relevant encyclopedic category, and significant collocations also had to be identified. The initial task took a team of lexicographers several years, and the software engineering took another team several years more. Indeed, the refining of the approach is still going on, to make sure it is fast enough and robust enough to cope with commercial demands, which might run to hundreds of millions of page-analyses and ad-assignments a day.

Incidentally, the same procedure can be used for other internet applications, such as improving search-engine relevance, automatic document classification, and internet security. It's difficult to get the big firms and organizations to run with these new ideas, though, I find. They are very set in their ways, and prefer to carry on using their familiar methods (even if they don't work that well) than to invest in new strategies. For instance, a couple of years ago I developed a method (called 'Chatsafe') for tracking paedophile gambits in conversations, based on this sort of lexical analysis. It worked fine, and I thought it would be welcomed by the Powers That Be concerned with this sort of thing, such as the Home Office or chatroom companies. But despite a lot of talk, nobody picked it up, so it's stayed on the shelf.

Is this influencing language use in general? I don't see much sign of that. I talk about the extent to which the internet is influencing current usage in my Language and the Internet, and also in A Glossary of Textspeak and Netspeak. Although the internet is linguistically revolutionary in certain respects, the impact it has so far had on actual usage in a language is pretty limited.


Bill Chapman said...

Hello again David,

I'm not a linguist, so I can offer no more than an anecdotal observation in response to your comment that the Internet's impact thus far "on actual usage in a language is pretty limited."

This semester I'm teaching a basic media literacy class to a small group of California middle school students (ages 11-13). As part of their instruction, I wanted to show them that their language (English) is more dynamic than instruction usually makes it seem.

I began with a history lesson, part of which I mentioned in a comment a week or so ago. I played a recording of the Lord's Prayer (which I took from a 1950s CBC radio documentary titled A Word in Your Ear) as it was apparently spoken about 1,000 years ago. I asked if anyone could identify what they had just heard. As happens each time I conduct this activity, I got a lot of guesses, none of them English. I then played another snippet from the recording - the Lord's Prayer as it was spoken about 600 years ago. This was easily recognizable as English, but with a pronounced French accent. We then discussed what happened during that 400 year period to change the language so dramatically - the Norman Conquest. I then gave them a copy of one page from Hariot's BRIEFE AND TRUE REPORT so they could see how written English has changed in the past 400+ years.

I then wanted them to see that such changes have not stopped; indeed that they continue even during their lifetimes. To do this I used the search facilities of the online Merriam-Webster Collegiate and the OED to compile a list of words whose first written usage had been dated in the years they were born (1992-95). I selected a list of 21 common words that I felt they might have come across. Admittedly, since the net is such a large part of their lives, this introduced a bias, but 12 of those words were from the Internet world (e.g. webcast, text message, dot-com, home page, and digerati). I haven't gone back to look at the complete list of words whose first date is from those year. Probably Internet words would not be such a large percentage; but, at least in terms of English vocabulary, it seems to me the net has been a major source of change.

DC said...

'Source of change'? Yes. 'Major source of change'? No. Well, not yet, anyway. I spent some time three or four years ago looking for all the words that could be said to have come into the language as a result of the Internet, and found a couple of hundred, such as the ones you mention. But a couple of hundred is next to nothing, compared with the size of English vocabulary as a whole, a million words at least - a drop in a lexical ocean. That's what I meant by 'pretty limited'.

We'd need to be more precise if we were doing a proper study of this topic. It would be important, I think, to distinguish words which have come in as a result of computing in general and those which have come in as a result of the internet or other technologies. For example, 'text-message', strictly speaking, is not an internet term. Also, other types of network apart from the internet might have to be distinguished.

I was also careful to say 'on actual usage'. You use a broader criterion - 'that they might have come across'. This raises the question of passive vs active vocabulary - words people know vs words they actually use. The latter is going to be even smaller, as a percentage.

A very important point is the time factor. We have to wait a while to be sure that the words have genuinely arrived, and are not just fashionable slang with a short life. Dictionaries are quite good at identifying when words arrive in a language. They are very poor at suggesting when they have left.

Bill Chapman said...

Realizing the difficulty, if not impossibility, of trying to ascertain the "major source" or sources of change in a language like English at any given moment in history; I'd still be interested to know what your educated guess is. My less well informed one would include science and technology (of which the Internet would constitute a subcategory), international business and marketing, and wars and military deployments. Have I overlooked anything of major importance? Or is it a fruitless task? Are such changes so complicated that looking for large engines of change is not a worthy endeavor?

I certainly agree with you about the need to recognize that not all words that survive long enough to merit inclusion (and dating) in our dictionaries continue in use over the long haul, becoming parts of "active vocabularies" - a phrase I'll now try to incorporate into mine. I also take your point about the difficulty of using dictionaries to identify when words or specific senses drop out of use. Until dictionaries improve in this regard, I guess I'll have to rely on books like Michael Quinion's GALLIMAUFRY and David Grambs THE ENDANGERED ENGLISH DICTIONARY.

Anonymous said...

How does your technology compare with Google's Adsense? See

"AdSense can deliver relevant ads because Google understands the meaning of a web page. We have refined our technology, and it keeps getting smarter all the time. For example, words can have several different meanings, depending on context. Google technology grasps these distinctions, so you get more targeted ads."

It seems to be much the same sort of technology as you have described.

DC said...

Looking for major sources of change is certainly a worthy endeavour. The best book I know on this is Geoffrey Hughes's 'Words in Time', which identifies several and does some useful analysis. A few people have suggested statistics, but these are not very illuminating, as they are usually extremely general - for example, that 75 percent of present-day English vocabulary is science and technology. This figure cannot be far from the truth, but what follows from knowing it? It would be nice to have a series of totals at a more specific level of enquiry, but I don't know of any. Probing a dictionary like the OED would be a start, but even that fine database only has a proportion of the highly technical terms (and senses) found in English specialised dictionaries.

That 'and senses' is crucial. There has been a tendency to investigate lexical change by reference to new word-formation only. When a dictionary publishes a new edition, it always draws attention to the new words it includes, and this is always picked up by the press. But new senses are probably far more important - and far more difficult to pin down.

DC said...

Re Anon's comment on Adsense. Yes, I think Adsense is slowly improving its relevance - and about time too! But there is a limit to what can be done without devoting time to the appropriate linguistic analysis, and it still has some way to go. Any simple algorithmic approach will only take you so far.

The direction of your comment should be the other way round - not how do we compare with Adsense, but how does Adsense compare with us! I did this work in the mid-1990s, before Google was even born. Our patents are filed from 1998. There's more than one way of doing this sort of thing, of course; but I haven't heard of anyone who has put the same kind of human linguistic effort into their approach. At its peak, there was a team of around 40 lexicographers slogging their way through the words and senses of the dictionary to provide the lexical filter that would be used to do the analysis. When we did our first comparisons with what else was out there (not in the context of advertising, at the time, but in relation to search-engine assistance, using Excite - remember Excite?), we were miles ahead. I mean, we were getting something like 80 percent relevant results when other approaches were getting 10 or 20 percent. But when Google arrived, our small (by comparison) little operation wasn't able to compete with the resources that firms like that had available to throw at a problem, so in recent years there have been several other approaches which have been trying to do the same thing. I don't know of any systematic comparison of results between the various systems, but I do know that my approach is now up to over 95 percent relevance, so I'm very pleased to see the system working so well. (Trying to sort that remaining few percent keeps me happily occupied still.)

I take a perverse pleasure when I see other systems still getting unacceptable numbers of misplaced ads. Just yesterday I did a search for 'Trojan horse' and noticed that 'Trojan' generates sponsored ads for spyware and condoms, all mixed up. That sort of thing happens all the time, with the more complex the enquiry generating the worst results. Last time I did some testing (in January), several of my test searches were producing ads that were irrelevant in over 50 percent of cases. So there's still a way to go. The so-called contextual systems are indeed 'getting smarter', but they're still a bit thick.

Anonymous said...

From the Flash Tour on I see your solution works by using a database of terms such as "depression", "weather", "cold", "warm", etc. and associating them with a particular context such as climate. However, this is precisely what Google claims to do with AdSense: they use the example of the terms "cup", "java", and "coffee" indicating a context of hot drinks, while "program", "java" and "C++" indicate a context of computer programming, and "Indonesia", "java" and "island" indicate a context of geography. So perhaps that explains why they didn't need to use Crystal Semantics' solution. Without a systematic comparison of results the ultimate test is whether they improve the profitability of online advertising. I guess the proof of this particular pudding will be in ad pepper's results over the next 12 months.

DC said...

Yes, there are a number of approaches out there now which are trying to do the same sort of thing. In some cases there may even be some patent infringement happening, but that's for someone else to explore.

As you say, the results will be the test. I hope some independent group will do some testing. But it isn't easy to test the notion of relevance, especially on multi-thematic web pages, where several contexts intertwine, so I guess there'll always be a degree of uncertainty about the relative merits of different systems. And even on mono-thematic pages, the question of what counts as a relevant ad can be problematic. Is an ad about the enviroment relevant to a page on car sales, for example? Depends on your point of view.

A couple of other points.

My approach wasn't intended to be restricted to contextual advertising - though that's the application which is currently being given all the attention (after adpepper became involved in what we were doing). It was actually originally designed as a search-engine assistant and an automatic document classifier. So the range of categories it classified was much wider than what you usually find in advertising. Because the approach derived from my general encyclopedia, it paid as much attention to philosophy and literary criticism as it did to cars and washing machines. And I think its breadth is still a distinguishing feature of the approach.

Related to this: some contexts are much easier to distinguish than others. It doesn't take much brain to distinguish geography from computing, because the words in the context don't overlap much. It's much harder to distinguish, say, opera from ballet, or credit cards from mortgages, because of the shared vocabulary. That's where simple algorithms let you down, and where human intuition scores. When my lexicographers were working on this project, most of the discussion was about how to build up the keyword base to be discriminating yet predictive of all possible uses. In other words, the question was not: can we analyse that particular page correctly?, but can we analyse all pages yet to be written on that topic? As users of the English language, wanting to write a page on, say, the climate sense of 'depression', what word-stock is available for us to use that relate specifically to that topic? It was this generative (in Chomsky's sense) perspective that I wanted to implement, and that's what took the time. Identifying a few keywords is easy. Give me half-a-dozen words on climate? No problem. But give me 200 maximally relevant words on climate? That takes time, and not a little expertise.

Anonymous said...

Unfortunately academic excellence doesn't always translate into commercially profitable ideas. According to ad pepper media's annual report Crystal Reference Systems has been losing money, despite the "8 years and $8 million investment in research" reported in a previous press release. Perhaps the technology doesn't have the commercial potential you believe or hasn't been marketed properly. The latter seems unlikely, given that three companies have tried to make money out of it over 8 years or more. But something doesn't add up here, and APM’s share price performance is not an encouraging indicator.

Anonymous said...

Perhaps my last post to your blog (which understandably has not been published) was unreasonably negative. If there is a failure of marketing, your sales team could take a leaf out of your own blog, in terms of the clear and positive descriptions you make of your technology and what distinguishes it from Google’s AdSense.

DC said...

'Understandably not been published'? Not so. It is Easter Sunday, and I have been out with the family! As I said at the every outset of this blog, I can't guarantee to respond quickly to questions or comments, as I am often away.

The history of this business is an extraordinary tale, actually, and the reason for the situation you correctly identify is entirely on the marketing side. In brief, a Dutch firm called AND bought my encyclopedia database and its taxonomy (which was owned by CUP at the time) from CUP in the mid-90s. I then acted as a consultant with the task of developing the taxonomy for electronic applications (as mentioned in a previous post). It took four years, and just as my team finished the project and demonstrated that it worked, AND go into liquidation (for reasons entirely unrelated to what we were up to). It was a pretty horrible time. Suddenly, my editorial team was out of work, and not only the electronic development but all the encyclopedia publishing was halted. All the plans for software production and marketing went out of the window.

I talked to Ian Saunders, who had joined AND very late on, and we wondered what to do. It seemed an awful waste - the encylopedia database was the result of some 15 years work. Also, there were precious jobs being lost (Holyhead, where the team was based, was very much a depressed area at the time). We decided to make a bid for the assets and see if we could make a go of it ourselves. That led to the formation of Crystal Reference Systerms in 2001. We had to get funding to survive, and this we did. It enabled us to start appointing teccies to develop the appropriate software to implement the work. We hoped that we would get some contracts quickly, but by the time the software had got to a product stage, and we were able to show it at some of the big tech events (like AdTech), it was 2004, and that gap was critical, because in the meantime other firms had been developing their own procedures. We spent an interesting, but ultimately abortive time in Silicon Valley demonstrating our approach to some big players, but all had vested interests in what they were already doing, and didn't feel they could chuck it all away and start again. We got a few small contracts, but this didn't bring in enough to do more than cover our essential development costs, and we weren't able to afford to employ the kind of marketing team which we needed. It was one of those hand-to-mouth survival periods, and I must say it was one of the toughest times of my life (remember I am an academic, with very little experience of the business world - to me, 'leveraging assets', 'exit strategies', and all that, had previously been just words!).

Our main problem, I think, looking back, was that we were being pulled in too many different directions at once. We had developed a basketful of applications - as I mentioned in a previous post, in such areas as search-engine assistance, automatic document classification, contextual advertising, e-commerce enquiry, and chatroom security - but without a proper sales team it was hard to make an impact in any of these areas. Eventually, in 2004, we decided to focus on the area which we felt would produce results most quickly, contextual advertising, and indeed that is what transpired. By 2005 our demos of our 'sense engine', as we called it, had attracted a great deal of interest, and people were starting to make enquiries about acquisition. As you know, eventually adpepper was the successful purchaser.

But that deal did not complete until March 2006. And since then, the main task has been to integrate our technology with the already existing adpepper technology, to help point it in new directions, and to thoroughly test it. This took most of last year, and it was only at the beginning of 2007 that we were finally able to implement a suite of products based on our approach. Sitescreen, which identifies sensitive sites, and Langscape, which identifies the language of a site, were the first to reach a usable state, and the fuller contextual use of the taxonomy has not long been in operation. The procedure is currently being developed in several languages other than English.

First results have been excellent, so I am hopeful that, after all this time, the approach will get the success that I believe it deserves and give a return on the huge investment that CUP, AND, Crystal Reference, and now adpepper have put into it. But that is for some other blog, not mine, to discuss. I am the least in the kingdom of commerce, I'm afraid. In particular, I can't comment on your last observation, as I have no idea what factors influence share prices.

The Ridger, FCD said...

My favorite remains the Google ad that put Ruth's Chris Steak House on a story about Aztecs butchering and eating Spainards...