Friday 10 June 2011

On being linguistically cognito

Some correspondents have been contributing to my last post incognito. It was a post about a point of usage in which, it began to emerge, there was an interesting usage divide between British and American English. The situation is probably more complex than that, with such factors as age, gender, and social context being relevant as well as regional origin. And very important is to establish the relevance, if any, of the contributors' language background. Without a sociolinguistic perspective of this kind, it is impossible to interpret what people are saying. 'I say this' or 'I never say this' is useless without knowing who 'I' is.

And this local issue reflects the main problem presented by the Internet, when it comes to interpreting language data. It's often said that the Internet is the largest linguistic corpus ever, and this is a goldmine for linguists. Well, up to a point, Lord Copper. Because it is also the largest anonymous linguistic corpus there has ever been, and this is an immense frustration for linguists. I take it as axiomatic, these days, that a linguistic analysis has to be sociolinguistically and pragmatically informed. If we want to explain linguistic patterns, as opposed to just describing them, we need to answer the question 'why'. Traditionally, linguistics had its focus on the what and when and where (descriptive, historical, and dialectological perspectives). Today we want to know why a usage occurs. What type of person uses it, in what situation? What was the intention behind using it and what was the effect? It is questions of this kind that sociolinguistics, stylistics, and pragmatics seek to answer. And they can't be answered without basic data, which is what the Internet so often does not provide. The fact that most contributions on the Internet are incognito, or pseudocognito, makes serious sociolinguistic investigation impossible. On the Internet, as the New Yorker cartoon once said, nobody knows you're a dog.

I'm well aware that there are some situations - some social networking domains, for example - where the opposite is the case. People tell the world everything about themselves. But there are still problems. Three, in particular. First, not everything we read can be trusted: false identities are all over the place, in which people adopt alternative ages, genders, roles... Second, saying too much about oneself is almost as problematic as saying too little, as nobody has got the time to trawl through a pile of (linguistically) irrelevant data about hobbies, likes and dislikes, and so on, in order to extract those values which relate to sociolinguistically relevant parameters. And third, linguists have spent a lot of time refining their investigative procedures in recent decades, so that they know the right kind of questions to ask, when approaching a usage issue, and these questions may not be addressed in the information people offer about themselves.

We do not yet have detailed linguistic accounts of the consequences of anonymity. All that is clear is that traditional theories don’t account for it. Try using Gricean maxims of conversation to the Internet: our speech acts should be truthful (maxim of quality), brief (maxim of quantity), relevant (maxim of relation), and clear (maxim of manner). Take quality: Do not say what you believe to be false; Do not say anything for which you lack evidence. Which world was Grice living in? A pre-Internet world, evidently. Analyses in pragmatics traditionally assume that human beings are nice. The Internet has shown that a lot of them are not. Is a paedophile going to be truthful, brief, relevant and clear? Are the people sending us tempting offers from Nigeria - beautifully pilloried in Neil Forsyth’s recent book, Delete This at your Peril (2010)? Are extreme-views sites (such as hate racist sites) going to follow Geoffrey Leech’s maxims of politness (tact, generosity, approbation, modesty, agreement, sympathy)? If brevity was the soul of the Internet, we would not have such coinages as bloggorhea and twitterhea.

I've just come back from a splendid corpus linguistics conference in Oslo (ICAME 32) where this was among the issues being addressed. The paper I gave will be up on my website shortly, but it raises more questions than answers. Maybe one day the Internet as a whole will provide linguistically sophisticated metadata, but I'm not holding my breath. And there may be a limit to what can be, given the collaborative nature of many Web pages, such as those we see on Wikipedia, which are often sociolinguistically heterogenous, reflecting contributions from people of diverse backgrounds. Stylistic conglomerates are emerging as a consequence. None of this helps the poor sociolinguist.

Can anything be done to improve the situation? Well, one small thing is that usage forums could start by demanding greater explicitness when usage issues are raised. And so, from now on, I will not publish contributions to my blog on points of usage that are sociolinguistically incognito. What is relevant to the debate will vary. Sometimes it will be regional background (as in the last post), sometimes it will be age, or gender, or occupation. But there needs to be something, and I hope we will see similar things happening in other usage forums, so that, gradually, a sociolinguistically more informed Internet climate evolves.


Dan said...

This is really interesting and I'm looking forward to reading your paper.

You're probably aware of this already, but there's been some analysis of Twitter language use based on gender and region, which at least starts to consider issues of identity. However difficult it might be to check the actual (rather than supposed) gender of a tweet's tweeter, at least the geographical location can be checked to some degree.

There are some links gathered here and here if you're interested in following these up.

Jessikat said...

I think that sociolect on the internet changes ridiculous amounts depending on whereabouts you are online - a lot of the time, NOT because of how old you are, or your gender, or where you come from geographically.

For example, on 4chan, people are collectively the 'hivemind' - all posts might as well be by the same person, and they all have stylistic differences to other places on the internet, as well as offline writing. For example, upon any realisation, or logical thought process, an arrow and ellipsis would be used:

>Haven't revised
>Going to fail

And also, I'm not too sure that Grice's Maxims really come into play that often on the internet. Using 4chan as an example again (but you could equally use any other large scale website) people, if anything, are tentatively rude to others. Though this wouldn't be the case on a small, friendly message board.

But yeah, I'd say that on the internet, identity doesn't matter as much as the linguist would like to believe - we're influenced linguistically by the communities we're in and these often change our online persona - but don't change our real life ones.

Jessikat said...

Considering this satyrical image produced by XKCD, reflecting the 'realms' of the internet -

- I bet you'd find different linguistic qualities of every online community seen there. And I think that in time an online geography of language could emerge.

Anonymous said...

Dear David,

How about designing an online form for people to fill out? This could be linked from this blog. The form would include specific questions designed (by you) for garnering the responses that you require from your readers, from a linguistic research point of view.



DC said...

I haven't looked at this for a while. Shows its age, doesn't it, now. It also illustrates the way the Internet has already become a domain of historical linguistics, for some of the output descriptions linguists were making a few years ago are already of historical interest only, having been supplanted by later developments. Distinguishing synchronic and diachronic is going to be an increasingly important issue. Twitter is a good example: when it changed its prompt in November 09 (from what doing to what happening) its linguistic character changed.

DC said...

Time sequence anomaly. My 'this' in my previous comment referred to Jessikat not Daniel.

Daniel: a nice idea, but quite a task in its own right. If I was a bit younger... I hope some sociolinguistically inspired Internet geek will take it on.

Sarah said...

I'm one of the guilty parties. Sorry, but I've no idea how to rectify this, so I suppose I'll have to refrain from contributing in future, unless I include some info about my background in each post. I do see your point, though.

DC said...

Note that my recommendation about background relates only to posts in which variations in usage are the topic.

John said...

I'm an ex-grammar school boy from Stoke-on-Trent and, when I was at school, we learnt that medical conditions (very often derived from Greek) were always differentiated from other words of Greek derivation by spelling the sound /r/ as rrh, as opposed to rh: thus, catarrh, diarrhoea, etc, as opposed to, say, rhino.
I know that you, David, are from North Wales so I'm puzzled at your acceptance of bloggorhea and Twitterhea. Is this the international English version, your reversioning of the English English version, or simply a local adaptation?
Great post, by the way.

DC said...

Usage is very mixed at present, and the Internet seems to be introducing the kind of simplification that people have been looking for for years! For the blog item, if blog is spelled with gg, a Google search shows a real bias towards a single r spelling; vice versa if it is a single g. The double r is still dominant for the twitter item, but it's very early days there.