Corpus

Published : Aug 24, 2007 00:00 IST

A corpus is like a photograph of the language. BY ERIN McKEAN

MOST thesaurus entries for the word weapon do not include the lowly spork. The Transportation Security Administration , at last check, does not include the spork on its ever-changing list of what you can and cannot carry on an airplane. The spork would appear to be the most innocuous of eating utensils, a harmless plastic makeshift, a benighted fork-spoon hybrid. However, if you look at the word spork in the Oxford English Corpus(OEC), a 1. 8-billion-word database of written and spoken English, and see how people actually write and talk about sporks, you find that 24 per cent of the uses of the word spork involve violence. Now, obviously, most of this sporking is facetious, done purely for humorous intent, but the phenomenon of the weaponised spork is one that passed lexicographers and language researchers by until we saw the corpus evidence.

A corpus is like a photograph of the language or, better yet, a satellite image. When you look at the language from 500 miles up, you see information that was simply not accessible from your previous perspective. A corpus makes patterns in language more visible: the same insight that would take months or years of reading to find (sporks are often used as humorous weapons) shows up in high relief when all the relevant examples are seen together. The OEC compiled from 32,000 different sources, ranging from news to fiction to blogs, all published since 2000, representing English from all over the world and growing every year is a mother lode of such insights.

Where else could you find, to your delight and surprise, that migrate as a verb is used almost twice as often with the direction south as it is with north? Obviously what goes south must come north, but we seem not to talk about it much. How satisfying is it to learn that if you describe something as pink, you are much more likely to choose to call it fluffy than you are to call it fuzzy?

The corpus also shows us that some words literal senses are losing ground to more figurative ones. The steer a boat or ship meaning of the verb helm, for instance, has been almost completely supplanted by the meaning manage the running of something: the top things helmed in the OEC are films, thrillers, adaptations, movies, comedies . . . only after all those do you get a ship to be steered. The verb herd has a similar pattern: after the cattle and sheep are all herded, what comes next in the OECs list of significant objects? Cats.

Information from the OEC can show us the way to better dictionary entries. Most dictionary example sentences for the adjective fake have objects like paintings or reports as being fakes. The corpus reveals what people really describe as fake: smiles, tans, IDs, passports, fur and boobs. (Farther down the list come blood, mustaches, names, addresses, eyelashes and orgasms.) It seems that fakes are much more important to discern when they concern how you present your person and your identity. . . . Paintings and reports are not even in the Top 50.

In many cases, the OEC gives you hints about why people use a certain word. The word edible is used mostly when there is some question as to whether the object being described is, in fact, suitable to eat. We do not talk about edible fruit much, because we assume that if we are talking about fruit, we can probably eat it. According to the OEC, though, we often question whether fungi, tubers, seaweed, mushrooms and flowers are edible. Occasionally, consulting the OEC shows you unfortunate patterns, not just for a word but in the world. For the verb beat, it is easy to find the common subjects, and they are the police, then stepfathers, then husbands. If you look at the construction beat + by, you still see police first, but now husbands and stepfathers switch places. If you compare the patterns for the verbs coerce and compel, you see that people are compelled to testify, write, act, obey, resign, surrender and comment, but that you are coerced into prostitution, sex, pornography and treatment (the last against your will).

All this may seem like mere playing around, but these little insights point to a bigger one: that by really observing how people use particular words, by using the OEC as a microscope to show us patterns in language that are not visible to the naked eye, we come to a better understanding of our language and ourselves. Lexicographers then communicate that new understanding through better dictionary and thesaurus entries, ones that more accurately reflect what words mean and how they are used.

Erin McKean has edited several dictionaries, most recently

The New Oxford American Dictionary, Second Edition.

She blogs about dictionaries at dictionaryevangelist.com.

William Safire is on vacation.
Sign in to Unlock member-only benefits!
  • Bookmark stories to read later.
  • Comment on stories to start conversations.
  • Subscribe to our newsletters.
  • Get notified about discounts and offers to our products.
Sign in

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide to our community guidelines for posting your comment