corpus.google.com

Via Languag Log, which is mentioned in the article, a great piece in The Economist on the use of the internet to do linguistics research:

Linguists must often correct lay people’s misconceptions of what they do. Their job is not to be experts in “correct” grammar, ready at any moment to smack your wrist for a split infinitive. What they seek are the underlying rules of how language works in the minds and mouths of its users. In the common shorthand, linguistics is descriptive, not prescriptive. What actually sounds right and wrong to people, what they actually write and say, is the linguist’s raw material.

But that raw material is surprisingly elusive. Getting people to speak naturally in a controlled study is hard. Eavesdropping is difficult, time-consuming and invasive of privacy. …

Linguists, however, are slowly coming to discover the joys of a free and searchable corpus of maybe 10 trillion words that is available to anyone with an internet connection: the world wide web. The trend, predictably enough, is prevalent on the internet itself. For example, a group of linguists write informally on a weblog called Language Log. There, they use Google to discuss the frequency of non-standard usages such as “far from” as an adverb (“He far from succeeded”), as opposed to more standard usages such as “He didn’t succeed—far from it”. A search of the blog itself shows that 354 Language Log pages use the word “Google”. The blog’s authors clearly rely heavily on it.

The article goes on to describe some of the problems associated with using Google as a corpus, such as the inclusion of “syntactically correct but meaningless verbiage” by web sites trying to score higher on Google searches. Mark Liberman says that these sites actually hire computational linguists to do this for them! The article also has a link to the Linguist’s Search Engine — a site which allows for searches by syntactic structure. I wonder if Google will eventually offer such a service themselves? “corpus.google.com”? (Apologies to those who thought this post was actually announcing such a service.)

UPDATE: But is Google’s search feature “broken“? Only if the numbers are too large.

{language, linguistics, google}