10 Apr 2009

99.9% of English is logically invalid

It's time to talk about a class of invalid statements abused all the time in comparative linguistics, strangely even by published linguistic scholars who've somehow managed to trick alma maters into granting them their doctorate degrees without proving that they know any better than the undergraduates. The pattern I'm referring to usually sounds something like this:

Amount X of language Y is Z

Typically heard statements online like these are "A third of Germanic is from non-Indo-European substrate" or "More than 90% of the Tok Pisin pidgin is of English origin." I've found that people no matter how supposedly educated they are about language and linguistics, seldom ponder on whether these kinds of statements are factual or verifiable at all, and will often shut down their devil's advocate at the foot of a reference, fully satisfied that the statement must surely be well-researched for it to be published and therefore true. Effectively if things are worded impressively enough, a careless author with a PhD can, wittingly or not, will facts to be so as long as they never admit one's insidious error in reasoning and simply let the credentialist bias of the general public do the service of burying one's bone of shame in the backyard. Out of sight, out of mind, they say.

The clear reason why these statements have absolutely no validity in logic is because the vocabulary of a natural language can never be quantified. Imagine how many words are in the English language, for a second. Do you think there are 10,000? 200,000? 4 million? Exactly how many words does the English language have? How would we go about counting the words? Do we restrict ourselves to colloquial terms or expand to technical terms? Or even terms that have yet to be invented but have the potentiality of being used based on the grammar of the language (eg. intensificate or macrolinguisticology)? The solution is evidently intractable, not just because no sooner do we attempt to tally it that new words enter into the language, but also because even when a language is static and moribund (such as Latin for example), there's no way of counting all possible words in a language because the grammatical combinations and styles available to a speaker of any natural human language to form words are quite literally infinite in number. What then is 99.9% of infinity?

Now it should be clear why any argument dependent on an unprovable percentage of vocabulary is silly and pointless, right? Yet, what should is often not what is. Alas, rather than accepting this inevitable conclusion, new words may be employed to merely obfuscate the issue. An author may ramble on about not just a percentage of "vocabulary" or "lexicon" but of something even more subjective like "core vocabulary". Consider for example Hubert Lehmann's article Towards a core vocabulary for a natural language system [pdf] wherein the author confesses on the one hand that "currently it does not seem possible to precisely define the notion of core vocabulary" while nonetheless attempting to prove the patently futile, as if one hemisphere of his brain wasn't aware of what the other half was plotting. Naturally the term core vocabulary is no more concretely definable than the simpler term vocabulary itself and therefore it's proof of nothing at all but a general human tendency to waste paper, ink and bandwidth.

Here are published examples of these futile statements paved with good intentions:

Corson, Using English Words (1995), p.200: "Haugen (1981) also notes the effect of 'the Anglo-American empire and its Latinised English' (p. 114) on Norwegian, where almost 90% of new words in the language come from English (Norstrom, 1992)."

Baker/Jones, Encyclopaedia of bilingualism and bilingual education (1998), p.147: "English-based pidgins derive as much as 90 percent of their lexicon from English."
Silly, isn't it?


  1. Basically you're pointing out that it's difficult to estabilish a rigorous metric, but it doesn't mean that "percentage of words" is entirely meaningless. Intuitively, qualitativ differences clearly exist - or do you want to argue that Old English and Modern English contain "equally much" lexicon of Romance origin?

    Declensions are a total red herring at least. We can simply ignore inflection and focus on stems or roots. Similarly, compounds are a class that is probably best not accounted. And yes, there is a gray area here, as always. It doesn't matter, since there is ample "white" and "black" area too. It's just a small increase in uncertainty.

    And everyone accepts uncertainty. Have you ever seen a statement of this kind to an accuracy of more than one decimal? "English gets 32.55% of its vocabulary from French" is silly. "English gets about a third of its vocabulary from French" might not be correct but it's not gibberish.

    A very handy way to do away with the issue of obscure derivations as well as all kinds of obscure terminology would be to weigh each word by its usage frequency, which yields a very exact metric, once a corpus is decided on. (The weighing factor might have to be non-linear to not stress function words too much, but that's for the specifics.)

    You could almost as well be complaining about facts like "the Earth's atmosphere is 21% oxygen" because we can never account for every single molecule...

  2. No, Tropylium. I'm not saying that it's 'difficult' to "estabilish"(sic) a rigorous metric on vocabulary but instead that's it's completely IMPOSSIBLE to establish a reliable statistic based on a concept indefinable in any logical manner! 'Language' is just such a term that is fundamentally arbitrary and inconsistent in usage. Thus so too are any associated statistics, much like statistics based on similarly arbitrary terms based on race. According to science itself, inconsistency produces meaningless results.

    I suggest you read Sihler, Language History (2000), p.168: "A further source of confusion in the lay perception of the language/dialect question is that the terms themselves have arbitrary and contradictory conventional uses."

    Then read Hickey, Legacies of colonial English (2004), p.474: "Using the not entirely reliable etymologies provided in Mihalic (1971), Mühlhäusler (1979) says that the lexical composition of Tok Pisin is mainly English (79 per cent), [...]" (boldface mine).

    The full reason why these kinds of statistics are to be considered "unreliable" is elaborated upon further in Romaine, Language, education, and development (1992), p.145: "These estimates are not, however, totally reliable due to the problem of convergent etymologies and lexical conflation. It is not always clear, for instance, that any one item had a unique source, e.g. Tok Pisin gaden may be equally from English garden or German Garten, and Tok Pisin bel may be from English belly or Tolai bala 'stomach, seat of the emotions'."

    Quite frankly, those who believe still that these types of statistics are really saying anything meaningful despite all I've said above must surely fail to comprehend the very nebulous nature of 'language' itself and thus linguistics in general.

  3. One further important note, when you (Tropylium) say "Intuitively, qualitativ(sic) differences clearly exist - or do you want to argue that Old English and Modern English contain 'equally much' lexicon of Romance origin?", you also fail to understand that even if one wanted to waste their time justifying such a clearly opinionative statement, the statement is nonetheless opinionative. Opinionative is the key word here. In other words, you're transparently speaking in a subjective manner and so straying away from logical deductive thinking.

    Naturally, if you can't distinguish between opinion and fact, no wonder you are failing to understand why these statistics are equally reified from insignificant opinions in the end.