It's time to talk about a class of invalid statements abused all the time in comparative linguistics, strangely even by published linguistic scholars who've somehow managed to trick alma maters into granting them their doctorate degrees without proving that they know any better than the undergraduates. The pattern I'm referring to usually sounds something like this:
Typically heard statements online like these are "A third of Germanic is from non-Indo-European substrate" or "More than 90% of the Tok Pisin pidgin is of English origin." I've found that people no matter how supposedly educated they are about language and linguistics, seldom ponder on whether these kinds of statements are factual or verifiable at all, and will often shut down their devil's advocate at the foot of a reference, fully satisfied that the statement must surely be well-researched for it to be published and therefore true. Effectively if things are worded impressively enough, a careless author with a PhD can, wittingly or not, will facts to be so as long as they never admit one's insidious error in reasoning and simply let the credentialist bias of the general public do the service of burying one's bone of shame in the backyard. Out of sight, out of mind, they say.
The clear reason why these statements have absolutely no validity in logic is because the vocabulary of a natural language can never be quantified. Imagine how many words are in the English language, for a second. Do you think there are 10,000? 200,000? 4 million? Exactly how many words does the English language have? How would we go about counting the words? Do we restrict ourselves to colloquial terms or expand to technical terms? Or even terms that have yet to be invented but have the potentiality of being used based on the grammar of the language (eg. intensificate or macrolinguisticology)? The solution is evidently intractable, not just because no sooner do we attempt to tally it that new words enter into the language, but also because even when a language is static and moribund (such as Latin for example), there's no way of counting all possible words in a language because the grammatical combinations and styles available to a speaker of any natural human language to form words are quite literally infinite in number. What then is 99.9% of infinity?
Now it should be clear why any argument dependent on an unprovable percentage of vocabulary is silly and pointless, right? Yet, what should is often not what is. Alas, rather than accepting this inevitable conclusion, new words may be employed to merely obfuscate the issue. An author may ramble on about not just a percentage of "vocabulary" or "lexicon" but of something even more subjective like "core vocabulary". Consider for example Hubert Lehmann's article Towards a core vocabulary for a natural language system [pdf] wherein the author confesses on the one hand that "currently it does not seem possible to precisely define the notion of core vocabulary" while nonetheless attempting to prove the patently futile, as if one hemisphere of his brain wasn't aware of what the other half was plotting. Naturally the term core vocabulary is no more concretely definable than the simpler term vocabulary itself and therefore it's proof of nothing at all but a general human tendency to waste paper, ink and bandwidth.
Here are published examples of these futile statements paved with good intentions:
Corson, Using English Words (1995), p.200: "Haugen (1981) also notes the effect of 'the Anglo-American empire and its Latinised English' (p. 114) on Norwegian, where almost 90% of new words in the language come from English (Norstrom, 1992)."Silly, isn't it?
Baker/Jones, Encyclopaedia of bilingualism and bilingual education (1998), p.147: "English-based pidgins derive as much as 90 percent of their lexicon from English."