25 Nov 2007

How NOT to reconstruct a protolanguage

I wrote an article last month, The Tower of Babel, which was an unexhaustive critical assessment of the late Sergei Starostin's grandiose online language project that limps on today through the efforts of surviving project members. A recent troll on that page under an unconvincing disguise of "G.Starostin" sent me two messages, one visible because it was civil if not misguided, while the second was abusive and thrown in the trash after I took note of his IP address. In case anyone was confused, my blog isn't a mouthpiece for proto-world rhetoric and I'm an ardent defender of mainstream linguistics despite my moderate interest in long-range linguistics. It suffices to reject the Tower of Babel project based simply on the consistent use of outdated and even disprovable information. Things such as its Indo-European database, infected with Julius Pokorny's 1950s reconstructions which notoriously neglected to reconstruct laryngeals to properly account for reflexes in Anatolian languages like Hittite, Luwian and Lycian. When a word halχ is assumed to mean “10” a priori in Etruscan purely by eyeballing texts and ripping words out of context in order to reject what is already established to be śar (c.f. Bonfante, Reading the Past - Etruscan (1990), p.61), Starostin's supplied pdf entitled Etruscan numerals: Problems and Results of Research by S. A. Iatsemirsky[1] is not credible enough to identify neither the problems nor the results of serious Etruscan research. Its Dravidian database is full of largely unaccepted reconstructions using voiced stops that are not proven to be necessary in that proto-language[2]. Then the addition of a Nostratic database and "Long Range Etymologies" is sure to add to the air of mediocrity of the website, putting the cart before the horse in light of the numerous mistakes regarding the more accepted languages and language families I just mentioned. This is all on top of the decidedly negative assessments of North Caucasian pushed by Sergei Starostin during his lifetime (Johanna Nichols, Current Trends in Caucasian, East European, and Inner Asian Linguistics (2003), p.208). I personally believe in an efficient use of time. So if it's proven that this website is consistently at odds with the mainstream, one would be wise to obtain a higher quality of information elsewhere.

On that note, it's important to discuss how NOT to reconstruct a protolanguage so that we're all on the same page and can more easily distinguish between real linguists and narrow-minded loons, whether online or in print. Considering that even Merritt Ruhlen of "Proto-World" infamy[3] has obtained his PhD from Stanford University, it's important to not be deceived by academic status. Theories can be ill-conceived no matter who one is or claims to be. So let's go through my cheeky list of important strategies that we can follow (using examples from the Tower of Babel project) if we want to isolate ourselves and be rejected by all universities around the world.

1. Use "phonemic wildcards" obsessively!
Cast the net wider and you might catch something!

The abuse of mathematical symbols like C, V, [a-z], (a/é/ö), etc. are an excellent way to make your idle conjecture look like a valid theory. It might be called "reconstruction by parentheses" since parentheses are either explicitly shown or hidden by a single variable. An example of this is *k`egVnV (claimed to be the Proto-Altaic word for "nine" in the Tower of Babel database). Obviously, if V represents all possible vowels in this proto-language and there are, say, ten of them possible in either position, then the fact that there are two wildcards in the same word means that the word represents a humungous, two-dimensional matrix of ONE HUNDRED possible permutations (10*10=100):

*k`egana, *k`egena, *k`egina, *k`egüna, *k`egïna, etc.
*k`egane, *k`egene, *k`egine, *k`egüne, *k`egïne, etc.
*k`egani, *k`egeni, *k`egini, *k`egüni, *k`egïni, etc.
*k`eganü, *k`egenü, *k`eginü, *k`egünü, *k`egïnü, etc.

Since no single form is actually being posited when wildcards are present, any claim of regular correspondence by such a theorist can be easily identified as fraud. If such linguists can't take themselves seriously enough to hypothesize a structured and testable theory, why then should we take them seriously in turn?

Other hilarious examples of wildcard fairy tales on the Tower of Babel site include Nostratic *cUKV ( ˜ č`-) "bundle" (in other words, all four are wildcards... jackpot!), Dravidian *kaṬ- "to cut into pieces" (universal onomatopoeia, anyone?), Semitic *ʔVrib- "tie (a knot)" (based on a single language, Arabic) and North Caucasian *ƛ̣_VẋwV ( ˜ Ł_-)̆ "rake" (wow, the number of possible permutations in this wildcard buffet is positively mindboggling! 200 perhaps?).

2. Ignore Occam's Razor and never seek logical justification for your ideas!
If an exotic phoneme gives you an orgasm, reconstruct it!

Most longrangers ignore Occam's Razor or fail to apply it in all aspects of their budding theory. It's easy to understand why it's not valid to reconstruct a sound in a proto-language which shows no regular correspondence in its daughter languages. However, even when one has justified a phoneme with evidence, one still has to justify the plausibility of the larger sound system that it's a part of. So if you have greater evidence for a palatal *ź than you do for its plain counterpart *z, you still have a problem to solve (c.f. phonemic markedness). If pronouns and common affixes use the more complicated sounds of the inventory of your proto-language, you still have a problem since this goes against the trend in languages we observe throughout the world, a reason that Allen Bomhard used to reject Illich-Svitych's reconstruction of Nostratic (e.g. Illich-Svitych and Dolgopolsky reconstructed the 2ps pronoun starting with the symbol *ṭ-, an ejective rather than its plain counterpart). This is how Occam's Razor works. In all aspects of our theory, we must abide by the simplest answer possible. Whenever you hear an argument like "Yeah, but, there's this language in some remote part of Africa with 30 speakers that uses a really rare sound or does something else that's really rare just like in my theory!" then you know that you're not dealing with someone in their right mind. Occam's Razor avoids unnecessarily exotic solutions at all times and teaches us to not confuse "minute possibility" for "convincing probability". For example, Klallam is certainly an existing spoken language, but there's also no doubt that its sound system and consonant clusters are very rare. So Klallam is something that your proto-language should not look like until you have solid proof (i.e. numerous regular sound correspondences) to back it all up.

By searching in the Tower of Babel's North Caucasian database for words beginning with sibilants, we get the following screwy search results. As of today, only one word with plain *z- in initial position is to be found, namely the first person pronoun claimed to be *zō, despite the fact that there are two instances of *ź- and *ž-. This means that plain *z- is outnumbered 3 to 1 by the comparatively more exotic counterparts with palatalization, labialization, clusters, etc. Even worse, there are only two instances of plain *s- among twelve roots starting with unvoiced sibilants. So plain phonemes are in the minority, as we would find if we were reconstructing a science-fiction language. Consistently, Starostin's North Caucasian defies any rational structure or common sense and a perfect example of diacritic overkill.

3. Make pages and pages of "correspondence tables"
They're sure to impress your family members!

"Correspondence tables" are lists of sounds in the daughter languages of a hypothetical proto-language proposed to prove regular correspondence and thus genuine relationship. So we can say that Germanic often corresponds to Latin t as Jacob Grimm remarked upon in 1822 showing that Germanic and Latin are part of the Indo-European family of languages. However, language isn't that simple and far more often than not, there are numerous exceptions to such simplistic equations. For example, the word 'eight' is octo in Latin and yet *ahtōu with a *t in Germanic. This is because the stop fails to be weakened to a fricative after another stop. What good then are correspondence tables when we can save time and space by actually describing sound changes and their processes? For some reason, Nostraticists and other longrangers like to use these at every turn, as does Sergei Starostin. These childishly repetitive tables simply waste pages and pages of paper and bandwidth without being terribly informative, but it's certainly an excellent way to make your book look thicker and impress your family.

4. Remember: All critics are conspiring against you!
Beat dead horses to death and if you can't win, punch them!

You may find that your theory isn't gaining the kind of press that you had hoped and quite a few may be noticing several flaws in your theory. You may not have a single factoid in your favour to form a coherent rebuttal. This is when you bring out the big guns: ignorance combined with non sequitur. This tactic must be handled delicately however. You could try attacking your critics on the personal level, whether that be through the direct use of swearwords or through subtle mockery of your opponent. However this is a desperate last resort, more common on Yahoo! Forums or Youtube. It looks more professional however to simply ignore critics altogether while overpraising the capabilities of yourself and your associates. Using a plethora of unnecessarily sesquipedalian, multipolysyllabic megaterminology, such as "lexicostatistical", is a great tactic to conceal the weaknesses of your theories, as is treating your conjectures as proven facts in any of your publications so as to not bog down your important work with silly things like justification or common sense. Remember, all critics don't know what they're talking about. Their valid criticisms are just a devilish trick of theirs to throw you off-track and pull you off of your hobby horse.

[1] Note that this pdf incorrectly cites TLE 295 in reference to a word zar when in fact it's properly TLE 275. Furthermore, automatically assuming that zar and śar are the same word purposely ignores phonemic distinctions in order to stroke one's pet theory. The instance of huθ-zars declined in the genitive case (TLE 191) has absolutely nothing to do with zar and everything to do with the fact that a dental stop plus the initial sibilant of attested śar (TCort ii) yield z // in this one particular instance. It's all quite understandable once one puts in the time and effort learning the basics of Etruscan phonetics.
[2] See Krishnamurti, Comparative Dravidian Linguistics: Current Perspectives (2001), p.250 [click here]
[3] Visit Mark Rosenfeld's humorous but rational article on the Proto-World language and its associated failures in reasoning: Deriving Proto-World with tools you probably have at home. One of the most poignant criticisms towards the proposals of Merrit Ruhlen and Joseph Greenberg (R&G) that I appreciate here is: "R&G really gain the benefit of obscurity here: how many of us can determine whether they are (unconsciously) playing the same kind of tricks with Tfaltik and Guamo as I am playing with Chinese and Quechua here?" This criticism is equally applicable to Starostin's theory of North Caucasian and his Tower of Babel project where a similar "benefit of obscurity" is being used against his readers.

(Feb 14 2008)
My entry The hidden binary behind the Japanese numeral system exposes another flaw in Starostin's reconstructions concerning the origin of Japanese numerals.


  1. That was a very enjoyable post, it had me laughing a lot ;)

    Starostin is post mortem annoying me to no end too. Not because of his theories, but his software.

    I'm currently working on digitalising Kloekhorst's Hittite Etymological dictioanary (which I believe is currently unpublish, but believe me it's awesome). Digitalising involved me having to work with Starsoft and converting all the non-standard fonts to other non-standard fonts. But the eventual result will be that we can view the dictionary for free on http://www.indo-european.nl so it's all worth it!

    But the software makes me want to hurt someone. It's also quite funny how the font has no Capital S Hacek, but does have a strike-through lambda with a dot below the character. (And other such completely obscure bizarre letters which aren't even to be found as letters in the Unicode as characters on their own :P).

  2. Very sad, Mr. Gordon. But the blame is on me - I should have remembered your blog is reserved for bombastic monologs rather than constructive discussion, which you find 'abusive' by definition. This is, after all, why you have chosen to run off from all the discussion boards and Wikipedia - so as not to be bothered by trolls who actually have something to say and do not get most of their information from snippets of Google books. What I didn't realize was that you're also a coward - not afraid of spewing forth miriads of arrogant stupidities, but afraid of having other people expose them. Well, I am happy to say that I have very rarely met such human material in my life, and even happier that I will never, ever bother you again with a single word, because you have given me ample reason to do so. Enjoy your blog.

    Yours truly
    George Starostin

    (Please feel free to note my IP. It's the same as before, I hope).

  3. The above "anonymous" troll's IP address is currently (from
    Iskra Telecom
    ) but often changes, keeping within the range 79.172.*.*. Any coward can log in as "anonymous" and claim whomever they wish to be for that day if they are that desperate in life. However, impersonating others may be prosecutable as slander once the ISP is contacted and the Starostin family notified. Even if "anonymous" were who he claims to be, flaming bloggers instead of citing something factual, intelligent and informative would admit to a downward spiral in his academic career. So it is to this individual's advantage to keep their identity guarded from accountability and honesty either way.

    The facts and quotes I cite aren't difficult to verify by capable scholars, so I actively encourage lazy readers to go elsewhere for sites less demanding of them, precisely such sites as the unmoderated Wikipedia which, lo and behold, universities have dismissed as insidious sources of misinformation, childish attacks, impersonations and outright slander of those who are brave enough to be open and accountable to the public by showing their identities without the shame of self that you so evidently demonstrate by your infantile words and deeds.

  4. Phoenix,

    Wow, that sounds like an interesting project (for geeks like me at least, hehe).

    I think the answer to your frustration can be found in the software's accompanying pdf:

    "Yes, STARLING is definitely not the most user-friendly program ever made - it has always
    valued substance over form, primarily because for the first ten years or so of its existence, the only person working on it was Sergei Starostin himself."

    What I learned from Computer Science is that user-unfriendliness is squarely the fault of the program designer, not its unfortunate users. I have to shake my head at the naivety of their ideological statement, "it has always valued substance over form", as if to suggest that form (i.e. structure) somehow isn't an intrinsic component of substance. They should quit making excuses for themselves, respond to criticism in a constructive way and just update their sickly software.

    Perhaps their statement here may not only explain their poor attitudes towards professional programming but also towards mainstream comparative linguistics as well.

  5. The work on the Hittite dictionary is indeed VERY interesting, and I get paid for it, how cool is that :P

    I am amused how 'George Starostin' manages to accuse you of what is generally done by long-distance linguists. I can understand a proponent of Starostin's theories would find this post offensive. But surely one must agree that some of the methodology of Starostin's work is downright sloppy. I can imagine people still being proponents of the theory, but they definitely have a lot of work to do to make a Dené-Sino-Caucasian language family plausible.

    After all, there's a reason why the better part of the linguists don't believe these theories.

    By the way, I was wondering. Is there an online version of the complete Etruscan text corpus? I guess there isn't eh? :( Makes you wish you picked up Tocharian or Lycian as your obscure language of choice :P

  6. Phoenix: "The work on the Hittite dictionary is indeed VERY interesting, and I get paid for it, how cool is that :P

    You have cake and eat it too. I'm jealous :)

    Phoenix: "I can understand a proponent of Starostin's theories would find this post offensive."

    Yes, I don't expect them to see why their theories can't work because their interests seem to be more ego-based than logic-based. I don't think that long-range reconstruction is futile at all, actually, but these people unfortunately help to make most people think it's hopeless by using very bad practices. In effect, they sabotage their own field and I personally would wish they could stop so that long-range linguistics can actually have a chance of moving forward without their static noise. Of course, they aren't going to listen to my little request, so we'll just have to step over them and progress forward ourselves.

    Something that really bugs me is when they leap over tens of thousands of years to get to their "holy grail" proto-language and fail to notice the gap in detail. Where's all the work on Pre-Indo-European, Pre-Uralic, Pre-Dravidian, etc.? Where is the in-between? Their method is like throwing away architecture to build a house with toothpicks and glue, with the roof built first. First, we need to reconstruct 7000 years back before reconstructing something 10,000 or 15,000 years back. And in my view, a "North Caucasian" protolanguage would be just as immensely old as Nostratic is believed to be, if not older.

    It is those kinds of longrangers that actually discourage more knowledgeable linguists to explore new ideas in this field for fear of being assumed to be just as foolish as Merrit Ruhlen et alia. Hence the brain drain. While trolls like this treat me like an enemy, they don't understand that I'm actually rooting for them (as long as, and only if, they smarten up, using sensible methods and developing structured theories that don't fly against linguistic science).

    Arrrggh! But oh well, it'll all work out eventually. I just hope I'll be alive to see it :)

    Phoenix: "Is there an online version of the complete Etruscan text corpus? [...] Makes you wish you picked up Tocharian or Lycian as your obscure language of choice :P"

    From what I've surfed, Etruscan Texts Project is the only thing that comes close, but it's not complete and I'm not sure what is going on with the "ETP Revision" since the news isn't dated and everything I see seems to have been abandoned since 2006. ??? And then you have Rick McCallister's Etruscan Glossary, which is unfortunately despite the effort, complete half-sourced gobbleygook and pop-up hell, full of transcription duplicates, non-existent words, conjectural glosses, and horrendous mistranslations without any weigh-in by the webmaster as to which solutions in each of his entries are more plausible.

    Does this make me sad to get involved with Etruscan? Hell no. It makes me excited because there's so much to learn and discover!

  7. Well.

    First of all, as a biologist, it baffles me to no end that historical linguists talk about "proof", "unproven hypotheses" and suchlike. Of all scientists, only historical linguists ever use the word "proof" and its derivatives, and they really, really, really shouldn't. That's because science cannot prove, only disprove. If we find the truth, we cannot prove that what we have found is indeed the truth -- how could we do that? By comparing our findings to the truth we don't have? Falsification is possible: if predictions from a hypothesis don't match an observation, the hypothesis is wrong. Verification is not possible: if predictions from a hypothesis match an observation, we cannot rule out that another hypothesis would make the same predictions. This is basic science theory (so basic that I didn't even need to mention Ockham's Razor). Is that not taught at linguistic institutes?

    Perhaps ironically, at least some members of the Moscow School do know about this. The reply by A. Dybo and G. Starostin (large pdf!) to Vovin's attacks on the Altaic and the Dené-Caucasian hypotheses mentions it near the end. (For other reasons, too, it makes a great and enjoyable read. Highly recommended.)

    Do you know how we biologists wince when a creationist comes and says "evolution is an unproven theory"?

    Which reminds me -- a theory is something bigger than a hypothesis. Large, overarching principles are theories, like natural selection in biology or perhaps the (near-)exceptionlessness of sound changes in historical linguistics.


    I have recently read parts of the long preface to the North Caucasian Etymological Dictionary (recently translated into English by G. Starostin -- another large pdf). It makes a few important points that you don't mention, for example that Proto-West Caucasian is almost completely derivable from Proto-East Caucasian. (Almost -- PEC does have a few innovations, though really not a lot.) In other words, the hypothesis that Nakh and Daghestanian are sister-groups is less well supported than the hypothesis that N, D, and WC are each other's closest relatives!

    Do you really think it matters that WC and EC are separated by typological features like the size of their phoneme inventories, word order, and "structure" (could you be more precise?)? Wouldn't we expect that from language families that are separated by so much time? German and Hindi have majorly different vowel inventories (never mind the productive Umlaut of German), quite different consonant inventories (2 vs 4 series of plosives, /x/ and /ts/ vs retroflexes...), and major differences in grammar (to pick the most egregious one, Hindi is split-ergative). And yet, nobody (myself obviously included) doubts the Indo-European hypothesis.


    You write that "any similarities between the two [EC and WC] could easily have been caused by strong areal influence between the two within such a small region and over a large expanse of time." Then why is the grammar still so different? Between the Balkan languages, or between Korean and Japanese, literal translations are often grammatical or nearly so. Of course anything can be borrowed, but demonstrating that it is more parsimonious to assume that anything was borrowed than that it was inherited is a different matter.


    You probably have a point with your allegation of "diacritic addiction". But how a phoneme was actually pronounced in a protolanguage is less important than the regular correspondences this phoneme has. You know what Schleicher's fable looked like in the original, right? Schleicher had assumed a sound system that was way too close to that of Sanskrit. He even made the laughable mistake of reconstructing [v] instead of [w], and had the vowel system backwards. But because he had the correspondences mostly right, the IE hypothesis didn't fall apart when all those mistakes were discovered ( = when more parsimonious explanations for the same observations, and then some, were found).

    It's probably significant that the 1st pers. sg. pronoun is one of the words Starostin & Nikolayev reconstructed with an extra-rare consonant. In IE languages we find forms of this pronoun that point to PIE */kʲ/, others that point to PIE */gʲ/, and yet others that point to PIE */gʲʱ/ (I hope this displays correctly). AFAIK a solution was only suggested a few years ago, namely that the PIE word had a */gʲH3/ cluster in this place, not a single consonant, and when the "laryngeals" started to disappear, the cluster was interpreted in different ways in different daughter languages. Perhaps such a phenomenon awaits discovery in the PNC pronoun, too?

    Incidentally, despite being a capital letter Ł is not a cover symbol in the transcription by Starostin & Nikolayev, it's the voiced lateral affricate. They had simply run out of lower-case letters when they got that far in making up their transcription.

    ƛ̣ is simply the ejective lateral affricate, a fairly common sound in some modern EC languages, not to mention Na-Dené languages and suchlike.

    (If you really want to swim in cover symbols, check out Dolgopolsky's latest Nostratic dictionary. Looking at the reconstructions will twist your tongue into knots.)


    Sentence-internal lenition is certainly worth looking into, IMHO.


    The lengthy preface (270 pp.) of the Etymological Dictionary of the Altaic Languages is unfortunately not on starling.rinet.ru, but I can send you the pdf. It does not reconstruct */séjra/ for "three". It reconstructs */séjra/ as meaning "object consisting of three parts", */ìlù/ as "third (or next after three = fourth)"/"consisting of three objects", and */ŋi̯u/ as "three". This latter word is supposed to have evolved into the */o/ part of Proto-Turkic */oturʲ/ "30", the */gu/ part of Proto-Mongolic */gurban/ "three" and */gut͡ʃin/ "thirty", the /mi/ part of Silla /mir/ "three" (Blažek 2006) and Goguryeo /mir/ "three", and Proto-Japonic */mi/- "three". Furthermore, as stated on p. 223 of the EDAL preface, the */y/ part of Proto-Turkic */yt͡ʃ/ "3" "may also reflect the same root, although the suffixation is not clear." I put the whole table up on Wikipedia a year ago or something.


    Everyone agrees Proto-Dravidian had two kinds of plosives. Most Dravidianists reconstruct a length contrast. G. Starostin reconstructs a voice contrast instead. Does this really matter that much which one it was? Northern and much of central German have a voice-and-aspiration contrast like English, southwestern German has a length contrast, and southeastern German has a pure fortis-lenis contrast (in those dialects that have any constrast at all)...


    I agree that Etruscan looks much more Nostratic (especially IE-like) than Dené-Caucasian.


    Did Nichols ever offer a detailed refutation of NC?


    I can't see what's wrong with Proto-Altaic */kʰegVnV/ "nine" (except the stupid, stupid use of an apostrophe to mark aspiration!). There's Proto-Tungusic */xegyn/ "nine", and there's Proto-Japonic */kəkənə/ "nine". That's all. Proto-Japonic had a four-vowel system (*/a i u ə/), and the last vowel is not preserved in Proto-Tungusic at all, so I don't see how the 2nd and 3rd vowels could be reconstructed with any more precision. The consonants, on the other hand, correspond very nicely.

    You also exaggerate the uncertainty in the vowels. The EDAL reconstructs only eight vowels for Proto-Altaic, */a e i o u i̯a i̯o i̯u/ (the last three may have been diphthongs or [æ ø y]; there's evidence for both possibilities), and the last three only occurred in the first syllable. Therefore there are not "ONE HUNDRED", but 25 possibilities at the most. 11 of the 20 "possibilities" you tabulate are impossible.

    Again, the dreaded "North Caucasian *ƛ̣_VẋwV ( ˜ Ł_-)" contains only two wildcards (the two instances of V) and uncertainty about whether the initial consonant was ejective or voiced (in other words, it was not aspirated; unlike American ones, Caucasian ejectives are lenes, not fortes, so I'm not terribly surprised it's not always possible to tell if such a lenis was voiced or ejective). The x with the stupid overdot is one of the Americanist symbols for [χ].

    Also, what kind of vowel correspondences did you expect? In their abovementioned reply to Vovin, A. Dybo & G. Starostin compare the situation to that between German and English. Being a native speaker of German and having had 10 years of English at school, I soon noticed that almost any vowel of one language can correspond to almost any of the other. It's a complete chaos (nicely demonstrated in a table by Dybo & Starostin). Yet, when factors like proximity of /w/ and /l/ are taken into account, much of this chaos can be cleared up, even though much remains (and can, presumably, only be sorted out by knowledge of older stages of both languages plus Frisian and whatnot).


    Are you a glottalist? I'm asking because a plosive system like that of PIE, with voiceless unaspirated, voiced unaspirated, and voiced aspirated plosives, is only attested in a single modern language worldwide (Kelabit, spoken in Sarawak). More to the point, Bomhard's Nostratic and Muscovite Nostratic are phonologically identical; Bomhard reconstructs aspirated plosives and affricates instead of the Muscovite ejective ones, and the Moscow School reconstructs pulmonic voiceless ones instead of Bomhard's ejectives. That's all. (Incidentally, if those voiceless plosives were aspirated, that would go a long way towards explaining why non-glottalistic PIE */b/ was so rare -- its ancestor, Muscovite */p/*, was equally rare, and AFAIK the most likely plosive in such a system to be missing.)


    I can't see why you get so angry at the long correspondence tables that list all correspondences, including yet unexplained splits. In the EDAL preface at least, the correspondences are also explained in the text...


    In Russia, North Caucasian and Dené-Caucasian, not to mention Altaic, have a much better reputation than in the western world, so I don't think anyone was getting paranoid. On the other hand, I agree that lexicostatistics does little good. It's not a phylogenetic but a phenetic method. Fortunately, the Moscow School doesn't rely on it (despite occasional claims to the contrary, such as S. Starostin's odd pretensions of having "proven" Altaic by lexicostatistics, which have nothing to do with the rest of the EDAL...).


    I agree that the current state of Proto-World research is eminently forgettable. Mass lexical, oopsie, multilateral comparison is a nice tool for generating hypotheses, but incapable of testing them, because it's phenetic. Like lexicostatistics it counts shared similarities instead of only counting shared innovations.


    Finally, I'd like to mention this medium-sized pdf on Dené-Caucasian grammar. Polysynthetic verbs all the way back, with cognate markers in cognate slots, not to mention classifier prefixes on the nouns, says Bengtson (published this year). Makes for a great read.

    IMHO the Dené-Caucasian hypothesis is currently in a much better shape than the Nostratic hypothesis. If we ignore Shevoroshkin's addition of Almosan, all that's currently missing from DC is a reasonably complete reconstruction of Proto-Na-Dené. In Nostratic, on the other hand, there is still no reconstruction of Proto-Eskimo-Aleut, and the latest three reconstructions of Proto-Afro-Asiatic (one of them extremely dependent on the way Arabic dictionaries are traditionally organized...) contradict each other on fairly important issues, not to mention the fact that plenty of AA languages are drastically underresearched and the fact that the hypothesis that Omotic is more closely related to (the rest of) AA than to anything else within Nostratic has recently turned out to be very, very poorly supported.


    This brings me back to NC, and to biology. If WC is not the closest known relative of EC, then what is? And what is the closest known relative of WC in that case? Phylogenetic hypotheses can only be shown to be less parsimonious than, or equally parsimonious as, other phylogenetic hypotheses. Directly disproving one without showing that any alternatives are better supported is not possible.


    There are two things that annoy me about the whole Moscow School, though. One is the refusal to use the IPA -- to the point that sometimes different transcriptions are used for different languages, so that an underdot can mean an ejective, a retroflex, a pharyngealized consonant and who knows what else. The other is the use of "(plain) laryngeal" and "emphatic laryngeal" instead of "glottal" and "epiglottal", which took me a long time to figure out -- but at least the distinctions between glottal, epiglottal and pharyngeal are made at all!

  8. Sorry -- I had managed to overlook that you posted this in November, when the Articles and Books site of starling.rinet.ru was AFAIK not yet up. The Wikipedia article on Altaic had about its present form, though.

  9. Wow, I thought that _I_ wrote the longest Blogger comment in the world, but you have me beat, David! :) We'll break Blogger yet. Hahaha. There's a lot to discuss here though so please excuse if I'm rushing through this to answer questions before my next engagement.

    David Marjanović: "First of all, as a biologist, it baffles me to no end that historical linguists talk about "proof", "unproven hypotheses" and suchlike."

    This is just semantics. Of course this is all theory but theory still involves logical structure. There's a limit to what one can seriously propose as a theory. A theory must be grounded upon established principles within that field. Sporadic metathesis as Starostin employed over and over is recognized by linguists as a cheap parlour trick that enables the amateur to link one language to another however they wish. A lack of respect for methodology and due process just doesn't cut it, whether it be in biology, linguistics or any other academic field.

    David Marjanović: "That's because science cannot prove, only disprove."

    If we're dealing with theories, neither proof nor disproof can be "certain". For example, the existence of the Big Bang is "most probable" given the current evidence but we can't predict what will shake up our knowledge in the future. Maybe the Big Bang will be "proven" wrong, but then how certain would that proof be? It's merely a matter of probability whether something like that is true or false. Historical linguistics deals with "probability", not "certainty", and it can't be helped because of the very nature of the study. This is why I stress that the principle of Occam's Razor must be our guide in these cases of prioritizing various degrees of uncertainty. In this way of thinking, Occam's Razor is the "proof" that we're talking about.

    David Marjanović: "This is basic science theory (so basic that I didn't even need to mention Ockham's Razor). Is that not taught at linguistic institutes?"

    My impression is that it's not taught enough so I like to keep mentioning it just in case.

    David Marjanović: "Which reminds me -- a theory is something bigger than a hypothesis. Large, overarching principles are theories,[...]"

    Pedantics again. The everyday usage of "theory" is more broad than this. I'm caught between both an academic-oriented and non-academic-oriented audience, so forgive my colloquialisms. This wordgame loses track of the original point I'm making: The theory/hypothesis/idle conjecture/pet theory (whatever you want to call it) isn't tenable.

    David Marjanović: "It makes a few important points that you don't mention, for example that Proto-West Caucasian is almost completely derivable from Proto-East Caucasian."

    These are not "points" since no evidence is pointed to. These are empty assertions that require ample justification. Reconstructing North Caucasian before adequately reconstructing Proto-Abkhaz-Adyghe and Proto-Nakh-Daghestanian is a wasted effort, putting the cart before the horse. Imagination at the expense of careful and detailed deductive reasoning.

    David Marjanović: Do you really think it matters that WC and EC are separated by typological features like the size of their phoneme inventories, word order, and "structure" (could you be more precise?)?"

    Oh dear, you get lost easily in detail, don't you :) As a whole, yes indeed it matters. These features matter whole-heartedly **as a whole**. "Structure" refers to grammatical structure. verbal affixes, roots, nominal case endings, syntax, everything there is to know about the grammar.

    David Marjanović: "You write that 'any similarities between the two [EC and WC] could easily have been caused by strong areal influence between the two within such a small region and over a large expanse of time.' Then why is the grammar still so different?"

    Mandarin and Cantonese have affected each other. Ask a Chinese speaker and (s)he will tell you. Yet, Mandarin and Cantonese are mutually unintelligible. Unless a Cantonese person learns Mandarin, (s)he will be unable to understand what is being said. There is an ongoing balance between linguistic convergence and linguistic divergence. If this balance didn't exist, all the tightly bound Athabaskan languages of the North American West Coast would have remained a single language! Obviously they didn't. So your perceptions on language change must be incorrect.

    David Marjanović: "But how a phoneme was actually pronounced in a protolanguage is less important than the regular correspondences this phoneme has."

    Precisely! This is where Starostin failed. He certainly feigns regularity but each word of his seems to have a "story" in the comments section. A metathesis here, an irregular phoneme there, etc, etc, etc. Amateurs can easily "fake" regularity by making so-called "rules" and then using a myriad of exceptions for each word.

    David Marjanović: "In IE languages we find forms of this pronoun that point to PIE */kʲ/, others that point to PIE */gʲ/, and yet others that point to PIE */gʲʱ/."

    Methinks you're confused about Proto-IE... a lot. There is only h₁eǵʰ-/*h₁eǵ- commonly reconstructed for the 1ps. The pronoun is never reconstructed as a plain voiceless stop, not even amongst proponents of the Glottalic Theory who reconstruct a non-palatalized ejective stop. The idea that there is a laryngeal here is only one possibility that's not necessarily true.

    At any rate, you're indulging in lavish hypothesis again. If one cannot explain things with one phoneme, I hardly think two, three or fifteen phonemes clustered together will make Starostin's hypotheses any more credible. Need I remind yet again: Occam's Razor?

    David Marjanović: "I can't see why you get so angry at the long correspondence tables that list all correspondences, including yet unexplained splits.

    You don't understand because you don't understand that language change isn't as simple as "X becomes Y". Sometimes "X becomes thirteen different things" depending on environment of the phoneme. Please look up Grimm's Law and Verner's Law. Then contemplate on how these regular rules and regular exceptions to rules undermine the usefulness of ridiculously long and simplistic "X becomes Y"-type correspondence tables. It suffices to either make a brief table showing changes *in general*, or to describe these changes in detail based on sound phonetic principles. Starostin mixes the two approaches up, both generalistic and detailed, making a big laughable mush that doesn't really signify anything. But oh how it impresses family members.

    David Marjanović: "Everyone agrees Proto-Dravidian had two kinds of plosives."

    In initial position? No. Medially, yes. I was talking about initial position where it is agreed on by qualified Dravidian specialists that there are only voiceless stops.

    David Marjanović: "Phylogenetic hypotheses can only be shown to be less parsimonious than, or equally parsimonious as, other phylogenetic hypotheses."

    However, languages aren't phylogenetic in nature as I wrote about before. There's your problem. Get that idea out of your head straight away.

  10. Okay, I have spare time and will address some more points here, David. As I said, your comments were long and there's a lot to explore here. You mentioned some specific reconstructions by Starostin and their purported reflexes. Time to tackle those:

    David Marjanović: "[The Etymological Dictionary of the Altaic Languages] reconstructs [...] */ŋi̯u/ as 'three'."

    Outside the Moscow asylum, I've seen "3" reconstructed along the lines of *göl- (I'm recalling from memory, don't quote me), not *ŋi̯u. The former form has the advantage of not demanding of its audience an undemonstrated correspondence of [- : m- : g- : ŋ-] derived from *ŋi̯- to connect words that any rational person can see are unrelated.

    One cruel fact is that Old Japanese mi- '3' is paired with mu- '6' because the entire native Japanese system is undeniably binary in nature (OJap fitö- vs. futa- '1/2', mi- vs. mu- '3/6', yö- vs. ya- '4/8'). Since Starostin seems to have been connecting both Japanese "3" and "6" to Altaic (at least according to this hideous database: see Proto-Altaic "6" and "3"), it's clear that Starostin was fundamentally unqualified in Altaic linguistics. Notice also that he uses two kinds of non-labial, patalized nasal onsets to explain away Japanese labial, *non*-patalized /m/!! Anyone who doesn't see the violations of Occam's Razor all over the place must surely be clinically insane. It can't be any plainer that he was shamefully wrong.

    This tactic of using obscure sound changes without a clear phonetic explanation or demonstrated regularity in order to rationalize any set of "cognates" one wishes was one of Starostin's many unscholarly idiosyncracies. It's unacceptable conjecture, not even theory.

    : "I can't see what's wrong with Proto-Altaic */kʰegVnV/ "nine" (except the stupid, stupid use of an apostrophe to mark aspiration!)."

    Since I've already pointed to the number of permutations embedded in this so-called reconstruction, your inability to 'see' and confront that fact must be a form of strong denial on your part.

    David Marjanović: "I put the whole table up on Wikipedia a year ago or something."

    Wikipedia, eh? Sigh. On the "Altaic Languages" entry which is currently represented only by Starostin (i.e. The poorest representative academic selected for Altaic. Where's Poppe, Hooper, etc.??), it currently states: "Therefore shared numerals are often considered good evidence for language relationships." How dense and contrary. Since Tagalog uses Spanish numerals to tell time and Japanese uses Chinese numerals for counting, a Wikipedian therefore concludes that Tagalog is an Indo-European language and Japanese is Sino-Tibetan. Yikes.

    The problem with Wikipedia is the fact that the very people who are knowledgeable eventually surrender after fighting endlessly with the most pompous, uneducated know-it-alls on the planet who are backed up by mentally corrupt administrators. If you think about it on the large-scale, the sane simply cannot compete with a horde of computer-literate loons diagnosed with OCD-ADHD. Wikipedian editors usually derive their knowledge solely from the internet without ever stepping foot in a university library where works by OTHER Altaicists are found. Hence the reason why most pages are riddled with errors and logical contradictions.