3 Jan 2012

Baxter-Sagart reconstructions and Occam's Razor

The internet abounds with information if we make the effort to search. One interesting find is a pdf of the Baxter-Sagart reconstruction of Old Chinese roots in tabular format. Excellent! But being an analytical bad news bear, I also see some important issues that tie in with my stance on developing orthographies that properly conform to Occam's Razor. This is out of respect for logic, for necessary simplicity, for clarity and for general readers, some of whom may not be well-versed in linguistics but which nonetheless are interested in the beauty of a language and its history.

Contempt for Occam's Razor inhabits even mainstream linguistics and the field is far too often misconceived as an intuitive art than a logical science. I put my money on organized phonologies and uncluttered orthographies that express only what's necessary for the topic at hand. It's not necessary to show exact phonetics of a word each and every time when the discussion is not about the exact phonetics of a language. If we have a list of roots, it doesn't make sense to list it all out in excruciating phonetic detail any more than it makes sense to write English this way. As such, mixing IPA symbols into your orthography often spells more trouble than what it's worth. "IPA" doesn't stand for International Orthographic Alphabet. At some point a decent linguist must come up with a sensible, legible, optimal, uncluttered orthography to express their language of study beyond the microscale phonetic level. A means, in other words, to quickly and clearly cite words in a vocabulary, pruned for immediate and sufficient comprehension by an everyday reader. Abusing symbols to complicate the message is as corrupt a practice as abusing unnecessary specialist terms for little other reason than for show.

On the top of the list, the Baxter-Sagart team begins with roots like *ʔˤra. This shows us that they envision a phonemic pharyngealized glottal stop. Fine. However unless */ʔˤ/ is phonemically distinct from other phonemes in the language, say */ʕ/, why be so precise on the orthographic level? Why not use a single clear symbol for this instead of mixing up orthography with the phonetic level far below it? If the orthography, in its necessary simplicity, doesn't make the phonetics you intend very clear, one may simply write a quick primer on it and be done with it. If only this, then I can concede that perhaps there's some reason for it that I've overlooked.

Further down the list, we also have *qˤrep which is quite the tongue-twister. One may dismiss this as within the bounds of plausibility although I do admit that this apparent pharyngealized uvular stop is unusual for its Schrödingeresque ability to inhabit two places of articulation at once. Then again, there are many consonant rich languages like Klallam around, right? We also have to keep in mind though that these kinds of languages are also quite rare and there's nothing scientific and methodical about a theory that strives towards the exotic rather than the minimal. Strong proof should come before the addition of a new phoneme to a reconstruction.

But when we come across *qʷʰˤat-, what is Baxter and Sagart trying to express to us and how does it fit into a plausible phonological system? A labialized, aspirated, pharyngealized, uvular stop??? How on earth could this possibly be contrastive with another phoneme? Surely at this point we have to concede that Baxter and Sagart have not respected the differences and proper uses of phonetic versus orthographic transcription. It gives the impression of a poorly organized phonology and orthography, mixing exact and even unlikely phonetic symbols together to create a visual mess that ends up being more confusing to the reader than helpful. At this point, it's just not reflective of the facts, even when (and especially when) armed with knowledge of the IPA system!

Keep in mind that there are already expressed concerns by others about the use of "j" in Middle Chinese onsets in words like gji  (祇 ) considering that the "phoneme" doesn't seem to exist when compared to some loanwords coming from outside Chinese (eg. MC *bjut [Baxter] < Sanskrit buddha 'enlightened one; Buddha'). There is indeed informational value behind "j" here but it's very unlikely a true semivowel or a palatalization of the preceding consonant. At some point then, we have to get back to reality, paying careful heed to creating a balanced, minimal orthography because overcomplexity quite simply hampers progress in all things.


  1. I've not read Baxter's work, but I do own a copy of Schuessler's Etymological Dictionary of Old Chinese. They both seem to do the same thing in neglecting to spend time deciding on a coherent and intuitive representation for the segmental phonology of their chosen reconstruction.

    Contrast the situation with Proto-Oceanic, where the representation of the protolanguage is not particularly controversial; any researcher can freely reference roots like *mʷakumʷaku- "arrowroot" or *faSofaSo- "plant" without fear of confusion. (The only point where there is awkwardness is where no daughter language has a clear distinction between e.g. *s and *S, as in *fiŋi[sS]aki- "be twisted".)

  2. Thanks for bringing that to my attention.

    I had known that Baxter has once gone with OC *-j- for MC palatalization (type B syllables), and then he went with OC short vowels to explain them. Now, it looks like he's adopted Norman's pharyngealization hypothesis for type A syllables.

    Based on his use of orthography in his other proposals, I cannot tell to what extent he is actually committed to the phonetics of pharyngealization or is just using ˤ as a symbol to mark type A syllables.

    But, yeah, I agree with you about the need for a plausible phonological system.

  3. Of course a consonant like *qʷʰˤ is utterly unacceptable. But since our previous discussions I have wondered about this orthography issue some more.

    For a person reading about Old Chinese or, for that matter, Proto-Berber or Proto-Afro-Asiatic reconstruction who is not particularly well-versed in the field of linguistics, what difference does it make whether you'd write a pharyngealised consonant as or . They both have something 'alien' to them from a lay-man's point of view. In my experience (although, admittedly not extensive), I've noticed on several occasions that non-linguists will simply skim over diacritics all-together and discard them as unimportant or meaningless. Maybe that wouldn't be the case if you'd have a superscript letter like that.

    In the case of Proto-Afro-Asiatic I think it's meaningful to go for the diacritic solution, because we do not know if emphatic consonants were Pharyngealized or ejectives. The same could actually probably be said for Chinese actually if I understand Stephen C. Carlson. But now I'm using phonetic arguments to not use certain symbols.

    So the greater point I'm trying to make here:

    Is a known symbol with an alien diacritic really less alien to someone with no experience in linguistics than a completely new symbol from the IPA?

    A character with a diacritic might imply it has some kind of relationship to the character without the diacritic, whereas an alien symbol would show that it's a completely different sound in its own right.

    In Proto-Berber there's no direct relationship between emphatic and non-emphatic consonants for example. Obviously, if a language has alternation between the two in one root, it would be better to use diacritics or other modifying symbols instead to emphasize the relation.

    Obivously both and would fail equally badly at showing a lack or presence of relation to a d.

    Maybe an interesting way to test this would be to get a group of non-linguists to look at a transcription of a text, one in an IPA-like transcription and the other with a diacritic-based transcription and ask them to write down which letters the language has and see what the results are.

  4. Phoenix, by referring to "Occam's Razor", I expect that the reader already understands that this is a principle of reducing complexity to its bare *necessities*, not to reduce things even further to absurd drivel just to gratify mentally incapable people. However reducing unnecessary complexity aids readability for those that *want* to learn and understand a theory. So there's no need to rant on about whether dˤ or ḍ is better. They're both equally good **if and only if** pharyngealization truly contrasts with its plain counterpart in an organized phonology with strong evidence to back it up. Failing that, **unnecessary** diacritics is distracting fluff.

    Diacritics should be used to distinguish **distinct** features in a _phonemic_ inventory. As you yourself can see, however, *qʷʰˤ is "utterly unacceptable" and for the good reason that it suggests contrasts that are so laborious and exotic as to be implausible. It suggests indulgence and lack of planning on the part of the theorist.

    "Maybe an interesting way to test this would be to get a group of non-linguists to look at a transcription of a text [...]"

    Please focus on Occam's Razor, not bland stupidity.

  5. FYI: In this May 2011 pdf, the Baxter-Sagart phonological system is outlined in a table. It's as absurd as I suspected. I note bizarre contrasts like that of *qʷʰˤ with *qʷˤ or *ɢˤ with *gˤ. These sounds are either nonsensical from an articulatory standpoint or represent contrasts that I've simply never seen in any world language. Even if attested, the system is exotic, hands down. I struggle not to laugh in utter bewilderment because I honestly don't see how it connects with real-world systems whatsoever. It's as if a computer programmer created this language.

    Stacked upon this extraterrestrial phonology is the added awkwardness of the distribution of these alleged phonemes among the lexical items they cite: 我 *ŋˤajʔ 'I' and 五 *C.ŋˤaʔ 'five'. I perceive an amazing lack of respect for markedness and the Pareto Principle here since we should expect to find plain phonemes in commonmost words, not exotic words.

    So I chalk this up to a perverse level of hyperanalysis concerning Chinese rhymes. I feel in my bones that a much simpler solution exists. No doubt an important opportunity to fill the void awaits eager young graduates. This Baxter-Sagart system is subpar for me.

  6. Sorry, the end of the second paragraph above should read "since we should expect to find plain phonemes in commonmost words, not exotic ones."

  7. I think the appeal of a simplified system for talking about language is that language itself favors these. What matters in articulating a word is not that you produce a set of exact sounds but that the sounds you make fall within a range where they won't be mistaken for other sounds. The IPA is good for capturing what things sound like or may have sounded like or features that sound may have had, but what is needed when working with language for purposes other than transcription is a system where you can easily identify words and syllables that are similar to each other but not so similar that a native speaker would confuse them in an environment where they could hear clearly.

    One of the big places writing runs into problems is when one language's system is used for another language. Using IPA where it is not warranted is another variant of this, an attempt to fit something imprecise, like language, onto a model that by virtue of being overprecise fails to capture the level of permissible inexactness that is at the heart of language's genius.

  8. So I wrote a big long response to this and it was way too long. The short version is that what Baxter and Sagart have published should be understood not as a viable phonetic reconstruction but as a complicated markup system representing various correspondences on which consensus has been reached. Some of them, for example the vowels and probably most of the simple consonants, have in fact come to the point where you can read them as IPA, but not a few (for example, the pharingealization mark you mentioned) for which no consensus has been reached on the nature of their protoform. I'm personally hugely unconvinced by pharingealization as an account for type A syllables (the idea being that it blocked a wave of y-insertion which I don't think is very well motivated) but the position that the character occupies makes it very convenient for exploring its relationship to different elements of the syllable it occurs in; as a raised postarticulation mark we're used to scanning it with the initial consonant, and it's right next to the medial if it's there so it's easy to spot that too. If you're scanning the vowels it's floating up there not far from where a prosodic diacritic usually would. You don't really need to worry about the coda since we already know it doesn't affect the rhyme. Another vexing thing if you want a sense of how things sounded is that not infrequently elements, especially extrasyllabic elements, are represented together in the same word when they probably couldn't have occurred together synchronically. This happens by necessity of the fact that we can't tell yet which sound changes happened first. It seems like some of them happened when the language was typologically not unlike English: etymologically diverse, with a robust system of affixation, permitting words with multiple complex syllables. By the end of the Three Kingdoms period it permitted (in the acrolects, at least) no affixation and all those complex words had been compressed into C(Y)V(C) roots. It's not even clear whether or not Proto-Sino-Tibetan permitted roots to be more than sesquisyllabic; all extant Tibeto-Burman languages have only mono and sesquisyllabic roots except secondarily, but there are also an awful lot of reconstructed affixes, often piled on three or four deep, for which no vowel can be reconstructed.
    The point of putting it out there is so people without journal access can work on it.

  9. Bel Matin, my summed-up feelings on that are:

    1. A non-viable phonetic reconstruction is an utterly useless theory.
    2. A "theory" that causes more problems than solves is called a "mess", not a "theory".
    3. Baxter & Sagart have published a pile of rocks for us to mine, but not a theory per se. Merely assigning symbols to **apparent** correspondences that they **believe** to be relevant despite failing to come up with a real-world phonology that explains them compels us to question their claims and to explore more meaningful answers. Reasoned skepticism is healthy.
    4. Language reconstruction is squarely about attempting to understand real-world history based on surviving evidence, not to create abstract, ivory-tower "markup systems" that give us no added insight.

    But let's focus on fixing Old Chinese phonology instead of defending other people's messes. Baxter and Sagart nonetheless share information online openly which is something to be commended and encouraged among academics. Solutions are still needed and debate is open.

    In regards to their "pharyngealization" which just can't be right, I'm fairly confident that the solution is to both change the articulation and reassign the feature to the vowel instead. This already relieves the big burden borne by OC consonants and thus also solves much of the assaults on markedness and common sense. Perhaps these "pharyngealized" phonemes amount to nothing more exotic than sequences of plain phonemes neighbouring long vowels.

    In regards to their "uvular plosives", I notice "night" according to B&S is *ɢaks while *ya in Tibeto-Burman. A supposedly uvular , they thus maintain, corresponds with TB palatal *y. The same is the case for "sheep": OC *ɢaŋ (as per B-S) versus TB *yaŋ. I'm led to wonder if these Old Chinese words are to be reconstructed instead as *ɣa and *ɣaŋ respectively. I ask: What evidence thoroughly proves that these alleged "uvular plosives" in OC are indeed plosives at all, let alone whether they must be "uvular"? From what I understand, their evidence for this rests on evidence showing an apparent three-way voiceless aspirated/voiceless non-aspirated/voiced contrast that mirrors that seen already in the velar series. But can their *qʰ/*q/ not be shifted to something like a velar fricative series of *x// instead?

    All I'm saying is that there is surely a more competent theory out there to resolve these phonetic issues, especially with a thousand sober heads working together online on it.

  10. Sorry, correcting a booboo: "I'm led to wonder if these Old Chinese words are to be reconstructed instead as *ɣa and *ɣaŋ respectively."

    That should of course be *ɣaks, not *ɣa, assuming there are no other issues involving their reconstruction of the coda.

  11. I've stumbled across your blog from elsewhere, so I don't pretend to be any kind of Tibeto-Burman expert. :) But I did happen to notice the following comment: "A labialized, aspirated, pharyngealized, uvular stop??? How on earth could this possibly be contrastive with another phoneme?" I'm working on a language that has just such a contrast. Ubykh possesses phonemic contrasts between voiceless and ejective versions of plain, palatalised, labialised, pharyngealised, and labialised-pharyngealised uvular stops. I'm not saying that Baxter and Sagart's consonantal system is valid, mind you - it smacks of the titanic inventory Sergei Starostin postulated for Sino-Caucasian - but languages with these contrasts do exist.

  12. Rshfenwick, let's not stretch facts just to play aimless devil's advocacy. The extinct Ubykh language doesn't support your point since:

    A) Ubykh has only a subset of B&S's outlandish inventory (ie. no voiced uvulars, no extra "aspirated pharyngeals", etc.).

    B) Given the "pharyngealized velar" gap, Ubykh's "pharyngealized uvulars" suggest pharyngealization is accompanied by concommitant tongue retraction (ie. [qˤ] = retracted [k̙ˤ], representing a pharyngealized "plain" ). Why mark phonemes with NON-phonemic features in a good orthography at the expense of legibility and information priority? We just don't need to constantly be reminded orthographically that is retracted if it goes without saying.

    So B&S's system far exceeds Abkhaz-Adyghe, Khoisan, Salishan and all other consonant-rich languages known on planet Earth. Unlikely theories require unlikely evidence.

    With representatives that allegedly contain these sounds in parentheses according to their pdf, I count at least 24 distinct dorsal/radical stops in B&S's Old Chinese: *k (強), *g (奇), *q (姜), (錡), *kˤ (歌), *gˤ (渾), *qˤ (瓮), *ɢˤ (芽), *kʰ (綺), *qʰ (蟻), *kʷ (舊), *gʷ (懼), *qʷ (迂), *ɢʷ (泉), *kʰˤ (脛), *qʰˤ (殸), *kʷˤ (黃), *gʷˤ (狐), *qʷˤ (汙), *ɢʷˤ (弘), *kʷʰ (丘), *qʷʰ (勸), *kʷʰˤ (薖), and *qʷʰˤ (華).

    For all I know, I may have missed a few dozen stops along the way. I'm losing count.

    Not even Ubykh has a contrast between *kʷʰˤ/*qʷʰˤ. Even if we relate B&S's "aspiration" to Ubykh ejectives, there STILL is no corresponding contrast of *kʼˤʷ/*qʼˤʷ in Ubykh's already bloated and statistically rare system. Then there's the added dimension of infixal *-r- which lacks any correlation in Ubykh. And how do we relate the additional sound marked *ʔˤ (as in 亞) to anything in Ubykh?

    Face it. Ubykh can't mend this broken wing.

  13. It is a peculiar thing, but the most bizarre phonologies are invariably found in reconstructions. (One reason why I hold to a variant of the glottalic theory, incidentally.) I only know of one case where a sound unattested in living languages can solidly be said to have been uttered once, and that's the voiced pharyngealized lateral fricative of Classical Arabic.

  14. It's also possible that some of these "single" consonants, were simply allophonic realizations of multiple consonants instead. For instance, qʷʰˤ may have been qʷhᵊʔ. At that same time, there may or may not be any reason to postulate such a form, when simpler forms could be alternatively proposed.