August 18, 2013 - Everything2.com

Breakfast today: scrambled eggs and cream cheese in between two Cinnamon Eggo Waffles. Man, Philly goes with everything.

I recently fell back into the conlang trap, and this time around I'm musing about putting together one of my own. I've always been intrigued by language, and the prospect of just up and inventing one was crazy. Since I started, I've read about countless languages, and even tried my hand at a few. I'm not very good at sticking with things, so my Esperanto is still only beginner level, if that, but coming from a multilingual background (Russian/English and taken up to high school French) I find there is a certain satisfaction in linguistic reflection and omphaloskepsis.

In an attempt to get shit done on a project of mine for once, I should like to get down some thoughts, goals, and visions, if only to reflect on them later.

Post-production note: This ended up being a bit of an introductory course to conlangs. I tried to be brief, but it felt elitist to leave so much unexplained—few of the featured hardlinks are even nodeshells, never mind filled nodes. E2 seems not to have a very hip linguistics scene :(

The Sapir-Whorf Hypothesis and motivation

For those who need a quick run down, the Sapir-Whorf hypothesis—which was neither coauthored by Sapir and Whorf nor explicitly hypothesized, incongruously enough—asserts that the structure and capabilities of a language affect the thought processes of its speakers, and shape their world-view. It's also known as the principle of linguistic relativity, and comes in two flavours: the strong Sapir-Whorf hypothesis holds (roughly) that language shapes thought to the extent of limitation and restriction of certain incompatible or uncategorized (i.e. not having provisions in the language) kinds of thought; and the weak SW holds that language merely influences thought and other cognitive processes.

I sit squarely on the fence regarding the strong Sapir-Whorf hypothesis. The weak version is very probably true, and I base this conclusion on what I know of various language studies regarding languages with different conceptualizations and quantifications of e.g. time and direction. For instance, I can remember the story of an Austrialian aboriginal culture/language that lacks words for relative direction, only absolute orientation: while this might seem to be a disability of the language, its native speakers apparently tend to be very good at navigation and orienteering. Another strong example is the Himba tribe from Nambia—which has only five "primary" colours compared to English's standard of eleven (black, white, gray, red, orange, yellow, green, blue, purple, pink, and brown)—whose memory of colours is weaker than that of a speaker of a more specific language. (Admittedly, a strong early influence on my views on this was this Cracked.com article.)

I hesitate to support the strong Sapir-Whorf, however, because I don't think that language is a limiting process. While it can certainly discourage some types of thought, I find it perfectly cromulent to assume that barriers of native inexpressibility or cultural taboo can be transcended by a suitably motivated speaker. In my mind, a language that makes some thoughts inexpressible or even unthinkable (try not to read this in a Newspeak sense) would have to be so restricted as to not satisfy its speakers, even if those speakers are so indoctrinated at young ages. This could be an underestimation on my part of the influence of language in neural programming, which is why I remain not entirely decided—that and a lack of suitable evidence to form an actual conclusion—but I don't see why I'd be wrong.

This all being said, I don't think want to create a language in order to test Sapir-Whorf, as many others have done. Lojban, Toki pona, and Ikthuil are all well and good, but they already exist, with probably a lot more linguistic thought put into them than I could muster. I'm more interested in seeing what kind of language I can create that would fit into my constraints for an improvement upon English. I see it as an auxlang, I guess, though with absolutely no delusions regarding its adoption.

Taxonomic languages and entropy

Taxonomic languages are a little more nuanced than the name might imply. The general idea is that a taxonomic language will attempt to enumerate all possible things in a systematic way, in a first-principles sort of fashion. This can be in the style of Ro, a language which classifies words much like the Dewey Decimal System classifies books: the first two letters denote the major field of the word, the next letter a subfield, and so on; the canonical example is bofo- is the prefix signifying colours, leading to bofoc red and bofof yellow. Another style, however, is to come up with base names (usually short ones, for convenience) for a core set of concepts and use agglutination to represent all possible words.

I figured from the start that taxonomic languages are full of crap. Ro was disgusting and arbitrary, and though the idea is admirable, it seems too obviously textbook to be an actual language. I have similar problems with the idea of the alphabet of human thought (i.e. with so-called philosophical languages): just because you create a list of fundamental irreducibles, does not mean that a language is more natural if it is based a priori on cobbling these atoms together. Although I have not tried these systems/languages out extensively, it appears to be very much an exercise in memorization: while someone raised with the language might get used to such a system, and even find merit in it, it's just far too different from natural language for me.

English, on the other hand, has enough redundancy and a diluted enough informational content—IIRC it's about 1 bit of entropy per character—that error correction is often easy to the extent of being natural. If a message becomes garbled or improperly transmitted due to a typo, anyone with sufficient knowledge of English can easily notice the discrepancy and even predict the intended message. This is, in fact, so easy, that in practice, many English speakers are able to go substantial fractions of their lives misinformed about spelling and grammar, despite communicating with others often. A similar phenomenon occurs with most other natural or almost-natural languages, though with varying ease of correction—I would like to utilize this fact, because I've realized that in practice, language will become mutated and abused despite even the best of intentions.

In fact, I think this kind of ability to error-correct will be critically important in a language being useful. The barrier to entry is lower, and there is plenty of leeway in communication at the most primal level, while still allowing for things like classy language when following the language's "style guide"*. In taxonomic languages like Ro or Wilkin's unnamed language expounded in An Essay Towards a Real Character, and a Philosophical Language, or very dense languages like Ikthuil or Briefscript, entropy is high and the possibility of misinterpretation due to either human error (e.g. misspoken word) or transmission error (e.g. noisy room) is too possible. Further, it takes more attention and effort than for e.g. English to glean information from some text or speech in a conversational setting. Most of the time, a good knowledge of English allows you to be lazy enough to predict the content of a sentence from only a handful of its words, and you can gloss over the remaining bits; I see this as advantageous and admirable in a language.

A quick final gripe with taxonomic languages: how to handle vocabulary otherwise. While I could draw inspiration from any of the existing syllabaries on the Earth, this is a little too a posteriori, and while it would certainly aid learning speed in people already fluent in languages related to the source language, this has already been done by Esperanto and Lojban. My goal is not to rederive either of them, but rather to create something of my own: my plan is to make a sort of ad-hoc list of words, in a way which both leaves comfortable space for the coining of new words (in the style of quiz) and still allows for natural agglutination or compound words in case some word or idea can or should be calqued in that way.

Logical languages and complexity

This leads me straight into the idea of complexity in language: what is an acceptable level of mandatory memorization, for the sake of efficiency? Obviously, vocabulary is more or less a straight memorization task, with the only relief being the ability to recognize patterns which certain groups of related words might fall under. (Consider the following list of simple and compound words: jump, jumps, jumped, jumping, jumper, jump-rope, high jump, ski jump.) But clearly the speakers of English eat that shit right up, or else we wouldn't have so many exceptions, misconceptions, and bad things to say about the French. (I before E except after C is a dirty lie!)

Logical languages usually fall near to taxonomic languages and involve ideas such as the alphabet of human thought, which I've already touched on. But they can be more creative with where they fall between a priori (constructed without influence) or a posteriori (directly influenced by existing languages). Lojban is a very logical conlang, especially grammatiically (suggesting an easy a priori classification), but apparently got most of its vocabulary by taking the average of some top spoken languages around the world, which is a curious brand of a posteriori.

Esperanto I think is probably the most successful in terms of striking the proper balance with complexity. Grammar is mostly a priori, but vocabulary is a posteriori, being inspired by many European languages (English, Spanish, and Russian come quickly to mind). And despite some failings in modern borrowing of foreign words**, it has probably the best balance of any auxlang between correcting failings and still using a level of complexity comparable to that of English.

I should probably elucidate a little on what I mean by the complexity thing. Take, for example, Hawaiian, the pidgin Tok pisin, or Toki pona. The alphabets are small, the number of different syllables is reduced from English, and the result is either less expressivity in the same amount of space, or equal expressivity requiring more space. In the case of Toki pona, actually, expressivity is intentionally decreased globally, with a small number of fairly general word roots and simplified grammar, intended by the author to attempt to induce a more Zen-like view of the modern world through Sapir-Whorf.

On the other hand, in a monolith like Ikthuil, while there is a great deal of complexity and terseness, it requires a great deal of dedication on the part of the speaker and any listeners to know the vocabulary, unambiguously and exactly interpret what is said (there are some very subtle differences in most of the words!), and then to form and transmit their own thoughts in response without misspeaking/mistyping that response. This means a very high barrier to entry and a very steep learning curve, thus making it overcomplex, and consequently highly, highly impractical for linguistic dissemination. Even Ikthuil's author does not speak the language, and has stated he does not want to attempt to do so.

The key would be to find an optimal middle ground between these two extremes, and my benchmark for this is in the English-Esperanto-Lojban region. While strictly speaking these three languages are very different, and it is probably misguided to call any of these specimens the ideal language, I am basically rating all the abovementioned languages on an ill-defined and non-quantitative continuum between very simple and very complex. I should liken this problem to the problem of selecting the most ideal numeric base:

English (or even all natlangs!) might be our historical and admittedly arithmetically unwieldy decimal
Simple languages might be analogous to binary and ternary (the limit point of the simplest languages being unary)
Complex languages might be large-numbered bases that contain many factors (the limit point of this being the "base-∞" language which has a unique glyph for every possible thought or set of thoughts)
The group of roughly-English-tier candidates trying to compete with English/natlangs might be bases 8, 9, 12, and 16
Nifty-but-unorthodox languages, like AllNoun, could be the uncommon but interesting negabinary, balanced ternary, or base φ
Esoteric bullshit like Oou equals esoteric bullshit like base -1±i***

To (finally) summarize

I hope to make a conlang of my own. I am not inspired by some new test of the Sapir-Whorf Hypothesis, but rather by idea of one-upping Zamenhof, if only in a theoretical light. Basically, this language would be a reimplementation of the wonderful ability of natural languages, particularly English, to allow for the nearly autonomous prediction of coming text and correction of already 'processed' text. I should hope not to introduce some kind of "alphabet of human thought" hogwash, but rather keep the vocabulary a little more ad-hoc, which I believe would make further invented words seem more natural, insofar as they don't have to be obscure or ridiculous trainwrecks of letters to be unique.

I (hopefully) have no delusions about the practicality of my language or its adoption, and I will try to be okay with both a flawed result and the potential lack of any result at all.

Maybe I'll also learn something?

* Some conlangs develop cadences and best practices of their own through continuous usage (I'm thinking of lojban off the top of my head)—this is to be expected. But can you imagine a lipu toki pi jan Siwun en jan Wije?

** Football is futbalo in Esperanto—yes, treated with the same amount of ambiguity as in English w.r.t. to European/American and the sport/the ball—which is a transliteration. A clearly more "Esperantish" choice would be piedpilko, from piedo foot and pilko ball. A weaker gripe is with the calque datumbazo for database, when datumujo, literally 'data container' seems more Esperantish. These examples are from this blog post.

*** There were scant few details present about this language when it still existed: the page it was written on could be found at http://kisa.ca/oou.html. There happens to have survived an archive of this page, however, so perhaps all is well.