**Conjecture to Statistical Proximity**

Abstract

We continue some steps showing the distinctions and proximities of languages over statistical facts as it has been pioneered previously [3]. In the paper, we construct the homology tree from the distance matrix yielded from the transformation of some statistical aspects of the empirical observations into binary sequences in order to conform to the concepts of memetics [2]. The resulting visualizations show interesting facts and possibly challenge some further steps for the advancement of our understanding to the discourse of languages and ethnicities.

Keywords: quantitative linguistics, ethnic languages, phylomemetic tree, evolution of language

Statistical Language Proximity

The works in [3] and Situngkir [4] showed some directions that statistical perspective on language based on the Zipfian plot might reveal the differences between languages. Apparently, some observation could be brought in order to see that some languages are different (as emerged from the particular linguistic structures as yielded from the adaptability of human developing the languages). This short report reflects a slightly different motive: the conjecture algorithmic works to see the proximity of languages as distinguished via the statistical properties.

We use the algorithmic computation introduced in [1] to build the phylomemetic tree. The evolutionary innovative artifact tree is built upon the the homology between different products which feature characteristics depicted in binary matrix (called the binary matrix of memeplexes). This main concept of the imported evolutionary theory and memetics are discussed in [2].

In our case, the two variables ϑ and θ , as empirically become variables distinguishing distinct languages in our statistical observation, are thus transformed into the binary matrix and hence it easy for us to have the homology matrix reflecting the similarities between languages – of course, statistically speaking or more specifically the Zipfian analysis of textual artifacts. In order to visualizing the differences and similarities between the languages fitted by the equation [1], we use the notion of Hamming distance. This distance shows how much changes occurred upon two binary sequences.

Discussions

In a glance view, the three figures shown in figure [1] are slightly different, especially as we see the leaves and the absolute positions of the nodes. However, the clustering of the nodes are interestingly, in some cases, are persistent over texts. For instance, we could see that anthropologically and culturally different ethnics of Javanese and Sundanese are always placed in different cluster with those of the Sumatera languages (Pakpak, Toba, Angkola, Simalungun, and Karo) as well as the two languages we used as reference, i.e.: Indonesian and English.

The clustered languages, i.e.: Javanese and Sundanese as well as the Indonesian and English languages are always seen in our observation to the three corpora. This fact is actually very interesting since the twos are culturally and relatively very different with the other Sumatera ethnics from the characteristics of customs, cultural artifact, traditions, etc. Scrutinizing the Sumatera languages in the three homology tree, we could see that qualitatively speaking the clustering conform to some common understanding of the ethnics. As Toba language are one of the most used Batak languages in North Sumatera, we do not see unique clustered languages with the Toba (see the map in [3]).

Throughout the three trees, we can see that Batak Toba may be showed up in some branches but never be with those of Javanese and Sundanese. Angkolanese and Karonese are shown twice at the same unique cluster; this might reflect some facts of similarity that is possible beyond merely languages. Some relative proximity over languages are also possibly visible in the language homology trees. For instance, from the sub‐figure showing the tree retrieved from the Pentateuch, we could clearly see that Simalungun, Karo, and Pakpak are relatively closer to the languages of Javanese and Sundanese than Toba and Angkola.

However, the latest two claims must be confirmed by anthropological and more casuistic linguistic cases. In some cases, the outlines of our observations confirm the robustness of the yielded homology tree. However, we must realize the limited languages we employ in our analytical observations must be put into consideration for further step of more specific cases of quantitative linguistics. The paper is however motivated solely showing the possibility to visually demonstrate the distinctions and proximities of languages through statistical approach, or in general quantitative linguistics. (paper代写)

