Tuesday, 11 January 2011

Creating a langauge with clustering algorithms

I recently posted this at the Conlang Mailing List, but I thought it was worth putting up here.

Define a phonology for a language. Make sure you know what all its
distinctive features are.

Generate a very large set of wordforms for the language using an
automatic vocabulary generator.

Calculate the difference between each pair of wordforms in the
vocabulary, using a modified version of the Levenshtein Distance, where
the cost of an insertion or deletion is the total number of distinctive
features in the language's phonology, and the cost of a substitution is
the number of features that differ between the substituted phonemes.

Cluster the wordforms so that each wordform belongs to the same cluster
as its nearest neighbour.

Explore the clusters, assigning related meanings to related wordforms.
Make notes of how changes of form relate to changes of meaning, so that
they can be reapplied later - if the software is clever enough, once
you've annotated a process that applies to two wordforms, it can search
the dataset for other pairs of wordforms where the same process may be
occurring.

This could be a good way of generating non-concatenative morphologies,
and simulating the effects of analogy on language development.

What do people think? Anyone like to have a go at implementing it?