Monday, 21 February 2011

A Fantastical Device for Identifying languages

If I were to say to you,
hajimemashite. buriikurii piitaa desu. douzo yoroshiku
you probably wouldn't understand what I meant. However, you would know that I wasn't speaking English (it's Japanese, as a matter of fact). Were I to say, however
'Twas brillig, and the slithy toves
Did gyre and gimbal in the wabe.
All mimsy were the borogoves,
And the mome raths outgrabe.

you would be able to recognise what I said as English, even though it doesn't make sense. Each language has its own phonotactics, the rules determining how sounds can be combined to make words. Therefore, you can identify a language by its sound even if you can't understand it.

I've seen a lot of language identification software on the web, but it all works from written text. We conlangers, or course, like to game the system by putting in our conlangs and seeing what happens. I thought it would be fun to try building a system that could identify languages from speech. So, I've started off a project on Google code to try my ideas out. The idea is to use a stripped-down speech recognition engine to extract a stream of phonemes from audio, and then feed those phonemes into an AI system that has been trained to recognise the phonotactics of various languages.