Wednesday, 6 April 2011

Not as easy as it sounds

A while ago, I came up with the idea of a spoken language recogniser. The idea was that it would use a stripped down speech recognition engine to identify the phonemes in an utterance, and then feed these into an AI system which could identify which of a given set of languages a particular sequence was most likely to come from. You may have noticed that I've been a bit quiet about this recently. I've run into a few snags.

The first is that a speech recogniser needs to be trained to recognise all the different sounds it has to identify. I can't just use an off-the-shelf model for this, as there aren't any that are designed for multi-language use. As far as I can tell, nobody else has needed a multi-language speech recognition app before. So, I'll have to build my own model. Fortunately this site has recordings of many sounds from many different languages, and so gives me a good starting point for building a phonetic model.

The second problem is with transcribing all these sounds. The speech recognition engine I'm likely to use, CMU Sphinx, seems to want phonetic transcriptions to be case insensitive and alphanumeric. I'd prefer to use an X-SAMPA derivative called CXS, but the constraints the speech recogniser places on me won't allow that. Fortunately, sounds within a transcription can be separated by spaces, allowing for multicharacter notation, but with the sheer number of sounds the system has to recognise, I'll probably end up with something like htwvitbveuotkvwvahfi, a logical but unusable system I created as a parody of spelling reform proposals.