The Fantastical Devices of Pete the Mad Scientist: language

Showing posts with label language. Show all posts

Tuesday, 13 December 2016

The Common Ground Algorithm - A Possible Remedy for Filter Bubbles

People have a tendency towards Confirmation Bias, whereby they seek out things that confirm their existing opinions and avoid things that challenge them. On social networks and recommendation systems, this can lead to the development of a filter bubble, whereby their sources of information come to be structured around what they already believe. This, of course, acts as an obstacle to healthy discussion between people of differing opinions, and causes their positions to become ever more deeply entrenched and polarised. Instead of seeing those with whom they differ as being decent people who have something of value to offer them, and who may be persuadable on some of their differences, people start seeing their opponents as the enemy. To prevent this, people need something that will put them in touch with people with whom they have generally opposing viewpoints. Of course, we can't just confront people with contrary opinions - this will risk provoking hostile reactions. What we need is to show people what they have in common with those whose opinions are different, so that they can build trust and begin to interact in a healthy way. As an attempt to do this, I present The Common Ground Algorithm. This uses a combination of topic modelling and sentiment analysis to characterise a user's opinions. It then finds people whose opinions are generally opposed to theirs, and identifies the topics on which they share common ground, recommending posts where they agree on something with people they disagree with in general. I've coded up a reference implementation in Python, and am releasing it under the MIT Licence to encourage its use and further development.

Thursday, 22 October 2015

Integrating Java with Python the Easy Way

I have an idea for something I want to build, which will involve a speech recognition component, written in Java and a Hidden Markov Model, written in Python. So that means I have to integrate components written in two different languages. What's the best way of doing it? One way would be to run Python on the JVM. There is a Python implementation for the JVM, Jython, but from what I've heard it's painfully slow. Since I'm aiming for something as close to real time as possible, it's unlikley to meet my needs. It did occur to me that there could be a faster way to run Python on the JVM. Pypy is a self-hosting, JIT-compliled implementation of Python, which is much faster than the reference implementation. If its code generation phase were modified to emit Java Bytecode, then Pypy could run on the JVM. This approach, which I call Jypy, would be a worthwhile project for somebody who knows Java Bytecode. Unfortunately, I'm not that person. However, I then thought about the architecture of my project. I'd already realised that it would have to be organised as a number of concurrent processes, communicating via pipes. I then realised that meant that I didn't need to run Python on the JVM at all. The Java and Python components could each run in their own processes, and didn't need to share any resources. The only integration I needed was pipes. You know the sense of delight when you realise that something complicated is actually simple? That's how I felt when I worked that out.

Tuesday, 9 June 2015

Emily Has Moved

As those of you who've tried out my semantic recommendation system, Emily, will have noticed, it didn't work. The reason was, I'd used the wrong cloud platform. Google App Engine isn't meant for anything that needs as much computation as Emily does, so I've ported Emily to OpenShift. This has the advantage that it gives me much more control of how I write the code, and I can use things like MongoDB and multiprocessing. Let's try this again!

Monday, 24 June 2013

A Couple of my Fantastical Devices

with the recent news about the Voynich Manuscript, as mentioned in my last post, I thought it opportune to share a couple of pieces of code I'd written. First off, as I mentioned earlier, a couple of years ago I wrote a Python implementation of Montemurro and Zanette's algorithm for calculating the entropy of words in documents. If you're interested in using the technique yourself, you may want to have a look. Secondly, my own attempts to uncover the syntax use a Python library for Hidden Markov Models that I created. It probably still has a few bugs in it, but it's attracted a bit of interest online, and I'm hoping to develop it further. So, if you're at all interested in AI, computational linguistics, or analytics, please have a look at these. Feedback is welcome, as is anybody who wishes to contribute further to these projects.

Sunday, 17 February 2013

Literally Vikings

One thing that quite a lot of people get annoyed about is using "literally" to mean "figuratively". Fortunately, these Vikings know how to use the word correctly.

Watch "Horrible Histories - Literally: The Viking Song" on YouTube

(Of course, they're not literally Vikings. They're really actors playing Vikings in Horrible Histories, but you knew that, didn't you?)

I'm off to York for a few days, and it just happens that the Viking Festival is on, so I might post some more Viking-related stuff.

Tuesday, 6 September 2011

Quantum Thinking and Function Words

Two articles in New Scientist caught my eye this week - one about a quantum mechanics like system of logic and the other about function words. According to the first article, many of our thought processes don't follow the rules of classical logic, but a system of inference that can be described in terms of Hilbert Space, which is a vector space with an arbitrary number of dimensions. Quantum mechanics uses Hilbert spaces to describe the states of quantum systems, and the mathematics of Hilbert space allows quantum states to interact in counter-intuitive ways. The same logic apparently allows human minds to combine ideas in ways that don't necessarily follow the rules of classical logic, but do allow greater flexibility. To quote from the article -

If you want to research a topic such as the "story of rock" with geophysics and rock formation in mind, you don't want a search engine to give you millions of pages on rock music. One approach would be to include "-songs" in your search terms in order to remove any pages that mention "songs". This is called negation and is based on classical logic. While it would be an improvement, you would still find lots of pages about rock music that just don't happen to mention the word songs. Widdows has found that a negation based on quantum logic works much better. Interpreting "not" in the quantum sense means taking "songs" as an arrow in a multidimensional Hilbert space called semantic space, where words with the same meaning are grouped together. Negation means removing from the search pages that shares any component in common with this vector, which would include pages with words like music, guitar, Hendrix and so on. As a result, the search becomes much more specific to what the user wants.

Obviously, if you're interested in Artificial Intelligence, where a key aim is to enable computers to emulate the flexibility of human thought, this is a useful approach. The second article, by James W. Pennebacker, concerns his work on the importance of function words. These are things like pronouns, conjunctions and prepositions, the words that don't seem to mean very much, but act like glue holding the sentence together. Professor Pennebaker has discovered that there's a lot of psychological information hidden in these apparently insignificant words - for example, in a conversation between two people, the more socially dominant one will tend to use the word "I" less than the other one. Most natural language processing software treats words like this as stop words and ignores them, but for some applications (eg sentiment analysis, social network analytics) it could be just the data you need.

Friday, 20 May 2011

Robots invent a language

IEEE Spectrum describes an interesting experiment in artificial intelligence and linguistics. Two robots, equipped with microphones and loudspeakers to talk to each other, managed to create a set of words useful for navigating their environment.

I think it would be interesting to extend this experiment to see if it could give insight into how language evolves. Starting with a larger population of robots, you could give them time to make up a language, and then start deleting the memories of individual robots at intervals. In real life, languages have to be continually relearned by sucessive generations of speakers, and this is probably part of the reason why the undergo changes. It would be possible to vary the size of the population and the rate of deletions to see what influence these might have, and also to add varying amounts of background noise.

Mind you, to give a real insight into the development of human language, you might want to give the robots more complex tasks to do than simply finding their way around, so that they would have to invent a grammar to express their meaning. Then you would be seeing how language might develop in a mind fundamentally unlike a humans. There's been considerable debate amongst linguists about how many of the constraints on human languages are hard-wired into the human brain, and how many are simply a result of circumstance, and what can evolve from what already exists.

Wednesday, 6 April 2011

Not as easy as it sounds

A while ago, I came up with the idea of a spoken language recogniser. The idea was that it would use a stripped down speech recognition engine to identify the phonemes in an utterance, and then feed these into an AI system which could identify which of a given set of languages a particular sequence was most likely to come from. You may have noticed that I've been a bit quiet about this recently. I've run into a few snags.

The first is that a speech recogniser needs to be trained to recognise all the different sounds it has to identify. I can't just use an off-the-shelf model for this, as there aren't any that are designed for multi-language use. As far as I can tell, nobody else has needed a multi-language speech recognition app before. So, I'll have to build my own model. Fortunately this site has recordings of many sounds from many different languages, and so gives me a good starting point for building a phonetic model.

The second problem is with transcribing all these sounds. The speech recognition engine I'm likely to use, CMU Sphinx, seems to want phonetic transcriptions to be case insensitive and alphanumeric. I'd prefer to use an X-SAMPA derivative called CXS, but the constraints the speech recogniser places on me won't allow that. Fortunately, sounds within a transcription can be separated by spaces, allowing for multicharacter notation, but with the sheer number of sounds the system has to recognise, I'll probably end up with something like htwvitbveuotkvwvahfi, a logical but unusable system I created as a parody of spelling reform proposals.

Monday, 28 March 2011

Thirtieth Post Wordle

Yes, it's the thirtieth post, so that means it's Wordle time. It's interesting to see how themes have developed over the last 30 posts or so - at the moment, language and conlang related stuff is dominating, 10 posts ago it was AI related stuff, and earlier on my ideas for an incorporating Romlang were most significant.

Wednesday, 23 February 2011

An Experiment in English Phonology

We normally think of English as having voiceless stops, /p t k/ and voiced stops, /b d g/. Voiceless stops have an aspirated allophone [p_h t_h k_h] that appears in certain places, particularly when the stop appears at the beginning of a syllable. Aspiration, in this model of English pronunciation, is a redundant secondary feature of voiceless stops.

However, on the Conlang Mailing List, And Rosta proposed an alternative model. His idea is that English actually has aspirated stops /p_h t_h k_h/ and unaspirated stops /b d g/. Voicing would be a redundant secondary feature of unaspirated stops - the [t] in "stop" would actually be a /d/ that's lost its voice. And put forward a number of interesting theoretical arguments for this, but I thought that it needed to be tested experimentally.

If the standard interpretation of English phonology is correct, English speakers should find voiced and voiceless stops easier to tell apart than aspirated and unaspirated stops. In And's interpretation, they should find aspirated and unaspirated stops easier to tell apart than voiced and voiceless.

Bengali distinguishes between plain (voiceless, unaspirated), apsirated, voiced and voiced aspirated (breathy voiced or murmured) stops. I asked a Bengali speaker to record a sample of 20 words which you can listen to here and transcribe them in CXS. I then asked three volunteers from the Conlang Mailing list, all of whom were monoglot English speakers, to listen to the recording, and trascribe their first impression of what they heard. Here are the results.

Original	Listener 1	Listener 2	Listener 3
kat`_h	dap	tat	dat_h
k_habar	kabar	kabar	k_habal
gajok	dajuk	gAijok	d_haIVlk_h
g_tOt`ona	@_^k_hot7na	xor\tunA	k_h7dn@
tSabi	cabi	tS)Abi	tzabin
d`_tol	d_<tol	tol	poUil
p_hul	pUl	pul	pul
pani	bani	bAni	bani
d`im	bim	bim	din
tara	d`al`a	tAr\A	da4a
t_haka	taka	taka	taka
dip	dip	dip	dip
bat`i	badi	bA'ti	badi
aSar`_h	aSar\	as`Ar\	aSal
roSun	4oSun	r\oSun	roUSVn
rOkto	h4OktO	rokto	rakt7N
sriSt`i	s4iSti	Sr\iS.ti	SriSdi
ha~S	haS:	hA:S	paS
tS_hata	tSata	tS)AtA	tSada
dZOl	dZOl	dZ)Al	dZVl

For each stop and affricate in the sample, I then recorded which of the four categories it fell into, and how the volunteers identified it. The results are as follows.

Actual	Plain	Aspirated	Voiced	Voiced aspirated	Other
	Heard as
Plain	27	1	12	0	0
Aspirated	13	2	0	0	0
Voiced	0	0	20	1	0
Voiced aspirated	3	2	0	0	1

From this we can see that English speakers correctly identify plain stops as voiceless 70% of the time. They almost always identify voiceless stops as plain, whether or not they are aspirated. Aspirated voiceless stops are always identified as voiceless. Voiced stops are almost always correctly identified, and voiced aspirated stops, which are alien to English, are never correctly identified.

These results are more consistent with voicing being the primary feature than aspiration. Sorry, And.

Monday, 21 February 2011

A Fantastical Device for Identifying languages

If I were to say to you,

hajimemashite. buriikurii piitaa desu. douzo yoroshiku

you probably wouldn't understand what I meant. However, you would know that I wasn't speaking English (it's Japanese, as a matter of fact). Were I to say, however

'Twas brillig, and the slithy toves
Did gyre and gimbal in the wabe.
All mimsy were the borogoves,
And the mome raths outgrabe.

you would be able to recognise what I said as English, even though it doesn't make sense. Each language has its own phonotactics, the rules determining how sounds can be combined to make words. Therefore, you can identify a language by its sound even if you can't understand it.

I've seen a lot of language identification software on the web, but it all works from written text. We conlangers, or course, like to game the system by putting in our conlangs and seeing what happens. I thought it would be fun to try building a system that could identify languages from speech. So, I've started off a project on Google code to try my ideas out. The idea is to use a stripped-down speech recognition engine to extract a stream of phonemes from audio, and then feed those phonemes into an AI system that has been trained to recognise the phonotactics of various languages.

Wednesday, 1 December 2010

How do you identify what a word means?

Seen on del.icio.us

Artificial Intelligence has been described as "Trying to get a computer to do anything that it's easier for a human to do." Trying to disambiguate the various possible senses of a word is a classic example of this - having heard a lecture from somebody who works in this field, I know that it's very difficult to create an algorithm that improves on the naive approach of just picking the most common meaning every time. 92% accuracy looks pretty impressive.

Tuesday, 30 November 2010

Exquisite Corpse and other stuff

I've started up a game of Conlang Exquisite Corpse amongst members of the conlang mailing list. What's Exquisite Corpse? It's a surrealist writing technique, where a group of people take turns to contribute to a text, each having limited knowledge of what has gone before. In this case, I've sent a sentence in Khangaþyagon to somebody, who has to translate it and write a followon sentence in his own conlang. He then passes his sentence (not mine) onto the next person in the chain. By the time it gets back to me, I expect the story to be about something completely different from what I started it off as.

I've also been running a little experiment with some volunteers from the conlang mailing list to see how native English speakers perceive the sounds of an unfamiliar language. I'll post the results here in a couple of days.

I've also been thinking about my Incorporating Romlang. I think I might introduce a vowel harmony system, and then break it on purpose.

Friday, 26 November 2010

Tenth post Wordle

To celebrate my tenth post, I've created a Wordle of my previous posts. Looks like the incoporating Romlang is dominating things at the moment.

I think I'll do this about every 10 posts or so.

Saturday, 13 November 2010

Can anybody read my birdbath?

Last year, my son found a little ceramic dish half-buried in the garden. We've been using it as a birdbath ever since.

What's interesting about this little dish is that it has two large Chinese characters on the inner surface. I did Japanese classes a few years ago, so I know a few basic Kanji, but these one's aren't familiar to me. Here's a photo of the birdbath, taken just after it was excavated, showing the characters.

[caption id="attachment_23" align="alignnone" width="240" caption="Can anyone tell me what this means?"]

[/caption]

If anyone knows what this means, please leave a comment. I hope it's not something along the lines of the tattoo, allegedly popular with American servicemen stationed in Japan after the Second World War, that read "I'm too stupid to ask what this means."

Updated I've uploaded a clearer picture of the birdbath, without the shadow that obscured the second character in the original image.