The Fantastical Devices of Pete the Mad Scientist: programming

Showing posts with label programming. Show all posts

Tuesday, 13 December 2016

The Common Ground Algorithm - A Possible Remedy for Filter Bubbles

People have a tendency towards Confirmation Bias, whereby they seek out things that confirm their existing opinions and avoid things that challenge them. On social networks and recommendation systems, this can lead to the development of a filter bubble, whereby their sources of information come to be structured around what they already believe. This, of course, acts as an obstacle to healthy discussion between people of differing opinions, and causes their positions to become ever more deeply entrenched and polarised. Instead of seeing those with whom they differ as being decent people who have something of value to offer them, and who may be persuadable on some of their differences, people start seeing their opponents as the enemy. To prevent this, people need something that will put them in touch with people with whom they have generally opposing viewpoints. Of course, we can't just confront people with contrary opinions - this will risk provoking hostile reactions. What we need is to show people what they have in common with those whose opinions are different, so that they can build trust and begin to interact in a healthy way. As an attempt to do this, I present The Common Ground Algorithm. This uses a combination of topic modelling and sentiment analysis to characterise a user's opinions. It then finds people whose opinions are generally opposed to theirs, and identifies the topics on which they share common ground, recommending posts where they agree on something with people they disagree with in general. I've coded up a reference implementation in Python, and am releasing it under the MIT Licence to encourage its use and further development.

Saturday, 27 February 2016

FizzBuzz

def FizzBuzz():
    for i in xrange(1,100):
        word=''.join(('Fizz' if i%3==0 else ''),
                              ('Buzz' if i%5==0 else ''))
        print i if word=='' else word

Thursday, 22 October 2015

Integrating Java with Python the Easy Way

I have an idea for something I want to build, which will involve a speech recognition component, written in Java and a Hidden Markov Model, written in Python. So that means I have to integrate components written in two different languages. What's the best way of doing it? One way would be to run Python on the JVM. There is a Python implementation for the JVM, Jython, but from what I've heard it's painfully slow. Since I'm aiming for something as close to real time as possible, it's unlikley to meet my needs. It did occur to me that there could be a faster way to run Python on the JVM. Pypy is a self-hosting, JIT-compliled implementation of Python, which is much faster than the reference implementation. If its code generation phase were modified to emit Java Bytecode, then Pypy could run on the JVM. This approach, which I call Jypy, would be a worthwhile project for somebody who knows Java Bytecode. Unfortunately, I'm not that person. However, I then thought about the architecture of my project. I'd already realised that it would have to be organised as a number of concurrent processes, communicating via pipes. I then realised that meant that I didn't need to run Python on the JVM at all. The Java and Python components could each run in their own processes, and didn't need to share any resources. The only integration I needed was pipes. You know the sense of delight when you realise that something complicated is actually simple? That's how I felt when I worked that out.

Tuesday, 9 June 2015

Emily Has Moved

As those of you who've tried out my semantic recommendation system, Emily, will have noticed, it didn't work. The reason was, I'd used the wrong cloud platform. Google App Engine isn't meant for anything that needs as much computation as Emily does, so I've ported Emily to OpenShift. This has the advantage that it gives me much more control of how I write the code, and I can use things like MongoDB and multiprocessing. Let's try this again!

Tuesday, 26 May 2015

Introducing Emily - my latest Fantastical Device

Emily is a semantic recommendation system for blogs that I've been working on. If you give it an Atom or RSS feed from a blog, it will create a feed of items from other blogs that hopefully match your interests.

It does this by using significant associations between words to infer your interests. Suppose a randomly-chosen sentence from your blog has a probability P(A) of containing word A, and a probability P(B) of containing word B. If there were no relationship between the words, we would expect the probability of a sentence containing both words to be P(AB)=P(A)P(B). If there is significant information contained in the relationship between the words, they will cooccur more frequently than this, and we can quantify this with an entropy, H=log2 P(AB) - log2 P(A) - log2 P(B)

Emily uses the strengths of these associations to calculate the similarity between two blogs. Then, if you post an article that makes your blog more similar to somebody else's blog than it was before, that article is recommended to them.

This has been an interesting project for me. I've learned about Google App Engine, pubsubhubbub and Atom. What I need now is for people to try it out. I'm looking forward to when Emily starts finding things for me.

Tuesday, 17 June 2014

NoSQL for Conlangers

In his blog, fellow-conlanger +Wm Annis writes that the best database format for dictionaries is text.

All his points are valid, but at one point he says The standard is SQL, and that got me thinking. I've done a fair bit of work with SQL, and can do scary things with it, but I wouldn't choose to use it. It's inflexible and clunky. You have to decide your schema in advance, and if your requirements change at a later date, you have no choice but to rebuild entire tables. Anything more complex than a simple one-to-one relationship requires a second table and a join. SQL basically expects you to fit your data to the model, and what you need is to fit the model to your data. Using an ORM like SQLAlchemy doesn't help - it's just a layer of abstraction on top of an inherently clunky system.

For a good dictionary system, you need the flexibility of a NoSQL database. One popular system, that I've done a lot of work with, is MongoDB. This stores documents in JSON format, so a dictionary entry might look like this

{"word":"kitab",
"definitions":[{"pos":"noun",
"definition":"book"]},
"inflections":{"plural":{"nominative":"kutuub"}},
"related":["muktib","kataaba"]}

If a field exists for some words but not others, you only need to put it in the relevant entries. If a field is variable length, you can store it in an array. One slight disadvantage is that cross-referencing between entries can be a little tricky.

Another possibility is ZODB. This is an object persistance system for Python objects. In many ways it's similar to MongoDB, but there's one important difference. If a member of a stored object is itself an object that inherits from persistant, what is stored in the parent object is a reference to that object. Cross-referencing is therefore completely transparent. The only small disadvantage is that it's Python-specific, but unless you really need to write your dictionary software in a different language, that shouldn't be a big problem.

You might also want to consider a graph database like Neo4j. This stores data as a network of nodes and edges, like this

kitab-[:MEANS]->book
kitab-[:PLURAL]->kutuub-[:MEANS]->books

In theory, this is the most flexible form of database. I wouldn't say it was easy to learn or use, though.

There are plenty of other NOSQL databases, these are just the ones I'd use, but I think they're all more suitable for dictionary software than SQL. But do make sure you have a human-readable backup.