Tuesday, 13 December 2016
The Common Ground Algorithm - A Possible Remedy for Filter Bubbles
Thursday, 22 October 2015
Integrating Java with Python the Easy Way
Wednesday, 17 June 2015
The Bootstrap Problem
A post on Data Community DC discusses Why You Should Not Build a Recommendation Engine. The main point is that recommendation engines need a lot of data to work properly, and you're unlikely to have that when you start out.
I know the feeling. In a previous job I created a recommendation engine for a business communication system. It used tags on the content and user behaviour to infer the topics that the user was most likely to be interested in, and recommend content accordingly. Unfortunately, my testbed was my employer's own instance of the product, and the company was a start-up that was too small to need its own product. I never really got a handle on how well it worked.
This brings me to Emily. Emily isn't a product. It's a personal portfolio project. I had an idea for a recommendation system that would infer users' interests from content they posted in blogs, and recommend similar content. The problem is, the content it recommends comes from the other users, so at its current early stage of operation, it doesn't have much to recommend. The more people use it, the better it will become, but what's the incentive to be an early adopter?
What I seem to have at the moment is a recommendation engine that needs somebody to recommend it.
Tuesday, 9 June 2015
Emily Has Moved
Thursday, 4 June 2015
Developing Emily - Revision 24: Porting to OpenShift. AppEngine wasn't suitable for the computationally intense
Modify /trunk/Emily.py
Modify /trunk/EmilyBlogModel.py
Modify /trunk/EmilyTreeNode.py
Modify /trunk/emily.js
Porting to OpenShift. AppEngine wasn't suitable for the computationally intense parts of Emily.
from Subversion commits to project emily-found-a-thing on Google Code http://ift.tt/1G9GWoV
via IFTTT
Tuesday, 26 May 2015
Introducing Emily - my latest Fantastical Device
Emily is a semantic recommendation system for blogs that I've been working on. If you give it an Atom or RSS feed from a blog, it will create a feed of items from other blogs that hopefully match your interests.
It does this by using significant associations between words to infer your interests. Suppose a randomly-chosen sentence from your blog has a probability P(A) of containing word A, and a probability P(B) of containing word B. If there were no relationship between the words, we would expect the probability of a sentence containing both words to be P(AB)=P(A)P(B). If there is significant information contained in the relationship between the words, they will cooccur more frequently than this, and we can quantify this with an entropy, H=log2 P(AB) - log2 P(A) - log2 P(B)
Emily uses the strengths of these associations to calculate the similarity between two blogs. Then, if you post an article that makes your blog more similar to somebody else's blog than it was before, that article is recommended to them.
This has been an interesting project for me. I've learned about Google App Engine, pubsubhubbub and Atom. What I need now is for people to try it out. I'm looking forward to when Emily starts finding things for me.
Thursday, 21 May 2015
Developing Emily - Revision 23: Ready to launch
Modify /trunk/Emily.py
Modify /trunk/EmilyBlogModel.py
Modify /trunk/EmilyTreeNode.py
Add /trunk/emily.js
Ready to launch
from Subversion commits to project emily-found-a-thing on Google Code http://ift.tt/1IN7SNv
via IFTTT
Saturday, 26 April 2014
Experimenting with IFTTT
I've just started trying out IFTTT. Partly this is because the Feedburner feed for this blog has needed manually prompting to update my Twitter feed, but also because I'm investigating using it to post automatically to Blogger on behalf of my friends at Speculative Grammarian.
To do this, I'm using a feed from one of my Google Code projects. It's a semantic recommendation system I've been working on. I call it Emily, because it finds things (or at least, it will do when it's up and running). Code updates from the project should be appearing here.
Wednesday, 5 March 2014
One of my Fantastical Devices is on PyPI
sudo pip install Markov, and try it out. If you feel you can help me improve it, contact me and I can add you to the Google Code project.
Monday, 24 June 2013
A Couple of my Fantastical Devices
Saturday, 22 June 2013
Exciting Voynich Manuscript News
A couple of years ago, I came across a new technique for analysing documents, developed by Marcello Montemurro and Damian Zanette. It identifies the most significant words in a document by the entropy of their distribution in the text. I tried it out on subtitles at the BBC, and got promissing early results.
Now Dr Montemurro has applied the technique to the infamous Voynich Manuscript, and discovered that it appears to contain a meaningful language, rather than gibberish. No news yet as to what any of it might mean, but hopefully my own efforts to uncover the syntax with a Hidden Markov Model might eventually bear fruit. I'm convinced it's a conlang.
Thursday, 26 January 2012
The Voynich Manuscript
Tuesday, 6 September 2011
Quantum Thinking and Function Words
If you want to research a topic such as the "story of rock" with geophysics and rock formation in mind, you don't want a search engine to give you millions of pages on rock music. One approach would be to include "-songs" in your search terms in order to remove any pages that mention "songs". This is called negation and is based on classical logic. While it would be an improvement, you would still find lots of pages about rock music that just don't happen to mention the word songs. Widdows has found that a negation based on quantum logic works much better. Interpreting "not" in the quantum sense means taking "songs" as an arrow in a multidimensional Hilbert space called semantic space, where words with the same meaning are grouped together. Negation means removing from the search pages that shares any component in common with this vector, which would include pages with words like music, guitar, Hendrix and so on. As a result, the search becomes much more specific to what the user wants.Obviously, if you're interested in Artificial Intelligence, where a key aim is to enable computers to emulate the flexibility of human thought, this is a useful approach. The second article, by James W. Pennebacker, concerns his work on the importance of function words. These are things like pronouns, conjunctions and prepositions, the words that don't seem to mean very much, but act like glue holding the sentence together. Professor Pennebaker has discovered that there's a lot of psychological information hidden in these apparently insignificant words - for example, in a conversation between two people, the more socially dominant one will tend to use the word "I" less than the other one. Most natural language processing software treats words like this as stop words and ignores them, but for some applications (eg sentiment analysis, social network analytics) it could be just the data you need.
Friday, 20 May 2011
Robots invent a language
I think it would be interesting to extend this experiment to see if it could give insight into how language evolves. Starting with a larger population of robots, you could give them time to make up a language, and then start deleting the memories of individual robots at intervals. In real life, languages have to be continually relearned by sucessive generations of speakers, and this is probably part of the reason why the undergo changes. It would be possible to vary the size of the population and the rate of deletions to see what influence these might have, and also to add varying amounts of background noise.
Mind you, to give a real insight into the development of human language, you might want to give the robots more complex tasks to do than simply finding their way around, so that they would have to invent a grammar to express their meaning. Then you would be seeing how language might develop in a mind fundamentally unlike a humans. There's been considerable debate amongst linguists about how many of the constraints on human languages are hard-wired into the human brain, and how many are simply a result of circumstance, and what can evolve from what already exists.
Wednesday, 6 April 2011
Not as easy as it sounds
The first is that a speech recogniser needs to be trained to recognise all the different sounds it has to identify. I can't just use an off-the-shelf model for this, as there aren't any that are designed for multi-language use. As far as I can tell, nobody else has needed a multi-language speech recognition app before. So, I'll have to build my own model. Fortunately this site has recordings of many sounds from many different languages, and so gives me a good starting point for building a phonetic model.
The second problem is with transcribing all these sounds. The speech recognition engine I'm likely to use, CMU Sphinx, seems to want phonetic transcriptions to be case insensitive and alphanumeric. I'd prefer to use an X-SAMPA derivative called CXS, but the constraints the speech recogniser places on me won't allow that. Fortunately, sounds within a transcription can be separated by spaces, allowing for multicharacter notation, but with the sheer number of sounds the system has to recognise, I'll probably end up with something like htwvitbveuotkvwvahfi, a logical but unusable system I created as a parody of spelling reform proposals.
Wednesday, 16 March 2011
Musical moods
I had a go myself earlier - you have to listen to a number of theme tunes and then answer questions about each one. The questions change from tune to tune - quite a clever piece of experimental design, in that it prevents you from getting into a rut where you're calculating your answers before the music's finished. Hopefully, it will enable my colleagues to train an AI to recognise genre and mood from theme music.
PS - sorry for the lack of posts recently.
Monday, 21 February 2011
A Fantastical Device for Identifying languages
hajimemashite. buriikurii piitaa desu. douzo yoroshikuyou probably wouldn't understand what I meant. However, you would know that I wasn't speaking English (it's Japanese, as a matter of fact). Were I to say, however
'Twas brillig, and the slithy toves
Did gyre and gimbal in the wabe.
All mimsy were the borogoves,
And the mome raths outgrabe.
you would be able to recognise what I said as English, even though it doesn't make sense. Each language has its own phonotactics, the rules determining how sounds can be combined to make words. Therefore, you can identify a language by its sound even if you can't understand it.
I've seen a lot of language identification software on the web, but it all works from written text. We conlangers, or course, like to game the system by putting in our conlangs and seeing what happens. I thought it would be fun to try building a system that could identify languages from speech. So, I've started off a project on Google code to try my ideas out. The idea is to use a stripped-down speech recognition engine to extract a stream of phonemes from audio, and then feed those phonemes into an AI system that has been trained to recognise the phonotactics of various languages.
Tuesday, 25 January 2011
AI is No Longer a Dirty Word
Our software combines mathematics with psychology and artificial intelligence to give your customers what they want.It's very unusual to see anyone outside the games industry use the term Artificial Intelligence for something they're actually selling. The reasons for this are largely historical. A few years ago, a lot of people made rather overhyped claims for what AI would be able to do, which didn't match up with what it could actually do at the time. This created the impression that anything that was described as Artificial Intelligence belonged in the lab, and wasn't likely to turn into a usable product in the forseeable future.
There's a gradual change in the perception of AI going on. This is partly because researchers have been taking a more pragmatic approach to AI, and partly because the internet is making large datasets more readily available. Good data is the limiting factor in most AI applications, so, the more data is available, the better AI works.
Wednesday, 1 December 2010
How do you identify what a word means?
Artificial Intelligence has been described as "Trying to get a computer to do anything that it's easier for a human to do." Trying to disambiguate the various possible senses of a word is a classic example of this - having heard a lecture from somebody who works in this field, I know that it's very difficult to create an algorithm that improves on the naive approach of just picking the most common meaning every time. 92% accuracy looks pretty impressive.