Relating content automatically in Plone
A question arose today in the Plone general mailing list (a.k.a. Plone-users): it is possible to create a list of related content automatically?
Well, the answer is yes and I'm going to tell you how.
Some time ago Benjamin Saller created a proof-of-concept product called Haystack to do auto-classification of content. Haystack was built around Open Text Summarizer and the haystack_tool included a couple of methods to summarize text and to get a list of "topics" extracted from the content. Haystack also included some portlets to demonstrate its functionality.
We used Haystack in La Jornada for some time with mixed results: the summarizer worked well; we called it to create the description field of our content using AJAX in order to reduce the work of our publishers at edition time.
On the other hand, with the "topics" obtained we were creating a portlet that retrieved the related content. The main problems with this were the low quality of the "topics" and the implementation of the relation. Sometimes we had some embarrassing results relating content from Iraq with some other of, let's say, Shakira, just because they shared some "topic".
Haystack didn't understood the meaning of words and, of course, Ben Saller was aware of that. Last time I saw him was during the Plone Conference 2006 in Seattle. He gave a talk on Haystack 2.0 and he was really excited about its new features: linguistic mapping and automated conceptual mapping, providing high-quality relationships with little or no human effort.
Unfortunately for us, Ben has been a little bit away from the Plone community for some time. So I don't know what's the status on his work.
Going back to the original question in the mailing list, Matt Bowen pointed out to me that Yahoo! has a web service called Term Extraction that does almost the same thing and he even found a python implementation for it.
I tested Term Extraction with some text in Spanish and I was very pleased with the results:
<ResultSet xsi:schemaLocation="urn:yahoo:cate http://api.search.yahoo.com/ContentAnalysisService/V1/TermExtractionResponse.xsd">
<Result>wong kar wai</Result>
<Result>stephen frears</Result>
<Result>festival de cannes</Result>
<Result>sean penn</Result>
<Result>25 de mayo</Result>
<Result>cines</Result>
<Result>organizadores</Result>
<Result>evidencia</Result>
<Result>el presidente</Result>
<Result>hace mucho tiempo</Result>
<Result>afp</Result>
<Result>ya</Result>
</ResultSet>
Implementing this in Plone seems not to be quite complicated: you can trigger a script in a workflow transition, or use Content Rules in Plone 3.0, to fill the Subject field or, better, add an additional field to store this information. Just remember the Term Extraction web service is limited to 5,000 queries per IP address per day.
Yes, I know this solution suffers from the same problems that Haystack, but the "topics" obtained here have better quality and you can always find a better algorithm to do the relation, like testing for more than one "topic" or using only "topics" longer than one word.
Anyway I will put this in my list of pending stuff to test (with a little help of Matt Bowen, of course).
Comments
I think at least we'll need a workflow script that can be attached to any transition event (will work in 2.5 via jensens's DCWorkflow patches in AGX) and a portlet for showing matches. We'll also need to make sure our field is indexed properly, and maybe make a script to run all existing content in small batches per day. I look forward to working with you on this.
For example, I imagine using this in a special field named after some NITF element with an adapter, that's probably not your idea. You want to filter results against a controlled vocabulary, but that's not my case.
I think different taxonomies can coexist on a site. I mean, you can have social bookmarking, controlled vocabularies and term extraction over the same content and use it for different purposes.
Let's talk about this on IRC.
For all the hype of the plone community about their product, the examples in plone.net are not impressive at all. The whole top down approach where an admin gives rights to users and groups instead of it coming out of the community... still no social networks, is shocking.
But not even having good algorithsm to connect CONTENT: the core of Plone is pretty sad.
Very disappointed in all the hype and no delivery. Internet in 2009/2010 requires a hell of a lot more than a portal admin and a bunch of documents thrown on the web.
"Internet in 2009/2010 requires...."
You might want to check the date of this article ;) you are 18 months behind...
Besides, what the author is saying is that there *IS* a solution, and all the required parts do do this are built in to Plone, apart from using an external web service to do the actual summarising. Which is you ask me, is very web 2.0 ;)
In the 18 months since this article was written, there is now another python text summariser which doesn't require Yahoo's web service if you wanted to use that:
http://pypi.python.org/pypi/topia.termextract
-Matt