A blog about things that interest me, politics, news, media, architecture, development, environment, local history, secularism, web, dublin ireland, tara

Contact me at expectationlost@gmail.com

Monday, 26 July 2010

Failed experiments in reports digestion, annotation and Entity extraction on the cheap

I've been looking around for ways to digest and annotate and make lengthy reports more usable for years.

One of the main ways to do with would be through named entity extraction running the report through a program and extracting all key words, and sorting them into proper names, places, organisations, companies etc. I have looked around and there are lots of university thesis papers on how to do it but it seems nobodies cracked it, unless mega corporations maybe able to do it on a industrial level. You could also paste it all the report into a wiki and hope people will annotate and link sections of the report, but wiki's aren't easy to use and there no way to automate it even a little that I know of.

Entity extraction has come back in vogue at the moment particularly in conjunction with the rise of the semantic web and there a number of companies who now offer it, Yahoo YQL, AlchemyAPI and Open Calais. There have demos you can try to see what it extracts Alchemy API demo and Open Calais demo but then it give you REST urls to actually use them and I don't know what to do with REST urls and they rarely give examples.

Open Calais has a wordpress plugin which will try and extract tags for your blog it works ok and I figured actually a blog would be good way to present a report, and give me the structure I would need, I could also dole out the pages reports via rss.

Open Calais actually has an WP Calais Archive Tagger too, So I found Mass page uploader plugin for WP and got some dense reports from thestory who who have gone to the effort of getting government reports, scanning and OCR'ing them, Like the Fas Report INV 37, into excessive spending by the Irish state training body.

I downloaded it from scribd a document pdf store website which also stores files in txt format. I followed the instructions of the mass page maker plugin useing Notepad++ to split each page in to a separate blog post in a in the mass page maker syntax a comma separated file(.csv) file and uploaded it to a blog.

Then I ran the calais archive tagger over each page, it extracts about 10-20 tags from each page, it often splits name and organisation up, I tried some YQL based and other auto taggers plugin on it too. Auto tagging WP plugins . They don't work very well.

I did have this blog, with each page of the DDDA report into state bodies bad property deals with private groups in the Dublin docklands, posted and tagged on my domain but had to delete it because it was getting in the way. An earlier attempt *shrug*

There are many other wordpress plugins I tried, for group tagging the tags into entity like groups of people and companies, for locating every place mentioned in the text on a google map. There are a number of plugins for wordpress to create taxonomy of words and glossaries.

Annotation
Wordpress does allow to leave comments at the end which may elucidate a report but in length of text it may be easier to annotate the text directly on to it. I came across WP plugin that does just that. Digress it Example, which has a floating side commenting widget which annotates each paragraph of post. There are another couple of previous versions of this plugin he has developed but I couldn't get any of them to work. There is particular apache setting most shared hosting services don't have and I can't change it or get any of suggested fixes to work stably.

Document cloud, seems to be professional answer to all these things and investigative journalism websites like Pro Publica are already using it for annotation and it seems to be open source and they say it will do entity extraction but its not ready yet.

So dear wind, I couldn't get entity extraction to work very well, nor annotation, maybe I'll just have to concentrate and read the reports :/

Detailed reading of the reports is what enable The Story reporters Mark Coughlan and Gavin Sheridan to produce article like this one about renting of buildings from business partners of FAS employees from the dense reports, and tracking down the details, I guess thats what investigative journalism is.

1 comment:

dublinstreams said...

got http://digress.it to work with version 3 http://distributarie.com/reports/

still having trouble converting pdf to paragraphed sections via csv