Project Blog: Day One

Had a meeting with Mike, from this I found that I could "get the ball rolling" as i now have a direction to go in.

There are still issues with the IBM Intellectual Property rights so for the time being an alternative dataset needs to be found

Mike suggested using PDF files and an API to extract an XML dataset from this. A few websites were very useful in identifying Java PDF API's:

http://java-source.net/open-source/pdf-libraries

http://schmidt.devlib.org/java/libraries-pdf.html

These poined me in the right direction

This one unfortunatly works the wrong way around(converts XML to PDF):

http://incubator.apache.org/pdfbox/

And the below looks like it may do the job, an e-mail has been sent to their sales team but as of yet there has been no reply

http://www.davisor.com/

The below does seem to do the job to some extent however not as much as i'd like, it does break the document into XML, however,

it would be very useful if it broke it down into sentance and paragraph tags etc. As it is it simply puts all words between word tags along with other, what appears to be redundant information.

http://multivalent.sourceforge.net/Tools/pdf/Info.html

Here is are examples of the outputs:

There is this option which appears to be useless:

http://www.megaupload.com/?d=NXLHK6EV

And this one which seems a little better as it would be easier to search on a word by word basis:

http://www.megaupload.com/?d=J5GS3GVI

It does look hopeful that this kind of tool is out there, but it appears a little more reasearch is needed, but it is important that I do not get too "bogged" down with this as there are alternatives and the requirements deadline is looming. It would also be useful if this davisor company would get back to me.

Also on my travels I came across what appears to be some kind og PDF indexor and searcher, this may be of use:

http://multivalent.sourceforge.net/Tools/doc/Lucene.html#Index

Other things taken from the meeting with Mike were:

The importance of the weightings you give words in the attentional model this is the "intelligent" aspect of the project.

The attentional engine will almost certainly be a class in it's own right

Just to go at a word level in the attentional model eg bookings and book are the same thing, as this is not the important aspect of the project, the indexing will be fairly novel.

There needs to be an RSS feed output (research - never done before)

The attentional aspect of the project is the part which will be evaluated, this will be evaluated upon performance.

Some psycology stuff to look into:

Cognitive psycology

Attentional Models

Spiking neurons

Histerisis curve

Working Set

Some communication with Chris wil be important, who is doing the RSS component of the project as a format for the description of individual words will be needed, eg snippets and ontext around it etc. This is why the PDF API is so important

Requirements needs to be started to be thought about especially as I have not done any since year 2. Maybe get a book out as a professional approach will be expected.

Project Blog

Thursday, 19 February 2009

Day One

1 comment:

Followers

Blog Archive

About Me