Had a meeting with Mike, from this I found that I could "get the ball rolling" as i now have a direction to go in.
There are still issues with the IBM Intellectual Property rights so for the time being an alternative dataset needs to be found
Mike suggested using PDF files and an API to extract an XML dataset from this. A few websites were very useful in identifying Java PDF API's:
These poined me in the right direction
This one unfortunatly works the wrong way around(converts XML to PDF):
And the below looks like it may do the job, an e-mail has been sent to their sales team but as of yet there has been no reply
The below does seem to do the job to some extent however not as much as i'd like, it does break the document into XML, however,
it would be very useful if it broke it down into sentance and paragraph tags etc. As it is it simply puts all words between word tags along with other, what appears to be redundant information.
Here is are examples of the outputs:
There is this option which appears to be useless:
And this one which seems a little better as it would be easier to search on a word by word basis:
It does look hopeful that this kind of tool is out there, but it appears a little more reasearch is needed, but it is important that I do not get too "bogged" down with this as there are alternatives and the requirements deadline is looming. It would also be useful if this davisor company would get back to me.
Also on my travels I came across what appears to be some kind og PDF indexor and searcher, this may be of use:
Other things taken from the meeting with Mike were:
The importance of the weightings you give words in the attentional model this is the "intelligent" aspect of the project.
The attentional engine will almost certainly be a class in it's own right
Just to go at a word level in the attentional model eg bookings and book are the same thing, as this is not the important aspect of the project, the indexing will be fairly novel.
There needs to be an RSS feed output (research - never done before)
The attentional aspect of the project is the part which will be evaluated, this will be evaluated upon performance.
Some psycology stuff to look into:
Cognitive psycology
Attentional Models
Spiking neurons
Histerisis curve
Working Set
Some communication with Chris wil be important, who is doing the RSS component of the project as a format for the description of individual words will be needed, eg snippets and ontext around it etc. This is why the PDF API is so important
Requirements needs to be started to be thought about especially as I have not done any since year 2. Maybe get a book out as a professional approach will be expected.
Hello Leigh
ReplyDeleteA good record which will be useful going forward and for later review.