Project Blog: February 2009

Friday, 27 February 2009

Day Six

Today functional requirements were completed they look brief, but they were very time consuming.

I created some use cases to help derive some requirements these were done in star UML. I had to re-learn use cases they look correct to me nevertheless they served their purpose of generating some requirements.

I had thoughts of creating various other diagrams such as state machines and data flow diagrams, however this requirements documents is by no means the final version, as the project evolves there will be new requirements and a better understanding of the project as a whole, by this point it may be necessary to define some higher level requirements.

Objective by the end of the next day is to complete the functional requirements in the document and tie up the loose ends

Thursday, 26 February 2009

Day Five

This was spent doing the "context" part of the requirements document.

The objective by the end of the next day will be to have all functional requirements defined.

Tuesday, 24 February 2009

Day Four

The first half of the day involved investigating XML manipulation API's the general view was that JDOM was the richest. I has issues installing it, but yet again realised that it belonged in java's \ext folder!

The API took a while to get familiar with as I have more experience using c#, however I found this example from IBM which was a huge help:

http://www.ibm.com/developerworks/library/x-injava2/

Then I was getting invalid XML errors which took a while to work out. It turns out that the API was putting certain closing tags in wrong places. Upon further investigation it turned out to be the "-font" parameter on the command line causing it, conveniently the tags were redundant anyway, so I now had valid XML.

This was the error:

Error on line 9 of document file:///C:/pdfoutput.xml: The element type "timesnewromanps-boldmt" must be terminated by the matching end-tag ""

Then I was faced with another problem, the program was reporting this exception:

Could not check

because Invalid byte 1 of 1-byte UTF-8 sequence.

It turned out the XML file was now not in UTF-8. To solve this I remembered a text editor which I used on placement called Ultra Edit. I opened the file up in Ultra Edit and re-saved it with UTF-8 encoding and it worked fine. So I now have a small XML parsing prototype.

The second half of the day was spent going through and making notes on the Academic documents highlighted in the project proposal:

http://www.iiit.net/~pkreddy/wdm03/wdm/vldb01.pdf

http://books.google.co.uk/books?hl=en&lr=&id=30UsZ8hy2ZsC&oi=fnd&pg=PR7&dq=cognitive+psychology+and+computing&ots=1LyL05Rj1i&sig=bX5q1YNTfqlTOANUSctLSS6f49g

http://books.google.co.uk/books?id=Rdwv-r5RlOcC&pg=PA88&dq=Attentional+models&lr=

http://books.google.co.uk/books?id=RZ-6cTL8YZsC&printsec=frontcover&dq=web+service&lr=

http://books.google.co.uk/books?id=hXTfWDkqnlkC&printsec=frontcover&dq=xml+indexing&lr=

So the Lit review has began!

Day Three

Today I added Chris Campbell on facebook, he's doing the second part of the project so communication will be helpful, we've had some discussions about the data-exchange protocol over facebook.

I experimented with programming with the PDF API and created a little prototype, I'm sure that I am missing some important methods which are there – may need to de-compile the source code, this should be fine as it is open source, I'll have to look into it though.

It took me a long time to work out how to add extensions or additional API's in java, of course it was obvious, in the \ext folder!

For example:

C:\Program Files\Java\jre1.6.0_07\lib\ext

After this I carried on with my re-learning of the requirements topic and found a useful website where I have taken over a page worth of notes from:

http://www.stsc.hill.af.mil/crosstalk/2008/03/0803Gottesdiener.html

I now think I know how to approach this area, however i'm not sure if what I have been looking at is a little too high level.

I sent an e-mail to Mike asking about the relationship between requirements and the lit review, he said that they do go hand in hand and are a very much iterative process. Soon i'll start with a small lit review and then go from there.

On my travels I also came across problem frames:

http://en.wikipedia.org/wiki/Problem_Frames_Approach

This is a requirements approach which can be thought about.

Also I came across requirements templates:

http://www.jiludwig.com/Template_Guidance.html#checklist

These again maybe are a little high level but they give an indication of the work needed, I am a little worried about this Friday deadline for the requirements, but I'll just have to get stuck in!

Sunday, 22 February 2009

Day Two

In the morning I did a little experiment to see how other domains represent kinds of text documents with XML. Mike had mentioned that there is a way of looking at how Open Office “.odt” files are represented in XML using windows. The process was:

Create an .odt file
Rename it's extension to “.zip”
Extract the file
Look at “Content.xml”

The resulting file looked like this:

http://www.megaupload.com/?d=1VZLOA04

Although the “Content.xml” file maybe a lot easier to read, I came to the conclusion that there were enough similarities with the XML file created from the PDF API that either process could be used as an example data-set for this problem. When the PDF API was re-visited it became apparent that there were some extra command line parameters which could produce extra useful elements in the XML, for example there were some which would include Hyperlinks in the XML.

As the reading of the XML was quite hard using a simple document viewer I discovered a specialist viewer called “XML Marker” this helped highlight to myself the thought behind how the PDF API represents a PDF file in XML. Before I had this tool I did not realise that there were “line” tags for blank lines and also that “word” tags between “text” tags meant that they were on the same line.

So this helped me reach the conclusion that the PDF API would be sufficient for the creation of the data set used for this project or until the IBM issues have or haven't been resolved.

Next I tried to find documentation which was large enough and presented how id imagine how the IBM documentation could possibly be presented I found this link:

www.reportlab.com/docs/PyRXP_Documentation.pdf

And processed it to see how it would handle it, it was fine please see the resulting XML at:

http://www.megaupload.com/?d=FKARW4FV

It's important to remember to make the program very flexible to changing of data-sets as it may be required that the IBM one needs to be substituted in for this one created by the XML. I know this is a principle of programming, but it is just a reminder to carry forward.

The second half of the day involved a trip to the library to get some books on requirements as I have not done any kind of requirements gathering since year to the books I found were:

Effective Requirements Practices – Ralph R.Young
Software Requirements and Specifications – Michael Jackson

(Add to Bibliography)

The remainder of the day was spend evaluating approaches to requirements.

Thursday, 19 February 2009

Day One

Had a meeting with Mike, from this I found that I could "get the ball rolling" as i now have a direction to go in.

There are still issues with the IBM Intellectual Property rights so for the time being an alternative dataset needs to be found

Mike suggested using PDF files and an API to extract an XML dataset from this. A few websites were very useful in identifying Java PDF API's:

http://java-source.net/open-source/pdf-libraries

http://schmidt.devlib.org/java/libraries-pdf.html

These poined me in the right direction

This one unfortunatly works the wrong way around(converts XML to PDF):

http://incubator.apache.org/pdfbox/

And the below looks like it may do the job, an e-mail has been sent to their sales team but as of yet there has been no reply

http://www.davisor.com/

The below does seem to do the job to some extent however not as much as i'd like, it does break the document into XML, however,

it would be very useful if it broke it down into sentance and paragraph tags etc. As it is it simply puts all words between word tags along with other, what appears to be redundant information.

http://multivalent.sourceforge.net/Tools/pdf/Info.html

Here is are examples of the outputs:

There is this option which appears to be useless:

http://www.megaupload.com/?d=NXLHK6EV

And this one which seems a little better as it would be easier to search on a word by word basis:

http://www.megaupload.com/?d=J5GS3GVI

It does look hopeful that this kind of tool is out there, but it appears a little more reasearch is needed, but it is important that I do not get too "bogged" down with this as there are alternatives and the requirements deadline is looming. It would also be useful if this davisor company would get back to me.

Also on my travels I came across what appears to be some kind og PDF indexor and searcher, this may be of use:

http://multivalent.sourceforge.net/Tools/doc/Lucene.html#Index

Other things taken from the meeting with Mike were:

The importance of the weightings you give words in the attentional model this is the "intelligent" aspect of the project.

The attentional engine will almost certainly be a class in it's own right

Just to go at a word level in the attentional model eg bookings and book are the same thing, as this is not the important aspect of the project, the indexing will be fairly novel.

There needs to be an RSS feed output (research - never done before)

The attentional aspect of the project is the part which will be evaluated, this will be evaluated upon performance.

Some psycology stuff to look into:

Cognitive psycology

Attentional Models

Spiking neurons

Histerisis curve

Working Set

Some communication with Chris wil be important, who is doing the RSS component of the project as a format for the description of individual words will be needed, eg snippets and ontext around it etc. This is why the PDF API is so important

Requirements needs to be started to be thought about especially as I have not done any since year 2. Maybe get a book out as a professional approach will be expected.

Tuesday, 17 February 2009

Project Blog

Welcome to Leigh Darlow's final year project blog. The title of the project is: "Intelligent indexing of large semi-structured data sets."

Project Blog

Friday, 27 February 2009

Day Six

Thursday, 26 February 2009

Day Five

Tuesday, 24 February 2009

Day Four

Day Three

Sunday, 22 February 2009

Day Two

Thursday, 19 February 2009

Day One

Tuesday, 17 February 2009

Project Blog

Followers

Blog Archive

About Me