Friday, 24 April 2009
Versioning
Thursday, 23 April 2009
Lit review progress
Monday, 20 April 2009
Apologies
Just to get you up to speed the application is pretty much finished, all is left is to get the application to upload the rss file and associated HTML files to a webspace, commenting of the code and testing.
The plan for the time being though is to start writing the lit review there has been some research done before coding but not enough so I have a big task on my hands, this is due in in about ten days as well as the output data from the application. I'd like to aim for 1000 words a day on the lit review, this gives plenty of time to neaten it up.
Friday, 10 April 2009
Coding Progress
Coding has started on the main comparison tool.
Monday, 6 April 2009
Computer Death and Progress
However today good progress has been made, I didn't realise the workload the test harness presented but it should most defiantly be completed by end of play tomorrow, there are now methods for deleting elements, attributes and words, methods for elements, attributes and words and methods for deleting elements, attributes and words. Hopefully the comparison tool will not take the same amount of time, the file crawler aspect of the tool has already been written. Regardless time is very tight, I could be looking at 2 weeks for both the lit review and data collection to be completed, luckily there is some work I can salvage from the lit review before the project was changed.
Wednesday, 1 April 2009
Configuration Of File Manipulator
Tuesday, 31 March 2009
XML Parser
Monday, 30 March 2009
Touching up
Sunday, 29 March 2009
The Crawler
Thursday, 26 March 2009
Carry On Coding
Tuesday, 24 March 2009
Continuation of coding
Monday, 23 March 2009
Coding Commences
Thursday, 19 March 2009
Researching XML Differencing Tools
looked at so that the most appropriate tool for the job is chosen.
The first tool I looked at was stylus studio:
http://www.stylusstudio.com/
The free version (only available for 30 days) doesn't seem to register API and doesn't look like it Will do what I want it to anyway However this does look like a very powerful XML manipulation tool.
The from google I came across this useful link:
It suggests a number of open source java XML tools, the XML Unit Java API looked the most useful so I began to research this.
http://xmlunit.sourceforge.net/
To run this software you will need: - Junit (http://www.junit.org/) - a JAXP compliant XML SAX and DOM parser (e.g. Apache Xerces) - a JAXP/Trax compliant XSLT engine (e.g. Apache Xalan) in your classpath.
In the java doc it mentions Diff and DetailedDiff classes:
"Diff and DetailedDiff provide simplified access to DifferenceEngine by implementing the ComparisonController and DifferenceListener interfaces themselves. They cover the two most common use cases for comparing two pieces of XML: checking whether the pieces are different (this is what Diff does) and finding all differences between them (this is what DetailedDiff does)."
This looks like exactly what I am after, I will have to test it first though. At the moment I have just managed to get it installed so the next stage will be to write a little prototype to see how well and how it compares two XML files.
Another XML manipulation tool which Mike mentioned was XMLSpy I will look into this also.
I had some thoughts on the actual comparison of files process too. I was thinking whether I need two copies of the whole document set in order to work out if they have changed surely this is highly inefficient? Is there another way? Also I thought if this were then case instead of trawling through
and comparing every document, if I had a table of hash values generated from each document then compared them, would this be more efficient to at least identify which ones have changed? Obviously this does have the tiny risk or a hash value being generated twice exactly the same. Just a couple of points to
think about.
Wednesday, 18 March 2009
Meeting With Mike
Day Fifteen
Friday, 13 March 2009
Day Thirteen/Fourteen
Tuesday, 10 March 2009
Day Twelve
Source Title
Source Topic
Author info (Discipline / Credentials)
Research Question(s)
Methodology
Result(s)
Relation to topic
Strength(s) / Weakness
Potential bias
Contextual Grounding (Location, Date etc.)
References
As well as this review of further academic documents continued. I feel I'm a bit behind on this, however I have been making progress with the implementation of the system, so it is fairly justified. However 100% focus will be needed on this until the end of the week as I have covered no where near enough academic documents.
Monday, 9 March 2009
Day Eleven
Friday, 6 March 2009
Day Ten
Wednesday, 4 March 2009
Day Nine
The lit review research continued today. Today I focused on the two psychology papers as the main focus of this project will be the attentional model. I actually found this papers interesting, having never studies psychology before.
The paper is:
http://www.infosci-online.com/downloadPDF/encyclopedias/IGR1854_G8WXb459Z3.pdf
The first paper was a good introduction to attentional theories it explained that attention was, “The processes by which we select information”
So in my project the idea will be select relevant words as a human would, this will involve elements of AI. Unfortunately the main focus of the paper in relation to computation attention was GUI's however I did take a lot from this paper.
The second paper went a lot more in depth into how the brain makes decisions
http://www.infosci-online.com/downloadPDF/pdf/ITJ4450_74ZWULUIHI.pdf
It went into mathematical models of the brain, I'm not sure that I fully understood it but again it was useful, I had to familiarise myself with things like unions again!
It explain how different people have gone about modelling the brain, a couple of the approaches were:
That the brain is layered an each layer has different characteristics, they explained the problem with this was that it was hard to determine how many layers the model needed - a LRMB.
Another approach with they seemed to favour was that the brain is a network an OAR, there were some useful directed graphs which may be very beneficial when it comes to designing the attentional model.
Tomorrows objectives will be get a prototype of Java web services working and more work on the lit review.
Tuesday, 3 March 2009
Day Eight
Today I carried on with the Lit review. This involved researching what a lit review should actually look like as well the evaluation of academic documents. Also as I spent the day in the library I discovered a few other documents which may be of use.
I started by reading an article on indexing
http://www.iiit.net/~pkreddy/wdm03/wdm/vldb01.pdf
It suggested a method of indexing which could be up tp ten times faster done their way, it introduced their idea of a process called XISS, this relates to not just XML indexing but also the storage of the data. The paper also goes on to explain indexing algorythms, this is a paper I have skimmed over previously but as I have started the Lit review properly now I chose to take notes on it. I found some very interesting idea s however I found it very hard going, so I've only been through half of the paper but it should probably be revisited. Although the indexing method in this project isn't actually that important a fairly novel approach should be used but this paper does give some good insight. The most important aspect of the project will be the attentional models, so as I found the indexing paper so hard going I moved on to the psychological aspect of the project and have came accross two papers which look very useful, so tomorrow I shall read through the rest of these making notes and hopefully it will give me an idea of how to approach the attentional model because at the moment I have no idea what one actually is!
The two papers are:
http://www.infosci-online.com/downloadPDF/encyclopedias/IGR1854_G8WXb459Z3.pdf
http://www.infosci-online.com/downloadPDF/pdf/ITJ4450_74ZWULUIHI.pdf
Monday, 2 March 2009
Day Seven
Completed the requirements specification, which I'm fairly pleased with as it looked fairly professional and it's been a while since I've done requirements. There may be some refinements needed at a later date as I research further into the project. The use case diagram may need revisiting and also some functional requirements may be better off migrating to a "Environmental" requirements section.
Next week objectives will be to do as much of the lit review as possible, start to look at java web services and how RSS feeds work and also may start a project plan.
Friday, 27 February 2009
Day Six
Thursday, 26 February 2009
Day Five
Tuesday, 24 February 2009
Day Four
The first half of the day involved investigating XML manipulation API's the general view was that JDOM was the richest. I has issues installing it, but yet again realised that it belonged in java's \ext folder!
The API took a while to get familiar with as I have more experience using c#, however I found this example from IBM which was a huge help:
http://www.ibm.com/developerworks/library/x-injava2/
Then I was getting invalid XML errors which took a while to work out. It turns out that the API was putting certain closing tags in wrong places. Upon further investigation it turned out to be the "-font" parameter on the command line causing it, conveniently the tags were redundant anyway, so I now had valid XML.
This was the error:
Error on line 9 of document file:///C:/pdfoutput.xml: The element type "timesnewromanps-boldmt" must be terminated by the matching end-tag ""
Then I was faced with another problem, the program was reporting this exception:
Could not check
because Invalid byte 1 of 1-byte UTF-8 sequence.
It turned out the XML file was now not in UTF-8. To solve this I remembered a text editor which I used on placement called Ultra Edit. I opened the file up in Ultra Edit and re-saved it with UTF-8 encoding and it worked fine. So I now have a small XML parsing prototype.
The second half of the day was spent going through and making notes on the Academic documents highlighted in the project proposal:
http://www.iiit.net/~pkreddy/wdm03/wdm/vldb01.pdf
http://books.google.co.uk/books?id=Rdwv-r5RlOcC&pg=PA88&dq=Attentional+models&lr=
http://books.google.co.uk/books?id=RZ-6cTL8YZsC&printsec=frontcover&dq=web+service&lr=
http://books.google.co.uk/books?id=hXTfWDkqnlkC&printsec=frontcover&dq=xml+indexing&lr=
So the Lit review has began!
Day Three
Today I added Chris Campbell on facebook, he's doing the second part of the project so communication will be helpful, we've had some discussions about the data-exchange protocol over facebook.
I experimented with programming with the PDF API and created a little prototype, I'm sure that I am missing some important methods which are there – may need to de-compile the source code, this should be fine as it is open source, I'll have to look into it though.
It took me a long time to work out how to add extensions or additional API's in java, of course it was obvious, in the \ext folder!
For example:
C:\Program Files\Java\jre1.6.0_07\lib\ext
After this I carried on with my re-learning of the requirements topic and found a useful website where I have taken over a page worth of notes from:
http://www.stsc.hill.af.mil/crosstalk/2008/03/0803Gottesdiener.html
I now think I know how to approach this area, however i'm not sure if what I have been looking at is a little too high level.
I sent an e-mail to Mike asking about the relationship between requirements and the lit review, he said that they do go hand in hand and are a very much iterative process. Soon i'll start with a small lit review and then go from there.
On my travels I also came across problem frames:
http://en.wikipedia.org/wiki/Problem_Frames_Approach
This is a requirements approach which can be thought about.
Also I came across requirements templates:
http://www.jiludwig.com/Template_Guidance.html#checklist
These again maybe are a little high level but they give an indication of the work needed, I am a little worried about this Friday deadline for the requirements, but I'll just have to get stuck in!
Sunday, 22 February 2009
Day Two
In the morning I did a little experiment to see how other domains represent kinds of text documents with XML. Mike had mentioned that there is a way of looking at how Open Office “.odt” files are represented in XML using windows. The process was:
Create an .odt file
Rename it's extension to “.zip”
Extract the file
Look at “Content.xml”
The resulting file looked like this:
http://www.megaupload.com/?d=1VZLOA04
Although the “Content.xml” file maybe a lot easier to read, I came to the conclusion that there were enough similarities with the XML file created from the PDF API that either process could be used as an example data-set for this problem. When the PDF API was re-visited it became apparent that there were some extra command line parameters which could produce extra useful elements in the XML, for example there were some which would include Hyperlinks in the XML.
As the reading of the XML was quite hard using a simple document viewer I discovered a specialist viewer called “XML Marker” this helped highlight to myself the thought behind how the PDF API represents a PDF file in XML. Before I had this tool I did not realise that there were “line” tags for blank lines and also that “word” tags between “text” tags meant that they were on the same line.
So this helped me reach the conclusion that the PDF API would be sufficient for the creation of the data set used for this project or until the IBM issues have or haven't been resolved.
Next I tried to find documentation which was large enough and presented how id imagine how the IBM documentation could possibly be presented I found this link:
www.reportlab.com/docs/PyRXP_Documentation.pdf
And processed it to see how it would handle it, it was fine please see the resulting XML at:
http://www.megaupload.com/?d=FKARW4FV
It's important to remember to make the program very flexible to changing of data-sets as it may be required that the IBM one needs to be substituted in for this one created by the XML. I know this is a principle of programming, but it is just a reminder to carry forward.
The second half of the day involved a trip to the library to get some books on requirements as I have not done any kind of requirements gathering since year to the books I found were:
Effective Requirements Practices – Ralph R.Young
Software Requirements and Specifications – Michael Jackson
(Add to Bibliography)
The remainder of the day was spend evaluating approaches to requirements.