Friday, 24 April 2009

Versioning

Versioning has been added to the lit review a the different approaches to integrating IBM's clear case with the comparison tool have been investigated. This was very interesting there were 2 options one to use triggers to trigger the application to launch passing in the 2 files and another option of using ClearCase Automation Library (CAL) COM object with Java. Rick Dunn and Dean Godfrey were useful resources to tap into as theny both have previous experiance in this field. I am now up to 2000 rough words. Good progress!

Thursday, 23 April 2009

Lit review progress

The lit review is progressing well. There is now a rough draft of the initial section of the review (around 1000 words!) It's nice to get the ball rolling having spent many hours programming. Within the section I try to justify using a stand alone test harness to test the application, so some articles on automated testing have been reviewed as well as myself remembering previous experience from writing test scripts at Honda. I actually quite enjoyed writing this section, previous blog posts look to be very useful for writing the lit review. There are many loose ends which will need smoothing out before the initial hand in at the end of the month.

Monday, 20 April 2009

Apologies

Apologies for the lack of activity on the blog recently I have been home for Easter and only have limited internet access.

Just to get you up to speed the application is pretty much finished, all is left is to get the application to upload the rss file and associated HTML files to a webspace, commenting of the code and testing.

The plan for the time being though is to start writing the lit review there has been some research done before coding but not enough so I have a big task on my hands, this is due in in about ten days as well as the output data from the application. I'd like to aim for 1000 words a day on the lit review, this gives plenty of time to neaten it up.

Friday, 10 April 2009

Coding Progress

The simulator is now complete, I learnt some new useful skills like using the timer class and how to export a DOM object as XML in Java, I can see these skills being very beneficial beneficial in the future. The test harness is fully configurable for robustness.

Coding has started on the main comparison tool.

Monday, 6 April 2009

Computer Death and Progress

On Thursday evening my computer died, so it's taken a while to recover data e.t.c, so this has been a bit of a setback.

However today good progress has been made, I didn't realise the workload the test harness presented but it should most defiantly be completed by end of play tomorrow, there are now methods for deleting elements, attributes and words, methods for elements, attributes and words and methods for deleting elements, attributes and words. Hopefully the comparison tool will not take the same amount of time, the file crawler aspect of the tool has already been written. Regardless time is very tight, I could be looking at 2 weeks for both the lit review and data collection to be completed, luckily there is some work I can salvage from the lit review before the project was changed.

Wednesday, 1 April 2009

Configuration Of File Manipulator

The config file was designed for the modifying of files and data for the XML simulator application: ( I'm thinking some documentation is going to be needed for all of these configuration files!)

xml version="1.0" encoding="UTF-8" standalone="no"?>
<config>
<rootDirectory value="C:\Documents and Settings\Leigh\My Documents\Dissertation\Code\XMLSimulation\files\generatedXML">rootDirectory>
<delNoOfElements value="1">delNoOfElements>
<delNoOfAttributes value="1">delNoOfAttributes>
<delNoOfWords value="1">delNoOfWords>
<delNoOfFiles value="1">delNoOfFiles>
<modNoOfElements value="1">modNoOfElements>
<modNoOfAttributes value="1">modNoOfAttributes>
<modNoOfWords value="1>modNoOfWords>
config>

And a class was written to load this into a hash table. The next stage is to finally start manipulating the XML data at certain intervals, luckily the file crawler is already written, so after this stage is completed it will be time to start properly on the file comparison tool, I'll work all through the weekend to make sure as much of this is done as possible. A meeting with Mike may be necessary on Friday as a lot of progress has been made.

Tuesday, 31 March 2009

XML Parser

I now have an XML parser, this will loop through a specified XML file and return a list of elements. This is useful for reading in the Configuration file for the simulator as well as the XML manipulation which will happen within the simulator. The generator is working beautifully, I'd like it to have random depths of folders but this looks like it could be quite taxing and not a priority at the moment. Also the code is not the prettiest there are currently no comments but these are all things which can be refactored at the end #1 priority is getting the system working. I know it may be ambitious but I'd like to have a fully working system by the end of the week, this will mean I have 3 weeks to complete some of the other objectives of the project.

Monday, 30 March 2009

Touching up

The XML generator now does exactly what it was specified to do. This took a long time to work out as it had quite complicated logic behind it, for example nested loops. The next stage for this will to go through and randomly delete attributes and elements of random XML files.

Sunday, 29 March 2009

The Crawler

A crawler application has now been written, this is essential to detect changes, as the whole file structure of the documentation will need to be crawled through to detect changes. This crawler application has methods which will return all files within a specified root directory, all directories within a specified root directory and all files and directories within a specified root directory, these are returned as an arraylist of strings. This could also be useful for the finishing off of the random directory and XML generation application.

Certain settings will need to be specified in the final configuration file to determine certain characteristics of this crawler application such as how often it iterates through the files.

Thursday, 26 March 2009

Carry On Coding

The first part of the test harness is being left for the time being, there are some problems with the logic at the moment. The harness actually does what it needs to do as it is, generate a number of random, randomly nested directories with random XML files with random data placed within them. There are problems with creating random spreads of files at higher levels, but as it is at the moment it does a good enough job at creating a random file structure representative of that at IBM. The next stage will be to write a file crawler.

Tuesday, 24 March 2009

Continuation of coding

Good progress was made with the coding today.
The application which will be a test harness to simulate a user changing XML has been the main focus. Mikes code has been built upon. It will now generate a specified number of directories with random names and put in these files XML data with random names and data within them. There has been a number of changes to the config file which now looks like this:

wordsFile files/words.txt
elementsFile elements
attributesFile attributes
dataFile data
rootElementsSet root
numberOfElements 100
elementSetSize 10
attributeSetSize 10
dataSetSize 100
maximumNumberOfAttributes 5
maximumNumberOfChildren 10
maximumNumberOfDataWords 20
mixedContentAllowed no
maxDepth 3
minDepth 2
topSpread 20
spreadOtherLevels 5
maxNumberOfFiles 20
minNumberOfFiles 3
fileNames files/fileNames.txt
XMLfileNames files/XMLfileNames.txt

Tomorrows objective will be to finish the generation of XML files and refactor some of the code.

Monday, 23 March 2009

Coding Commences

Both aspects of the project were born today. 

Today the XML differencing tool was started. At the end of play the code currently compares two XML files and outputs the results to a text file.

Also the XML Manipulation tool was started, this took up the majority of the day. The main issues I had were: 

Working out how to use eclipses work spaces, and importing and exporting source code and jar files
Understanding the workings of Mike's code, there are various supplementary files to the source code which need understanding.

Today has provided a foundation to build upon. The next stage of the manipulation tool will be to create a whole directory structure of XML files and for the differencing tool a lot will need to be thought about as to how the identified differences between the XML files will need to be output.

Thursday, 19 March 2009

Researching XML Differencing Tools

As the new amended project will involve comparison of XML data a number of XML comparison applications and API's will need to be
looked at so that the most appropriate tool for the job is chosen.

The first tool I looked at was stylus studio:

http://www.stylusstudio.com/

The free version (only available for 30 days) doesn't seem to register API and doesn't look like it Will do what I want it to anyway However this does look like a very powerful XML manipulation tool.

The from google I came across this useful link:

http://www.roseindia.net/opensource/xmldiff.php

It suggests a number of open source java XML tools, the XML Unit Java API looked the most useful so I began to research this.

http://xmlunit.sourceforge.net/

To run this software you will need: - Junit (http://www.junit.org/) - a JAXP compliant XML SAX and DOM parser (e.g. Apache Xerces) - a JAXP/Trax compliant XSLT engine (e.g. Apache Xalan) in your classpath.

In the java doc it mentions Diff and DetailedDiff classes:
"Diff and DetailedDiff provide simplified access to DifferenceEngine by implementing the ComparisonController and DifferenceListener interfaces themselves. They cover the two most common use cases for comparing two pieces of XML: checking whether the pieces are different (this is what Diff does) and finding all differences between them (this is what DetailedDiff does)."

This looks like exactly what I am after, I will have to test it first though. At the moment I have just managed to get it installed so the next stage will be to write a little prototype to see how well and how it compares two XML files.

Another XML manipulation tool which Mike mentioned was XMLSpy I will look into this also.

I had some thoughts on the actual comparison of files process too. I was thinking whether I need two copies of the whole document set in order to work out if they have changed surely this is highly inefficient? Is there another way? Also I thought if this were then case instead of trawling through
and comparing every document, if I had a table of hash values generated from each document then compared them, would this be more efficient to at least identify which ones have changed? Obviously this does have the tiny risk or a hash value being generated twice exactly the same. Just a couple of points to
think about.

Wednesday, 18 March 2009

Meeting With Mike

Today I received quite major news regarding the project. The IP rights issues are progressing, however aspects of the project have changed. The indexing part of it has been pretty much scrapped but the data mining part is still there. The idea now is to routinely crawl through the help file document set and identify which bits have changed from the last crawl through, the changes are the output in an RSS feed. This means that the direction of the project has changed various differencing applications will need to be investigated, filters for input and output data will need to be looked at as well as versioning and how to measure the amount a document has changed will need to be thought about (maybe some weighted tolerances in the config file). Mike has provided me with a junk XML creator, the first stage will to be to get this generating junk XML and deleting random elements and attributes in the XML.

Day Fifteen

Today I built upon the research already done. I investigated some aspects of clustering, Tree vs Binary clustering I found particularly interesting

Also I came across a documents showing that IBM already done something similar to this project but for text mining the methods and techniques used were very enlightening.

I have a meeting with Mike tomorrow regarding IBM.

Friday, 13 March 2009

Day Thirteen/Fourteen

I'm still well into my lit review, I've moved slightly away form the psychology aspect of the projects and I'm now looking more into data mining techniques and agents. These also cover some cognitive models, I still have a lot more to research into, as I go I've been building up a list of terms to google:

neuropsychology
McCrickard et al., 2003c
Horvitz et al.,2003
(Kahneman 1973, Posner & Boies 1971):
Goldman-Rakic (1988)
data mining + intelligence
Search engine indexing
data caching in search engines
NGRAM - used with indexing
hash tables
data mining xml
Data mining clusters
data mining book in library - Data mining: concepts and techniques 

This NGRAM looks very useful for pattern matching, as there was an article on Weka, this is a Java API for data mining which may be useful. There's still a lot to do, but I'd like to have the lit review at least started as well as the design and another set of requirements done before I see Mike on Tuesday.

Tuesday, 10 March 2009

Day Twelve

Today I started "stream-lining" my lit review. As it was I had lots of chunks of academic documents in a word document. This is not going to be all that useful when it comes to the write up. Ricky Dunn suggested I used his template which looks like it will be very useful. It's a nice formal way of structuring the research and has got me thinking about document comparisons etc. The template looks like this:

Source Title
Source Topic
Author info (Discipline / Credentials)
Research Question(s)
Methodology
Result(s)
Relation to topic
Strength(s) / Weakness
Potential bias
Contextual Grounding (Location, Date etc.)
References

As well as this review of further academic documents continued. I feel I'm a bit behind on this, however I have been making progress with the implementation of the system, so it is fairly justified. However 100% focus will be needed on this until the end of the week as I have covered no where near enough academic documents.

 

Monday, 9 March 2009

Day Eleven

As I'm not that familiar with RSS, although I should be! I took the time to see how it should work I found a couple of useful websites which showed me how it should be done:


From this I realised that it works by uploading a file(only one file) which an RSSreader points too and reads and the XML items within it point at various point on the web page, this will the the help documentation in my case.

I'm not sure now if my research on glassfish the other day was irrelevant, however it may be needed to upload files to the server. I understand that the file on the server is updated and this in turn filters down to the RSSfeeds. I'm still not entirely clued up on this but I'm making progress.

A link previously referenced, the one which described how to create an RSSfeed in java was investigated today. I managed to get this working and upload the file it created to some webspace. I then downloaded feedreader and pointed it at the file and it worked!! This will be a good foundation for my feed, bit's will need modifying however. I'm not sure what implications using code off of the net will have, I will have to speak to Mike about it.


Friday, 6 March 2009

Day Ten

Today I've been looking at how this XML feed is going to work. It was quite interesting actually as I have never written a component of a distributed system before. Everything I looked at pointed towards this java glassfish which seems to provide java support for web services. The link I downloaded it from is as below:


The installation of it was a nightmare!! I had to run lots of command line statements some of which failed, the thing which messed me around most was that it was saying that I didn't have a JDK installed, it turned out in the environment variable the "\bin" part of the path was not needed.

I also downloaded a massive samples directory which will be of use when I get going on this properly as I have no idea how to use web services.

A useful link I found was:


These will be useful for the format of the feed.

Also I found useful information on the coding of the feed from:


I copied and pasted the package imports into eclipse to see if the installation of glassfish is what I need and if it was successful, all packages imported successfully.

import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartDocument;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;





Wednesday, 4 March 2009

Day Nine

The lit review research continued today. Today I focused on the two psychology papers as the main focus of this project will be the attentional model. I actually found this papers interesting, having never studies psychology before.

The paper is:

http://www.infosci-online.com/downloadPDF/encyclopedias/IGR1854_G8WXb459Z3.pdf

The first paper was a good introduction to attentional theories it explained that attention was, “The processes by which we select information”

So in my project the idea will be select relevant words as a human would, this will involve elements of AI. Unfortunately the main focus of the paper in relation to computation attention was GUI's however I did take a lot from this paper.

The second paper went a lot more in depth into how the brain makes decisions

http://www.infosci-online.com/downloadPDF/pdf/ITJ4450_74ZWULUIHI.pdf

It went into mathematical models of the brain, I'm not sure that I fully understood it but again it was useful, I had to familiarise myself with things like unions again!

It explain how different people have gone about modelling the brain, a couple of the approaches were:

That the brain is layered an each layer has different characteristics, they explained the problem with this was that it was hard to determine how many layers the model needed - a LRMB.

Another approach with they seemed to favour was that the brain is a network an OAR, there were some useful directed graphs which may be very beneficial when it comes to designing the attentional model.

Tomorrows objectives will be get a prototype of Java web services working and more work on the lit review.


Tuesday, 3 March 2009

Day Eight

Today I carried on with the Lit review. This involved researching what a lit review should actually look like as well the evaluation of academic documents. Also as I spent the day in the library I discovered a few other documents which may be of use.

I started by reading an article on indexing

http://www.iiit.net/~pkreddy/wdm03/wdm/vldb01.pdf

It suggested a method of indexing which could be up tp ten times faster done their way, it introduced their idea of a process called XISS, this relates to not just XML indexing but also the storage of the data. The paper also goes on to explain indexing algorythms, this is a paper I have skimmed over previously but as I have started the Lit review properly now I chose to take notes on it. I found some very interesting idea s however I found it very hard going, so I've only been through half of the paper but it should probably be revisited. Although the indexing method in this project isn't actually that important a fairly novel approach should be used but this paper does give some good insight. The most important aspect of the project will be the attentional models, so as I found the indexing paper so hard going I moved on to the psychological aspect of the project and have came accross two papers which look very useful, so tomorrow I shall read through the rest of these making notes and hopefully it will give me an idea of how to approach the attentional model because at the moment I have no idea what one actually is!

The two papers are:

http://www.infosci-online.com/downloadPDF/encyclopedias/IGR1854_G8WXb459Z3.pdf

http://www.infosci-online.com/downloadPDF/pdf/ITJ4450_74ZWULUIHI.pdf


Monday, 2 March 2009

Day Seven

Completed the requirements specification, which I'm fairly pleased with as it looked fairly professional and it's been a while since I've done  requirements. There may be some refinements needed at a later date as I research further into the project. The use case diagram may need revisiting and also some functional requirements may be better off migrating to a "Environmental" requirements section.

Next week objectives will be to do as much of the lit review as possible, start to look at java web services and how RSS feeds work and also may start a project plan.



Friday, 27 February 2009

Day Six

Today functional requirements were completed they look brief, but they were very time consuming.

I created some use cases to help derive some requirements these were done in star UML. I had to re-learn use cases they look correct to me nevertheless they served their purpose of generating some requirements.

I had thoughts of creating various other diagrams such as state machines and data flow diagrams, however this requirements documents is by no means the final version, as the project evolves there will be new requirements and a better understanding of the project as a whole, by this point it may be necessary to define some higher level requirements.

Objective by the end of the next day is to complete the functional requirements in the document and tie up the loose ends

Thursday, 26 February 2009

Day Five

This was spent doing the "context" part of the requirements document.
The objective by the end of the next day will be to have all functional requirements defined.

Tuesday, 24 February 2009

Day Four

The first half of the day involved investigating XML manipulation API's the general view was that JDOM was the richest. I has issues installing it, but yet again realised that it belonged in java's \ext folder!

The API took a while to get familiar with as I have more experience using c#, however I found this example from IBM which was a huge help:

http://www.ibm.com/developerworks/library/x-injava2/

Then I was getting invalid XML errors which took a while to work out. It turns out that the API was putting certain closing tags in wrong places. Upon further investigation it turned out to be the "-font" parameter on the command line causing it, conveniently the tags were redundant anyway, so I now had valid XML.

This was the error:

Error on line 9 of document file:///C:/pdfoutput.xml: The element type "timesnewromanps-boldmt" must be terminated by the matching end-tag ""

Then I was faced with another problem, the program was reporting this exception:

Could not check

because Invalid byte 1 of 1-byte UTF-8 sequence.

It turned out the XML file was now not in UTF-8. To solve this I remembered a text editor which I used on placement called Ultra Edit. I opened the file up in Ultra Edit and re-saved it with UTF-8 encoding and it worked fine. So I now have a small XML parsing prototype.

The second half of the day was spent going through and making notes on the Academic documents highlighted in the project proposal:

http://www.iiit.net/~pkreddy/wdm03/wdm/vldb01.pdf

http://books.google.co.uk/books?hl=en&lr=&id=30UsZ8hy2ZsC&oi=fnd&pg=PR7&dq=cognitive+psychology+and+computing&ots=1LyL05Rj1i&sig=bX5q1YNTfqlTOANUSctLSS6f49g

http://books.google.co.uk/books?id=Rdwv-r5RlOcC&pg=PA88&dq=Attentional+models&lr=

http://books.google.co.uk/books?id=RZ-6cTL8YZsC&printsec=frontcover&dq=web+service&lr=

http://books.google.co.uk/books?id=hXTfWDkqnlkC&printsec=frontcover&dq=xml+indexing&lr=

So the Lit review has began!

Day Three

Today I added Chris Campbell on facebook, he's doing the second part of the project so communication will be helpful, we've had some discussions about the data-exchange protocol over facebook.

I experimented with programming with the PDF API and created a little prototype, I'm sure that I am missing some important methods which are there – may need to de-compile the source code, this should be fine as it is open source, I'll have to look into it though.

It took me a long time to work out how to add extensions or additional API's in java, of course it was obvious, in the \ext folder!

For example:

C:\Program Files\Java\jre1.6.0_07\lib\ext

After this I carried on with my re-learning of the requirements topic and found a useful website where I have taken over a page worth of notes from:

http://www.stsc.hill.af.mil/crosstalk/2008/03/0803Gottesdiener.html

I now think I know how to approach this area, however i'm not sure if what I have been looking at is a little too high level.

I sent an e-mail to Mike asking about the relationship between requirements and the lit review, he said that they do go hand in hand and are a very much iterative process. Soon i'll start with a small lit review and then go from there.

On my travels I also came across problem frames:

http://en.wikipedia.org/wiki/Problem_Frames_Approach

This is a requirements approach which can be thought about.

Also I came across requirements templates:

http://www.jiludwig.com/Template_Guidance.html#checklist

These again maybe are a little high level but they give an indication of the work needed, I am a little worried about this Friday deadline for the requirements, but I'll just have to get stuck in!



Sunday, 22 February 2009

Day Two

In the morning I did a little experiment to see how other domains represent kinds of text documents with XML. Mike had mentioned that there is a way of looking at how Open Office “.odt” files are represented in XML using windows. The process was:

  1. Create an .odt file

  2. Rename it's extension to “.zip”

  3. Extract the file

  4. Look at “Content.xml

The resulting file looked like this:

http://www.megaupload.com/?d=1VZLOA04

Although the “Content.xml” file maybe a lot easier to read, I came to the conclusion that there were enough similarities with the XML file created from the PDF API that either process could be used as an example data-set for this problem. When the PDF API was re-visited it became apparent that there were some extra command line parameters which could produce extra useful elements in the XML, for example there were some which would include Hyperlinks in the XML.

As the reading of the XML was quite hard using a simple document viewer I discovered a specialist viewer called “XML Marker” this helped highlight to myself the thought behind how the PDF API represents a PDF file in XML. Before I had this tool I did not realise that there were “line” tags for blank lines and also that “word” tags between “text” tags meant that they were on the same line.

So this helped me reach the conclusion that the PDF API would be sufficient for the creation of the data set used for this project or until the IBM issues have or haven't been resolved.

Next I tried to find documentation which was large enough and presented how id imagine how the IBM documentation could possibly be presented I found this link:

www.reportlab.com/docs/PyRXP_Documentation.pdf

And processed it to see how it would handle it, it was fine please see the resulting XML at:

http://www.megaupload.com/?d=FKARW4FV

It's important to remember to make the program very flexible to changing of data-sets as it may be required that the IBM one needs to be substituted in for this one created by the XML. I know this is a principle of programming, but it is just a reminder to carry forward.

The second half of the day involved a trip to the library to get some books on requirements as I have not done any kind of requirements gathering since year to the books I found were:

  • Effective Requirements Practices – Ralph R.Young

  • Software Requirements and Specifications – Michael Jackson

(Add to Bibliography)

The remainder of the day was spend evaluating approaches to requirements.


Thursday, 19 February 2009

Day One

Had a meeting with Mike, from this I found that I could "get the ball rolling" as i now have a direction to go in.
There are still issues with the IBM Intellectual Property rights so for the time being an alternative dataset needs to be found
Mike suggested using PDF files and an API to extract an XML dataset from this. A few websites were very useful in identifying Java PDF API's:


These poined me in the right direction

This one unfortunatly works the wrong way around(converts XML to PDF):


And the below looks like it may do the job, an e-mail has been sent to their sales team but as of yet there has been no reply


The below does seem to do the job to some extent however not as much as i'd like, it does break the document into XML, however,
it would be very useful if it broke it down into sentance and paragraph tags etc. As it is it simply puts all words between word tags along with other, what appears to be redundant information.


Here is are examples of the outputs:

There is this option which appears to be useless:


And this one which seems a little better as it would be easier to search on a word by word basis:


It does look hopeful that this kind of tool is out there, but it appears a little more reasearch is needed, but it is important that I do not get too "bogged" down with this as there are alternatives and the requirements deadline is looming. It would also be useful if this davisor company would get back to me.

Also on my travels I came across what appears to be some kind og PDF indexor and searcher, this may be of use:


Other things taken from the meeting with Mike were:

The importance of the weightings you give words in the attentional model this is the "intelligent" aspect of the project.
The attentional engine will almost certainly be a class in it's own right
Just to go at a word level in the attentional model eg bookings and book are the same thing, as this is not the important aspect of the project, the indexing will be fairly novel.
There needs to be an RSS feed output (research - never done before)
The attentional aspect of the project is the part which will be evaluated, this will be evaluated upon performance.

Some psycology stuff to look into:

Cognitive psycology
Attentional Models
Spiking neurons
Histerisis curve
Working Set

Some communication with Chris wil be important, who is doing the RSS component of the project as a format for the description of individual words will be needed, eg snippets and ontext around it etc. This is why the PDF API is so important

Requirements needs to be started to be thought about especially as I have not done any since year 2. Maybe get a book out as a professional approach will be expected.


Tuesday, 17 February 2009

Project Blog

Welcome to Leigh Darlow's final year project blog. The title of the project is: "Intelligent indexing of large semi-structured data sets."