Project Blog: March 2009

Tuesday, 31 March 2009

XML Parser

I now have an XML parser, this will loop through a specified XML file and return a list of elements. This is useful for reading in the Configuration file for the simulator as well as the XML manipulation which will happen within the simulator. The generator is working beautifully, I'd like it to have random depths of folders but this looks like it could be quite taxing and not a priority at the moment. Also the code is not the prettiest there are currently no comments but these are all things which can be refactored at the end #1 priority is getting the system working. I know it may be ambitious but I'd like to have a fully working system by the end of the week, this will mean I have 3 weeks to complete some of the other objectives of the project.

Monday, 30 March 2009

Touching up

The XML generator now does exactly what it was specified to do. This took a long time to work out as it had quite complicated logic behind it, for example nested loops. The next stage for this will to go through and randomly delete attributes and elements of random XML files.

Sunday, 29 March 2009

The Crawler

A crawler application has now been written, this is essential to detect changes, as the whole file structure of the documentation will need to be crawled through to detect changes. This crawler application has methods which will return all files within a specified root directory, all directories within a specified root directory and all files and directories within a specified root directory, these are returned as an arraylist of strings. This could also be useful for the finishing off of the random directory and XML generation application.

Certain settings will need to be specified in the final configuration file to determine certain characteristics of this crawler application such as how often it iterates through the files.

Thursday, 26 March 2009

Carry On Coding

The first part of the test harness is being left for the time being, there are some problems with the logic at the moment. The harness actually does what it needs to do as it is, generate a number of random, randomly nested directories with random XML files with random data placed within them. There are problems with creating random spreads of files at higher levels, but as it is at the moment it does a good enough job at creating a random file structure representative of that at IBM. The next stage will be to write a file crawler.

Tuesday, 24 March 2009

Continuation of coding

Good progress was made with the coding today.

The application which will be a test harness to simulate a user changing XML has been the main focus. Mikes code has been built upon. It will now generate a specified number of directories with random names and put in these files XML data with random names and data within them. There has been a number of changes to the config file which now looks like this:

wordsFile files/words.txt

elementsFile elements

attributesFile attributes

dataFile data

rootElementsSet root

numberOfElements 100

elementSetSize 10

attributeSetSize 10

dataSetSize 100

maximumNumberOfAttributes 5

maximumNumberOfChildren 10

maximumNumberOfDataWords 20

mixedContentAllowed no

maxDepth 3

minDepth 2

topSpread 20

spreadOtherLevels 5

maxNumberOfFiles 20

minNumberOfFiles 3

fileNames files/fileNames.txt

XMLfileNames files/XMLfileNames.txt

Tomorrows objective will be to finish the generation of XML files and refactor some of the code.

Monday, 23 March 2009

Coding Commences

Both aspects of the project were born today.

Today the XML differencing tool was started. At the end of play the code currently compares two XML files and outputs the results to a text file.

Also the XML Manipulation tool was started, this took up the majority of the day. The main issues I had were:

Working out how to use eclipses work spaces, and importing and exporting source code and jar files

Understanding the workings of Mike's code, there are various supplementary files to the source code which need understanding.

Today has provided a foundation to build upon. The next stage of the manipulation tool will be to create a whole directory structure of XML files and for the differencing tool a lot will need to be thought about as to how the identified differences between the XML files will need to be output.

Thursday, 19 March 2009

Researching XML Differencing Tools

As the new amended project will involve comparison of XML data a number of XML comparison applications and API's will need to be
looked at so that the most appropriate tool for the job is chosen.

The first tool I looked at was stylus studio:

http://www.stylusstudio.com/

The free version (only available for 30 days) doesn't seem to register API and doesn't look like it Will do what I want it to anyway However this does look like a very powerful XML manipulation tool.

The from google I came across this useful link:

http://www.roseindia.net/opensource/xmldiff.php

It suggests a number of open source java XML tools, the XML Unit Java API looked the most useful so I began to research this.

http://xmlunit.sourceforge.net/

To run this software you will need: - Junit (http://www.junit.org/) - a JAXP compliant XML SAX and DOM parser (e.g. Apache Xerces) - a JAXP/Trax compliant XSLT engine (e.g. Apache Xalan) in your classpath.

In the java doc it mentions Diff and DetailedDiff classes:
"Diff and DetailedDiff provide simplified access to DifferenceEngine by implementing the ComparisonController and DifferenceListener interfaces themselves. They cover the two most common use cases for comparing two pieces of XML: checking whether the pieces are different (this is what Diff does) and finding all differences between them (this is what DetailedDiff does)."

This looks like exactly what I am after, I will have to test it first though. At the moment I have just managed to get it installed so the next stage will be to write a little prototype to see how well and how it compares two XML files.

Another XML manipulation tool which Mike mentioned was XMLSpy I will look into this also.

I had some thoughts on the actual comparison of files process too. I was thinking whether I need two copies of the whole document set in order to work out if they have changed surely this is highly inefficient? Is there another way? Also I thought if this were then case instead of trawling through
and comparing every document, if I had a table of hash values generated from each document then compared them, would this be more efficient to at least identify which ones have changed? Obviously this does have the tiny risk or a hash value being generated twice exactly the same. Just a couple of points to
think about.

Wednesday, 18 March 2009

Meeting With Mike

Today I received quite major news regarding the project. The IP rights issues are progressing, however aspects of the project have changed. The indexing part of it has been pretty much scrapped but the data mining part is still there. The idea now is to routinely crawl through the help file document set and identify which bits have changed from the last crawl through, the changes are the output in an RSS feed. This means that the direction of the project has changed various differencing applications will need to be investigated, filters for input and output data will need to be looked at as well as versioning and how to measure the amount a document has changed will need to be thought about (maybe some weighted tolerances in the config file). Mike has provided me with a junk XML creator, the first stage will to be to get this generating junk XML and deleting random elements and attributes in the XML.

Day Fifteen

Today I built upon the research already done. I investigated some aspects of clustering, Tree vs Binary clustering I found particularly interesting

Also I came across a documents showing that IBM already done something similar to this project but for text mining the methods and techniques used were very enlightening.

I have a meeting with Mike tomorrow regarding IBM.

Friday, 13 March 2009

Day Thirteen/Fourteen

I'm still well into my lit review, I've moved slightly away form the psychology aspect of the projects and I'm now looking more into data mining techniques and agents. These also cover some cognitive models, I still have a lot more to research into, as I go I've been building up a list of terms to google:

neuropsychology

McCrickard et al., 2003c

Horvitz et al.,2003

(Kahneman 1973, Posner & Boies 1971):

Goldman-Rakic (1988)

data mining + intelligence

Search engine indexing

data caching in search engines

NGRAM - used with indexing

hash tables

data mining xml

Data mining clusters

data mining book in library - Data mining: concepts and techniques

This NGRAM looks very useful for pattern matching, as there was an article on Weka, this is a Java API for data mining which may be useful. There's still a lot to do, but I'd like to have the lit review at least started as well as the design and another set of requirements done before I see Mike on Tuesday.

Tuesday, 10 March 2009

Day Twelve

Today I started "stream-lining" my lit review. As it was I had lots of chunks of academic documents in a word document. This is not going to be all that useful when it comes to the write up. Ricky Dunn suggested I used his template which looks like it will be very useful. It's a nice formal way of structuring the research and has got me thinking about document comparisons etc. The template looks like this:

Source Title
Source Topic
Author info (Discipline / Credentials)
Research Question(s)
Methodology
Result(s)
Relation to topic
Strength(s) / Weakness
Potential bias
Contextual Grounding (Location, Date etc.)
References

As well as this review of further academic documents continued. I feel I'm a bit behind on this, however I have been making progress with the implementation of the system, so it is fairly justified. However 100% focus will be needed on this until the end of the week as I have covered no where near enough academic documents.

Monday, 9 March 2009

Day Eleven

As I'm not that familiar with RSS, although I should be! I took the time to see how it should work I found a couple of useful websites which showed me how it should be done:

http://websearch.about.com/od/rsssocialbookmarks/f/rss.htm

http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html

From this I realised that it works by uploading a file(only one file) which an RSSreader points too and reads and the XML items within it point at various point on the web page, this will the the help documentation in my case.

I'm not sure now if my research on glassfish the other day was irrelevant, however it may be needed to upload files to the server. I understand that the file on the server is updated and this in turn filters down to the RSSfeeds. I'm still not entirely clued up on this but I'm making progress.

A link previously referenced, the one which described how to create an RSSfeed in java was investigated today. I managed to get this working and upload the file it created to some webspace. I then downloaded feedreader and pointed it at the file and it worked!! This will be a good foundation for my feed, bit's will need modifying however. I'm not sure what implications using code off of the net will have, I will have to speak to Mike about it.

Friday, 6 March 2009

Day Ten

Today I've been looking at how this XML feed is going to work. It was quite interesting actually as I have never written a component of a distributed system before. Everything I looked at pointed towards this java glassfish which seems to provide java support for web services. The link I downloaded it from is as below:

http://java.sun.com/webservices/downloads/previous/gf-transition.jsp

The installation of it was a nightmare!! I had to run lots of command line statements some of which failed, the thing which messed me around most was that it was saying that I didn't have a JDK installed, it turned out in the environment variable the "\bin" part of the path was not needed.

I also downloaded a massive samples directory which will be of use when I get going on this properly as I have no idea how to use web services.

A useful link I found was:

http://www.petefreitag.com/item/465.cfm

http://feedvalidator.org/

These will be useful for the format of the feed.

Also I found useful information on the coding of the feed from:

http://www.vogella.de/articles/RSSFeed/index.html

I copied and pasted the package imports into eclipse to see if the installation of glassfish is what I need and if it was successful, all packages imported successfully.

import javax.xml.stream.XMLEventFactory;

import javax.xml.stream.XMLEventWriter;

import javax.xml.stream.XMLOutputFactory;

import javax.xml.stream.XMLStreamException;

import javax.xml.stream.events.Characters;

import javax.xml.stream.events.EndElement;

import javax.xml.stream.events.StartDocument;

import javax.xml.stream.events.StartElement;

import javax.xml.stream.events.XMLEvent;

Wednesday, 4 March 2009

Day Nine

The lit review research continued today. Today I focused on the two psychology papers as the main focus of this project will be the attentional model. I actually found this papers interesting, having never studies psychology before.

The paper is:

http://www.infosci-online.com/downloadPDF/encyclopedias/IGR1854_G8WXb459Z3.pdf

The first paper was a good introduction to attentional theories it explained that attention was, “The processes by which we select information”

So in my project the idea will be select relevant words as a human would, this will involve elements of AI. Unfortunately the main focus of the paper in relation to computation attention was GUI's however I did take a lot from this paper.

The second paper went a lot more in depth into how the brain makes decisions

http://www.infosci-online.com/downloadPDF/pdf/ITJ4450_74ZWULUIHI.pdf

It went into mathematical models of the brain, I'm not sure that I fully understood it but again it was useful, I had to familiarise myself with things like unions again!

It explain how different people have gone about modelling the brain, a couple of the approaches were:

That the brain is layered an each layer has different characteristics, they explained the problem with this was that it was hard to determine how many layers the model needed - a LRMB.

Another approach with they seemed to favour was that the brain is a network an OAR, there were some useful directed graphs which may be very beneficial when it comes to designing the attentional model.

Tomorrows objectives will be get a prototype of Java web services working and more work on the lit review.

Tuesday, 3 March 2009

Day Eight

Today I carried on with the Lit review. This involved researching what a lit review should actually look like as well the evaluation of academic documents. Also as I spent the day in the library I discovered a few other documents which may be of use.

I started by reading an article on indexing

http://www.iiit.net/~pkreddy/wdm03/wdm/vldb01.pdf

It suggested a method of indexing which could be up tp ten times faster done their way, it introduced their idea of a process called XISS, this relates to not just XML indexing but also the storage of the data. The paper also goes on to explain indexing algorythms, this is a paper I have skimmed over previously but as I have started the Lit review properly now I chose to take notes on it. I found some very interesting idea s however I found it very hard going, so I've only been through half of the paper but it should probably be revisited. Although the indexing method in this project isn't actually that important a fairly novel approach should be used but this paper does give some good insight. The most important aspect of the project will be the attentional models, so as I found the indexing paper so hard going I moved on to the psychological aspect of the project and have came accross two papers which look very useful, so tomorrow I shall read through the rest of these making notes and hopefully it will give me an idea of how to approach the attentional model because at the moment I have no idea what one actually is!

The two papers are:

http://www.infosci-online.com/downloadPDF/encyclopedias/IGR1854_G8WXb459Z3.pdf

http://www.infosci-online.com/downloadPDF/pdf/ITJ4450_74ZWULUIHI.pdf

Monday, 2 March 2009

Day Seven

Completed the requirements specification, which I'm fairly pleased with as it looked fairly professional and it's been a while since I've done requirements. There may be some refinements needed at a later date as I research further into the project. The use case diagram may need revisiting and also some functional requirements may be better off migrating to a "Environmental" requirements section.

Next week objectives will be to do as much of the lit review as possible, start to look at java web services and how RSS feeds work and also may start a project plan.

Project Blog

Tuesday, 31 March 2009

XML Parser

Monday, 30 March 2009

Touching up

Sunday, 29 March 2009

The Crawler

Thursday, 26 March 2009

Carry On Coding

Tuesday, 24 March 2009

Continuation of coding

Monday, 23 March 2009

Coding Commences

Thursday, 19 March 2009

Researching XML Differencing Tools

Wednesday, 18 March 2009

Meeting With Mike

Day Fifteen

Friday, 13 March 2009

Day Thirteen/Fourteen

Tuesday, 10 March 2009

Day Twelve

Monday, 9 March 2009

Day Eleven

Friday, 6 March 2009

Day Ten

Wednesday, 4 March 2009

Day Nine

Tuesday, 3 March 2009

Day Eight

Monday, 2 March 2009

Day Seven

Followers

Blog Archive

About Me