Thursday, 19 March 2009

Researching XML Differencing Tools

As the new amended project will involve comparison of XML data a number of XML comparison applications and API's will need to be
looked at so that the most appropriate tool for the job is chosen.

The first tool I looked at was stylus studio:

http://www.stylusstudio.com/

The free version (only available for 30 days) doesn't seem to register API and doesn't look like it Will do what I want it to anyway However this does look like a very powerful XML manipulation tool.

The from google I came across this useful link:

http://www.roseindia.net/opensource/xmldiff.php

It suggests a number of open source java XML tools, the XML Unit Java API looked the most useful so I began to research this.

http://xmlunit.sourceforge.net/

To run this software you will need: - Junit (http://www.junit.org/) - a JAXP compliant XML SAX and DOM parser (e.g. Apache Xerces) - a JAXP/Trax compliant XSLT engine (e.g. Apache Xalan) in your classpath.

In the java doc it mentions Diff and DetailedDiff classes:
"Diff and DetailedDiff provide simplified access to DifferenceEngine by implementing the ComparisonController and DifferenceListener interfaces themselves. They cover the two most common use cases for comparing two pieces of XML: checking whether the pieces are different (this is what Diff does) and finding all differences between them (this is what DetailedDiff does)."

This looks like exactly what I am after, I will have to test it first though. At the moment I have just managed to get it installed so the next stage will be to write a little prototype to see how well and how it compares two XML files.

Another XML manipulation tool which Mike mentioned was XMLSpy I will look into this also.

I had some thoughts on the actual comparison of files process too. I was thinking whether I need two copies of the whole document set in order to work out if they have changed surely this is highly inefficient? Is there another way? Also I thought if this were then case instead of trawling through
and comparing every document, if I had a table of hash values generated from each document then compared them, would this be more efficient to at least identify which ones have changed? Obviously this does have the tiny risk or a hash value being generated twice exactly the same. Just a couple of points to
think about.

No comments:

Post a Comment