Wednesday, May 31, 2006

DSpace vs Xena

DSpace is now producing a xena file as part of its ingest process. Xena normalisation occurs in the InstallItem class, which is activated during the very last step in the DSpace process, after the DSpace editor has signed off on the submission. The implementation is not ideal, as I was forced to use the DSpace BitStream to write out to a temporary file, and then create a XenaInputSource using this temporary file. Ideally we would create a XenaInputSource using the BitStream directly, and this seemed to work well for the guessing stage, but the stream could not be read during the normalisation stage for some reason. So this is something to look at in the future.

The xena file is currently stored in a hard-coded directory on the server, so this will also need to be changed. The Sydney Uni guys want the xena file to appear in the "File in this item" list, like the following:

http://ses.library.usyd.edu.au/handle/2123/204

I think this means that the place where normalisation takes place will have to change, probably to the point where the ingested file is first received from the client and saved on the server. This way we can slip an extra BitStream into the process. This has the advantage of not having to worry about how to set up a separate Xena repository, as the xena files will be stored in the same way as the ingested files. Xena integration might be a bit harder though...

Friday, May 19, 2006

Java Profilers

Xena Lite has a bit of a memory problem. It often seems to grow to using almost 100MB of memory... but calls to System.gc (garbage collection request) do not seem to reduce the usage. Bizarrely enough, minimising the application does cut the memory usage back to about 5MB, and it only jumps back to 15MB when the window is restored. In addition, viewing large documents (such as a 60MB PDF) have caused out-of-memory errors, which were only alleviated when the heap size was increased to 500MB! Out of memory errors have also occasionally occurred when normalising large numbers of files.

To investigate this problem properly we needed a Profiler of some sort. I have spent much of the last week investigating the options. I downloaded trial versions of the below products and attempted to memory profile Xena Lite with each.

Before starting I should mention that the various Eclipse plugins for these profilers "don't play nice together", I found that I had to remove all other profilers completely before the new profiler would work properly.

Eclipse Test and Performance Tools Platform
The TPTP is an Eclipse plugin which comes with a set of default tools. Installation is not straightforward as 3 or 4 separate downloads are required, plus a monitoring agent needs to be installed as a service. Once started the profiler worked well enough, and having the profiler as a part of Eclipse with the ability to open up the source files directly is definitely a plus. However the memory profiling tool was a little basic for our needs. It shows a list of the currently allocated objects, the object that allocated each of them (but not for primitives or arrays), and these can be sorted in order of size. And that's about it.

BEA JRockit
JRockit is a Java 5 JDK produced by BEA, which happens to ship with profiling tools. Installation was easy, but finding how to start it up was difficult as the start menu only contained an uninstall target, and the installation directory contained a large amount of command-line executables. jconsole.exe turned out to be the correct option, and this brings up an application with a "Connect to agent" dialog, which was a little confusing... after some experimentation it turned out the best option was to select the "Remote" tab and just go with the default options. Then I needed to start Xena Lite using the JRockit JRE, and this brought up the profiling console. Unfortunately this profiler is even more basic than the Eclipse TPTP, only consisting of guages showing the current memory levels, CPU utilisations and garbage collection state. So fairly pointless, really.

Rational PurifyPlus
We had great hopes for IBM's Rational PurifyPlus, as we figured that it would have the greatest level of integration with Eclipse. However the latest version of PurifyPlus was produced in 2003, which did not bode well for use with Java 5 or Eclipse 3.1. And this turned out to be the case - I copied the Eclipse plugins from the install directory over to the Eclipse directory, but only the PurifyPlus help files were installed properly. The standalone application failed when trying to run Xena Lite, most likely because it only supports Java 1.4.2. You'd think they'd be due for an update soon, I guess it's something to keep an eye on.

YourKit Java Profiler
Finally a decent product, and from the only company I hadn't previously heard of! Installation was easy, it wasn't immediately clear how to start a profiling session but then I noticed the Eclipse Integration option and installed the Eclipse plugin. After that it was straightforward, launching the application in Eclipse automatically starts the YourKit application and its many, many profiling options. It's not perfect, the main problem from my point of view is that it seems more focussed on recording where objects were allocated rather than how they are currently referenced, which makes it easy to solve problems where an object is creating too many resources, but much harder to solve problems where objects that should have been garbage collected are being retained (which is Xena Lite's main problem). But eventually I figured out how best to use it (find large retained objects, and then use the "incoming references" function) and it has helped to solve one memory leak problem already. The best news about YourKit is that they offer the profiler free for use with open source projects, as long as you acknowledge them on the project website. Definitely a winner!

Borland Optimizeit
Before I finally figured out how to best use YourKit, I thought I might give Optimizeit a go. I had found it quite useful pinpointing a memory leak at a project I worked on for Defence a few years ago. I found it slightly clunky back then, and they don't appear to have updated the interface in the intervening years. The functionality is still there though, and it was working well until Xena Lite hung in the middle of a normalisation job. It turned out that this was Xena Lite's fault, but as I couldn't find a way to bring up the console I had no way to determine this. Optimizeit offers integration with Eclipse which might have fixed this problem, but the integration did not work (I think because it doesn't support Eclipse 3.1). Optimizeit is probably the best backup option if YourKit cannot be used for some reason.

Monday, May 08, 2006

Normalising with OpenOffice 2.0

On Friday I used Xena Lite to normalise the full set of test office documents to which I have access (some from real transfers, some of my own documents, and some made-up test documents). There were about 500 documents in total, including word, spreadsheet and presentation documents. There were only 5 files that could not be normalised - three were password protected, one was a Office '95 presentation (not handled by OpenOffice) and one was a Microsoft Project document with a "'.doc" extension. So I was very happy with the success rate!

However the main point of this exercise was to ensure that the ODF-normalised file viewed in OpenOffice would appear the same, or at least be close in appearance, to the original file opened in Microsoft Office. We require the "essence" of the original document to be preserved... although we haven't quite gotten around to defining what the "essence" is. Real soon now!

Anyway, the results were actually better than expected. I opened about 20-30 documents in both MS Office and OpenOffice, choosing the most complicated documents from the set, and there were only a few very minor discrepancies.

Below is the front page of a spreadsheet file showing off various features of Excel, opened in Excel:


And here is the OpenOffice version:


These links all navigate to the correct worksheet in the OpenOffice version. Even better, on the Data Validation page the cells which have been set as "numbers only" or "seven characters only" produce an error message if you break these rules. The macros which have been set up on the original page do not work in OpenOffice, but I think that was to be expected!

The only discrepancies I found in the subset of documents I examined were the aforementioned macros, and some issues with page layout - when looking at a document in "page layout" mode, images which just fit on a page in MS Office would be carried across to the next page in OpenOffice. This could possibly be fixed with Page settings in OpenOffice, and when viewed in "Normal Layout" mode, the files were identical.

Obviously more testing will be needed to cover all the possible options in all possible versions of Office, but it appears that our normalisation to ODF is producing excellent results.

Friday, May 05, 2006

XML and ODF at Xtech conference

Donna Benjamin from Open Source Industry Australia is giving a presentation at the Xtech conference in Amsterdam this month. Donna's talk will discuss some of the Australian projects using XML and ODF in digital preservation.

From the Xtech web site:

XTech 2006 is the premier European conference for developers, information designers and managers working with web and standards-based technologies. XTech brings together the worlds of web development, open source, semantic web and web standards.

http://xtech06.usefulinc.com/schedule/detail/108

Thursday, May 04, 2006

ODF adopted as ISO26300

I can't let today pass without mention of the fact that the OpenDocument Format has been voted into acceptance as an ISO standard by the ISO technical committee looking into the issue. With luck this will open the doors to more support of open formats in government and lead to more robust digital preservation.

Hey, if everyone used open formats, we wouldn't need to develop Xena!

DSpace 1.3.2 with Xena

After looking through the DSpace 1.3.2 source, the best place to make the diversion to the Xena normalisation process appears to be in the installItem method of the org.dspace.content.InstallItem class. At this point the submission has passed all the necessary review stages and is about to be archived. We have references to BitStreams representing the input files (from which we can retrieve an InputStream and pass it to Xena) and an InProgressSubmission object which contains information about the collection to which we are adding, the submitting user etc.

So it should be very easy to add code to normalise the files. But the question remains - where should the normalised file go? Should it be added back in as an extra file in the DSpace process, or stored in a separate repository of some sort?

DSpace Install

After the Postgres install "issues", the DSpace install was relatively straightforward. I just followed the instructions found here:

http://dspace.org/technology/system-docs

and it worked!

Although we may need to modify the DSpace source code in order to plug in Xena, and so I wanted to add DSpace as a project to Eclipse. This didn't work all that well the first time due to problems with permissions. It seems the best idea is to do everything using the (system) dspace user - including logging in to the desktop as dspace - and all the problems just go away. I installed a tomcat plugin for Eclipse so now source modification, compilation, deployment, and starting and stopping the tomcat server can be done from within Eclipse.

Wednesday, May 03, 2006

Xena Lite preview

Did I mention there's a developer preview of Xena Lite available? It's very close to what Xena Lite will look like, but it converts office documents to OpenOffice flat XML instead of ODF. The office plugin gets its ODF goodness next month. If you have Java, Openoffice and the capacity to grab 30MB or so, read the release notes first, then grab the download.

Sophos Command-line Virus Checker

Sophos Command-line Virus Checker

A possible solution to DPR virus checking? Seems to read virus definitions from a directory, so can presumably update them, and could read the log file for results... hopefully would be well-structured to facilitate this.

Welcome

Welcome to the NAA Digital Preservation team's development blog. Let's see if we can use this as a collaboration space to share info during the development of Xena, Quest and the DPR.