A computing cluster is a network of homogenous (or heterogeneous, in the case of intranets) workstations, connected by a high-speed local area network. It is used as if it were a single computing resource. A grid is an extension of this idea where, instead of the workstations being located and controlled within a single institution, geographically disparate systems that span multiple organisations are internetworked. Grids, or computational grids, were first proposed for highly complex, parameterised analysis where the computation can be split up over different nodes. However the HEP experiments are characterised by relatively simple queries over enormous amounts of experimental data.
Data grids are a specialisation and extension of ‘The Grid’ where the power of grid architecture is used to manage the sharing large datasets between ‘collaboratories’. They have been designed for use in HEP experiments and also in the astrophysics community. Given the large size of the data files, metadata is used to describe the content, semantics and logical structure of the data.
The semantic grid extends this notion by envisaging a grid as an environment where services are provided to and from entities, under advertised contracts (a specific form of metadata). JACK-G lies in the realm of semantic grids and is discussed further in Section 6.1.
Grids clearly offer massive benefits to large projects such as the BELLE experiment. They allow data sharing on an unprecedented level, leading to important collaborative work. They also allow resource sharing advantages, allowing the complex operations and data management to be dispersed to multiple machines and organisations. However grids are also characterised by a number of problems:
All these can lead to a high failure rate of jobs that have been submitted to a grid. Indeed it has lead some to claim that failure is the rule rather than the exception. Our ultimate goal described here is to reduce this failure rate by introducing an agent system capable of pre-empting failure through intelligent scheduling and submission, and intelligent recovery and resubmission following the failure of a node to successfully complete the job. This would improve the dependability and consistency of The Grid. Our initial implementation provides an agent tool capable of detecting failure and reporting this in a meaningful way to the physicist.
Our project does not aim to develop the grid infrastructure, rather augment it. Unfortunately, despite the apparent efforts of many, there is not an existing ‘Grid’ linking world wide HEP resources.
Appendix I gives a fuller description of Computational and Data Grids. The next section describes the current state of the major grid middleware.
The interest and need in grid computing has lead to attempts to develop a middleware capable of supporting grid applications. Globus has become most accepted and indeed the default standard for providing these services, although there are others (such as Legion). The HEP group at Uni of Melb. has a test grid using the Globus Toolkit and Globus is the only middleware being used in the worldwide EDG and LHC Grids (see Section 3.4); this is the only architecture explored in further detail.
Globus defines an open source “toolkit” of low-level services for security, communication, resource location and allocation, process management and data access. This toolkit can then be used to implement higher level applications or programming models. This makes up what is referred to as grid middleware and is used by the end-user application layer and is an abstraction over the grid fabric that provides storage, networks and computers. The Globus toolkit allows for the construction of a number of grid nodes and an API capable of authentication and authorisation via proxies, simple submission facilities, and the ability to retrieve data from completed jobs. The field of High Energy Physics is one of the principle applications of the Globus grid.
A detailed look at the toolkit is undertaken in Appendix II. What is of primary interest to us is what’s not in the toolkit, as this hole is one JACK-G/g aim to fill.
There are two major issues or problems grid middleware, including Globus, have yet to address:
These two problems do not mean that the Globus Toolkit is not useful, but emphasise the fact that the Toolkit is only a building block upon which developers use to construct useful applications.
The only major alternative to Globus, as a middleware grid mechanism, appears to be the Legion metasystem (http://legion.virginia.edu/). Legion define a metasystem to be a high speed network connecting isolated resources and aim to provide both architecture for building applications to work with these metasystems as well as the metasystem itself. In this sense it appears to be a more integrated package than Globus. This coupled integration, while useful, may also have impeded its takeup in the community. Whatever the reason, it is not used or likely to be used in the HEP community and we have not explored it further.
As noted in Section 3.1, data grids are significantly different from traditional computation grids. The European Data Grid (EDG) project aims to extend the Globus toolkit, and to provide services such as replica management that are necessary for data grid management. Essentially the EDG is a very large collaboration of organisations devoted to developing tools to allow all communities using very large datasets to perform their analysis on using the Grid. It is planned that the tools will eventually be incorporated into the Globus toolkit as part of the standard lower level services, extending the MDS (Monitoring and Discovery Service). The EDG is currently running (see: http://eu-datagrid.web.cern.ch/eu-datagrid/Intranet_Home.htm), though its current incarnation is little more than a renaming of the Globus toolkit binaries, with some semantic changes. For example, a grid node is now described as a “Virtual Organisation (VO)”. Monitoring and scheduling issues have not been addressed, although the Job Description Language (JDL) adds some standard meta-data to the job, something that will be important for JACK-G (see Section 6). The next evolution is the LHC Computing Grid Toolkit to be used when the Large Hadron Collider (LHC) or ATLAS experiment comes online in 2007. The users of the LHC are expected to produce of the order of 10-14 peta-bytes of data per year. The LCG toolkit is in its infancy and currently bears no significant difference to the EDG.
Both JACK-g and JACK-G are based upon the Globus Toolkit and given it’s similarities to the EDG and LCG porting JACK-G/g, when physicists inevitably migrate to the LCG, should be simple.
A technical, detailed look at the Globus toolkit is contained in Appendix II.
Our initial prototype development is aimed not at the complete community described above, but at the Uni of Melb HEP Group. This group is part of (and help setup) the BelleTestBed which is a Grid setup across Australian universities. There are 4 university physics departments with nodes setup on this grid. These are listed in Table 1:
| Machine name | University |
| fleagle.ph.unimelb.edu.au | University of Melbourne |
| belle.cs.mu.oz.au | University of Melbourne |
| belle.anu.edu.au | Australian National University |
| belle.physics.usyd.edu.au | University of Sydney |
| belle.cs.adelaide.edu.au | University of Adelaide |
Table 1: Nodes on the Australian BelleTestBed
The Globus Toolkit is used as the middleware in the BelleTestBed and is installed on all the nodes. Access via the University of Melbourne to the BelleTestBed is primitive at the current point in time and is based upon the low level Globus Toolkit commands. Some of these Globus commands used are detailed Appendix VI and II. Clear from our experience documented in Appendix VI is that the Globus Toolkit commands provide basic access only. There are no tools for scheduling or failure handling or monitoring in any real sense.
Unsurprisingly, there have been a number of attempts by developers to use the Globus Toolkit. The UoM HEP group has been collaborating with a scheduler: Nimrod/G, which has been in development for a number of years. It is a grid-enabled resource management and scheduling system, using economic principles, but is not yet a useable system. Nimrod/G and other schedulers are described in Appendix III. These schedulers are designed for computational grids and parametric computing, a somewhat different paradigm from the HEP problem.
In the absence of the completion of the Nimrod/G project, the HEP group developed their own Quick and Dirty scheduler which interfaces to the BelleTestBed grid. This scheduler handles node choice and takes into account replica catalog (file system) info and allows multiple nodes to be used at once. They claim that it is one of the first “advanced data schedulers”. It does not consider complex requirements or cost/feasibility.
The UoM HEP group would like a system/scheduler where:
The UniMelb HEP group would greatly benefit from the sophisticated handling of failure that agents could deliver. The issues identified above are handled particularly well by the agent paradigm. Given that the agent paradigm specifies multiple plans to deal specifically with the general condition of failure as well as the inherent dynamic and proactive nature of agent systems, it would appear to be a pertinent area to apply an agent application. A general architecture for our agent-based grid monitoring systems is outlined in Section 6.
Our future work, JACK-G aims to provide a comprehensive solution to this issue. The current implementation, JACK-g, provides a system that deals with failure handling. These are described in subsequent sections.