Introduction
Documents, though they contain useful information, have not been used as a data sources traditionally before the advent of XML. By enabling labeling of each part of the document with the type of data it contained so that more document-centric applications could be developed. XML can be seen as an object serialization format for distributed object applications or as a data exchange format. Due to the importance of XML and Database Systems (DBS), they need to be integrated for storage, retrieval, and update. By enabling this interaction, the business data and legacy application data that exists in the DBS can be exposed to other systems in XML format making them highly interoperable. Similarly, the XML data can be stored in the DBS to enable querying and summarization. There can be two types of databases; XML-enabled databases (such as RDBS) and Native XML databases (such as EMC Documentum Store). This paper discusses XML-enabled databases.
Best Type of Database for Interacting and Integrating with XML Data
There are four different types of databases that can be considered. Special purpose DBS specifically made to store, retrieve, and update XML documents can be used. Examples of these are Rufus, Lore, Strudel and Natix which are research prototypes and DBS such as eXcelon and Tamino. The second approach is to use object-oriented databases due to their rich data modeling capabilities which make them best suited to storing hypertext documents. Object-relational DBS are also good candidates for mapping to and from XML documents since their nested structure works well with XML’s nested document model. However, they are not very popular nor are they capable enough to efficiently handle large-scale data. Thus, the best choice for storing XML documents is relational database systems (RDBS). Being a mature technology, integrating RDBS with XML would provide several advantages. This facilitates seamlessly querying the data represented in XML documents and relations. This has the further benefit of making available the data that is already stored within an RDBS available for other newer applications, including the web.
How RDBS Interacts with XML
There are three different alternatives for storing the XML documents within an RDBS. One approach is to the XML documents as a whole within a single database attribute. The second approach is to interpret XML documents as graph structures and storing them into the arbitrary graph structures for the relational schema. The third approach is to store the XML documents according to mappings based on the relational schema and the structure of XML DTD. The last of these alternatives is the best as it allows the exploitation of RDBS features such as querying, optimization, and concurrency control. However, while mapping the XML DTD and the relational schema, the heterogeneity of XML DTD data model and the relational schema have to be reconciled. The heterogeneity exists due to the fundamental differences in the concepts provided by XML and RDBS. XML DTDs and RDBS have different goals that are pursued at the design stage as the former deals with the redundant representation of information while the latter deals with normalization.
ET_R. An element type (ET) is mapped to a relation (R) called base relation. It is a many-to-one mapping.
ET_A. An element type is mapped to an attribute (A). It is a many-to-one mapping.A_A. An XML attribute is mapped to a relational attribute. Again, it is a many-to-one mapping.
Figure 1: Basic Kinds of Mappings
Source:
Mappings Between XML and RDBS
Both element types and attributes can be mapped only to a single base relation and a single attribute (Figure 1). The ET_A and A_A mappings determine that values of database attributes are mapped to values of XML elements or attributes so ET_R mappings occur together with them. Not all element types and attributes of a DTD, as well as all relations and attributes of a relational schema, will have a mapping (foreign keys may be irrelevant in the case of an XML document). The omission of mapping is possible not only when the DTD and relational schema are developed independently but also when they are derived from reach other unless there is a certain constraint that requires the existence of a value within the XML document.
Figure 2: Exemplary mappings of composite elements
Source:
These basic mappings can be further extended by nesting hierarchy built by element types (composite) containing another element (component) types. While mapping an element the composite element is directly or indirectly mapped to a relation or an attribute. This base relation becomes the parent base relation of the XML element. An XML attribute is mapped by first considering its element type. If the attribute’s element type is not mapped, it's direct and indirect composite element types have to be ignored. In a direct mapping, the parent base relation and the base relation of the XML element type or the attribute, respectively, can be mapped to the relation or one of its attributes. Indirect mapping happens when the relational attribute, which should be the mapping target, is factored out from the parent base relation due to reasons such as normalization and others. Both direct and indirect mapping are applicable to the above three ET_R, ET_A, and A_A mappings. Due to vertical partitioning, if there is a possibility of a direct mapping, then there is always the possibility of an indirect mapping.
Enabling relational databases to handle XML data faced three main problems:
Storing XML documents in relational database
Publishing XML documents from data stored in relational database
Querying XML views over relational data
The main principles of XML Publishing and XML Querying will be discussed through SilkRoute which is the typical system for the XML-enabled database.
Publishing/Querying
Figure 3: SilkRoute Architecture
Source:
The definition of data that has to be exposed to XML is created by using SilkRoute's proprietary Relational to XML transformation Language (RXL). So, a query on the RDBMS (1) containing one database (PEOPLE) and one table (PERSON) with columns PERSON (ID, Name, Age, Employer) using RXL (2) is created
Source:
It produces the output as shown in Code Snippet 2. This RXL expression is used by the translator (3) to generate XQL queries (4) to extract the data, which are executed by the RDBMS to produce the data in the form of tuples (6) and extract the XML template (5). The XML template is filled with data by the XML generator (7) to produce the final XML answer (8). SilkRoute support querying using XML-QL. Using the same query (9) as in Code Snippet 1, the Query composer, which is the most complex piece of SilkRoute architecture, integrates the view definition (2) and the user query (9) produces the answer (8).
Source: (Smiljanić et al., 2002)
Publishing
The techniques for XML publishing of relational data has to pay special attention to do a computational pushdown by trying to make existing relational engine take the best part of the load for performance gains. Extensions are made available for the relational engines so that they will be able to support XML publishing. The three tasks that are involved in XML Publishing are 1) Data extraction, 2) Data Structuring, and 3) Data Tagging, which can be performed in different ways. In the process of transforming Relational data to XML data, the data has to be appended with tags as well as hierarchical structure. They can be done early in the process (early structuring/tagging) or they can be done later in the process (late structuring/tagging). Another implementation decision is the choice of whether the structuring/tagging can be done within the relational engine or without. Figure 4 shows the alternative methods for XML publishing. When using stored procedures they embedded within the queries so as to generate the XML output. These stored procedure use table attributes with basic SQL data types as arguments. The other alternative is to use the already formed subparts of the XML document. New XML elements are produced by combining those arguments and framing them within new XML tags. A special aggregate function called XMLAGG is added to the relational engine to generate lists of XML elements by concatenation. This aggregate function concatenates the XML elements to produce a CLOB, which is in turn used a parameter for other stored procedures. This basic approach, called the Early Structuring / Early Tagging XML publishing, suffers from highly correlated nested SQL queries. Therefore, to de-correlate, nested SELECT statements are used in the Late Structuring / Late Tagging publishing approach along with the exploitation of Outer Union.
Figure 4: Space of Alternatives for Publishing XML
Source: (Smiljanić et al., 2002)
Storing
One of the methods for storing XML documents is to use a set of binary tables. The binary table is created for every possible parent-child relation existing in the XML document. Figure 5 shows an example of applying this storage technique to a simple XML document. As an example, the table named a-b-c is populated with pairs of IDs of XML elements <b> and <c> that are in parent-child relationship (but only if they are reached through <a>). The other option, called basic inlining technique, is to store as much descendants’ data as possible is stored in a single table.
Figure 5: a) XML document, b) Element ID's, c) Contents of binary tables for the XML document a)
Source: (Smiljanić et al., 2002)
Conclusion
Different types of databases and their appropriateness for storing XML data were discussed in the paper to finally select RDBS as the most appropriate. The relationships between the XML and Databases were discussed. The basic and expanded mappings were presented and the various methods for publishing, querying, and storing XML data in RDBS were discussed. A typical XML-enabled database architecture, SilkRoute, was discussed briefly. The advantages of using XML enabled databases were discussed. Future research could be focused towards the benefits of Native XML databases over XML-enabled databases.
References
Fernandez, M. F., Tan, W. C., & Suciu, D. (2000). SilkRoute: trading between relations and XML. WWW9 / Computer Networks, 33(1-6), 723-745.
Kanne, C. C., & Moerkotte, G. (2000). Efficient storage of XML data. Proceedings Of the 16th International Conference On Data Engineering (ICDE) (p. 198). San Diego, CA: ICDE.
Kappel, G., Kapsammer, E., Rausch-Schott, S., & Retschitzegger, W. (2000). X-ray - towards integrating XML and relational database systems. Proceedings of the 19th international conference on Conceptual modeling. Salt Lake City, UT: Werner Retschitzegger. doi:10.1007/3-540-45393-8_25
Rambaugh, J., Jacobson, I., & Booch, G. (1999). The Unified Modeling Language reference manual. Essex, UK: Addison-Wesley Longman Ltd.
Shanmugasundaram, J., Kiernan, J., Shekita, E. J., Fan, C., & Funderburk, J. (2001). Querying XML views of relational data. Proceedings of the 27th International Conference on Very Large Data Bases (pp. 261-270). San Francisco, CA: Morgan Kaufmann Publishers Inc.
Smiljanić, M., Blanken, H., Keulen, M. V., & Jonker, W. (2002). Distributed XML Database Systems. Enschede, Netherlands: Twente University Faculty of Informatics Database group.
Williams, K., Brundage, M., Dengler, P., Gabriel, J., Hoskinson, A., Kay, M., . . . Vanmane, M. (2000). Professional XML databases. Birmingham, UK: Wrox Press Ltd.