IT-experts treat data mining (storage) and retrieval as the most important application of the data warehouse technology (see, e.g., Inmon, 1992; Inmon, 1997). In particular (or most largely), this refer to the retrieval of the latent predicting information from large databases (Edelstein, 1997; Thearling, 1998). In the present essay, most important and promising trends of automatic tools to store, mine, retrieve, and process data are reviewed.
Integration of text, data, code, and flows
In DBMS (database management systems) area, the main attention was always drawn to the organization, storage, analysis, and retrieval of structured data. The development of Internet has shown that more complex data types (such as texts, images, temporal data, audio data, and video data). This leads to the revision of the basic relational DBMS architecture intended to support the following types of structured data:
text, spatial, temporal, and multimedia data;
procedure data, i.e., data types and methods encapsulating them;
triggers;
data flows and queues.
Integration of information
A typical approach to the integration of information at the enterprise level is the design of data warehouses and datamarts based on the so-called ETL (extraction, transformation, and loading) procedure: operational data is extracted, transformed to a unified scheme, and loaded into the warehouse. Such an approach is workable for an enterprise with several dozens of operational databases under a unified control. For Internet, the ETL approach does not fit because the integration of information between different enterprises is required. As a rule, entities prevent mass data retrieval from their operational databases; only particular queries are allowed instead. This produces the necessity to integrate, perhaps, millions of data sources “in-fly.” To solve specific problems faced on that way, we have to replace the ETL approach by something else. In this regard, a promising direction is the Online Analytical Processing (OLAP) of the data.
Sensor data and sensor networks
A sensor network consists of a very large number of cheap devices such that each one is a data source determining the value of a parameter such as the object coordinates or the external temperature. Communications require substantially more energy than computations do so a sensor network is addressed, then a complete distribution of computations between its nodes might be much more efficient: the network becomes somewhat database machine. The TinyDB project (see Madden et al., 2004) might become a prototype of such a database machine.
Multimedia queries
The flow of multimedia data (images, audio, video, and so on) is going up as a waterfall. A DBMS challenge is to create simple tools to analyze, generalize, search, and review electronic samples of multimedia information referring to a particular object (entity, person, etc.) obviously, this is an aspect of the data integration problem as well.
Inexact data usage
Outside the business world, all the data to be processed is incomplete and inexact. Scientific measurements undergo the impact of standard errors, the similarity of sequences, images, texts, and so on is approximate, etc. To be used in such areas, DBMS should ensure the support of inexact data. Query processing should be based on the probabilistic (indeterminate) model. Query processing unit should accumulate facts to ensure better and better responses to users’ queries. Users need an option to ask inexact questions, while the processing unit should treat this as an additional source of the incompleteness and inexactness.
If an inexact answer is to a user query is delivered, then the system should characterize its exactness level in order users be able to understand whether that level is sufficient for their needs. The relevancy of an answer provided by a search engine can be treated as a reasonable analog for such a case.
Self-adaptation
The lack of qualified database administrators is consequence of the broad propagation of DBMS technologies. Contemporary administrators should understand disk partitions, parallel query processing, user-defined data types, etc. In brief, it is hard to use contemporary DBMS. Main DBMS providers run projects to simplify database administration.
Contemporary DBMS have comprehensive sets of tuning “knobs.” Using them, an expert can provide the greatest efficiency of an industrial system. Frequently, such a tuning is implemented by engineers of the vendor company, which leads to substantial expenses of the client. Moreover, the majority of engineers do not understand the sense of “knobs” and optimize systems, basing on the previous experience.
It is possible to tune, combining the system based on rules and the database of installation and configuration data. In that direction, DBMS vendors have achieved a substantial progress regarding the dynamical resource allocation and the choice of physical structures and materialized representations. However, this success is just intermediate: the problem is to quit “knobs” at all. All tuning decisions should be taken by the system itself under the guidance of the politics accepted on default (such as the relative importance of the reactivity and bandwidth).
For many new applications, unattended DBMS is required. The DBMS should recognize its internal fails (and fails of communication components) itself, find failed data and application upsets, and take decisions.
New user interfaces and query optimization
The power of contemporary desk computers allows us to run strong visualization systems on them. However, visualization systems QBE and VisiCalc are mainly used up to now. They are really good, but they are known during three decades so a development in this direction is strongly expected. The main vector of the current development is the transition from SQL to XQuery. This is not a qualitatively jump: current users do not know SQL this is a language of professional programmers) and future users unlikely will know XQuery. Qualitatively new ideas are related to investigations associated with the term “semantic Web’’ though the term itself is rather vague; anyway, search engines broadly use keyword queries and the interest to browsers grows in different areas.
Query optimization is an important component of solutions of above problems. The general principle is as follows: if the data volume is very large, then the tendency is to manipulate that data in a standard way. This allows us to successfully use SQL and XQuery (languages of a very high level) in the data world, but this is almost the only application area for those languages. Languages of so high level need appropriate optimizers. Further development requires us to continue the optimization of information integration tools, query languages for semi-structured data (such as XQuery), flow processing units, sensor networks, etc.
In many cases, SQL-oriented systems process sequences of relatively simple queries that a built-in at the level of the program language and are intended for a same task. To develop this area, we have to investigate an “inter-query” optimization over a large number of traditional (purely relational) queries.
References
Inmon, W.H. (1992). Building the Data Warehouse. New York, NY: John Wiley & Sons, Inc.
Inmon, W.H. (1997). Are multiple data warehouses too much of a good thing? Datamation, 43(4), 94-96.
Edelstein, H. (1997). Data Mining: Exploiting the Hidden Trends in Your Data. Retrieved from
http://www.psy.gla.ac.uk/~steve/pr/edel.html
Thearling, K. (1998). An Introduction to Data Mining: Discovering Hidden Value in your Data Warehouse. Retrieved from http://www.thearling.com/text/dmwhite/dmwhite.htm
Madden, S., Franklin, M. J., Hellerstein, J. M., and Hong, W. (2004). TinyDB: An Acquisitional Query Processing System for Sensor Networks. Retrieved from
http://db.csail.mit.edu/madden/html/tinydb_tods_final.pdf