Abstract
In the cyber age, the storage and retrieval of useful information is central to any organization, whether profit-oriented or nonprofit. Company processors store large volumes of raw data that need processing and classifying. Only useful data is worth retrieving. There is no better way to extract useful information from databases than by applying data mining that leaves unnecessary data untouched. The competitiveness of business environment necessitates the application of the most updates information retrieval techniques. Data mining may prove useful in market segmentation, fraud detection, customer churn, direct marketing, market basket analysis, interactive marketing, and trend analysis. The project is split into sections that highlight the concept of data mining, the purpose of technology application, its history, future trends, techniques, and operational principles.
Keywords: information, raw, data, mining, approach, retrieval, extraction
The development of information technologies in the cyber age has improved human lives in many ways. Financial, social, cultural, and other sectors used to put information and knowledge on paper. The evolution of computers now allows storing huge volumes of information in the digital format. However, data is of little use unless retrieved via data mining technologies that enable human analysts to single out useful trends and regularities that would otherwise go unnoticed. Data mining finds its application in business, company owners benefiting from learning about the categories of customers or the possibility of their defection to market competitors. Overall, data mining is an IT approach to extracting valuable information from raw data that has its own history, much-promising evolutionary trends, the purpose of application, techniques, and operation principles.
The Concept of Data Mining. The Purpose of Technique Application
According to the University of North Carolina at Chapel Hill (n.d.), data mining is a procedure, whereby information technology specialists retrieve hidden knowledge from large amounts of raw data. The discovery of knowledge is different from the conventional extraction of information from databases. In conventional database management systems, the records of data banks are returned in reply to an inquiry. In the discovery of knowledge, the objects of retrieval are implicit rather than explicit in data banks. The process of finding implicit patterns is referred to as data mining. Data mining applies data analysis techniques and instruments to build models, which allows discovering the aforementioned patterns as well as relationships. Experts differentiate between two major types of models in the process of data mining. Predictive models utilizes information with known results to produce a model that is possible to apply by explicitly predicting values. Descriptive models tend to describe patterns in existing data. All models are believed to be the abstract reality representations. They may be genuine navigators in suggesting actions undertake and understanding business.
Data mining is applicable when it is necessary to interpret the data and retrieve helpful information from data or when there is an excess of data and shortage of information. While dealing with extensive amounts of data, human analysts seem helpless for want of specific instruments. What data mining can do is make it automatic the procedure of discovering patterns and relationships in raw data. The results of search can be evaluated by a human analyst or applied in automated decision support systems. Hence, it becomes obvious why data mining is utilized in areas like business and science, in which experts need to scrutinize large portions of data to find tendencies that would otherwise remain unseen. Given the ability of the retrieval of knowledge from raw data, such data may become the most precious asset. Metaphorically speaking, data mining is the instrument of extracting precious diamonds of knowledge from the fields of historical data and forecasting the results of future situations (The University of North Carolina at Chapel Hill, n.d.).
Alexander (n.d) noted that corporations were bulging at seams with the volume of raw data in databases they maintain. From pixel-by-pixel galaxy images to trillions of credit card purchases and point-of-scale transactions, databases measure gigabytes or even terabytes. To quote a simple example, a single terabyte amounts to an aggregate of 2 million books. The amount of information that accrues to Wal-Mart databases on the daily basis best exemplifies the value of data mining. Every day, the retail giant submits as many as 20 million point-of-scale transactions to an A&T parallel database system, a centralized data bank run by 483 processors. There is not much analysts can learn from unprocessed data. Now that business environment has grown mercilessly competitive, companies are in urgent need to convert the huge volumes of raw data into coherent information on production markets and clients and that swiftly. If successful, the process of conversion or data mining can facilitate management strategies, investment, and marketing.
While data mining is in a budding state, companies in a wide variety of industries, such as finance, retail, healthcare, aerospace, and manufacturing transportation are putting data mining to good use. By applying mathematical and statistical techniques and the technologies of pattern recognition process stored information, data mining enables human analysts to recognize essential patterns, facts, exceptions, trends, relationships, and anomalies that would otherwise most likely be overlooked. In the case of business ventures, data mining is applicable for the discovery of relationships and patterns in the data, which allows making efficient business decisions. The information technology enables analysts to identify sales tendencies, extrapolate customer loyalty from existing trends, and elaborate wiser marketing strategies.
More specifically, data mining may be applied for market segmentation that enables analysts to detect the common traits of clients purchasing the same product, fraud detection that helps spot transactions that are fraudulent, and customer churn that allows predicting the groups of clients likely to defect to other competitors. Other aspects of data mining application include direct marketing that indicates prospects that need including in a mailing list in order to receive the highest level of response, market basket analysis that enables analysts to understand what services and products are commonly bought together, and interactive marketing that allows predicting what stirs the interest of website visitors. Finally, trend analysis makes it possible to identify the distinction between typical clients this month and last (Alexander, n.d.).
The History of Data Mining and Future Trends
Information technology specialists introduced the concept of data mining in the 1990s, yet the term is the product of evolution of the field that has an extensive history. Artificial intelligence, classical statistics, and machine learning are at the root of data mining origin. Statistics is the basis of the majority of technologies like standard distribution, regression analysis, discriminant and cluster analysis, standard variance and deviation, and confidence intervals, on which data mining is reliant. All technologies produced in the list are applied for studying data relationships and data. As distinct from statistics, the cornerstone of artificial intelligence is heuristics. AI attempts to utilize human-though-like processing to statistical issues.
Machine learning is the combination of artificial intelligence and statistics, mixing advanced statistical analysis and AI heuristics. Machine learning tries to allow computer programs to learn more about the studied data. Programs make various decisions on the basis of the studied data qualities by applying statistics for basic concepts and adding more advanced algorithms and artificial knowledge heuristics to gain its objectives. Data mining adapts the techniques of machine learning to business use. Data mining is the combination of recent as well as historical developments in areas like machine learning, statistics, and artificial intelligence that help study data and retrieve hidden patterns or tendencies (The University of North Carolina at Chapel Hill, n.d.).
Alexander (n.d.) claimed that data mining had undergone four principal stages of revolution. Data mining is a natural development of the enhanced application of computerized databases storing data and giving answers to business analysts. The first huge historical landmark was data collection in the 1960s, with technologies like disks, computers, and tapes being enabling technologies. The second stage was data access in the 1980s enabled by cheaper and swifter computers with more relational and storage databases. The third stage was decision support and data warehousing facilitated by cheaper and swifter computers with more data warehouses, online analytical processing, storage, and multidimensional databases. The fourth huge landmark in the evolution of data mining was expedited by cheaper and sifter computers with more advanced computer algorithms and more storage options.
In the near future, data mining results will be in lucrative business areas. New niches will be explored by micro-marketing campaigns. It is with new precision that advertising companies will target their potential clients. During later periods, data mining may become as widespread and user-friendly as email. People may apply these tools for finding the cheapest ticket price to the destination of their choosing, retrieve the phone number of a schoolfellow, or choose the best price on commodities. In the distant future, prospects are much promising. Computers may find new treatments for illnesses or give clues about the nature of the universe. This notwithstanding, there are privacy-related concerns. Systems may collect information on every aspect of human lives from the results of visits to a physician to webpage visit history, which may encroach upon privacy at best (Alexander, n.d.).
Data Mining Techniques
Alexander (n.d.) claimed that recognized mathematical techniques and algorithms were analytical approaches applied in data mining. The use of techniques to business issues enabled by the enhanced availability of data and its cheap storage along with processing power has given mathematical technologies a new application. The possibility to use graphical interfaces has resulted in instruments becoming easy to access and use. The major instrument applicable for data mining are nearest neighbor, artificial neural networks, genetic algorithms, rule induction, and decision tree. Artificial neural networks are nonlinear predictive models, learning by means of training, are similar to biological neural networks from a structural perspective. Rule induction is the retrieval of helpful if-then rules from the data on the basis of statistical performance. Decision tree implies tree-shaped structures representing the sets of decision producing rules for dataset classification. Nearest neighbor is a classification approach classifying every record on the basis of records resembling it in a historical database most. Genetic algorithms are the approaches of optimization on the basis of concepts of natural selection, mutation, and genetic combination (Alexander, n.d.).
There is a different classification of data mining techniques. As it stands presently, there are a total of four principal techniques in data mining. Clustering is the approach instrumental in discovering appropriate groupments of elements for data sets. Clustering is a type of unsupervised knowledge or the discovery of knowledge that is undirected. In other words, there is no target area while data relationship is identifiable by a bottom-up technique. Association rule, aka market basket analysis, is a data mining approach that helps discover associations between various attributes in databases. The basis of association rule is frequency counts of the number of items in a certain event. It may give insight into whether item A is a part of the event and, if so, what the percentage of item B is a part of the event as well. Sometimes represented as a layered collection of processors connected with each other, neural network is another data mining technique (The University of North Carolina at Chapel Hill, n.d.).
The University of North Carolina at Chapel Hill (n.d.) suggested that processor nodes were usually termed as neurodes, which allows showing a relationship with brain neurons. Each node has a weighted link to a number of other nodes in adjoining layers. Separate nodes take the input obtained from interconnected nodes as well as applying the weights so that output values may be computed. Decision trees is the fourth technique that executes classification by building a tree on the basis of training instances with leaves that have class tags. Leaf discovery makes it necessary for the tree to be thoroughly traversed. Leaf class is the forecast class. Thus, this is the discovery of directed knowledge since there is a specific area, the value of which analysts want to forecast.
How Data Mining Works
According to How Data Mining Works (n.d.), there are six underlying steps of building mining model. Defining the problem is the first step in the process of data mining. After determining the problem, it is necessary to think over how the data may be applied to receive a response to the problem. The first step necessitates the analysis of business requirements, the determination of the metrics of model assessment, the scope of the problem, and specific goals for the data-mining project. Analysts will most likely have to carry out a data availability study and examine the needs of business users with respect to the available data in order to find answers to their questions. Redefining the project becomes necessary when data does not support users’ needs. Analysts may consider how model results may be included into KPI, or key performance indicators, the benchmarks of business projects.
How Data Mining Works (n.d.) suggested that the next step was data preparation. It serves the purpose of unifying and cleaning the data identified during the preceding stage. The step is very important since there is the possibility of data being dispersed across the company and stored in a multiplicity of various formats. Worse, it may have inconsistencies like missing or incorrect entries. Besides interpolating values that are missing and dropping confusing data, the stage of cleaning discovers concealed correlations in the data, determines the most accurate sources of data and the columns fit for application in analysis. The third step is the exploration of the prepared data. For one to make proper decisions by building the mining model, he or she needs to understand the data.
Exploration approaches incorporate the calculation of standard and mean deviations, maximum and minimum values and the consideration of data distribution. Standard deviations and other distribution values can give helpful information on the accuracy and stability of results. What can suggest that the addition of extra data can improve the model is a large standard deviation. The exploration of the data in view of one’s own understanding of the business issue can help one decide on whether the dataset has defective data. If it does, the analyst can elaborate a strategy of tackling the problem or receiving a more profound understanding of the behaviors characteristic of the business (How Data Mining Works, n.d.).
Building models is the fourth step facilitated by the exploration of the data. At first, it necessary to determine the columns of data to be applied while building a mining structure related to the source of data. Still, the mining structure contains no data until the analyst processes it. While processing the mining structure, the so-called Analysis Services produces aggregates and other pieces of statistical information applied for analysis. If based on the structure, any mining model can apply this information. Before model and structure being processed, the model of data mining is nothing else but a container defining the columns applied for input, parameters revealing the sequence of actions or algorithm of processing data, and the attribute up for prediction. The processing of a model is referred to as training. What training refers to is the process of the application of a certain mathematical algorithm to the structure data needed to retrieve patterns (How Data Mining Works, n.d.).
How Data Mining Works (n.d.) suggested that it was possible to apply parameters to fit each algorithm or filters to the training data to utilize the subset of data, producing various results. After analysts’ passing data through the model, the object of the mining model accommodates patterns and summaries that can be requested or applied for prediction. Important to note is that analysts must update the mining model and structure whenever it is that data changes. The update of a mining structure through reprocessing is the process, by which Analysis Services extracts data from the source, especially if it is dynamically updated, and refills the mining structure. The exploration and validation of models is the fifth step. In other words, the building of the model is followed by the exploration and testing of the data-mining model.
Prior to the deployment of the mining model into a production environment, there is the need of testing its performance. Based on the second scenario, analysts may build a number of models that have varying configurations. Testing such models allows singling out the one that generates the maximum results for the data or problem. Analysis Services gives instruments that enable analysts to separate the data into testing and training sets of data, which, in turn, makes it possible to evaluate the performance of all models on identical data accurately. The training dataset may be utilized to create the model while the testing dataset may be used to test model accuracy by producing prediction inquiries.
The deployment and update of models are the final step in the process of data mining. Only the most optimal models yielding expected results will be deployed during this phase. Following the introduction of the mining model into production environment, it can execute a variety of tasks. Such tasks may include the application of models to develop predictions later applied for making business-related decisions, the production of content inquiries for extracting rules, statistics or formulas from models, and the embedding of the functionality of data mining into an application. Other tasks may incorporate the creation of a report allowing users to query against the current mining model, the update of models dynamically and following analysis and review, and the utilization of Integration Services for the creation of a package, wherein a mining model is applied to separate entering data into numerous tables (How Data Mining Works, n.d.).
Conclusions
Data mining is the unique product of information technology development that allows handpicking the most useful information from raw data. Such information otherwise might go unnoticed by human analysts. The lack of specific tools like data mining renders specialists helpless and unable to handle large amounts of information. Data mining makes the procedure of finding relationships and patterns automatic. Analysts may proceed to assess the results of search or apply it in automated decision support systems. Databases measure terabytes of data from pixel-by-pixel galaxy images to trillions of credit card purchases and point-of-scale transactions. The technology may be of great use in big-scale corporations that need to process large flows of data. For example, Walmart uploads up to 20 million point-of-scale transactions to database systems run by as good as 500 processors on a regular basis.
Businesses would not be able to learn a lot save for data mining techniques. Having grown competitive, business environment necessitates the conversion of huge amount of data into information on clients and customer markets in order to survive a stiff competition. Data mining may prove useful in market segmentation, fraud detection, customer churn, direct marketing, market basket analysis, interactive marketing, and trend analysis. As far as the historical perspective of the technology is concerned, the term emerged in the 1990s, yet the evolution first started in the 1960s and spanned subsequent decades as discs, tapes, storage databases, and cheaper and swifter computers began emerging.
Data mining has a bright future; however, there is the possibility of it infringing upon privacy, with every human move recorded and stored in databases. As for data mining techniques, the major instruments applicable for data mining are nearest neighbor, artificial neural networks, genetic algorithms, rule induction, and decision tree. Another classification partially reproduces the aforementioned set of techniques, citing clustering, neural network, decision tree and association rules as major approaches to mining data. There is a five-stage process of technology execution that includes the defining of the problem, the preparation and exploration of data, the building of models, the exploration and validation of models, and the deployment of models. Overall, data mining is a highly efficient approach to picking up the most useful information from raw data that is found highly useful by whatever organization handles the large volumes of data.
References
Alexander, D. (n.d.). Data Mining. The University of Texas at Austin. Retrieved from: http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/
How Data Mining Works. (n.d.). Microsoft. Retrieved from: http://msdn.microsoft.com/en-us/library/ms174949.aspx
The University of North Carolina at Chapel Hill. (n.d.). Data mining. Retrieved from: http://www.unc.edu/~xluan/258/datamining.html