The primary difference between a data warehouse and a database is the that a data warehouse is a repository of data that has been obtained from different sources over time, and that is stored in a unified schema and is typically used for decision support and analysis. On the other hand, a database is a representation of the presented status of particular data that has been stored and is mainly used to handle live quires and online transactions. The similarities between a database and data warehouse are that they are both means of storing large amounts of data.
Characterization is the general summary of the features of a particular target class of information. Discrimination is the comparison and contrasting of the attributes of a target class with a predefined class or group. Association and correlation analysis is the establishment of rules that indicate particular attribute-value condition is occurring at a high frequency within the data set provided; this can be used to allow selection and building of the discriminating analysis attribute in intrusion detection. Classification is mainly the use of constructed classifiers or models in prediction of categorical labels; it distinguishes data. Regression is simply a statistical methodology that is applied in numeric prediction in data mining. Clustering is the analysis of data without the application a predetermined class label; the data is then categorized with the goal of maximizing the similarity of the data within particular classes and minimizing the differences with the class. Outlier analysis is the identification and examination of the data objects that fail to comply with the general model and behavior of the data; the outlier data objects are often radically inconsistent with the rest of the data.
Discrimination and classification, characterization and clustering, classification and regression and similarities
The difference between discrimination and classification lies in the fact that discrimination is the comparison of the general attributes of a particular target data class with contrasting classes while classification is simply a process of discovering the functions or models that describe the data classes that distinguish and describe the data to facilitate prediction of classes whose data remains unknown. Their similarity is that they are both means of handling analysis of data objects. Clustering and characterization difference is that characterization is the process of summarizing the attributes of a target class while clustering conducts analysis without any known class label. Their similarity is the fact that both groups related data and objects based on high similarity.
Considering the impact that outliers often have of the data mining it is crucial that methods and procedures for identifying them be established. The application of outliers in data mining is apparent in intrusion and fraudulence detection. The approaches relied on intrusion detection through outliers can be broadly categorized into the model-based approach, angle based approach and proximity-based approach. The model-based approach is the application of a model to the data than then assuming that the data points outside the model are outliers. Proximity approach is considering the spatial proximity of data objects and then assuming the data objects whose proximity varies from other objects as an outlier. The angle based approach is the determination of the spectrum of pairwise angles between particular points and other points; the outliers are then the points whose spectrum has the highest fluctuations. The best system would the angle based approach.
Challenges of mining big data, as opposed to mining small data, include the volume, variety, velocity, veracity, storage location and the value of the data. Big data as the name suggests has large volumes of data that can be challenging to deal with. In addition, the variety of the data makes it hard to work with as this can include thousands of features per data items, different data formats and types. He velocity of the data into the systems can also be a challenge as we often find the rate of flow of big data flow is very high. Determining where and how to store big data is also a big challenge as options range from multiple platforms, different owners systems, varying formatting as access requirements and even public and private cloud systems.