Association Rules
Select a dataset that is appropriate for association rule mining, and perform the following tasks. It is important to prepare and preprocess the data and preprocess the data if needed. In order to start with preparation and preprocessing of the data, it is important to ensure that the right attribute is selected and then by the elimination of an attribute the dataset is made smaller. For this dataset which is houses.zip, the target is to make the function more compact, enhance the speed of the actual running of learning is worth and ensure that the result is comprehendible.
Based on the required deadlines for attributes, latitude and number of bedrooms were found to be in question. If we talk about the latitude, it was observed that this particular parameter was used to identify the houses that would lie within California and in case of the negative value was even then the algorithm ran very slow due to the extra calculation is that had to be done to identify the absolute values. Another major observation was that the numbers of rooms were only required to indicate the approximate size and this was just an extra edition and further resulted in the tree becoming more complex.
Based on the reduced attribute data set, whether was run again however there were many errors that were received due to our choice of reincorporating both attributes. Hence, in order to fix those errors the latitude was multiplied by a negative value.
The data description and selection of Attributes . . .
This particular data set was collected as a part of the US Census exercise, the data is based on the block groups of the Californian houses and the range is limited to the west coast. In total, the data contains almost 20,640 instances and the file is uploaded along with this project houses.zip. The dataset contains nine attributes which include the following:
Median house price
Median income of residents in a particular Block
Median house age
Total rooms in block
Total bedrooms those are present
Population residing in the block
The total number of households in block
Latitude
Longitude
The Problem Statement
The approach towards data analysis and the problem as identified is that the data needs to be analyzed using Naïve Bayes , C4.5, 1R, Zero R and also the differences between the earning schemes have to be identified which is a tough ask considering the attributes.
The most important rules, and their requirements . . .
if(a2< 4.1 & a8<38 & a7<590 & a3<25 & a9<122 & a4<2164 & a5<402 & a6<1347)
disp ’ THE HOUSE WILL BE CHEAP ’
elseif (a2< 4.1 & a8<38 & a7<590 & a3<25 & a9<122 & a4<2164 & a5<402 & a6>=1347)
disp ’ THE HOUSE WILL BE CHEAP ’
elseif(a2< 4.1 & a8<38 & a7<590 & a3<25 & a9<122 & a4<2164 & a5>=402 &
a6<1347)
disp ’ THE HOUSE WILL BE CHEAP ’
elseif(a2< 4.1 & a8<38 & a7<590 & a3<25 & a9<122 & a4<2164 & a5>=402 &
a6>=1347)
disp ’ THE HOUSE WILL BE CHEAP ’
elseif(a2< 4.1 & a8<38 & a7<590 & a3<25 & a9<122 & a4>=2164 & a6<1347)
disp ’ THE HOUSE WILL BE CHEAP ’
Report.
The primary aim related to this data is to find out the price of numeric house on the basis of eight covariates. As we move ahead the task is identified to the selection of cost efficient or overpriced house on the basis of a different attributes which are mentioned below. In total, the data contains almost 20,640 instances and the file is uploaded along with this project houses.zip. The dataset contains nine attributes which include the median house price, the median income of residents, median house age, total rooms and block, bedrooms, population, the total number of households, latitude and longitude.
Based on the dataset, it was observed that there was a subdivision of the Californian residential areas and total 20,640 zones of blocks were present which were placed in the same geographic location and quite similar latitude and longitude positions. This helped us to neglect these attributes initially as the average price of houses within a particular block quite common. Considering the fact that, the above attributes were also similar; the overall data was not divided into two parts which was cheaper expensive. The final data that was received had seen the nominal class along with eight attributes that were numeric. The primary analysis that has been done over here is related to the use of the 4.5, 1R and ZeroR through the support vector machine and Naïve Baayes. It was difficult to meet the objectives together and also at the same time check references related to project which was being performed in this section.
Approach used for the mining.
In order to conduct the mining, there is a detailed approach which isn't to place starts with the analysis of the task and a review of the structure of the dataset. Considering the fact that a preprocessing of the data was required, the right attributes were selected and the missing values were identified there are rules created for the outliers and the cleanup is done. Once clustering process is completed, the linear regression and outliers are checked for their existence post which the mining process moves to algorithm. The evaluation is done by predicting performance and finally the results are received which are further tested for error rate and the conclusion is drawn.
Module 2: Classification
Description of the experiments, and discussion of the results.
The ARFF file which is received is opened in order to be analyzed. The research situation error was checked with the help of training data and on the basis of the subdivision the partition of 1 / 3 was considered for testing and 2 / 3 for training. The reason behind using this type of error analysis is because the data is quite abundant and with the help of this process in the dataset improves. The learning schemes that have been used in this particular project are 1R, 0R, C4 .5, linear regression and Naive Bayes.
Use of Weka tools to find the rules and appropriate parameter settings.
Based on the previous data that was received after the completion of the last step, the following was done in order to find the rules and parameter settings.
nodenum = 1
The node number in the tree. Due to recursion, all left nodes at each split are numbered
before the right ones.
remainingattributes =
2 3 4 5 6 7 8 9
The attribute chosen to split on next.
sizeR =
1410 9
Size of the right child node after split.
sizeL =
3590 9
Size of the left child node after split.
nodenum = 2
remainingattributes = 3 4 5 6 7 8 9
chosenatt = 8
sizeR = 671 9
sizeL = 2919 9
nodenum = 3
remainingattributes = 3 4 5 6 7 9
chosenatt = 7
sizeR = 720 9
sizeL = 2199 9
Scheme
Correlation coefficient
Linear Regression
Conclusion that summarizes the findings and limitations/challenges faced during this work.
The most effective result which was received through various methods was provided by 1R, and therefore made it possible to create a baseline for performance as shown below.
In general the 1R results make sense.
b2:
< 2.4175 -> cheap
.>= 4.94745 -> expensive
This result can easily be integrated into the assumption that the groups earning lower income have cheaper houses however on the other hand the once with high income have an expensive one. The omissions in the dataset also show a slight over fitting at the extremes.
Comparison and Result of Algorithms
The first type of activity includes pruning and there are two types of owning which are applicable over here where the first one is called as pre-pruning which helps to build the tree and the next one is called as postponing which helps in the non-value-added material elimination. The postponing process also has two different operations which are subtree replacement and subtree raising. The application of the above two postponing processes is done on each node separately and subtree are replaced by single leaves. The subtree raising process helps to replace the lower nodes with higher nodes. The trial starts for this algorithm and the numbers which were received initially were within the range of .001 .002 and therefore the focus on analysis move towards limiting the instances for each node. The code for the above processes mentioned as below, and it is important to understand that the learning algorithm is not accurate if pruning is not done.
if ( sizeL(1) >= 50 ) %Pruning on number of instances in left node
treemodified3(leftD, A,V);
else
%print out results for that node
End
Module 3
The Decision Tree Algorithm
The decision tree algorithm requires the application of C4 .5 and it deals with the missing values, pruning and numeric attributes discussed at the beginning of this report. There are some assumptions made related to the size of the tree and the depth is expected to be in the order of log 8, which is written as O(Log 8), and has a standard rate of tree growth. The computational cost of building the tree is O (20640 * 8 Log 8).
The distinguished is done in two forms, where the first one is global and the second one is local. The global oil requires the preparation of dataset related to the learning schemes and helps to conduct an analysis required to save the time and generate convenience. The local helps to build the nodes of the tree and also to conduct the reanalysis in order to normalize the overall activity. Considering the fact that eight of the nine attributes announced in the beginning are numerical, the algorithm is extended in order to provide a separate space to these and in order to make it simple binary splits are used. In order to split it, let's say for example if A is the first attribute which has to be split, and they've be is another attribute which is numerical and will be the next to be spirited, where X is a constant. Then a hypothetical number X which is lying between A and B is chosen and is used in order to cut the attribute. The idea behind doing this step is to improve the efficiency of the sorting of data from all the rows were the valley of the class changes. An example of the same as shown below.
Attribute: 64 | 65 68 69 70 | 71 72 75 | 80 83
In case if the attributes are dependent then there is a reduction in the accuracy and in this dataset number of bedrooms and number of rooms are dependent. Also considering the fact that the data is not normal Naïve Bayes is not expected to deliver the most accurate results. If we compare this to the results from C4 .5 it would be easy to note is that the median income of individuals living in block seems to be the most important attribute and this is an inference which makes sense considering the fact that higher the income and more expensive will be the house.
0R and 1R are more applicable to this situation as they clearly provide the right instances in terms of the tree which is developed to find out the relationship between the income and houses.
Clustering is basically an exploratory analysis which is done in order to help with the classification and sought various cases in two customized groups to make the degree of association stronger. The clustering which is done for this particular dataset is applied as the instances are supposed to be divided into natural groups and therefore classification and association learning methods are used. The clustering methods that are used over here are the k-means algorithm, incremental clustering and statistical clustering.
A k-Means cluster is created whenever there is a requirement to create clusters which can be specifically used in numeric domain. The Parramatta K is identified based on the number of clusters that are being sought and it is also the number of points that are chosen as cluster centers. The next step is the calculation of mean and the formation of centroids , this process is repeated for multiple new cluster centres till the time the same points are assigned to each cluster again.
The next type of clustering which is used over here is incremental clustering and it works instance by instance where a tree is formed which leaves the root node. This is a process that works on a stage to stage basis and therefore the instances are added one by one and the tree continues to be updated, the only drawback of this type of algorithm is that in case if there is an instance that has to be added in between then it is very difficult if the same stage is not running.
In case of statistical clustering the type of clusters developed are according to the data and expectations, and there is no replacement that takes place, not even for the training instances as an infinite amount of evidence is required in order to take a decision. Therefore, the accuracy algorithm is very high and there is a lot of rigidity is associated with the judgments.