Introduction
Algorithms form the basis of problem solving for most scientific solutions. This paper will focus on the issues and the problem that is being solved in the case that has been provided in the article. The article describes the approaches that are used in trying to solve the problem that has been presented.
The paper develops an algorithm that is used to solve the problem of c-approximate nearest neighbor. The problem that has been an issue is that of making approximations with high levels of dimensions. This is a problem that has been under research by many researchers. This problem is solved using d-dimensional Euclidean space. The nearest problem can be described as reporting the nearest data point that is nearest to the query from any point from a collection of n points. In the study to solve this problem, there was the rise of an interesting instance where the data points are known to be found in Euclidean space which has d-dimensions.
Importance of this paper
This problem will help in many areas where it is applicable. One of the areas that this problem is applicable is data compression, data mining processes, database technologies, the process of machine learning, retrieving information, databases of video and images, statistics and data analysis, and pattern recognition. In this process of getting the solution to the problem, each object that is the interest point (that is document, image, or video) are represented to be points in the d-dimensional Euclidean space and the metric that represents distance is used to gauge the similarity that is found between objects. This is an important feature and solution process as the solution of differentiating the objects is an important one. The problem that is presented in this aspect, then, is to undertake indexing or undertake searching for objects that are used as queries. The features that are found in this process of solving range from tens of thousands.
Context of the paper
The context that the paper was made is that of making the creation of the algorithms where there was low dimensional measurement. There was the need to have an algorithm that would deal with bigger numbers. There has been concerted effort to solve this issue. This comes from the backdrop of the need to develop newer algorithms. Although this is the case, there have been the problems of query time or space. There was a need to develop query space and time that was exponential to d. the development of this solution to the problem was so that to provide an improvement over a linear algorithm which has been used for long. For linear algorithms, the comparison is done from each point from the database. This is what is referred to as “the curse of dimensionality”. The development of this solution was followed after researchers wanted to find solution to the running of time bottleneck (Charikar et al., 62).
Solution proposed
The paper has the solution to the problem that has been presented. The solution is by the use of an algorithm which has query time of dnp(c)with the use of space dn+n1+p(c). In this case, p(c) is 1/c2 +O(log log n/1/3n). With this, there is significant improvement on the time and space query constraints that has been explained before. The earlier time that has been used was 12. When given specific values, when c=2, the exponent is seen to tend to 0.25,and the exponent that was found in [12] is found to be 0.45. recent researches have shown that algorithms which are hashing-based cannot go achieve p<0.462/c2. In this case, the running time that the exponent has is seen to have some up optimality. The exponent gets small factor of constant (Arya et al., 281).
The discovery for this algorithm improves the search for an algorithm that would be used for higher dimensions. One case where this is applicable is that the c-approximate minimum of the spanning tree (MST) that is used for n points that are found in ld/2 is found to be computed using O(n log n). this computation will call to the approximate neat neighbor oracle that is used for that space. this means that the results that are obtained in this experiment give an implication of dn1+1/c2+0-time algorithm which used for the c-approximate MST problem. There are other problems where there is recorded improvement using this algorithm is that of dynamic closest pair algorithm and facility location process (Chakrabarti, and Regev 826).
One problem that is seen with the use of this strategy is that there is low convergence of the component to the 1/c2 limit. The formula that binds this approach is given by:
in the expression, t is the parameter that has been chosen in order to ensure that the expression has been minimized. The appearance of tO(t) can be justified by the fact that the algorithm that space of t-dimension. In this regard, the quality that the configurations will have will improve if t will be increased. From the formula, it is clear that for the exponent in the formula to be competitive against the earlier bounds that had been proposed, there is a need to have t being large. When this is done, the factor tO(t) becomes unnecessarily large. This increase in size is seen to reduce the gains that had been got from the exponent which has been improved. This will only be salvaged only if n is extremely large (Datar et al., 271).
For the problem to be solved in the algorithm there is a need to modify the algorithm. This modification will make it have more moderate values of n. one specific change that need to be made in the algorithm is to make efficiency possible for moderate values of n. The specific modification made is that of replacing the points of configurations that have been mentioned before with constructions which are known. These are the point-sets which are known and are replaced in dimensions which are known. There is the use of Leech Lattice in the undertaking of this modification. This modification involves getting the algorithm with a modification with 24 dimensions. The resultant algorithm is with p(c) in such a manner that p(2) -<0.37. In this case the leading term that is used in the running time is reduced so that there is only a few hundred. In addition, if the d dimension does not go beyond 24, there will be the reduction of the exponent to p(2)-< 0.27. The leading term in the leading time will still remain the same. There are no changes in this parameter (Andoni, and Indyk 271). Partitioning method
The modification is made to be more effective with the introduction of a new method of doing partitioning. There is a need to have a need to have a different method where partitioning can be made in order to have significant improvements. The practical method of undertaking partitioning is to eradicate the use of Leech Lattice LSH. This algorithm avoids the use of tO(t) factor in the partitioning process. The specific strategy that is used in this method is that of tessellations. The induction is done with the use of Vonoroi diagrams that are shifted in a random manner. The tessellations have constellations which have fixed dimension lengths. The properties of the Vonoroi diagrams include the fact that the closest constellation point to a given point can be found in an effective manner. Another property is that the exponent p which is induced by the process of constellation is as close to 1/c2 as much as possible.
The process of partitioning is now done by using Vonoroi diagrams together with projecting points to Rd. in this process a Vonoroi diagram which has 24 constellations is what is referred to as Leech Lattice. This meets the mentioned properties well. The first property can be seen to be achieved with the use of a decoder which is bounded of [2] and performs 519 floating points of operations and only per point of decoding. The second property that is achieved and met is that the p(c) property that is provided by the decoder is attractive. When c is 2, the exponent p(2) is found to be less than 0.37. One reason that this is the case is that Leech Lattice is found to be a very symmetric algorithm. This means that the Vonoroi cells of this algorithm is round. In addition to this attractive feature, if d does not exceed 24, then the dimensionality reduction part can be skipped. With this, p(2)-<0.27 while the term that is used in the running time is found to be the same.
Locality sensitive Hashing (LSH)
This is another important part of the report. This is because it is the algorithm that is used to solve the c-approximate near neighbor problem that is the main issue and problem in the paper is the locality-sensitive hashing. The LSH function relies on the availability of locality-sensitive hashing functions.
The LSH algorithm is used to find points that are nearest to the query. This is used in large databases. There is a high probability which is equal to 1-ᵟ that the nearest point that is near to the database will be returned with the use of this algorithm. One way in which this problem can be solved is to iterate between each point that is found in the database and computing the distance from the point to the query. One problem that is found with taking this approach is that the database may contain billions of objects. Each object is defined by vectors which have hundreds of dimensions. With these limitations, there was a need to have a method that could be used to accomplish this solution. The solutions that have existed include the use of hashes and trees.
The principle that the LSH approach uses is that if two points are close together, this will mean that after the two points have undergone a project, they will still remain to be close to one another. One example that can be used to demonstrate this is that two points that are found to be close to one another on the sphere are still closer together when the sphere has been projected onto the page that is two-dimension. This solution is true no matter how else the sphere is rotated. Two other points that are found on the sphere that are far may be found to be close tone another on the page but can be found to be far apart after there has been some orientation (Amrani, and Beery 24).
One of the most important ideas of the LSH algorithm is the random projections. This is referred to as the scalar projections. The scalar project is then quantified into a set of hash bins with the intention that the items that are found in the original space will fall in the same bin. In order for the project operator to serve the purpose of the research problem of this paper, there is a need to have points that are nearby to be projected together.
Hash point implementation is done by placing the data points into hash buckets. This placement is undertaken by quantization and project that has been undertaken before. The points are described by k integer’s indices. Because of the fact that the k-dimension space is found to be sparse, there is a need to use the conventional method of hashing where the points will be found in an efficient manner when these points fall in the same bucket in the space.
The performance of the LSH algorithm can be undertaken with the use of accuracy that comes with it. The determination of the accuracy of the algorithm is determined by finding out the probability that the algorithm will find the true neighbor which is near to the point of concern. This is done by the introduction of s-stable distributions.
The LSH algorithm has been applied in the database management systems where it has been used to find the nearest points to a query when undertaking a query search. This is when handling very large databases. The difference between hashes and LSH is that there is no use of conventional matches in this method. The use of LSH method makes use of finding the locality of the points so that the distance of points will remain the same in the process. With this algorithm, the distance between points will remain the same (Ailon, and Chazelle 63).
Relationship between LSH and c-NN
There is the relationship that exists between the new modified algorithm of c-NN and that of LSH family. This is because there is the use of l2 functions that are found in LSH family algorithm. LSH is important in the development of the new algorithm because it utilizes the ideal LSH family. There is also the introduction of a new concept which is required to reduce U. U is the necessary number of grids that is required to reduced the space covered by the grid. The LSH family is important in order to initialize the h hash funcrion. It is also important because it will help the research compute the value of h(p) for a given point p€ Rd.
Conclusion
The algorithm that has been described in the paper makes use of LSH algorithm extensively. The use of this algorithm will return neighboring points which can be reached quickly. This algorithm introduces a new search where the nearest points to a given algorithm can be reached with ease. There is the use of probabilistic guarantee that the nearest points will be returned in the process. This algorithm and the idea of randomized algorithms are important and applicable in internet systems and databases. The search of neighboring points is found to be effectively done with the use of this new approach. There have been the definitions and prove of the theorems that have been defined and mentioned. The algorithm is effective and still has room for expansion in future developments of search algorithms. The utilization of the Leech Lattice is seen to be another effective strategy that has been used in the paper.
Works Cited
Ailon, Nir, and Bernard Chazelle. "Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform." Proceedings of the thirty-eighth annual ACM symposium on Theory of computing. ACM, 2006.
Amrani, Ofer, and Yair Beery. "Efficient bounded-distance decoding of the hexacode and associated decoders for the leech lattice and the golay code."Communications, IEEE Transactions on 44.5 (1996): 534-537.
Amrani, Ofer, et al. "The Leech lattice and the Golay code: bounded-distance decoding and multilevel constructions." Information Theory, IEEE Transactions on 40.4 (1994): 1030-1043.
Andoni, Alexandr, and Piotr Indyk. "Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions." Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. IEEE, 2006.
Arya, Sunil, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Wu. "An optimal algorithm for approximate nearest neighbor searching." Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1994.
Chakrabarti, Amit, and Oded Regev. "An optimal randomised cell probe lower bound for approximate nearest neighbour searching." Foundations of Computer Science, 2004. Proceedings. 45th Annual IEEE Symposium on. IEEE, 2004.
Charikar, Moses, Chandra Chekuri, Ashish Goel, Sudipto Guha, and Serge Plotkin. "Approximating a finite metric by a small number of tree metrics." Foundations of Computer Science, 1998. Proceedings. 39th Annual Symposium on. IEEE, 1998.
Datar, Mayur, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. "Locality-sensitive hashing scheme based on p-stable distributions." Proceedings of the twentieth annual symposium on Computational geometry. ACM, 2004.