Discussion Question
Hive was discovered by Jeff Hammerbacher during his tenure at Facebook. Facebook was receiving huge volumes of data daily, and there was a need to create a mechanism to help in data storage, mining, and analysis. It is this tool that allows Facebook to process thousands of Terabytes of Data with ease on a daily basis. Hive employs a language called HQL which works in a similar way as SQL. Irrespective of the limitations of commands that HQL understands, the language is still useful.
Hive gives users with little programming knowledge the capacity to process, extract and analyze data in Hadoop. It offers an exceptional open source deployment of MapReduce. The tool functions better when there is a need to process data stored in a distributed manner, as opposed to SQL which necessitates stringent conformity to schemes during data storage. Hive is the same as SQL interface in Hadoop (Manlove, 2016). The data held in the HBase module of the Hadoop ecology is accessible through the Hive. It can also be regarded as a Data Warehousing bundle that is built on top of Hadoop for processing and analyzing huge data volumes. Hive does not require users to be conversant with Java Programming Language and Hadoop Application Programming Interface (API) hence any type of user can make good use of the tool.
Pig
Pig was discovered by Yahoo in 2006 to give them a platform for creating and processing MapReduce jobs on voluminous data sets. The major goal of developing Pig was to reduce the amount of time needed for development through multi-query approach. The tool employs a scripting language known as Pig Latin, which is majorly based on workflows (Schrader, 2014). Currently, the tool is used by the technology giants such as Yahoo, Microsoft, and Google to assemble and store huge data arrays in the form of search logs, web crawls and click streams.
Although one does not need to be an experienced Java programmer to use Pig, a few coding skills are however required. Using apache Pig, developers can employ several query approach, which lessens the number of iterations required for data scan. Moreover, the tool supports several nested data types such as Bags, Tuples, Maps and frequently used operations such as Joins, Ordering, and Filters. Pig Popular because of several reasons including:
The tool has a short learning curve for users familiar with SQL
It employs a multi-query approach hence reducing the data scan times.
It has a high performance similar to that of MapReduce.
Drill
Drill refers to a distributed system used to perform interactive analysis of voluminous data sets. It has a lot of similarities with Google Dremel (Mapr, 2016). Although it is not permanently wired to the Hadoop ecosystem, the tool offers a vigorous, Lite SQL gateway for data access. This gives it the strength to penetrate beyond the traditional Hadoop-oriented databases into other data stores such as Amazon S3, Swift, Azure Blog Storage and MongoDB. Below are some of the Drill features that make it preferred by developers (Manlove, 2016).
Speed- the tool is enhanced for interactive applications, enabling it to execute trillions of data records within few seconds.
Flexibility- The tool is compatible with all data types, including schema-less and nested data. Moreover, a user can query from several schema-less databases such as HBase, MongoDB, and Cassandra with a little hitch.
References
Mapr. (2016). Apache Drill. Retrieved from < https://www.mapr.com/products/product- overview/apache-drill >
Manlove, D. (2016). SQL On Hadoop The Differences And Making The Right Choice. Agil Data. Retrieved from < http://www.agildata.com/sql-on-hadoop-the-differences-and- making-the-right-choice/ >
Schrader, C. (2014).What Is The Criteria To Choose Pig, Hive, HBase, Storm, Solr, Or Spark To Analyze Your Data In Hadoop? Quora. Retrieved from < https://www.quora.com/What- is-the-criteria-to-choose-Pig-Hive-Hbase-Storm-Solr-or-Spark-to-analyze-your-data-in- Hadoop >