Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. .
Apache Hive is a powerful data warehouse tool to process and query big data that is stored in Hadoop. Ecosystem
Apache Sqoop is a data migration tool to migrate data to and from Hadoop Ecosystem.
Apache Hadoop provides capability to store and process large amount of data using distributed processing capabilities
Apache Cassandra is a wide-column NoSQL datastore to handle large amount of data. It provides high resilience and high availability.
Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration.
Spark creates a single partition for a single input split. For example,
In Spark, we can use Cache() and Persist() to save RDD intermediate results.
Spark context is the entry gate and allows spark application to access cluster using resource manager such as Yarn.
Although you can create multiple Spark context in a single JVM by setting the spark configuration spark.driver.allowMultipleContexts to true. However, Multiple Spark context is discouraged as it can lead to unexpected behaviour
In Spark, repartitioning can be done using two API's : repartition and coleasce.
Speculative Execution is an optimization technique in Spark in which if any tasks is taking too long , Spark automatically launch a parallel task in another worker and Spark accepts the result of worker whichever completes first. Spark speculative excecution is disabled by default and can be enabled by setting spark.speculation property to true.
Spark Driver is a program that runs on master node and defines actions and transformations on RDD.
Cluster Manager is responsible for managing and allocating resources in a cluster. Spark can run on many cluster managers:
There are two types of deploy modes in Spark:
Each Spark job consists of stages which further is broken down into tasks. Number of tasks in each stage is determined by:
Answer coming soon
Answer coming soon
Answer coming soon
Answer coming soon
Metastore is a relational database that stores schema of hive tables. By default, Hive comes with embedded derby database as local metastore but it can be changed to any external relational database.
Hive does not enforce any schema during write time and only enforces while reading the data, that is why it is called Schema on Read
By Defaukt, Hive data is stored in hive warehouse directory on HDFS at user/hive/warehouse location. However, default warehouse directory can be changed by changing the hive.metastore.warehouse.dir configuration in hive-site.xml. Moreover, Hive tables can also point to some other location on HDFS and if it is an external tables , data is not moved to warehouse directory.
Hive support three type of execution engines:
Previously Map Reduce was the default execution engine in Hive but it is replaced with Tej as default execution engine because of its high performance. However, spark execution engine is fastest among all as it uses the in memory capabilities to run complex queries.
SerDe is used for the purpose of IO of different types of files. Hive has many inbuilt SerDe to serialize and Deserialize various file formats such as parquet, json, csv, avro etc. However, it is quite possible to write your own customer SerDe..
Hive support various kind of joins:
There are various optimization technique to run queries faster such as
Sqoop can import data from RDBMS into various destinations:
Sqoop commands or jobs are converted into MapReduce programs that run on top of Yarn Cluster Manager. By default, 4 mappers are launched for Sqoop command to run parallely . Though this mapper can be increased or decreased by setting -m flag to desired parallelism in the sqoop command.
It can be achieved using three things:
We can make use of Sqoop codeGen command to generate underlying Java code
There are two ways to import a table that does not have any primary key.
To compress data , we can make use of --compress flag along with --compression-codec to specify the compression. There are various compression codec supported by Sqoop : Gzip,Snappy, Lz4,Bzip2, deflate.
Large objects such BLOB/CLOB are stored in a separate file known as "LobFile" which have the capability to store large sized data records.
split-by column is used to split the data among the mappers for parallel processing. By default, table's primary key is taken as split-by column by Sqoop to divide data. However, we can change it in sqoop command, but split-by columns should be chosen wisely such that:
Sqoop has support for various file formats. By default, data is imported in text file format. Other file formats supported by Sqoop are:
Stay connected