TechnoAvengers
TechnoAvengers
  • About Us
  • E-Learnings
    • All Trainings
    • Big Data
    • Cassandra
  • Technical Labs
    • AWS Labs
    • Kubernetes Labs
  • Interviews
  • Blogs
    • All
    • AWS
    • Kubernetes
    • Edge Computing
  • Certification Guidance
    • CCA175 Certification
  • More
    • About Us
    • E-Learnings
      • All Trainings
      • Big Data
      • Cassandra
    • Technical Labs
      • AWS Labs
      • Kubernetes Labs
    • Interviews
    • Blogs
      • All
      • AWS
      • Kubernetes
      • Edge Computing
    • Certification Guidance
      • CCA175 Certification

  • About Us
  • E-Learnings
  • Technical Labs
  • Interviews
  • Blogs
  • Certification Guidance

Big data interview Questions

Apache Spark

Apache Spark

Apache Spark

 Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. .

Interview Questions

Apache Hive

Apache Spark

Apache Spark

 Apache Hive is a powerful data warehouse tool to process and query big data that is stored in Hadoop. Ecosystem 

Interview questions

Apache Sqoop

Apache Spark

Apache Sqoop

Apache Sqoop is a data migration tool to migrate data to and from Hadoop Ecosystem.

INTErview questions

Hadoop/Yarn

Hadoop/Yarn

Apache Sqoop

Apache Hadoop provides capability to store and process large amount of data using distributed processing capabilities

INterview questions

Cassandra

Hadoop/Yarn

Apache Kafka

Apache Cassandra is a wide-column NoSQL datastore to handle large amount of data. It provides high resilience and high availability.


Interview questions

Apache Kafka

Hadoop/Yarn

Apache Kafka

 Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration.

Interview questions

MASTER BIG DATA

eNROLL

spark interview questions

Why is Spark faster than Hadoop MapReduce?

In Spark, How can we save the result of RDD evaluation to reuse further?

How Spark create partitions from HDFS file?

  • Spark prevents the unnecessary disk I/O operations by doing processing in main memory of worker nodes where as MapReduce make use of disks to store intermediate results.
  • Spark comes with  advanced directed acyclic graph (DAG) that is more optimized and efficient as compared to MapReduce.
  •  Apache Spark DAGS are lazily evaluated which means it delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed 

How Spark create partitions from HDFS file?

In Spark, How can we save the result of RDD evaluation to reuse further?

How Spark create partitions from HDFS file?

Spark creates a single partition for a single input split. For example,

  •  In case of text file one spark partition is approximately equal to one HDFS block. (HDFS Blocks are split based on HDFS Block size where as partitions use line splitting)
  • In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable).

In Spark, How can we save the result of RDD evaluation to reuse further?

In Spark, How can we save the result of RDD evaluation to reuse further?

In Spark, How can we save the result of RDD evaluation to reuse further?

In Spark, we can use Cache() and Persist() to save RDD intermediate results.

  • In Cache() - Default Storage level is Memory_Only which means that RDD intermediate results are stored on main memory.  If the size of RDD is greater than memory, It will not cache some partition and recompute them next time whenever needed.
  • In Persist() - there are various other storage level as well such as   Memory_Only,Memory_and_Disk,Memory_Only_Ser,Memory_and_Disk_Ser,Disk_Only

What is difference between Memory_Only and Memory_And_Disk?

What is difference between Checkpointing and Persistence in Spark?

In Spark, How can we save the result of RDD evaluation to reuse further?

  • Memory_Only - In this persistence level, data is stored in memory so space requirement is high and CPU Computation is low. It is comparatively faster but less resilient.  If the size of RDD is greater than memory, It will not cache some partition and recompute them next time.
  • Memory_And_Disk - In this persistence level,  if  RDD size is more than the size of memory, the excess partition are stored  on the disk, and retrieved from disk whenever required. Space requirement is high and CPU Computation is medium. It is more resilient but less faster than Memory_Only

What are the functions of Spark Context?

What is difference between Checkpointing and Persistence in Spark?

What is difference between Checkpointing and Persistence in Spark?

Spark context is the entry gate and  allows spark application to access cluster using resource manager such as Yarn.

  • It encapsulates all  Spark configurations.
  • It can be used to access services like TaskScheduler, BlockManager, ShuffelManager etc
  • It can be used to persist and unpersist RDDs
  • It can be used for dynamic allocation of executors: requestExecutors, killExecutors, requestTotalExecutors, getExecutorIds. 
  • To register Spark listeners.

What is difference between Checkpointing and Persistence in Spark?

What is difference between Checkpointing and Persistence in Spark?

What is difference between Checkpointing and Persistence in Spark?

  • In Checkpointing, the lineage is destroyed and intermediate rdds are physically stored on hdfs.
  • Checkpointing data remain intact even if Spark context stops.
  • Checkpointing data can be used as an entry point in another driver program.
  • Checkpointing is a costly operation because computation is done twice once to cache the result in memory and another to checkpoint it to storage.
  • In Persist/Cache - Lineage is not destroyed. If Spark Context stops, the persisted and cached data is also cleared.

Can we create multiple spark context in Spark Application?

Can we create multiple spark context in Spark Application?

Can we create multiple spark context in Spark Application?

 Although you can create multiple Spark context in a single JVM by setting the spark configuration spark.driver.allowMultipleContexts to true. However, Multiple Spark context is discouraged as it can lead to unexpected behaviour

Can we repartition data in Spark?

Can we create multiple spark context in Spark Application?

Can we create multiple spark context in Spark Application?

In Spark, repartitioning can be done using two API's : repartition and coleasce.  

  • repartition can be used to increase and decrease the number of partitions and cause complete data shuffling and therefore is a costly operation 
  • Coleasce can only be used to decrease the number of partitions and do not cause complete data shuffling and is recommended over repartition when partition size is to be reduced.


What is Speculative Execution in Apache Spark?

Can we create multiple spark context in Spark Application?

What are wide and narrow transformations in Spark?

Speculative Execution is an optimization technique in Spark in which if any tasks is taking too long , Spark automatically launch a parallel task in another worker and Spark accepts the result of worker whichever completes first. Spark speculative excecution is disabled by default and can be enabled by setting spark.speculation property to true.

What are wide and narrow transformations in Spark?

What are different cluster manager on which Spark Application can run?

What are wide and narrow transformations in Spark?

  • Narrow transformations: Narrow transformations are the one that do not cause data shuffling such as Map. Filter , FlatMap etc.


  • Wide Transformations:  are the transformations that cause data to be shuffled across the cluster such as groupByKey, Join, reduceByKey etc.

What is the role of Spark Driver in Spark applications?

What are different cluster manager on which Spark Application can run?

What are different cluster manager on which Spark Application can run?

Spark Driver is a program that runs on master node and defines actions and transformations on RDD.

  • It creates SparkContext to connect to spark cluster using resource Manager.
  • It splits Spark Application into tasks and stages and fed them to DAGScheduler & TaskScheduler.
  • It coordinates with workers for overall execution.
  • It also hosts Spark Web UI.

What are different cluster manager on which Spark Application can run?

What are different cluster manager on which Spark Application can run?

What are different cluster manager on which Spark Application can run?

Cluster Manager is responsible for managing and allocating resources in a cluster. Spark can run on many cluster managers:

  • YARN
  • MESOS
  • SPARK STANDALONE
  • KUBERNETES

What are different types of deploy modes in Spark?

How number of tasks are determined in each Spark application?

How number of tasks are determined in each Spark application?

There are two types of deploy modes in Spark:

  • "cluster mode", In this mode the framework launches the driver inside of the cluster. 
  • "client mode", In this mode the submitter launches the driver outside of the cluster. 

How number of tasks are determined in each Spark application?

How number of tasks are determined in each Spark application?

How number of tasks are determined in each Spark application?

Each Spark job consists of stages which further is broken down into tasks. Number of tasks in each stage is determined by:

  • Map/Scan Phase: Number of tasks is determined by number of partitions (Blocks in HDFS files)
  • Reduce/Shuffle Phase: Number of tasks is determined by spark.default.parallelism in case of RDD and spark.sql.shuffle.partitions in case of Datasets.

RDD Vs Dataframe Vs Datasets. Which one is faster?

How number of tasks are determined in each Spark application?

RDD Vs Dataframe Vs Datasets. Which one is faster?

 Answer coming soon 

Can we Dynamically Allocate resources in Spark?

Can we Dynamically Allocate resources in Spark?

RDD Vs Dataframe Vs Datasets. Which one is faster?

Why Partitions in Spark are immutable?

Can we Dynamically Allocate resources in Spark?

Why Partitions in Spark are immutable?

What is sort-merge join in Spark?

Can we Dynamically Allocate resources in Spark?

Why Partitions in Spark are immutable?

How RDD is recovered in case of failures?

If worker RAM is not sufficient to load Partition. Will it cause out of memory issue?

How to decide number of cores and memory in Spark Application?

Answer coming soon


How to decide number of cores and memory in Spark Application?

If worker RAM is not sufficient to load Partition. Will it cause out of memory issue?

How to decide number of cores and memory in Spark Application?

Answer coming soon

If worker RAM is not sufficient to load Partition. Will it cause out of memory issue?

If worker RAM is not sufficient to load Partition. Will it cause out of memory issue?

If worker RAM is not sufficient to load Partition. Will it cause out of memory issue?

Answer coming soon

Learn Apache Spark

Enroll

Hive interview questions

What is a metastore in Hive? Difference between remote & local metastore?

What is a metastore in Hive? Difference between remote & local metastore?

What is a metastore in Hive? Difference between remote & local metastore?

Metastore is a relational database that stores schema of hive tables. By default, Hive comes with embedded derby database as local metastore but it can be changed to any external relational database.


Why Hive is called Schema on Read?

What is a metastore in Hive? Difference between remote & local metastore?

What is a metastore in Hive? Difference between remote & local metastore?

Hive does not enforce any schema during write time and only enforces while reading the data, that is why it is called Schema on Read

Where does the data of a Hive table gets stored?

What is a metastore in Hive? Difference between remote & local metastore?

What is the difference between external table and managed table?

By Defaukt, Hive data is stored in hive warehouse directory on HDFS at  user/hive/warehouse location. However, default warehouse directory can be changed by changing the  hive.metastore.warehouse.dir  configuration in hive-site.xml. Moreover, Hive tables can also point to some other location on HDFS and if it is an external tables , data is not moved to warehouse directory.

What is the difference between external table and managed table?

What is the difference between external table and managed table?

What is the difference between external table and managed table?

 

  • External tables are not managed by Hive which means if we drop such tables, data remains intact and only Hive schema is dropped from metastore.
  • Managed tables are tables in which both data & schema is managed by Hive. If we drop such table, both schema and data is dropped.

What are different execution engines supported by Hive?

What is the difference between external table and managed table?

What are different execution engines supported by Hive?

Hive support three type of execution engines:

  • Spark
  • Map Reduce
  • Tej

Previously Map Reduce was the default execution engine in Hive but it is replaced with Tej as default execution engine because of its high performance. However, spark execution engine is fastest among all as it uses the in memory capabilities to run complex queries.

What is SerDe in Hive?

What is the difference between external table and managed table?

What are different execution engines supported by Hive?

SerDe is used for the purpose of IO of different types of files. Hive has many inbuilt SerDe to serialize and Deserialize various file formats such as parquet, json, csv, avro etc. However, it is quite possible to write your own customer SerDe..

What are different types of optimized joins supported by Hive?

What are different types of optimized joins supported by Hive?

What are different types of optimized joins supported by Hive?

Hive support various kind of joins:

  • Map Side/Broadcast Join- It is an optimization technique. In this join, if one of the table in the join condition is small, it is loaded into memory to join with another table.
  • Bucket  Map Join -  if both the tables are large and also bucketed on column that is used in join, then this join can be used where bucket from one table is loaded into memory to join with buckets from another table
  • Skew Join - Skew join is used if one of the table has skewed data in joining column. In this approach skewed values are stored in separate file.
  • Sort Merge bucket (SMB) Join - This join is applied when both tables are large, bucketed and sorted on the join column and both tables have same number of buckets.

What are different techniques for optimization in Hive?

What are different types of optimized joins supported by Hive?

What are different types of optimized joins supported by Hive?

There are various optimization technique to run queries faster such as 

  • Partitioning - In this data is partitioned on some columns which drastically improves read performance if we read on those partitioned columns
  • Bucketing - This approach make use of hashing while storing data into buckets and use the same hashing algo while reading which improves its read performance
  • ORC file formats - It is best and most efficient file format with Hive.
  • Tej/Spark Execution engine - These execution engines drastically improve query performance
  • Vectorization - In this approach operations such as scans, aggregations, filters, and joins is performed in batches  1024 rows rather than one by one. 
  •  Indexing It is used to speed up read queries. Since, the database system does not need to read all rows in the table to find the data with the use of the index.


What is the difference between local and remote metastore?

What are different types of optimized joins supported by Hive?

What is the difference between local and remote metastore?

  

  • Local Metastore:  In this, metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.


  •  Remote Metastore:
    In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. 

Learn Advance Hive

Enroll

Sqoop Interview questions

What are the destinations supported by Sqoop?

If our source database is changing everyday, how can we make sure that data is sync regularly on HDF

What are the destinations supported by Sqoop?

Sqoop can import data from RDBMS into various destinations:

  • HDFS
  • Hive
  • Hbase
  • HCatalog
  • Accumulo

How Sqoop command runs internally?

If our source database is changing everyday, how can we make sure that data is sync regularly on HDF

What are the destinations supported by Sqoop?

Sqoop commands or jobs are converted into MapReduce programs that run on top of Yarn Cluster Manager. By default, 4 mappers are launched for Sqoop command to run parallely . Though this mapper can be increased or decreased by setting -m flag to desired parallelism in the sqoop command. 

If our source database is changing everyday, how can we make sure that data is sync regularly on HDF

If our source database is changing everyday, how can we make sure that data is sync regularly on HDF

If our source database is changing everyday, how can we make sure that data is sync regularly on HDF

It can be achieved using three things:

  • We need to use incremental append/update sqoop commands to make sure we import only changed data from Database.
  • We need to write Sqoop job for incremental append/update
  • These Sqoop jobs can be scheduled using some scheduler (Cron/Oozie/Airflow) to run daily.

How can we generate underlying Java code for Sqoop command?

How can we import a table that does not have any primary key?

If our source database is changing everyday, how can we make sure that data is sync regularly on HDF

We can make use of Sqoop codeGen command to generate underlying Java code

How can we import a table that does not have any primary key?

How can we import a table that does not have any primary key?

How can we import a table that does not have any primary key?

There are two ways to import a table that does not have any primary key.

  • Define a split-by column
  • Change number of mappers to 1

How can we apply compression while import data using Sqoop?

How can we import a table that does not have any primary key?

How can we import a table that does not have any primary key?

To compress data , we can make use of --compress flag along with --compression-codec to specify the compression. There are various compression codec supported by Sqoop : Gzip,Snappy, Lz4,Bzip2, deflate.

How large objects are handled in Sqoop?

What all file formats are supported by Sqoop?

How large objects are handled in Sqoop?

Large objects such BLOB/CLOB are stored in a separate file known as "LobFile" which  have the capability to store large sized data records.  

What is --split-by used for in Sqoop?

What all file formats are supported by Sqoop?

How large objects are handled in Sqoop?

split-by column is used to split the data among the mappers for parallel processing. By default, table's primary key is taken as split-by column by Sqoop to divide data. However, we can change it in sqoop command, but split-by columns should be chosen wisely such that:

  • Split-by column does not have skewed data
  • There are no null in it.

What all file formats are supported by Sqoop?

What all file formats are supported by Sqoop?

What all file formats are supported by Sqoop?

Sqoop has support for various file formats. By default, data is imported in text file format. Other file formats supported by Sqoop are:

  • Parquet
  • Avro
  • Sequence File


Learn Sqoop

Enroll

Coming soon

Stay connected

Copyright © 2019 TechnoAvengers - All Rights Reserved.


Powered by GoDaddy Website Builder