Starting a new batch for Scala&Sparks
Course Content for Hadoop and Spark Batch
Introduction to BIGDATA and HADOOP
What is Big Data?
What is Hadoop?
Relation between Big Data and Hadoop.
What is the need of going ahead with Hadoop?
Scenarios to apt Hadoop Technology in REAL TIME Projects
Challenges with Big Data
How Hadoop is addressing Big Data Changes
Comparison with Other Technologies
o Data Warehouse
Different Components of Hadoop Echo System
o Storage Components
o Processing Components
Importance of Hadoop Echo System Components
Other solutions of Big Data
o Introduction to NO SQL
HDFS (Hadoop Distributed File System)
What is a Cluster Environment?
Cluster Vs Hadoop Cluster.
Significance of HDFS in Hadoop
Features of HDFS
Storage aspects of HDFS
o How to Configure block size
o Default Vs Configurable Block size
o Why HDFS Block size so large?
o Design Principles of Block Size
HDFS Architecture - 5 Daemons of Hadoop
NameNode and its functionality
DataNode and its functionality
JobTracker and its functionality
TaskTrack and its functionality
Secondary Name Node and its functionality.
Replication in Hadoop – Fail Over Mechanism
Data Storage in Data Nodes
Fail Over Mechanism in Hadoop – Replication
Design Constraints with Replication Factor
Can we change the replication factor in Hadoop?
Can we change the block size for a file or directory in Hadoop?
CLI (Command Line Interface) and HDFS Commands
Java Based Approach
Configuration files in Hadoop Installation and the Purpose
How to & Where to Configure Hadoop Daemons in a Hadoop Cluster?
Difference between Hadoop 1.X.X and Hadoop 2.X.X version
o Name Node HA (High Availability in Hadoop 2.X.X)
Why Map Reduce is essential in Hadoop?
Processing Daemons of Hadoop
o Roles Of Job Tracker
o Drawbacks w.r.to Job Tracker failure in Hadoop Cluster
o How to configure Job Tracker in Hadoop Cluster
o Roles of Task Tracker
o Drawbacks w.r.to Task Tracker Failure in Hadoop Cluster
Need Of Input Split in Map Reduce
InputSplit Size Vs Block Size
InputSplit Vs Mappers
Map Reduce Life Cycle
Communication Mechanism of Job Tracker & Task Tracker
Input Format Class
Record Reader Class
Success Case Scenarios
Failure Case Scenarios
Retry Mechanism in Map Reduce
MapReduce Programming Model
Different phases of Map Reduce Algorithm
Different Data types in Map Reduce
o Primitive Data types Vs Map Reduce Data types
How to write a basic Map Reduce Program
Importance of Driver Code in a Map Reduce program
How to Identify the Driver Code in Map Reduce program
Different sections of Driver code
Importance of Mapper Phase in Map Reduce
How to Write a Mapper Class?
Methods in Mapper Class
Importance of Reduce phase in Map Reduce
How to Write Reducer Class?
Methods in Reducer Class
IDENTITY MAPPER & IDENTITY REDUCER
Input Format’s in Map Reduce
How to use the specific input format in Map Reduce
How to write Custom Input Format Class and Custom Record Reader
Output Format’s in Map Reduce
How to use the specific Output format in Map Reduce
How to write Custom Output Format Class and Custom Record Writer
Map Reduce API(Application Programming Interface)
o New API
o Deprecated API
Combiner in Map Reduce
o Is combiner mandate in Map Reduce
o How to use the combiner class in Map Reduce
o Performance tradeoffs w.r.to Combiner
o Real Time Use Cases
o Where to Use & Where Not to Use Combiner
Partitioner in Map Reduce
o Importance of Practitioner class in Map Reduce
o How to use the Partitioner class in Map Reduce
o Different types of Practitioners in Map Reducer
o Importance of hashPartitioner
o How to write a custom Practitioner
o Real Time Use Cases
Compression Techniques in Map Reduce
o Importance of Compression in Map Reduce
o What is CODEC
o Compression Types
o Configurations w.r.to Compression Techinques
o How to customize the Compression per one job Vs all the job.
Map Reduce Job Chaining
o What is Map Reduce Job Chaining?
o Use of MR Chaining in Real Time Hadoop Projects
o Real Time Use case
o Performance trade off’s using MR Chaining
Joins - in Map Reduce
o Map Side Join
o Reduce Side Join
o Performance Trade Off
o Distributed cache
How to debug MapReduce Jobs in Local and Pseudo cluster Mode.
o Introduction to MapReduce Streaming
o Data locality in Map Reduce
o Secondary Sorting Using Map Reduce
Introduction to Apache Pig
Map Reduce Vs Apache Pig
SQL Vs Apache Pig
Different datat ypes in Pig
Where to Use Map Reduce and PIG in REAL Time Hadoop Projects
Modes Of Execution in Pig
o Local Mode
o Map Reduce OR Distributed Mode
o Grunt Shell
Transformations in Pig
How to write a simple pig script
Parameter substitution in PIG Scripts
How to develop the Complex Pig Script
Bags , Tuples and fields in PIG
UDFs in Pig
o Need of using UDFs in PIG
o How to use UDFs
o REGISTER Key word in PIG
Techniques to improve the performance and efficiency of Pig Latin
Need of Apache HIVE in Hadoop
When to choose PIG & HIVE in REAL Time Project
o Executor(Semantic Analyzer)
Meta Store in Hive
o Importance Of Hive Meta Store
o Embedded metastore configuration
o External metastore configuration
o Communication mechanism with Metastore
Hive Integration with Hadoop
Hive Query Language(Hive QL)
Configuring Hive with MySQL MetaStore
SQL VS Hive QL
Data Slicing Mechanisms
o Partitions In Hive
o Buckets In Hive
o Partitioning Vs Bucketing
o Real Time Use Cases
Collection Data Types in HIVE
o Real Time Use Cases
User Defined Functions(UDFs) in HIVE
o Need of UDFs in HIVE
Hive Serializer/Deserializer - SerDe
Semi Structured Data Processing Using Hive
HIVE – HBASE Integration
Introduction to Sqoop.
MySQL client and Server Installation
How to connect to Relational Database using Sqoop
Different Sqoop Commands
o Different flavors of Imports
HDFS Vs Hbase
Hbase Vs RDBMS
Hbase Vs NO SQL
Hbase Data modeling Elements
o Column families
o Column Qualifier Name
o Row Key
o Java Based
Map Reduce Integration
Map Reduce over Hbase
o Schema Definition
o Basic CRUD Operations
o Client Side Buffering in Hbase
Flume Master , Flume Collector and Flume Agent
Real Time Use Case using Apache Flume
Oozie Configuration Files
Oozie Job Submission
o Transit parameters in workflow.xml
YARN (Yet another Resource Negotiator) – Next Gen. MapReduce
What is YARN?
Difference between Map Reduce & YARN
o Resource Manager
o Application Master
o Node Manager
When should we go ahead with YARN
YARN Process flow
YARN Web UI
Different Configuration Files for YARN
Examples on YARN
What is Impala?
How can we use Impala for Query Processing?
When should we go ahead with Impala
HIVE Vs Impala
REAL TIME Use Cases with Impala
MongoDB ( As part of NoSQL Databases )
Need of NoSQL Databases
Relational VS Non-Relational Databases
Introduction to MongoDB
Features of MongoDB
Installation of MongoDB
Mongo DB Basic operations
REAL Time Use Cases on Hadoop & MongoDB Use Cases
Introduction to Cassandra
Mongo DB Vs Cassandra
Basic Operation using Cassandra
Apache Kafka (A Distributed Message Queuing System)
Introduction to Kafka
Installation of Kafka
Difference between MQ Vs Kafka
Basic Operation using Kafka
Mahout (As a part of BIGDATA ANALYTICS)
Introduction to Machine Learning (ML) Languages
Types of Machine Learning
Introduction to Apache MAHOUT
Categories of Mahout Algorithms
Real Time Use case using Classifier Algorithm of Mahout
– Naives Bayes
SCALA (Object Oriented and Functional Programming)
Getting started With Scala.
Scala Background, Scala Vs Java and Basics.
Interactive Scala – REPL, data types, variables,expressions, simple
Running the program with Scala Compiler.
Explore the type lattice and use type inference
Define Methodsand Pattern Matching.
Scala Environment Set up.
Scala set up on Windows.
Scala set up on UNIX.
What is Functional Programming.
Differences between OOPS and FPP.
Collections (Very Important for Spark)
Iterating, mapping, filtering and counting
Regular expressions and matching with them.
Maps, Sets, group By, Options, flatten, flat Map
Word count, IO operations,file access, flatMap
Object Oriented Programming.
Classes and Properties.
Objects, Packaging and Imports.
Objects, classes, inheritance, Lists with multiple related types, apply
What is SBT?
Integration of Scala in Eclipse IDE.
Integration of SBT with Eclipse.
Batch versus real-time data processing
Introduction to Spark, Spark versus Hadoop
Architecture of Spark.
Coding Spark jobs in Scala
Exploring the Spark shell -> Creating Spark Context.
Operations on RDD.
Loading Data and Saving Data.
Key Value Pair RDD.
Configuring and running the Spark cluster.
Exploring to Multi Node Spark Cluster.
Submitting Spark jobs and running in the cluster mode.
Developing Spark applications in Eclipse
Tuning and Debugging Spark.
CASSANDRA (N0SQL DATABASE)
Getting started with architecture
Communicating with Cassandra.
Creating a database.
Create a table
Creating an Application with Web.
Updating and Deleting Data.
SPARK INTEGRATION WITH NO SQL (CASSANDRA) and AMAZON EC2
Introduction to Spark and Cassandra Connectors.
Spark With Cassandra -> Set up.
Creating Spark Context to connect the Cassandra.
Creating Spark RDD on the Cassandra Data base.
Performing Transformation and Actions on the Cassandra RDD.
Running Spark Application in Eclipse to access the data in the Cassandra.
Introduction to Amazon Web Services.
Building 4 Node Spark Multi Node Cluster in Amazon Web Services.
Deploying in Production with Mesos and YARN.
Introduction of Spark Streaming.
Architecture of Spark Streaming
Processing Distributed Log Files in Real Time
Discretized streams RDD.
Applying Transformations and Actions on Streaming Data
Integration with Flume and Kafka.
Integration with Cassandra
Monitoring streaming jobs.
Introduction to Apache Spark SQL
The SQL context
Importing and saving data
Processing the Text files,JSON and Parquet Files
Local Hive Metastore server
Introduction to Machine Learning
Types of Machine Learning.
Introduction to Apache Spark MLLib Algorithms.
Machine Learning Data Types and working with MLLib.
Regression and Classification Algorithms.
Decision Trees in depth.
Classification with SVM, Naive Bayes
Clustering with K-Means
Building the Spark server
What we are offering as part of this Course?
3 REAL TIME Hadoop Projects End-to-End Explanation with architecture.
Mock Interviews will be conducted on a one-to-one basis after the
Hard Copy & Soft Copy Materials for all the Components.
Detailed Assistance in RESUME Preparation on a one-to-one basis with
Real Time Projects based on your technical back ground.
All the Real time interview questions and answers will be provided.
Discussing the new happenings in Hadoop
Discussing the Interview Questions on a daily basis
Discussing Certification (CCA 175 – Spark and Hadoop Certification)
Related topics on a daily basis.
Proof Of Concept using complex architectures to give a real time idea