In this training, Spark is explored in great detail. Programing paradigm of spark is given due importance. RDDs is explored as data structures. The novelty of RDDs is brought into focus. Contrast is done with Hadoop Map reduce.

Spark is explored in great detail. Programing paradigm of spark is given due importance. RDDs is explored as data structures. The novelty of RDDs is brought into focus. Contrast is done with Hadoop Map reduce. All of this on a proper cluster. Most importantly how does it interact with other components like Cassandra, HBase, Mongo, Kafka , Storm etc. For example how do we perform a distributed join with Spark with data in multiple join tables. Or How we process real-time events from Kafka and then store processed data in Mongo for online consumption.

Intended Audience

Instructional Method

Key skills

Pre-requisites

Topics

01
Interfaces
Hadoop Filesystems
The Design of HDFS
Using Hadoop Archives
Limitations
Parallel Copying with distcp
Keeping an HDFS Cluster Balanced
Hadoop Archives
Data Flow
Anatomy of a File Write
Anatomy of a File Read
Coherency Model
The Command-Line Interface
Basic Filesystem Operations
The Java Interface
Querying the Filesystem
Reading Data Using the FileSystem API
Directories
Deleting Data
Reading Data from a Hadoop URL
Writing Data
02
Data Integrity
ChecksumFileSystem
LocalFileSystem
Data Integrity in HDFS
Serialization
Implementing a Custom Writable
Serialization Frameworks
The Writable Interface
Writable Classes
Avro
ORC Files
Large size enables efficient read of columns
New types (datetime, decimal)
Encoding specific to the column type
Default stripe size is 250 MB
A single file as output of each task
Split files without scanning for markers
Bound the amount of memory required for reading or writing.
Lowers pressure on the NameNode
Dramatically simplifies integration with Hive
Break file into sets of rows called a stripe
Complex types (struct, list, map, union)
Support for the Hive type model
ORC File:Footer
Count, min, max, and sum for each column
Types, number of rows
Contains list of stripes
ORC Files:Index
Required for skipping rows
Position in each stream
Min and max for each column
Currently every 10,000 rows
Could include bit field or bloom filter
ORC Files:Postscript
Contains compression parameters
Size of compressed footer
ORC Files:Data
Directory of stream locations
Required for table scan
Parquet
Nested Encoding
Configurations
Error recovery
Extensibility
Nulls
File format
Data Pages
Motivation
Unit of parallelization
Logical Types
Metadata
Modules
Column chunks
Separating metadata and column data
Checksumming
Types
File-Based Data Structures
MapFile
SequenceFile
Compression
Codecs
Using Compression in MapReduce
Compression and Input Splits
03
04
05
Expressing Existing Programming Models
Fault Recovery
Interpreter Integration
Memory Management
Implementation
MapReduce
RDD Operations in Spark
User Applications Built with Spark
Google's Pregel
Console Log Minning
Iterative MapReduce
Behavior with Insufficient Memory
A Fault-Tolerant Abstraction
Support for Checkpointing
Evaluation
Spark Programming Interface
Job Scheduling
Advantages of the RDD Model
Understanding the Speedup
Leveraging RDDs for Debugging
Iterative Machine Learning Applications
Explaining the Expressivity of RDDs
Representing RDDs
Applications Not Suitable for RDDs
06
07
08
Scheduling Within and Between Spark Applications
Spark Runtime Architecture
A Scala Spark Application Built with sbt
Packaging Your Code and Dependencies
Launching a Program
A Java Spark Application Built with Maven
Hadoop YARN
Deploying Applications with spark-submit
The Driver
Standalone Cluster Manager
Cluster Managers
Executors
Amazon EC2
Cluster Manager
Dependency Conflicts
Apache Mesos
Which Cluster Manager to Use?
09
Spark:YARN Mode
Resource Manager
Node Manager
Workers
Containers
Threads
Task
Executers
Application Master
Multiple Applications
Tuning Parameters
Spark:LocalModeSpark Caching
With Serialization
Off-heap
In Memory
Running on a Cluster
Scheduling Within and Between Spark Applications
Spark Runtime Architecture
A Scala Spark Application Built with sbt
Packaging Your Code and Dependencies
Launching a Program
A Java Spark Application Built with Maven
Hadoop YARN
Deploying Applications with spark-submit
The Driver
Standalone Cluster Manager
Cluster Managers
Executors
Amazon EC2
Cluster Manager
Dependency Conflicts
Apache Mesos
Which Cluster Manager to Use?
Spark SerializationStandAlone Mode
Task
Multiple Applications
Executers
Tuning Parameters
Workers
Threads
10
11
12
13
Driver and Executor Logs
Memory Management
Finding Information
Key Performance Considerations
Configuring Spark with SparkConf
Components of Execution: Jobs, Tasks, and Stages
Spark Web UI
Hardware Provisioning
Level of Parallelism
Serialization Format
Memory Management Driver and Executor Logs Components of Execution: Jobs, Tasks, and Stages Key Performance Considerations Hardware ProvisioningMetrics and Debugging
Evaluating spark jobs
Monitoring tool for spark
Spark WebUI
Memory consumption and resource allocation
Job metrics
Debugging & troubleshooting spark jobs
Monitoring Spark jobs
Level of ParallelismMonitoring Spark
Logging in Spark
Spark History Server
Spark Metrics
Exploring the Spark Application UI
Finding Information
Spark Administration & Best Practices
Estimating cluster resource requirements
Estimating Drive/Executer Memory Sizes
Serialization Format
14
15
Cassandra in a cluster
Replication Strategies
Seed Nodes
Adding Nodes to a Cluster
Node Configuration
Cassandra Cluster Manager
Creating a Cluster
Dynamic Ring Participation
Snitches
Partitioners
The Cassandra Query Language
Data Types
CQL1
The Relational Data Model
CQL3
CQL Types
Cassandra’s Data Model
Secondary Indexes
CQL2
Performance Tuning
Memtables
Commit Logs
Caching
Compaction
Hinted Handoff
JVM Settings
Concurrency and Threading
SSTables
Networking and Timeouts
Managing Performance
Using cassandra-stress
Cassandra Introduction
A Quick Review of Relational Databases
Beyond Relational Databases
Web Scale
What’s Wrong with Relational Databases?
The Cassandra Elevator Pitch
The Rise of NoSQL
Where Did Cassandra Come From?
Is Cassandra a Good Fit for My Project?
The Cassandra Architecture
System Keyspaces
Partitioners
Data Centers and Racks
Staged Event-Driven Architecture (SEDA)
Lightweight Transactions and Paxos
Rings and Tokens
Compaction
Queries and Coordinator Nodes
Caching
Consistency Levels
Hinted Handoff
Bloom Filters
Gossip and Failure Detection
Anti-Entropy, Repair, and Merkle Trees
Snitches
Virtual Nodes
Managers and Services
Replication Strategies
Memtables, SSTables, and Commit Logs
Tombstones
Data Modeling
Evaluating and Refining
Conceptual Data Modeling
Defining Database Schema
Defining Application Queries
Logical Data Modeling
RDBMS Design
Physical Data Modeling
Monitoring and Maintenance
Logging
Cassandra’s MBeans
Backup and Recovery
Maintenance Tools
SSTable Utilities
Basic Maintenance
Adding Nodes
Handling Node Failure
Health Check
Monitoring with nodetool
Monitoring Cassandra with JMX
16
Indexing and query optimization
Replication
Sharding
Sharding
Starting the Servers
Adding a Shard from a Replica Set
Splitting Chunks
Chunk Ranges
Configuring Sharding
Sharding Data
Understanding the Components of a Cluster
When to Shard
The Balancer
Config Servers
Adding Capacity
The mongos Processes
How MongoDB Tracks Cluster Data
Monitoring MongoDB Applications
False Positives
Seeing the Current Operations
Documents
Collections
Calculating Sizes
Finding Problematic Operations
Preventing Phantom Operations
Using mongotop and mongostat
Killing Operations
Using the System Profiler
Databases
Seeing What Your Application Is Doing
Durability
What Journaling Does
Replacing Data Files
Durability with Replication
Sneaky Unclean Shutdowns
Planning Commit Batches
Checking for Corruption
What MongoDB Does Not Guarantee
Repairing Data Files
Setting Commit Intervals
Turning Off Journaling
The mongodlock File
Advanced Sharding
Controlling Data Distribution
Location-Based Shard Keys
Hashed Shard Keys for GridFS
The Firehose Strategy
Shard Key Limitations
Ascending Shard Keys
Shard Key Strategies
Picturing Distributions
Randomly Distributed Shard Keys
Hashed Shard Key
Choosing a Shard Key
Shard Key Cardinality
Manual Sharding
Shard Key Rules and Guidelines
Taking Stock of Your Usage
Multi-Hotspot
Using a Cluster for Multiple Databases and Collections
17
Intoduction Clients Concepts Hbase vs RDBMSLog Structures Merge Trees
Compaction
Limitations of B+ Trees
Limitations of Binary Trees
LogStructured Merge tree as the back bone of storage
HBase Storage Architecture
MemStore
Read and Write Path
Physical Architecture
HFile
WAL
HFile Format
HMaster and HRegionServer
How Data is Store in Hfile
Root Table and Meta Table
Key Format
Role ofZookeeper
Future Directions
MMap for bloom filters and Block indexes
Exploring OFF-Heap Storage
Introduction
Common Advantages
Dynamo and Bigtable
Table, Column Families,Rows and Columns
Data Mode
BigTable and HBase(C + P)
What am I giving up?
Schemaless
Key/Value Stores
HBase OperationsAccess Patterns
Batching
Filters
Put
Gets
Caching
Scanning
Desiging HBase Tables and Schemas
Time-Ordered Relations
Pagination
Concepts
Key Design
Partial Key Scans
Tall-Narrow Versus Flat-Wide Tables
Time Series Data
Secondary Indexes
Advanced Schemas