
Spark Internals and Performance Tuning
Training Description
Internals of Spark 2.0 is explored in great detail. Programing paradigm of spark is given due importance. RDDs are compared to Dataframes and Data set. High performance programming patterns are explored with complex problems. For example, how to take advantage of copartition and colocation. Another example is against the grain Spark programming using the powerful Iterator to Iterator Spark Programming. All of this on a proper cluster on cloud. Most importantly how does it interact with other components like Cassandra, HBase, Mongo, Kafka etc. For example how do we perform a distributed join with Spark with data in multiple join tables. Or How we process real-time events from Kafka and then store processed data in Mongo for online consumption.
Internals of Spark 2.0 is explored in great detail. Programing paradigm of spark is given due importance. RDDs are compared to Dataframes and Data set. High performance programming patterns are explored with complex problems.
Networking specialists
Programmers
Engineers
Intended Audience
This is an instructor led course which provides lecture topics and the practical application of Hadoop and the underlying technologies. It pictorially presents most concepts and there is a detailed case study that strings together the technologies,patterns and design.
Instructional Method
Hadoop-2 Technologies:YARN Detailed Architecture
How we use Hadoop an spark in practice to solve problems
Proper Cluster Setup in classroom
Spark: RDDs as Data Structure
Spark: SQL
Spark: Detailed Architecture
Advanced Map Reduce Concepts & Algorithms
Spark: In a cluster
Spark: Streaming
Advanced Map Reduce Patterns:Cluster utilization
Spark: RDDs comparison with other techniques
Advanced Map Reduce Patterns:Compute Intensive tasks
Spark: Programming Model
Key skills
Specifically how Comparator and Comparable work in Java.
For Spark, as it embraces the functional programming, a knowledge of lambda expression in Java 8 would be an added advantage.
Working knowledge of Java
Pre-requisites
- 01
Interfaces
Hadoop Filesystems
The Design of HDFS
Using Hadoop Archives
Limitations
Parallel Copying with distcp
Keeping an HDFS Cluster Balanced
Hadoop Archives
Data Flow
Anatomy of a File Write
Anatomy of a File Read
Coherency Model
The Command-Line Interface
Basic Filesystem Operations
The Java Interface
Querying the Filesystem
Reading Data Using the FileSystem API
Directories
Deleting Data
Reading Data from a Hadoop URL
Writing Data
- 02
Data Integrity
ChecksumFileSystem
LocalFileSystem
Data Integrity in HDFS
Serialization
Implementing a Custom Writable
Serialization Frameworks
The Writable Interface
Writable Classes
Avro
ORC Files
Large size enables efficient read of columns
New types (datetime, decimal)
Encoding specific to the column type
Default stripe size is 250 MB
A single file as output of each task
Split files without scanning for markers
Bound the amount of memory required for reading or writing.
Lowers pressure on the NameNode
Dramatically simplifies integration with Hive
Break file into sets of rows called a stripe
Complex types (struct, list, map, union)
Support for the Hive type model
ORC File:Footer
Count, min, max, and sum for each column
Types, number of rows
Contains list of stripes
ORC Files:Index
Required for skipping rows
Position in each stream
Min and max for each column
Currently every 10,000 rows
Could include bit field or bloom filter
ORC Files:Postscript
Contains compression parameters
Size of compressed footer
ORC Files:Data
Directory of stream locations
Required for table scan
Parquet
Nested Encoding
Configurations
Error recovery
Extensibility
Nulls
File format
Data Pages
Motivation
Unit of parallelization
Logical Types
Metadata
Modules
Column chunks
Separating metadata and column data
Checksumming
Types
File-Based Data Structures
MapFile
SequenceFile
Compression
Codecs
Using Compression in MapReduce
Compression and Input Splits
- 03
Apache Tez
Apache Tez: A New Chapter in Hadoop Data Processing
Data Processing API in Apache Tez
Writing a Tez Input/Processor/Output
Runtime API in Apache Tez
Apache Tez: Dynamic Graph Reconfiguration
Apache YARN
Agility
global ResourceManager
per-node slave NodeManager
Scalability
Support for workloads other than MapReduce
Compatibility with MapReduce
per-application Container running on a NodeManager
Improved cluster utilization
per-application ApplicationMaster
HDFS-2
High Availability for HDFS
HDFS-append support
HDFS Federation
HDFS Snapshots
- 04
GraphX
MLlib
Spark SQL
Data Processing Applications
Spark Streaming
What Is Apache Spark?
Data Science Tasks
Storage Layers for Spark
Spark Core
Who Uses Spark, and for What?
A Unified Stack
Cluster Managers
- 05
Lazy Evaluation
Common Transformations and Actions
Passing Functions to Spark
RDD Operations
Creating RDDs
Actions
Transformations
Scala
Java
Persistence
Python
Converting Between RDD Types
RDD Basics
Basic RDDs
- 06
Expressing Existing Programming Models
Fault Recovery
Interpreter Integration
Memory Management
Implementation
MapReduce
RDD Operations in Spark
User Applications Built with Spark
Google's Pregel
Console Log Minning
Iterative MapReduce
Behavior with Insufficient Memory
A Fault-Tolerant Abstraction
Support for Checkpointing
Evaluation
Spark Programming Interface
Job Scheduling
Advantages of the RDD Model
Understanding the Speedup
Leveraging RDDs for Debugging
Iterative Machine Learning Applications
Explaining the Expressivity of RDDs
Representing RDDs
Applications Not Suitable for RDDs
- 07
Sorting Data
Determining an RDD’s Partitioner
Operations That Affect Partitioning
Grouping Data
Motivation
Aggregations
Data Partitioning (Advanced)
Actions Available on Pair RDDs
Joins
Creating Pair RDDs
Operations That Benefit from Partitioning
Transformations on Pair RDDs
Example: PageRank
Custom Partitioners
- 08
File Formats
Hadoop Input and Output Formats
Local/“Regular” FS
Text Files
Java Database Connectivity
Structured Data with Spark SQL
Elasticsearch
File Compression
Apache Hive
Cassandra
Object Files
Comma-Separated Values and Tab-Separated Values
HBase
Databases
Filesystems
SequenceFiles
JSON
HDFS
Motivation
JSON
Amazon S3
- 09
Working on a Per-Partition Basis
Accumulators
Optimizing Broadcasts
Custom Accumulators
Accumulators and Fault Tolerance
Numeric RDD Operations
Piping to External Programs
Broadcast Variables
- 10
Scheduling Within and Between Spark Applications
Spark Runtime Architecture
A Scala Spark Application Built with sbt
Packaging Your Code and Dependencies
Launching a Program
A Java Spark Application Built with Maven
Hadoop YARN
Deploying Applications with spark-submit
The Driver
Standalone Cluster Manager
Cluster Managers
Executors
Amazon EC2
Cluster Manager
Dependency Conflicts
Apache Mesos
Which Cluster Manager to Use?
- 11
Spark:YARN Mode
Resource Manager
Node Manager
Workers
Containers
Threads
Task
Executers
Application Master
Multiple Applications
Tuning Parameters
Spark:LocalModeSpark Caching
With Serialization
Off-heap
In Memory
Running on a Cluster
Scheduling Within and Between Spark Applications
Spark Runtime Architecture
A Scala Spark Application Built with sbt
Packaging Your Code and Dependencies
Launching a Program
A Java Spark Application Built with Maven
Hadoop YARN
Deploying Applications with spark-submit
The Driver
Standalone Cluster Manager
Cluster Managers
Executors
Amazon EC2
Cluster Manager
Dependency Conflicts
Apache Mesos
Which Cluster Manager to Use?
Spark SerializationStandAlone Mode
Task
Multiple Applications
Executers
Tuning Parameters
Workers
Threads
- 12
Effective Transformations
Shared Variables
Using Smaller Data Structures
Narrow Versus Wide Transformations
Reducing Setup Overhead
Deciding if Recompute Is Inexpensive Enough
Accumulators
Implications for Fault Tolerance
LRU Caching
Space and Time Advantages
What Type of RDD Does Your Transformation Return?
Cases for Reuse
Implications for Performance
Minimizing Object Creation
Alluxio (nee Tachyon)
Interaction with Accumulators
Noisy Cluster Considerations
Reusing Existing Objects
Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
Broadcast Variables
Set Operations
Reusing RDDs
The Special Case of coalesce
Working with Key/Value DataThe Powerful Iterator-to-Iterator transformation
Reduce to Distinct on Each Partition
A Different Approach to Key/Value
Straggler Detection and Unbalanced
Sort on Cell Values
Secondary Sort SolutiongroupByKey Solution
Multiple RDD Operations
Choosing an Aggregation Operation
Sorting by Two Keys with SortByKey
Preserving Partitioning Information Across Transformations
Dictionary of OrderedRDDOperations
Leveraging Co-Located and Co-Partitioned RDDs
Leveraging repartitionAndSortWithinPartitions for a Group by
Key and
Range Partitioning
Dictionary of Mapping and Partitioning Functions
PairRDDFunctions
Sort Values Function
Partitioners and Key/Value Data
Co-Grouping
Custom Partitioning
Secondary Sort and repartitionAndSortWithinPartitions
Using the Spark Partitioner Object
Hash Partitioning
Dictionary of Aggregation Operations with Performance
Considerations
Iterative Solution
What’s So Dangerous About the groupByKey Function
How to Use PairRDDFunctions and OrderedRDDFunctions
Actions on Key/Value Pairs
How Not to Sort by Two Orderings
- 13
Spark SQL—Structured Queries on Large Scale
UserDefinedAggregateFunction—User-Defined Aggregate
Functions (UDAFs)
User-Defined Functions (UDFs)
Schema—Structure of Data
Dataset Operators
Encoders—Internal Row Converters
Joins
StructField
StructType
InternalRow—Internal Binary Row Format
Builder—Building SparkSession with Fluent API
Aggregation—Typed and Untyped Grouping
Column Operators
Standard Functions—functions object
Window Aggregate Operators—Windows
Caching
Data Types
Row
Datasets—Strongly-Typed DataFrames with Encoders
DataFrame—Dataset of Rows
SparkSession—The Entry Point to Spark SQL
RowEncoder—DataFrame Encoder
DataSource API—Loading and Saving Datasets
DDLStrategy
Custom Formats
CSVFileFormat
BaseRelation
DataFrameWriter
JoinSelection
Query Execution
DataSource—Pluggable Data Sources
SparkPlanner—Default Query Planner (with no Hive Support)
ParquetFileFormat
QueryPlanner—From Logical to Physical Plans
DataSourceStrategy
DataSourceRegister
Structured Query Plan
BasicOperators
FileSourceStrategy
EnsureRequirements Physical Plan Optimization
AlterViewAsCommand Runnable Command
SparkPlan—Physical Execution Plan
Join Logical Operator
CoalesceExec Physical Operator
LocalTableScanExec Physical Operator
ExplainCommand Logical Command
ClearCacheCommand Runnable Command
ExecutedCommandExec Physical Operator
CreateViewCommand Runnable Command
BroadcastHashJoinExec Physical Operator
ShuffledRowRDD
ExchangeCoordinator and Adaptive Query Execution
Debugging Query Execution
LocalRelation Logical Operator
CheckAnalysis
BroadcastNestedLoopJoinExec Physical Operator
Logical Query Plan Analyzer
InMemoryRelation Logical Operator
InMemoryTableScanExec Physical Operator
LogicalPlan—Logical Query Plan
ShuffleExchange Physical Operator
DeserializeToObject Logical Operator
WindowExec Physical Operator
Joins (SQL & Core)
Core Spark Joins
Spark SQL Joins
Choosing a Join Type
DataFrame Joins
Choosing an Execution Plan
Datasets vs DataFrames vs RDDs
Catalog
SQL Parser Framework
Eliminate Serialization
Combine Typed Filters
Predicate Pushdown / Filter Pushdown
CatalystSerde
Nullability (NULL Value) Propagation
SessionState
Logical Query Plan Optimizer
Column Pruning
Constant Folding
Vectorized Parquet Decoder
SessionCatalog
Propagate Empty Relation
SQLExecution Helper Object
Simplify Casts
SQLConf
GetCurrentDatabase / ComputeCurrentTime
SparkSqlAstBuilder
ExternalCatalog—System Catalog of Permanent Entities
Tungsten Execution Backend (aka Project Tungsten)
Catalyst—Tree Manipulation Framework
Spark SQL CLI - spark-sql
Whole-Stage Code Generation (CodeGen)
Attribute Expression
Hive Integration
Expression TreeNode
SparkSQLEnv
CacheManager—In-Memory Cache for Cached Tables
Settings
(obsolete) SQLContext
TreeNode
Generator
DataSinks Strategy
Thrift JDBC/ODBC Server—Spark Thrift Server (STS)
DataFrames, Datasets & Spark SQLQuery Optimize
Debugging Spark SQL Queries
Large Query Plans and Iterative Algorithms
JDBC/ODBC Server
Logical and Physical Plans
Code Generation
Data Loading and Saving Functions
DataFrameWriter and DataFrameReader
Save Modes
Partitions (Discovery and Writing)
Formats
Datasets
Easier Functional (RDD “like”) Transformations
Compile-Time Strong Typing
Grouped Operations on Datasets
Extending with User-Defined Functions and Aggregate
Functions (UDFs, UDAFs)
Multi-Dataset Relational Transformations
Interoperability with RDDs, DataFrames, and Local Collections
Relational Transformations
Spark SQL Dependencies
Managing Spark Dependencies
Basics of Schemas
Avoiding Hive JARs
DataFrame API
Multi-DataFrame Transformations
Data Representation in DataFrames and Datasets
Transformations
Plain Old SQL Queries and Interacting with Hive Data
Getting Started with the SparkSession (or HiveContext or SQLContext)
- 14
Checkpointing
Output Operations
Stateless Transformations
Receiver Fault Tolerance
Core Sources
Worker Fault Tolerance
Stateful Transformations
Batch and Window Sizes
Performance Considerations
Architecture and Abstraction
Streaming UI
Driver Fault Tolerance
Multiple Sources and Cluster Sizing
Processing Guarantees
A Simple Example
Input Sources
Additional Sources
Transformations
- 15
Kafka Core Concepts
brokers
Topics
producers
replicas
Partitions
consumers
Operating Kafka
P&S tuning
monitoring
deploying
Architecture
hardware specs
Developing Kafka apps
serialization
compression
testing
Case Study
reading from Kafka
Writing to Kafka
- 16
Driver and Executor Logs
Memory Management
Finding Information
Key Performance Considerations
Configuring Spark with SparkConf
Components of Execution: Jobs, Tasks, and Stages
Spark Web UI
Hardware Provisioning
Level of Parallelism
Serialization Format
Memory Management Driver and Executor Logs Components of Execution: Jobs, Tasks, and Stages Key Performance Considerations Hardware ProvisioningMetrics and Debugging
Evaluating spark jobs
Monitoring tool for spark
Spark WebUI
Memory consumption and resource allocation
Job metrics
Debugging & troubleshooting spark jobs
Monitoring Spark jobs
Level of ParallelismMonitoring Spark
Logging in Spark
Spark History Server
Spark Metrics
Exploring the Spark Application UI
Finding Information
Spark Administration & Best Practices
Estimating cluster resource requirements
Estimating Drive/Executer Memory Sizes
Serialization Format
- 17
Adapting Perf for cluster with NOSQL
Understanding Hardware performance counters
Making perf JVM aware
Understanding Perf
Topics
