
Spark2.0 Internals, Kafka and, NoSQL DBs
Training Description
In this training, Spark is explored in great detail. Programing paradigm of spark is given due importance. RDDs is explored as data structures. The novelty of RDDs is brought into focus. Contrast is done with Hadoop Map reduce.
Spark is explored in great detail. Programing paradigm of spark is given due importance. RDDs is explored as data structures. The novelty of RDDs is brought into focus. Contrast is done with Hadoop Map reduce. All of this on a proper cluster. Most importantly how does it interact with other components like Cassandra, HBase, Mongo, Kafka , Storm etc. For example how do we perform a distributed join with Spark with data in multiple join tables. Or How we process real-time events from Kafka and then store processed data in Mongo for online consumption.
- Programmers 
- Engineers 
Intended Audience
This is an instructor led course which provides lecture topics and the practical application of Spark, Spark Streaming and the underlying technologies. It pictorially presents most concepts and there is a detailed case study that strings together the technologies, patterns and design.
Instructional Method
- Spark: In a cluster 
- Spark: Streaming 
- Spark: Programming Model 
- Spark: RDDs comparison with other techniques 
- Spark: Detailed Architecture 
- Spark: RDDs as Data Structure 
Key skills
- Good knowledge of Java8 Lambda expressions 
- Good understanding of Hadoop 
- Good understanding of Java 
- Good understanding of HDFS 
- Linux would help as it would be the platform 
Pre-requisites
- 01- Interfaces 
- Hadoop Filesystems 
- The Design of HDFS 
 Using Hadoop Archives - Limitations 
 Parallel Copying with distcp - Keeping an HDFS Cluster Balanced 
- Hadoop Archives 
 Data Flow - Anatomy of a File Write 
- Anatomy of a File Read 
- Coherency Model 
 The Command-Line Interface - Basic Filesystem Operations 
 The Java Interface - Querying the Filesystem 
- Reading Data Using the FileSystem API 
- Directories 
- Deleting Data 
- Reading Data from a Hadoop URL 
- Writing Data 
 
- 02Data Integrity - ChecksumFileSystem 
- LocalFileSystem 
- Data Integrity in HDFS 
 Serialization - Implementing a Custom Writable 
- Serialization Frameworks 
- The Writable Interface 
- Writable Classes 
- Avro 
 ORC Files - Large size enables efficient read of columns 
- New types (datetime, decimal) 
- Encoding specific to the column type 
- Default stripe size is 250 MB 
- A single file as output of each task 
- Split files without scanning for markers 
- Bound the amount of memory required for reading or writing. 
- Lowers pressure on the NameNode 
- Dramatically simplifies integration with Hive 
- Break file into sets of rows called a stripe 
- Complex types (struct, list, map, union) 
- Support for the Hive type model 
 ORC File:Footer - Count, min, max, and sum for each column 
- Types, number of rows 
- Contains list of stripes 
 ORC Files:Index - Required for skipping rows 
- Position in each stream 
- Min and max for each column 
- Currently every 10,000 rows 
- Could include bit field or bloom filter 
 ORC Files:Postscript - Contains compression parameters 
- Size of compressed footer 
 ORC Files:Data - Directory of stream locations 
- Required for table scan 
 Parquet - Nested Encoding 
- Configurations 
- Error recovery 
- Extensibility 
- Nulls 
- File format 
- Data Pages 
- Motivation 
- Unit of parallelization 
- Logical Types 
- Metadata 
- Modules 
- Column chunks 
- Separating metadata and column data 
- Checksumming 
- Types 
 File-Based Data Structures - MapFile 
- SequenceFile 
 Compression - Codecs 
- Using Compression in MapReduce 
- Compression and Input Splits 
 
- 03- GraphX 
- MLlib 
- Spark SQL 
- Data Processing Applications 
- Spark Streaming 
- What Is Apache Spark? 
- Data Science Tasks 
- Storage Layers for Spark 
- Spark Core 
- Who Uses Spark, and for What? 
- A Unified Stack 
- Cluster Managers 
 
- 04- Lazy Evaluation 
- Common Transformations and Actions 
- Passing Functions to Spark 
- RDD Operations 
- Creating RDDs 
- Actions 
- Transformations 
- Scala 
- Java 
- Persistence 
- Python 
- Converting Between RDD Types 
- RDD Basics 
- Basic RDDs 
 
- 05- Expressing Existing Programming Models 
- Fault Recovery 
- Interpreter Integration 
- Memory Management 
- Implementation 
- MapReduce 
- RDD Operations in Spark 
- User Applications Built with Spark 
- Google's Pregel 
- Console Log Minning 
- Iterative MapReduce 
- Behavior with Insufficient Memory 
- A Fault-Tolerant Abstraction 
- Support for Checkpointing 
- Evaluation 
- Spark Programming Interface 
- Job Scheduling 
- Advantages of the RDD Model 
- Understanding the Speedup 
- Leveraging RDDs for Debugging 
- Iterative Machine Learning Applications 
- Explaining the Expressivity of RDDs 
- Representing RDDs 
- Applications Not Suitable for RDDs 
 
- 06- Sorting Data 
- Determining an RDD’s Partitioner 
- Operations That Affect Partitioning 
- Grouping Data 
- Motivation 
- Aggregations 
- Data Partitioning (Advanced) 
- Actions Available on Pair RDDs 
- Joins 
- Creating Pair RDDs 
- Operations That Benefit from Partitioning 
- Transformations on Pair RDDs 
- Example: PageRank 
- Custom Partitioners 
 
- 07- File Formats 
- Hadoop Input and Output Formats 
- Local/“Regular” FS 
- Text Files 
- Java Database Connectivity 
- Structured Data with Spark SQL 
- Elasticsearch 
- File Compression 
- Apache Hive 
- Cassandra 
- Object Files 
- Comma-Separated Values and Tab-Separated Values 
- HBase 
- Databases 
- Filesystems 
- SequenceFiles 
- JSON 
- HDFS 
- Motivation 
- JSON 
- Amazon S3 
 
- 08- Scheduling Within and Between Spark Applications 
- Spark Runtime Architecture 
- A Scala Spark Application Built with sbt 
- Packaging Your Code and Dependencies 
- Launching a Program 
- A Java Spark Application Built with Maven 
- Hadoop YARN 
- Deploying Applications with spark-submit 
- The Driver 
- Standalone Cluster Manager 
- Cluster Managers 
- Executors 
- Amazon EC2 
- Cluster Manager 
- Dependency Conflicts 
- Apache Mesos 
- Which Cluster Manager to Use? 
 
- 09Spark:YARN Mode - Resource Manager 
- Node Manager 
- Workers 
- Containers 
- Threads 
- Task 
- Executers 
- Application Master 
- Multiple Applications 
- Tuning Parameters 
 Spark:LocalModeSpark Caching - With Serialization 
- Off-heap 
- In Memory 
 Running on a Cluster - Scheduling Within and Between Spark Applications 
- Spark Runtime Architecture 
- A Scala Spark Application Built with sbt 
- Packaging Your Code and Dependencies 
- Launching a Program 
- A Java Spark Application Built with Maven 
- Hadoop YARN 
- Deploying Applications with spark-submit 
- The Driver 
- Standalone Cluster Manager 
- Cluster Managers 
- Executors 
- Amazon EC2 
- Cluster Manager 
- Dependency Conflicts 
- Apache Mesos 
- Which Cluster Manager to Use? 
 Spark SerializationStandAlone Mode - Task 
- Multiple Applications 
- Executers 
- Tuning Parameters 
- Workers 
- Threads 
 
- 10- Working on a Per-Partition Basis 
- Optimizing Broadcasts 
- Accumulators 
- Custom Accumulators 
- Accumulators and Fault Tolerance 
- Numeric RDD Operations 
- Piping to External Programs 
- Broadcast Variables 
 
- 11- Checkpointing 
- Output Operations 
- Stateless Transformations 
- Receiver Fault Tolerance 
- Core Sources 
- Worker Fault Tolerance 
- Stateful Transformations 
- Batch and Window Sizes 
- Performance Considerations 
- Architecture and Abstraction 
- Streaming UI 
- Driver Fault Tolerance 
- Multiple Sources and Cluster Sizing 
- Processing Guarantees 
- A Simple Example 
- Input Sources 
- Additional Sources 
- Transformations 
 
- 12- User-Defined Functions 
- Long-Lived Tables and Queries 
- Spark SQL Performance 
- Apache Hive 
- Loading and Saving Data 
- Performance Tuning Options 
- Parquet 
- Initializing Spark SQL 
- Caching 
- SchemaRDDs 
- JSON 
- From RDDs 
- Linking with Spark SQL 
- Spark SQL UDFs 
- Using Spark SQL in Applications 
- Basic Query Example 
 
- 13- Driver and Executor Logs 
- Memory Management 
- Finding Information 
- Key Performance Considerations 
- Configuring Spark with SparkConf 
- Components of Execution: Jobs, Tasks, and Stages 
- Spark Web UI 
- Hardware Provisioning 
- Level of Parallelism 
- Serialization Format 
 Memory Management Driver and Executor Logs Components of Execution: Jobs, Tasks, and Stages Key Performance Considerations Hardware ProvisioningMetrics and Debugging - Evaluating spark jobs 
- Monitoring tool for spark 
- Spark WebUI 
- Memory consumption and resource allocation 
- Job metrics 
- Debugging & troubleshooting spark jobs 
- Monitoring Spark jobs 
 Level of ParallelismMonitoring Spark - Logging in Spark 
- Spark History Server 
- Spark Metrics 
- Exploring the Spark Application UI 
 Finding Information Spark Administration & Best Practices - Estimating cluster resource requirements 
- Estimating Drive/Executer Memory Sizes 
 Serialization Format 
- 14Kafka Core Concepts - brokers 
- Topics 
- producers 
- replicas 
- Partitions 
- consumers 
 Operating Kafka - P&S tuning 
- monitoring 
- deploying 
- Architecture 
- hardware specs 
 Developing Kafka apps - serialization 
- compression 
- testing 
- Case Study 
- reading from Kafka 
- Writing to Kafka 
 
- 15Cassandra in a cluster - Replication Strategies 
- Seed Nodes 
- Adding Nodes to a Cluster 
- Node Configuration 
- Cassandra Cluster Manager 
- Creating a Cluster 
- Dynamic Ring Participation 
- Snitches 
- Partitioners 
 The Cassandra Query Language - Data Types 
- CQL1 
- The Relational Data Model 
- CQL3 
- CQL Types 
- Cassandra’s Data Model 
- Secondary Indexes 
- CQL2 
 Performance Tuning - Memtables 
- Commit Logs 
- Caching 
- Compaction 
- Hinted Handoff 
- JVM Settings 
- Concurrency and Threading 
- SSTables 
- Networking and Timeouts 
- Managing Performance 
- Using cassandra-stress 
 Cassandra Introduction - A Quick Review of Relational Databases 
- Beyond Relational Databases 
- Web Scale 
- What’s Wrong with Relational Databases? 
- The Cassandra Elevator Pitch 
- The Rise of NoSQL 
- Where Did Cassandra Come From? 
- Is Cassandra a Good Fit for My Project? 
 The Cassandra Architecture - System Keyspaces 
- Partitioners 
- Data Centers and Racks 
- Staged Event-Driven Architecture (SEDA) 
- Lightweight Transactions and Paxos 
- Rings and Tokens 
- Compaction 
- Queries and Coordinator Nodes 
- Caching 
- Consistency Levels 
- Hinted Handoff 
- Bloom Filters 
- Gossip and Failure Detection 
- Anti-Entropy, Repair, and Merkle Trees 
- Snitches 
- Virtual Nodes 
- Managers and Services 
- Replication Strategies 
- Memtables, SSTables, and Commit Logs 
- Tombstones 
 Data Modeling - Evaluating and Refining 
- Conceptual Data Modeling 
- Defining Database Schema 
- Defining Application Queries 
- Logical Data Modeling 
- RDBMS Design 
- Physical Data Modeling 
 Monitoring and Maintenance - Logging 
- Cassandra’s MBeans 
- Backup and Recovery 
- Maintenance Tools 
- SSTable Utilities 
- Basic Maintenance 
- Adding Nodes 
- Handling Node Failure 
- Health Check 
- Monitoring with nodetool 
- Monitoring Cassandra with JMX 
 
- 16Indexing and query optimization Replication Sharding Sharding - Starting the Servers 
- Adding a Shard from a Replica Set 
- Splitting Chunks 
- Chunk Ranges 
- Configuring Sharding 
- Sharding Data 
- Understanding the Components of a Cluster 
- When to Shard 
- The Balancer 
- Config Servers 
- Adding Capacity 
- The mongos Processes 
- How MongoDB Tracks Cluster Data 
 Monitoring MongoDB Applications - False Positives 
- Seeing the Current Operations 
- Documents 
- Collections 
- Calculating Sizes 
- Finding Problematic Operations 
- Preventing Phantom Operations 
- Using mongotop and mongostat 
- Killing Operations 
- Using the System Profiler 
- Databases 
- Seeing What Your Application Is Doing 
 Durability - What Journaling Does 
- Replacing Data Files 
- Durability with Replication 
- Sneaky Unclean Shutdowns 
- Planning Commit Batches 
- Checking for Corruption 
- What MongoDB Does Not Guarantee 
- Repairing Data Files 
- Setting Commit Intervals 
- Turning Off Journaling 
- The mongodlock File 
 Advanced Sharding - Controlling Data Distribution 
- Location-Based Shard Keys 
- Hashed Shard Keys for GridFS 
- The Firehose Strategy 
- Shard Key Limitations 
- Ascending Shard Keys 
- Shard Key Strategies 
- Picturing Distributions 
- Randomly Distributed Shard Keys 
- Hashed Shard Key 
- Choosing a Shard Key 
- Shard Key Cardinality 
- Manual Sharding 
- Shard Key Rules and Guidelines 
- Taking Stock of Your Usage 
- Multi-Hotspot 
- Using a Cluster for Multiple Databases and Collections 
 
- 17Intoduction Clients Concepts Hbase vs RDBMSLog Structures Merge Trees - Compaction 
- Limitations of B+ Trees 
- Limitations of Binary Trees 
- LogStructured Merge tree as the back bone of storage 
 HBase Storage Architecture - MemStore 
- Read and Write Path 
- Physical Architecture 
- HFile 
- WAL 
- HFile Format 
- HMaster and HRegionServer 
- How Data is Store in Hfile 
- Root Table and Meta Table 
- Key Format 
- Role ofZookeeper 
 Future Directions - MMap for bloom filters and Block indexes 
- Exploring OFF-Heap Storage 
 Introduction - Common Advantages 
- Dynamo and Bigtable 
- Table, Column Families,Rows and Columns 
- Data Mode 
- BigTable and HBase(C + P) 
- What am I giving up? 
- Schemaless 
- Key/Value Stores 
 HBase OperationsAccess Patterns - Batching 
- Filters 
- Put 
- Gets 
- Caching 
- Scanning 
 Desiging HBase Tables and Schemas - Time-Ordered Relations 
- Pagination 
- Concepts 
- Key Design 
- Partial Key Scans 
- Tall-Narrow Versus Flat-Wide Tables 
- Time Series Data 
- Secondary Indexes 
- Advanced Schemas 
 
Topics
