
Modern Natural Language Processing with Deep Learning Training
Training Description
The main goal of course is to provide a comprehensive details on the recent advances
in deep learning applied to NLP. The session presents state of the art of NLP-centric deep learning research, and focuses on the role of deep learning played in major NLP applications including spoken language understanding, dialog systems,lexical analysis, parsing, knowledge graph, machine translation, question answering, sentiment analysis, social computing, and natural language generation (from images).
- This session is targetted to data scientists, with a technical background in computation, including, post-doctoral researchers, educators, and industrial researchers and anyone interested in getting up to speed with the latest techniques of deep learning associated with NLP. 
- This is an advanced course on natural language processing. 
- Focus will be on few of the most important tactics Neural Machine Translation(NMT), Attention, Bidirectional Encoder Representations from Transformers(BERT). 
- In modern NLP these are the most important tricks used and this trick underlies most NLP tasks. 
- Various small case studies will be undertaken but the most important being Machine Translation. 
- Unlike other models, These techniques are end-to-end models that can be used for many use cases. These models will be trained differently but models them selves need not be changed. These are really complex models, and getting them to train right or even use right is very difficult. Therefore understanding their inner working is crucial. Refer to the case-study section to see what can be done. 
Case studies
- Language Modeling 
- Topic Modeling 
- Sentiment analysis 
- Information extraction 
- Text Summarization 
- Question answering / Chat bot 
- Text classification / categorization 
- Document classification 
- Sentence classification 
- Emotion classification 
- Spelling correction 
- Paraphrase generation 
- Named entity recognition 
- Semantic textual similarity 
- Relation extraction 
- Word sense disambiguation 
- End-to-end Speech Recognition 
- End-to-end Text to Speech 
- Entity linking 
- Morphological analysis 
- Grammatical error correction 
- Slot filling 
- Subjectivity analysis 
- Sarcasm detection 
- Hate speech detection 
- Intent classification 
Techniques
- Neural Machine Translations(NMT) with Attention 
- Neural Network 
- Text clustering 
- TextCNN 
- RNN 
- TextRNN 
- LSTM 
- TextLSTM 
- Bi-LSTM 
- Multi-LSTM 
- BiMulti-LSTM 
- Word Embeddings 
- Seq2Seq 
- Seq2Seq with Attention 
- Encoder-Decoder models 
- Neural Machine Translations(NMT) 
- Google's Neural Machine Translations(GNMT) 
- Self Attention 
- ELMO 
- ULMFIT 
- Transformer 
- BERT(Bidirectional Encoder Representations from Transformers) 
- XLNET 
Key skills
- Understand Encoder Decoder Architecture 
- Understand Neural Machine Translation 
- Have an awareness of the hardware issues inherent in implementing scalable neural network models for language data. 
- Understand Attention 
- Be able to derive and implement optimisation algorithms for these models 
- Be able to implement and evaluate common neural network models for language. 
- Understand neural implementations of attention mechanisms and sequence embedding models and how these modular components can be combined to build state of the art NLP systems. 
Pre-requisites
- Solid knowledge of Linear,Logistic Regression 
- Good knowledge of Machine Learning concepts like pipelines, grid search,randomized search, error curves, normalization techniques etc. 
- Working Knowledge of Python 
- Cursory knowledge of Deep Neural Network 
Instructional Method
This is an instructor led course provides lecture topics and the practical application of modern NLP with Deep Learning and the underlying technologies. It pictorially presents most concepts and there is a detailed case study that strings together the technologies, patterns and design.
Topics
Introduction to Deep Learning
- Parameter Hyperspace 
- Minimizing Cost Entropy 
- Normalized Inputs And Initial Weights 
- Measuring Performance 
- Transition Into Practical Aspects Of Learning 
- Stochastic Gradient Descent 
- Training your Logistic Classifier 
- Transition: Overfitting -> Dataset Size 
- Momentum And Learning Rate Decay 
- Supervised Classification 
- Solving Problems 
- Lather Rinse Repeat 
- Optimizing A Logistic Classifier 
- Cross Entropy 
- What is Deep Learning 
Deep Neural Network
- "2-layer" neural network 
- Network Of ReLUs 
- Dropout 
- Intro to Deep Neural Network 
- No Neurons 
- Backprop 
- Regularization Intro 
- Linear Models Are Limited 
- The Chain Rule 
- Dropout Pt-2 
- Regularization 
- Training A Deep Learning Network 
Deep Learning Internals
- How They Work 
- A Simple Predicting Machine 
- Following Signals Through A Neural Network 
- Sometimes One Classifier Is Not Enough 
- Learning Weights From More Than One Node 
- A Three Layer Example with Matrix Multiplication 
- Training A Simple Classifier 
- Backpropagating Errors To More Layers 
- Classifying is Not Very Different from Predicting 
- Making it easy by looking at logic and math 
- Preparing Data 
- Matrix Multiplication is Useful Honest! 
- Neurons, Nature’s Computing Machines 
- How Do We Actually Update Weights? 
- Weight Update Worked Example 
- Backpropagating Errors with Matrix Multiplication 
- Backpropagating Errors From More Output Nodes 
- DIY with Python 
- Interactive Python = IPython 
- A Very Gentle Start with Python 
- The MNIST Dataset of Handwritten Numbers 
- Python 
- Neural Network with Python 
- Hand rolled Neural Network 
- Creating New Training Data: Rotations 
- Your Own Handwriting 
- Inside the Mind of a Neural NetworkRecurrent Neural Network and Sequence Modelling 
Recurrent Neural Network and Sequence Modelling
- Concrete Recurrent Neural Network Architectures 
- Simple RNN 
- Gated Architectures:LSTM 
- Gated Architectures:GRU 
- CBOW as an RNN 
- Dropout in RNNs 
- Gated Architectures:Other Variants 
- Recurrent Neural Networks: Modeling Sequences and Stacks 
- Transducer 
- RNN Training 
- RNN Abstraction 
- Common RNN Usage-patterns 
- A Note on Reading the Literature 
- Encoder 
- Multi-layer (stacked) RNNs 
- RNNs for Representing Stacks 
- Acceptor 
- Bidirectional RNNs (biRNN) 
- Modeling with Recurrent Networks 
- Acceptors 
- RNN–CNN Document Classification 
- RNNs as Feature Extractors 
- Subject-verb Agreement Grammaticality Detection 
- Arc-factored Dependency Parsing 
- Part-of-speech Tagging 
- Sentiment Classification 
- Conditioned Generation 
- Applications 
- Sequence to Sequence Models 
- Syntactic Parsing 
- Morphological Inflection 
- Attention-based Models in NLP 
- Computational Complexity 
- Conditioned Generation with Attention 
- Machine Translation 
- Training Generators 
- Interpretability 
- Other Conditioning Contexts 
- Conditioned Generation (Encoder-Decoder) 
- Unsupervised Sentence Similarity 
- RNN Generators 
- Models for Sequence Analysis 
- Dissecting a Neural Translation Network 
- Beam Search and Global Normalizatio 
- Tackling seqseq with Neural N-Grams 
- Implementing a Sentiment Analysis Model 
- Long Short-Term Memory (LSTM) Units 
- Recurrent Neural Networks 
- Implementing a Part-of-Speech Tagger 
- A Case for Stateful Deep Learning Models 
- Solving seqseq Tasks with Recurrent Neural Networks 
- The Challenges with Vanishing Gradients 
- Augmenting Recurrent Networks with Attention 
- Dependency Parsing and SyntaxNet 
- TensorFlow Primitives for RNN Models 
- Analyzing Variable-Length Inputs 
RNN Internals
- Backward Propagation 
- Unrolling 
- Forward Propagation 
- Matrix and their Shapes 
- GPU Optimization 
- Recurrent Neurons 
Gated Units Internals
- Forward Propagation 
- Significance of Batch and Sequence 
- Recurrence Depth 
- Gates and their significance 
- Feed Forward Depth 
- Introduction 
- Saliency Heatmap 
- Why 
- variants 
- Weight Shapes 
- Bi Directional 
- Backward Propagation 
- Uni Directional 
- Dropouts 
Modern Natural Language Processing(NLP) with Deep Learning-Part 1
- Pre-trained Word Representations 
- Character-based and Sub-word Representations 
- Sentences, Paragraphs, or Documents 
- Limitations of Distributional Methods 
- Other Algorithms 
- Using Pre-trained Embeddings 
- Choice of Contexts 
- Connecting the Worlds 
- Syntactic Window 
- Unsupervised Pre-training 
- Dealing with Multi-word Units and Word Inflections 
- Random Initialization 
- Word Embedding Algorithms 
- From Neural Language Models to Distributed Representations 
- Multilingual 
- Window Approach 
- Distributional Hypothesis and Word Representations 
- Supervised Task-specific Pre-training 
- Working with Natural Language Data 
- Directly Observable Properties 
- Typology of NLP Classification Problems 
- Distributional Features 
- Ngram Features 
- Features for NLP Problems 
- Features for Textual Data 
- Core Features vs Combination Features 
- Inferred Linguistic Properties 
- Language Modeling 
- Using Language Models for Generation 
- Limitations of Traditional Language Models 
- Evaluating Language Models: Perplexity 
- Language Modeling Task 
- Traditional Approaches to Language Modeling 
- Neural Language Models 
- Byproduct: Word Representations 
- From Textual Features to Inputs 
- Encoding Categorical Features 
- Odds and Ends 
- Variable Number of Features: Continuous Bag of Words 
- One-hot Encodings 
- Embeddings Vocabulary 
- Example: Part-of-Speech Tagging 
- Feature Combinations 
- Distance and Position Features 
- Padding, Unknown Words, and Word Dropout 
- Dense Encodings (Feature Embeddings) 
- Network’s Output 
- Vector Sharing 
- Combining Dense Vectors 
- Relation Between One-hot and Dense Vectors 
- Example: Arc-factored Parsing 
- Dimensionality 
- Dense Vectors vs One-hot Representations 
- Window-based Features 
- Using Word Embeddings 
- Odd-one Out 
- Short Document Similarity 
- Word Clustering 
- Retrofitting and Projections 
- Word Analogies 
- Word Similarity 
- Finding Similar Words 
- Similarity to a Group of Words 
- Obtaining Word Vectors 
- Case Study: A Feed-forward Architecture for Sentence Meaning 
- Inference 
- A Textual Similarity Network 
- Practicalities and Pitfalls 
- Natural Language Inference and the SNLI Dataset 
- Case Studies of NLP Features 
- Relation Between Words in Context: Arc-Factored Parsing 
- Document Classification: Authorship Attribution 
- Document Classification: Topic Classification 
- Word in Context, Linguistic Features: Preposition Sense 
- Disambiguation 
- Document Classification: Language Identification 
- Word-in-context: Named Entity Recognition 
- Word-in-context: Part of Speech Tagging 
Modern Natural Language Processing(NLP) with Deep Learning Part 2
- Deep Learning in Question Answering 
- Deep Learning in Machine Comprehension 
- Deep Learning in Question Answering over Knowledge Base 
- Deep Learning in Machine Translation 
- End-to-End Deep Learning for Machine Translation 
- Statistical Machine Translation and Its Challenges 
- Component-Wise Deep Learning for Machine Translation 
- Natural Language Understanding:Neural Machine Translation:Attention+(Plus) 
- Pairing Strategies 
- Rare word Translation 
- Input Feeding 
- Performance 
- Alignment Score 
- Global vs Local Weights 
- MultiLingual Model 
- Performance 
- Utilize Monolingual Data 
- Achieving State of the art results 
- Natural Language Understanding 
- Introduction 
- Deep Learning in Sentiment Analysis 
- Opinion Mining 
- Sentiment-Specific Word Embedding 
- Fine-Grained Sentiment Analysis 
- Document-Level Sentiment Classification 
- Sentence-Level Sentiment Classification 
- Natural Language Understanding:Neural Machine Translation:Attention 
- Avoid the curse of length 
- Identifying bottlenecks with the vannila seq-to-seq structure 
- Attention 
- Achieving State of the art results 
- Performance 
- BahdanauAttention 
- Soft Alignment 
- Backward Propagation 
- Context Vector 
- Image Text Embedding 
- Alignment 
- Forward Propagation 
- QA 
- LuongAttention 
- Attention Mechanism 
- Image Generation 
- Training 
- Big Wins 
- Natural Language Understanding:Neural Machine Translation:GoogleNMT(GNMT) 
- GNMT Decoder 
- GNMT Encoder 
- Achieving State of the art results 
- Natural Language Understanding:Neural Machine Translation 
- Backward Propagation 
- Evaluation Metrics 
- Implementation techniques 
- RBMT vs SMT vs NMT 
- Forward Propagation 
- Performance 
- Formulation: Sequence-to-sequence 
- Identifiying Bottlenecks 
- Training 
- Big Wins 
- Goal: End-to-End 
- Encoder-Decoder Architecture 
- LSTM 
- Bi-LSTM 
- GRU 
- CNN 
- Decoder 
- Conditional recurrence lanuage model 
- Output 
- Decoder Transition 
- Decoder Strategies 
- Naive Search 
- Beam Search 
- Greedy Search 
- Encoder 
- Context Vector 
- Encoder Transition 
- Inputs 
Modern Natural Language Processing(NLP) with Deep Learning Part 3
- BERT (Bidirectional Encoder Representations from Transformers) 
- Take BERT out for a spin 
- Piping out outputs 
- Comparisions with Convnets 
- BERT: From Decoders to Encoders 
- Task specific-Models 
- Two-sentence Tasks 
- Masked Language Model 
- Transfer Learning to Downstream Tasks 
- OpenAI Transformer: Pre-training a Transformer Decoder for 
- Language Modeling 
- Piping in Inputs 
- Architecture 
- BERT for feature extraction 
- The Transformer: Going beyond LSTMs 
- Transformer 
- Encoding 
- Encoders 
- High Level View 
- Tensors 
- Linear Softmax layer 
- Loss Function 
- Decoder side 
- Decoders 
- The Residuals 
- ELMO: Advanced Word Embeddings 
- Training ELMo on corpus 
- Salient features 
- Problem at hand? 
- Loading the ELMo embedding 
- Token Representation: 
- Bidirectional Language Model (biLM) 
- How does it do it? Using Long Short-Term Memory (LSTM) 
- Let’s see the architecture: 
- What’s already existing? 
- What numbers do they improve on? 
- Deep contextualized word representation 
- Why to look for a new method? 
- Let’s dive in crux! 
- Self Attention 
- Attention is all you need 
- Representing The Order of The Sequence Using Positional 
- Encoding 
- Matrix Calculation of Self-Attention 
- Visual Attention 
- Machine Translation 
- Scaled dot product attention 
- Model Variations 
- Simultaneously Self-Attending to All Mentions for Relation 
- Extraction 
- Applying Attention Throughout the Entire Model 
- Deep Semantic Role Labeling With Self-Attention 
- Detailed Architecture 
- Multi Head Attention 
- Self-Attention in NLP 
- Multi Headed Attention 
- Why Self Attention 
- ULMFIT 
- Input Dropouts 
- Multi Batch Encoder 
- Weight Dropouts 
- Variable Length BPTT 
- Building Blocks 
- Transfer Learning with ULMFIT 
- Hidden Dropout 
- Encoder Dropouts 
- QRNN 
- AWD_LSTM 
Software Tools
- Tensorflow 
- Installation 
- Sharing Variables 
- Creating Your First Graph and Running It in a Session 
- Managing Graphs 
- Visualizing the Graph and Training Curves Using TensorBoard 
- Implementing Gradient Descent 
- Lifecycle of a Node Value 
- Linear Regression with TensorFlow 
- Modularity 
- Saving and Restoring Models 
- Name Scopes 
- Feeding Data to the Training Algorithm 
- Keras 
