CSE590/MB590 Special Topics
  - Natural Language Processing with Python

Learn algorithmic techniques from Machine Learning (ML) for identifying useful and relevant patterns, associations, and relationships in and from natural language and text data in order to automate the process of learning from these types of data.
  • » 24 hours in class lecturing plus dedicated mentoring sessions from our faculty of industry experts

  • » 1.5 semester credits for both certificate and master’s degree

  • » Access to high-quality live class recording

  • » Online live classroom available for all classes

  • » Lifetime learning resources for our students

  • $ 1000
Course Description

This course will introduce algorithmic techniques from Machine Learning (ML) for identifying useful and relevant patterns, associations, and relationships in and from natural language and text data in order to automate the process of learning from these types of data. The student will learn how ideas and methods from probability theory, mathematical statistics, learning theory, optimization, and computational complexity theory are combined to design these algorithmic techniques. Fundamental methods from Natural Language Processing (NLP) such as word and text embeddings, classification, supervised learning, generalization theory, and model reduction will be introduced. Methods for query relevance assessment and relevance-ranking will be discussed. Specific examples of industry and business use cases for NLP will be given in the course.

The student is required to work on course projects by using modern data analysis software and cases studies. This course will focus on implementation of NLP algorithms using the Python language.

Course Objectives
  • To learn how computational methods and techniques are employed in Natural Language Processing and text mining and to learn the analytical, theoretical, and intuitive ideas that underpin them.
  • To understand and become familiar with the implementation details of NLP algorithms.
  • To gain hands-on experience with NLP tools in the Python language.
Week 1: Natural Language Processing Overview and Text Representation
  • Overview of Natural Language Processing and Text mining
  • Tokenization
  • Parts-of-speech tagging
  • Syntax parsing
  • Named Entity Extraction and Recognition
  • Data Cleansing for text analytics
  • Introduction to Python for NLP and text processing
HW: Basic exercises and homework using Python
Week 2: Bag-of-Words Approach and Word Embeddings
  • Vector Space Model and TF-IDF weightings
  • Bag-of-words model
  • Word2vec
  • GloVe (Global Vectors for Word Representation)
  • Dimensionality reduction (including PCA)
  • Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation
HW: Exercises/homework in Python
Week 3: ML classification algorithms for NLP and text mining
  • Naïve Bayes classifiers
  • K Nearest Neighbors classifier
  • Logistic regression 
  • Decision trees
  • Overfitting and underfitting of models and how to avoid them
HW: Midterm project announcement and discussion
Week 4: Introduction to Artificial Neural Networks for NLP
  • From biology to Artificial Neurons 
  • The perceptron 
  • Multi-layer Perceptron and backpropagation 
  • Feed-forward neural nets and their applications
  • Recurrent neural nets to exploit semantic context in text
  • Deep Learning
  • Implementation using Python

Week 5: Support Vector Machines for NLP
  • Mathematical background, concept of hyperplane in n-dimension 
  • Regularization
  • SVM classifier for text and language
  • SVM with nonlinear decision boundaries 
  • SVM with more than two classes 
  • SVM example in Python
HW: Using SVMs for text classification and Midterm project
Week 6: Ensemble learning, Boosting, and Bayesian ML for text mining
  • Voting classifier 
  • Random Forests 
  • Boosting 
  • AdaBoost 
  • Gradient Boosting 
  • Bayesian Machine Learning
HW: Comparison of Random forests vs boosting and voting classifier in Python
Week 7: Testing, Verification, Validation, and Visualization for text mining
  • Error analysis with cross validation
  • Precision/Recall tradeoff
  • ROC curves
  • Visualization techniques for text mining analysis

Week 8: Information retrieval and text ranking
  • Query relevance assessment
  • Document relevance ranking methods
  • Boolean and extended Boolean retrieval models
  • Google page-rank algorithm
Final project presentation and Final exam
Your Instructor
 Daniel Zanger, Ph.D.
has over 20 years of experience in both industry and the federal government, working extensively in the fields of theoretical and applied machine learning, data analysis, optimization, statistical database privacy, cryptology, quantum computing, and others. He has applied techniques from these fields to problems in such areas as text mining, image processing, operations research, and multi-sensor fusion. Dr. Zanger has authored numerous publications in refereed journals and conference proceedings in various technical fields including mathematics (partial differential equations), probability theory, information retrieval, statistical learning theory (applied to finance), operations research, and database privacy. He holds a Ph.D. in Mathematics from the Massachusetts Institute of Technology (MIT) as well as a B.A. (with Highest Honors), also in Mathematics, from the University of California at Berkeley.