ml, machinelearning, ml, datavisualization, communication, artificialintelligence, ai, computerscience, datascience, dataanalytics, technologytraining, bay area, deep learning , nlp, sql, dataanalytics, dataengineering, cstu, apacke, ai application, onlinetraining, onlydegrees, onlineprogram, computerengineering
CSE590/MB590 Special Topics (1.5 credits)
  - Data Engineering with Apache Cassandra and Advanced SQL

This lesson introduces the use of design NoSQL database with Apache Cassandra. We will have hands on project design, build, write code in Python to create an application simulating the real world situation and that can help students to have demonstrating skills on Cassandra NoSQL and application built with Cassandra, along Python coding with Cassandra.
  • » 23 hours (8 weeks) in class lecturing plus dedicated mentoring sessions from our faculty of industry experts
  • » 1.5 semester credits for both certificate and master’s degree
  • » Access to high-quality live class recording
  • » Online live classroom available for all classes
  • » Lifetime learning resources for our students
  • $ 990
Course Description

Apache Cassandra

Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra supports for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

AWS Dynamo DB and Google Bigtable has similar distribution design of Apache Cassandra.

This lesson introduces the use of design NoSQL database with Apache Cassandra. We will have hands on project design, build, write code in Python to create an application simulating the real world situation and that can help students to have demonstrating skills on Cassandra NoSQL and application built with Cassandra, along Python coding with Cassandra.

Advanced SQL

While Cassandra is a NoSQL database design for massive data analytics, Cassandra offers a limited SQL interface called CQL, that does not have join and windows function. We will integrate Spark SQL with Cassandra and run advanced SQL queries such as join, window function, nested sub-queries. We will also work on PostgreSQL, an open source relational database for more advanced SQL queries that are designed to handle SQL coding challenges in many of the data science and data engineering job interviews.

Course Objectives
  • Demonstrate a basic understanding of management information system terminology.
  • Explain the components and the operations, managerial and strategic roles of information systems within an organization.
  • Describe the major concepts, developments and managerial implications involved in computer hardware, software, database management, and telecommunications technologies
  • Describe how information technology is used in modern information systems to support end user applications, enterprise operations, e commerce, and managerial decision making
  • Explain the development of information system solutions for business problems and how to implement change.
  • Describe the managerial challenges and methods of managing information management, and security and ethical challenges.
  • Demonstrate ability to communicate a professional (affective) presentation and project paper by integrating concepts (cognitive) from the course.
University-wide Student Learning Outcomes

The University Student Learning Outcomes assessed and reinforced in this course include but are not limited to the following:

  • Communication
  • Critical Thinking
  • Information Literacy
Weekly Schedule
Week 1
  • Introduction
  • Discuss Chapter 1
  • Chapter 1
    • Introduction to NoSQL database and Apache Cassandra.
    • What applications are best to build on Cassandra and what applications are best not to use Cassandra.
    • Get used to the Cassandra environment running on the server in the cloud (we will run query there)
    • Get used to the PostgreSQL environment running on the server in the cloud (we will run query there)
  • Discuss Chapter 2
  • Chapter 2
    • Implementation project requirement discussion
    • Discuss AWS EC2 cloud server environment
    • Discuss AWS serverless database services, maintenance free, auto scaling features.
    • Overview of various AWS database services
  • HW1
Week 2

Quiz 1– Chapters 1 & 2

  • Chapter 3
    • The Cassandra Query Language
    • The Relational Data Model
    • Cassandra’s Data Model
  • Give out Group Project
  • Chapter 4 -11 Overview
  • Chapter 4
    • Clusters
    • Keyspaces
    • Tables
    • Columns
    • CQL Types
    • Numeric Data Types
    • Textual Data Types
    • Time and Identity Data Types
    • Other Simple Data Types
    • Collections
    • Tuples
    • User-Defined Types
  • Project discussion.
Week 3
  • Discuss Chapter 3 and 4
  • Chapter 5
    • Data Modeling
    • Conceptual Data Modeling
    • RDBMS Design
    • Design Differences Between RDBMS and Cassandra
    • Defining Application Queries
    • Logical Data Modeling
      • Hotel Logical Data Model
      • Reservation Logical Data Model
    • Physical Data Modeling
      • Hotel Physical Data Model
      • Reservation Physical Data Model
    • Evaluating and Refining
      • Calculating Partition Size
      • Calculating Size on Disk
    • Breaking Up Large Partitions
      • Defining Database Schema
    • Cassandra Data Modeling Tools

Quiz 2– Chapters 3, 4 and 5

  • Discuss Chapter 5
  • Chapter 6
    • Designing Applications with Cassandra
    • Hotel Application Design
      • Cassandra and Microservice Architecture
      • Microservice Architecture for a Hotel Application
      • Identifying Bounded Contexts
      • Identifying Services
      • Designing Microservice Persistence
    • Extending Designs
      • Secondary Indexes
      • Materialized Views
    • Deployment and Integration Considerations
      • Services, Keyspaces, and Clusters
      • Data Centers and Load Balancing
      • Interactions Between Microservices


Week 4
  • Discuss Chapter 6
  • Chapter 7
    • Application Development with Cassandra Drivers and connectors
    • Integrate Spark SQL and Cassandra
    • Develop Python code that have embedded SQL queries to access Cassandra tables
    • Write SQL queries in Python code from Spark SQL against tables in Cassandra
    • Starting working on PostgreSQL RDBMS running in the cloud server, get familiar with PostgreSQL command line

Mid-term Exam– Chapters 1,2,3,4,5, and 6

Week 5
  • Discuss Chapter 7
  • Chapter 8
    • ANSI SQL using PostgreSQL
    • Overview and hands on
    • SQL Data Definition Language: Create, Drop, Alter statements
    • SQL Data Manipulation Language: Insert, Update, Delete
    • SQL Transaction control language: Commit, Rollback
    • SQL Data Retrieval language: Select
    • Advanced Queries using select statement
    • Schema objects:
    • Schema, tables, Views, Materialized Views, Columns, Primary Keys, Foreign Keys, Normalization
    • Row level vs column level features
    • Data types

Discuss Group Project

Quiz 3– Chapter 6, 7


Week 6
  • Discuss Group Project
  • Discuss Chapter 8
  • Chapter 9
    • Explain when to use join, aggregation in queries
    • write SQL statements that perform below queries with sample database tables, explain how they work
    • Simple:
      • Which shippers do we have?
      • Certain fields from Categories
      • Sales Representatives
      • Sales Representatives in the United States
      • Orders placed by specific EmployeeID
      • Suppliers and ContactTitles
      • Orders shipping to France or Belgium
      • Orders shipping to any country in Latin America
      • Employees, in order of age
      • Showing only the Date with a DateTime field
      • Employees full name
      • OrderDetails amount per line item
      • How many customers?
      • When was the first order?
      • Countries where there are customers
      • Contact titles for customers
      • Products with associated supplier names
      • Orders and the Shipper that was used Intermediate Problems
  • Discuss Chapter 9
  • Present Group Project

Quiz 4– Chapter 8, 9


Week 7
  • Review Group Projects
  • Chapter 10
    • Medium:
      • Categories, and the total products in each category
      • Total customers per country/city
      • Products that need reordering
      • Products that need reordering, continued
      • Customer list by region
      • High freight charges
      • High freight charges with between dates
      • High freight charges - last year
      • Inventory list
      • Customers with no orders
  • Discuss Chapter 10
  • Review for final

HW 5

Week 8
  • Review
  • Chapter 11
    • Advanced SQL deep dive
    • Outer join, sub queries, window functions, analytic functions such as time series etc.
    • Continue hands on and explain how these queries work.
    • Advanced:
      • High-value customers
      • High-value customers - total orders
      • High-value customers - with discount
      • Month-end orders
      • Orders with many line items
      • Orders - random assortment
      • Orders - accidental double-entry
      • Orders - accidental double-entry details
      • Orders - accidental double-entry details, derived table
      • Late orders
      • Late orders - which employees?
      • Late orders vs. total orders
      • Late orders vs. total orders - missing employee
      • Late orders vs. total orders - fix null
      • Late orders vs. total orders - percentage
      • Late orders vs. total orders - fix decimal Customer grouping
      • Customer grouping - fix null
      • Customer grouping with percentage
      • Customer grouping - flexible
      • Countries with suppliers or customers
      • First order in each country
      • Customers with multiple orders in 5-day period
      • Customers with multiple orders in 5 day
  • Discuss Chapter 11

Final Exam

About the Instructor

George Jen

Dr. George Jen has in deep technical knowledge, experience and problem-solving skills on Vertica, Oracle and SAP Basis. Strong Data Analytical capability and well versed on statistics, linear algebra, popular machine learning/deep learning algorithms, data warehouse/data lake, ETL/ELT pipelines, data mining and BI, front end UI. Strong coding in Python and Scala. Expert know-how on Hadoop/Hive/Spark/Kafka/Nifi and variety of AWS server-less services on big data projects.

He has been certified by Oracle as Oracle Certified Professional and Vertica as Vertica Big Data Solution V1. I have doctor degree in Computer Engineering.