computer, ai, artificialintelligence, bigdata, datascience, techtraining, techschool, onlineeducation, technology, aiaaplication, cstu, masterdegree, computerscience data, jobseeker, pytorch, deeplearning, nlp, neuralnetworks, programming, spark, analytics,analyzingapplication, sql, Hadoop
CSE590/MB590 Special Topics (1.5 credits)
  - Big Data/Analytics with Apache Spark.

This lesson introduces the use of Spark Core, SQL, Hadoop / HDFS / Hive (Needed for Spark), practical operation, online demonstration, and enterprise application cases (such as housing price database).
  • » 23 hours (8 weeks) in class lecturing plus dedicated mentoring sessions from our faculty of industry experts
  • » 1.5 semester credits for both certificate and master’s degree
  • » Access to high-quality live class recording
  • » Online live classroom available for all classes
  • » Lifetime learning resources for our students
  • $ 990
Course Description

Spark has increased the speed of analyzing applications by orders of magnitude, and the ability to be versatile and easy to use is rapidly gaining market share. Developers, end users, and integrators can solve complex data problems on a large scale. It is now the most active open source project in the big data community.

This lesson introduces the use of Spark Core, SQL, Hadoop / HDFS / Hive (Needed for Spark), practical operation, online demonstration, and enterprise application cases (such as housing price database).

Learn about the command line syntax and examples of using them through Spark, and Spark program tuning tips and writing application code in Python and Scala with Spark in area of SQL, Streaming, Machine Learning and Graph computing

Through this study, students have a clear understanding of the development and association of data storage from hadoop / hive to SPARK. Have a deep understanding of the application scenarios of SPARK, such as financial data management.

Note this is hands on heavy, coding practices throughout the class.

Prerequisite: Working experience, basic computer concept, business software tools.

Prerequisite: Working experience, basic math knowledge, programming experience with Python.

Course Textbook

Textbook Information
ISBN13: 978-1491912218
ISBN10: 1491912219
Cover type: Paper Back
Edition: 1st
Copyright: 2018
Publisher: O’Reilly
Published: 2018

CSTU does not have a student bookstore. Students are required to purchase textbooks required for their courses on the open market. In accordance with the current HEOA requirements, CSTU will provide the ISBN and retail price of our texts along with information on various purchasing options and buyback programs. The ISBN and price information are provided in the syllabus. Course materials can be purchased from any source, the CSTU website offers a convenient means of obtaining required course materials. CSTU cautions students about obtaining course materials from overseas sources because of the risk of delivery time and quality of the materials. Purchase decisions should not be based on the purchase price alone

Additional reading and references

Data Science with Apache Spark – George Jen (View More)

Course Objectives
  • Through this study, students have a clear understanding of the development and association of data storage from Hadoop / hive to SPARK. Have a deep understanding of the application scenarios of SPARK, such as financial data management.
  • Complete coding projects/homework assignments at the end of each session
  • Complete course project, midterm and final examination
Course Schedule
Chapter 1: Get Up to Speed (3 hours)
  • Dev environment setup, task list
  • JDK setup
  • Download and install Anaconda Python and create virtual environment with Python 3.6
  • Download and install Spark
  • Scala IDE
  • Install findspark, add spylon-kernel for scala
  • Docker deployment of Spark Cluster
  • Production Spark Environment Setup
  • Create customized Apache Spark Docker container
  • Create Dockerfile
  • docker-compose and docker-compose.yml
  • Launch custom built Docker container with docker-compose
  • Setup Hadoop, Hive and Spark on Linux without docker
  • Hadoop setup
  • Hadoop configuration
  • Configure $HADOOP_HOME/etc/hadoop
  • HDFS
  • Start Hadoop
  • Work with Hadoop and HDFS file system
  • Connect to Hadoop web interface port 50070
  • Install Hive
  • hive home
  • Initialize hive schema
  • Start hive metastore service.
  • Hive client
  • Setup and configure Apache Spark on production
Chapter 2: Python and Scala Crash Courses (3 hours)

Apache Spark is a powerful big data/analytic engine that is built for developers, and require writing code to use it. Python and Scala are 2 common programming languages that Spark supports. To learn spark, you need to be able to write code to call Spark API functions.

  • Python crash course:
  • Basics
  • Iterables/Collections
  • Strings
  • List
  • Tuple
  • Dictionary
  • Set
  • Conditional statement
  • Loop statement -- For statement
  • Functions and methods
  • map and filter
  • map and filter takes function as input
  • lambda
  • Data structure
  • Input and if statement
  • Input from a file
  • Output to a file
  • Homework Programming Exercises
  • Scala crash course:
  • Basics
  • Functional programming
  • Type of Variable: Mutable or immutable
  • Methods
  • Class
  • Objects
  • Trait
  • Loop statements
  • Conditional branch
  • Run a program to estimate pi
  • Run Scala code with Apache Spark
  • Python with Apache Spark using Jupyter notebook
  • Homework Programming Exercises
Chapter 3: Spark SQL (3 hours)
  • SPARK SQL Discussion -- Connect to any data source the same consistent way
  • Spark SQL Implementation Example in Scala
  • Write Scala Code in Eclipse IDE
  • Hive Integration, run SQL or HiveQL queries on existing warehouses.
  • Scala Applicaiton development project "Enrich JSON"
  • Homework Programming Exercises
Chapter 4: Spark Streaming 1 (3 hours)
  • Discretized Streams (DStreams)
  • Transformations on DStreams
  • map(func)
  • filter(func)
  • repartition(numPartitions)
  • union(otherStream)
  • count()
  • reduce(func)
  • countByValue()
  • reduceByKey(func, [numTasks])
  • join(otherStream, [numTasks])
  • cogroup(otherStream, [numTasks])
  • transform(func)
  • updateStateByKey(func)
  • repartition(numPartitions)
  • Coding Excercises
Chapter 5: Spark Streaming 2 (3 hours)
  • Window Operations
  • Transformation
  • countByWindow(windowLength, slideInterval)
  • reduceByWindow(func, windowLength, slideInterval)
  • reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
  • reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])
  • countByValueAndWindow(windowLength, slideInterval, [numTasks])
  • window(windowLength, slideInterval)
  • Join Operations
  • print(n)
  • saveAsTextFiles(prefix, [suffix])
  • saveAsObjectFiles(prefix, [suffix])
  • saveAsHadoopFiles(prefix, [suffix])
  • foreachRDD(func)
  • Project:
  • Spark Streaming with Twitter, you can get public tweets by using Twitter API in Scala.
  • Spark streaming use case with Python
Chapter 6: GraphX -- Spark Graph Computing 1 (3 hours)
  • Package org.apache.spark.graphx
  • Edge Class
  • EdgeContext Class
  • EdgeDirection Class
  • EdgeRDD Class
  • EdgeTriplet Class
  • Graph Class
  • GraphLoader Object
  • GraphOps Class
  • GraphXUtils Object
  • PartitionStrategy Trait
  • Pregel Object
  • TripletFields Class
  • VertexRDD Class
Chapter 7: GraphX -- Spark Graph Computing 2 (3 hours)
  • Package org.apache.spark.graphx.impl
  • AggregatingEdgeContext Class
  • EdgeRDDImpl Class
  • Class GraphImpl
  • Package org.apache.spark.graphx.lib
  • Class ConnectedComponents
  • Class LabelPropagation
  • Class PageRank
  • Class ShortestPaths
  • Class StronglyConnectedComponents
  • Class SVDPlusPlus
  • Class SVDPlusPlus.Conf
  • Class TriangleCount
  • Package org.apache.spark.graphx.util
  • Class BytecodeUtils
  • Class GraphGenerators
  • Coding Projects
  • Graphx applicaiton 1
  • Graphx applicaiton 2
  • Graphx applicaiton 3
Chapter 8: Apache Spark Machine Learning (3 hours)
  • Spark Machine Learning
  • Classifications:
  • Binary Classification
  • Multiclass Classification
  • Naive Bayes classifiers
  • Decision trees
  • Random forests
  • Gradient-boosted trees (GBTs)
  • Linear Support Vector Machine
  • Regressions:
  • Linear Regression
  • Isotonic regression
  • Decision Tree Regression
  • Random Forest Regression
  • Gradient-boosted tree regression
  • Clustering:
  • k-means
  • Projects: Spark Machine Learning Applications
  • Data Visualization with Vegas Viz and Scala with Spark ML
  • Apache Spark Machine Learning with Dremio Data Lake Engine
  • Dremio Data Lake Engine Apache Arrow Flight Connector with Spark Machine Learning
About the Instructor

George Jen

Dr. George Jen is Oracle certified professional and Vertica certified as well. He has doctor degree in Computer Engineering. George has in depth technical knowledge, experience and problem-solving skills on Vertica, Oracle and SAP Basis. Strong Data Analytical capability and well versed on statistics, linear algebra, popular machine learning algorithms, data warehouse, ETL/ELT pipelines, data mining and BI, front end UI. Strong Python coding in Pythonic way, in simple term, able to accomplish a programming objective with minimal lines of code in Python whenever possible, rather than code it in C++ or Java way, in my view, that would lose the elegance of Python coding style! Proficient in Python/Scala/C++/Java/node.js programming, and experiences on using tools Hadoop/Hive/Spark/Kafka/Nifi and variety of AWS server-less services on big data projects.