This course takes a detailed look at how to implement Big Data solutions using Apache Spark. The course uses the Scala programming language, although we can also run it on Python or Java if required.
Duration
4 days
Prerequisites
Solid experience in Scala (or Python/Java)
What you’ll learn
- Big Data principles
- Creating and using RDDs
- Spark Streaming
- Spark SQL
- Spark Machine Learning
- Spark Graph Processing
Course details
Introduction to Big Data
- Introduction to Hadoop
- Data serialization
- Column-based storage
- Messaging systems
- NoSQL
- Distributed SQL query engine
Introduction to Apache Spark
- Key features of Spark
- Spark architecture
- Application execution
- Resilient Distributed Datasets
- Spark API
- Caching
- Spark jobs
Interactive Data Analysis with Spark Shell
- Key concepts
- REPL commands
- Using Scala
- Number analysis
- Log analysis
Writing Spark Applications
- Writing a Hello world application
- Compiling and running an application
- Monitoring and debugging an application
Spark Streaming
- Overview of Spark streaming
- Spark streaming API
- Creating a discretized stream
- Processing a discretized stream
- Output operations
Spark SQL
- Overview of Spark SQL
- Performance considerations
- Usage scenarios
- Spark SQL API
- Built-in functions
Machine Learning with Spark
- Overview of Machine Learning
- Spark Machine Learning Libraries (MLllb API)
- Spark ML
Graph Processing with Spark
- Overview of graphs
- Overview of GraphX API
- Using GraphX API
Cluster Managers
- Standalone cluster manager
- Apache Mesos
- YARN