spark-syntax

This is a repo documenting the best practices in PySpark.

Github stars Tracking Chart

Spark-Syntax

This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark for 3 years. This will mainly focus on the Spark DataFrames and SQL library.

you can also visit ericxiao251.github.io/spark-syntax/ for a online book version.

Contributing/Topic Requests

If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it ?, you'll most likely be right.

If you have any topics that I could potentially go over, please create an issue and describe the topic. I'll try my best to address it ?.

Acknowledgement

Huge thanks to Levon for turning everything into a gitbook. You can follow his github at https://github.com/tumregels.

Table of Contexts:

Chapter 1 - Getting Started with Spark:

Chapter 2 - Exploring the Spark APIs:

Chapter 3 - Aggregates:

Chapter 4 - Window Objects:

Chapter 5 - Error Logs:

Chapter 6 - Understanding Spark Performance:

  • 6.1 - Primer to Understanding Your Spark Application

    • 6.1.1 - Understanding how Spark Works

    • 6.1.2 - Understanding the SparkUI

    • 6.1.3 - Understanding how the DAG is Created

    • 6.1.4 - Understanding how Memory is Allocated

  • 6.2 - Analyzing your Spark Application

    • 6.1 - Looking for Skew in a Stage

    • 6.2 - Looking for Skew in the DAG

    • 6.3 - How to Determine the Number of Partitions to Use

  • 6.3 - How to Analyze the Skew of Your Data

Chapter 7 - High Performance Code:

  • 7.0 - The Types of Join Strategies in Spark

    • 7.0.1 - You got a Small Table? (Broadcast Join)
    • 7.0.2 - The Ideal Strategy (BroadcastHashJoin)
    • 7.0.3 - The Default Strategy (SortMergeJoin)
  • 7.1 - Improving Joins

  • 7.2 - Repeated Work on a Single Dataset (caching)

    • 7.2.1 - caching layers
  • 7.3 - Spark Parameters

    • 7.3.1 - Running Multiple Spark Applications at Scale (dynamic allocation)
    • 7.3.2 - The magical number 2001 (partitions)
    • 7.3.3 - Using a lot of UDFs? (python memory)
  • 7. - Bloom Filters :o?

Main metrics

Overview
Name With Ownerericxiao251/spark-syntax
Primary LanguageJupyter Notebook
Program languageJupyter Notebook (Language Count: 5)
Platform
License:
所有者活动
Created At2017-08-27 17:59:59
Pushed At2022-12-08 18:20:12
Last Commit At2019-06-02 16:48:08
Release Count0
用户参与
Stargazers Count463
Watchers Count14
Fork Count78
Commits Count88
Has Issues Enabled
Issues Count10
Issue Open Count6
Pull Requests Count14
Pull Requests Open Count4
Pull Requests Close Count2
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private