spark-syntax

This is a repo documenting the best practices in PySpark.

Github星跟踪图

Spark-Syntax

This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark for 3 years. This will mainly focus on the Spark DataFrames and SQL library.

you can also visit ericxiao251.github.io/spark-syntax/ for a online book version.

Contributing/Topic Requests

If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it ?, you'll most likely be right.

If you have any topics that I could potentially go over, please create an issue and describe the topic. I'll try my best to address it ?.

Acknowledgement

Huge thanks to Levon for turning everything into a gitbook. You can follow his github at https://github.com/tumregels.

Table of Contexts:

Chapter 1 - Getting Started with Spark:

Chapter 2 - Exploring the Spark APIs:

Chapter 3 - Aggregates:

Chapter 4 - Window Objects:

Chapter 5 - Error Logs:

Chapter 6 - Understanding Spark Performance:

  • 6.1 - Primer to Understanding Your Spark Application

    • 6.1.1 - Understanding how Spark Works

    • 6.1.2 - Understanding the SparkUI

    • 6.1.3 - Understanding how the DAG is Created

    • 6.1.4 - Understanding how Memory is Allocated

  • 6.2 - Analyzing your Spark Application

    • 6.1 - Looking for Skew in a Stage

    • 6.2 - Looking for Skew in the DAG

    • 6.3 - How to Determine the Number of Partitions to Use

  • 6.3 - How to Analyze the Skew of Your Data

Chapter 7 - High Performance Code:

  • 7.0 - The Types of Join Strategies in Spark

    • 7.0.1 - You got a Small Table? (Broadcast Join)
    • 7.0.2 - The Ideal Strategy (BroadcastHashJoin)
    • 7.0.3 - The Default Strategy (SortMergeJoin)
  • 7.1 - Improving Joins

  • 7.2 - Repeated Work on a Single Dataset (caching)

    • 7.2.1 - caching layers
  • 7.3 - Spark Parameters

    • 7.3.1 - Running Multiple Spark Applications at Scale (dynamic allocation)
    • 7.3.2 - The magical number 2001 (partitions)
    • 7.3.3 - Using a lot of UDFs? (python memory)
  • 7. - Bloom Filters :o?

主要指标

概览
名称与所有者ericxiao251/spark-syntax
主编程语言Jupyter Notebook
编程语言Jupyter Notebook (语言数: 5)
平台
许可证
所有者活动
创建于2017-08-27 17:59:59
推送于2022-12-08 18:20:12
最后一次提交2019-06-02 16:48:08
发布数0
用户参与
星数463
关注者数14
派生数78
提交数88
已启用问题?
问题数10
打开的问题数6
拉请求数14
打开的拉请求数4
关闭的拉请求数2
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?