MyGit

DataTalksClub/data-engineering-zoomcamp

Fork: 5426 Star: 25352 (更新于 2024-11-30 13:10:39)

license: 暂无

Language: Jupyter Notebook .

Free Data Engineering course!

GitHub网址

Data Engineering Zoomcamp

Syllabus

Taking the course

2025 Cohort

Self-paced mode

All the materials of the course are freely available, so that you can take the course at your own pace

  • Follow the suggested syllabus (see below) week by week
  • You don't need to fill in the registration form. Just start watching the videos and join Slack
  • Check FAQ if you have problems
  • If you can't find a solution to your problem in FAQ, ask for help in Slack

Syllabus

We encourage Learning in Public

Note: NYC TLC changed the format of the data we use to parquet. In the course we still use the CSV files accessible here.

Module 1: Containerization and Infrastructure as Code

  • Course overview
  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

More details

Module 2: Workflow Orchestration

  • Data Lake
  • Workflow orchestration
  • Workflow orchestration with Mage
  • Homework

More details

Workshop 1: Data Ingestion

  • Reading from apis
  • Building scalable pipelines
  • Normalising data
  • Incremental loading
  • Homework

More details

Module 3: Data Warehouse

  • Data Warehouse
  • BigQuery
  • Partitioning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • BigQuery Machine Learning

More details

Module 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • Postgres and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualizing the data with google data studio and metabase

More details

Module 5: Batch processing

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins

More details

Module 6: Streaming

  • Introduction to Kafka
  • Schemas (avro)
  • Kafka Streams
  • Kafka Connect and KSQL

More details

Project

Putting everything we learned to practice

  • Week 1 and 2: working on your project
  • Week 3: reviewing your peers

More details

Overview

Prerequisites

To get the most out of this course, you should feel comfortable with coding and command line and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Past instructors:

Asking for help in Slack

The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering channel.

To make discussions in Slack more organized:

Supporters and partners

Thanks to the course sponsors for making it possible to run this course

Do you want to support our course and our community? Please reach out to alexey@datatalks.club

最近版本更新:(数据更新于 2024-10-16 14:10:18)

主题(topics):

data-engineering, dbt, docker, kafka, prefect, spark

DataTalksClub/data-engineering-zoomcamp同语言 Jupyter Notebook最近更新仓库

2024-11-29 18:33:27 neo4j-labs/llm-graph-builder

2024-11-15 05:39:53 KindXiaoming/pykan

2024-11-11 10:53:33 microsoft/autogen

2024-10-09 04:20:42 Arize-ai/phoenix

2024-10-03 01:07:52 langchain-ai/langchain

2024-10-02 03:17:33 udlbook/udlbook