Study Notes: DE Zoomcamp 1.2.1 – Introduction to Docker

Overview

Topic: Introduction to Docker and its importance for data engineers.

Purpose: Learn the basics of Docker, including its use cases, advantages, and practical setup for data engineering tasks such as running databases and pipelines…


This content originally appeared on DEV Community and was authored by Pizofreude

Overview

  • Topic: Introduction to Docker and its importance for data engineers.
  • Purpose: Learn the basics of Docker, including its use cases, advantages, and practical setup for data engineering tasks such as running databases and pipelines.

Key Concepts

  1. What is Docker?
    • A platform for delivering software in isolated environments called containers.
    • Containers ensure isolation and portability, making it easier to run applications without interfering with the host system or other containers.
  2. Why Docker for Data Engineers?
    • Reproducibility: Ensures consistent environments across different systems.
    • Local experiments and testing: Quickly set up and run tools like PostgreSQL without installing them on the host system.
    • Integration tests (CI/CD): Simulate real-world scenarios by connecting components like data pipelines and databases in isolated environments.
    • Cloud readiness: Docker images can be deployed to cloud environments (e.g., Kubernetes, AWS Batch) for scalable execution.

Practical Examples and Workflow

  1. Docker for Running PostgreSQL
    • A PostgreSQL database can run inside a container, eliminating the need to install it on the host system.
    • Multiple containers can run different database instances without conflicts.
    • Tools like pgAdmin can also run in containers for database management and SQL query execution.

container

  1. Data Pipelines in Docker
    • Example pipeline: A Python script that processes data from a CSV file, performs transformations using pandas, and outputs results to PostgreSQL.
    • Dependencies (Python version, libraries) are included in the container to ensure consistency.

pipeline

  1. Isolation and Reproducibility
    • Containers can be reset to their original state after each use.
    • Docker images can be shared, ensuring the same environment is used regardless of the platform.

reproducibility

Key Docker Commands and Concepts

  1. Basic Commands
    • docker run [image-name]: Runs a container based on the specified image.
    • docker build -t [tag-name] .: Builds a Docker image from a Dockerfile.
    • docker exec -it [container-id] bash: Access the container's terminal.
    • docker stop [container-id]: Stops a running container.
  2. Images and Containers
    • Image: A template containing instructions to create a container.
    • Container: A running instance of an image.
  3. Dockerfile
    • A file containing instructions to build a custom Docker image.
    • Common commands in Dockerfile:
      • FROM [base-image]: Specifies the base image (e.g., python:3.9).
      • RUN [command]: Executes commands (e.g., RUN pip install pandas).
      • ENTRYPOINT: Defines the default command executed when a container starts.
      • WORKDIR: Sets the working directory inside the container.

Practical Demonstrations

  1. Running a Container
    • Run a test image: docker run hello-world.
    • Run an Ubuntu image interactively: docker run -it ubuntu bash. it means interactive.
  2. Installing Python Dependencies in a Container
    • Start a Python container: docker run -it python:3.9 bash.
    • Install pandas: pip install pandas. To install python library in a docker container, use this command: docker run -it --entrypoint=bash python: 3.9 which will run the entry point inside bash to run pip install command.
    • Run Python commands within the container.
    • Note: Changes made in the container (e.g., installed packages) are lost after the container stops.
  3. Creating a Custom Docker Image

    • Example Dockerfile for a data pipeline:

      FROM python:3.9
      RUN pip install pandas
      WORKDIR /app
      COPY pipeline.py /app/
      ENTRYPOINT ["python", "pipeline.py"]
      
      

    Build the image from a Dockerfile: docker build -t pipeline-image .
    - -t = tag
    - pipeline-image = tag name
    - . = build the docker image in current directory
    - In the Docker command docker build -t test:pandas ., the colon : is used to tag the image being built. Specifically:
    - test is the name of the image.
    - pandas is the tag for that image.

    Tags are useful to differentiate between versions or variations of the same image. So, in this case, test:pandas might indicate a specific version of the test image that includes pandas (a Python library for data manipulation and analysis).

    The . at the end specifies the current directory as the build context, meaning Docker will use the contents of the current directory to build the image.

    Run the container: docker run pipeline-image.
    - docker run -it test:pandas 2025-01-27 let us runs the image at specified date.
    - docker run -it test:pandas 2025-01-27 param1 param2 let us runs the image at specified date with various parameters.

  4. Parameterizing the Pipeline

    • Pass arguments to the script using command-line parameters.
    • Example: docker run pipeline-image arg1 arg2.
    • Access parameters in Python using sys.argv.

Advantages of Docker

  1. Portability: Run the same container in local, cloud, or CI/CD environments.
  2. Consistency: Eliminates the "works on my machine" problem.
  3. Isolation: Prevents interference between different applications or services.
  4. Scalability: Easily deploy containers in distributed systems like Kubernetes.

Recommendations for Beginners

  1. Tools for Development:
    • Use Visual Studio Code or similar editors for editing files.
    • On Windows, use Git Bash or Windows Subsystem for Linux (WSL) for a Linux-like terminal experience.
  2. Learning Resources:
    • Experiment with basic Docker commands.
    • Practice building and running custom images.
    • Explore Docker Hub for prebuilt images.
    • Look into CI/CD tools like GitHub Actions for automation.

Next Steps

  • Apply Docker to run PostgreSQL and practice SQL.
  • Build and test data pipelines using Docker containers.
  • Explore deploying containers to cloud platforms for scalable execution.


This content originally appeared on DEV Community and was authored by Pizofreude


Print Share Comment Cite Upload Translate Updates
APA

Pizofreude | Sciencx (2025-01-26T17:03:44+00:00) Study Notes: DE Zoomcamp 1.2.1 – Introduction to Docker. Retrieved from https://www.scien.cx/2025/01/26/study-notes-de-zoomcamp-1-2-1-introduction-to-docker/

MLA
" » Study Notes: DE Zoomcamp 1.2.1 – Introduction to Docker." Pizofreude | Sciencx - Sunday January 26, 2025, https://www.scien.cx/2025/01/26/study-notes-de-zoomcamp-1-2-1-introduction-to-docker/
HARVARD
Pizofreude | Sciencx Sunday January 26, 2025 » Study Notes: DE Zoomcamp 1.2.1 – Introduction to Docker., viewed ,<https://www.scien.cx/2025/01/26/study-notes-de-zoomcamp-1-2-1-introduction-to-docker/>
VANCOUVER
Pizofreude | Sciencx - » Study Notes: DE Zoomcamp 1.2.1 – Introduction to Docker. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/01/26/study-notes-de-zoomcamp-1-2-1-introduction-to-docker/
CHICAGO
" » Study Notes: DE Zoomcamp 1.2.1 – Introduction to Docker." Pizofreude | Sciencx - Accessed . https://www.scien.cx/2025/01/26/study-notes-de-zoomcamp-1-2-1-introduction-to-docker/
IEEE
" » Study Notes: DE Zoomcamp 1.2.1 – Introduction to Docker." Pizofreude | Sciencx [Online]. Available: https://www.scien.cx/2025/01/26/study-notes-de-zoomcamp-1-2-1-introduction-to-docker/. [Accessed: ]
rf:citation
» Study Notes: DE Zoomcamp 1.2.1 – Introduction to Docker | Pizofreude | Sciencx | https://www.scien.cx/2025/01/26/study-notes-de-zoomcamp-1-2-1-introduction-to-docker/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.