This content originally appeared on DEV Community 👩‍💻👨‍💻 and was authored by Samuel Earl
My background is mostly in web development, but I am learning data engineering because I am interested in business intelligence and want to level-up my knowledge. One of the things that I have recently discovered is data validation within a data pipeline. Let me share a quick example from a previous job I had that illustrates the need for data validation in a data pipeline.
Example of a faulty data pipelineWe had a web app that allowed our clients to log into an account and view performance data in a dashboard. That dashboard was the main landing page that users would see after they logged in - it was the main part of our app. The dashboard pulled its data from a datasource that was managed by some analysts. What would often happen was that an analyst would pull data from a database, perform some analyses, and update the data that was used in the dashboard. Sometimes the dashboard in our web app would silently fail and not show any data because the dataset no longer matched the correct format that the dashboard needed. That would then kickoff a manual search to find out who changed the data and how to fix the issue. (Clearly we had some automation issues, but this kind of problem exists even with automated data pipelines that are well architected.)
We could have really benefited from using Great Expectations. Let me show you why.
NOTE: Instead of writing out “Great Expectations” I will abbreviate it as “GX” from now on in this tutorial.
What does GX do? Where does it fit in a data pipeline?
This is an example of a simple architecture that uses GX:
Simple data pipeline example using GXGX is used to validate data within a data pipeline. You can place GX Checkpoints at any stage in a data pipeline where data needs to be checked for data quality before it moves onto the next stage.
When using GX in a data pipeline, if the data passes validation, then the data can be moved to the next stage in the pipeline. If the data does not pass validation, then we need to alert the proper stakeholders (e.g. data engineers) so they can fix the issues before end users are affected.
What will we be doing in this tutorial?
We are going to use some datasets from the NYC taxi data to demonstrate how GX works. Specifically, we are going to use two datasets: January 2019 and February 2019. Each dataset contains 10,000 records and each record represents one taxi ride with multiple columns of data (e.g. pick-up location, drop-off location, payment amount, number of passengers).
In our scenario, we know that the January data is clean and it matches what we expect. So we are going to use that dataset to create some validation rules. Then we are going to run the February dataset (i.e. batch) through our validation rules to see if the February data passes validation.
This content originally appeared on DEV Community 👩‍💻👨‍💻 and was authored by Samuel Earl
Samuel Earl | Sciencx (2023-01-31T02:41:10+00:00) Data Pipelines with Great Expectations | Introduction. Retrieved from https://www.scien.cx/2023/01/31/data-pipelines-with-great-expectations-introduction/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.