You deserve that money — Data Wrangling in Data Engineering

This content originally appeared on Level Up Coding - Medium and was authored by Stella_Space

You deserve that money — Data Wrangling in Data Engineering

If you’ve had the opportunity to work across multiple data ecosystems, you’ll quickly notice how privileged data engineers in Big Tech or even moderately tech-focused businesses are when it comes to data wrangling. These professionals largely rely on SQL (or alike) because much of the data they handle comes from well-structured digital footprints.

However, this is not the case for many other enterprises, including; government agencies, research institutions, and non-profits. In these environments, data systems tend to be chaotic, often requiring more than just data engineering expertise.

Visual summary of differences by Stella

Success in such settings demands an intersection of skills spanning data engineering, systems architecture (or even software engineering), and core data science principles to build, manage, and ultimately create usable data infrastructures for other researchers.

This blend of complexity and frustration brings us to our second project; — data wrangling. While data wrangling often aligns closely with data science approaches, these skills are fundamental for data engineers, as we’ll explore in this article.

Why Data Wrangling?

We’ve all heard the phrase, ‘Data is the new oil.’ And while it’s true, data is inherently multifaceted. It arrives in various forms, often messy and unstructured, and as a data engineer, it’s your responsibility to process, refine, and make it usable.

In such scenarios, wrangling provides a set of powerful yet straightforward tools that can be customized to create robust schemas and seamlessly integrated systems for managing data.

Now this process can be grueling, but it has also been a very rewarding aspect of my work , allowing me to leverage my full range of knowledge and skills. Over time, I’ve found that my role has evolved far beyond the title of data engineer. Today, I see myself as a data systems architect and research scientist, bridging the gap between data engineering and innovation.

Project Snapshot: APIs, NLP, and Knowledge Graphs

This article summarizes a project that leverages APIs, Natural Language Processing (NLP), and knowledge graphs to extract and structure relationships from text data.

I walk through the process of scraping blog posts about social events, wrangling the text data, and mapping relationships among individuals and groups.

Data Wrangling: The Foundation

Data wrangling involves transforming raw data regardless of its format, into a structured, clean, and usable state. This process becomes especially challenging when dealing with text data, which is inherently unstructured and often noisy.

To convert raw text into meaningful insights, we utilize

Pandas for basic data manipulation and
SpaCy, an NLP library, for cleaning, extracting, and structuring the text data.

Example Project: Mapping Relationships from Party Blogs

Goal: Using the scraping skills developed earlier, we aim to build an insights database from the New York Social Diary Site. Each blog post typically includes:

Party Details — Names and descriptions of the events.
Captions — Text mentioning individuals and groups attending the events.

Snippet of NY Social Diary posts

Wrangling goal:

Extract Information Using NLP; from the scraped data, identify individuals, groups, and dates.
Build a Knowledge Graph; map relationships to highlight individual influence in these occasions.

Step 1: Data Collection and Caching

The first step involves scraping data from the website, storing it in a database, and making it accessible via an API.

Once the data is fetched, we use Python’s pickle module to cache it locally. Caching prevents repeated API calls, reduces the risk of exceeding request limits, and ensures progress isn’t lost during interruptions.

This process is known as serialization that allows smooth processing and quick resumption of work if needed. Figure 1.

Figure 1: project code snippet

Step 1 returns all parties and dates for when they occurred.

All parties up to December 2015

Steps 2–4: Parsing, Extracting, and Mapping Relationships

Step 2: Parse all captions from the blog posts.
Step 3: Extract names of individuals and groups using NLP techniques.
Step 4: Save the processed data and create a knowledge graph to visualize relationships.

Visit my GitHub repository to access scripts, detailed documentation, and project resources. The repository includes a step-by-step notebook demonstrating the wrangling process.

Challenges and Solutions

Challenge 1: Data Ambiguity making data complex often forces us to spend more time understanding the data than scripting it. For example, names such as Mr. and Mrs. Alexis Davis require careful parsing to differentiate between individuals and relationships.
Solution: Use regular expressions and NLP preprocessing pipelines to standardize input data.
Challenge 2: Scalability issues due to large datasets with thousands of captions requiring advanced approaches to preprocess the data.
Solution: Leverage distributed frameworks like Spark NLP for scalable processing when handling larger datasets.

caption sample snippet

Conclusion

Wrangling data comes with its share of challenges, but it also offers numerous solutions, especially when approached with an innovative and pragmatic mindset.

The workflow shared in this project lays a strong foundation for tackling any data wrangling challenge — whether working with structured or unstructured data. It empowers you with the tools and strategies needed to transform messy, complex datasets into meaningful insights, proving that no data is too chaotic to handle.

Stay tuned for Project 3, where we’ll explore data mining as the next step in this series.

If these projects inspire you, consider following, engaging or subscribing to receive updates on each new release. Let’s build together, handling the full spectrum of data challenges in data engineering!

References

[1] https://pro.arcgis.com/en/pro-app/latest/help/data/geodatabases/overview/feature-class-basics.htm

[2]https://party-captions.tditrain.com/

You deserve that money — Data Wrangling in Data Engineering was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding - Medium and was authored by Stella_Space

Print Share Comment Cite Upload Translate Updates

APA

Stella_Space | Sciencx (2025-01-12T20:24:42+00:00) You deserve that money — Data Wrangling in Data Engineering. Retrieved from https://www.scien.cx/2025/01/12/you-deserve-that-money-data-wrangling-in-data-engineering/

MLA

" » You deserve that money — Data Wrangling in Data Engineering." Stella_Space | Sciencx - Sunday January 12, 2025, https://www.scien.cx/2025/01/12/you-deserve-that-money-data-wrangling-in-data-engineering/

HARVARD

Stella_Space | Sciencx Sunday January 12, 2025 » You deserve that money — Data Wrangling in Data Engineering., viewed ,<https://www.scien.cx/2025/01/12/you-deserve-that-money-data-wrangling-in-data-engineering/>

VANCOUVER

Stella_Space | Sciencx - » You deserve that money — Data Wrangling in Data Engineering. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/01/12/you-deserve-that-money-data-wrangling-in-data-engineering/

CHICAGO

" » You deserve that money — Data Wrangling in Data Engineering." Stella_Space | Sciencx - Accessed . https://www.scien.cx/2025/01/12/you-deserve-that-money-data-wrangling-in-data-engineering/

IEEE

" » You deserve that money — Data Wrangling in Data Engineering." Stella_Space | Sciencx [Online]. Available: https://www.scien.cx/2025/01/12/you-deserve-that-money-data-wrangling-in-data-engineering/. [Accessed: ]

rf:citation

» You deserve that money — Data Wrangling in Data Engineering | Stella_Space | Sciencx | https://www.scien.cx/2025/01/12/you-deserve-that-money-data-wrangling-in-data-engineering/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

You deserve that money — Data Wrangling in Data Engineering

Why Data Wrangling?

Project Snapshot: APIs, NLP, and Knowledge Graphs

Data Wrangling: The Foundation

Example Project: Mapping Relationships from Party Blogs

Step 1: Data Collection and Caching

Steps 2–4: Parsing, Extracting, and Mapping Relationships

Challenges and Solutions

Conclusion

Related Posts