This content originally appeared on DEV Community and was authored by pauline-banye
Introduction
This is the first task for the HNG11 internship Data Analytics track. HNG internship is a fast-paced bootcamp for learning digital skills such as software development, data analytics, Software testing, DevOps, and design, to name a few. It also provides an avenue to network, collaborate with other techies, and access exclusive jobs via the HNG premium network.
This task involved reviewing the retail sales dataset obtained from Kaggle, identifying initial insights from the data at first glance, and creating a technical report detailing my observations. The dataset contains information about individual orders (order number, quantity ordered, price), customer details (address, contact information), product information (product line, code, MSRP), and sales. The purpose of this review is to understand the dataset's structure, identify key variables, and derive initial insights from a preliminary exploration.
Observations
This review provides initial insights into the dataset structure and content, which would guide further analysis.
- Data Shape: The dataset consists of 2823 rows and 25 columns.
-
Key Variables: The features include:
- ORDERNUMBER: A unique identifier for each order.
- QUANTITYORDERED: Number of items ordered.
- PRICEEACH: Price of each item.
- ORDERLINENUMBER: Line number for each order.
- SALES: Total sales amount for each order.
- ORDERDATE: Date of the order.
- STATUS: Order status (e.g., shipped).
- QTR_ID: Quarter of the year.
- MONTH_ID: Month of the year.
- YEAR_ID: Year of the order.
- PRODUCTLINE: Category of the product.
- MSRP: Manufacturer's Suggested Retail Price.
- PRODUCTCODE: Product identifier.
- CUSTOMERNAME: Name of the customer.
- ADDRESSLINE1, ADDRESSLINE2: Customer address.
- CITY, STATE, POSTALCODE, COUNTRY: location details of the customer.
- TERRITORY: Sales territory.
- CONTACTLASTNAME, CONTACTFIRSTNAME: Customer contact information.
- DEALSIZE: Size of the deal (e.g., small, medium, large).
- Data Types: 16 columns in the dataset have categorical datatypes (e.g., order status, product line, country), and 9 consist of numerical datatypes (e.g., order quantity, price, year).
- Missing Values: There are missing values in several columns.
- Date: The 'ORDERDATE' column appears to be a string representation of dates.
- Sales Figures: The average sale value is $3553.89, with a standard deviation of $1841.87, which indicates variability in order sizes.
- Order Quantities: The average order quantity is 35 units.
- Dates: The ORDERDATE is present but appears to be in string format. Further cleaning and transformation might be required for time-based analysis.
- Order Details: Columns like 'QUANTITYORDERED', 'PRICEEACH', and 'SALES' suggest the data may allow for analysis of order quantities, pricing strategies, and revenue generation.
- Customer Information: The dataset includes customer details like name, address, and phone number. This data could be used to explore customer demographics.
- Order Status: This column indicates the order status. This data could be used to track order fulfillment efficiency.
- Deal Size: The 'DEALSIZE' column is categorical and might indicate the size or value of each deal.
Data Exploration
Insights from the dataset exploration: The initial exploration of the dataset revealed several key points and trends that will guide further analysis.
- ORDERNUMBER: Ranges from 10100 to 10425, with a mean score of 10258.73.
- QUANTITYORDERED: ranges from 6 to 97, with a mean score of 35.09.
- PRICEEACH: Ranges from 26.88 to 100, with a mean score of 83.66.
- ORDERLINENUMBER: Ranges from 1 to 18, with a mean score of 6.47.
- SALES: range from 482.13 to 14082.80, with a mean score of 3553.89.
- QTR_ID: Ranges from 1 to 4, with a mean score of 2.72.
- MONTH_ID: Ranges from 1 to 12, with a mean score of 7.09.
- YEAR_ID: Ranges from 2003 to 2005, with a mean score of 2003.82.
- MSRP: ranges from 33 to 214, with a mean score of 100.72.
- Missing Values: Several columns contain missing values:
# Identify the columns with missing data
null_count = sales_df.isnull().sum()
# Filter columns with missing values greater than 0
null_count[null_count > 0]
# Result
ADDRESSLINE2 2521
STATE 1486
POSTALCODE 76
TERRITORY 1074
Sales Distribution: The sales amounts vary significantly, with a mean sales value of 3553.89 and a standard deviation of 1841.87. This indicates that some high-value orders could be potential outliers.
Product Performance: The dataset contains multiple product types. Analyzing the sales performance across different product categories can provide insights into which categories perform better than others.
- Temporal Trends: Orders span from 2003 to 2005, covering different quarters and months. Seasonal trends or quarterly performance can be analyzed further.
Conclusion
The initial review of the sample sales dataset has highlighted several key areas for further exploration. A detailed analysis should focus on sales performance by product line, periods, and geographical distribution. Addressing missing values and converting data types accurately will be essential for a more precise analysis. Continued investigation will yield deeper insights into the sales data, helping to identify significant trends and patterns.
Potential Areas for Further Analysis:
- Sales Performance: Examine trends in sales figures over various periods (quarterly, monthly, and yearly) to understand sales distribution and trends.
- Product Analysis: Investigate the relationships between product types, prices, and sales figures.
- Customer Segmentation: Identify customer segments based on order behavior (quantity, frequency) and demographics (location).
- Data Cleaning and Preprocessing: Resolve missing values, convert appropriate columns to numerical data types, and properly format the ORDERDATE column for time-series analysis.
- Geographical Insights: Utilize the customer location data to conduct a geographical analysis of sales performance.
This content originally appeared on DEV Community and was authored by pauline-banye
pauline-banye | Sciencx (2024-06-28T17:55:05+00:00) Overview of the Retail Sales Kaggle Dataset. Retrieved from https://www.scien.cx/2024/06/28/overview-of-the-retail-sales-kaggle-dataset/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.