| Dataset Summary | |||
|---|---|---|---|
| Shopping Customer Data (n = 3,900) | |||
| Variable Name | Type | Description | Range / Values |
| Customer ID | Integer | Unique ID for each shopper record | 1 to 3900 |
| Age | Numeric | Customer's age in years | 18 to 70 |
| Gender | Categorical | Gender of the customer | Male (68%), Female (32%) |
| Item Purchased | Categorical | Specific item bought | 25 unique items (e.g., Blouse, Pants) |
| Category | Categorical | Type of product purchased | Clothing (45%), Accessories (32%), Other |
| Purchase Amount (USD) | Numeric | Total money spent on shopping | $20 to $100 (Mean: $59.8) |
| Location | Categorical | Geographic area or city of shopper | 50 unique locations (e.g., Montana, California) |
| Size | Categorical | Size of item purchased | M (45%), L (27%), others |
| Color | Categorical | Color of item purchased | 25 unique colors (e.g., Olive, Yellow) |
| Season | Categorical | Season during which purchase occurred | Spring (26%), Fall (25%), others |

Scenario: Interning at a Retail Analytics Firm
You’ve joined Insight Retail Analytics, a firm that helps brands understand consumer behavior to improve marketing and product strategies. Your team has received a dataset containing 3,900 customer records, including demographics (age, gender, income), product categories, purchase amounts, and seasonal preferences.
Your supervisor has asked you to explore this dataset using R and ggplot2 to uncover patterns in consumer spending. Your goal is to produce visual insights that can help retail managers understand customer behavior and tailor promotions.s decisions.
Dataset description
The Shopping Behavior Dataset provides detailed records of 3,900 retail transactions, capturing key demographic and behavioral attributes of customers. It includes variables such as age, gender, income, product category, purchase amount, and seasonal preferences. This dataset is ideal for exploring consumer trends, visualizing spending patterns, and practicing data management and descriptive analytics using R. The table below summarizes each variable, its type, description, and typical values or ranges.
Your Mission: Complete the following tasks
- Load and Inspect the Dataset
- Load the CSV file into R.
- Use
str(),summary(), andhead()to inspect the structure and contents - Identify the number of observations and key variables.
- Data Cleaning
- Check for missing values and handle them appropriately.
- Recode categorical variables (e.g., gender, season) as factors.
- Purchase Amount Distribution
- Create a histogram of purchase amounts.
- Add axis labels and a title.
- Briefly describe the distribution.
- Spending by Gender
- Create a boxplot comparing purchase amounts by gender.
- What does the plot suggest about gender-based spending?
- Spending by Product Category
- Create a bar chart showing average purchase amount per product category.
- Use
group_by()andsummarise()to prepare the data.
- Seasonal Spending Patterns
- Create a bar chart showing total purchase amount by season.
- Customize colors and labels.
- Age Group Analysis
- Create age groups (e.g., 18–25, 26–35, etc.) using
cut(). - Compare average purchase amounts across age groups using a bar chart.
- Create age groups (e.g., 18–25, 26–35, etc.) using
- Annotated Insight
- Choose one plot and add annotations using
geom_text()orgeom_label() - Highlight a key insight and explain its relevance for retail strategy.
- Choose one plot and add annotations using
- Save Your Visuals
- Save at least three plots as PNG files using
ggsave(). - Include filenames and dimensions in your code.
- Save at least three plots as PNG files using
- Reflection
- Write a short reflection (150–200 words) on how data visualization can support retail decision-making and customer segmentation.
Midterm files and submission:
Access the midterm R script and dataset from this link: Midterm Project Files
Submit your midterm exam answer sheet (R script with code, plots, and explanations) via google form: Midterm Submission Form
Deadline: November 5, 2025, 11:59 PM PST