Skip to main content

Command Palette

Search for a command to run...

8 Week SQL Challenge - Case Study 5 (Data Cleansing Steps)

Updated
2 min read
M
Mary has been writing about Data Science on her blog (http://maryjonah.me) for years. In her free time, she likes to read books and have a good rest.

Welcome to Case Study #5 - Data Mart and the section 1. Data Cleansing Steps

In a single query, perform the following operations and generate a new table in the data_mart schema named clean_weekly_sales:

  • Convert the week_date to a DATE format

  • Add a week_number as the second column for each week_date value, for example any value from the 1st of January to 7th of January will be 1, 8th to 14th will be 2 etc

  • Add a month_number with the calendar month for each week_date value as the 3rd column

  • Add a calendar_year column as the 4th column containing either 2018, 2019 or 2020 values

  • Add a new column called age_band after the original segment column using the following mapping on the number inside the segment value

segment

age_band

1

Young Adults

2

Middle Aged

3 or 4

Retirees

  • Add a new demographic column using the following mapping for the first letter in the segment values:

segment

demographic

C

Couples

F

Families

  • Ensure all null string values with an "unknown" string value in the original segment column as well as the new age_band and demographic columns

  • Generate a new avg_transaction column as the sales value divided by transactions rounded to 2 decimal places for each record


CREATE TABLE clean_weekly_sales AS 
SELECT
	STR_TO_DATE(week_date, '%d/%m/%y') AS week_date,
    CEIL(DAY(STR_TO_DATE(week_date, '%d/%m/%y')) / 7) AS week_number,
    MONTH(STR_TO_DATE(week_date, '%d/%m/%y')) AS month_number,
    YEAR(STR_TO_DATE(week_date, '%d/%m/%y')) AS calendar_year,
    CASE 
		WHEN RIGHT(segment, 1) = '1' THEN 'Young Adults' 
        WHEN RIGHT(segment, 1) = '2' THEN 'Middle Aged'
        WHEN RIGHT(segment, 1) = '3' THEN 'Retirees'
        ELSE 'unknown'
    END AS age_band,
    CASE 
		WHEN UPPER(LEFT(segment, '1')) = 'C' THEN 'Couples'
        WHEN UPPER(LEFT(segment, '1')) = 'F' THEN 'Families'
        ELSE 'unknown'
    END AS demographic,
    ROUND(CAST(sales AS UNSIGNED) / CAST(transactions AS UNSIGNED), 2) AS avg_transaction
FROM weekly_sales;