~/dashboards/rfm_segmentation.html

1,062,989 rows · 4.2mb live

DATA ANALYSIS · DA PROJECT 1 · RFM

Unlocking Customer Value Through
RFM Segmentation

Online retail businesses often fail to distinguish loyal customers from nearly inactive ones. This dashboard uses RFM Analysis on 1,062,989 real transactions to identify high-value customer segments and surface actionable retention strategies.

Dataset UCI Online Retail

Rows 805,549

Customers 5,878

Period 2009 – 2011

unique customers

5,878

after cleaning & dedup

RFM segments

R · F · M score combinations

top segment

Seg 01

Current Loyal High Spending

score range

1–5 per R·F·M

percentile-based quintiles

Data Cleaning Process

Step	Result
Check dataset shape	1,062,989 rows × 10 columns on load
Check missing values	238,625 empty strings found in customer_id — rows removed
Check duplicate rows	26,124 duplicate rows found and removed
Convert data types	invoicedate converted to datetime format; customer_id to string
Remove invalid values	Rows with negative/zero quantity and price removed (20,261 negative qty; 1,820 zero price)
Remove cancelled transactions	Cancelled orders excluded using is_cancelled boolean flag
Remove extreme outliers	3 extreme outlier customers removed (total_price > 25,000) — confirmed as anomalies
Final dataset shape	805,546 rows × 10 columns — ready for RFM analysis

Customer Count by Segmentn = 5,878

Segment Distribution% of customers

Actionable Strategies per Segment

Presentation

AnthonyDjiadyDjie_DS39+_Final_Project_Analysis

Canva ↗ Open in Canva

Preview loading

If the presentation doesn't appear, click the button below

↗ Open in Canva

~/dashboards/inventory_analysis.html

485,875 rows · 6.8mb live

DATA ANALYSIS · DA PROJECT 2 · INVENTORY

Store Performance, Inventory Management
and Profitability

Out of 485,875 inventory items, only 36.47% were sold. This project analyses the Looker E-commerce dataset — uncovering top categories, most profitable brands, and slow-moving products.

Dataset Looker E-commerce

Rows 485,875

Sold Rate 36.47%

Period 2019 – 2024

TOTAL ITEMS

485K

after cleaning

SOLD RATE

36.47%

items actually sold

REALIZED REVENUE

$10.5M

from sold inventory

REALIZED PROFIT

$5.5M

net after cost

AVG SELL TIME

29.4d

days to sell

Data Cleaning Process

Step	Result
Check dataset shape	490,705 rows × 12 columns before cleaning
Convert datetime columns	created_at and sold_at converted to datetime format
Check duplicate rows	Total duplicated rows: 0 — no duplicates found
Check missing values	Missing values in sold_at represent unsold inventory — intentionally kept as NaN
Validate numerical columns	cost: min $0.008, max $557 — no zeros or negatives; retail_price: min $0.02, max $999 — valid
Drop invalid records	Dropped rows without created_at and rows with missing brand/product name
Final dataset shape	485,875 rows × 12 columns — ready for analysis

VIEW BY

Inventory Sold Rate 485K items

36.47%

SOLD

Revenue by Category click to filter

ABC Inventory Classification click a tier to filter all charts

A — 20%

B — 30%

C — 50%

Actionable Strategies

Presentation

AnthonyDjiadyDjie_DS39+_Final_Project_Analysis_DA2

Canva ↗ Open in Canva

Preview loading

If the presentation doesn't appear, click the button below

↗ Open in Canva

DA Project 03 — Hotel Booking Demand Analysis

Hotel Booking Demand

Cancellation drivers, pricing patterns, and demand seasonality across 83,293 hotel bookings — Dibimbing DS39+

Dataset 83,293 bookings

Columns 33

Class Dibimbing DS39+

Tool Python · Pandas · Matplotlib

Total Bookings

83,293

83K unique booking records

Cancellation Rate

37.2%

30,776 bookings canceled

Realized Revenue

$18.1M

After cancellations

Lost Revenue

$11.6M

~39% of gross potential

Avg Daily Rate

$101–103

Mean ADR, symmetric dist.

Data Cleaning Process

Step	Result
Check dataset shape	83,293 rows × 33 columns on load
Check missing values	company (94% missing), agent (14%), country (<0.5%), children (3 rows) — company & agent filled with 0 (absence is meaningful)
Check duplicates	No duplicate rows or duplicate bookingIDs found
Convert data types	reservation_status_date → datetime; arrival_date_month ordered for time-series
Remove invalid rows	Removed bookings with zero guests or zero nights (data-entry errors)
Cap ADR outliers	ADR capped at 99th percentile — final cleaned dataset: 82,721 rows, ready for analysis

Cancellation Rate by Lead Time Key Driver

Cancellation Rate by Deposit Type Policy Signal

Monthly Booking Volume Seasonality

Market Segment — Volume & Cancel Rate Channel Mix

ADR by Room Type Pricing

Repeat vs New Guest Behavior LTV Signal

Key Insights

Strategic Recommendations

Presentation

AnthonyDjiadyDjie_DS39+_HotelBookingDemand_Analysis

Canva ↗ Open in Canva

Preview loading

If the presentation doesn't appear, click the button below

↗ Open in Canva

DATA SCIENCE · DS PROJECT 1 · K-MEANS CLUSTERING

Discovering Hidden Customer Segments
with K-Means Clustering

Online retail businesses often struggle to understand customer differences. This project applies RFM feature engineering and K-Means Clustering to segment 5,878 customers into 4 actionable groups — each with a tailored business strategy.

Dataset UCI Online Retail

Rows 1,062,989

Customers 5,878

Clusters k = 4

optimal k

4 clusters

Elbow + Silhouette method

best silhouette

0.32 at k=2

k=4 chosen for business fit

features used

5 features

R · F · M · AOV · Avg Qty

PCA variance

84.8%

PC1 59.1% · PC2 25.7%

Data Cleaning Process

Step	Result
Check dataset shape	1,062,989 rows × 10 columns
Drop missing customer_id	Null customer_id dropped — 824,364 rows remain
Convert data types	customer_id → string; invoicedate → datetime
Check empty strings	All categorical columns — no empty strings found
Remove negatives & zeros	18,744 negative quantity rows + 71 zero-price rows removed
Check duplicates	26,124 duplicates found — kept (same invoice, different items)
Outlier detection (IQR)	Quantity 6.45%, Price 8.36%, Total Price 8.24% — kept as high-value signals
Final dataset shape	805,549 rows × 10 columns — ready for RFM feature engineering

RFM + Extended Features

Features Used 5 features

Recency

Days since last purchase

Frequency

Number of unique invoices

Monetary

Total spending per customer

AOV

Average Order Value (Monetary / Freq)

Avg Qty per Invoice

Total quantity / frequency

Log1p transformation applied to reduce skewness
StandardScaler used before K-Means fitting

Elbow Method

Inertia vs K k=4 chosen

Auto-detected elbow: k=5 · Chosen: k=4 for business fit

Silhouette Analysis

Score vs K k=4 chosen

k=2 peaks (0.3204) but too broad · k=4 chosen: 0.2560

PCA Cluster Visualization

2D PCA Projection — Hover over clusters to explore 84.8% variance explained

At Risk Customers (n=1,763)

Bulk Buyers (n=1,142)

Regular Customers (n=1,552)

Dormant Customers (n=1,421)

Cluster Explorer

Select a Cluster — See Profile & Strategy k=4 segments

Recency

18 days

Median days since last order

Frequency

12.5 orders

Median unique invoices

Monetary

£5,599

Median lifetime spend

AOV

£447.58

Avg order value

Avg Qty / Inv

267.32

Avg quantity per invoice

Customers

1,142 (19.4%)

Share of customer base

Strategy & Actions

→VIP loyalty programs & membership tiers

→Early access to new products & restocks

→Personalized marketing based on purchase history

→Premium product promotions & bundles

→Exclusive events & newsletters

🏆 Highest Value

Recency

31 days

Median days since last order

Frequency

5 orders

Median unique invoices

Monetary

£1,131

Median lifetime spend

AOV

£231.23

Avg order value

Avg Qty / Inv

126.40

Avg quantity per invoice

Customers

1,552 (26.4%)

Share of customer base

Strategy & Actions

→Upselling — promote higher-value products

→Cross-selling complementary items

→Product bundling to increase basket size

→Loyalty incentives to drive repeat visits

→Personalized recommendations engine

📈 Growth Potential

Recency

299 days

Median days since last order

Frequency

2 orders

Median unique invoices

Monetary

£798

Median lifetime spend

AOV

£393.26

Avg order value

Avg Qty / Inv

239.00

Avg quantity per invoice

Customers

1,763 (30.0%)

Share of customer base

Strategy & Actions

→Reactivation campaigns — "We miss you" messaging

→Limited-time discounts to create urgency

→Seasonal or event-based promotions

→Digital retargeting on social media

→Loyalty incentive win-back programs

⚠️ Reactivate

Recency

383 days

Median days since last order

Frequency

1 order

Median unique invoices

Monetary

£189

Median lifetime spend

AOV

£137.37

Avg order value

Avg Qty / Inv

68.00

Avg quantity per invoice

Customers

1,421 (24.2%)

Share of customer base

Strategy & Actions

→Welcome-back promos & onboarding nudges

→First repeat purchase incentives

→Follow-up emails with product highlights

→Targeted promotions to encourage engagement

→Retargeting campaigns via digital ads

💤 Low Activity

Revenue Contribution per Cluster % of total

At Risk Customers

29.4%

Regular Customers

27.0%

Bulk Buyers

25.0%

Dormant Customers

18.5%

Customer Distribution % of base

At Risk Customers

30.0%

Regular Customers

26.4%

Dormant Customers

24.2%

Bulk Buyers

19.4%

Business Recommendations

Retain Bulk Buyers — Protect Your Core Revenue

With only 19.4% of customers but 25% of revenue, Bulk Buyers are your most efficient segment. VIP loyalty programs, early product access, and exclusive communications will protect retention.

Grow Regular Customers Into Higher-Value Buyers

Regular Customers (26.4%) contribute 27% of revenue with solid frequency. Upselling, cross-selling, and product bundling can push them toward the Bulk Buyer tier over time.

Reactivate At Risk Customers Before They Churn

At Risk Customers haven't purchased in ~299 days. Time-limited "we miss you" campaigns, discounts, and digital retargeting are the priority to win them back before they're lost.

Nurture Dormant Customers With Low-Friction Entry

Dormant Customers (24.2%) made only 1 purchase on median. Welcome-back promos and first repeat purchase incentives are the right low-cost tools to gradually re-engage this group.

DATA SCIENCE · DS PROJECT 2 · REGRESSION

Predicting Food Delivery Time
with Machine Learning

Delivery platforms often give inaccurate ETAs. This project builds an end-to-end ML pipeline to predict delivery time — factoring in distance, weather, traffic, and courier experience — achieving R²=0.82.

Dataset Food Delivery Times

Rows 1,000 orders

Best Model Linear Regression

R² Score 0.82

best R² score

0.82

Linear Regression wins

MAE

±6.08 min

mean absolute error

models tested

LR · RF · Gradient Boosting

features

original input features

Data Cleaning Process

Step	Result
Check dataset shape	1,000 rows × 9 columns
Check missing values	30 missing values each in Weather, Traffic_Level, Time_of_Day, Courier_Experience_yrs — imputed with median/mode
Check duplicate rows	No duplicate rows found
Feature engineering	Created Is_Rush_Hour (Morning/Evening = 1) and Distance × Prep_Time interaction feature
Encode categoricals	Weather, Traffic_Level, Time_of_Day, Vehicle_Type — encoded via OneHotEncoder
Scale numericals	Distance_km, Preparation_Time_min, Courier_Experience_yrs, Is_Rush_Hour, Distance_x_Prep_Time scaled with StandardScaler
Train-test split	800 train / 200 test (80/20 split, random_state=42)

Model Comparison — R², MAE, RMSE test set

R²  higher is better

MAE  lower is better

RMSE  lower is better

Top Feature Importance linear coeff.

Interactive Delivery Time Predictor

Adjust Parameters — See Predicted ETA model simulation

Distance (km) 5

Prep Time (min) 15

Courier Experience (yrs) 3

Courier Rating 4.5

Weather Condition

Traffic Density

Rush Hour (adds ~8 min)

PREDICTED ETA

minutes

Normal delivery

Business Recommendations

Presentation

AnthonyDjiadyDjie_DS39+_Take_Home_Project_DS

Canva ↗ Open in Canva

Preview loading

If the presentation doesn't appear, click the button below

↗ Open in Canva

Portfolio / Data Science / Customer Churn Prediction

DATA SCIENCE · DS PROJECT 3 · CLASSIFICATION

Predicting Customer Churn
with Machine Learning Classification

Subscription businesses lose revenue when customers leave without warning. This project trains 4 classification models on 64,374 customer records to predict churn — identifying at-risk customers early so retention actions can be taken before it's too late.

Dataset Customer Churn Dataset

Rows 64,374

Best Model SVM RBF

Recall 84.9%

dataset size

64,374 rows

12 features · no nulls · no duplicates

churn rate

47.4%

Near-balanced · class_weight applied

best recall

84.9%

SVM RBF · smallest train/test gap

net profit

$50,076

From 12,875 test customers

Data Cleaning & Inspection

Step	Result
Check dataset shape	64,374 rows × 12 columns
Drop identifier column	CustomerID dropped — not predictive, would add noise
Check null values	No null values found across all 12 columns
Check duplicates	No duplicate rows found
Class balance check	52.6% No Churn / 47.4% Churn — near-balanced, class_weight='balanced' applied
Final dataset shape	64,374 rows × 11 features — ready for feature engineering

Feature List 11 features

Support Calls

Strongest churn signal — high calls = at-risk customer

Payment Delay

Days late on payment — financial disengagement signal

Contract Length

Monthly contracts churn far more than Annual

Usage Frequency

Low usage = low perceived value = higher churn risk

Tenure

Shorter tenure = less loyalty = more likely to leave

Age · Total Spend · Last Interaction

Supporting features with moderate predictive impact

Gender · Subscription Type

Categorical — encoded via One-Hot Encoding (drop_first=True)

OHE applied to: Gender, Subscription Type, Contract Length
StandardScaler applied to all numerical features (except Decision Tree)
Train/Test split: 80/20 · random_state=42

EDA Key Findings

Support calls is the top churn predictor

Customers with 7+ support calls show dramatically higher churn rates — high call volume signals frustration and unresolved product issues.

Monthly contracts are highest-risk

Monthly subscribers churn far more than quarterly or annual customers — low-commitment contracts make cancellation easy and frictionless.

Payment delay is an early warning signal

Customers with 20+ day payment delays are significantly more likely to churn — late payments frequently precede cancellation behavior.

Model Recall Comparison priority metric

SVM RBF selected

84.9%

Logistic Regression

84.4%

KNN

93.6%

Decision Tree

99.7%

Recall prioritized — missing a churner (FN) costs $10 in lost revenue vs $4 marketing cost for a false alarm (FP).
Decision Tree 99.7% recall but train accuracy = 100.0% — severe overfitting.
SVM RBF selected: train 82.4% / test 83.0% — smallest gap, best generalization.

SVM RBF — Confusion Matrix 12,875 test rows

5,527 True Negative

Correctly predicted No Churn

1,266 False Positive

Flagged churn — actually stayed

917 False Negative

Missed churner — highest cost

5,165 True Positive

Correctly predicted Churn

Revenue retained (TP × $10)+$51,650

Marketing waste (FP × $4)-$5,064

Missed churn loss (FN × $10)-$9,170

Net Profit+$50,076

Live Churn Risk Predictor SVM-based scoring

Support Calls 5

Payment Delay (days) 15

Usage Frequency 15

Tenure (months) 30

Age 40

Last Interaction (days ago) 15

Contract Length

Subscription Type

Churn Probability

42%

Medium Risk

Risk Factors Detected

Portfolio / Data Science / Credit Risk / Loan Default

DS Project 04

Credit Risk / Loan Default

Predicting whether a Home Credit loan applicant will default within 2 years — trained on 307,511 real borrowers. Directly relevant to tax & accounting domain expertise.

XGBoost Logistic Regression ROC-AUC SHAP Scikit-Learn Pandas Class Imbalance Binary Classification

Dataset: Home Credit (Kaggle)

Rows: 307,511

Features used: 18

XGB AUC: 0.7312

Default rate: 8.07%

Full Dataset

307K^rows

Home Credit Default Risk

XGBoost AUC

0.731^roc

+16% over baseline LR

Default Rate

8.07^%

Class imbalance handled

Top Predictor

EXT_SOURCE

External credit score signals

ML Pipeline

📦

Data
Loading

🧹

Clean &
Impute

⚖️

Class
Balance

🤖

LR → XGB
Train

🔍

SHAP
Explain

ROC Curves — LR vs XGBoost AUC comparison

Model Comparison

Logistic Regression

AUC: 0.6287
F1: 0.1873
Role: Interpretable baseline
Strength: Explainable coefficients

XGBoost WINNER

AUC: 0.7312
F1: 0.2562
Role: Production model
Strength: Captures non-linearity

AUC vs random baseline

Logistic Regression

0.5 (random)0.62871.0 (perfect)

XGBoost

0.5 (random)0.73121.0 (perfect)

SHAP Feature Importance XGBoost explainability

Key Findings

External credit scores dominate

EXT_SOURCE_2 and EXT_SOURCE_3 are by far the strongest predictors — they act like bureau credit scores and carry the most signal for default risk.

Employment stability matters

DAYS_EMPLOYED strongly predicts default — applicants with very short employment history or anomalous values (365,000 days) are significantly higher risk.

Class imbalance is the real challenge

Only 8.07% of applicants default. Without scale_pos_weight adjustment, XGBoost would simply predict "no default" for everyone and still appear 92% accurate.

LR vs XGBoost: +16% AUC lift

XGBoost captures non-linear interactions between credit amount, income, and employment — relationships a linear model fundamentally cannot model.

📦 Dataset — Home Credit Default Risk (Kaggle) · 307,511 rows · 122 original features (18 curated) · Binary TARGET (1 = default) · kaggle.com/c/home-credit-default-risk
This project uses only application_train.csv — no relational table joins required. Features were curated to focus on interpretable financial signals relevant to credit risk analysis.

Portfolio / Data Science / Hotel Cancellation Prediction

DS Project 05 · New

Hotel Cancellation Prediction

Predicting whether a hotel booking will be canceled before arrival — trained on 82,721 real bookings across two years. Production-realistic evaluation using a chronological train/test split, calibrated to a recall-tuned threshold.

Random Forest Gradient Boosting Chronological Split Permutation Importance Scikit-Learn Pandas Threshold Tuning Binary Classification

Dataset: Hotel bookings

Rows: 82,721

Features used: 68

RF F1: 0.614

RF AUC: 0.817

Cleaned Dataset

82.7K^rows

2 hotels · Jul 2017 → Aug 2019

Random Forest AUC

0.817^roc

+63% over majority baseline

Cancel Rate

37.2^%

Class imbalance handled

Top Predictor

lead_time

10% → 57% by bucket

ML Pipeline

📦

Data
Loading

🧹

Clean &
Engineer

⏱️

Time-Based
Split

🤖

3 Baselines
3 Models

🎯

Threshold
Tuning

Precision–Recall Curves chronological test set

Model Comparison

Gradient Boosting

AUC: 0.823
F1: 0.590
Role: Runner-up
Strength: Highest ranking quality

Random Forest PROD

AUC: 0.817
F1: 0.614
Role: Production model
Strength: Higher recall, robust

F1 vs simple baselines

B2: lead_time > 90 rule

0.00.3951.0

B3: 6-feature LogReg

0.00.5091.0

Random Forest (production)

0.00.6141.0

Permutation Importance RF · top 10 features

Key Findings

Lead time dominates — 10% → 57% by bucket

Bookings made 0–7 days out cancel at 9.8%. Bookings made 180+ days out cancel at 56.9%. The single strongest signal in both impurity and permutation rankings.

Engagement features collapse risk by 25–40 pts

Special requests, booking changes, parking, and room reassignment each drop cancel rate dramatically. Free PMS signals the business already collects but never used.

Caught two leakage traps before training

deposit_type excluded outright (99% of "Non Refund" cancel — inverted logic). previous_cancellations included but flagged: rank 5 by impurity, rank 14 by permutation.

Chronological split costs 16 F1 points — and we keep them

Random split would show F1 of 0.77; chronological gives 0.61. The 16-point gap is the size of the temporal-leakage illusion most projects don't catch. We report the production number.

Live Predictor scoring with the deployed model

Loading predictor

If the demo doesn't appear, open it directly:

↗ Open Live Demo

Presentation

Hotel_Cancellation_Prediction_Final_Project

Canva ↗ Open in Canva

Preview loading

If the presentation doesn't appear, click the button below

↗ Open in Canva

Data Analysis/Science_

Sertifikasi Profesional

Data nyata. Hasil nyata.

Unlocking Customer Value Through
RFM Segmentation

Store Performance, Inventory Management
and Profitability

Discovering Hidden Customer Segments
with K-Means Clustering

Predicting Food Delivery Time
with Machine Learning

Predicting Customer Churn
with Machine Learning Classification

Explore the interactive dashboard

Tertarik Bekerja Sama?

Data Analysis/Science_

Sertifikasi Profesional

Data nyata. Hasil nyata.

Unlocking Customer Value ThroughRFM Segmentation

Store Performance, Inventory Managementand Profitability

Discovering Hidden Customer Segmentswith K-Means Clustering

Predicting Food Delivery Timewith Machine Learning

Predicting Customer Churnwith Machine Learning Classification

Explore the interactive dashboard

Tertarik Bekerja Sama?

Unlocking Customer Value Through
RFM Segmentation

Store Performance, Inventory Management
and Profitability

Discovering Hidden Customer Segments
with K-Means Clustering

Predicting Food Delivery Time
with Machine Learning

Predicting Customer Churn
with Machine Learning Classification