Dashboard yang menjawab pertanyaan bisnis — dari segmentasi RFM dan profitabilitas inventaris hingga clustering ML dan model prediktif.
0
data dianalisis
0
proyek selesai
0y
pengalaman langsung
layanan kami
1.8M+data points
8proyek
6ypengalaman
Dashboard Bisnisviz
Model Prediksi Penjualanml
Segmentasi Pelangganrfm
Laporan KPIkpi
Pembersihan & Penataan Dataetl
// sertifikasi
Sertifikasi Profesional
Karya terpilih
Data nyata. Hasil nyata.
01
Data Analysis
EDA · Business Insights · Dashboards
View Projects →
02
Data Science
ML · Clustering · Regression
View Projects →
Data Analysis — Pilih Project
DA Project 1
RFM Segmentation
Unlocking Customer Value — RFM Analysis on 541,909 real transactions
RFMEDAPower BI
View Project →
DA Project 2
Inventory Analysis
Store Performance, Inventory Management and Profitability across 485,875 items
EDAABC AnalysisTableau
View Project →
DA Project 3
Hotel Booking Demand
Cancellation, Pricing & Demand Patterns across 83,293 hotel bookings
EDACancellation AnalysisPython
View Project →
~/dashboards/rfm_segmentation.html
1,062,989 rows · 4.2mblive
DATA ANALYSIS · DA PROJECT 1 · RFM
Unlocking Customer Value Through RFM Segmentation
Online retail businesses often fail to distinguish loyal customers from nearly inactive ones. This dashboard uses RFM Analysis on 1,062,989 real transactions to identify high-value customer segments and surface actionable retention strategies.
Dataset UCI Online Retail
Rows 805,549
Customers 5,878
Period 2009 – 2011
unique customers
5,878
after cleaning & dedup
RFM segments
8
R · F · M score combinations
top segment
Seg 01
Current Loyal High Spending
score range
1–5 per R·F·M
percentile-based quintiles
Data Cleaning Process
Step
Result
Check dataset shape
1,062,989 rows × 10 columns on load
Check missing values
238,625 empty strings found in customer_id — rows removed
Check duplicate rows
26,124 duplicate rows found and removed
Convert data types
invoicedate converted to datetime format; customer_id to string
Remove invalid values
Rows with negative/zero quantity and price removed (20,261 negative qty; 1,820 zero price)
Remove cancelled transactions
Cancelled orders excluded using is_cancelled boolean flag
Store Performance, Inventory Management and Profitability
Out of 485,875 inventory items, only 36.47% were sold. This project analyses the Looker E-commerce dataset — uncovering top categories, most profitable brands, and slow-moving products.
Dataset Looker E-commerce
Rows 485,875
Sold Rate 36.47%
Period 2019 – 2024
TOTAL ITEMS
485K
after cleaning
SOLD RATE
36.47%
items actually sold
REALIZED REVENUE
$10.5M
from sold inventory
REALIZED PROFIT
$5.5M
net after cost
AVG SELL TIME
29.4d
days to sell
Data Cleaning Process
Step
Result
Check dataset shape
490,705 rows × 12 columns before cleaning
Convert datetime columns
created_at and sold_at converted to datetime format
Check duplicate rows
Total duplicated rows: 0 — no duplicates found
Check missing values
Missing values in sold_at represent unsold inventory — intentionally kept as NaN
Validate numerical columns
cost: min $0.008, max $557 — no zeros or negatives; retail_price: min $0.02, max $999 — valid
Drop invalid records
Dropped rows without created_at and rows with missing brand/product name
Final dataset shape
485,875 rows × 12 columns — ready for analysis
VIEW BY● filter active — click to clear
Inventory Sold Rate485K items
36.47%
SOLD
Revenue by Categoryclick to filter
ABC Inventory Classification
click a tier to filter all charts
Predict booking cancellations on 82,721 hotel reservations — production-grade evaluation with chronological hold-out
Random ForestGradient BoostingTime-Based SplitThreshold Tuning
View Project →
DATA SCIENCE · DS PROJECT 1 · K-MEANS CLUSTERING
Discovering Hidden Customer Segments with K-Means Clustering
Online retail businesses often struggle to understand customer differences. This project applies RFM feature engineering and K-Means Clustering to segment 5,878 customers into 4 actionable groups — each with a tailored business strategy.
Log1p transformation applied to reduce skewness StandardScaler used before K-Means fitting
Elbow Method
Inertia vs K k=4 chosen
Auto-detected elbow: k=5 · Chosen: k=4 for business fit
Silhouette Analysis
Score vs K k=4 chosen
k=2 peaks (0.3204) but too broad · k=4 chosen: 0.2560
PCA Cluster Visualization
2D PCA Projection — Hover over clusters to explore 84.8% variance explained
At Risk Customers (n=1,763)
Bulk Buyers (n=1,142)
Regular Customers (n=1,552)
Dormant Customers (n=1,421)
Cluster Explorer
Select a Cluster — See Profile & Strategy k=4 segments
Recency
18 days
Median days since last order
Frequency
12.5 orders
Median unique invoices
Monetary
£5,599
Median lifetime spend
AOV
£447.58
Avg order value
Avg Qty / Inv
267.32
Avg quantity per invoice
Customers
1,142 (19.4%)
Share of customer base
Strategy & Actions
→VIP loyalty programs & membership tiers
→Early access to new products & restocks
→Personalized marketing based on purchase history
→Premium product promotions & bundles
→Exclusive events & newsletters
🏆 Highest Value
Recency
31 days
Median days since last order
Frequency
5 orders
Median unique invoices
Monetary
£1,131
Median lifetime spend
AOV
£231.23
Avg order value
Avg Qty / Inv
126.40
Avg quantity per invoice
Customers
1,552 (26.4%)
Share of customer base
Strategy & Actions
→Upselling — promote higher-value products
→Cross-selling complementary items
→Product bundling to increase basket size
→Loyalty incentives to drive repeat visits
→Personalized recommendations engine
📈 Growth Potential
Recency
299 days
Median days since last order
Frequency
2 orders
Median unique invoices
Monetary
£798
Median lifetime spend
AOV
£393.26
Avg order value
Avg Qty / Inv
239.00
Avg quantity per invoice
Customers
1,763 (30.0%)
Share of customer base
Strategy & Actions
→Reactivation campaigns — "We miss you" messaging
→Limited-time discounts to create urgency
→Seasonal or event-based promotions
→Digital retargeting on social media
→Loyalty incentive win-back programs
⚠️ Reactivate
Recency
383 days
Median days since last order
Frequency
1 order
Median unique invoices
Monetary
£189
Median lifetime spend
AOV
£137.37
Avg order value
Avg Qty / Inv
68.00
Avg quantity per invoice
Customers
1,421 (24.2%)
Share of customer base
Strategy & Actions
→Welcome-back promos & onboarding nudges
→First repeat purchase incentives
→Follow-up emails with product highlights
→Targeted promotions to encourage engagement
→Retargeting campaigns via digital ads
💤 Low Activity
Revenue Contribution per Cluster % of total
At Risk Customers
29.4%
Regular Customers
27.0%
Bulk Buyers
25.0%
Dormant Customers
18.5%
Customer Distribution % of base
At Risk Customers
30.0%
Regular Customers
26.4%
Dormant Customers
24.2%
Bulk Buyers
19.4%
Business Recommendations
1
Retain Bulk Buyers — Protect Your Core Revenue
With only 19.4% of customers but 25% of revenue, Bulk Buyers are your most efficient segment. VIP loyalty programs, early product access, and exclusive communications will protect retention.
2
Grow Regular Customers Into Higher-Value Buyers
Regular Customers (26.4%) contribute 27% of revenue with solid frequency. Upselling, cross-selling, and product bundling can push them toward the Bulk Buyer tier over time.
3
Reactivate At Risk Customers Before They Churn
At Risk Customers haven't purchased in ~299 days. Time-limited "we miss you" campaigns, discounts, and digital retargeting are the priority to win them back before they're lost.
4
Nurture Dormant Customers With Low-Friction Entry
Dormant Customers (24.2%) made only 1 purchase on median. Welcome-back promos and first repeat purchase incentives are the right low-cost tools to gradually re-engage this group.
DATA SCIENCE · DS PROJECT 2 · REGRESSION
Predicting Food Delivery Time with Machine Learning
Delivery platforms often give inaccurate ETAs. This project builds an end-to-end ML pipeline to predict delivery time — factoring in distance, weather, traffic, and courier experience — achieving R²=0.82.
Dataset Food Delivery Times
Rows 1,000 orders
Best Model Linear Regression
R² Score 0.82
best R² score
0.82
Linear Regression wins
MAE
±6.08 min
mean absolute error
models tested
3
LR · RF · Gradient Boosting
features
9
original input features
Data Cleaning Process
Step
Result
Check dataset shape
1,000 rows × 9 columns
Check missing values
30 missing values each in Weather, Traffic_Level, Time_of_Day, Courier_Experience_yrs — imputed with median/mode
Check duplicate rows
No duplicate rows found
Feature engineering
Created Is_Rush_Hour (Morning/Evening = 1) and Distance × Prep_Time interaction feature
Encode categoricals
Weather, Traffic_Level, Time_of_Day, Vehicle_Type — encoded via OneHotEncoder
Scale numericals
Distance_km, Preparation_Time_min, Courier_Experience_yrs, Is_Rush_Hour, Distance_x_Prep_Time scaled with StandardScaler
Train-test split
800 train / 200 test (80/20 split, random_state=42)
Model Comparison — R², MAE, RMSE test set
R² higher is better
MAE lower is better
RMSE lower is better
Top Feature Importance linear coeff.
Interactive Delivery Time Predictor
Adjust Parameters — See Predicted ETA model simulation
Predicting Customer Churn with Machine Learning Classification
Subscription businesses lose revenue when customers leave without warning. This project trains 4 classification models on 64,374 customer records to predict churn — identifying at-risk customers early so retention actions can be taken before it's too late.
Dataset Customer Churn Dataset
Rows 64,374
Best Model SVM RBF
Recall 84.9%
dataset size
64,374 rows
12 features · no nulls · no duplicates
churn rate
47.4%
Near-balanced · class_weight applied
best recall
84.9%
SVM RBF · smallest train/test gap
net profit
$50,076
From 12,875 test customers
Data Cleaning & Inspection
Step
Result
Check dataset shape
64,374 rows × 12 columns
Drop identifier column
CustomerID dropped — not predictive, would add noise
Check null values
No null values found across all 12 columns
Check duplicates
No duplicate rows found
Class balance check
52.6% No Churn / 47.4% Churn — near-balanced, class_weight='balanced' applied
Final dataset shape
64,374 rows × 11 features — ready for feature engineering
Feature List 11 features
Support Calls
Strongest churn signal — high calls = at-risk customer
Payment Delay
Days late on payment — financial disengagement signal
Contract Length
Monthly contracts churn far more than Annual
Usage Frequency
Low usage = low perceived value = higher churn risk
Tenure
Shorter tenure = less loyalty = more likely to leave
Age · Total Spend · Last Interaction
Supporting features with moderate predictive impact
Gender · Subscription Type
Categorical — encoded via One-Hot Encoding (drop_first=True)
OHE applied to: Gender, Subscription Type, Contract Length StandardScaler applied to all numerical features (except Decision Tree) Train/Test split: 80/20 · random_state=42
EDA Key Findings
01
Support calls is the top churn predictor
Customers with 7+ support calls show dramatically higher churn rates — high call volume signals frustration and unresolved product issues.
02
Monthly contracts are highest-risk
Monthly subscribers churn far more than quarterly or annual customers — low-commitment contracts make cancellation easy and frictionless.
03
Payment delay is an early warning signal
Customers with 20+ day payment delays are significantly more likely to churn — late payments frequently precede cancellation behavior.
Model Recall Comparison priority metric
SVM RBF selected
84.9%
Logistic Regression
84.4%
KNN
93.6%
Decision Tree
99.7%
Recall prioritized — missing a churner (FN) costs $10 in lost revenue vs $4 marketing cost for a false alarm (FP). Decision Tree 99.7% recall but train accuracy = 100.0% — severe overfitting. SVM RBF selected: train 82.4% / test 83.0% — smallest gap, best generalization.
Predicting whether a Home Credit loan applicant will default within 2 years — trained on 307,511 real borrowers. Directly relevant to tax & accounting domain expertise.
AUC: 0.7312 F1: 0.2562 Role: Production model Strength: Captures non-linearity
AUC vs random baseline
Logistic Regression
0.5 (random)0.62871.0 (perfect)
XGBoost
0.5 (random)0.73121.0 (perfect)
SHAP Feature Importance XGBoost explainability
Key Findings
01
External credit scores dominate
EXT_SOURCE_2 and EXT_SOURCE_3 are by far the strongest predictors — they act like bureau credit scores and carry the most signal for default risk.
02
Employment stability matters
DAYS_EMPLOYED strongly predicts default — applicants with very short employment history or anomalous values (365,000 days) are significantly higher risk.
03
Class imbalance is the real challenge
Only 8.07% of applicants default. Without scale_pos_weight adjustment, XGBoost would simply predict "no default" for everyone and still appear 92% accurate.
04
LR vs XGBoost: +16% AUC lift
XGBoost captures non-linear interactions between credit amount, income, and employment — relationships a linear model fundamentally cannot model.
📦 Dataset — Home Credit Default Risk (Kaggle) ·
307,511 rows · 122 original features (18 curated) · Binary TARGET (1 = default) ·
kaggle.com/c/home-credit-default-risk
This project uses only application_train.csv — no relational table joins required. Features were curated to focus on interpretable financial signals relevant to credit risk analysis.
Predicting whether a hotel booking will be canceled before arrival — trained on 82,721 real bookings across two years. Production-realistic evaluation using a chronological train/test split, calibrated to a recall-tuned threshold.
Random ForestGradient BoostingChronological SplitPermutation ImportanceScikit-LearnPandasThreshold TuningBinary Classification
AUC: 0.817 F1: 0.614 Role: Production model Strength: Higher recall, robust
F1 vs simple baselines
B2: lead_time > 90 rule
0.00.3951.0
B3: 6-feature LogReg
0.00.5091.0
Random Forest (production)
0.00.6141.0
Permutation Importance RF · top 10 features
Key Findings
01
Lead time dominates — 10% → 57% by bucket
Bookings made 0–7 days out cancel at 9.8%. Bookings made 180+ days out cancel at 56.9%. The single strongest signal in both impurity and permutation rankings.
02
Engagement features collapse risk by 25–40 pts
Special requests, booking changes, parking, and room reassignment each drop cancel rate dramatically. Free PMS signals the business already collects but never used.
03
Caught two leakage traps before training
deposit_type excluded outright (99% of "Non Refund" cancel — inverted logic). previous_cancellations included but flagged: rank 5 by impurity, rank 14 by permutation.
04
Chronological split costs 16 F1 points — and we keep them
Random split would show F1 of 0.77; chronological gives 0.61. The 16-point gap is the size of the temporal-leakage illusion most projects don't catch. We report the production number.