Chapter · 02

Data & Methods

The Olist Brazilian e‑commerce dataset — eight relational tables, cleaned, merged, and engineered into an analysis‑ready foundation for classification.

07 · Source

The Olist E‑commerce Dataset

A large‑scale Brazilian retail dataset widely used for machine learning and business analytics research, integrating multiple relational tables that describe the full customer journey from listing to review.

TABLE · 01

Orders

Timestamps for purchase, approval, shipping, delivery, and estimated delivery.

TABLE · 02

Order Items

Links orders to products, sellers, item prices, and freight charges.

TABLE · 03

Products

Category, dimensions, and weight; categories translated from Portuguese.

TABLE · 04

Sellers

Geographic information of sellers — ZIP code prefixes, city, and state.

TABLE · 05

Customers

Customer identifiers and geographic information for distance analysis.

TABLE · 06

Payments

Payment methods, installment counts, and payment values.

TABLE · 07

Reviews

Customer review scores and textual feedback — the target variable.

TABLE · 08

Geolocation

Maps ZIP code prefixes to latitude / longitude for distance modeling.

08.2 · Preprocessing

Cleaning, typing, and merging.

Data cleaning was performed separately for each dataset and followed a structured approach: correcting data types, removing invalid entries, handling missing values, and preparing the tables for merging. Timestamp fields were converted to datetime, numeric fields to proper types, and only valid operational statuses (e.g. delivered, shipped,approved) were retained.

Missing delivery dates were replaced with estimated delivery dates, review comments filled with empty strings, and non‑critical missing rows removed. Numeric product attributes were imputed using median values. All cleaned tables were then merged into a single master table — orders ⇢ customers ⇢ items ⇢ products ⇢ payments ⇢ reviews — via shared identifiers, producing the unified analytical dataset.

08.4 · Feature Engineering

New variables capturing operational, product, and customer signal.

FEATURE · 01

Delivery Delay

actual_delivery_date − estimated_delivery_date

Early, on time, or late.

FEATURE · 02

Delivery Time (Total Days)

actual_delivery_date − purchase_date

Total fulfilment duration.

FEATURE · 03

Price‑to‑Freight Ratio

price / freight_value

Detects disproportionate shipping cost.

FEATURE · 04

Product Volume

length × height × width

Packaging and handling complexity.

FEATURE · 05

Review Sentiment Flags

positive (4–5) · neutral (3) · negative (1–2)

Binary indicators.

FEATURE · 06

Payment Installment Features

n_installments, value_per_installment

Payment behavior.

09 · Exploratory Data Analysis

Reading the dataset before modeling.

EDA was conducted to understand how operational, product, and customer‑related factors influence customer review scores — surfacing patterns that may explain variations in review outcomes and that support the predictive modeling phase.

09.1 · Orders

Order status distribution and delivery delay patterns.

The orders dataset was analyzed to understand operational factors that may influence customer review scores. The analysis focused on key delivery‑related variables — order status distribution, delivery time patterns, purchase timestamp trends, delivery delays, and estimated vs. actual delivery comparisons.

FIG. 1.2Delivery Delay Distribution (Days). Most deliveries cluster near zero days of delay (slightly early to on‑time); a long right tail captures the late deliveries that correlate with negative reviews.

09.2 · Products

Product category prevalence and physical attributes.

The products dataset reveals which categories dominate the catalog and how physical product attributes — weight and dimensions — relate to fulfilment experience. Heavier or larger items are more exposed to damage and longer delivery times.

FIG. 2.1Top 20 Product Categories. Bed/bath/table, sports/leisure, and furniture/decor dominate order volume, with a steep drop into the long tail of niche categories.

FIG. 2.2Product Weight Distribution. A strongly right‑skewed distribution: most products are light, with a small fraction of heavy items that drive freight cost and handling complexity.

09.3 · Customers

Geographic spread across Brazilian states and cities.

Brazil's geographic scale creates significant regional variation in logistics efficiency. Customer concentration in São Paulo, Rio de Janeiro, and Minas Gerais shapes both demand patterns and delivery performance — and indirectly, review outcomes.

FIG. 3.1Customer Distribution by State. São Paulo (SP) dominates customer count, followed by Rio de Janeiro and Minas Gerais — concentration that drives logistical priorities.

FIG. 3.2Customer Distribution by City. Top cities reflect the major urban centers, providing a granular view of where most operational improvements would have the largest customer impact.

09.4 · Combined Dataset

Price, freight, review scores, and pairwise correlations.

The merged master dataset surfaces interactions between operational, product, and customer‑level variables and how they relate to customer review scores — providing direct numerical support for the modeling phase that follows.

FIG. 4.1Price Distribution. Strongly right‑skewed: most orders are low‑cost with a small number of high‑value purchases that disproportionately affect total revenue.

FIG. 4.2Freight Value Distribution. Freight cost typically falls between R$ 10–40, reflecting distance, weight, and handling complexity — variables that feed the price‑to‑freight engineered feature.

FIG. 4.3Review Score Distribution. Heavy concentration at 4 and 5 stars confirms the strong class imbalance addressed during modeling — the dominant class is the positive review.

FIG. 4.4Correlation Heatmap. Pairwise correlations show tight clustering of product physical attributes (weight, dimensions), modest links between payment installments and value, and weak linear association of review score with most numeric features.

09.5 · Descriptive Statistics

Central tendencies and skewness.

Price and freight value show strong right‑skewed distributions — most orders are low‑cost with a small number of high‑value purchases.
Product weight and dimensions also exhibit right skewness, consistent with the presence of a few very large or heavy items.
Delivery delays have a mean close to zero but a wide range — reflecting both early and late deliveries.
Review scores are heavily concentrated at the high end (4 and 5 stars), confirming class imbalance addressed during modeling.
Payment values vary widely, with some customers using multiple installments — reflecting purchasing behavior and financial flexibility.