Skip to content

Data Processing

Overview

This section describes our data processing pipeline that prepares raw data for model training. We use DVC to track all data transformations and ensure reproducibility.

Data Pipeline

Our data processing pipeline consists of the following stages:

  1. Raw Data Collection
  2. Data is stored in data/raw/
  3. Each dataset is versioned with DVC
  4. Raw data is considered immutable

  5. Data Cleaning

  6. Handle missing values
  7. Remove duplicates
  8. Fix data types
  9. Normalize values

  10. Feature Engineering

  11. Create new features
  12. Encode categorical variables
  13. Scale numerical features
  14. Select relevant features

Pipeline Configuration

The data processing pipeline is defined in dvc.yaml:

stages:
  process_data:
    cmd: python src/mlops/data/process.py
    deps:
      - data/raw/dataset.csv
      - src/mlops/data/process.py
    params:
      - data.split_ratio
      - data.random_state
    outs:
      - data/processed/train.csv
      - data/processed/test.csv

Usage

To run the data processing pipeline:

# Run data processing stage
dvc repro process_data

# Check pipeline status
dvc status

# Push processed data to remote storage
dvc push

Data Versioning

Each dataset version is tracked in DVC:

# Add new raw data
dvc add data/raw/dataset.csv

# Add processed datasets
dvc add data/processed/train.csv
dvc add data/processed/test.csv

# Commit changes
git add data/raw/dataset.csv.dvc data/processed/*.dvc
git commit -m "feat: add new dataset version"