Data Processing
Overview
This section describes our data processing pipeline that prepares raw data for model training. We use DVC to track all data transformations and ensure reproducibility.
Data Pipeline
Our data processing pipeline consists of the following stages:
- Raw Data Collection
- Data is stored in
data/raw/
- Each dataset is versioned with DVC
-
Raw data is considered immutable
-
Data Cleaning
- Handle missing values
- Remove duplicates
- Fix data types
-
Normalize values
-
Feature Engineering
- Create new features
- Encode categorical variables
- Scale numerical features
- Select relevant features
Pipeline Configuration
The data processing pipeline is defined in dvc.yaml
:
stages:
process_data:
cmd: python src/mlops/data/process.py
deps:
- data/raw/dataset.csv
- src/mlops/data/process.py
params:
- data.split_ratio
- data.random_state
outs:
- data/processed/train.csv
- data/processed/test.csv
Usage
To run the data processing pipeline:
# Run data processing stage
dvc repro process_data
# Check pipeline status
dvc status
# Push processed data to remote storage
dvc push
Data Versioning
Each dataset version is tracked in DVC: