A lightweight toolkit for quantitatively scoring LeRobot episodes.
A comprehensive toolkit for evaluating and filtering LeRobot episode datasets based on multiple quality dimensions. It combines classic Computer Vision heuristics (blur/exposure tests, kinematic smoothness, collision spikes) with optional Gemini-powered vision-language checks to give each episode a 0–1 score across multiple quality dimensions.
Use this toolkit to:
- Automatically score robot demonstration episodes on visual clarity, motion smoothness, collision detection, and more
- Filter low-quality episodes to improve downstream training performance
- Train and compare baseline vs. filtered dataset models
- Visualize score distributions and identify problematic episodes
| Dimension | Function | What it measures |
|---|---|---|
| Visual clarity | score_visual_clarity |
Blur, over/under-exposure, low-light frames |
| Smoothness | score_smoothness |
2nd derivative of joint angles |
| Path efficiency | score_path_efficiency |
Ratio of straight-line vs. actual joint-space path |
| Collision / spikes | score_collision |
Sudden acceleration outliers (proxy for contacts) |
| Joint stability (final 2 s) | score_joint_stability |
Stillness at the goal pose |
| Gripper consistency | score_gripper_consistency |
Binary "closed vs. holding" agreement |
| Actuator saturation | score_actuator_saturation |
Difference between commanded actions and achieved states |
| Task success (VLM) | score_task_success (via VLMInterface) |
Gemini grades whether the desired behaviour happened |
| Task success (VLM) | score_task_success (via VLMInterface) |
Gemini grades whether the desired behavior happened |
| Runtime penalty / outliers | score_runtime + build_time_stats, is_time_outlier |
Episode length vs. nominal / Tukey-IQR / Z-score fences |
- Python 3.8 or higher
- pip package manager
-
Clone the repository
git clone https://github.com/RoboticsData/score_lerobot_episodes.git cd score_lerobot_episodes -
Install dependencies
pip install -r requirements.txt
-
Set up API keys (optional)
Only required if using VLM-based scoring with Gemini:
export GOOGLE_API_KEY="your-api-key-here"
Note: The free tier rate limits of the Gemini API are fairly restrictive and might need to be upgraded depending on episode length. Check Gemini API rate limits for more info.
Score a dataset and save results:
python score_dataset.py \
--repo_id lerobot/aloha_static_pro_pencil \
--output ./output/lerobot/aloha_static_pro_pencil \
--threshold 0.5This will:
- Download and load the dataset from HuggingFace
- Score each episode across multiple quality dimensions
- Save scores to output path
- Filter episodes with aggregate score >= 0.5
- Save the filtered dataset to the output directory
--repo_id: HuggingFace repository ID for the dataset (e.g.,username/dataset-name)
--root: Local path to dataset root (default: downloads from HuggingFace Hub)--output: Output directory for filtered dataset (default: None, no filtering)--threshold: Minimum aggregate score to keep episodes (default: 0.5, range: 0.0-1.0)--nominal: Expected episode duration in seconds (used for runtime scoring)--vision_type: Vision scoring method, choices:opencv(default),vlm_gemini--policy_name: Policy type for training (default:act)--overwrite: Overwrite existing filtered dataset (default: True)--overwrite_checkpoint: Overwrite existing training checkpoints (default: False)--train-baseline: Train model on unfiltered dataset (default: False)--train-filtered: Train model on filtered dataset (default: False)--plot: Display score distribution plots in terminal (default: False)
python score_dataset.py --repo_id username/my-robot-datasetpython score_dataset.py \
--repo_id username/my-robot-dataset \
--output ./output/username/my-robot-dataset \
--threshold 0.6export GOOGLE_API_KEY="your-key"
python score_dataset.py \
--repo_id username/my-robot-dataset \
--vision_type vlm_gemini \
--output ./filtered_datapython score_dataset.py \
--repo_id username/my-robot-dataset \
--output ./output/username/my-robot-dataset \
--threshold 0.5 \
--train-baseline True \
--train-filtered True \
--policy_name actpython score_dataset.py \
--repo_id username/my-robot-dataset \
--threshold 0.7 \
--plot Truepython score_dataset.py \
--repo_id username/my-robot-dataset \
--root /path/to/local/dataset \
--output ./filtered_outputSaved to results/{repo_id}_scores.json:
[
{
"episode_id": 0,
"camera_type": "camera_0",
"video_path": "/path/to/video.mp4",
"aggregate_score": 0.752,
"per_attribute_scores": {
"visual_clarity": 0.85,
"smoothness": 0.78,
"collision": 0.92,
"runtime": 0.65
}
},
...
]Displays a formatted table showing scores for each episode:
Episode scores (0–1 scale)
─────────────────────────────────────────────────────────────────
Episode Camera visual_clarity smoothness collision runtime Aggregate Status
0 camera_0 0.850 0.780 0.920 0.650 0.752 GOOD
1 camera_1 0.420 0.650 0.710 0.580 0.590 BAD
...
─────────────────────────────────────────────────────────────────
Average aggregate over 20 videos: 0.671
Percentage of episodes removed: 0.25, total: 5
When using --output, a new filtered dataset is created with only episodes scoring above the threshold, maintaining the original LeRobot dataset structure.
score_lerobot_episodes/
├── score_dataset.py # Main scoring script
├── data.py # Dataset loading and filtering utilities
├── vlm.py # Vision-Language Model interface (Gemini)
├── train.py # Training pipeline integration
├── evaluation.py # Evaluation utilities
├── corrupt.py # Data corruption tools for robustness testing
├── ui.py # Streamlit web interface (if available)
├── requirements.txt # Python dependencies
├── README.md # This file
├── CONTRIBUTING.md # Contribution guidelines
├── LICENSE # Apache 2.0 license
├── results/ # Generated score JSON files
├── output/ # Filtered datasets
└── checkpoints/ # Training checkpoints
The toolkit integrates with LeRobot's training pipeline to compare baseline vs. filtered dataset performance.
-
Baseline Training: Train on the original unfiltered dataset
python score_dataset.py \ --repo_id username/dataset \ --train-baseline True
-
Filtered Training: Train on the quality-filtered dataset
python score_dataset.py \ --repo_id username/dataset \ --output ./filtered_data \ --threshold 0.6 \ --train-filtered True
-
Compare Both: Run both training pipelines in one command
python score_dataset.py \ --repo_id username/dataset \ --output ./filtered_data \ --train-baseline True \ --train-filtered True
- Default policy: ACT (Action Chunking Transformer)
- Default steps: 10,000
- Batch size: 4
- Checkpoints saved to
./checkpoints/{job_name}/ - WandB logging enabled by default
You can customize training parameters by modifying train.py.
1. ModuleNotFoundError: No module named 'google.generativeai'
- Solution: Install dependencies with
pip install -r requirements.txt - If using VLM scoring, ensure
google-generativeaiis installed
2. API rate limit errors with Gemini
- Solution: The free tier has restrictive limits. Consider:
- Using
--vision_type opencvinstead - Upgrading to a paid Gemini API tier
- Processing smaller batches
- Using
3. All episodes filtered out
- Error:
ValueError: All episodes filtered out, decrease threshold to fix this - Solution: Lower the
--thresholdvalue (e.g., from 0.5 to 0.3)
4. Dataset not found
- Solution:
- Verify the
--repo_idis correct - Check internet connection for HuggingFace Hub access
- Use
--rootto specify a local dataset path
- Verify the
5. Out of memory during training
- Solution: Reduce
batch_sizeintrain.py:44or use a smaller model
6. Permission errors when overwriting
- Solution: Use
--overwrite Trueor manually delete the output directory
We welcome contributions! Please see CONTRIBUTING.md for guidelines on:
- Setting up a development environment
- Code style and conventions
- Submitting pull requests
- Reporting issues
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
LeRobot Episode Scoring Toolkit is distributed under the Apache 2.0 License. See LICENSE for more information.