# Yuuki Training Code
**Official training pipeline for Yuuki, an experimental small-scale language model for source code generation.**
[](https://huggingface.co/OpceanAI/Yuuki-82M)
[](LICENSE)
[](https://python.org)
---
## Abstract
This repository contains the official training implementation for **Yuuki**, a compact causal language model optimized for source code understanding and generation. The system is designed with an emphasis on simplicity, reproducibility, and accessibility across heterogeneous computing environments, including CPU-only systems, cloud notebooks (Colab, Kaggle), and resource-constrained platforms such as Termux on mobile devices.
---
## Model Specification
| Attribute | Description |
|-----------|-------------|
| **Architecture** | GPT-style autoregressive transformer |
| **Base Model** | `distilgpt2` |
| **Domain** | Source code (multi-language) |
| **Training Corpus** | `bigcode/the-stack-smol-xl` |
| **Parameter Count** | ~82M |
| **Design Principles** | Minimal dependencies, transparent implementation, full reproducibility |
---
## Repository Structure
### Included Components
| File | Description |
|------|-------------|
| `train_yuuki.py` | Complete, self-contained training script |
| `LICENSE` | Apache 2.0 License |
### Excluded Artifacts
The following components are intentionally omitted to maintain repository portability and encourage local reproducibility:
- Pre-trained model weights and checkpoints
- Tokenized datasets and Arrow cache files
- Training logs and metrics
- Experimental or proprietary scripts
- Auxiliary datasets from subsequent experiments
All artifacts should be generated locally by executing the provided training script.
---
## Configuration Parameters
Training behavior is controlled exclusively through environment variables, enabling seamless adaptation across diverse execution environments.
### Default Configuration
| Parameter | Default Value | Description |
|-----------|---------------|-------------|
| `MODEL_NAME` | `distilgpt2` | Pre-trained model identifier for initialization |
| `DATASET_ID` | `bigcode/the-stack-smol-xl` | HuggingFace dataset identifier |
| `SPLIT` | `train` | Dataset partition for training |
| `OUTPUT_DIR` | `./yuuki_model` | Output directory for model artifacts |
| `TOKENIZED_CACHE_DIR` | `./yuuki_model/tokenized_cache` | Cache location for tokenized sequences |
| `MAX_LENGTH` | `256` | Maximum input sequence length |
| `EPOCHS` | `2` | Number of training iterations |
| `BATCH_SIZE` | `1` | Samples per gradient update |
### Implementation Notes
- **Sequence Length (`MAX_LENGTH=256`)**: Selected to optimize memory utilization and training throughput on constrained hardware.
- **Batch Size (`BATCH_SIZE=1`)**: Configured for compatibility with low-memory execution environments.
- **Tokenization Caching**: Optional but recommended for iterative training workflows.
---
## Execution
### Standard Invocation
```bash
python train_yuuki.py
```
### Custom Configuration Example
```bash
MODEL_NAME=distilgpt2 \
MAX_LENGTH=256 \
EPOCHS=3 \
BATCH_SIZE=2 \
python train_yuuki.py
```
The training script performs automatic hardware detection and configures CUDA acceleration when available.
---
## Design Rationale
Yuuki is not intended to compete with large-scale foundation models. The project objectives are:
| Principle | Description |
|-----------|-------------|
| **Interpretability** | Prioritizes readable, maintainable code over abstraction layers |
| **Accessibility** | Executable without specialized hardware infrastructure |
| **Transparency** | No hidden procedures or undocumented dependencies |
| **Educational Utility** | Serves as a reference implementation for language model training |
---
## Pre-trained Model
The model trained using this pipeline is publicly available:
**[Yuuki-82M on HuggingFace](https://huggingface.co/OpceanAI/Yuuki-82M)**
---
## Limitations and Disclaimer
This software is provided for research and educational purposes. The model may produce:
- Syntactically or semantically incorrect code
- Incomplete or truncated outputs
- Potentially unsafe or nonsensical suggestions
**This system is not suitable for production deployment.** Users assume full responsibility for any application of the generated outputs.
---
## License
This project is distributed under the **Apache License 2.0**. See the [LICENSE](LICENSE) file for complete terms.
Under this license, you are permitted to:
- Use, copy, and distribute the software
- Modify and create derivative works
- Use for commercial and non-commercial purposes
Subject to the conditions of attribution and license preservation as specified in the Apache 2.0 terms.
---
## Contact
For inquiries, collaboration proposals, or technical discussions regarding Yuuki, please submit an Issue or initiate a Discussion in this repository.
---
**Developed by [OpceanAI](https://huggingface.co/OpceanAI)**