mirror of
https://github.com/YuuKi-OS/yuuki-training.git
synced 2026-02-18 22:01:09 +00:00
Rename TRAINING.md to README.md
This commit is contained in:
166
README.md
Normal file
166
README.md
Normal file
@@ -0,0 +1,166 @@
|
||||
<div align="center">
|
||||
|
||||
# Yuuki Training Code
|
||||
|
||||
**Official training pipeline for Yuuki, an experimental small-scale language model for source code generation.**
|
||||
|
||||
[](https://huggingface.co/OpceanAI/Yuuki-82M)
|
||||
[](LICENSE)
|
||||
[](https://python.org)
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## Abstract
|
||||
|
||||
This repository contains the official training implementation for **Yuuki**, a compact causal language model optimized for source code understanding and generation. The system is designed with an emphasis on simplicity, reproducibility, and accessibility across heterogeneous computing environments, including CPU-only systems, cloud notebooks (Colab, Kaggle), and resource-constrained platforms such as Termux on mobile devices.
|
||||
|
||||
---
|
||||
|
||||
## Model Specification
|
||||
|
||||
| Attribute | Description |
|
||||
|-----------|-------------|
|
||||
| **Architecture** | GPT-style autoregressive transformer |
|
||||
| **Base Model** | `distilgpt2` |
|
||||
| **Domain** | Source code (multi-language) |
|
||||
| **Training Corpus** | `bigcode/the-stack-smol-xl` |
|
||||
| **Parameter Count** | ~82M |
|
||||
| **Design Principles** | Minimal dependencies, transparent implementation, full reproducibility |
|
||||
|
||||
---
|
||||
|
||||
## Repository Structure
|
||||
|
||||
### Included Components
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `train_yuuki.py` | Complete, self-contained training script |
|
||||
| `LICENSE` | Apache 2.0 License |
|
||||
|
||||
### Excluded Artifacts
|
||||
|
||||
The following components are intentionally omitted to maintain repository portability and encourage local reproducibility:
|
||||
|
||||
- Pre-trained model weights and checkpoints
|
||||
- Tokenized datasets and Arrow cache files
|
||||
- Training logs and metrics
|
||||
- Experimental or proprietary scripts
|
||||
- Auxiliary datasets from subsequent experiments
|
||||
|
||||
All artifacts should be generated locally by executing the provided training script.
|
||||
|
||||
---
|
||||
|
||||
## Configuration Parameters
|
||||
|
||||
Training behavior is controlled exclusively through environment variables, enabling seamless adaptation across diverse execution environments.
|
||||
|
||||
### Default Configuration
|
||||
|
||||
| Parameter | Default Value | Description |
|
||||
|-----------|---------------|-------------|
|
||||
| `MODEL_NAME` | `distilgpt2` | Pre-trained model identifier for initialization |
|
||||
| `DATASET_ID` | `bigcode/the-stack-smol-xl` | HuggingFace dataset identifier |
|
||||
| `SPLIT` | `train` | Dataset partition for training |
|
||||
| `OUTPUT_DIR` | `./yuuki_model` | Output directory for model artifacts |
|
||||
| `TOKENIZED_CACHE_DIR` | `./yuuki_model/tokenized_cache` | Cache location for tokenized sequences |
|
||||
| `MAX_LENGTH` | `256` | Maximum input sequence length |
|
||||
| `EPOCHS` | `2` | Number of training iterations |
|
||||
| `BATCH_SIZE` | `1` | Samples per gradient update |
|
||||
|
||||
### Implementation Notes
|
||||
|
||||
- **Sequence Length (`MAX_LENGTH=256`)**: Selected to optimize memory utilization and training throughput on constrained hardware.
|
||||
- **Batch Size (`BATCH_SIZE=1`)**: Configured for compatibility with low-memory execution environments.
|
||||
- **Tokenization Caching**: Optional but recommended for iterative training workflows.
|
||||
|
||||
---
|
||||
|
||||
## Execution
|
||||
|
||||
### Standard Invocation
|
||||
|
||||
```bash
|
||||
python train_yuuki.py
|
||||
```
|
||||
|
||||
### Custom Configuration Example
|
||||
|
||||
```bash
|
||||
MODEL_NAME=distilgpt2 \
|
||||
MAX_LENGTH=256 \
|
||||
EPOCHS=3 \
|
||||
BATCH_SIZE=2 \
|
||||
python train_yuuki.py
|
||||
```
|
||||
|
||||
The training script performs automatic hardware detection and configures CUDA acceleration when available.
|
||||
|
||||
---
|
||||
|
||||
## Design Rationale
|
||||
|
||||
Yuuki is not intended to compete with large-scale foundation models. The project objectives are:
|
||||
|
||||
| Principle | Description |
|
||||
|-----------|-------------|
|
||||
| **Interpretability** | Prioritizes readable, maintainable code over abstraction layers |
|
||||
| **Accessibility** | Executable without specialized hardware infrastructure |
|
||||
| **Transparency** | No hidden procedures or undocumented dependencies |
|
||||
| **Educational Utility** | Serves as a reference implementation for language model training |
|
||||
|
||||
---
|
||||
|
||||
## Pre-trained Model
|
||||
|
||||
The model trained using this pipeline is publicly available:
|
||||
|
||||
<div align="center">
|
||||
|
||||
**[Yuuki-82M on HuggingFace](https://huggingface.co/OpceanAI/Yuuki-82M)**
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## Limitations and Disclaimer
|
||||
|
||||
This software is provided for research and educational purposes. The model may produce:
|
||||
|
||||
- Syntactically or semantically incorrect code
|
||||
- Incomplete or truncated outputs
|
||||
- Potentially unsafe or nonsensical suggestions
|
||||
|
||||
**This system is not suitable for production deployment.** Users assume full responsibility for any application of the generated outputs.
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
This project is distributed under the **Apache License 2.0**. See the [LICENSE](LICENSE) file for complete terms.
|
||||
|
||||
Under this license, you are permitted to:
|
||||
|
||||
- Use, copy, and distribute the software
|
||||
- Modify and create derivative works
|
||||
- Use for commercial and non-commercial purposes
|
||||
|
||||
Subject to the conditions of attribution and license preservation as specified in the Apache 2.0 terms.
|
||||
|
||||
---
|
||||
|
||||
## Contact
|
||||
|
||||
For inquiries, collaboration proposals, or technical discussions regarding Yuuki, please submit an Issue or initiate a Discussion in this repository.
|
||||
|
||||
---
|
||||
|
||||
<div align="center">
|
||||
|
||||
**Developed by [OpceanAI](https://huggingface.co/OpceanAI)**
|
||||
|
||||
</div>
|
||||
|
||||
Reference in New Issue
Block a user