Files
yuuki-training/README.md
2026-01-30 08:35:37 -06:00

5.1 KiB

Yuuki Training Code

Official training pipeline for Yuuki, an experimental small-scale language model for source code generation.

Model License Python


Abstract

This repository contains the official training implementation for Yuuki, a compact causal language model optimized for source code understanding and generation. The system is designed with an emphasis on simplicity, reproducibility, and accessibility across heterogeneous computing environments, including CPU-only systems, cloud notebooks (Colab, Kaggle), and resource-constrained platforms such as Termux on mobile devices.


Model Specification

Attribute Description
Architecture GPT-style autoregressive transformer
Base Model distilgpt2
Domain Source code (multi-language)
Training Corpus bigcode/the-stack-smol-xl
Parameter Count ~82M
Design Principles Minimal dependencies, transparent implementation, full reproducibility

Repository Structure

Included Components

File Description
train_yuuki.py Complete, self-contained training script
LICENSE Apache 2.0 License

Excluded Artifacts

The following components are intentionally omitted to maintain repository portability and encourage local reproducibility:

  • Pre-trained model weights and checkpoints
  • Tokenized datasets and Arrow cache files
  • Training logs and metrics
  • Experimental or proprietary scripts
  • Auxiliary datasets from subsequent experiments

All artifacts should be generated locally by executing the provided training script.


Configuration Parameters

Training behavior is controlled exclusively through environment variables, enabling seamless adaptation across diverse execution environments.

Default Configuration

Parameter Default Value Description
MODEL_NAME distilgpt2 Pre-trained model identifier for initialization
DATASET_ID bigcode/the-stack-smol-xl HuggingFace dataset identifier
SPLIT train Dataset partition for training
OUTPUT_DIR ./yuuki_model Output directory for model artifacts
TOKENIZED_CACHE_DIR ./yuuki_model/tokenized_cache Cache location for tokenized sequences
MAX_LENGTH 256 Maximum input sequence length
EPOCHS 2 Number of training iterations
BATCH_SIZE 1 Samples per gradient update

Implementation Notes

  • Sequence Length (MAX_LENGTH=256): Selected to optimize memory utilization and training throughput on constrained hardware.
  • Batch Size (BATCH_SIZE=1): Configured for compatibility with low-memory execution environments.
  • Tokenization Caching: Optional but recommended for iterative training workflows.

Execution

Standard Invocation

python train_yuuki.py

Custom Configuration Example

MODEL_NAME=distilgpt2 \
MAX_LENGTH=256 \
EPOCHS=3 \
BATCH_SIZE=2 \
python train_yuuki.py

The training script performs automatic hardware detection and configures CUDA acceleration when available.


Design Rationale

Yuuki is not intended to compete with large-scale foundation models. The project objectives are:

Principle Description
Interpretability Prioritizes readable, maintainable code over abstraction layers
Accessibility Executable without specialized hardware infrastructure
Transparency No hidden procedures or undocumented dependencies
Educational Utility Serves as a reference implementation for language model training

Pre-trained Model

The model trained using this pipeline is publicly available:


Limitations and Disclaimer

This software is provided for research and educational purposes. The model may produce:

  • Syntactically or semantically incorrect code
  • Incomplete or truncated outputs
  • Potentially unsafe or nonsensical suggestions

This system is not suitable for production deployment. Users assume full responsibility for any application of the generated outputs.


License

This project is distributed under the Apache License 2.0. See the LICENSE file for complete terms.

Under this license, you are permitted to:

  • Use, copy, and distribute the software
  • Modify and create derivative works
  • Use for commercial and non-commercial purposes

Subject to the conditions of attribution and license preservation as specified in the Apache 2.0 terms.


Contact

For inquiries, collaboration proposals, or technical discussions regarding Yuuki, please submit an Issue or initiate a Discussion in this repository.


Developed by OpceanAI