Aguitauwu/yuuki-training

Fork 0

mirror of https://github.com/YuuKi-OS/yuuki-training.git synced 2026-02-18 22:01:09 +00:00

Files

Gogs b1c0598a82 Rename TRAINING.md to README.md

2026-01-30 08:35:37 -06:00

5.1 KiB

Raw Permalink Blame History

Yuuki Training Code

Official training pipeline for Yuuki, an experimental small-scale language model for source code generation.

Abstract

This repository contains the official training implementation for Yuuki, a compact causal language model optimized for source code understanding and generation. The system is designed with an emphasis on simplicity, reproducibility, and accessibility across heterogeneous computing environments, including CPU-only systems, cloud notebooks (Colab, Kaggle), and resource-constrained platforms such as Termux on mobile devices.

Model Specification

Attribute	Description
Architecture	GPT-style autoregressive transformer
Base Model	`distilgpt2`
Domain	Source code (multi-language)
Training Corpus	`bigcode/the-stack-smol-xl`
Parameter Count	~82M
Design Principles	Minimal dependencies, transparent implementation, full reproducibility

Repository Structure

Included Components

File	Description
`train_yuuki.py`	Complete, self-contained training script
`LICENSE`	Apache 2.0 License

Excluded Artifacts

The following components are intentionally omitted to maintain repository portability and encourage local reproducibility:

Pre-trained model weights and checkpoints
Tokenized datasets and Arrow cache files
Training logs and metrics
Experimental or proprietary scripts
Auxiliary datasets from subsequent experiments

All artifacts should be generated locally by executing the provided training script.

Configuration Parameters

Training behavior is controlled exclusively through environment variables, enabling seamless adaptation across diverse execution environments.

Default Configuration

Parameter	Default Value	Description
`MODEL_NAME`	`distilgpt2`	Pre-trained model identifier for initialization
`DATASET_ID`	`bigcode/the-stack-smol-xl`	HuggingFace dataset identifier
`SPLIT`	`train`	Dataset partition for training
`OUTPUT_DIR`	`./yuuki_model`	Output directory for model artifacts
`TOKENIZED_CACHE_DIR`	`./yuuki_model/tokenized_cache`	Cache location for tokenized sequences
`MAX_LENGTH`	`256`	Maximum input sequence length
`EPOCHS`	`2`	Number of training iterations
`BATCH_SIZE`	`1`	Samples per gradient update

Implementation Notes

Sequence Length (MAX_LENGTH=256): Selected to optimize memory utilization and training throughput on constrained hardware.
Batch Size (BATCH_SIZE=1): Configured for compatibility with low-memory execution environments.
Tokenization Caching: Optional but recommended for iterative training workflows.

Execution

Standard Invocation

python train_yuuki.py

Custom Configuration Example

MODEL_NAME=distilgpt2 \
MAX_LENGTH=256 \
EPOCHS=3 \
BATCH_SIZE=2 \
python train_yuuki.py

The training script performs automatic hardware detection and configures CUDA acceleration when available.

Design Rationale

Yuuki is not intended to compete with large-scale foundation models. The project objectives are:

Principle	Description
Interpretability	Prioritizes readable, maintainable code over abstraction layers
Accessibility	Executable without specialized hardware infrastructure
Transparency	No hidden procedures or undocumented dependencies
Educational Utility	Serves as a reference implementation for language model training

Pre-trained Model

The model trained using this pipeline is publicly available:

Yuuki-82M on HuggingFace

Limitations and Disclaimer

This software is provided for research and educational purposes. The model may produce:

Syntactically or semantically incorrect code
Incomplete or truncated outputs
Potentially unsafe or nonsensical suggestions

This system is not suitable for production deployment. Users assume full responsibility for any application of the generated outputs.

License

This project is distributed under the Apache License 2.0. See the LICENSE file for complete terms.

Under this license, you are permitted to:

Use, copy, and distribute the software
Modify and create derivative works
Use for commercial and non-commercial purposes

Subject to the conditions of attribution and license preservation as specified in the Apache 2.0 terms.

Contact

For inquiries, collaboration proposals, or technical discussions regarding Yuuki, please submit an Issue or initiate a Discussion in this repository.

Developed by OpceanAI

5.1 KiB Raw Permalink Blame History