GitHub - neospe/autofit2: Automated end-to-end data preprocessing, model training, and evaluation pipeline
Few-shot text classification. Massively multilingual (50+ languages), fully automated pipeline built on `setfit` and SBERT embeddings.
Key Features
[](https://github.com/neospe/autofit2#key-features)
- **Few-Shot Learning:** High precision (95–99%) with a few dozen labeled examples.
- **Multilingual Support:** Pretrained models for 20 languages; evaluation corpora for 50+. Scalable to 100+ via Common Crawl.
- **Automated Pipeline:** End-to-end preprocessing, fine-tuning, evaluation, and deployment from a single JSON config.
- **Reproducibility & Transparency:** JSON-based configuration, model card generation, and CO₂ emission tracking.
Usage
[](https://github.com/neospe/autofit2#usage) **1. Prepare Data** Use `dataload` or implement a custom loader providing labeled examples.
**2. Configure** Create `myproject.json` specifying dataset paths, model settings, and output directories. Supports multi-language/task blocks.
**3. Run**
The pipeline supports resumable execution.
python train.py myproject.json
**4. Output**
- Deployable model archive.
- Generated model card (training details, intended use, performance metrics, bias evaluation).
Configuration
[](https://github.com/neospe/autofit2#configuration) `myproject.json` defines the training parameters. Its structure depends on the target type: **Base Models** (`all`) or **Custom Models** (`custom`).
General Structure
[](https://github.com/neospe/autofit2#general-structure)
{ "<task-key>": { "<language-key>": { "base": { "model file": "<path>", // Relative path, no trailing slash (e.g. "models-in/all-MiniLM-L6-v2") "model type": "<string>", // e.g., "bert" "pretraining task": "<string>", // e.g., "sentence similarity" "downstream task": "<string>" // e.g., "binary text classification" }, "targets": { "<id-key>": { ... } // See Target Options below } } } }
Target Types
[](https://github.com/neospe/autofit2#target-types) The `"targets"` dictionary supports three specific key types:
1. **`all`** (Base Model)
- Generates a full set of artifacts: model folder, archive, and card.
- **Model ID:** Derived from the config filename (`{config_name}-{task}-{lang}`). The config filename must be stable.
2. **`custom`** (Custom Model)
- Generates a full set of artifacts: model folder, archive, and card.
- **Model ID:** can be auto-generated as a 14–16 character lowercase alphanumeric string.
3. **`benchmark 1..N`** (Benchmarking Only)
- Does **not** generate model artifacts.
- Outputs only score logs.
- Must be used in conjunction with an `all` target to produce output.
Target Options
[](https://github.com/neospe/autofit2#target-options) Each entry in the `"targets"` dictionary supports the following keys:
| Key | Type | Description | | --- | --- | --- | | `description` | `string` | Free-form description of the target. | | `link` | `string` | URL to source data or documentation. | | `train embedding` | `bool` | Set to `true` to fine-tune embeddings during training. | | `base clf` | `string` | ID string pointing to a `.joblib` file located in `BASE_PATH`. Must match exactly. | | `sample ratio` | `float` | Random sample of total data for full training (e.g., `0.5` = 50%). | | `embedding sample ratio` | `float` | Random sample of data used **only** for embedding fine-tuning (e.g., `0.1` = 10%). |
Loaders
[](https://github.com/neospe/autofit2#loaders) The `"loader"` field defines how data is ingested and transformed. It expects a list of commands (functions or transformations):
"loader": ["command_1", "command_2"]
- **Command Definition:** Each command must return a list of dictionaries with keys `text` and `label`. Commands can be raw loader functions or wrapped transformations (e.g., list comprehensions, lambdas).
- **Data Splitting Logic:**
- **If 2 commands AND target != `all`:**
- Command 1 → Training Data
- Command 2 → Evaluation Data
- **Else (Target = `all`):**
- All commands are concatenated into a single dataset.
- Split: **100/100** (No split; entire set used for training).
- **Else (Other Targets, e.g., `custom` or benchmarks with 1 command):**
- All commands are concatenated into a single dataset.
- Split: **70/30** (Train/Test).
Configuration Example
[](https://github.com/neospe/autofit2#configuration-example)
{ "mod": { "el": { "base": { "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2", "model type": "bert", "pretraining task": "sentence similarity", "downstream task": "binary text classification" }, "targets": { "benchmark 1": { "description": "Pitenis et al. - Offensive Language Identification in Greek", "link": "https://arxiv.org/abs/2003.07459", "loader": [ "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])", "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])" ] }, "all": { "loader": [ "el_offense20()" ] } } } } }
Breakdown: Finetuning a Sentence Transformer
[](https://github.com/neospe/autofit2#breakdown-finetuning-a-sentence-transformer) To fine-tune a base model for a specific task and language, define a config block like the one below. This example sets up a text moderation (mod) pipeline for Greek (el) using a multilingual sentence transformer.
**Base Model Setup**
"base": { "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2", "model type": "bert", "pretraining task": "sentence similarity", "downstream task": "binary text classification" }
- Model file: Path to the pretrained transformer.
- Model type: Architecture type (e.g., BERT).
- Pretraining task: Original task the model was trained on.
- Downstream task: Task you're adapting it to (e.g., moderation, sentiment analysis).
**Targets**
You can specify multiple finetuning targets. Each target defines a dataset and training strategy.
1. `benchmark 1`
"benchmark 1": { "description": "Pitenis et al. - Offensive Language Identification in Greek", "link": "https://arxiv.org/abs/2003.07459", "loader": [ "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])", "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])" ] }
- Uses a train/test split for evaluation.
- Based on a published benchmark dataset.
1. `all`
"all": { "loader": ["el_offense20()"] }
- Uses the full dataset for training.
- No explicit evaluation—this is for production-grade finetuning.