p.enthalabs

GitHub - neospe/autofit2: Automated end-to-end data preprocessing, model training, and evaluation pipeline

Few-shot text classification. Massively multilingual (50+ languages), fully automated pipeline built on `setfit` and SBERT embeddings.

Key Features

[](https://github.com/neospe/autofit2#key-features)

- **Few-Shot Learning:** High precision (95–99%) with a few dozen labeled examples.

- **Multilingual Support:** Pretrained models for 20 languages; evaluation corpora for 50+. Scalable to 100+ via Common Crawl.

- **Automated Pipeline:** End-to-end preprocessing, fine-tuning, evaluation, and deployment from a single JSON config.

- **Reproducibility & Transparency:** JSON-based configuration, model card generation, and CO₂ emission tracking.

Usage

[](https://github.com/neospe/autofit2#usage) **1. Prepare Data** Use `dataload` or implement a custom loader providing labeled examples.

**2. Configure** Create `myproject.json` specifying dataset paths, model settings, and output directories. Supports multi-language/task blocks.

**3. Run**

The pipeline supports resumable execution.

python train.py myproject.json

**4. Output**

- Deployable model archive.

- Generated model card (training details, intended use, performance metrics, bias evaluation).

Configuration

[](https://github.com/neospe/autofit2#configuration) `myproject.json` defines the training parameters. Its structure depends on the target type: **Base Models** (`all`) or **Custom Models** (`custom`).

General Structure

[](https://github.com/neospe/autofit2#general-structure)

{ "<task-key>": { "<language-key>": { "base": { "model file": "<path>", // Relative path, no trailing slash (e.g. "models-in/all-MiniLM-L6-v2") "model type": "<string>", // e.g., "bert" "pretraining task": "<string>", // e.g., "sentence similarity" "downstream task": "<string>" // e.g., "binary text classification" }, "targets": { "<id-key>": { ... } // See Target Options below } } } }

Target Types

[](https://github.com/neospe/autofit2#target-types) The `"targets"` dictionary supports three specific key types:

1. **`all`** (Base Model)

- Generates a full set of artifacts: model folder, archive, and card.

- **Model ID:** Derived from the config filename (`{config_name}-{task}-{lang}`). The config filename must be stable.

2. **`custom`** (Custom Model)

- Generates a full set of artifacts: model folder, archive, and card.

- **Model ID:** can be auto-generated as a 14–16 character lowercase alphanumeric string.

3. **`benchmark 1..N`** (Benchmarking Only)

- Does **not** generate model artifacts.

- Outputs only score logs.

- Must be used in conjunction with an `all` target to produce output.

Target Options

[](https://github.com/neospe/autofit2#target-options) Each entry in the `"targets"` dictionary supports the following keys:

| Key | Type | Description | | --- | --- | --- | | `description` | `string` | Free-form description of the target. | | `link` | `string` | URL to source data or documentation. | | `train embedding` | `bool` | Set to `true` to fine-tune embeddings during training. | | `base clf` | `string` | ID string pointing to a `.joblib` file located in `BASE_PATH`. Must match exactly. | | `sample ratio` | `float` | Random sample of total data for full training (e.g., `0.5` = 50%). | | `embedding sample ratio` | `float` | Random sample of data used **only** for embedding fine-tuning (e.g., `0.1` = 10%). |

Loaders

[](https://github.com/neospe/autofit2#loaders) The `"loader"` field defines how data is ingested and transformed. It expects a list of commands (functions or transformations):

"loader": ["command_1", "command_2"]

- **Command Definition:** Each command must return a list of dictionaries with keys `text` and `label`. Commands can be raw loader functions or wrapped transformations (e.g., list comprehensions, lambdas).

- **Data Splitting Logic:**

- **If 2 commands AND target != `all`:**

- Command 1 → Training Data

- Command 2 → Evaluation Data

- **Else (Target = `all`):**

- All commands are concatenated into a single dataset.

- Split: **100/100** (No split; entire set used for training).

- **Else (Other Targets, e.g., `custom` or benchmarks with 1 command):**

- All commands are concatenated into a single dataset.

- Split: **70/30** (Train/Test).

Configuration Example

[](https://github.com/neospe/autofit2#configuration-example)

{ "mod": { "el": { "base": { "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2", "model type": "bert", "pretraining task": "sentence similarity", "downstream task": "binary text classification" }, "targets": { "benchmark 1": { "description": "Pitenis et al. - Offensive Language Identification in Greek", "link": "https://arxiv.org/abs/2003.07459", "loader": [ "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])", "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])" ] }, "all": { "loader": [ "el_offense20()" ] } } } } }

Breakdown: Finetuning a Sentence Transformer

[](https://github.com/neospe/autofit2#breakdown-finetuning-a-sentence-transformer) To fine-tune a base model for a specific task and language, define a config block like the one below. This example sets up a text moderation (mod) pipeline for Greek (el) using a multilingual sentence transformer.

**Base Model Setup**

"base": { "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2", "model type": "bert", "pretraining task": "sentence similarity", "downstream task": "binary text classification" }

- Model file: Path to the pretrained transformer.

- Model type: Architecture type (e.g., BERT).

- Pretraining task: Original task the model was trained on.

- Downstream task: Task you're adapting it to (e.g., moderation, sentiment analysis).

**Targets**

You can specify multiple finetuning targets. Each target defines a dataset and training strategy.

1. `benchmark 1`

"benchmark 1": { "description": "Pitenis et al. - Offensive Language Identification in Greek", "link": "https://arxiv.org/abs/2003.07459", "loader": [ "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])", "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])" ] }

- Uses a train/test split for evaluation.

- Based on a published benchmark dataset.

1. `all`

"all": { "loader": ["el_offense20()"] }

- Uses the full dataset for training.

- No explicit evaluation—this is for production-grade finetuning.