A queryable corpus of (almost) all the news in the world · CS²

Source: https://cs2.uni-graz.at/blog/infini-news/

- * *

I do many things, and sometimes the thing I do is work with researcher and journalist friends on various projects. Some of those projects tend to involve news articles: analysis of news articles is vital to many things. From propaganda studies, to media analysis, or just to find cultural trends, news articles — especially in large quantities — are a good source of insight.

I will say that I have, over the course of half a decade, written or helped to write scrapers for a variety of news websites. For example, I contributed various scrapers for Italian news sites to the fundus framework, a piece of software that I really like working with. Some months ago, though, I thought I'd had enough of writing hack-ish code, and I decided that it was maybe time to solve the issue of news-article dataset procurement, possibly forever and for every (or at least most) edge case. So I did it.

First of all, I happened to know that there is a corpus of all the webpages of the internet (more or less) continuously scraped and well compiled into huge WARC files (like ZIP, but fancy and optimized for HTML), the Common Crawl. At some point I had also stumbled upon the fact that the people who make it also curate a subset of the crawl containing only news articles, CC-NEWS — something that is apparently not very well advertised. At time of writing, the main page documenting its existence is this 2016 announcement of its creation on the Common Crawl blog (maybe they want journalists covering them not to notice).

CC-NEWS, though, is huge: the scrapes from 2016 to 2026 (covering all ten years of its existence) total more than 100 TB of HTML when uncompressed. In its raw form, this dataset is pretty difficult for normal users to download, let alone query. Usually only a specific set of articles is needed for an analysis, and scavenging for them through 30k files named CC-NEWS-20240627144043-04810.warc.gz and the like would be rather inconvenient. While thinking about how to handle that, I got reminded of another amazing thing I'd found some months before: infini-gram. Infini-gram is a research paper (and code implementation) of something that feels like magic: it lets you search keywords or sets of keywords (of arbitrary length!) in a corpus the size of an LLM pretraining dataset in sub-linear time — usually in the order of milliseconds. To achieve that, indexes are calculated and stored for querying; the indexes themselves are big, in the order of terabytes, but they are still far faster and easier to serve via something like a web API. The chart just below _is_ that API, live: type any word (or three) and it sweeps all ten years of the corpus in a single query, then click a point to read real headlines from that year.

Terms to compare — up to three, comma-separated

covid ukraine recession

Showing the baked snapshot · refreshing live from the index…

Figure 1. Keyword frequency across the corpus, year by year. Each curve is a _single_ infini-gram‑mini find() over all 1.36 billion articles, split into years by the index's shard map — an exact count.1 Warm queries return in milliseconds; the first after an idle spell takes a few seconds (the real latency is reported under the chart). Toggle linear/log and raw counts vs. per‑million‑articles; click any point for real headlines from that year. Watch covid go vertical in 2020 and ukraine spike in 2022.

After connecting the dots and making a very bad prototype in one night, I made a more concrete plan and then started implementing it with Kirill, a colleague at my research lab who thought that this could be, if not a good idea, at least a fun project. He also usually needs commercial datasets for news research, so he got excited about replacing those with an open option.

In short, what we wanted to achieve was: a version of the corpus cleaned and enriched with useful per-article information — things downstream people care about — plus a set of infini-gram indexes (infini-gram _mini_ indexes, actually, the newer implementation that cuts down their size) so the whole corpus could be queried fast.

I had worked with big datasets before (like this one!) but nothing of this size and scope. I tried to pick a stack rooted in what I knew from the LLM-pretraining-dataset processing literature, partly because I could learn a thing or two with some practice. In the end we picked trafilatura to clean the articles and pull out metadata (author, publish date), GlotLID for its coverage, CommonLingua for its performance, and lingua for its short-text optimization — all for language tagging, each with its perks — and this RoBERTa model for tagging article topics, optimized for multilingual articles using the standard International Press Telecommunications Council categories. The sampler below draws a real random article and shows exactly these enrichments attached to it — hit “draw another” a few times to get a feel for the corpus.

Must contain — optional

Language — optional

Year — for random

A sample article from the corpus — type a word above, or hit **Draw another**.

Eishockey: Thomas Vanek ein Thema in Graz

Kleine Zeitung

Thomas Vanek ist eine der größten Persönlichkeiten im heimischen Eishockey und pflegte stets einen guten Kontakt zur Heimat Graz. So auch jetzt. Der Ex-Stürmer ist im Austausch mit dem neuen Sportlichen Leiter der Grazer. „Es gibt noch nichts Konkretes, aber Thomas hat sich angeboten, jederzeit zu helfen, sobald etwas gebraucht wird“, sagt Philipp Pinter. Vanek ist auch mit Neo-Präsident Herbert Jerich per du. Der Geschäftsmann ermöglichte in der Saison 2012/13 das Gastspiel des mittlerweile 40-Jährigen bei den 99ers. Auf dem Spielermarkt kennt sich Vanek gut aus. Er hat nicht nur die Expertise aus 1029 NHL-Spielen; mittlerweile ist er als Scout bei den San Jose Sharks tätig. Unterdessen basteln die Grazer weiter an einem hochkarätigen Kader.

Language German deu_Latn

Publisher Kleine Zeitung

Host kleinezeitung.at

Author not in the record

Country TLD at

Published 2024-04-16

Length not in the record

Sourceopen the original ↗

Figure 2. One real article, pulled live from a random point in the corpus, shown with the enrichment sidecar the pipeline writes for every document: detected language, the publishing site, author, source URL and crawl metadata.

Technically, downloading, cleaning and enriching the corpus was not too straightforward, but luckily our lab cluster provided more than enough storage, CPUs and GPUs. After optimizing all the steps and parallelizing what was possible, I found one of the biggest bottlenecks to be read/write speed: our big storage is an NFS mount and its hardware is apparently faulty; this has caused quite a few headaches and one major crisis (the whole mount stopped working the night before a deadline, three times).

English Spanish Russian German Italian French Turkish Arabic Portuguese other

Exact — every article counted (DuckDB over the parquet, 117 months). Snapshot 2026-06-23.

Figure 3. Exact monthly composition of the corpus, by detected language and by IPTC topic — every one of the 1.36 billion articles counted, computed once with DuckDB straight over the parquet (not sampled).1 Toggle language/topic and count/share; hover for that month's breakdown. English's share of the corpus visibly shrinks as it grows; switch to _by topic_ to watch health and conflict coverage swell around 2020.

The end result is this corpus, the biggest and most complete news dataset readily available on Hugging Face (and with already ~30k monthly downloads despite virtually no publicity), and its sister indexes, for super-fast querying. We have also built an API that lets end users — researchers and journalists — query the dataset and create subdatasets within seconds, without technical expertise. We are currently working with our university to figure out how to deploy it in a way that doesn't violate its cybersecurity policies. In the meantime the public endpoint behind every live widget in this post is, quite literally, a Raspberry Pi with an external hard drive bolted on, sitting on a desk — so the charts here are real but unhurried: give a cold query a few seconds and it will answer. For those interested in the technical details, here is the preprint we wrote with our PI Jana Lasser to present the dataset to the scientific community.

Keyword

11,834,061

cached · refreshing live…

The exact call that reproduces this slice:

pip install requests

import requests API = "https://infini-news.uni-graz.at" r = requests.post(f"{API}/api/v1/count", json={"query": "climate change", "index": "ccnews"}) print(r.json()["count"]) # matches across 1.36B articles

Figure 4. The subdataset builder. Pick a keyword; the count comes straight from /api/v1/count, and the snippet is the exact code to reconstruct that slice from the API — the thing we're getting cleared for outside-the-university access. Until that clearance lands it answers only from the campus network.

1 Counts are _token_-match counts from the FM-index, not article counts: a term that appears twice in one article counts twice. The per-year split is exact, read from the index's shard map (each of the 117 shards carries its year), not a sample. The x-axis is crawl date (warc_date), close to but not identical to publication date; 2016 is partial (CC-NEWS starts that August), so its point sits low until you switch to per-million.

Live data: infini-news.uni-graz.at · infini-gram-mini FM-index · index ccnews · 1,357,027,742 articles.