Show HN: Infini-News – 1.36B news articles from Common Crawl, queryable in ms
cs2.uni-graz.at · Read Story HN original
Infini-News is ten years of CC-NEWS (the news subset of Common Crawl), cleaned, enriched and turned into a full-text index so you can count any keyword or phrase across 1.36B articles in sub-second time (ok, now maybe a few seconds, but circumstantial), without downloading anything. It's free and open on Hugging Face.
I did it because I was sick of having to manually scrape news websites and the like for research purposes and because it felt interesting personally to tackle a project of this scale.
On top of data cleaning, we have run language, country (via TLDs and some other heuristics) and topic tagging over all the articles and I have indexed all of them using a recent new n-gram indexing technology that I consider akin to magic.
I would encourage you to read the blogpost and play with the interactive viz I made for it. Also, of course, happy to answer questions.
Blog: https://cs2.uni-graz.at/blog/infini-news/
Dataset: https://huggingface.co/datasets/ruggsea/infini-news-corpus
Index: https://huggingface.co/datasets/ruggsea/infini-news-index
Preprint: https://arxiv.org/abs/2605.18337
Comments