p.enthalabs

Show HN: Infini-News – 1.36B news articles from Common Crawl, queryable in ms

cs2.uni-graz.at · Read Story HN original

Infini-News is ten years of CC-NEWS (the news subset of Common Crawl), cleaned, enriched and turned into a full-text index so you can count any keyword or phrase across 1.36B articles in sub-second time (ok, now maybe a few seconds, but circumstantial), without downloading anything. It's free and open on Hugging Face. I did it because I was sick of having to manually scrape news websites and the like for research purposes and because it felt interesting personally to tackle a project of this scale. On top of data cleaning, we have run language, country (via TLDs and some other heuristics) and topic tagging over all the articles and I have indexed all of them using a recent new n-gram indexing technology that I consider akin to magic. I would encourage you to read the blogpost and play with the interactive viz I made for it. Also, of course, happy to answer questions. Blog: https://cs2.uni-graz.at/blog/infini-news/ Dataset: https://huggingface.co/datasets/ruggsea/infini-news-corpus Index: https://huggingface.co/datasets/ruggsea/infini-news-index Preprint: https://arxiv.org/abs/2605.18337

Comments

Very cool! Happy to see some cool stuff made in Graz too. Keep up the good work!