Machine learning is taking hold in all kinds of applications, from self-driving cars to image recognition to online recommendation engines. But unless you’re a Google or a Facebook, it’s hard to get your hands on the kind of massive, real world data sets required to test and validate machine learning programs.
Yahoo has helped to rectifying that with the release Thursday of what it called the “largest ever” data set made available to machine learning scientists. It’s a collection of anonymized user interactions with the news steams on sites like Yahoo News and Yahoo Sports.
Yahoo says there are 110 billion events in the file — or 110 billion records of when a user clicked on a news story or took some other action in the feed — and it comprises 13.5TB of data, or 1.5TB compressed. That’s more than ten times the size of the previous largest dataset released, Yahoo says.