Datasets

A curated collection of datasets and data repositories that I’ve found useful for learning, testing, and building data projects.


Dataset Search & Repositories

Google Dataset Search

Google's search engine for finding publicly available datasets across the web. A great starting point when you're not sure where to look.

Kaggle Datasets

Large repository of datasets for machine learning and data science projects. Includes community contributions, competitions, and notebooks.

Maven Analytics Data Playground

Free sample datasets designed for practicing data analysis and visualization. Well-documented with suggested analysis questions.

R Datasets

Collection of datasets available in R packages. Useful for statistical analysis and reproducible examples.


Automotive

Audi Autonomous Driving Dataset (A2D2)

Autonomous driving dataset with sensor data from Audi vehicles. Includes camera, lidar, and semantic segmentation data.


Cloud Platform Datasets

Azure ML Open Datasets (Python)

Azure ML library for processing public datasets in Python. Provides easy access to common datasets in notebooks.

Azure Open Datasets

Microsoft's catalog of curated public datasets on Azure. Includes weather, genomics, demographics, and more.

Registry of Open Data on AWS

Public datasets available for free on Amazon Web Services. Covers satellite imagery, genomics, and large-scale datasets.


Data Visualization Sources

Beautiful News

Positive news data visualizations and underlying datasets. Good for practicing visualization with uplifting stories.

Gapminder

Global development indicators covering health, economy, and demographics. Made famous by Hans Rosling's presentations.

Information is Beautiful

Curated datasets behind data visualizations covering diverse topics. Well-structured and visually interesting data.


Database Testing & Development

Datasets for Test Databases

Curated list of datasets suitable for testing database systems. Useful when you need realistic test data.

SQL Server Sample Databases

Sample databases for learning and testing SQL Server. Includes AdventureWorks, WideWorldImporters, and others.


Entertainment & Media

IMDb Non-Commercial Datasets

Movie and TV show data from IMDb for non-commercial use. Includes titles, ratings, cast, and crew information.

Million Song Dataset

Audio features and metadata for a million contemporary songs. Widely used for music information retrieval research.

MovieLens

Movie rating datasets from GroupLens for recommender system research. Various sizes from 100K to 25M ratings.


Government & International Statistics

National Center for Education Statistics

US education statistics and research data from the Department of Education. Covers schools, colleges, and educational outcomes.

Source Cooperative - Harvard LIL Gov Data

Archive of US government data from Harvard Library Innovation Lab. Preserves government datasets for long-term access.

UNdata

United Nations statistical databases covering global indicators across economics, demographics, environment, and social metrics.

World Health Organisation

Global health statistics and indicators from the WHO, including disease surveillance, health systems, and population health data.


Location & Places

Foursquare Open Source Places

Open dataset of 100M+ places and locations from Foursquare. Covers businesses, landmarks, and points of interest globally.


Science & Research

CERN Open Data Portal

Particle physics research data from the European Organization for Nuclear Research. Includes collision data and analysis tools.

National Cancer Institute GDC Portal

Cancer genomics data from the Genomic Data Commons. Contains clinical and genomic data for cancer research.

The COVID Tracking Project

Historical US COVID-19 testing and outcomes data. Comprehensive archive of pandemic statistics.


Sports & Football

Betting Exchange Historical Data

Historical odds and market data from the Betfair betting exchange. Useful for sports analytics and prediction models.

FiveThirtyEight

Datasets behind FiveThirtyEight's data journalism articles on politics, sports, and culture. Clean, well-documented data.

Football csv

Open football/soccer data in CSV format including leagues, matches, and results. Community-maintained and regularly updated.

football.db

Open football data in structured database format. Good for building football statistics applications.

Soccer Data and APIs Guide

Comprehensive guide to football/soccer data sources and APIs. Useful for understanding what's available.

Who Scored

WhoScored.com consists of a dedicated team of football analysts and software developers with backgrounds in the sector, based in Central London. We have taken on the responsibility of providing you with valuable and unique content about a sport we all love and are passionate about.

Stats Bomb

StatsBomb is a sports data and analytics company that provides advanced football data and insights to clubs, media, and betting companies worldwide. Their datasets include detailed event data, player tracking, and tactical analysis.

Opta Analyst

Opta Analyst is a platform that offers in-depth football statistics and analysis. It provides access to a wide range of data, including player performance metrics, team statistics, and match analysis, making it a valuable resource for fans, analysts, and professionals in the football industry.


UK Geographic Data

Open Geography Portal

UK geographic boundaries and statistical data from the Office for National Statistics. Essential for UK-based spatial analysis.

Ordnance Survey

UK national mapping and geographic data. Offers both free and premium datasets for mapping applications.


Weather & Climate

Meteostat

Historical weather and climate data from weather stations worldwide. Clean API access to temperature, precipitation, and conditions.