Datasets Collection

Anh-Thi Dinh

Articles

Elite Data Science -- Datasets for Data Science and Machine Learning

Create artificial dataset

sklearn dataset module: from sklearn import datasets. This contains also some popular reference datasets.

Source of datasets

awesome-public-datasets — A topic-centric list of HQ open datasets.

Built-in datasets in Scikit-Learn.

BuzzFeedNews/everything — data from BuzzFeed.

COCO -- Common Objects in Context.

Data Hub Datasets collection — high quality data and datasets organized by topic.

data.gov — a large dataset aggregator and the home of the US Government’s open data.

data.world -- The Cloud-Native Data Catalog.

FiveThirtyEight — hard data and statistical analysis to tell stories about politics, sports, societal matters and more.

Google Dataset Search.

Google Trends Datastore

Google AI Datasets — In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.

Kaggle Datasets.

NLP-progress.

Open Images V6

Quandl — your perfect choice for testing your machine learning algorithms and don’t waste your time on cleaning data.

r/datasets.

Stanford Large Network Dataset Collection.

UCI

TensorFlow Datasets

The Yahoo Webscope Program

torchvision.datasets

WHU-RS Datasets -- Dataset Collection by Group of Photogrammetry and Computer Vision (GPCV) at Wuhan University.

Specific Datasets

COCO Dataset -- a large-scale object detection, segmentation, and captioning dataset.

Dataset samples from Machine Learning Mastery.

Fruit-Images-Dataset — A dataset of images containing fruits and vegetables.

google-landmark -- Dataset with 5 million images depicting human-made and natural landmarks spanning 200 thousand classes.

ImageNet -- ImageNet is an image database organized according to the WordNet hierarchy.

Insight - BBC News Datasets

Large-scale CelebFaces Attributes (CelebA) Dataset

Large Movie Review Dataset (IMDB)

MIT Places Database for Scene Recognition.

Sarcasm detection dataset.

UEA & UCR Time Series Classification Repository

WordNet -- A Lexical Database for English.

Vietnamese

IWSLT'15 English-Vietnamese data (small from Stanford).

NLP-progress - Vietnamese

PhoBERT -- Pre-trained language models for Vietnamese.

PhoW2V (2020): Pre-trained Word2Vec syllable- and word-level embeddings for Vietnamese.

ViText2SQL (EMNLP 2020 Findings): A dataset for Vietnamese Text2SQL semantic parsing.

VnCoreNLP (NAACL 2018): A Vietnamese NLP pipeline of word (and sentence) segmentation, POS tagging, named entity recognition and dependency parsing.

Sample datasets

Iris flower dataset (from sklearn.datasets import load_iris).

Labeled Faces in the Wild Home (from sklearn.datasets import fetch_lfw_people).

pydatafaker -- A python package to create fake data with relationships between tables.

The digits dataset (sklearn.datasets.load_digits).

Tools

TimeSynth -- A Multipurpose Library for Synthetic Time Series Generation in Python.