- Elite Data Science -- Datasets for Data Science and Machine Learning
- sklearn dataset module:
from sklearn import datasets
. This contains also some popular reference datasets.
- awesome-public-datasets — A topic-centric list of HQ open datasets.
- BuzzFeedNews/everything — data from BuzzFeed.
- COCO -- Common Objects in Context.
- Data Hub Datasets collection — high quality data and datasets organized by topic.
- data.gov — a large dataset aggregator and the home of the US Government’s open data.
- data.world -- The Cloud-Native Data Catalog.
- FiveThirtyEight — hard data and statistical analysis to tell stories about politics, sports, societal matters and more.
- Google AI Datasets — In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
- Quandl — your perfect choice for testing your machine learning algorithms and don’t waste your time on cleaning data.
- WHU-RS Datasets -- Dataset Collection by Group of Photogrammetry and Computer Vision (GPCV) at Wuhan University.
- COCO Dataset -- a large-scale object detection, segmentation, and captioning dataset.
- Fruit-Images-Dataset — A dataset of images containing fruits and vegetables.
- google-landmark -- Dataset with 5 million images depicting human-made and natural landmarks spanning 200 thousand classes.
- ImageNet -- ImageNet is an image database organized according to the WordNet hierarchy.
- WordNet -- A Lexical Database for English.
- IWSLT'15 English-Vietnamese data (small from Stanford).
- PhoBERT -- Pre-trained language models for Vietnamese.
- PhoW2V (2020): Pre-trained Word2Vec syllable- and word-level embeddings for Vietnamese.
- ViText2SQL (EMNLP 2020 Findings): A dataset for Vietnamese Text2SQL semantic parsing.
- VnCoreNLP (NAACL 2018): A Vietnamese NLP pipeline of word (and sentence) segmentation, POS tagging, named entity recognition and dependency parsing.
- Iris flower dataset (
from sklearn.datasets import load_iris
).
- Labeled Faces in the Wild Home (
from sklearn.datasets import fetch_lfw_people
).
- pydatafaker -- A python package to create fake data with relationships between tables.
- The digits dataset (
sklearn.datasets.load_digits
).
- TimeSynth -- A Multipurpose Library for Synthetic Time Series Generation in Python.