Appearance
Sample Data
Yellowbrick maintains some sample data sets (source files, DDL, and queries) in GitHub at YmSamples. The DDL scripts in that repository reference AWS S3 files in a public bucket. You can use either the Yellowbrick Manager Load Assistant or SQL commands to load the sample data.
Some of the data sets include:
gdelt
: The GDELT Project is the largest, most comprehensive, and highest resolution open database of human society ever created. Just the 2015 data alone records nearly three quarters of a trillion emotional snapshots and more than 1.5 billion location references, while its total archives span more than 215 years, making it one of the largest open-access spatio-temporal data sets in existence and pushing the boundaries of "big data" study of global human society.NOAA GHCN-D
: Weather observations for over 200 years collected from a large number of land-based weather stations. The source files are provided by the Registry of Open Data on AWS (RODA).nyc-taxi
: Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.premdb
: A tiny database that contains actual English Premier League soccer results for about 20 seasons, as well as a few details about each team in the league. You can run a script that creates and loads five small tables in about 5 seconds. These tables are used extensively in the main Yellowbrick documentation to provide simple, reproducible examples of SQL commands and functions.tpcds_sf1000
: A 1TB version of the TPC-DS data set, which is frequently used by database companies for competitive benchmarking. This data set is pre-loaded into theyellowbrick_trial
database. Scripts are available for re-creating and loading these tables.NetFlow
: a data set generated by a service that sits on network routers and collects information on IP traffic as it enters or exits an interface. This information includes source and destination addresses, ports, protocols, and the amount of data transmitted. By analyzing NetFlow data, a network administrator can summarize network activity, find sources of network congestion, and identify potential security threats.
Note: TheYmSamples site and the referenced S3 bucket are both public. You do not need an AWS account to access the data.