External Table Examples
The following examples show to create and select from external tables. These examples assume that an external mount exists with the reference name /ext_db/
Create an external table:
Create an external table by writing (unloading) the results of a query to a file:
premdb=# create external table '/ext_db/ext_season.csv'
using(format text) as select * from season;
SELECT 25
Query the external table:
Subsequent queries can read from the external table file. Select data from this kind of table by using FROM EXTERNAL
syntax:
premdb=# select * from external '/ext_db/season.csv'
(seasonid int, season_name char(9), numteams int, winners varchar(40))
using (format csv);
seasonid | season_name | numteams | winners
----------+-------------+----------+-------------------
1 | 1992-1993 | 22 | Manchester United
2 | 1993-1994 | 22 | Manchester United
3 | 1994-1995 | 22 | Blackburn Rovers
4 | 1995-1996 | 20 | Manchester United
5 | 1996-1997 | 20 | Manchester United
...
Create an external table with additional options:
Create an external table with several options in the USING
clause:
- Format the output as CSV.
- Compress the output as GZIP (using the
GZIP_BEST
unload compression type). - Specify a writable log directory relative to the mount point. (This is useful for keeping track of log files that are tied to specific uses of external tables. For example, you can direct the logs for load and unload operations to different log directories.)
- Include a header line in the output file.
- Use a pipe character as the field delimiter in the file that contains the exported data.
premdb=# create external table '/ext_db/newmatchstats.gz'
using(format csv, compress 'GZIP_BEST', logdir '/ext_db/logs', includeheader true, delim '|')
as select * from newmatchstats;
SELECT 774540
You can check that the option settings were respected by unzipping and inspecting the newmatchstats.gz
file.
Now select some rows from this external table by using SELECT...FROM EXTERNAL
syntax. Note the use of the skiprows
option to ignore the header line in the file, the delim
option to make sure the columns are read correctly from the file, and the logdir
option:
premdb=# select distinct seasonid, matchday
from external '/ext_db/newmatchstats'
(seasonid int, matchday date)
using(format csv, delim '|', skiprows 1, logdir '/ext_db/query_logs')
where seasonid=20
order by 2;
seasonid | matchday
----------+------------
20 | 2011-08-13
20 | 2011-08-14
20 | 2011-08-15
20 | 2011-08-20
20 | 2011-08-21
20 | 2011-08-22
20 | 2011-08-27
20 | 2011-08-28
...
Note: You must specify the correct number and order of columns in the column list when selecting from external tables. For example, the following query fails because it attempts to select three columns but only specifies two in the column list for the external table reference:
premdb=# select distinct seasonid, matchday, winners
from external '/ext_db/newmatchstats'
(seasonid int, matchday date)
using(format csv, skiprows 1)
where seasonid=20
order by 2;
ERROR: column "winners" does not exist
LINE 1: select distinct seasonid, matchday, winners from external 'e...
^
Check the ybload and ybunload log files:
You can check the log files for ybunload
and ybload
operations that create and select from external tables.
These logs are useful for troubleshooting problems and as a record of when and how external table operations were performed. If you do not set the logdir
option in the USING
clause, the logs are written to the same directory as the external files. Log files are named <queryid>-UNLOAD.log
or <queryid>-LOAD.log
.
For example, the following ybunload
log was written when an external table was created:
$ more 118173-UNLOAD.log
2019-06-05 13:58:49.484 [ INFO]
...
app.name_and_version = "external version 3.0.0-13800"
java.home = "/usr/lib/jvm/java-8-oracle/jre"
java.version = "1.8.0_101"
jvm.memory = "875.00 MB (max=6.00 GB)"
jvm.name_and_version = "Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)"
jvm.options = "-Xms512m, -Xmx6g, -XX:+UseG1GC, -Dlogfile=/home/brumsby/ybd/external/build/bin/external.log, -Dapp.name=external, -D
app.pid=21892, -Dapp.repo=/home/brumsby/ybd/external/build/lib, -Dapp.home=/home/brumsby/ybd/external/build, -Dbasedir=/home/brumsby/ybd/externa
l/build"
jvm.vendor = "Oracle Corporation"
os.name_and_version = "Linux 4.4.0-31-generic (amd64)"
2019-06-05 13:58:49.484 [ INFO] <unload-4> Beginning unload to /home/brumsby/ext_db
2019-06-05 13:58:50.756 [ INFO] <pool-6-thread-1> state: RUNNING - Network BW: 5.57 MB/s Disk BW: 1.99 MB/s
2019-06-05 13:58:51.147 [ INFO] <unload-4> Network I/O Complete. Waiting on file I/O
2019-06-05 13:58:51.147 [ INFO] <unload-4> Finalizing...
2019-06-05 13:58:51.151 [ INFO] <unload-4> Transfer complete
2019-06-05 13:58:51.151 [ INFO] <unload-4> Transferred: 2661174.00 B Avg Network BW: 5.89 MB/s Avg Disk write rate: 1.89 MB/s
Here is an example of a ybload
log file, which was written when a user selected from an external table:
$ more 118288-LOAD.log
2019-06-05 13:59:10.456 [ INFO]
...
app.name_and_version = "external version 3.0.0-13800"
java.home = "/usr/lib/jvm/java-8-oracle/jre"
java.version = "1.8.0_101"
jvm.memory = "875.00 MB (max=6.00 GB)"
jvm.name_and_version = "Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)"
jvm.options = "-Xms512m, -Xmx6g, -XX:+UseG1GC, -Dlogfile=/home/brumsby/ybd/external/build/bin/external.log, -Dapp.name=external, -D
app.pid=21892, -Dapp.repo=/home/brumsby/ybd/external/build/lib, -Dapp.home=/home/brumsby/ybd/external/build, -Dbasedir=/home/brumsby/ybd/externa
l/build"
jvm.vendor = "Oracle Corporation"
os.name_and_version = "Linux 4.4.0-31-generic (amd64)"
2019-06-05 13:59:10.458 [ INFO] <load-5> Assuming source encoding matches database server encoding: LATIN9
2019-06-05 13:59:10.459 [ INFO] <LFSourceContext> Gathering metadata on input files
2019-06-05 13:59:10.460 [ INFO] <load-5> --num-header-lines: 1
2019-06-05 13:59:10.464 [ INFO] <load-5> Session Key = h7TVdFVK4c0fehPzOk8R4fQUPd9F16Obi1fgeaivU8vwwK61mWqYbByciMXNIpdH
2019-06-05 13:59:10.570 [ INFO] <load-5> Starting 1 source PreReaders
2019-06-05 13:59:10.571 [ INFO] <load-5> Using database locale: C
2019-06-05 13:59:10.572 [ INFO] <PreReader #0> Auto-detected line separator = '\n'
2019-06-05 13:59:10.572 [ INFO] <LFScanContext> Configuration (record/field separation):
--format : CSV
--delimiter : |
--linesep : \n
--quote-char : "
--escape-char : "
--trim-white : false
--skip-blank-lines : true
--on-missing-field : ERROR
--on-extra-field : REMOVE
--on-unescaped-embedded-quote : ERROR
Internal scanner : RecordScannerQuote_1_1
2019-06-05 13:59:10.573 [ INFO] <PreReader #0> Reading from 75 for /home/brumsby/ext_db/newmatchstats
2019-06-05 13:59:10.596 [ INFO] <load-5> Configuration (pre-parsing):
2019-06-05 13:59:10.597 [ INFO] <load-5> Configuration (session):
tableName : "yellowbrick"."public"."external_table"
keepAliveSeconds: 60
maxBadRows : 0
2019-06-05 13:59:10.598 [ INFO] <load-5> Bad rows will be written to file:/home/brumsby/ext_db/query_logs/118288-LOAD.badrow
2019-06-05 13:59:10.598 [ INFO] <load-5> Starting import of 1 files
2019-06-05 13:59:10.772 [ INFO] <load-5> Starting 1 segment reader
2019-06-05 13:59:11.135 [ INFO] <load-5> Flushing last 774540 rows (of 774540 total) for transaction 1
2019-06-05 13:59:11.152 [ INFO] <sender 1.00> SUCCESSFUL BULK LOAD: Loaded 774540 good rows in 0:00:00 (READ: 27.64MB/s WRITE: 19.60MB/s)
If an external table operation fails, you can use the log files to troubleshoot. For example, a ybload
operation that generates bad rows will write a badrow
file to the same log directory:
$ more 1882713-LOAD.badrow
#error: lineByteRange(75-99)
#reason: Too few fields (1 < 2)
1|1992-08-01|2|52|01:00
Join an external table to a regular database table:
The following example shows how to select from an external table (newmatchstats.csv
aliased as ex
) and join the results to a regular database table (match
aliased as db
):
premdb=# select distinct db.seasonid, db.matchday
from
external '/ext_db/newmatchstats.csv' (seasonid int, matchday date)
using(format csv, delim '|', skiprows 1, logdir '/ext_db/query_logs') ex
join
match db on db.seasonid=ex.seasonid
where db.seasonid in(10,11,12)
order by 1,2;
seasonid | matchday
----------+---------------------
10 | 2001-08-18 00:00:00
10 | 2001-08-19 00:00:00
10 | 2001-08-20 00:00:00
10 | 2001-08-21 00:00:00
10 | 2001-08-22 00:00:00
10 | 2001-08-25 00:00:00
10 | 2001-08-26 00:00:00
10 | 2001-08-27 00:00:00
10 | 2001-09-08 00:00:00
10 | 2001-09-09 00:00:00
10 | 2001-09-15 00:00:00
...
Join two external tables
This query joins two external tables (aliased nms
and esn
). Note that you must specify the EXTERNAL
keyword, a column list, and a USING
clause each time you reference an external table in a query:
premdb=# select distinct nms.seasonid, nms.matchday
from
external '/ext_db/newmatchstats' (seasonid int, matchday date) using(format csv, delim '|', skiprows 1, logdir '/ext_db/logs_may22')
nms join
external '/ext_db/ext_season_new' (seasonid int) using(format csv, logdir '/ext_db/logs_may22')
esn on nms.seasonid=esn.seasonid
order by nms.seasonid;
seasonid | matchday
----------+------------
1 | 1992-08-01
2 | 1993-09-01
2 | 1993-08-31
2 | 1993-08-15
2 | 1994-05-02
...
Insert rows into a table by selecting from an external table:
To read data from an external table and write it into a regular table, use an INSERT INTO SELECT...FROM EXTERNAL
statement. You do not have to declare the column list for the file in this case; it is optional. The USING
clause is required. For example:
premdb=# insert into season
select * from external '/ext_db/season.csv'
using(format csv);
INSERT 0 25
The external table in this statement uses the schema of the target table (season
). The columns in the external data file must be in the same order as they are in the target table, and every column in the target table must also exist in the external file.
Parent topic:CREATE EXTERNAL TABLE