Sqoop Import to HDFS
The import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.
Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the --connect argument.
Example :
sqoop list-databases \
--connect jdbc:mysql://ms.itversity.com:3306 \ //Test Connectivity
--username username \
--password ******
--connect jdbc:mysql://ms.itversity.com:3306 \ //Test Connectivity
--username username \
--password-file /user/umeshp6655/MySql/sqoop.password
//Some useful control arguments
--append Append data to an existing dataset in HDFS
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data to Parquet Files
--boundary-query <statement> Boundary query to use for creating splits
--columns <col,col,col…> Columns to import from table
--delete-target-dir Delete the import target directory if it exists
--fetch-size <n> Number of entries to read from database at once.
--inline-lob-limit <n> Set the maximum size for an inline LOB
--direct Use direct connector if exists for the database
-m,--num-mappers <n> Use n map tasks to import in parallel
-e,--query <statement> Import the results of statement.
--split-by <column-name> Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
--autoreset-to-one-mapper Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option.
--table <table-name> Table to read
--target-dir <dir> HDFS destination dir
--warehouse-dir <dir> HDFS parent for table destination
--where <where clause> WHERE clause to use during import
-z,--compress Enable compression
--compression-codec <c> Use Hadoop codec (default gzip)
--null-string <null-string> The string to be written for a null value for string columns
--null-non-string <null-string> The string to be written for a null value for non-string columns
mysql> select * from mysql_import;
+-------+---------------------+------------+----------+-----------------+
| sid | city | state | ranked | rDate |
+-------+---------------------+------------+----------+-----------------+
| 101 | Patna | Bihar | 1 | 2014-03-23 |
| 102 | Allahabad | UP | 2 | 2014-03-23 |
| 103 | Kolkata | WB | 4 | 2014-03-27 |
| 104 | Panaji | Goa | 4 | 2014-03-29 |
+-------+---------------------+------------+----------+-----------------+
Sqoop Import sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile
--split-by col1
Note: Table don't have primary key so we have to use split-by keyword column should be indexed and not have null value.
mysql> select * from mysql_import;
+-------+---------------------+------------+----------+-----------------+
| sid | city | state | ranked | rDate |
+-------+---------------------+------------+----------+-----------------+
| 101 | Patna | Bihar | 1 | 2014-03-23 |
| 102 | Allahabad | UP | 2 | 2014-03-23 |
| 103 | Kolkata | WB | 4 | 2014-03-27 |
| 104 | Panaji | Goa | 4 | 2014-03-29 |
| 105 | Charlotte | NC | 6 | 2014-03-31 |
| 106 | Greenville | SC | 6 | 2014-05-31 |
| 107 | Greenville Yard | SC Yard | 7 | 2014-07-31 |
| 108 | Atlanta | GA | 9 | 2014-06-25 |
+-------+---------------------+------------+----------+-----------------+
8 rows in set (0.00 sec)
//Sqoop Import Incremental append
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--check-column ranked \
--incremental append --last-value 4
//Delete target directory if it exists
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \ //for deleting directory
-m 1 //--num-mapper 1
//Boundary query example
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 1 \
--boundary-query "Select min(101), max(105) from mysql_import"
Note : Table don't have primary key so we have to use split-by keyword, split-by column should be indexed, not have null value only that column used in --boundary-query
//Output:
umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000
Patna,1,2014-03-22,101,Bihar
Allahabad,2,2014-03-22,102,UP
Kolkata,4,2014-03-26,103,WB
Panaji,4,2014-03-28,104,Goa
Charlotte,6,2014-03-30,105,NC
OR
sqoop list-databases \--connect jdbc:mysql://ms.itversity.com:3306 \ //Test Connectivity
--username username \
--password-file /user/umeshp6655/MySql/sqoop.password
//Some useful control arguments
--append Append data to an existing dataset in HDFS
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data to Parquet Files
--boundary-query <statement> Boundary query to use for creating splits
--columns <col,col,col…> Columns to import from table
--delete-target-dir Delete the import target directory if it exists
--fetch-size <n> Number of entries to read from database at once.
--inline-lob-limit <n> Set the maximum size for an inline LOB
--direct Use direct connector if exists for the database
-m,--num-mappers <n> Use n map tasks to import in parallel
-e,--query <statement> Import the results of statement.
--split-by <column-name> Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
--autoreset-to-one-mapper Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option.
--table <table-name> Table to read
--target-dir <dir> HDFS destination dir
--warehouse-dir <dir> HDFS parent for table destination
--where <where clause> WHERE clause to use during import
-z,--compress Enable compression
--compression-codec <c> Use Hadoop codec (default gzip)
--null-string <null-string> The string to be written for a null value for string columns
--null-non-string <null-string> The string to be written for a null value for non-string columns
mysql> select * from mysql_import;
+-------+---------------------+------------+----------+-----------------+
| sid | city | state | ranked | rDate |
+-------+---------------------+------------+----------+-----------------+
| 101 | Patna | Bihar | 1 | 2014-03-23 |
| 102 | Allahabad | UP | 2 | 2014-03-23 |
| 103 | Kolkata | WB | 4 | 2014-03-27 |
| 104 | Panaji | Goa | 4 | 2014-03-29 |
+-------+---------------------+------------+----------+-----------------+
Sqoop Import sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile
--split-by col1
Note: Table don't have primary key so we have to use split-by keyword column should be indexed and not have null value.
mysql> select * from mysql_import;
+-------+---------------------+------------+----------+-----------------+
| sid | city | state | ranked | rDate |
+-------+---------------------+------------+----------+-----------------+
| 101 | Patna | Bihar | 1 | 2014-03-23 |
| 102 | Allahabad | UP | 2 | 2014-03-23 |
| 103 | Kolkata | WB | 4 | 2014-03-27 |
| 104 | Panaji | Goa | 4 | 2014-03-29 |
| 105 | Charlotte | NC | 6 | 2014-03-31 |
| 106 | Greenville | SC | 6 | 2014-05-31 |
| 107 | Greenville Yard | SC Yard | 7 | 2014-07-31 |
| 108 | Atlanta | GA | 9 | 2014-06-25 |
+-------+---------------------+------------+----------+-----------------+
8 rows in set (0.00 sec)
//Sqoop Import Incremental append
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--check-column ranked \
--incremental append --last-value 4
//Delete target directory if it exists
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \ //for deleting directory
-m 1 //--num-mapper 1
//Boundary query example
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 1 \
--boundary-query "Select min(101), max(105) from mysql_import"
Note : Table don't have primary key so we have to use split-by keyword, split-by column should be indexed, not have null value only that column used in --boundary-query
//Output:
umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000
Patna,1,2014-03-22,101,Bihar
Allahabad,2,2014-03-22,102,UP
Kolkata,4,2014-03-26,103,WB
Panaji,4,2014-03-28,104,Goa
Charlotte,6,2014-03-30,105,NC
//Sqoop importing for selecting cloumns
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 1 \
--boundary-query "Select min(101), max(106) from mysql_import" \
--columns "sid,city,state"
//Output:
umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000
101,Patna,Bihar
102,Allahabad,UP
103,Kolkata,WB
104,Panaji,Goa
105,Charlotte,NC
106,Greenville,SC
//Sqoop import using --direct control argument
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--delete-target-dir \
--columns "sid,city,state" \
--direct \
--split-by sid
Note :--direct is only supported in mysql and postgresql.
Sqoop’s direct mode does not support imports of BLOB, CLOB or LONGVARBINARY columns.
This channel may be higher performance than using JDBC.
//Sqoop import using --autoreset-to-one-mapper
Tables without primary key will be imported with one mapper and others with primary key with default mappers (4 - if not specified in sqoop command).
We can't use --split-by with import-all-tables but this just a quote from sqoop guide in context of error you got: If a table does not have a primary key defined and the --split-by> <col> is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 or --m 1 option or the --autoreset-to-one-mapper option is used.
The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.
sqoop import-all-tables \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--autoreset-to-one-mapper \
--target-dir /user/umeshp6655/MySql/sqoop_import/
//Sqoop import using --warehouse-dir
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--warehouse-dir /user/umeshp6655/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 2 \
--columns "sid,city,state"
//Sqoop import having where condition
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 2 \
--columns "sid,city,state" \
--where "city='Patna'"
//Output
umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000
101,Patna,Bihar
//Compression while saving to HDFS
Use --compress to enable the compression
Use --compression-codec to use specific compression algorithm
By default it uses GzipCodec.
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
-m 2 \
--as-textfile \
--compress \ //-z also used (Gzip default algorithm)
--compression-codec org.apache.hadoop.io.compress.GzipCodec
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
-m 2 \
--as-textfile \
--compress \ //-z also used
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
//Sqoop import --null-string or --null-non-string
//Code to import
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 1 \
--null-string 'NA' \
--null-non-string '-'
//Output
[umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000
Patna,1,2014-03-22,101,Bihar
Allahabad,2,2014-03-22,102,UP
Kolkata,4,2014-03-26,103,WB
Panaji,4,2014-03-28,104,Goa
Charlotte,6,2014-03-30,105,NC
Greenville,6,2014-05-30,106,SC
Greenville Yard,7,2014-07-30,107,SC Yard
Atlanta,9,2014-06-24,108,GA
NA,-,2015-05-22,112,Kerala
--warehouse-dir : It create a directory which works as database directory (sqoop_db_movies) and table name (as given in import command) directory automatically created with imported files with in warehouse dir(database directory).
sqoop import
--options-file /home/cloudera/sqoop/conn
--table movies
--warehouse-dir /sqoop_db_movies -m 1
Output as:
/sqoop_db_movies/movies
/sqoop_db_movies/movies/_SUCCESS
/sqoop_db_movies/movies/part-m-00000
VS
--target-dir: It create a directory which work as table name (sqoop_table_movies) with imported files.
sqoop import
--options-file /home/cloudera/sqoop/conn
--table movies
--target-dir /sqoop_table_movies -m 1
Output as:
/sqoop_table_movies/_SUCCESS
/sqoop_table_movies/part-m-00000
Command to unzip Gzip(.gz) file gunzip part*.gz
No comments:
Post a Comment