Wednesday, 29 May 2019

Sqoop Import to HDFS

Sqoop Import to HDFS

The import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.

Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the --connect argument.
Example :
sqoop list-databases \
 --connect jdbc:mysql://ms.itversity.com:3306 \       //Test Connectivity
 --username username \
 --password ******
OR
sqoop list-databases \
--connect jdbc:mysql://ms.itversity.com:3306 \       //Test Connectivity
--username username \
--password-file /user/umeshp6655/MySql/sqoop.password

//Some useful control arguments
--append    Append data to an existing dataset in HDFS  
--as-avrodatafile  Imports data to Avro Data Files  
--as-sequencefile  Imports data to SequenceFiles  
--as-textfile    Imports data as plain text (default)  
--as-parquetfile    Imports data to Parquet Files  
--boundary-query <statement> Boundary query to use for creating splits  
--columns <col,col,col…>  Columns to import from table  
--delete-target-dir  Delete the import target directory if it exists 
--fetch-size <n>  Number of entries to read from database at once.  
--inline-lob-limit <n>  Set the maximum size for an inline LOB 
--direct  Use direct connector if exists for the database   
-m,--num-mappers <n>  Use n map tasks to import in parallel  
-e,--query <statement>  Import the results of statement.  
--split-by <column-name>  Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.  
--autoreset-to-one-mapper  Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option.  
--table <table-name>  Table to read  
--target-dir <dir>  HDFS destination dir  
--warehouse-dir <dir>  HDFS parent for table destination  
--where <where clause>  WHERE clause to use during import  
-z,--compress  Enable compression  
--compression-codec <c>  Use Hadoop codec (default gzip)  
--null-string <null-string>  The string to be written for a null value for string columns  
--null-non-string <null-string>     The string to be written for a null value for non-string columns 

mysql> select * from mysql_import;
+-------+---------------------+------------+----------+-----------------+
| sid     | city                   | state       | ranked | rDate            |
+-------+---------------------+------------+----------+-----------------+
|  101 | Patna                 | Bihar      |      1      | 2014-03-23 |
|  102 | Allahabad          | UP          |      2      | 2014-03-23 |
|  103 | Kolkata              | WB         |      4      | 2014-03-27 |
|  104 | Panaji                | Goa        |      4      | 2014-03-29 |
+-------+---------------------+------------+----------+-----------------+

Sqoop Import sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile
--split-by col1  
Note: Table don't have primary key so we have to use split-by keyword column should be indexed and not have null value.

mysql> select * from mysql_import;
+-------+---------------------+------------+----------+-----------------+
| sid     | city                   | state       | ranked | rDate            |
+-------+---------------------+------------+----------+-----------------+
|  101 | Patna                 | Bihar      |      1      | 2014-03-23 |
|  102 | Allahabad          | UP          |      2      | 2014-03-23 |
|  103 | Kolkata              | WB         |      4      | 2014-03-27 |
|  104 | Panaji                | Goa        |      4      | 2014-03-29 |
|  105 | Charlotte           | NC          |      6      | 2014-03-31 |
|  106 | Greenville         | SC           |      6      | 2014-05-31 |
|  107 | Greenville Yard | SC Yard  |      7      | 2014-07-31 |
|  108 | Atlanta              | GA           |      9      | 2014-06-25 |
+-------+---------------------+------------+----------+-----------------+

8 rows in set (0.00 sec)

//Sqoop Import Incremental append
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--check-column ranked \
--incremental append --last-value 4

//Delete target directory if it exists
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \                      //for deleting directory
-m 1                                            //--num-mapper 1

//Boundary query example
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 1 \
--boundary-query "Select min(101), max(105) from mysql_import" 
Note : Table don't have primary key so we have to use split-by keyword, split-by column should be indexed, not have null value only that column used in --boundary-query
//Output:
umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000 
Patna,1,2014-03-22,101,Bihar
Allahabad,2,2014-03-22,102,UP
Kolkata,4,2014-03-26,103,WB
Panaji,4,2014-03-28,104,Goa
Charlotte,6,2014-03-30,105,NC

//Sqoop importing for selecting cloumns
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 1 \
--boundary-query "Select min(101), max(106) from mysql_import" \
--columns "sid,city,state"
//Output:
umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000
101,Patna,Bihar
102,Allahabad,UP
103,Kolkata,WB
104,Panaji,Goa
105,Charlotte,NC
106,Greenville,SC

//Sqoop import using --direct control argument
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--delete-target-dir \
--columns "sid,city,state" \
--direct \
--split-by sid
Note :
--direct is only supported in mysql and postgresql.
Sqoop’s direct mode does not support imports of BLOB, CLOB or LONGVARBINARY columns.
This channel may be higher performance than using JDBC.

//Sqoop import using --autoreset-to-one-mapper
Tables without primary key will be imported with one mapper and others with primary key with default mappers (4 - if not specified in sqoop command).
We can't use --split-by with import-all-tables but this just a quote from sqoop guide in context of error you got: 
If a table does not have a primary key defined and the --split-by> <col> is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 or --m 1 option or the --autoreset-to-one-mapper option is used.
The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.
sqoop import-all-tables \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--autoreset-to-one-mapper \
--target-dir  /user/umeshp6655/MySql/sqoop_import/

//Sqoop import using --warehouse-dir
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--warehouse-dir /user/umeshp6655/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 2 \
--columns "sid,city,state"

//Sqoop import having where condition
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 2 \
--columns "sid,city,state" \
--where "city='Patna'"
//Output
umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000 
101,Patna,Bihar

//Compression while saving to HDFS
Use --compress to enable the compression
Use --compression-codec to use specific compression algorithm

By default it uses GzipCodec.
sqoop import \ 
--connect jdbc:mysql://ms.itversity.com/retail_export \ 
--username retail_user \ 
--password-file /user/umeshp6655/MySql/sqoop.password \ 
--table mysql_import \ 
--target-dir /user/umeshp6655/MySql/mysql_import \ 
-m 2 \ 
--as-textfile \ 
--compress \                           //-z also used (Gzip default algorithm)
--compression-codec org.apache.hadoop.io.compress.GzipCodec

sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
-m 2 \
--as-textfile \ 
--compress \ //-z also used
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
//Sqoop import --null-string or --null-non-string
//Code to import 
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 1 \
--null-string 'NA' \
--null-non-string '-'
//Output
[umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000
Patna,1,2014-03-22,101,Bihar
Allahabad,2,2014-03-22,102,UP
Kolkata,4,2014-03-26,103,WB
Panaji,4,2014-03-28,104,Goa
Charlotte,6,2014-03-30,105,NC
Greenville,6,2014-05-30,106,SC
Greenville Yard,7,2014-07-30,107,SC Yard
Atlanta,9,2014-06-24,108,GA
NA,-,2015-05-22,112,Kerala
--warehouse-dir : It create a directory which works as database directory (sqoop_db_movies) and table name (as given in import command) directory automatically created with imported files with in warehouse dir(database directory).
sqoop import 
--options-file /home/cloudera/sqoop/conn 
--table movies 
--warehouse-dir /sqoop_db_movies -m 1
Output as:
/sqoop_db_movies/movies
/sqoop_db_movies/movies/_SUCCESS
/sqoop_db_movies/movies/part-m-00000
        VS
--target-dir: It create a directory which work as table name (sqoop_table_movies) with imported files.
sqoop import 
--options-file /home/cloudera/sqoop/conn 
--table movies 
--target-dir /sqoop_table_movies -m 1
Output as:
/sqoop_table_movies/_SUCCESS
/sqoop_table_movies/part-m-00000
Command to unzip Gzip(.gz) file gunzip part*.gz

No comments:

Post a Comment

Introduction to Sqoop

  Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external datastores such as relatio...