Sqoop: Sqoop Import to HDFS

Sqoop Import to HDFS

The import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.

Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the --connect argument.

Example :

sqoop list-databases \

--connect jdbc:mysql://ms.itversity.com:3306 \ //Test Connectivity

--username username \

--password ******

sqoop list-databases \
--connect jdbc:mysql://ms.itversity.com:3306 \ //Test Connectivity
--username username \
--password-file /user/umeshp6655/MySql/sqoop.password

//Some useful control arguments
--append Append data to an existing dataset in HDFS
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data to Parquet Files
--boundary-query <statement> Boundary query to use for creating splits
--columns <col,col,col…> Columns to import from table
--delete-target-dir Delete the import target directory if it exists
--fetch-size <n> Number of entries to read from database at once.
--inline-lob-limit <n> Set the maximum size for an inline LOB
--direct Use direct connector if exists for the database
-m,--num-mappers <n> Use n map tasks to import in parallel
-e,--query <statement> Import the results of statement.
--split-by <column-name> Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
--autoreset-to-one-mapper Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option.
--table <table-name> Table to read
--target-dir <dir> HDFS destination dir
--warehouse-dir <dir> HDFS parent for table destination
--where <where clause> WHERE clause to use during import
-z,--compress Enable compression
--compression-codec <c> Use Hadoop codec (default gzip)
--null-string <null-string> The string to be written for a null value for string columns
--null-non-string <null-string> The string to be written for a null value for non-string columns

mysql> select * from mysql_import;
+-------+---------------------+------------+----------+-----------------+
| sid | city | state | ranked | rDate |
+-------+---------------------+------------+----------+-----------------+
| 101 | Patna | Bihar | 1 | 2014-03-23 |
| 102 | Allahabad | UP | 2 | 2014-03-23 |
| 103 | Kolkata | WB | 4 | 2014-03-27 |
| 104 | Panaji | Goa | 4 | 2014-03-29 |
+-------+---------------------+------------+----------+-----------------+

Sqoop Import sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile
--split-by col1
Note: Table don't have primary key so we have to use split-by keyword column should be indexed and not have null value.

mysql> select * from mysql_import;
+-------+---------------------+------------+----------+-----------------+
| sid | city | state | ranked | rDate |
+-------+---------------------+------------+----------+-----------------+
| 101 | Patna | Bihar | 1 | 2014-03-23 |
| 102 | Allahabad | UP | 2 | 2014-03-23 |
| 103 | Kolkata | WB | 4 | 2014-03-27 |
| 104 | Panaji | Goa | 4 | 2014-03-29 |
| 105 | Charlotte | NC | 6 | 2014-03-31 |
| 106 | Greenville | SC | 6 | 2014-05-31 |
| 107 | Greenville Yard | SC Yard | 7 | 2014-07-31 |
| 108 | Atlanta | GA | 9 | 2014-06-25 |
+-------+---------------------+------------+----------+-----------------+

8 rows in set (0.00 sec)

//Sqoop Import Incremental append
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--check-column ranked \
--incremental append --last-value 4

//Delete target directory if it exists
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \ //for deleting directory
-m 1 //--num-mapper 1

//Boundary query example
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 1 \
--boundary-query "Select min(101), max(105) from mysql_import"
Note : Table don't have primary key so we have to use split-by keyword, split-by column should be indexed, not have null value only that column used in --boundary-query
//Output:
umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000
Patna,1,2014-03-22,101,Bihar
Allahabad,2,2014-03-22,102,UP
Kolkata,4,2014-03-26,103,WB
Panaji,4,2014-03-28,104,Goa
Charlotte,6,2014-03-30,105,NC

//Sqoop importing for selecting cloumns

sqoop import \

--connect jdbc:mysql://ms.itversity.com/retail_export \

--username retail_user \

--password-file /user/umeshp6655/MySql/sqoop.password \

--table mysql_import \

--target-dir /user/umeshp6655/MySql/mysql_import \

--as-textfile \

--split-by sid \

--delete-target-dir \

-m 1 \

--boundary-query "Select min(101), max(106) from mysql_import" \

--columns "sid,city,state"

//Output:

umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000

101,Patna,Bihar

102,Allahabad,UP

103,Kolkata,WB

104,Panaji,Goa

105,Charlotte,NC

106,Greenville,SC

//Sqoop import using --direct control argument
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--delete-target-dir \
--columns "sid,city,state" \
--direct \
--split-by sid
Note :
--direct is only supported in mysql and postgresql.

Sqoop’s direct mode does not support imports of BLOB, CLOB or LONGVARBINARY columns.

This channel may be higher performance than using JDBC.

//Sqoop import using --autoreset-to-one-mapper

Tables without primary key will be imported with one mapper and others with primary key with default mappers (4 - if not specified in sqoop command).

We can't use --split-by with import-all-tables but this just a quote from sqoop guide in context of error you got:
If a table does not have a primary key defined and the --split-by> <col> is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 or --m 1 option or the --autoreset-to-one-mapper option is used.
The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.
sqoop import-all-tables \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--autoreset-to-one-mapper \
--target-dir /user/umeshp6655/MySql/sqoop_import/

//Sqoop import using --warehouse-dir
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--warehouse-dir /user/umeshp6655/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 2 \
--columns "sid,city,state"

//Sqoop import having where condition
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
--as-textfile \
--split-by sid \
--delete-target-dir \
-m 2 \
--columns "sid,city,state" \
--where "city='Patna'"
//Output
umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000
101,Patna,Bihar

//Compression while saving to HDFS
Use --compress to enable the compression
Use --compression-codec to use specific compression algorithm
By default it uses GzipCodec.
sqoop import \
--connect jdbc:mysql://ms.itversity.com/retail_export \
--username retail_user \
--password-file /user/umeshp6655/MySql/sqoop.password \
--table mysql_import \
--target-dir /user/umeshp6655/MySql/mysql_import \
-m 2 \
--as-textfile \
--compress \ //-z also used (Gzip default algorithm)
--compression-codec org.apache.hadoop.io.compress.GzipCodec







sqoop import \

--connect jdbc:mysql://ms.itversity.com/retail_export \

--username retail_user \

--password-file /user/umeshp6655/MySql/sqoop.password \

--table mysql_import \

--target-dir /user/umeshp6655/MySql/mysql_import \

-m 2 \

--as-textfile \ 

--compress \ //-z also used

--compression-codec org.apache.hadoop.io.compress.SnappyCodec



//Sqoop import --null-string or --null-non-string

//Code to import 

sqoop import \

--connect jdbc:mysql://ms.itversity.com/retail_export \

--username retail_user \

--password-file /user/umeshp6655/MySql/sqoop.password \

--table mysql_import \

--target-dir /user/umeshp6655/MySql/mysql_import \

--as-textfile \

--split-by sid \

--delete-target-dir \

-m 1 \

--null-string 'NA' \

--null-non-string '-'

//Output

[umeshp6655@gw02 ~]$ hdfs dfs -cat /user/umeshp6655/MySql/mysql_import/part-m-00000

Patna,1,2014-03-22,101,Bihar

Allahabad,2,2014-03-22,102,UP

Kolkata,4,2014-03-26,103,WB

Panaji,4,2014-03-28,104,Goa

Charlotte,6,2014-03-30,105,NC

Greenville,6,2014-05-30,106,SC

Greenville Yard,7,2014-07-30,107,SC Yard

Atlanta,9,2014-06-24,108,GA

NA,-,2015-05-22,112,Kerala



--warehouse-dir : It create a directory which works as database directory (sqoop_db_movies) and table name (as given in import command) directory automatically created with imported files with in warehouse dir(database directory).

sqoop import 

--options-file /home/cloudera/sqoop/conn 

--table movies 

--warehouse-dir /sqoop_db_movies -m 1

Output as:

/sqoop_db_movies/movies

/sqoop_db_movies/movies/_SUCCESS

/sqoop_db_movies/movies/part-m-00000

        VS

--target-dir: It create a directory which work as table name (sqoop_table_movies) with imported files.

sqoop import 

--options-file /home/cloudera/sqoop/conn 

--table movies 

--target-dir /sqoop_table_movies -m 1

Output as:

/sqoop_table_movies/_SUCCESS

/sqoop_table_movies/part-m-00000



Command to unzip Gzip(.gz) file gunzip part*.gz

Sqoop

Wednesday, 29 May 2019

Sqoop Import to HDFS

Sqoop Import to HDFS

No comments:

Post a Comment

Introduction to Sqoop

Search This Blog