Sqoop: July 2019

Tuesday, 30 July 2019

Introduction to Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external datastores such as relational databases, enterprise data warehouses.

Sqoop is used to import data from external datastores into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can also be used to extract data from Hadoop or its eco-systems and export it to external datastores such as relational databases, enterprise data warehouses. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc.

Sqoop used for?

For Hadoop developers, the interesting work starts after data is loaded into HDFS. Developers play around the data in order to find the magical insights concealed in that Big Data. For this, the data residing in the relational database management systems need to be transferred to HDFS, play around the data and might need to transfer back to relational database management systems. In reality of Big Data world, Developers feel the transferring of data between relational database systems and HDFS is not that interesting, tedious but too seldom required. Developers can always write custom scripts to transfer data in and out of Hadoop, but Apache Sqoop provides an alternative.

Sqoop automates most of the process, depends on the database to describe the schema of the data to be imported. Sqoop uses MapReduce framework to import and export the data, which provides parallel mechanism as well as fault tolerance. Sqoop makes developers life easy by providing command line interface. Developers just need to provide basic information like source, destination and database authentication details in the sqoop command. Sqoop takes care of remaining part.

Sqoop provides many salient features like:

1. Full Load

2. Incremental Load

3. Parallel import/export

4. Import results of SQL query

5. Compression

6. Connectors for all major RDBMS Databases

7. Kerberos Security Integration

8. Load data directly into Hive/Hbase

9. Support for Accumulo

Sqoop Architecture

Sqoop provides command line interface to the end users. Sqoop can also be accessed using Java APIs. Sqoop command submitted by the end user is parsed by Sqoop and launches Hadoop Map only job to import or export data because Reduce phase is required only when aggregations are needed. Sqoop just imports and exports the data; it does not do any aggregations.

Sqoop parses the arguments provided in the command line and prepares the Map job. Map job launch multiple mappers depends on the number defined by user in the command line. For Sqoop import, each mapper task will be assigned with part of data to be imported based on key defined in the command line. Sqoop distributes the input data among the mappers equally to get high performance. Then each mapper creates connection with the database using JDBC and fetches the part of data assigned by Sqoop and writes it into HDFS or Hive or HBase based on the option provided in the command line.

Sqoop Jobs

For help on sqoop job type : sqoop help job

Argument	Description
--create <job-id>	Define a new saved job with the specified job-id (name). A second Sqoop command-line, separated by a -- should be specified; this defines the saved job.
--delete <job-id>	Delete a saved job.
--exec <job-id>	Given a job defined with --create, run the saved job.
--show <job-id>	Show the parameters for a saved job.
--list	List all saved jobs
--meta-connect <jdbc-uri>	Specifies the JDBC connect string used to connect to the metastore

While saving sqoop command as job we have to specify password file and hence we need to pre-create password file based on the database we want to connect to.

Adding Password file

We cannot just add password to a file and start using it as there will be end of line character.

We can have password file either in HDFS or on Local File System.

While using specifying the password file from Local File System, we need to use file protocol.

It is better to echo the password and write to a file. echo -n "password" > password_file

command is use to remove newline from password_file

Let us run sample sqoop command using password file.

sqoop eval \

--connect jdbc:mysql:dburl \

--username retail_user \

--password-file file:///home/training/password_file \ //file at local

-e "SELECT count(1) FROM orders"

Creating Sqoop Job

As we have created password file, let us understand how to create sqoop job.

We can use sqoop job --create to create the job.

Typically we create jobs for sqoop import, sqoop export etc.

Once we identified the relevant sqoop command, we can create the job with the command and password file.

We also need to assign unique id for the sqoop job

Here is the example to create a job for sqoop import command using sqoop job. Once the job is created we can list created jobs using sqoop job --list.

sqoop job \

--create import_orders \

-- import \

--connect jdbc:mysql://dburl \

--username retail_user \

--password-file file:///home/training/password_file \

--table orders \

--warehouse-dir /user/training/sqoop_import/retail_db \

--delete-target-dir

//show all sqoop jobs

sqoop job --list

//show all details with respect to sqoop job

sqoop job --show import_orders

//exec sqoop job

sqoop job --exec import_orders

//delete sqoop jobs

sqoop job --delete import_orders

Sqoop Incremental Imports

Incremental import arguments:

Argument	Description
--check-column (col)	Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)
--incremental (mode)	Specifies how Sqoop determines which rows are new. Legal values for mode include append(primary key) and lastmodified(date).
--last-value (value)	Specifies the maximum value of the check column from the previous import.

Check for data in case of incremental import

sqoop eval \

--connect jdbc:mysql:dburl \

--username retail_user \

--password-file file:///home/training/password_file \ //file at local

-e "SELECT Max(order_date) FROM orders"

//Output

Max(order_date)

2018-09-23 00:00:00.0

For HDFS

Incremental import we generally use append

sqoop job\

--connect jdbc:mysql:dburl \

--username retail_user \

--password-file file:///home/training/password_file \ //file at local

--table orders \

--warehouse-dir /user/training/sqoop_import/retail_db \

--append \

--where ''order_id >883"

//Check how many records are received in map-reduced function.

sqoop eval \

--connect jdbc:mysql:dburl \

--username retail_user \

--password-file file:///home/training/password_file \ //file at local

--table orders \

--warehouse-dir /user/training/sqoop_import/retail_db \

--check-column order_id \ \\primary key

--incremental append \

--last-value 883

if any new record then code will receive > 0.

To create sqoop job @1

sqoop job \

--create import_orders_incr \

--connect jdbc:mysql:dburl \

--username retail_user \

--password-file file:///home/training/password_file \ //file at local

--table orders \

--warehouse-dir /user/training/sqoop_import/retail_db \

--check-column order_id \ \\primary key

--incremental append \

--last-value 0

To get property related to incremental load

sqoop job --show import_orders_incr|grep incremental

In first run of sqoop job @1 it will move all records from mysql to hdfs

when we rerun command sqoop job --show import_orders_incr|grep incremental it gives us three values like :

incremental.last.value = 1000

incremental.col = order_id

incremental.mode = AppendRows

Now, add some records in mysql and re run sqoop import_orders_incr then we found records have been moved to hdfs. We may run sqoop job --show import_orders_incr|grep incremental command to get all informations.

incremental mode as lastmodified(where date field present)

Check for directory if exists remove directory.

To create sqoop job @2

sqoop job \

--create import_orders_incr \

--connect jdbc:mysql:dburl \

--username retail_user \

--password-file file:///home/training/password_file \ //file at local

--table orders \

--warehouse-dir /user/training/sqoop_import/retail_db \

--check-column order_date \ //not null

--incremental lastmodified\

--last-value "2018-07-23 00:00:00.0";

Sqoop Export

sqoop-export

The export tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
Let us create Hive table and then load data into that table using insert command.
USE training_data_retail;
CREATE TABLE daily_retail (
order_date STRING,
revenue FLOAT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

INSERT INTO daily_retail
SELECT order_date, round(sum(order_subtotal), 1) AS revenue
FROM orders JOIN order_items
ON order_id = order_item_id
WHERE order_status IN ('COMPLETE', 'CLOSED')
GROUP BY order_date;

Let us first create table in MySQL into which we can export the data using sqoop export from data in HDFS location.
We are using daily_retail as table name. Database name is db_training_retail.
CREATE TABLE daily_retail (
order_date VARCHAR(25) PRIMARY KEY,
revenue FLOAT
);

Sqoop export to export data from HDFS into a table in relational database.
We need to have JDBC Driver and Database credentials to connect to database to export data into the table in the database.
We can pass JDBC URL using --connect, user name using --username and password using --password.
HDFS location can be specified using --export-dir.
Table name into which data need to be exported can be specified using --table.
Here is the example for sqoop export to export the data from HDFS location pointed by Hive Table daily_revenue into MySQL table daily_retail
sqoop export \
--connect jdbc:mysql:dburl \
--username user_name \
--password ******** \
--export-dir /apps/hive/warehouse/sqoop_retail.db/daily_retail \

--table daily_retail
--num-mappers 1

When sqoop export job initiates Map Reduce job, tracking URL is generated, which can be used to go through the Map Reduce logs.

Get an overview about merging the data into existing table.

When we export data into an existing table, typically data will be inserted into the target table.
When we have the data being exported with existing keys in target table, then export will fail provided primary key or unique key is enforced in the target table.
At times we want to update data when there is a key already, otherwise we want to insert the data. This is called as merge or update.
We can use --update-key and --update-mode to merge or upsert into the target table based on the key.

We will perform following steps to understand upsert in MySQL tables.

Connect to the MySQL server
Switch to the database
Truncate training_daily_revenue table.
Insert couple of records into the table.
We can then try INSERT command with ON DUPLICATE KEY to understand how we can achieve merge into tables in MySQL Database.
If we drop the primary key, ON DUPLICATE KEY will never be triggered and hence data is always inserted.

Using update-mode - update-only

we can update data in tables in relational databases while performing sqoop export.

There are 2 main arguments with respect to updating the data while performing sqoop export, --update-mode and --update-key.
If we use --update-key then data will be updated for the existing keys in the target database.
Others will be by default discarded.
If --update-mode is specified as allowinsert, all 364 records will be inserted and none of them will be updated as ON DUPLICATE KEY is ineffective.

Let us perform sqoop export using --update-key

sqoop export \

--connect jdbc:mysql:dburl \

--username retail_user \

--password password \

--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \

--table daily_retail \

--update-key order_date

Using update-mode - allow-insert

If we only specify --update-key, the export will run update-only mode.

To ensure that data is inserted if the key does not exist we have to define --update-mode with allowinsert.

When we perform sqoop export with --update-mode allowinsert into tables in MySQL database, it will build INSERT ON DUPLICATE KEY UPDATE command.

It will ensure that new records are inserted and existing records are updated based on the primary key.

Let us export the data into table using update-mode allowinsert where there is no primary key and check the behavior.

sqoop export \

--connect jdbc:mysql://dburl \

--username retail_user \

--password password \

--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \

--table daily_retail \

--update-key order_date \

--update-mode allowinsert

Let us validate against the table where there is no primary key defined.

USE retail_export;

SELECT count(1) FROM daily_retail;

SELECT * FROM daily_retail WHERE order_date LIKE '2013-07-25%';

Let us see the steps to perform sqoop export to merge or upsert into the target table.

Connect to MySQL

Switch to the database retail_export

Truncate table daily_retail

Alter table and add primary key to the table

Export data into the table, daily_retail

Delete data beyond 2013-12-31 from daily_retail, so that we can upsert into the target table.

Here is the SQL Script to prepare the table for the export in upsert mode.

USE retail_export;

TRUNCATE TABLE daily_retail;

ALTER TABLE daily_retail ADD PRIMARY KEY (order_date);

Here is the first sqoop export command to export data once again into daily_retail.

sqoop export \

--connect jdbc:mysql://dburl \

--username retail_user \

--password password \

--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \

--table daily_retail

Let us delete the data beyond 2013-12-31.

USE retail_export;

DELETE FROM daily_retail WHERE order_date > '2013-12-31 00:00:00.0';

Here is the second sqoop export command to export the data in upsert mode.

sqoop export \

--connect jdbc:mysql://dburl \

--username retail_user \

--password password \

--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \

--table daily_retail \

--update-key order_date \

--update-mode allowinsert

Now data beyond 2013-12-31 will be inserted and upto 2013-12-31 will be updated.

Specifying Columns

How to perform sqoop export using --columns when data in HDFS is not consistent with target table structure in relational database.

If target table have more columns than the fields in HDFS data or the order in target table is different from the order of fields in HDFS, then typical sqoop export will fail.

We can successfully export the data by specifying the columns as per the order of fields in HDFS in those kind of scenarios.

Here is the example of performing sqoop export when the data in HDFS is not consistent with target table structure in relational database.

sqoop export \

--connect jdbc:mysql:dburl \

--username retail_user \

--password password \

--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_revenue01 \

--table daily_retail \

--columns order_date,revenue

Specifying Delimiters

How to specify delimiters while performing sqoop export from HDFS to tables in relational database.

Delimiters are important when we try to export data from files of type text file format.

By default sqoop export uses comma as field delimiter. But quite often we might use other delimiters to save the data in HDFS.

When we export data from Hive tables, then we might have to deal with Ascci1 as it is default field delimiter for Hive tables.

Hence let us take an example about how to export data with Ascii1 as delimiter.

We can specify custom delimiters using --input-fields-terminated-by, --input-lines-terminated-by etc.

Let us create Hive table with default delimiters so that we can demonstrate sqoop export with custom delimiters.

USE training_retail;

CREATE TABLE daily_retail01 AS SELECT * FROM daily_retail;

Let us truncate the table in MySQL using sqoop eval, so that we can reexport the data.

sqoop eval \

--connect jdbc:mysql:dburl\

--username retail_user \

--password password \

-e "truncate table daily_retail"

Here is the example of sqoop export to specify custom field delimiter.

sqoop export \

--connect jdbc:mysql:dburl\

--username retail_user \

--password password \

--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail01 \

--table daily_retail \

--input-fields-terminated-by '\001'

Using Stage Table

By default if there are any exceptions in map tasks while running map reduce job of sqoop export, the task will be retried four times.

Due to that, the data in the target table might be partially loaded. At times cleaning up the partial load and reloading can be tedious.

We can use --staging-table to overcome this issue. Data will be first loaded into staging table. If there are no exceptions then data will be copied from staging table into the target table.

If data in staging table is not cleaned up for any reason, we might have to use additional control argument --clear-staging-table.

--clear-staging-table will ensure that data is deleted in the staging table before the export.

Let me prepare the target table to understand the relevance of --staging-table. We need to connect to MySQL database and delete all the records except one.

USE retail_export;

DELETE FROM daily_retail WHERE order_date != '2013-08-03 00:00:00.0';

COMMIT;

Let us perform the sqoop export into the target table where there is already one record. It will partially load the data (you can connect to the MySQL database and run queries against the table).

sqoop export \

--connect jdbc:mysql://dburl \

--username retail_user \

--password password \

--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_revenue \

--table daily_retail

Let us create staging table in MySQL which will be used while performing sqoop export using --stage-table. Also we will delete all the records from daily_revenue except one.

USE retail_export;

CREATE TABLE daily_retail_stage AS SELECT * FROM daily_retail WHERE 1=2;

DELETE FROM daily_retail WHERE order_date != '2013-08-03 00:00:00.0';

COMMIT;

Now let us perform sqoop export specifying --staging-table. Staging table will be populated successfully, but copy from staging table to final table will be failed due to primary key constraint on one record.

sqoop export \

--connect jdbc:mysql://dburl \

--username retail_user \

--password password \

--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_revenue \

--table daily_retail \

--staging-table daily_retail_stage

Now let us clean up the target table by deleting one record and then run sqoop export with --staging-table and also --clear-staging-table.

USE retail_export;

DELETE FROM daily_retail;

COMMIT;

Here is the sqoop export command to export data into table in relational database using --staging-table and --clear-staging-table.

sqoop export \

--connect jdbc:mysql://dburl \

--username retail_user \

--password password \

--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \

--table daily_retail \

--staging-table daily_retail_stage \

--clear-staging-table

Sqoop Import to Hive Database

Create Hive Database

Type hive in command shell
[umeshp6655@gw02 ~]$ hive
Logging initialized using configuration in file:/usr/hdp/2.6.5.0-292/hive/conf/hive-log4j.properties
hive (default)> create database umeshp6655_sqoop_import;
OK
Time taken: 2.09 seconds
hive (default)> use umeshp6655_sqoop_import
OK
Time taken: 0.253 seconds
hive (umeshp6655_sqoop_import)> create table t(i int);
OK
Time taken: 0.448 seconds
-- Dropping directories, in case your hive database/tables in consistent state
hadoop fs -rm -R /user/hive/warehouse/departments
Hive Arguments:

Argument	Description
--hive-import	Import tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwrite	Overwrite existing data in the Hive table.
--create-hive-table	If set, then the job will fail if the target hive
	table exists. By default this property is false.
--hive-table <table-name>	Sets the table name to use when importing to Hive.
--hive-drop-import-delims	Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement	Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.
--hive-partition-key	Name of a hive field to partition are sharded on
--hive-partition-value <v>	String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map>	Override default mapping from SQL type to Hive type for configured columns.

Simple hive import using sqoop import command
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--num-mappers 2
//data store at /user/username/tablename (default directory)
//show tables //describe formatted table_name //get location from the value displayed.
//hadoop fs -ls location_value(paste)
//hadoop fs -get location_value(paste) umeshp6655_import_sqoop \\to get data to local machine
//move to directory by cd directory_name and type view part-m-00000 \\In file you can see format in which data stored like delimiters.

For checking data
1) Come out of hive by writing exit command
2) when you are in hdfs/hadoop then use below command :
hdfs fs -get filepath .
3) move to that path by use of cd command
cd filepath
4) Once you are in that directory in which data imported then type below command.
ls -ltr
Imported files are displayed
5) To display file contained use below command
view file_name.
Will display stored data format.

Simple hive import using sqoop import command to overwrite hive tables
--hive-overwrite : It will overwrite existing values.
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--hive-overwrite \
--num-mappers 2

Note : --hive-overwrite or --create-hive-table only one will be use as they are mutually exclusive.
--create-hive-table : If set, then the job will fail if the target hive table exists
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--create-hive-table \ //failed as tables already creates at respective location.
--num-mappers 2

import-all-tables

Limitations :

--warehouse-dir is mandatory

Better to use autoreset-to-one-mapper(if table has no primary key)

Cannot specify many arguments such as --query, --cols, --where where does filtering or transformations on the data, Incremental import not possible

sqoop import-all-tables \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \

--warehouse-dir /user/umeshp6655/sqoop_import/retail_db \

--autoreset-to-one-mapper \

--exclude-tables <tables>

sqoop import-all-tables \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \

--warehouse-dir /user/umeshp6655/sqoop_import/retail_db \

--autoreset-to-one-mapper \

--exclude-table table_name1,table_name2

--hive-drop-import-delims : Drops \n, \r, and \01 from string fields when importing to Hive. We can instruct Sqoop to automatically clean your data using --hive-drop- import-delims , which will remove all \n , \t , and \01 characters from all string-based columns:
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-database dbname \
--hive-table table_name \
--hive-import \
--hive-drop-import-delims

--hive-delims-replacement : Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.If removing the special characters is not an option in your use case, you can take ad‐ vantage of the parameter --hive-delims-replacement , which will accept a replacement string.
Instead of removing separators completely, they will be replaced with a specified string. The following example will replace all \n , \t , and \01 characters with the string SPECIAL :
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-database dbname \
--hive-table table_name \
--hive-import \
--hive-delims-replacement "SPECIAL"

--hive-partition-key : Name of a hive field to partition are sharded on.
--hive-partition-value : Value that serves as partition key for importing into hive for this job.
Partition key category_deptt_id cannot be a column to import.
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--columns=category_id,category_name \
--where \'category_deptt_id\'=8 \
--hive-overwrite \
--hive-partition-key category_deptt_id \
--hive-partition-value 8

–map-column-hive <map> :Override default mapping from SQL type to Hive type for configured columns
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--hive-overwrite \
--map-column-hive column_id1=bigint,column_id2=bigint //If you entered space between two columns like –map-column-hive column_id1=bigint, column_id2=bigint it will give error to avoid error remove the space between the two columns.

Do import and store file in Parquet file format:
@9
sqoop import
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-parquetfile
Query will give error by saying some file format is missing.
To overcome this issue we have to create table in hive as store it as parquetfile like below :
hive> CREATE TABLE Table_name(table_col1 INT,
table_col2 INT, table_col3 INT,
table_col4 FLOAT) STORE AS PARQUETFILE
Now, run @9 import command it will import data successfully.
Now run select query in hive to see data has been imported successfully or not.
Note : If hive select count(1) from table_name query is returning zero then
SET hive.compute.query.using.stats = false;
and re-run select query it will give you total count.

Do Sqoop import into Hive in avro datafile format :
sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-avrodatafile

Sqoop import can be used to import only delimited data into hive table. It can't import avro data directly into a hive table. If you want to import avro data then follow below steps:
Do sqoop import using --as-avrodatafile. This will also create an .avsc file(schema) in the directory from which sqoop command is run.
Move the .avsc file into hdfs. Create an external hive table which points to the imported data. Also point its schema to .avsc file :
CREATE EXTERNAL TABLE departments
STORED AS AVRO
LOCATION "/user/umeshp6655/deptt"
TBLPROPERTIES('avro.schema.url'='.avsc file location')

Do Sqoop import into Hive as sequence file format :
sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \

–as-sequencefile
Query will give error by saying some file format is missing.

To overcome this issue we have to create table in hive as store it as sequencefile like below :

Hive SerDe for accessing Sqoop sequence files. To use generate a sequence file using sqoop and then in Hive create the table. An example DDL statement would be:

CREATE EXTERNAL TABLE table_name (id INT,name STRING)

ROW FORMAT SERDE 'com.cloudera.sqoop.contrib.FieldMappableSerDe'

WITH SERDEPROPERTIES (

"fieldmappable.classname" = "name.of.FieldMappable.generated.by.sqoop"

)

STORED AS SEQUENCEFILE

LOCATION "hdfs://hdfs.server/path/to/sequencefile";

Now, run import function it will work