we can update data in tables in relational databases while performing sqoop export.
- There are 2 main arguments with respect to updating the data while performing sqoop export,
--update-mode and --update-key.
- If we use
--update-key then data will be updated for the existing keys in the target database.
- Others will be by default discarded.
- If
--update-mode is specified as allowinsert, all 364 records will be inserted and none of them will be updated as ON DUPLICATE KEY is ineffective.
Let us perform sqoop export using --update-key
sqoop export \
--connect jdbc:mysql:dburl \
--username retail_user \
--password password \
--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \
--table daily_retail \
--update-key order_date
Using update-mode - allow-insert
If we only specify --update-key, the export will run update-only mode.
To ensure that data is inserted if the key does not exist we have to define --update-mode with allowinsert.
When we perform sqoop export with --update-mode allowinsert into tables in MySQL database, it will build INSERT ON DUPLICATE KEY UPDATE command.
It will ensure that new records are inserted and existing records are updated based on the primary key.
Let us export the data into table using update-mode allowinsert where there is no primary key and check the behavior.
sqoop export \
--connect jdbc:mysql://dburl \
--username retail_user \
--password password \
--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \
--table daily_retail \
--update-key order_date \
--update-mode allowinsert
Let us validate against the table where there is no primary key defined.
USE retail_export;
SELECT count(1) FROM daily_retail;
SELECT * FROM daily_retail WHERE order_date LIKE '2013-07-25%';
Let us see the steps to perform sqoop export to merge or upsert into the target table.
Connect to MySQL
Switch to the database retail_export
Truncate table daily_retail
Alter table and add primary key to the table
Export data into the table, daily_retail
Delete data beyond 2013-12-31 from daily_retail, so that we can upsert into the target table.
Here is the SQL Script to prepare the table for the export in upsert mode.
USE retail_export;
TRUNCATE TABLE daily_retail;
ALTER TABLE daily_retail ADD PRIMARY KEY (order_date);
Here is the first sqoop export command to export data once again into daily_retail.
sqoop export \
--connect jdbc:mysql://dburl \
--username retail_user \
--password password \
--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \
--table daily_retail
Let us delete the data beyond 2013-12-31.
USE retail_export;
DELETE FROM daily_retail WHERE order_date > '2013-12-31 00:00:00.0';
Here is the second sqoop export command to export the data in upsert mode.
sqoop export \
--connect jdbc:mysql://dburl \
--username retail_user \
--password password \
--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \
--table daily_retail \
--update-key order_date \
--update-mode allowinsert
Now data beyond 2013-12-31 will be inserted and upto 2013-12-31 will be updated.
Specifying Columns
How to perform sqoop export using --columns when data in HDFS is not consistent with target table structure in relational database.
If target table have more columns than the fields in HDFS data or the order in target table is different from the order of fields in HDFS, then typical sqoop export will fail.
We can successfully export the data by specifying the columns as per the order of fields in HDFS in those kind of scenarios.
Here is the example of performing sqoop export when the data in HDFS is not consistent with target table structure in relational database.
sqoop export \
--connect jdbc:mysql:dburl \
--username retail_user \
--password password \
--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_revenue01 \
--table daily_retail \
--columns order_date,revenue
Specifying Delimiters
How to specify delimiters while performing sqoop export from HDFS to tables in relational database.
Delimiters are important when we try to export data from files of type text file format.
By default sqoop export uses comma as field delimiter. But quite often we might use other delimiters to save the data in HDFS.
When we export data from Hive tables, then we might have to deal with Ascci1 as it is default field delimiter for Hive tables.
Hence let us take an example about how to export data with Ascii1 as delimiter.
We can specify custom delimiters using --input-fields-terminated-by, --input-lines-terminated-by etc.
Let us create Hive table with default delimiters so that we can demonstrate sqoop export with custom delimiters.
USE training_retail;
CREATE TABLE daily_retail01 AS SELECT * FROM daily_retail;
Let us truncate the table in MySQL using sqoop eval, so that we can reexport the data.
sqoop eval \
--connect jdbc:mysql:dburl\
--username retail_user \
--password password \
-e "truncate table daily_retail"
Here is the example of sqoop export to specify custom field delimiter.
sqoop export \
--connect jdbc:mysql:dburl\
--username retail_user \
--password password \
--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail01 \
--table daily_retail \
--input-fields-terminated-by '\001'
Using Stage Table
By default if there are any exceptions in map tasks while running map reduce job of sqoop export, the task will be retried four times.
Due to that, the data in the target table might be partially loaded. At times cleaning up the partial load and reloading can be tedious.
We can use --staging-table to overcome this issue. Data will be first loaded into staging table. If there are no exceptions then data will be copied from staging table into the target table.
If data in staging table is not cleaned up for any reason, we might have to use additional control argument --clear-staging-table.
--clear-staging-table will ensure that data is deleted in the staging table before the export.
Let me prepare the target table to understand the relevance of --staging-table. We need to connect to MySQL database and delete all the records except one.
USE retail_export;
DELETE FROM daily_retail WHERE order_date != '2013-08-03 00:00:00.0';
COMMIT;
Let us perform the sqoop export into the target table where there is already one record. It will partially load the data (you can connect to the MySQL database and run queries against the table).
sqoop export \
--connect jdbc:mysql://dburl \
--username retail_user \
--password password \
--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_revenue \
--table daily_retail
Let us create staging table in MySQL which will be used while performing sqoop export using --stage-table. Also we will delete all the records from daily_revenue except one.
USE retail_export;
CREATE TABLE daily_retail_stage AS SELECT * FROM daily_retail WHERE 1=2;
DELETE FROM daily_retail WHERE order_date != '2013-08-03 00:00:00.0';
COMMIT;
Now let us perform sqoop export specifying --staging-table. Staging table will be populated successfully, but copy from staging table to final table will be failed due to primary key constraint on one record.
sqoop export \
--connect jdbc:mysql://dburl \
--username retail_user \
--password password \
--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_revenue \
--table daily_retail \
--staging-table daily_retail_stage
Now let us clean up the target table by deleting one record and then run sqoop export with --staging-table and also --clear-staging-table.
USE retail_export;
DELETE FROM daily_retail;
COMMIT;
Here is the sqoop export command to export data into table in relational database using --staging-table and --clear-staging-table.
sqoop export \
--connect jdbc:mysql://dburl \
--username retail_user \
--password password \
--export-dir /apps/hive/warehouse/training_sqoop_retail.db/daily_retail \
--table daily_retail \
--staging-table daily_retail_stage \
--clear-staging-table