Sqoop Import to Hive Database
Create Hive Database
Type hive in command shell
[umeshp6655@gw02 ~]$ hive
Logging initialized using configuration in file:/usr/hdp/2.6.5.0-292/hive/conf/hive-log4j.properties
hive (default)> create database umeshp6655_sqoop_import;
OK
Time taken: 2.09 seconds
hive (default)> use umeshp6655_sqoop_import
OK
Time taken: 0.253 seconds
hive (umeshp6655_sqoop_import)> create table t(i int);
OK
Time taken: 0.448 seconds
-- Dropping directories, in case your hive database/tables in consistent state
hadoop fs -rm -R /user/hive/warehouse/departments
Hive Arguments:
Simple hive import using sqoop import command
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--num-mappers 2
//data store at /user/username/tablename (default directory)
//show tables //describe formatted table_name //get location from the value displayed.
//hadoop fs -ls location_value(paste)
//hadoop fs -get location_value(paste) umeshp6655_import_sqoop \\to get data to local machine
//move to directory by cd directory_name and type view part-m-00000 \\In file you can see format in which data stored like delimiters.
For checking data
1) Come out of hive by writing exit command
2) when you are in hdfs/hadoop then use below command :
hdfs fs -get filepath .
3) move to that path by use of cd command
cd filepath
4) Once you are in that directory in which data imported then type below command.
ls -ltr
Imported files are displayed
5) To display file contained use below command
view file_name.
Will display stored data format.
Simple hive import using sqoop import command to overwrite hive tables
--hive-overwrite : It will overwrite existing values.
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--hive-overwrite \
--num-mappers 2
[umeshp6655@gw02 ~]$ hive
Logging initialized using configuration in file:/usr/hdp/2.6.5.0-292/hive/conf/hive-log4j.properties
hive (default)> create database umeshp6655_sqoop_import;
OK
Time taken: 2.09 seconds
hive (default)> use umeshp6655_sqoop_import
OK
Time taken: 0.253 seconds
hive (umeshp6655_sqoop_import)> create table t(i int);
OK
Time taken: 0.448 seconds
-- Dropping directories, in case your hive database/tables in consistent state
hadoop fs -rm -R /user/hive/warehouse/departments
Hive Arguments:
Argument
|
Description
|
--hive-import
|
Import
tables into Hive (Uses Hive’s default delimiters if none are set.)
|
--hive-overwrite
|
Overwrite
existing data in the Hive table.
|
--create-hive-table
|
If set,
then the job will fail if the target hive
|
table
exists. By default this property is false.
|
|
--hive-table
<table-name>
|
Sets
the table name to use when importing to Hive.
|
--hive-drop-import-delims
|
Drops \n,
\r, and \01 from string fields when importing to Hive.
|
--hive-delims-replacement
|
Replace
\n, \r, and \01 from string fields with user defined
string when importing to Hive.
|
--hive-partition-key
|
Name of
a hive field to partition are sharded on
|
--hive-partition-value
<v>
|
String-value
that serves as partition key for this imported into hive in this job.
|
--map-column-hive
<map>
|
Override
default mapping from SQL type to Hive type for configured columns.
|
Simple hive import using sqoop import command
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--num-mappers 2
//data store at /user/username/tablename (default directory)
//show tables //describe formatted table_name //get location from the value displayed.
//hadoop fs -ls location_value(paste)
//hadoop fs -get location_value(paste) umeshp6655_import_sqoop \\to get data to local machine
//move to directory by cd directory_name and type view part-m-00000 \\In file you can see format in which data stored like delimiters.
For checking data
1) Come out of hive by writing exit command
2) when you are in hdfs/hadoop then use below command :
hdfs fs -get filepath .
3) move to that path by use of cd command
cd filepath
4) Once you are in that directory in which data imported then type below command.
ls -ltr
Imported files are displayed
5) To display file contained use below command
view file_name.
Will display stored data format.
Simple hive import using sqoop import command to overwrite hive tables
--hive-overwrite : It will overwrite existing values.
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--hive-overwrite \
--num-mappers 2
Note : --hive-overwrite or --create-hive-table only one will be use as they are mutually exclusive.
--create-hive-table : If set, then the job will fail if the target hive table exists
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--create-hive-table \ //failed as tables already creates at respective location.
--num-mappers 2
--create-hive-table : If set, then the job will fail if the target hive table exists
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--create-hive-table \ //failed as tables already creates at respective location.
--num-mappers 2
import-all-tables
Limitations :
--warehouse-dir is mandatory
Better to use autoreset-to-one-mapper(if table has no primary key)
Cannot specify many arguments such as --query, --cols, --where where does filtering or transformations on the data, Incremental import not possible
sqoop import-all-tables \--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--warehouse-dir /user/umeshp6655/sqoop_import/retail_db \
--autoreset-to-one-mapper \
--exclude-tables <tables>
sqoop import-all-tables \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--hive-partition-key : Name of a hive field to partition are sharded on.
--hive-partition-value : Value that serves as partition key for importing into hive for this job.
Partition key category_deptt_id cannot be a column to import.
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--columns=category_id,category_name \
--where \'category_deptt_id\'=8 \
--hive-overwrite \
--hive-partition-key category_deptt_id \
--hive-partition-value 8
–map-column-hive <map> :Override default mapping from SQL type to Hive type for configured columns
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--hive-overwrite \
--map-column-hive column_id1=bigint,column_id2=bigint //If you entered space between two columns like –map-column-hive column_id1=bigint, column_id2=bigint it will give error to avoid error remove the space between the two columns.
Do import and store file in Parquet file format:
@9
sqoop import
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-parquetfile
Query will give error by saying some file format is missing.
To overcome this issue we have to create table in hive as store it as parquetfile like below :
hive> CREATE TABLE Table_name(table_col1 INT,
table_col2 INT, table_col3 INT,
table_col4 FLOAT) STORE AS PARQUETFILE
Now, run @9 import command it will import data successfully.
Now run select query in hive to see data has been imported successfully or not.
Note : If hive select count(1) from table_name query is returning zero then
SET hive.compute.query.using.stats = false;
and re-run select query it will give you total count.
Do Sqoop import into Hive in avro datafile format :
sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-avrodatafile
Sqoop import can be used to import only delimited data into hive table. It can't import avro data directly into a hive table. If you want to import avro data then follow below steps:
Do sqoop import using --as-avrodatafile. This will also create an .avsc file(schema) in the directory from which sqoop command is run.
Move the .avsc file into hdfs. Create an external hive table which points to the imported data. Also point its schema to .avsc file :
CREATE EXTERNAL TABLE departments
STORED AS AVRO
LOCATION "/user/umeshp6655/deptt"
TBLPROPERTIES('avro.schema.url'='.avsc file location')
Do Sqoop import into Hive as sequence file format :
sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-sequencefile
Query will give error by saying some file format is missing.
To overcome this issue we have to create table in hive as store it as sequencefile like below :
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--warehouse-dir /user/umeshp6655/sqoop_import/retail_db \
--autoreset-to-one-mapper \
--exclude-table table_name1,table_name2
--hive-drop-import-delims : Drops \n, \r, and \01 from string fields when importing to Hive. We can instruct Sqoop to automatically clean your data using --hive-drop- import-delims , which will remove all \n , \t , and \01 characters from all string-based columns:
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-database dbname \
--hive-table table_name \
--hive-import \
--hive-drop-import-delims
--hive-delims-replacement : Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.If removing the special characters is not an option in your use case, you can take ad‐ vantage of the parameter --hive-delims-replacement , which will accept a replacement string.
Instead of removing separators completely, they will be replaced with a specified string. The following example will replace all \n , \t , and \01 characters with the string SPECIAL :
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-database dbname \
--hive-table table_name \
--hive-import \
--hive-delims-replacement "SPECIAL"
--hive-drop-import-delims : Drops \n, \r, and \01 from string fields when importing to Hive. We can instruct Sqoop to automatically clean your data using --hive-drop- import-delims , which will remove all \n , \t , and \01 characters from all string-based columns:
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-database dbname \
--hive-table table_name \
--hive-import \
--hive-drop-import-delims
--hive-delims-replacement : Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.If removing the special characters is not an option in your use case, you can take ad‐ vantage of the parameter --hive-delims-replacement , which will accept a replacement string.
Instead of removing separators completely, they will be replaced with a specified string. The following example will replace all \n , \t , and \01 characters with the string SPECIAL :
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-database dbname \
--hive-table table_name \
--hive-import \
--hive-delims-replacement "SPECIAL"
--hive-partition-key : Name of a hive field to partition are sharded on.
--hive-partition-value : Value that serves as partition key for importing into hive for this job.
Partition key category_deptt_id cannot be a column to import.
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--columns=category_id,category_name \
--where \'category_deptt_id\'=8 \
--hive-overwrite \
--hive-partition-key category_deptt_id \
--hive-partition-value 8
–map-column-hive <map> :Override default mapping from SQL type to Hive type for configured columns
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--hive-overwrite \
--map-column-hive column_id1=bigint,column_id2=bigint //If you entered space between two columns like –map-column-hive column_id1=bigint, column_id2=bigint it will give error to avoid error remove the space between the two columns.
Do import and store file in Parquet file format:
@9
sqoop import
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-parquetfile
Query will give error by saying some file format is missing.
To overcome this issue we have to create table in hive as store it as parquetfile like below :
hive> CREATE TABLE Table_name(table_col1 INT,
table_col2 INT, table_col3 INT,
table_col4 FLOAT) STORE AS PARQUETFILE
Now, run @9 import command it will import data successfully.
Now run select query in hive to see data has been imported successfully or not.
Note : If hive select count(1) from table_name query is returning zero then
SET hive.compute.query.using.stats = false;
and re-run select query it will give you total count.
Do Sqoop import into Hive in avro datafile format :
sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-avrodatafile
Sqoop import can be used to import only delimited data into hive table. It can't import avro data directly into a hive table. If you want to import avro data then follow below steps:
Do sqoop import using --as-avrodatafile. This will also create an .avsc file(schema) in the directory from which sqoop command is run.
Move the .avsc file into hdfs. Create an external hive table which points to the imported data. Also point its schema to .avsc file :
CREATE EXTERNAL TABLE departments
STORED AS AVRO
LOCATION "/user/umeshp6655/deptt"
TBLPROPERTIES('avro.schema.url'='.avsc file location')
Do Sqoop import into Hive as sequence file format :
sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-sequencefile
Query will give error by saying some file format is missing.
To overcome this issue we have to create table in hive as store it as sequencefile like below :
Hive SerDe for
accessing Sqoop sequence files. To use generate a sequence file using sqoop and
then in Hive create the table. An example DDL statement would be:
CREATE EXTERNAL TABLE table_name (id INT,name STRING)
ROW FORMAT SERDE 'com.cloudera.sqoop.contrib.FieldMappableSerDe'
WITH SERDEPROPERTIES (
"fieldmappable.classname"
= "name.of.FieldMappable.generated.by.sqoop"
)
STORED AS SEQUENCEFILE
LOCATION "hdfs://hdfs.server/path/to/sequencefile";
Now, run import function it will work
No comments:
Post a Comment