Tuesday, 30 July 2019

Sqoop Import to Hive Database

Sqoop Import to Hive Database

Create Hive Database
Type hive in command shell
[umeshp6655@gw02 ~]$ hive
Logging initialized using configuration in file:/usr/hdp/2.6.5.0-292/hive/conf/hive-log4j.properties
hive (default)> create database umeshp6655_sqoop_import;
OK
Time taken: 2.09 seconds
hive (default)> use umeshp6655_sqoop_import
OK
Time taken: 0.253 seconds
hive (umeshp6655_sqoop_import)> create table t(i int);
OK
Time taken: 0.448 seconds
-- Dropping directories, in case your hive database/tables in consistent state
hadoop fs -rm -R /user/hive/warehouse/departments
Hive Arguments:

Argument
Description
--hive-import
Import tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwrite
Overwrite existing data in the Hive table.
--create-hive-table
If set, then the job will fail if the target hive
table exists. By default this property is false.
--hive-table <table-name>
Sets the table name to use when importing to Hive.
--hive-drop-import-delims
Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement
Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.
--hive-partition-key
Name of a hive field to partition are sharded on
--hive-partition-value <v>
String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map>
Override default mapping from SQL type to Hive type for configured columns.

Simple hive import using sqoop import command
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--num-mappers 2
//data store at /user/username/tablename (default directory)
//show tables   //describe formatted table_name  //get location from the value displayed.
//hadoop fs -ls location_value(paste)
//hadoop fs -get location_value(paste) umeshp6655_import_sqoop \\to get data to local machine
//move to directory by cd directory_name and type view part-m-00000 \\In file you can see format in which data stored like delimiters.


For checking data
1) Come out of hive by writing exit command
2) when you are in hdfs/hadoop then use below command :
hdfs fs -get filepath .
3) move to that path by use of cd command
cd filepath
4) Once you are in that directory in which data imported then type below command.
ls -ltr
Imported files are displayed
5) To display file contained use below command
view file_name.
Will display stored data format.

Simple hive import using sqoop import command to overwrite hive tables
--hive-overwrite : It will overwrite existing values.
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--hive-overwrite \
--num-mappers 2

Note : --hive-overwrite or --create-hive-table only one will be use as they are mutually exclusive.
--create-hive-table : If set, then the job will fail if the target hive table exists

sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--create-hive-table \    //failed as tables already creates at respective location.
--num-mappers 2

import-all-tables
Limitations :
--warehouse-dir is mandatory
Better to use autoreset-to-one-mapper(if table has no primary key)
Cannot specify many arguments such as --query, --cols, --where where does filtering or transformations on the data, Incremental import not possible
sqoop import-all-tables \
  --connect jdbc:mysql://db_path \
  --username username \
  --password ****** \
  --warehouse-dir /user/umeshp6655/sqoop_import/retail_db \
  --autoreset-to-one-mapper \

--exclude-tables <tables>
sqoop import-all-tables \
  --connect jdbc:mysql://db_path \
  --username username \
  --password ****** \
  --warehouse-dir /user/umeshp6655/sqoop_import/retail_db \
  --autoreset-to-one-mapper \
  --exclude-table table_name1,table_name2

--hive-drop-import-delims : Drops \n\r, and \01 from string fields when importing to Hive. We can instruct Sqoop to automatically clean your data using  --hive-drop- import-delims , which will remove all  \n ,  \t , and  \01  characters from all string-based columns: 
sqoop import \   
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-database dbname \
--hive-table table_name \ 
--hive-import \   
--hive-drop-import-delims 

--hive-delims-replacement : Replace \n\r, and \01 from string fields with user defined string when importing to Hive.If removing the special characters is not an option in your use case, you can take ad‐ vantage of the parameter --hive-delims-replacement , which will accept a replacement string. 
Instead of removing separators completely, they will be replaced with a specified string. The following example will replace all  \n ,  \t , and  \01  characters with the string  SPECIAL : 
sqoop import \   
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \ 
--hive-database dbname \
--hive-table table_name \
--hive-import \   
--hive-delims-replacement "SPECIAL"

--hive-partition-key : Name of a hive field to partition are sharded on.
--hive-partition-value : Value that serves as partition key for importing into hive for this job.
 Partition key category_deptt_id cannot be a column to import.
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \
--columns=category_id,category_name \
--where \'category_deptt_id\'=8 \
--hive-overwrite \
--hive-partition-key category_deptt_id \
--hive-partition-value 8

–map-column-hive <map> :Override default mapping from SQL type to Hive type for configured columns
sqoop import \
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database dbname \
--hive-table table_name \ 
--hive-overwrite \
--map-column-hive column_id1=bigint,column_id2=bigint //If you entered space between two columns like –map-column-hive column_id1=bigint, column_id2=bigint it will give error to avoid error remove the space between the two columns.

Do import and store file in Parquet file format:
@9
sqoop import 
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-parquetfile
Query will give error by saying some file format is missing.
To overcome this issue we have to create table in hive as store it as parquetfile like below :
hive> CREATE TABLE Table_name(table_col1 INT,
table_col2 INT, table_col3 INT,
table_col4 FLOAT) STORE AS PARQUETFILE
Now, run @9 import command it will import data successfully.
Now run select query in hive to see data has been imported successfully or not.
Note : If hive select count(1) from table_name query is returning zero then 
SET hive.compute.query.using.stats = false;
and re-run select query it will give you total count.

Do Sqoop import into Hive in avro datafile format :
sqoop import 
-Dmapreduce.job.user.classpath.first=true 
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \
–as-avrodatafile 

Sqoop import can be used to import only delimited data into hive table. It can't import avro data directly into a hive table. If you want to import avro data then follow below steps:
Do sqoop import using --as-avrodatafile. This will also create an .avsc file(schema) in the directory from which sqoop command is run.
Move the .avsc file into hdfs. Create an external hive table which points to the imported data. Also point its schema to .avsc file :
CREATE EXTERNAL TABLE departments
STORED AS AVRO
LOCATION "/user/umeshp6655/deptt"
TBLPROPERTIES('avro.schema.url'='.avsc file location')

Do Sqoop import into Hive as sequence file format :
sqoop import 
-Dmapreduce.job.user.classpath.first=true 
--connect jdbc:mysql://db_path \
--username username \
--password ****** \
--table $table_name \
--hive-import \
--hive-database hive_dbname \

–as-sequencefile 
Query will give error by saying some file format is missing.

To overcome this issue we have to create table in hive as store it as sequencefile like below :
Hive SerDe for accessing Sqoop sequence files. To use generate a sequence file using sqoop and then in Hive create the table. An example DDL statement would be:
CREATE EXTERNAL TABLE table_name (id INT,name STRING)
ROW FORMAT SERDE 'com.cloudera.sqoop.contrib.FieldMappableSerDe'
WITH SERDEPROPERTIES (
  "fieldmappable.classname" = "name.of.FieldMappable.generated.by.sqoop"
)
STORED AS SEQUENCEFILE
LOCATION "hdfs://hdfs.server/path/to/sequencefile";
Now, run import function it will work

No comments:

Post a Comment

Introduction to Sqoop

  Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external datastores such as relatio...