Apache
Sqoop is a tool designed for efficiently transferring bulk data between Apache
Hadoop and external datastores such as relational databases, enterprise data
warehouses.
Sqoop
is used to import data from external datastores into Hadoop Distributed File
System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can
also be used to extract data from Hadoop or its eco-systems and export it to
external datastores such as relational databases, enterprise data warehouses.
Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL,
Postgres etc.
Sqoop
used for?
For
Hadoop developers, the interesting work starts after data is loaded into HDFS.
Developers play around the data in order to find the magical insights concealed
in that Big Data. For this, the data residing in the relational database
management systems need to be transferred to HDFS, play around the data and
might need to transfer back to relational database management systems. In
reality of Big Data world, Developers feel the transferring of data between
relational database systems and HDFS is not that interesting, tedious but too
seldom required. Developers can always write custom scripts to transfer data in
and out of Hadoop, but Apache Sqoop provides an alternative.
Sqoop
automates most of the process, depends on the database to describe the schema
of the data to be imported. Sqoop uses MapReduce framework to import and export
the data, which provides parallel mechanism as well as fault tolerance. Sqoop
makes developers life easy by providing command line interface. Developers just
need to provide basic information like source, destination and database
authentication details in the sqoop command. Sqoop takes care of remaining
part.
Sqoop
provides many salient features like:
1. Full
Load
2. Incremental
Load
3. Parallel
import/export
4. Import
results of SQL query
5. Compression
6. Connectors
for all major RDBMS Databases
7. Kerberos
Security Integration
8. Load
data directly into Hive/Hbase
9. Support
for Accumulo
Sqoop Architecture
Sqoop
provides command line interface to the end users. Sqoop can also be accessed
using Java APIs. Sqoop command submitted by the end user is parsed by Sqoop and
launches Hadoop Map only job to import or export data because Reduce phase is
required only when aggregations are needed. Sqoop just imports and exports the
data; it does not do any aggregations.
Sqoop
parses the arguments provided in the command line and prepares the Map job. Map
job launch multiple mappers depends on the number defined by user in the
command line. For Sqoop import, each mapper task will be assigned with part of
data to be imported based on key defined in the command line. Sqoop distributes
the input data among the mappers equally to get high performance. Then each
mapper creates connection with the database using JDBC and fetches the part of
data assigned by Sqoop and writes it into HDFS or Hive or HBase based on the
option provided in the command line.
No comments:
Post a Comment