banner



How To Change Timezone For Sqoop Import?

This is sometime-hat for about Hadoop veterans, only I've been meaning to note it on the blog for a while, for anyone who'due south first meet with Hadoop is Oracle's BigDataLite VM.

About people looking to bring external data into Hadoop, practice and so through flat-file exports that they then import into HDFS, using the "hadoop fs" command-line tool or Hue, the spider web-based developer tool in BigDataLite, Cloudera CDH, Hortonworks and and then on. They then often create Hive tables over them, either creating them from the Hive / Beeswax shell or through Hue, which can create a tabular array for you out of a file you upload from your browser. ODI, through the ODI Awarding Adapter for Hadoop, too gives you a knowledge module (IKM File to Hive) that'south used in the ODI demos also on BigDataLite to load information into Hive tables, from an Apache Avro-format log file.

NewImage

What a lot of people don't know who're new to Hadoop, is that yous can skip the "dump to file" step completely, and load data straight into HDFS direct from the Oracle database, without an intermediate file export footstep. The tool you use for this comes as office of the Cloudera CDH4 Hadoop distribution that's on BigDataLite, and it's called "Sqoop".

"Sqoop", curt for "SQL to Hadoop", gives you the power to do the following Oracle data transfer tasks amongst other ones:

  • Import whole tables, or whole schemas, from Oracle and other relational databases into Hadoop'south file system, HDFS
  • Export data from HDFS back out to these databases - with the export and import being performed through MapReduce jobs
  • Import using an arbitrary SQL SELECT statement, rather than grabbing whole tables
  • Perform incremental loads, specifying a fundamental column to determine what to exclude
  • Load direct into Hive tables, creating HDFS files in the background and the Hive metadata automatically

Documentation for Sqoop as shipped with CDH4 can be institute on the Cloudera website here, and there are even optimisations and plugins for databases such as Oracle to enable faster, direct loads - for example OraOOP.

Commonly, you lot'd need to download and install JDBC drivers for sqoop before you lot can apply it, but BigDataLite comes with the required Oracle JDBC drivers, so let's only have a play effectually and see some examples of Sqoop in action. I'll kickoff past importing the Activeness table from the MOVIEDEMO schema that comes with the Oracle 12c database likewise on BigDataLite (brand sure the database is running, first though):

          [oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:sparse:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY        

You should then see sqoop process your control in its panel output, and then run the MapReduce jobs to bring in the data via the Oracle JDBC driver:

                      14/03/21 18:21:36 INFO sqoop.Sqoop: Running Sqoop version: 1.iv.3-cdh4.5.0 14/03/21 18:21:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. fourteen/03/21 xviii:21:37 INFO manager.SqlManager: Using default fetchSize of 1000 14/03/21 eighteen:21:37 INFO tool.CodeGenTool: Beginning code generation fourteen/03/21 18:21:38 INFO manager.OracleManager: Time zone has been set to GMT 14/03/21 18:21:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM ACTIVITY t WHERE 1=0 fourteen/03/21 eighteen:21:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce xiv/03/21 eighteen:21:38 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20-mapreduce/hadoop-cadre.jar Note: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/Activeness.java uses or overrides a deprecated API. Notation: Recompile with -Xlint:deprecation for details. fourteen/03/21 18:21:41 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/ACTIVITY.jar 14/03/21 18:21:41 INFO manager.OracleManager: Fourth dimension zone has been gear up to GMT 14/03/21 18:21:41 INFO manager.OracleManager: Time zone has been set to GMT fourteen/03/21 18:21:42 INFO mapreduce.ImportJobBase: Beginning import of ACTIVITY fourteen/03/21 18:21:42 INFO manager.OracleManager: Time zone has been fix to GMT xiv/03/21 18:21:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/03/21 18:21:45 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ACTIVITY_ID), MAX(ACTIVITY_ID) FROM Action xiv/03/21 eighteen:21:45 INFO mapred.JobClient: Running job: job_201403111406_0015 14/03/21 18:21:47 INFO mapred.JobClient: map 0% reduce 0% 14/03/21 eighteen:22:nineteen INFO mapred.JobClient: map 25% reduce 0% xiv/03/21 xviii:22:27 INFO mapred.JobClient: map 50% reduce 0% 14/03/21 eighteen:22:29 INFO mapred.JobClient: map 75% reduce 0% xiv/03/21 xviii:22:37 INFO mapred.JobClient: map 100% reduce 0% ... xiv/03/21 xviii:22:39 INFO mapred.JobClient: Map input records=11 fourteen/03/21 xviii:22:39 INFO mapred.JobClient: Map output records=11 14/03/21 18:22:39 INFO mapred.JobClient: Input carve up bytes=464 14/03/21 18:22:39 INFO mapred.JobClient: Spilled Records=0 xiv/03/21 18:22:39 INFO mapred.JobClient: CPU time spent (ms)=3430 14/03/21 xviii:22:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=506802176 14/03/21 18:22:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2714157056 14/03/21 18:22:39 INFO mapred.JobClient: Full committed heap usage (bytes)=506724352 14/03/21 eighteen:22:39 INFO mapreduce.ImportJobBase: Transferred 103 bytes in 56.4649 seconds (1.8241 bytes/sec) 14/03/21 eighteen:22:39 INFO mapreduce.ImportJobBase: Retrieved 11 records.                  

Past default, sqoop will put the resulting file in your user'due south home directory in HDFS. Let's accept a look and run across what'south there:

                      [oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/ACTIVITYFound 6 items -rw-r--r-- ane oracle supergroup 0 2014-03-21 xviii:22 /user/oracle/Activity/_SUCCESS drwxr-xr-x - oracle supergroup 0 2014-03-21 18:21 /user/oracle/Activeness/_logs -rw-r--r-- i oracle supergroup 27 2014-03-21 18:22 /user/oracle/Activity/role-k-00000 -rw-r--r-- 1 oracle supergroup 17 2014-03-21 18:22 /user/oracle/Action/part-m-00001 -rw-r--r-- 1 oracle supergroup 24 2014-03-21 eighteen:22 /user/oracle/ACTIVITY/function-one thousand-00002 -rw-r--r-- one oracle supergroup 35 2014-03-21 18:22 /user/oracle/Activity/part-m-00003 [oracle@bigdatalite ~]$ hadoop fs -true cat /user/oracle/ACTIVITY/part-thousand-000001,Charge per unit ii,Completed 3,Interruption [oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000014,Start 5,Browse                  

What you can come across in that location is that sqoop has imported the data equally a serial of "part-m" files, CSV files with one per MapReduce reducer. At that place's various options in the docs for specifying compression and other functioning features for sqoop imports, but the bones format is a series of CSV files, one per reducer.

You can also import Oracle and other RDBMS data direct into Hive, with sqoop creating equivalent datatypes for the information coming in (basic datatypes only, none of the advanced spatial and other Oracle ones). For example, I could import the Crew table in the MOVIEDEMO schema in like this, straight into an equivalent Hive tabular array:

          [oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table Coiffure --hive-import<        

Taking a look at Hive, I can then see the tabular array this is created, describe information technology and count the number of rows it contains:

                      [oracle@bigdatalite ~]$ hive Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.ten.0-cdh4.5.0.jar!/hive-log4j.backdrop Hive history file=/tmp/oracle/hive_job_log_effb9cb5-6617-49f4-97b5-b09cd56c5661_1747866494.txt            

hive> desc CREW;
OK
crew_id double
name string
Fourth dimension taken: 2.211 seconds

hive> select count(*) from CREW;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: ane
In order to alter the average load for a reducer (in bytes):
ready hive.exec.reducers.bytes.per.reducer=<number>
In guild to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
gear up mapred.reduce.tasks=<number>
Starting Job = job_201403111406_0017, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201403111406_0017
Kill Control = /usr/lib/hadoop/bin/hadoop job -kill job_201403111406_0017
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-03-21 18:33:twoscore,756 Phase-ane map = 0%, reduce = 0%
2014-03-21 18:33:46,797 Phase-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:47,812 Stage-ane map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:48,821 Stage-i map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:49,907 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:50,916 Phase-one map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 eighteen:33:51,929 Phase-1 map = 100%, reduce = 100%, Cumulative CPU i.75 sec
2014-03-21 18:33:52,942 Stage-1 map = 100%, reduce = 100%, Cumulative CPU ane.75 sec
2014-03-21 18:33:53,951 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 xviii:33:54,961 Phase-1 map = 100%, reduce = 100%, Cumulative CPU ane.75 sec
MapReduce Total cumulative CPU time: i seconds 750 msec
Ended Job = job_201403111406_0017
MapReduce Jobs Launched:
Chore 0: Map: 1 Reduce: 1 Cumulative CPU: 1.75 sec HDFS Read: 135111 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: i seconds 750 msec
OK
6860
Time taken: twenty.095 seconds

I can fifty-fifty do an incremental import to bring in new rows, appending their contents to the existing ones in Hive/HDFS:

          [oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CREW --hive-import --incremental append --cheque-column CREW_ID        

The information I've now loaded can be processed past a tool such as ODI, or you can use a tool such equally Pig to do some further assay or number crunching, like this:

                      grunt> RAW_DATA = LOAD 'Action' USING PigStorage(',') AS   grunt> (act_type: int, act_desc: chararray);   grunt> B = FILTER RAW_DATA past act_type < five;  grunt> Shop B into 'FILTERED_ACTIVITIES' USING PigStorage(',');                  

the output of which is another file on HDFS. Finally, I can export this data back to my Oracle database using the sqoop export characteristic:

          [oracle@bigdatalite ~]$ sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY_FILTERED --consign-dir FILTERED_ACTIVITIES        

So there's a lot more to sqoop than just this, including features and topics such as pinch, transforming data and and then on, merely it'south a useful tool and likewise something y'all could call from the command-line, using an ODI Tool if you lot desire to. Recent versions of Hue also come up with a GUI for Sqoop, giving you lot the power to create jobs graphically and as well schedule them using Oozie, the Hadoop scheduler in CDH4,

Source: https://www.rittmanmead.com/blog/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/

Posted by: martinhignisfat.blogspot.com

0 Response to "How To Change Timezone For Sqoop Import?"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel