Importing data from RDBMS into Hive using Sqoop and oozie (hive-import)

In the How to run Sqoop command from oozie entry i talked about how you can use Oozie and Sqoop to import data into HDFS. I wanted to change it to use sqoop's hive-import option, which in addition to importing data into HDFS also creats Hive table on top of the data. These are the steps that i followed
  • First i changed the workflow.xml to take out as-avrodatafile and added hive-import option and i re-ran the workflow that looks like this When i did that the oozie workflow failed with following error
    
    7936 [uber-SubtaskRunner] WARN  org.apache.sqoop.mapreduce.JobBase  - SQOOP_HOME is unset. May not be able to find all job dependencies.
    9202 [uber-SubtaskRunner] DEBUG org.apache.sqoop.mapreduce.db.DBConfiguration  - Fetching password from job credentials store
    9207 [uber-SubtaskRunner] INFO  org.apache.sqoop.mapreduce.db.DBInputFormat  - Using read commited transaction isolation
    9210 [uber-SubtaskRunner] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat  - Creating input split with lower bound '1=1' and upper bound '1=1'
    25643 [uber-SubtaskRunner] INFO  org.apache.sqoop.mapreduce.ImportJobBase  - Transferred 931.1768 KB in 17.6994 seconds (52.6107 KB/sec)
    25649 [uber-SubtaskRunner] INFO  org.apache.sqoop.mapreduce.ImportJobBase  - Retrieved 12435 records.
    25649 [uber-SubtaskRunner] DEBUG org.apache.sqoop.hive.HiveImport  - Hive.inputTable: customers
    25650 [uber-SubtaskRunner] DEBUG org.apache.sqoop.hive.HiveImport  - Hive.outputTable: customers
    25653 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Execute getColumnInfoRawQuery : SELECT t.* FROM `customers` AS t LIMIT 1
    25653 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - No connection paramenters specified. Using regular API for making connection.
    25658 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Using fetchSize for next query: -2147483648
    25658 [uber-SubtaskRunner] INFO  org.apache.sqoop.manager.SqlManager  - Executing SQL statement: SELECT t.* FROM `customers` AS t LIMIT 1
    25659 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Found column customer_id of type [4, 11, 0]
    25659 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Found column customer_fname of type [12, 45, 0]
    25659 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Found column customer_lname of type [12, 45, 0]
    25660 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Found column customer_email of type [12, 45, 0]
    25660 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Found column customer_password of type [12, 45, 0]
    25660 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Found column customer_street of type [12, 255, 0]
    25660 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Found column customer_city of type [12, 45, 0]
    25660 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Found column customer_state of type [12, 45, 0]
    25660 [uber-SubtaskRunner] DEBUG org.apache.sqoop.manager.SqlManager  - Found column customer_zipcode of type [12, 45, 0]
    25663 [uber-SubtaskRunner] DEBUG org.apache.sqoop.hive.TableDefWriter  - Create statement: CREATE TABLE IF NOT EXISTS `customers` ( `customer_id` INT, `customer_fname` STRING, `customer_lname` STRING, `customer_email` STRING, `customer_password` STRING, `customer_street` STRING, `customer_city` STRING, `customer_state` STRING, `customer_zipcode` STRING) COMMENT 'Imported by sqoop on 2016/12/22 21:18:39' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
    25664 [uber-SubtaskRunner] DEBUG org.apache.sqoop.hive.TableDefWriter  - Load statement: LOAD DATA INPATH 'hdfs://quickstart.cloudera:8020/user/cloudera/customers' INTO TABLE `customers`
    25667 [uber-SubtaskRunner] INFO  org.apache.sqoop.hive.HiveImport  - Loading uploaded data into Hive
    25680 [uber-SubtaskRunner] DEBUG org.apache.sqoop.hive.HiveImport  - Using in-process Hive instance.
    25683 [uber-SubtaskRunner] DEBUG org.apache.sqoop.util.SubprocessSecurityManager  - Installing subprocess security manager
    Intercepting System.exit(1)
    
    <<< Invocation of Main class completed <<<
    
    Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
    
    Oozie Launcher failed, finishing Hadoop job gracefully
    
    Oozie Launcher, uploading action data to HDFS sequence file: hdfs://quickstart.cloudera:8020/user/cloudera/oozie-oozi/0000007-161222163830473-oozie-oozi-W/sqoop-52c0--sqoop/action-data.seq
    
    Oozie Launcher ends
    
    
  • As you can see from the log the Sqoop job was able to import data into HDFS in /user/cloudera/customers directory and i could actually see the data in the directory. But when Sqoop tried to create the table in hive it failed and the table did not get created in hive, this is the log statement that i am referring to CREATE TABLE IF NOT EXISTS `customers` ( `customer_id` INT, `customer_fname` STRING, `customer_lname` STRING, `customer_email` STRING, `customer_password` STRING, `customer_street` STRING, `customer_city` STRING, `customer_state` STRING, `customer_zipcode` STRING) COMMENT 'Imported by sqoop on 2016/12/22 21:18:39' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
  • So it seems the problem is Sqoop needs hive-site.xml so that it knows how to talk to hive service, for that first i search my sandbox to figure out where hive-site.xml is located, i executed following command to first find the hive-site.xml and then uploading it to HDFS sudo find / -name hive-site.xml hdfs dfs -put /etc/hive/conf.dist/hive-site.xml
  • After that i went back to the workflow.xml and modified it to look like this
Now when i ran the oozie workflow it was successful and i could query customer data

No comments: