Moving data from Avro to ORC files

In the Importing data from Sqoop into Hive External Table with Avro encoding i blogged about how to sqoop data from RDBMS into Hive. But i wanted to take it to next step by moving the data downloaded to ORC table. I followed these steps to achieve that
  1. First thing is to find out the schema of the table in Avro and you can get that by executing following statement in hive
    
    show create table CUSTOMER;
    
    You will get output that looks something like this, it contains schema of the table
    
    CREATE EXTERNAL TABLE `CUSTOMER`(
      `contactid` int COMMENT 'from deserializer',
      `firstname` string COMMENT 'from deserializer',
      `lastname` string COMMENT 'from deserializer',
      `email` string COMMENT 'from deserializer')
    ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT
      'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION
      'hdfs://sandbox.hortonworks.com:8020/tmp/customer/data'
    TBLPROPERTIES (
      'avro.schema.url'='hdfs:///tmp/customer/schema/customer.avsc',
      'transient_lastDdlTime'='1431719666')
    
  2. Copy the schema from last step and remove the part about format and table properties and replace it with part that highlighted in this code, execute this in Hive to create customer table in ORC format
    
    CREATE EXTERNAL TABLE `CUSTOMER_ORC`(
      `contactid` int  ,
      `firstname` string  ,
      `lastname` string  ,
      `email` string  )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\001'
    LOCATION
      'hdfs://sandbox.hortonworks.com:8020/tmp/customer/data_orc'
    STORED AS ORC tblproperties ("orc.compress"="SNAPPY","orc.row.index.stride"="20000"); 
    
  3. Last step is to copy data from avro table to ORC table, you can achieve that by using following command
    
    insert into table CUSTOMER_ORC select * from customer;
    

Importing data from Sqoop into Hive External Table with Avro encoding

I wanted to figure out how to import content of RDBMS table into Hive with Avro encoding, during this process i wanted to use external hive tables so that i have complete control over the location of files.
Note: I have a different/easier method for doing this in Importing data from Sqoop into Hive External Table with Avro encoding updated
First i did create following table in the mysql database which is on the same machine as that of my HortonWorks Sandbox
  1. First create CUSTOMER table like this in mysql
    
    CREATE TABLE CUSTOMER (       
    contactid INTEGER NOT NULL ,       
    firstname VARCHAR(50),       
    lastname  VARCHAR(50),       
    email varchar(50) );
    
  2. After creating table add couple of records in it by executing following insert statement insert into customer values(1,'Sachin','Tendulark','sachin@gmail.com');
  3. Next step is to run sqoop query that downloads records of the table into HDFS at /tmp/customer/sample. In real world you might want to download only first 10 records or so into Hive, because you need few sample records just to create avro schema
    
    sqoop import --connect jdbc:mysql://localhost/test --table CUSTOMER --username sqoop1 
    --password sqoop -m 1 --create-hive-table 
    --hive-table CONTACT --as-avrodatafile  --target-dir /tmp/customer/sample
    
  4. Running sqoop command it will dump records in HDFS, so first download the avro file generated by sqoop
    
    hdfs dfs -get /tmp/customer/sample/part-m-00000.avro
    
  5. Use the avro-tools-*.jar, to read schema of the file generated by sqoop. by executing following command
    
    java -jar avro-tools-1.7.5.jar getschema part-m-00000.avro > customer.avsc
    
    This is how the customer.avsc file looks like in my case
    
    {
      "type" : "record",
      "name" : "CUSTOMER",
      "doc" : "Sqoop import of CUSTOMER",
      "fields" : [ {
        "name" : "contactid",
        "type" : [ "int", "null" ],
        "columnName" : "contactid",
        "sqlType" : "4"
      }, {
        "name" : "firstname",
        "type" : [ "string", "null" ],
        "columnName" : "firstname",
        "sqlType" : "12"
      }, {
        "name" : "lastname",
        "type" : [ "string", "null" ],
        "columnName" : "lastname",
        "sqlType" : "12"
      }, {
        "name" : "email",
        "type" : [ "string", "null" ],
        "columnName" : "email",
        "sqlType" : "12"
      } ],
      "tableName" : "CUSTOMER"
    }
    
  6. Next step is to upload the avro schema file that you created in the last step back to HDFS, in my case i had HDFS folder called /tmp/customer/schema and i uploaded the avro schema file in it
    
    hdfs dfs -put customer.avsc /tmp/customer/schema/
    
  7. Now go to hive and execute the following command to define External Customer Hive table with avro schema defined in last step
    
    CREATE EXTERNAL TABLE CUSTOMER
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION '/tmp/customer/data'
    TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/customer/schema/customer.avsc');
    
  8. Last step is to run sqoop again but this time with all the data in the external directory that Customer hive table is pointing to.
    
    sqoop import --connect jdbc:mysql://localhost/test --table CUSTOMER --username sqoop1 
    --password sqoop -m 1 --as-avrodatafile 
     --target-dir /tmp/customer/data --compression-codec snappy
    
Now if you run select query on the CUSTOMER table you should be able to get all the data that you see in your RDBMS

Running oozie job on Hortonworks Sandbox

In the Enabling Oozie console on Cloudera VM 4.4.0 and executing examples i blogged about how to run oozie job in Cloudera Sandbox. It seems this process is little bit easier in HortonWorks 2.2 sandbox. So first i had brand new HDP 2.2 image and i tried running oozie example on it by executing

oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run
But when i tried running it i got following error

Error: E0501 : E0501: Could not perform authorization operation, Call From sandbox.hortonworks.com/10.0.2.15 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
So i looked into /var/log/oozie/oozie.log and i saw following error

2015-05-01 20:34:39,195  WARN V1JobsServlet:546 - SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] URL[POST http://sandbox.hortonworks.com:11000/oozie/v2/jobs?action=start] error[E0501], E0501: Could not perform authorization operation, Call From sandbox.hortonworks.com/10.0.2.15 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
org.apache.oozie.servlet.XServletException: E0501: Could not perform authorization operation, Call From sandbox.hortonworks.com/10.0.2.15 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
 at org.apache.oozie.servlet.BaseJobServlet.checkAuthorizationForApp(BaseJobServlet.java:240)
 at org.apache.oozie.servlet.BaseJobsServlet.doPost(BaseJobsServlet.java:96)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
 at org.apache.oozie.servlet.JsonRestServlet.service(JsonRestServlet.java:287)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
 at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at org.apache.oozie.servlet.AuthFilter$2.doFilter(AuthFilter.java:143)
 at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572)
 at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:542)
 at org.apache.oozie.servlet.AuthFilter.doFilter(AuthFilter.java:148)
 at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at org.apache.oozie.servlet.HostnameFilter.doFilter(HostnameFilter.java:84)
 at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
 at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
 at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
 at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
 at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
 at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.oozie.service.AuthorizationException: E0501: Could not perform authorization operation, Call From sandbox.hortonworks.com/10.0.2.15 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
 at org.apache.oozie.service.AuthorizationService.authorizeForApp(AuthorizationService.java:399)
 at org.apache.oozie.servlet.BaseJobServlet.checkAuthorizationForApp(BaseJobServlet.java:229)
 ... 25 more
Caused by: java.net.ConnectException: Call From sandbox.hortonworks.com/10.0.2.15 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
 at org.apache.hadoop.ipc.Client.call(Client.java:1472)
 at org.apache.hadoop.ipc.Client.call(Client.java:1399)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
 at com.sun.proxy.$Proxy29.getFileInfo(Unknown Source)
 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy30.getFileInfo(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1988)
 at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
 at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
 at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
 at org.apache.oozie.service.AuthorizationService.authorizeForApp(AuthorizationService.java:371)
 ... 26 more
Caused by: java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
 at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
 at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
 at org.apache.hadoop.ipc.Client.call(Client.java:1438)
 ... 44 more
In order to solve these issues i had to make changes in examples/apps/map-reduce/job.properties, to replace localhost with sandbox.hortonworks.com


nameNode=hdfs://sandbox.hortonworks.com:8020
jobTracker=sandbox.hortonworks.com:8032

queueName=default
examplesRoot=examples

oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/map-reduce
outputDir=map-reduce