Importing data from Sqoop into Hive External Table with Avro encoding

I wanted to figure out how to import content of RDBMS table into Hive with Avro encoding, during this process i wanted to use external hive tables so that i have complete control over the location of files.
Note: I have a different/easier method for doing this in Importing data from Sqoop into Hive External Table with Avro encoding updated
First i did create following table in the mysql database which is on the same machine as that of my HortonWorks Sandbox
  1. First create CUSTOMER table like this in mysql
    
    CREATE TABLE CUSTOMER (       
    contactid INTEGER NOT NULL ,       
    firstname VARCHAR(50),       
    lastname  VARCHAR(50),       
    email varchar(50) );
    
  2. After creating table add couple of records in it by executing following insert statement insert into customer values(1,'Sachin','Tendulark','sachin@gmail.com');
  3. Next step is to run sqoop query that downloads records of the table into HDFS at /tmp/customer/sample. In real world you might want to download only first 10 records or so into Hive, because you need few sample records just to create avro schema
    
    sqoop import --connect jdbc:mysql://localhost/test --table CUSTOMER --username sqoop1 
    --password sqoop -m 1 --create-hive-table 
    --hive-table CONTACT --as-avrodatafile  --target-dir /tmp/customer/sample
    
  4. Running sqoop command it will dump records in HDFS, so first download the avro file generated by sqoop
    
    hdfs dfs -get /tmp/customer/sample/part-m-00000.avro
    
  5. Use the avro-tools-*.jar, to read schema of the file generated by sqoop. by executing following command
    
    java -jar avro-tools-1.7.5.jar getschema part-m-00000.avro > customer.avsc
    
    This is how the customer.avsc file looks like in my case
    
    {
      "type" : "record",
      "name" : "CUSTOMER",
      "doc" : "Sqoop import of CUSTOMER",
      "fields" : [ {
        "name" : "contactid",
        "type" : [ "int", "null" ],
        "columnName" : "contactid",
        "sqlType" : "4"
      }, {
        "name" : "firstname",
        "type" : [ "string", "null" ],
        "columnName" : "firstname",
        "sqlType" : "12"
      }, {
        "name" : "lastname",
        "type" : [ "string", "null" ],
        "columnName" : "lastname",
        "sqlType" : "12"
      }, {
        "name" : "email",
        "type" : [ "string", "null" ],
        "columnName" : "email",
        "sqlType" : "12"
      } ],
      "tableName" : "CUSTOMER"
    }
    
  6. Next step is to upload the avro schema file that you created in the last step back to HDFS, in my case i had HDFS folder called /tmp/customer/schema and i uploaded the avro schema file in it
    
    hdfs dfs -put customer.avsc /tmp/customer/schema/
    
  7. Now go to hive and execute the following command to define External Customer Hive table with avro schema defined in last step
    
    CREATE EXTERNAL TABLE CUSTOMER
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION '/tmp/customer/data'
    TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/customer/schema/customer.avsc');
    
  8. Last step is to run sqoop again but this time with all the data in the external directory that Customer hive table is pointing to.
    
    sqoop import --connect jdbc:mysql://localhost/test --table CUSTOMER --username sqoop1 
    --password sqoop -m 1 --as-avrodatafile 
     --target-dir /tmp/customer/data --compression-codec snappy
    
Now if you run select query on the CUSTOMER table you should be able to get all the data that you see in your RDBMS

20 comments:

Anonymous said...

hi sunil,
while trying to run the commands as you described in your blog.
getting error as below
"Error: Unable to access jarfile avro-tools..."

can you please help me out to identify the reason or m i missing any configurations?

phanikb said...

You need to download avro-tools.jar for internet and need to move to the directory where you want to trigger that above command.

Unknown said...

I am not able to read the data from customer no class found error org/serial/snappy/Snappy.
Anyone please advise.

wirawan0 said...

I thought AVRO format must be used in conjunction to its schema, not by itself. In other word, I was not expecting AVRO to be self-descripting by storing the schema inside the file. Am I wrong here? Since one of the commands above demonstrated that there is a tool to extract the schema out from the AVRO file.

Sameer said...

Nice Article. thanks!!

Unknown said...

good one to poc avro...

Vamsi Krishna said...

what is the purpouse of [--create-hive-table and--hive-table CONTACT] in step 3 in sqoop command?we are not importing any data into hive table.Just we are generating schema file based on sample data.where you are using this contact[hive table] in your post.

Vamsi Krishna said...
This comment has been removed by the author.
Unknown said...
This comment has been removed by the author.
Unknown said...

easy to understand .Thanks

subha said...

You made some respectable points there. I appeared on the web for the difficulty and located most people will associate with along with your website share more
Ai & Artificial Intelligence Course in Chennai
PHP Training in Chennai
Ethical Hacking Course in Chennai Blue Prism Training in Chennai
UiPath Training in Chennai

Links For You said...

First class site, where did u come happening pondering the notification coarsely this posting?i've obsession in a portion of the articles essentially about the solicitation for your site now, and I in fact in pantomime of your style. thankful a million and connect with progress forward on the functioning delivery faithfulness.. https://crackdj.com/paragon-ntfs-crack/

Hi Every One said...

Charming test for a weblog. i've been examining the net for amusement and arrived re your web page. shocking understandable. much gratitude to you a ton for sharing your knowledge! it is invigorating to look that specific people anyway supplement an undertaking into adjusting to their destinations. i'll be veritable to check affirm inside the works indeed unambiguous quickly.. Avast Premier Activation Code

Hi Every One said...

Funny Cousin Quotes · Your cousin is the brother God knew your mother could not handle. · The craziness of your family is best understood by your cousins.. Frequently, Cousins Day Caption

Abortion Centre Delhi said...

Abortion Centre Delhi is a registered and leading abortion centre in India. At the clinic, we have the state-of-the-art clinical setup, sophisticated facilities, and guidance of Dr Ruchi Malhotra to offer abortion services to terminate unwanted pregnancies. At first, we interact with females willing to abort their pregnancies. Through interaction, we know the current pregnancy status and general health of our patients. After physical examinations and a few test reports, we suggest medical abortion or surgical abortion for the termination of their unwanted pregnancies. We give more value to the health and safety of our patients.    

Best Website Designing Company In Delhi said...

We are a leading Website Development Company Delhi, offering custom web development services to businesses of all sizes.

Buy Healing Crystals said...

Discover a stunning collection of Buy Semi Precious Stones for sale - explore our exquisite range today and buy semi precious stones online. Transform your jewelry designs with premium quality semi precious stones.

tata spare parts said...

Looking for high-quality Tata Aria Parts? Look no further than BP Impex! Visit us for a wide range of Tata Aria Parts, sourced from reliable manufacturers. Shop now for unbeatable prices and excellent customer service.

LBM Solutions said...
This comment has been removed by the author.
Tata Parts India said...

Looking for high-quality Tata Parts India? Explore our wide range of Tata Parts for all models at competitive prices. Order now and get fast shipping!