Monday, 4 September 2017

4)Sqoop Tool1: sqoop-import:(Contd)

5)Incremental Imports:

Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.

Two types of incremental imports:

append - append mode when importing a table where new rows are being added with increasing row id values.(Only new insertions)
lastmodified - when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp.

Working Mechanism:

Based on --incremental argument, it decides either append or lastmodified mode,  then the --check-column argument, check the value of column > than --last-value




Argument Description
--check-column (col) Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.

Example: using argument append

Step 1: Table on mysql database ==> which acts as source for hdfs to import


Step 3: Normal import table to HDFS
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table students -m 1 --target-dir /MANO/Sqoop_import_table/students

Data on HDFS after import 


Step 4: using incremental import by append mode for new records on source databse

Let's first insert new record on to table and see,


Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table students --check-column id --incremental append --last-value 5 -m 1 --target-dir /MANO/Sqoop_import_table/students

Output:

17/09/04 12:50:44 INFO tool.ImportTool:  --incremental append
17/09/04 12:50:44 INFO tool.ImportTool:   --check-column id
17/09/04 12:50:44 INFO tool.ImportTool:   --last-value 6
17/09/04 12:50:44 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')

mano@Mano:~$ hadoop fs -cat /MANO/Sqoop_import_table/students/part-m-00001
6,Govind,2017-09-04

Example: using argument lastmodified

Step 1:
After data update on database table






Step 2: View existing data on HDFS

mano@Mano:~$ hadoop fs -cat /MANO/Sqoop_import_table/students/part*
1,Mano,2017-09-04
2,Govind,2017-09-04
3,Mahan,2017-09-04
4,Madhan,2017-09-04
5,Kishore,2017-09-04
6,Govind,2017-09-04
mano@Mano:~$

Step 3: using argument lastmodified incremental mode

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table students -m 1 --check-column last_update --incremental lastmodified --last-value 2017-09-04 --target-dir /MANO/Sqoop_import_table/students --append

Note:
We need to use the --append import control argument , since we are writing on existing directory

Output:
17/09/04 13:23:35 INFO mapreduce.ImportJobBase: Retrieved 6 records.
17/09/04 13:23:35 INFO util.AppendUtils: Appending to directory students
17/09/04 13:23:35 INFO util.AppendUtils: Using found partition 2
17/09/04 13:23:35 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
17/09/04 13:23:35 INFO tool.ImportTool:  --incremental lastmodified
17/09/04 13:23:35 INFO tool.ImportTool:   --check-column last_update
17/09/04 13:23:35 INFO tool.ImportTool:   --last-value 2017-09-04 13:22:07.0
17/09/04 13:23:35 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')

mano@Mano:~$ hadoop fs -cat /MANO/Sqoop_import_table/students/part-m-00002
1,Chella,2017-09-04
2,Prasath,2017-09-04
3,Mano,2017-09-04
4,Madhan,2017-09-04
5,Kishore,2017-09-04
6,Govind,2017-09-04

Note :
we might see the separate file for each operation , we perform on HDFS

Incremental Imports Status table:
 

Argument Description Example
--check-column (col) Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR) Done
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified. Done
--last-value (value) Specifies the maximum value of the check column from the previous import. Done

6)Output line formatting arguments:

importing delimited files, the choice of delimiter is important. Delimiters which appear inside string-based fields may cause ambiguous parsing of the imported data by subsequent analysis passes.

Example:
the string "Hello, happy to see you" should not be imported with the end-of-field delimiter set to a comma.

Argument Description
--enclosed-by <char> Sets a required field enclosing character
--escaped-by <char> Sets the escape character
--fields-terminated-by <char> Sets the field separator character
--lines-terminated-by <char> Sets the end-of-line character
--mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: '
--optionally-enclosed-by <char> Sets a field enclosing character

Delimiters may be specified as:

  •     a character (--fields-terminated-by X)
  •     an escape character (--fields-terminated-by \t). Supported escape characters are:
        \b (backspace)
        \n (newline)
        \r (carriage return)
        \t (tab)
        \" (double-quote)
        \\' (single-quote)
        \\ (backslash)
        \0 (NUL) - This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the --enclosed-by, --optionally-enclosed-by, or --escaped-by arguments.

Note:
The default delimiters are a comma (,) for fields, a newline (\n) for records, no quote character, and no escape character.

Command:
using --mysql-delimiters
Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: '

Sql data on Table:

Import using --mysql-delimiters

mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table students --delete-target-dir -m 1 --mysql-delimiters



Command: using --enclosed-by and --escaped-by

mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table students --delete-target-dir -m 1 --enclosed-by "\\'" --escaped-by '\'

Import results:



Note: on the above, we see the enclosed-by single qutes applied to all the fields,

to avoid that, we can use --optionally-enclosed -by argument

Command: using --optionally-enclosed-by

mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table students --delete-target-dir -m 1 --optionally-enclosed-by '\"' --escaped-by '\'

Sql data on Table:


Import results:


  


Command: Using --fields-terminated-by and --lines-terminated-by
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table students --delete-target-dir -m 1 --fields-terminated-by '|' --lines-terminated-by '\t'

Output:
mano@Mano:~$ hadoop fs -cat /user/mano/students/par*
1|Chella|2017-09-04    2|Prasath|2017-09-04    3|Mano|2017-09-04    4|Madhan|2017-09-04    5|Kishore|2017-09-04    6|Govind|2017-09-04    6|This could|2017-09-05    7|"Hey,dear"|2017-09-05
Output line formatting arguments status table:

Argument Description Example
--enclosed-by <char> Sets a required field enclosing character Done
--escaped-by <char> Sets the escape character Done
--fields-terminated-by <char> Sets the field separator character Done
--lines-terminated-by <char> Sets the end-of-line character Done
--mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ' Done
--optionally-enclosed-by <char> Sets a field enclosing character Done

7)Input parsing arguments:

These arguments are used for parsing input files to specify the enclosed by/escaped by characters etc when exporting the files from HDFS to RDBMS tables.

Note:
These delimiters are same as the one set in previous section for storing into HDFS files, which will act as input files for exporting into RDBMS from HDFS.

As sqoop generates a Java class which can reinterpret the text files and for generated parse() method reinterprets this data.


Argument Description
--input-enclosed-by <char> Sets a required field encloser
--input-escaped-by <char> Sets the input escape character
--input-fields-terminated-by <char> Sets the input field separator
--input-lines-terminated-by <char> Sets the input end-of-line character
--input-optionally-enclosed-by <char> Sets a field enclosing character

It's same like about output line fields, will update the examples for this as well soon 😊

8)Hive arguments:
Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.

use the --hive-home option to identify the Hive installation directory(if in the case, where multiple hive's or manual pointing). Sqoop will use $HIVE_HOME/bin/hive from here
Argument Description
--hive-home <dir> Override $HIVE_HOME
--hive-import Import tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwrite Overwrite existing data in the Hive table.
--create-hive-table If set, then the job will fail if the target hive


table exits. By default this property is false.
--hive-table <table-name> Sets the table name to use when importing to Hive.
--hive-drop-import-delims Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.
--hive-partition-key Name of a hive field to partition are sharded on
--hive-partition-value <v> String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map> Override default mapping from SQL type to Hive type for configured columns
--hive-database ==> to define the name of database to be use.
Note:
  • The table name used in Hive is, by default, the same as that of the source RDBMS table. We can change the output table name with the –hive-table option.
  • If the Hive table already exists, we can override it with –hive-overwrite option.
  • Hive Importing doesn’t support –as-avrodatafile and –as-sequencefile clauses.
  • By default, row delimiters in hive are \n and \r characters and column delimiters \001 characters, if source RDBMS tables contain these characters inside fields, then we need to use –hive-drop-import-delims option to drop those characters on import to give Hive-compatible text data, or –hive-delims-replacement option to replace those characters with a user-defined characters.
  • Sqoop can import data for Hive into a particular partition by specifying the –hive-partition-key and –hive-partition-value arguments
Command: using --hive-import and --create-hive-table
--hive-import - import table into hive
--create-hive-table - creates a new hive table if not exists, else it fails

mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table students -m 1 --warehouse-dir /user/hive/warehouse --hive-import --create-hive-table


Will update further details related to HBASE, ACCUMULO ASAP.


Please follow the link for further ==> Sqoop_Page5



4 comments:

  1. I wonder, whether anyone can write sqoop import query for the below conditions:

    There is a system which will process dumps of a file being imported from Mysql through Sqoop.
    However, system is designed the way that it can process only files if fields are enclosed in(') single quote and separate of the field should be (-} and line needs to be terminated by : (colon). If data itself contains the " (double quote } than it should be escaped by \.

    ReplyDelete
    Replies
    1. Hi,

      i tried to solve this question with this sqoop import:
      sqoop import \
      --table departments \
      --connect jdbc:mysql://quickstart:3306/retail_db \
      --username retail_dba \
      --password cloudera \
      --target-dir /user/cloudera/tests/departments_test \
      --enclosed-by "'" \
      --escaped-by "\\" \ # here the "\" doesn't work, you shoul put "\\"
      --lines-terminated-by ":" \
      --fields-terminated-by "-"

      before running this sqoop command, i inserted a row in departments table : (11111,'just for test "department_test"').
      so after executing the sqoop command i had this result :

      hdfs dfs -cat /user/cloudera/tests/departments_test/part*

      '2'-'Fitness':'3'-'Footwear':'4'-'Apparel':'5'-'Golf':'6'-'Outdoors':'7'-'Fan Shop':'11111111'-'just for test "department_test"'

      It seems good except for the --escaped-by "\\" and i don't know why the 'just for test "department_test"' value didn't changed. it should be like 'just for test \"department_test\"'

      i hope it helped you, and i if you have an explanation for the 'just for test "department_test"' value, let me know please :).

      Thanks.

      Delete

Fundamentals of Python programming

Fundamentals of Python programming: Following below are the fundamental constructs of Python programming: Python Data types Python...