- Blog Categories
- Software Development
- Data Science
- AI/ML
- Marketing
- General
- MBA
- Management
- Legal
- Software Development Projects and Ideas
- 12 Computer Science Project Ideas
- 28 Beginner Software Projects
- Top 10 Engineering Project Ideas
- Top 10 Easy Final Year Projects
- Top 10 Mini Projects for Engineers
- 25 Best Django Project Ideas
- Top 20 MERN Stack Project Ideas
- Top 12 Real Time Projects
- Top 6 Major CSE Projects
- 12 Robotics Projects for All Levels
- Java Programming Concepts
- Abstract Class in Java and Methods
- Constructor Overloading in Java
- StringBuffer vs StringBuilder
- Java Identifiers: Syntax & Examples
- Types of Variables in Java Explained
- Composition in Java: Examples
- Append in Java: Implementation
- Loose Coupling vs Tight Coupling
- Integrity Constraints in DBMS
- Different Types of Operators Explained
- Career and Interview Preparation in IT
- Top 14 IT Courses for Jobs
- Top 20 Highest Paying Languages
- 23 Top CS Interview Q&A
- Best IT Jobs without Coding
- Software Engineer Salary in India
- 44 Agile Methodology Interview Q&A
- 10 Software Engineering Challenges
- Top 15 Tech's Daily Life Impact
- 10 Best Backends for React
- Cloud Computing Reference Models
- Web Development and Security
- Find Installed NPM Version
- Install Specific NPM Package Version
- Make API Calls in Angular
- Install Bootstrap in Angular
- Use Axios in React: Guide
- StrictMode in React: Usage
- 75 Cyber Security Research Topics
- Top 7 Languages for Ethical Hacking
- Top 20 Docker Commands
- Advantages of OOP
- Data Science Projects and Applications
- 42 Python Project Ideas for Beginners
- 13 Data Science Project Ideas
- 13 Data Structure Project Ideas
- 12 Real-World Python Applications
- Python Banking Project
- Data Science Course Eligibility
- Association Rule Mining Overview
- Cluster Analysis in Data Mining
- Classification in Data Mining
- KDD Process in Data Mining
- Data Structures and Algorithms
- Binary Tree Types Explained
- Binary Search Algorithm
- Sorting in Data Structure
- Binary Tree in Data Structure
- Binary Tree vs Binary Search Tree
- Recursion in Data Structure
- Data Structure Search Methods: Explained
- Binary Tree Interview Q&A
- Linear vs Binary Search
- Priority Queue Overview
- Python Programming and Tools
- Top 30 Python Pattern Programs
- List vs Tuple
- Python Free Online Course
- Method Overriding in Python
- Top 21 Python Developer Skills
- Reverse a Number in Python
- Switch Case Functions in Python
- Info Retrieval System Overview
- Reverse a Number in Python
- Real-World Python Applications
- Data Science Careers and Comparisons
- Data Analyst Salary in India
- Data Scientist Salary in India
- Free Excel Certification Course
- Actuary Salary in India
- Data Analyst Interview Guide
- Pandas Interview Guide
- Tableau Filters Explained
- Data Mining Techniques Overview
- Data Analytics Lifecycle Phases
- Data Science Vs Analytics Comparison
- Artificial Intelligence and Machine Learning Projects
- Exciting IoT Project Ideas
- 16 Exciting AI Project Ideas
- 45+ Interesting ML Project Ideas
- Exciting Deep Learning Projects
- 12 Intriguing Linear Regression Projects
- 13 Neural Network Projects
- 5 Exciting Image Processing Projects
- Top 8 Thrilling AWS Projects
- 12 Engaging AI Projects in Python
- NLP Projects for Beginners
- Concepts and Algorithms in AIML
- Basic CNN Architecture Explained
- 6 Types of Regression Models
- Data Preprocessing Steps
- Bagging vs Boosting in ML
- Multinomial Naive Bayes Overview
- Bayesian Network Example
- Bayes Theorem Guide
- Top 10 Dimensionality Reduction Techniques
- Neural Network Step-by-Step Guide
- Technical Guides and Comparisons
- Make a Chatbot in Python
- Compute Square Roots in Python
- Permutation vs Combination
- Image Segmentation Techniques
- Generative AI vs Traditional AI
- AI vs Human Intelligence
- Random Forest vs Decision Tree
- Neural Network Overview
- Perceptron Learning Algorithm
- Selection Sort Algorithm
- Career and Practical Applications in AIML
- AI Salary in India Overview
- Biological Neural Network Basics
- Top 10 AI Challenges
- Production System in AI
- Top 8 Raspberry Pi Alternatives
- Top 8 Open Source Projects
- 14 Raspberry Pi Project Ideas
- 15 MATLAB Project Ideas
- Top 10 Python NLP Libraries
- Naive Bayes Explained
- Digital Marketing Projects and Strategies
- 10 Best Digital Marketing Projects
- 17 Fun Social Media Projects
- Top 6 SEO Project Ideas
- Digital Marketing Case Studies
- Coca-Cola Marketing Strategy
- Nestle Marketing Strategy Analysis
- Zomato Marketing Strategy
- Monetize Instagram Guide
- Become a Successful Instagram Influencer
- 8 Best Lead Generation Techniques
- Digital Marketing Careers and Salaries
- Digital Marketing Salary in India
- Top 10 Highest Paying Marketing Jobs
- Highest Paying Digital Marketing Jobs
- SEO Salary in India
- Content Writer Salary Guide
- Digital Marketing Executive Roles
- Career in Digital Marketing Guide
- Future of Digital Marketing
- MBA in Digital Marketing Overview
- Digital Marketing Techniques and Channels
- 9 Types of Digital Marketing Channels
- Top 10 Benefits of Marketing Branding
- 100 Best YouTube Channel Ideas
- YouTube Earnings in India
- 7 Reasons to Study Digital Marketing
- Top 10 Digital Marketing Objectives
- 10 Best Digital Marketing Blogs
- Top 5 Industries Using Digital Marketing
- Growth of Digital Marketing in India
- Top Career Options in Marketing
- Interview Preparation and Skills
- 73 Google Analytics Interview Q&A
- 56 Social Media Marketing Q&A
- 78 Google AdWords Interview Q&A
- Top 133 SEO Interview Q&A
- 27+ Digital Marketing Q&A
- Digital Marketing Free Course
- Top 9 Skills for PPC Analysts
- Movies with Successful Social Media Campaigns
- Marketing Communication Steps
- Top 10 Reasons to Be an Affiliate Marketer
- Career Options and Paths
- Top 25 Highest Paying Jobs India
- Top 25 Highest Paying Jobs World
- Top 10 Highest Paid Commerce Job
- Career Options After 12th Arts
- Top 7 Commerce Courses Without Maths
- Top 7 Career Options After PCB
- Best Career Options for Commerce
- Career Options After 12th CS
- Top 10 Career Options After 10th
- 8 Best Career Options After BA
- Projects and Academic Pursuits
- 17 Exciting Final Year Projects
- Top 12 Commerce Project Topics
- Top 13 BCA Project Ideas
- Career Options After 12th Science
- Top 15 CS Jobs in India
- 12 Best Career Options After M.Com
- 9 Best Career Options After B.Sc
- 7 Best Career Options After BCA
- 22 Best Career Options After MCA
- 16 Top Career Options After CE
- Courses and Certifications
- 10 Best Job-Oriented Courses
- Best Online Computer Courses
- Top 15 Trending Online Courses
- Top 19 High Salary Certificate Courses
- 21 Best Programming Courses for Jobs
- What is SGPA? Convert to CGPA
- GPA to Percentage Calculator
- Highest Salary Engineering Stream
- 15 Top Career Options After Engineering
- 6 Top Career Options After BBA
- Job Market and Interview Preparation
- Why Should You Be Hired: 5 Answers
- Top 10 Future Career Options
- Top 15 Highest Paid IT Jobs India
- 5 Common Guesstimate Interview Q&A
- Average CEO Salary: Top Paid CEOs
- Career Options in Political Science
- Top 15 Highest Paying Non-IT Jobs
- Cover Letter Examples for Jobs
- Top 5 Highest Paying Freelance Jobs
- Top 10 Highest Paying Companies India
- Career Options and Paths After MBA
- 20 Best Careers After B.Com
- Career Options After MBA Marketing
- Top 14 Careers After MBA In HR
- Top 10 Highest Paying HR Jobs India
- How to Become an Investment Banker
- Career Options After MBA - High Paying
- Scope of MBA in Operations Management
- Best MBA for Working Professionals India
- MBA After BA - Is It Right For You?
- Best Online MBA Courses India
- MBA Project Ideas and Topics
- 11 Exciting MBA HR Project Ideas
- Top 15 MBA Project Ideas
- 18 Exciting MBA Marketing Projects
- MBA Project Ideas: Consumer Behavior
- What is Brand Management?
- What is Holistic Marketing?
- What is Green Marketing?
- Intro to Organizational Behavior Model
- Tech Skills Every MBA Should Learn
- Most Demanding Short Term Courses MBA
- MBA Salary, Resume, and Skills
- MBA Salary in India
- HR Salary in India
- Investment Banker Salary India
- MBA Resume Samples
- Sample SOP for MBA
- Sample SOP for Internship
- 7 Ways MBA Helps Your Career
- Must-have Skills in Sales Career
- 8 Skills MBA Helps You Improve
- Top 20+ SAP FICO Interview Q&A
- MBA Specializations and Comparative Guides
- Why MBA After B.Tech? 5 Reasons
- How to Answer 'Why MBA After Engineering?'
- Why MBA in Finance
- MBA After BSc: 10 Reasons
- Which MBA Specialization to choose?
- Top 10 MBA Specializations
- MBA vs Masters: Which to Choose?
- Benefits of MBA After CA
- 5 Steps to Management Consultant
- 37 Must-Read HR Interview Q&A
- Fundamentals and Theories of Management
- What is Management? Objectives & Functions
- Nature and Scope of Management
- Decision Making in Management
- Management Process: Definition & Functions
- Importance of Management
- What are Motivation Theories?
- Tools of Financial Statement Analysis
- Negotiation Skills: Definition & Benefits
- Career Development in HRM
- Top 20 Must-Have HRM Policies
- Project and Supply Chain Management
- Top 20 Project Management Case Studies
- 10 Innovative Supply Chain Projects
- Latest Management Project Topics
- 10 Project Management Project Ideas
- 6 Types of Supply Chain Models
- Top 10 Advantages of SCM
- Top 10 Supply Chain Books
- What is Project Description?
- Top 10 Project Management Companies
- Best Project Management Courses Online
- Salaries and Career Paths in Management
- Project Manager Salary in India
- Average Product Manager Salary India
- Supply Chain Management Salary India
- Salary After BBA in India
- PGDM Salary in India
- Top 7 Career Options in Management
- CSPO Certification Cost
- Why Choose Product Management?
- Product Management in Pharma
- Product Design in Operations Management
- Industry-Specific Management and Case Studies
- Amazon Business Case Study
- Service Delivery Manager Job
- Product Management Examples
- Product Management in Automobiles
- Product Management in Banking
- Sample SOP for Business Management
- Video Game Design Components
- Top 5 Business Courses India
- Free Management Online Course
- SCM Interview Q&A
- Fundamentals and Types of Law
- Acceptance in Contract Law
- Offer in Contract Law
- 9 Types of Evidence
- Types of Law in India
- Introduction to Contract Law
- Negotiable Instrument Act
- Corporate Tax Basics
- Intellectual Property Law
- Workmen Compensation Explained
- Lawyer vs Advocate Difference
- Law Education and Courses
- LLM Subjects & Syllabus
- Corporate Law Subjects
- LLM Course Duration
- Top 10 Online LLM Courses
- Online LLM Degree
- Step-by-Step Guide to Studying Law
- Top 5 Law Books to Read
- Why Legal Studies?
- Pursuing a Career in Law
- How to Become Lawyer in India
- Career Options and Salaries in Law
- Career Options in Law India
- Corporate Lawyer Salary India
- How To Become a Corporate Lawyer
- Career in Law: Starting, Salary
- Career Opportunities: Corporate Law
- Business Lawyer: Role & Salary Info
- Average Lawyer Salary India
- Top Career Options for Lawyers
- Types of Lawyers in India
- Steps to Become SC Lawyer in India
- Tutorials
- Software Tutorials
- C Tutorials
- Recursion in C: Fibonacci Series
- Checking String Palindromes in C
- Prime Number Program in C
- Implementing Square Root in C
- Matrix Multiplication in C
- Understanding Double Data Type
- Factorial of a Number in C
- Structure of a C Program
- Building a Calculator Program in C
- Compiling C Programs on Linux
- Java Tutorials
- Handling String Input in Java
- Determining Even and Odd Numbers
- Prime Number Checker
- Sorting a String
- User-Defined Exceptions
- Understanding the Thread Life Cycle
- Swapping Two Numbers
- Using Final Classes
- Area of a Triangle
- Skills
- Explore Skills
- Management Skills
- Software Engineering
- JavaScript
- Data Structure
- React.js
- Core Java
- Node.js
- Blockchain
- SQL
- Full stack development
- Devops
- NFT
- BigData
- Cyber Security
- Cloud Computing
- Database Design with MySQL
- Cryptocurrency
- Python
- Digital Marketings
- Advertising
- Influencer Marketing
- Performance Marketing
- Search Engine Marketing
- Email Marketing
- Content Marketing
- Social Media Marketing
- Display Advertising
- Marketing Analytics
- Web Analytics
- Affiliate Marketing
- MBA
- MBA in Finance
- MBA in HR
- MBA in Marketing
- MBA in Business Analytics
- MBA in Operations Management
- MBA in International Business
- MBA in Information Technology
- MBA in Healthcare Management
- MBA In General Management
- MBA in Agriculture
- MBA in Supply Chain Management
- MBA in Entrepreneurship
- MBA in Project Management
- Management Program
- Consumer Behaviour
- Supply Chain Management
- Financial Analytics
- Introduction to Fintech
- Introduction to HR Analytics
- Fundamentals of Communication
- Art of Effective Communication
- Introduction to Research Methodology
- Mastering Sales Technique
- Business Communication
- Fundamentals of Journalism
- Economics Masterclass
- Free Courses
Apache Oozie Tutorial: Introduction, Workflow & Easy Examples
Updated on 03 November, 2022
8.36K+ views
• 14 min read
Table of Contents
In this article, we are going to learn about the scheduler system and why it is essential in the first place. We will also discuss why it is essential to have a scheduler in the Hadoop system. Also, we will deeply learn about Apache Oozie and a few of its concepts of Apache Oozie, such as time-based job, word count workflow job, oozie bundle, oozie coordinator, oozie workflow.
Scheduler System
As we all know that there are many jobs which are interdependent on each other and completion of one job only can start another job. For example, in the system of the Hadoop Ecosystem, Hive Job gets the input to work from the output of MapReduce. In this way, there is more process which receives feedback from the production of other processes. For this purpose of organizing the job and bringing a solution, jobs are scheduled using a scheduler system.
Read: Apache Spark vs Mapreduce
One can now very quickly all those situations of scheduling using Apache Oozie. For the working of the Ecosystem of Hadoop, Apache Oozie is essential.
Apache Oozie: Introduction
In the distributed environment of Hadoop, Jobs are executed and managed by a scheduler system called Apache Oozie. Many various kinds of jobs can be combined using apache oozie, and a job pipeline of one’s desire can be easily created. The task of MapReduce, Sqoop, Pig, or Hive can be quickly scheduled using apache oozie.
An individual can schedule their job easily using the Apache Oozie. One can also run parallel jobs of two or more at the same time with each other while creating the sequence of the task. The Scheduler System, called Apache System, is very extensible, reliable, and scalable.
Action in the workflow can be triggered by the Oozie, which is a web application of Open Source Java. It is the responsibility of Apache Oozie to start the job in the workflow. For the execution of the task, Apache Oozie uses the execution engine of Hadoop.
Through polling and callback, detection of task completion can be done by Apache Oozie. The task is provided with a unique callback HTTP URL at the beginning of starting a job by the Oozie. When the task is finished, Oozie will notify the unique callback HTTP URL about the completion of the task. Sometimes the tasks are polled for completion, in case the callback URL is failed to be invoked by the task.
Explore our Popular Software Engineering Courses
In Apache Oozie, One can find three kinds of jobs, which are:
Oozie Bundles – Oozie Bundles refer to multiple workflow jobs and coordinators packages.
Oozie Coordinator Jobs – According to the availability of data and time, these kinds of workflow jobs get triggered.
Oozie Workflow Jobs – Execution of Actions in the sequence are specified by Directed Acyclic Graphs (DAGs)
Oozie Workflow
Direct Acyclic Graph (DAG) arranges the actions to be done in a sequence in the workflow of Hadoop. One step is dependent on another action, and the next action can start only after the execution of the previous actworkion is finished because it needs to take the output of one job as the input for another job.
Read: Must Read Big Data Interview Questions
Java action, shell action, MapReduce action, Hive action, Pig action are some of the workflow actions which can be scheduled and executed by the Apache Oozie scheduler system. One can also specify the condition for a job to run. In the workflow, one can tell the scheduler to run this job or action if the output comes like this or the action meets the requirement.
Based on the job, various kinds of activities can be created by an individual. There can be a unique kind of tags for each type of action. Before the workflow execution, the HDFS path should be placed with the jars or script and the workflow.
Command: oozie job –oozie http://localhost:11000/oozie -config job.properties -run
http://host_name:11000 is the link address to go to the Oozie web console to check the job status. One can check the job status by just doing a click on the job after opening this Oozie web console.
A fork can be used when one needs to run many jobs together at the same time. When the fork is used, it requires an end node to fork and in this case one needs to take help of Join. Join should be used for each fork. When many jobs are executed together, nodes are assumed as the single c.
A single fork will have single nodes, and each Join will assume only on a single node as their child of the single fork. One can parallelly do the creation of 2 tables at the same time together.
Decision tags are also very useful to use in this system when one needs to run any action based on output. Decision tags help in deciding which operation to run after getting the required output to run a specific action. For instance, one needs not to create any hive table once it is already created. So if the table is already existing, then by adding the decision tag, an individual can stop the creation of the steps of the table. There are also switch cards present in decision nodes, and it is similar to the switch case.
The property file is also known as the config file. It comes handy when the management of the value of the param, script, name-node, and job-tracker becomes difficult.
Apache Oozie Coordinator
There is some workflow that needs to be regularly scheduled, and there is some workflow that is complex to schedule. Both kinds of workflow can be quickly scheduled by using Oozie Coordinator. Event predicates, data, and time are used as the basis for the workflow trigeneration by Oozie Coordinators. On the satisfaction of a condition, the Job coordinator starts the workflows.
Here are some of the Definition one needs to understand for the coordinator jobs:
- frequency – For the purpose of the job execution, frequency is mentioned, and it is counted in minutes.
- timezone – This tells us about the coordinator application’s timezone.
- end – It refers to the job’s end datetime.
- start – It refers to the job’s start datetime.
Let us now learn more about the Control Information’s properties:
execution- The jobs are executed in order, and the execution specifies it. Whenever different job coordinator meets multiple criteria of execution, then execution comes to tell the order in which the jobs should be executed. These are the kinds of execution:
- LAST_ONLY
- LIFO
- FIFO – It is the standard default execution one can find but can also be changed to another type of execution according to the desire of the individual.
- Concurrency – It is the property to control the maximum no. of action that can be taken for a job when the jobs are running parallely. One is the default value of the concurrency, which means at one time, only one action can be taken parallelly.
- Timeout – It is the property that helps to decide the time limit of an action to wait before it is discarded. The action will be immediately timed out if the value is 0, and no input conditions can satisfy the materialization of action. The action can also wait forever without being discarded ever if one has mentioned the value as -1. -1 is the default value of timeout.
Command: oozie job –oozie http://localhost:11000/oozie -config <path to coordinator.properties file> -run
Read: In-demand Big Data Skills Necessary to Land ‘Big’ Data Jobs
Apache Oozie Bundle
The data pipeline is the set of coordinator applications. The execution and definition of the data pipeline are allowed by the Oozie Bundle system. Coordinator applications are not clearly dependent on each other in an Oozie bundle. Data application pipeline can be easily created by using the coordinator applications’ data dependency. The bundle can be rerun, resumed, suspended, stopped, and started by any person. Operational control of the Oozie Bundle is effortless and better to use.
Kick-off-time – It is the time of the submitting and starting of the coordinator applications by a bundle.
Now, as we move forward, we will know how the creation of the workflow job is done:
In-Demand Software Development Skills
Apache Oozie Word Count Workflow Job
With the use of Apache Oozie, one can do the execution of Word Count Job. All the files will be placed after the directory of WordCountTest is created. All the word count jar is placed after the creation of the lib directory.
Job and the associated parameters will be specified in the file of workflow.xml and job.properties after it is created.
1. job.properties
ResourceManager and NameNode’s are defined in the file of job.properties after it is created. The path of NameNode resolves the path of the directory of the workflow. Jobs are submitted to YARN with the help of the path of JobTracker. HDFS stores the path of the workflow.xml file and is provided by the user.
2. workflow.xml
All the execution and actions are defined in the file called workflow.xml after the user creates it. WorkflowRunnerTest is the name of the workflow-app which has to be specified by the user. After that, the Start node will be determined by the user.
For a workflow job, the entry point is called the start node. The job will start after the point of start is mentioned in the first node of the workflow. After the mentioning of the start node, the job will start from the next node called intersection0.
Read: Big Data Project Ideas for Beginners
In the next step, in the action node, the user needs to specify the task that should be performed. The mission of WordCount of MapReduce will be executed now. The task of the WordCount of MapReduce is completed after the user specifies the required configuration. The user then defines the address of the NameNode and the Job Tracker.
The next action is to prepare the element. The cleaning up of the directory is done by using the feature. The directory cleanup is done before the action execution. Now we are going to operate to delete in HDFS. If the out1 folder is created already, we will remove the out1 folder. Before the execution of the job, deletion or creation of a folder is done using prepare tags. Output value class, output key class, reducer class, mapper class, job queue name are some of the properties of the MapReduce specified by the user.
Please have a look at the coding of the workflow.xml file:
<workflow-app xmlns=”uri:0ozie:workflow:0.1″ name=”WorkflowRunnerTest “>
<start to=”intersection0″/>
<action name=”intersection0″>
<map-reduce>
<job-tracke>localhost:8032</job-tracker>
<name-node>hdfs://localhost:8020</name-node>
<prepare> <delete path=”hdfs://localhost:8020/00zieout/outl”/></prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>mapred.mapper.class</name>
<value>MapperClass</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>ReducerClass</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io. Text</value>
</property>
<property>
<name>mapred.output.value.class</name>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io. Text</value>
</property>
<pгoperty>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io. IntWritable</value>
</property>
<ргоperty>
<name>mapred.input.dir</name>
<value>/data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/oozieout/out1</value>
</property>
</configuration>
</map-reduce>
<ok to=”end”/>
<error to=”fail”/>
</action>
<kill name=”fail”>
<message>Map/Reduce failed, error message</message>
</kill>
<end name=”end”/>
</workflow-app>
In HDFS, the Output and Input directory is the last configuration task of MapReduce. Input directory is also known as the data directory. In the NameNode root path, the Data directory gets stored. On the failure of the job, the element will be killed according to the specification given by the user in the end.
Goto : data
Go back to dir listing
Advanced view/download options
Below are important points about this VM, please go through it without fail.
1) Hadoop and all other components are present in /usr/lib/
JDK : /usr/lib/jvm/jdkl.7.0 67
Eclipse : /homme/upgrad/Desktop/eclipse
Hadoop : /usr/lib/hadoop-2.2.0
Pig : /usr/lib/pig-0.12.0
Hive : /usr/lib/hive-0.13.1-bin
Hbase : /usr/lib/hbase-0.96.2-hadoop2
Oozie : /usr/lib/oozię-4.0.0
Sqoop : /usr/lib/sqoop-1.4.4
Flume-ng : /usr/lib/flume-ng
2) The paths of all the components are set.
JDK : .bashrc
Hadoop : .bashrc
Pig : /etc/profile.d/pig.sh
Hive : /etc/profile.d/hive.sh
In the property oozie.wf.application.path in the file of job.properties, the user needs to specify and move the WordCountTest folder in HDFS. Now we will perform a copy of the folder of WordCountTest folder in the root directory of the Hadoop.
Command: hadoop fs -put WordCountTest /
upgrad@localhost:/usr/lib/0ozie-4.0.0
File Edit View Search Terminal Help
[upgrad@localhost oozie-4.0.0]$ hadoop fs -put WordCountTest /
17/12/19 18:11:03 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform… using builtin-java classes where applicable
To check If the folder is uploaded not uploaded in the root directory of HDFS, then the user needs to check the folder by going to the NameNode Web UI for verification.
Now using this code, we will do the execution of the job of the workflow and go ahead:
Command: oozie job -oozie http://localhost:11000/o0ozie -config job.properties –run
upgrad@localhost:/usr/lib/oozie-4.0.0/WordCountTest
File Edit View Search Terminal Help
[upgrad@localhost oozie-4.0.0]$ cd WordCountTest
[upgrad@localhost WordCount Test]$ oozie job -oozie http://localhost:11000/00zie
– config job.properties -run
job: 0000009-171219160449620-0ozie-edur-W
[upgrad@localhost WordCountTest]$
- Time-Based Word Count Coordinator Job | Apache Oozie Tutorial
We are going to create a controller that will be executed every specified time interval. In the end, time base word count job will be completed across the time interval. Using Apache Oozie, we can create a scheduled job and run them in a periodical manner.
Let us move forward and let us create an Oozie coordinator job. There will be three files create that is workflow.xml, coordinator.xml & workflow.xml files. Wordcount jar file will be placed inside the lib directory.
Let us look into properties file:
frequency=60
startTime=2017-12-19T13\:29Z
endTime=2017-2-19T13\:34Z
timezone=UTC
nameNode=hdfs://localhost:8020
jobTracker=localhost:8032
queueName=default
workflowPath=${nameNode}/WordCountTest_TimeBased
oozie.coord.application.path=${nameNode}/WordCountTest_TimeBased
Here we have to specify the frequency at which the work should be executed. The unit of frequency here is in minutes. Here, in our case, the coordinator job should be running every 60 minutes. Data set will be captured using the frequency specified that is produced and scheduled to run the coordinator application.
Use this below format to specify frequency:
${coord:minutes(int n)}
n
${coord:minutes(45)} –> 45
${coord:hours(int n)}
n * 60
${coord:hours(3)} –> 180
${coord:days(int n)}
variable
${coord:days(2)} –> minutes in 2 full days from the current date
${coord:months(int n)}
variable
${coord:months(1)} –> minutes in a 1 full month from the current date
Here we have to specify the startTime and the endTime of the job. Where startTime is the start date, and endTime is the end date of the coordinated job.
Finally, we have to specify the application path where all files are stored.
All the properties will be well defined in the coordinator.properties file. Frequency, name, and timezone should be specified in the properties . Hence let us create elements in the coordinator XML file.
<coordinator-app name=”coordinator1” frequency =”${frequency}” start=”${startTime}”
end=”${endTime}” timezone=”${timezone}” xmlns=”uri:oozie:coordinator:0.1”>
<action>
<workflow>
<app-path>${workflowPath}</app-path>
</workflow>
</action>
</coordinate-app>
Now let us create workflow.xml for our job.
<workflow-app xmlns=”ari:oozie:vorkflows6.” name= Warktlovfunnertest>
<start to=”intersection9″/>
kaction name=intersectione>
<map-reduce>
<job-tracker>localhost:8032</job-tracker>
<name-node>hdfs://localhost:8020</name-node>
<prepare> <delete path=”h5ts//Loalhost 3320/00aie ireBasadcut/ut/></prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>mapred.mapper.class</name>
<value>MapperClass</values
</property>
<property>
<name>mapred.reducer.class</name>
<value>ReducerClass</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io. Text</value>
</property>
<property>
<name>mapred.output.value.class</name>
Also make:
<value>org.apache.hadoop.io.Text</value>
</property>
<pгоperty>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.Intwritable</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/oozieTimeBasedout/out1</value>
</property>
</configuration>
</map-reduce>
<ok to=”end”/>
<error to=”fail”/>
</action>
<kill name=”fail”>
<message>Map/Reduce failed, error message</message>
</kill>
<end name=end”/>
</workflow-app>
Now let us move this to the HDFS directory.
Finally we have to configure Coordinator job.
<contiguration>
<property>
<name>startTime</name>
<value>2017-12-19T13:29Z<value>
</property>
<property>
<name>workflowPath</name>
<value>hdfs:/localhost:8020WordCountTest TimeBased</value>
</property>
<property>
<name>oozie.coord.application.path</name>
<value>hdfs:/Mocalhost 8020/WordCountTest TimeBased<value>
</property>
<property>
<name>timezone</name>
<value>UTC<value>
<iproperty>
<property>
<name>user.name</name>
<value>upgrad</value>
</property>
<property>
<name>mapreduce job.user.name</name>
<value>upgrad<value>
</property>
<property>
<name>queueName</name>
<value>default</value>
<property>
<property>
Now lets see the output created.
!! 1
“Save” 1
“reboot” 1
(Just 1
+ 1
-C 1
-f 1
-n 1
-r 2
safemode 1
. 1
./bin/oozie-start.sh 1
./bin/start-hbase.sh 2
./flume-ng 1
bashrc 2
/etc/profile.d/hbase. sh 1
/etc/profile.d/hive.sh 1
/etc/profile.d/oozie.sh 1
/etc/profile.d/pig.sh 1
/etc/profile.d/sqoop.sh 1
Explore Our Software Development Free Courses
Conclusion
Finally, we came to an end of the tutorial and hoped that you liked the tutorial and learned something about Apache Oozie. This article is a good start for any beginner who is interested in learning the basic concept of the Apache Oozie.
If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.
Learn Software Development Courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs or Masters Programs to fast-track your career.
Frequently Asked Questions (FAQs)
1. What is the job of Apache Oozie?
The primary job Apache Oozie aims to fulfil is to schedule Apache Hadoop jobs. It is a Java web application. Oozie also cooperates by offering operational services that usually work with a Hadoop cluster. Moreover, it can put the various jobs together in one block. Also, its integration with the Hadoop stack is a crucial reason to consider Apache Oozie. Apache MapReduce, Apache Hive, Apache Pig, and Apache Snoop, all the Hadoop jobs, are supported extensively by Oozie. Apache Oozie is also capable of scheduling Java program or shell script jobs. Mainly, Oozie works with Hadoop operations which incorporates cluster administrators to carry on complex data transformation in very less time.
2. Why should I consider using Apache Oozie in my organisation?
Many organisations are avidly using Apache Oozie nowadays. It could be troublesome to begin with; however, it gets easier once you get the hang of it. With Oozie, users can manage Hadoop workflows from numerous interfaces. The other prevalent reason companies prefer working with Apache Oozie is that it is free. Despite not being able to fulfil the customers’ operational needs, being free is a considerable bargain. Oozie is very easy to work with, especially creating workflows, managing schedulers, and writing scripts. People do not want to change their technology if they begin their journey with Oozie. Managing Oozie with Hadoop workflows is never easy, but a Big Data project around schedulers makes it worthwhile.
3. What are the cons of Apache Oozie?
Apache Oozie is less attentive to actions, dependencies, and other operations. For instance, for partitions, the format is MM.DD.YY; doing it another way around won’t work out. The next disadvantage of Oozie is its limited actions like FS action, Pig action, SSH action, and Shell action. These actions are very compact as they allow passing only one action to the next. Oozie is also restricted to using only HDFS for MapReduce jobs which could be a limitation.