- Blog Categories
- Software Development Projects and Ideas
- 12 Computer Science Project Ideas
- 28 Beginner Software Projects
- Top 10 Engineering Project Ideas
- Top 10 Easy Final Year Projects
- Top 10 Mini Projects for Engineers
- 25 Best Django Project Ideas
- Top 20 MERN Stack Project Ideas
- Top 12 Real Time Projects
- Top 6 Major CSE Projects
- 12 Robotics Projects for All Levels
- Java Programming Concepts
- Abstract Class in Java and Methods
- Constructor Overloading in Java
- StringBuffer vs StringBuilder
- Java Identifiers: Syntax & Examples
- Types of Variables in Java Explained
- Composition in Java: Examples
- Append in Java: Implementation
- Loose Coupling vs Tight Coupling
- Integrity Constraints in DBMS
- Different Types of Operators Explained
- Career and Interview Preparation in IT
- Top 14 IT Courses for Jobs
- Top 20 Highest Paying Languages
- 23 Top CS Interview Q&A
- Best IT Jobs without Coding
- Software Engineer Salary in India
- 44 Agile Methodology Interview Q&A
- 10 Software Engineering Challenges
- Top 15 Tech's Daily Life Impact
- 10 Best Backends for React
- Cloud Computing Reference Models
- Web Development and Security
- Find Installed NPM Version
- Install Specific NPM Package Version
- Make API Calls in Angular
- Install Bootstrap in Angular
- Use Axios in React: Guide
- StrictMode in React: Usage
- 75 Cyber Security Research Topics
- Top 7 Languages for Ethical Hacking
- Top 20 Docker Commands
- Advantages of OOP
- Data Science Projects and Applications
- 42 Python Project Ideas for Beginners
- 13 Data Science Project Ideas
- 13 Data Structure Project Ideas
- 12 Real-World Python Applications
- Python Banking Project
- Data Science Course Eligibility
- Association Rule Mining Overview
- Cluster Analysis in Data Mining
- Classification in Data Mining
- KDD Process in Data Mining
- Data Structures and Algorithms
- Binary Tree Types Explained
- Binary Search Algorithm
- Sorting in Data Structure
- Binary Tree in Data Structure
- Binary Tree vs Binary Search Tree
- Recursion in Data Structure
- Data Structure Search Methods: Explained
- Binary Tree Interview Q&A
- Linear vs Binary Search
- Priority Queue Overview
- Python Programming and Tools
- Top 30 Python Pattern Programs
- List vs Tuple
- Python Free Online Course
- Method Overriding in Python
- Top 21 Python Developer Skills
- Reverse a Number in Python
- Switch Case Functions in Python
- Info Retrieval System Overview
- Reverse a Number in Python
- Real-World Python Applications
- Data Science Careers and Comparisons
- Data Analyst Salary in India
- Data Scientist Salary in India
- Free Excel Certification Course
- Actuary Salary in India
- Data Analyst Interview Guide
- Pandas Interview Guide
- Tableau Filters Explained
- Data Mining Techniques Overview
- Data Analytics Lifecycle Phases
- Data Science Vs Analytics Comparison
- Artificial Intelligence and Machine Learning Projects
- Exciting IoT Project Ideas
- 16 Exciting AI Project Ideas
- 45+ Interesting ML Project Ideas
- Exciting Deep Learning Projects
- 12 Intriguing Linear Regression Projects
- 13 Neural Network Projects
- 5 Exciting Image Processing Projects
- Top 8 Thrilling AWS Projects
- 12 Engaging AI Projects in Python
- NLP Projects for Beginners
- Concepts and Algorithms in AIML
- Basic CNN Architecture Explained
- 6 Types of Regression Models
- Data Preprocessing Steps
- Bagging vs Boosting in ML
- Multinomial Naive Bayes Overview
- Gini Index for Decision Trees
- Bayesian Network Example
- Bayes Theorem Guide
- Top 10 Dimensionality Reduction Techniques
- Neural Network Step-by-Step Guide
- Technical Guides and Comparisons
- Make a Chatbot in Python
- Compute Square Roots in Python
- Permutation vs Combination
- Image Segmentation Techniques
- Generative AI vs Traditional AI
- AI vs Human Intelligence
- Random Forest vs Decision Tree
- Neural Network Overview
- Perceptron Learning Algorithm
- Selection Sort Algorithm
- Career and Practical Applications in AIML
- AI Salary in India Overview
- Biological Neural Network Basics
- Top 10 AI Challenges
- Production System in AI
- Top 8 Raspberry Pi Alternatives
- Top 8 Open Source Projects
- 14 Raspberry Pi Project Ideas
- 15 MATLAB Project Ideas
- Top 10 Python NLP Libraries
- Naive Bayes Explained
- Digital Marketing Projects and Strategies
- 10 Best Digital Marketing Projects
- 17 Fun Social Media Projects
- Top 6 SEO Project Ideas
- Digital Marketing Case Studies
- Coca-Cola Marketing Strategy
- Nestle Marketing Strategy Analysis
- Zomato Marketing Strategy
- Monetize Instagram Guide
- Become a Successful Instagram Influencer
- 8 Best Lead Generation Techniques
- Digital Marketing Careers and Salaries
- Digital Marketing Salary in India
- Top 10 Highest Paying Marketing Jobs
- Highest Paying Digital Marketing Jobs
- SEO Salary in India
- Brand Manager Salary in India
- Content Writer Salary Guide
- Digital Marketing Executive Roles
- Career in Digital Marketing Guide
- Future of Digital Marketing
- MBA in Digital Marketing Overview
- Digital Marketing Techniques and Channels
- 9 Types of Digital Marketing Channels
- Top 10 Benefits of Marketing Branding
- 100 Best YouTube Channel Ideas
- YouTube Earnings in India
- 7 Reasons to Study Digital Marketing
- Top 10 Digital Marketing Objectives
- 10 Best Digital Marketing Blogs
- Top 5 Industries Using Digital Marketing
- Growth of Digital Marketing in India
- Top Career Options in Marketing
- Interview Preparation and Skills
- 73 Google Analytics Interview Q&A
- 56 Social Media Marketing Q&A
- 78 Google AdWords Interview Q&A
- Top 133 SEO Interview Q&A
- 27+ Digital Marketing Q&A
- Digital Marketing Free Course
- Top 9 Skills for PPC Analysts
- Movies with Successful Social Media Campaigns
- Marketing Communication Steps
- Top 10 Reasons to Be an Affiliate Marketer
- Career Options and Paths
- Top 25 Highest Paying Jobs India
- Top 25 Highest Paying Jobs World
- Top 10 Highest Paid Commerce Job
- Career Options After 12th Arts
- Top 7 Commerce Courses Without Maths
- Top 7 Career Options After PCB
- Best Career Options for Commerce
- Career Options After 12th CS
- Top 10 Career Options After 10th
- 8 Best Career Options After BA
- Projects and Academic Pursuits
- 17 Exciting Final Year Projects
- Top 12 Commerce Project Topics
- Top 13 BCA Project Ideas
- Career Options After 12th Science
- Top 15 CS Jobs in India
- 12 Best Career Options After M.Com
- 9 Best Career Options After B.Sc
- 7 Best Career Options After BCA
- 22 Best Career Options After MCA
- 16 Top Career Options After CE
- Courses and Certifications
- 10 Best Job-Oriented Courses
- Best Online Computer Courses
- Top 15 Trending Online Courses
- Top 19 High Salary Certificate Courses
- 21 Best Programming Courses for Jobs
- What is SGPA? Convert to CGPA
- GPA to Percentage Calculator
- Highest Salary Engineering Stream
- 15 Top Career Options After Engineering
- 6 Top Career Options After BBA
- Job Market and Interview Preparation
- Why Should You Be Hired: 5 Answers
- Top 10 Future Career Options
- Top 15 Highest Paid IT Jobs India
- 5 Common Guesstimate Interview Q&A
- Average CEO Salary: Top Paid CEOs
- Career Options in Political Science
- Top 15 Highest Paying Non-IT Jobs
- Cover Letter Examples for Jobs
- Top 5 Highest Paying Freelance Jobs
- Top 10 Highest Paying Companies India
- Career Options and Paths After MBA
- 20 Best Careers After B.Com
- Career Options After MBA Marketing
- Top 14 Careers After MBA In HR
- Top 10 Highest Paying HR Jobs India
- How to Become an Investment Banker
- Career Options After MBA - High Paying
- Scope of MBA in Operations Management
- Best MBA for Working Professionals India
- MBA After BA - Is It Right For You?
- Best Online MBA Courses India
- MBA Project Ideas and Topics
- 11 Exciting MBA HR Project Ideas
- Top 15 MBA Project Ideas
- 18 Exciting MBA Marketing Projects
- MBA Project Ideas: Consumer Behavior
- What is Brand Management?
- What is Holistic Marketing?
- What is Green Marketing?
- Intro to Organizational Behavior Model
- Tech Skills Every MBA Should Learn
- Most Demanding Short Term Courses MBA
- MBA Salary, Resume, and Skills
- MBA Salary in India
- HR Salary in India
- Investment Banker Salary India
- MBA Resume Samples
- Sample SOP for MBA
- Sample SOP for Internship
- 7 Ways MBA Helps Your Career
- Must-have Skills in Sales Career
- 8 Skills MBA Helps You Improve
- Top 20+ SAP FICO Interview Q&A
- MBA Specializations and Comparative Guides
- Why MBA After B.Tech? 5 Reasons
- How to Answer 'Why MBA After Engineering?'
- Why MBA in Finance
- MBA After BSc: 10 Reasons
- Which MBA Specialization to choose?
- Top 10 MBA Specializations
- MBA vs Masters: Which to Choose?
- Benefits of MBA After CA
- 5 Steps to Management Consultant
- 37 Must-Read HR Interview Q&A
- Fundamentals and Theories of Management
- What is Management? Objectives & Functions
- Nature and Scope of Management
- Decision Making in Management
- Management Process: Definition & Functions
- Importance of Management
- What are Motivation Theories?
- Tools of Financial Statement Analysis
- Negotiation Skills: Definition & Benefits
- Career Development in HRM
- Top 20 Must-Have HRM Policies
- Project and Supply Chain Management
- Top 20 Project Management Case Studies
- 10 Innovative Supply Chain Projects
- Latest Management Project Topics
- 10 Project Management Project Ideas
- 6 Types of Supply Chain Models
- Top 10 Advantages of SCM
- Top 10 Supply Chain Books
- What is Project Description?
- Top 10 Project Management Companies
- Best Project Management Courses Online
- Salaries and Career Paths in Management
- Project Manager Salary in India
- Average Product Manager Salary India
- Supply Chain Management Salary India
- Salary After BBA in India
- PGDM Salary in India
- Top 7 Career Options in Management
- CSPO Certification Cost
- Why Choose Product Management?
- Product Management in Pharma
- Product Design in Operations Management
- Industry-Specific Management and Case Studies
- Amazon Business Case Study
- Service Delivery Manager Job
- Product Management Examples
- Product Management in Automobiles
- Product Management in Banking
- Sample SOP for Business Management
- Video Game Design Components
- Top 5 Business Courses India
- Free Management Online Course
- SCM Interview Q&A
- Fundamentals and Types of Law
- Acceptance in Contract Law
- Offer in Contract Law
- 9 Types of Evidence
- Types of Law in India
- Introduction to Contract Law
- Negotiable Instrument Act
- Corporate Tax Basics
- Intellectual Property Law
- Workmen Compensation Explained
- Lawyer vs Advocate Difference
- Law Education and Courses
- LLM Subjects & Syllabus
- Corporate Law Subjects
- LLM Course Duration
- Top 10 Online LLM Courses
- Online LLM Degree
- Step-by-Step Guide to Studying Law
- Top 5 Law Books to Read
- Why Legal Studies?
- Pursuing a Career in Law
- How to Become Lawyer in India
- Career Options and Salaries in Law
- Career Options in Law India
- Corporate Lawyer Salary India
- How To Become a Corporate Lawyer
- Career in Law: Starting, Salary
- Career Opportunities: Corporate Law
- Business Lawyer: Role & Salary Info
- Average Lawyer Salary India
- Top Career Options for Lawyers
- Types of Lawyers in India
- Steps to Become SC Lawyer in India
- Tutorials
- C Tutorials
- Recursion in C: Fibonacci Series
- Checking String Palindromes in C
- Prime Number Program in C
- Implementing Square Root in C
- Matrix Multiplication in C
- Understanding Double Data Type
- Factorial of a Number in C
- Structure of a C Program
- Building a Calculator Program in C
- Compiling C Programs on Linux
- Java Tutorials
- Handling String Input in Java
- Determining Even and Odd Numbers
- Prime Number Checker
- Sorting a String
- User-Defined Exceptions
- Understanding the Thread Life Cycle
- Swapping Two Numbers
- Using Final Classes
- Area of a Triangle
- Skills
- Software Engineering
- JavaScript
- Data Structure
- React.js
- Core Java
- Node.js
- Blockchain
- SQL
- Full stack development
- Devops
- NFT
- BigData
- Cyber Security
- Cloud Computing
- Database Design with MySQL
- Cryptocurrency
- Python
- Digital Marketings
- Advertising
- Influencer Marketing
- Search Engine Optimization
- Performance Marketing
- Search Engine Marketing
- Email Marketing
- Content Marketing
- Social Media Marketing
- Display Advertising
- Marketing Analytics
- Web Analytics
- Affiliate Marketing
- MBA
- MBA in Finance
- MBA in HR
- MBA in Marketing
- MBA in Business Analytics
- MBA in Operations Management
- MBA in International Business
- MBA in Information Technology
- MBA in Healthcare Management
- MBA In General Management
- MBA in Agriculture
- MBA in Supply Chain Management
- MBA in Entrepreneurship
- MBA in Project Management
- Management Program
- Consumer Behaviour
- Supply Chain Management
- Financial Analytics
- Introduction to Fintech
- Introduction to HR Analytics
- Fundamentals of Communication
- Art of Effective Communication
- Introduction to Research Methodology
- Mastering Sales Technique
- Business Communication
- Fundamentals of Journalism
- Economics Masterclass
- Free Courses
Top 3 Apache Spark Applications / Use Cases & Why It Matters
Updated on 23 October, 2024
12.46K+ views
• 8 min read
Table of Contents
Apache Spark was developed by a team at UC Berkeley in 2009. Since then, Apache Spark has seen a very high adoption rate from top-notch technology companies like Google, Facebook, Apple, Netflix etc. The demand has been ever increasing day by day. According to marketanalysis.com survey, the Apache Spark market worldwide will grow at a CAGR of 67% between 2019 and 2022. The Spark market revenue is zooming fast and may grow up $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 - 2022).
Spark is a cluster computing framework, somewhat similar to MapReduce but has a lot more capabilities, features, speed and provides APIs for developers in many languages like Scala, Python, Java and R. It is also friendly for database developers as it provides Spark SQL which supports most of the ANSI SQL functionality. Spark also has out of the box support for Machine learning and Graph processing using components called MLlib and GraphX respectively. Spark also has support for streaming data using Spark Streaming.
Spark is developed in Scala programming language. Though the majority of use cases of Spark uses HDFS as the underlying data file storage layer, it is not mandatory to use HDFS. It does work with a variety of other Data sources like Cassandra, MySQL, AWS S3 etc. Apache Spark also comes with its default resource manager which might be good enough for the development environment and small size cluster, but it also integrates very well with YARN and Mesos. Most of the production-grade and large clusters use YARN and Mesos as the resource manager.
Features of Spark
- Speed: According to Apache, Spark can run applications on Hadoop cluster up to 100 times faster in memory and up to 10 times faster on disk. Spark is able to achieve such a speed by overcoming the drawback of MapReduce which always writes to disk for all intermediate results. Spark does not need to write intermediate results to disk and can work in memory using DAG, lazy evaluation, RDDs and caching. Spark has a highly optimized execution engine which makes it so fast.
- Fault Tolerance: Spark’s optimized execution engine not only makes it fast but is also fault tolerant. It achieves this using abstraction layer called RDD (Resilient Distributed Datasets) in combination with DAG, which is built to handle failures of tasks or even node failures.
- Lazy Evaluation: Spark works on lazy evaluation technique. This means that the processing(transformations) on Spark RDD/Datasets are evaluated in a lazy manner, i.e. the output RDDs/datasets are not available after transformation will be available only when needed i.e. when any action is performed. The transformations are just part of the DAG which gets executed when action is called.
- Multiple Language Support: Spark provides support for multiple programming languages like Scala, Java, Python, R and also Spark SQL which is very similar to SQL.
- Reusability: Spark code once written for batch processing jobs can also be utilized for writing processed on Stream processing and it can be used to join historical batch data and stream data on the fly.
- Machine Learning: MLlib is a Machine Learning library of Spark. which is available out of the box for creating ML pipelines for data analysis and predictive analytics also
- Graph Processing: Apache Spark also has Graph processing logic. Using GraphX APIs which is again provided out of the box one can write graph processing and do graph-parallel computation.
- Stream Processing and Structured Streaming: Spark can be used for batch processing and also has the capability to cater to stream processing use case with micro batches. Spark Streaming comes with Spark and one does not need to use any other streaming tools or APIs. Spark streaming also supports Structure Streaming. Spark streaming also has in-built connectors for Apache Kafka which comes very handy while developing Streaming applications.
- Spark SQL: Spark has an amazing SQL support and has an in-built SQL optimizer. Spark SQL features are used heavily in warehouses to build ETL pipelines.
Spark is being used in more than 1000 organizations who have built huge clusters for batch processing, stream processing, building warehouses, building data analytics engine and also predictive analytics platforms using many of the above features of Spark. Let’s look at some of the use cases in a few of these organizations.
What are the Different Apache Spark Applications?
Streaming Data:
Streaming is basically unstructured data produced by different types of data sources. The data sources could be anything like log files generated while customers using mobile apps or web applications, social media contents like tweets, facebook posts, telemetry from connected devices or instrumentation in data centres. The streaming data is usually unbounded and is being processed as received from the data source.
Then there is Structured streaming which works on the principle of polling data in intervals and then this interval data is processed and appended or updated to the unbounded result table.
Apache Spark has a framework for both i.e. Spark Streaming to handle Streaming using micro batches and DStreams and Structured Streaming using Datasets and Data frames.
Let us try to understand Spark Streaming from an example.
Suppose a big retail chain company wants to get a real-time dashboard to keep a close eye on its inventory and operations. Using this dashboard the management should be able to track how many products are being purchased, shipped and delivered to customers.
Spark Streaming can be an ideal fit here.
The order management system pushes the order status to the queue(could be Kafka) from where Streaming process reads every minute and picks all the orders with their status. Then Spark engine processes these and emits the output status count. Spark streaming process runs like a daemon until it is killed or error is encountered.
Machine Learning:
As defined by Arthur Samuel in 1959, “Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed”. In 1997, Tom Mitchell gave a definition which is more specifically from an engineering perspective, “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”. ML solves complex problems that could not be solved with just mathematical numerical methods or means. ML is not supposed to make perfect guesses. In ML’s domain, there is no such thing. Its goal is to make a prediction or make guesses which are good enough to be useful.
MLlib is the Apache Spark’s scalable machine learning library. MLlib has multiple algorithms for Supervised and Unsupervised ML which can scale out on a cluster for classification, regression, clustering, collaborative filtering. MLlib interoperates with Python’s math/numerical analysis library NumPy and also with R’s libraries. Some of these algorithms are also applicable to streaming data. MLlib helps Spark provide sentiment analysis, customer segmentation and predictive intelligence.
A very common use case of ML is text classification, say for categorising emails. An ML pipeline can be trained to classify emails by reading an Inbox. A typical ML pipeline looks like this. ML is a subject in itself so it is not possible to deep dive here.
Fog Computing:
Fog Computing is another use case of Apache Spark. To understand Fog computing we need to understand IoT first. IoT basically connects all our devices so that they can communicate with each other and provide solutions to the users of those devices. This would mean huge amounts of data and current cloud computing may not be sufficient to cater to so much data transfer, data processing and online demand of customer’s request.
Fog computing can be ideal here as it takes the work of processing to the devices on the edge of the network. This would need very low latency, parallel processing of ML and complex graph analytical algorithms, all of which are readily available in Apache spark out of the box and can be pick and choose as per the requirements of the processing. So it is expected that as IoT gains momentum Apache spark will be the leader in Fog computing.
- Event Detection:Apache Spark is increasingly used in event detection like credit card fraud detection, money laundering activities etc. Apache spark streaming along with MLlib and Apache Kafka forms the backbone of a fraud financial transaction detection.
Credit card transactions of a cardholder can be captured over a period of time to categorize user’s spending habits. Models can be developed and trained to predict any anomaly in the card transaction and along with Spark streaming and Kafka in real time. - Interactive Analysis: Spark’s one of the most popular features is its ability to provide users with interactive analytics. MapReduce does provide tools like Pig and Hive for interactive analysis, but they are too slow in most of the cases. But Spark is very fast and swift and that’s why it has gained so much ground in the interactive analysis.
Spark interfaces with programming languages like R, Python, SQL and Scala which caters to a bigger set of developers and users for interactive analysis. Spark also came up with Structured Streaming in version 2.0 which can be used for interactive analysis with live data as well as join the live data with batch data output to get more insight into the data. Structured streaming in future has the potential to boost Web Analytics by allowing users to query user’s live web session. Even machine learning can be applied to live session data for more insights. - Data Warehousing: Data warehousing is another function where Apache Spark has is getting tremendous traction. Due to an increasing volume of data day by day, the tradition ETL tools like Informatic along with RDBMS are not able to meet the SLAs as they are not able to scale horizontally. Spark along with Spark SQL is being used by many companies to migrate to Big Data based Warehouse which can scale horizontally as the load increases.
With Spark, even the processing can be scaled horizontally by adding machines to the Spark engine cluster. These migrated applications embed the Spark engine and offer a web UI to allow users to create, run, test and deploy jobs interactively. Jobs are primarily written in native Spark SQL or other flavors of SQL. These Spark clusters have been able to scale to process many terabytes of data every day and the clusters can be hundreds to thousands of nodes.
Companies Using Apache Spark
Apache Spark at Alibaba:
Alibaba is the world’s one of the biggest e-commerce players. Alibaba’s online shopping platform generates Petabytes of data as it has millions of users every day doing searches, shopping and placing orders. These user interactions are represented as complex graphs. The processing of these data points is done using Spark’s Machine learning component MLlib and then used to provide better user shopping experience by suggesting products based on choice, trending products, reviews etc.
Apache Spark at MyFitnessPal:
MyFitnessPal is one of the largest health and fitness lifestyle portals. It has over 80 million active users. The portal helps its users follow and achieve a healthy lifestyle by following a proper diet and fitness regime. The portal uses the data added by users about their food, exercise and lifestyles to identify the best quality food and effective exercise. Using Spark the portal is able to scan through the huge amount of structured and unstructured data and pull out best suggestions for its users.
Apache Spark at TripAdvisor:
TripAdvisor has a huge user base and generates a mammoth amount of data every day. It is one of the biggest names in the Travel and Tourism industry. It helps users plan their personal and official trips around the world. It uses Apache Spark to process petabytes of data from user interactions and destination details and gives recommendations on planning a perfect trip based on users choice and preferences. They help users identify best airlines, best prices on hotels and airlines, best places to eat, basically everything needed to plan any trip. It also ranks these places, hotels, airlines, restaurants based on user feedback and reviews. All this processing is done using Apache Spark.
Apache Spark at Yahoo:
Yahoo is known to have one of the biggest Hadoop Cluster and everyone is aware of Yahoo’s contribution to the development of Big Data system. Yahoo is also heavily using Apache Spark Machine learning capabilities to identify topics and news which users are interested in. This is similar to trending tweets or hashtags on Twitter or Facebook. Earlier these Machine Learning algo were developed in C/C++ with thousands of lines of code. While today with Spark and Scala/Pythons these algorithms can be implemented in few hundreds of lines of code. This is a big leap in turnover time as well as code understanding and maintenance. This has been made possible due to Spark to a great extent.
Apache Spark Use cases
Finance:
Spark is used in Finance industry across different functional and technology domains.
A typical use case is building a Data Warehouse for batch processing and daily reporting. The Spark data frames abstraction has been used as a generic ingestion platform capable of ingesting data from multiple sources of different formats.
Financial services companies also use Apache Spark MLlib to create and train models for fraud detection. Some of the banks have started using Spark as a tool for classifying text in money transfers.
Some of the companies use Apache spark as log collection, an analysis engine and detection engine.
Let’s look at Spain's 2nd biggest bank BBVA use case where every money transfer a customer makes goes through an engine that infers a category from its textual description. This engine has been developed in Spark, mixes MLLib and own implementations, and is currently into production serving more than 5M customers daily.
The challenges that the BBVA technology team faced while building this ML were many:
- They did not know the data source in advance
- They did not have a labelled set
- A fraction of texts is useless (detection rather than classification)
- Distribution of categories is imbalanced
- Prefer false negatives over false positives
- Very short text, language not even syntactically correct
The engineers solved these problems using the Spark MLlib pipeline using some other NLP tools like word2vec.
- TF-IDF features + linear classifier (98% precision, 21% recall
- Further tests with word2vec + Vector of Locally Aggregated Descriptors (VLAD)
- Implemented in Spark/Scala, using MLlib classes
- Own classes implemented for Multi-class Logistic Regression, VLAD
- Scala dependency injection useful to quickly setup variants of the above steps
HealthCare:
Healthcare industry is the newest in adopting advanced technologies like big data and machine learning to provide hi-tech facilities to their patients. Apache Spark is penetrating fast and is becoming the heartbeat in the latest Healthcare applications. Hospitals use these Spark enabled healthcare applications to analyze patients medical history to identify possible health issues based on history and learning.
Also, healthcare produces massive amounts of data and to process so much of the data in quick time and provide insights based on that itself was a challenge which Spark solves with ease.
Another very interesting problem in hospitals is when working with Operating Room(OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures.
Let’s see a use case. For a basic surgical procedure, it costs around $15-20 per minute. So, OR is a scarce and valuable resource and it needs to be utilized carefully and optimally. OR efficiency differs depending on the OR staffing and allocation, not the workload. So the loss of efficiency means a loss for the patient. So time and management are the utmost importance here.
Spark and MLlib solve the problem by developing a predictive model that would identify available OR time 2 weeks in advance, allows hospitals to confirm waitlist cases two weeks in advance instead of when blocks normally release 4 days out. This OR Scheduling can be done by getting the historical data and running then linear regression model with multiple variables.
This model works because:
- Can coordinate waitlist scheduling logistics with physicians and patients within 2 weeks of surgery.
- Plan staff scheduling and resources so there are less last-minutes staffing issues for nursing and anesthesia
- Utilization metrics show where elective surgical schedule and level demand can be maximized.
Retail:
Big retail chains have this usual problem of optimising their supply chain to minimize cost and wastage, improve customer service and gain insights into customer’s shopping behaviour to serve better and in the process optimize their profit.
To achieve these goals these retail companies have a lot of challenges like to keep the inventory up to date based on sales and also to predict sales and inventory during some promotional events and sale seasons. Also, they need to keep a track on customer’s orders transit and delivery. All these pose huge technical challenges. Apache Spark and MLlib is being used by a lot of these companies to capture real-time sales and invoice data, ingest it and then figure out the inventory. The technology can also be used to identify in real-time the order’s transit and delivery status. Spark MLlib analytics and predictive models are being used to predict sales during promotions and sale seasons to match the inventory and be ready for the event. The historical data on customer’s buying behaviour is also used to provide the customer with personalized suggestions and improve customer satisfaction. A lot of stores have started using sensors to get data on customer’s location within the store, their preferences, shopping behaviour, etc to provide on-the-spot suggestions and help to find, buy a product by sending messages, using displays etc.
Travel:
Airline customer segmentation is a challenging field to understand due to customer’s complex behaviour. Amadeus is one of the main IT solution providers in the airline industry. It has the resources and infrastructure to manage all the ticketing and booking data as well as understanding the Airline needs and market particularities. By combining different data sources produced by different airline systems, they have applied unsupervised machine learning techniques to improve our understanding of customer behaviour.
Challenges in the airline industry are to understand the health of the business:
- Are any segments growing or shrinking
- How is the yield developing
- Tune marketing to specific interests within segments
- Optimize product offers using fare structures and media offers
Traditional approaches for segmentation were based on business intuition and manually crafter rules set. But these approaches have limitations and prejudices which can sometimes be negative for the business. On the contrary, the data-driven approach is resilient against turn-over, prejudices and market change.
With a data-driven approach and using Spark and MLlib, the model is able to extract actionable insights on typical customer behaviour and intentions. Supervised and supervised learning using Spark MLlib techniques at scale are used to train models for prediction. These are then used to assist the customer in deploying the newfound insights into day-to-day operations.
Media:
Media companies Netflix, Hotstar etc are using Apache Spark at the heart of their technology engine to drive their business. When a user turns on Netflix, he is able to see his favourite content playing automatically. This is achieved through recommendation engines built on Machine learning algorithms and Spark MLlib. Netflix uses historical data from users content selection, trains its ML algorithms, tests it offline and then deploys it live and checks if it works in Production as well.
Netflix has built an engine something called Time Travel using Apache Spark and other big data technologies to: Snapshot online services and use the snapshot data offline to generate features and share facts and features between experiments without calling live systems.
Energy:
Apache Spark is spreading its roots everywhere. A common man not related to software industry may not realize it but there are applications running or extracting data from his home environment and processed in Spark to make his life better and easier. An example we will discuss below is the British Gas.
British Gas is a 200-year-old company. Connected Homes is BG’s IoT “startup”. It is a leader in the UK’s connected home market. Connected Homes is trying to predict the usage consumption patterns of the electricity, gas at the homes and provide consumers with insights so they can smartly use their devices and reduce energy consumption and save energy and money. Connected homes use Apache Spark at the core of its Data Engineering and ML engine.
The challenges are there are millions of electric and gas meters and the meters are read every 30 minutes.
There are:
- Gas and electricity meter readings
- Thermostat temperature data
- Connected boiler data
- Real-time energy consumption data
- Introducing motion sensors, window and door sensors, etc
Apache Spark MLlib is used to apply machine learning to these data for disaggregation, similar home comparison and smart meters used in indirect algorithms for non-smart customers.
The analytics engine is used to show customers how they have spent energy, what are their top 3 spends, how can they reduce their energy consumption by showing patterns from smart consumers and smart meters etc. This gives customers a lot of insight and educates customers on optimally using energy at their homes.
Gaming:
Online Gaming industry is another beneficiary of the Apache Spark technology.
Riot Games uses Spark for Combating abusive language in chat in the team games. The challenges in online gaming are:
- 1% of all players are consistently unsportsmanlike
- 2% of all games infected by serious toxicity
- In-Game Toxicity 95% of all serious toxicity comes from players who are otherwise sportsman like
To solve this the game developers tried to predict the words used by the gamers in the context of the game or the scenario. They used the “Word2Vec” a neural model which has 256 dimensions embedding months of chat logs. Each word in the chat is document split in spaces and lowercase. The model was trained on NLP for acronyms, short forms, colloquial words etc. and the deviations could be huge. The team built a model trained to predict bad/toxic language. The gaming company has 100+ million users every month and so the data is huge. They used Spark MLlib to train their models using different algorithms, one of them Logistic Regression Random Forest Gradient Boosted Trees. The results were impressive for them as they tuned their models for better precision.
Benefits of having Apache Spark for Individual Companies
Many of the companies across industries have been benefiting from Apache Spark. The reasons could be different:
- Speed of execution
- Multi-language support
- Machine learning library
- Graph processing library
- Batch processing as well as Stream & Structured stream processing
Apache Spark is beneficial for small as well as large enterprise. Spark offers a complete solution to many of the common problems like ETL and warehousing, Stream data processing, common use case of supervised and unsupervised learning for data analytics and predictive modelling. So with Apache Spark, the technology team does not require to look out for different technology stack and multiple vendors for a solution. This reduces the learning curve for additional development and maintenance. Also, since Spark has support for multiple languages Scala, Java, Python & R, it is easy to find developers.
Limitations:
Though there are so many benefits of Apache Spark as we have seen above, there are few limitations which Apache Spark has. We should be aware of these limitations before we decide to adopt any technology.
- Apache Spark does not come with an inbuilt file system and it has to depend on HDFS in most of the use cases. If not, it has to be used with some cloud-based data platform.
- Even though Spark has Stream processing feature it is not exactly real-time processing. It processes in batches which are called micro-batches.
- Apache Spark is expensive as it catches a lot of data and memory is not cheap.
- Spark faces issues while working with HDFS which has a very large number of small files.
- Though Spark MLlib provides machine learning capabilities, it does not come with a very exhaustive list of algorithms. It can solve a lot of ML problems but not all.
- Apache Spark does not have automatic code optimization process in place and so the code needs to be optimized manually.
Reasons Why You Should Learn Apache Spark
We have seen the wide impact and use cases if Apache Spark. So we know that Spark has become a buzzword these days. We should now also understand why we should learn Spark.
- Spark offers a complete package for developers and can act as a unified analytics engine. Hence it increases the productivity for the developers. So it ROI for any firm is high and most of the companies dependent on technology are aware of the fact and also willing to put their money in Spark.
- Learning Spark can help explore the world of Big data and data science. Both these technology fields are the future and bringing transformational changes in almost all industries. So getting exposed to Spark is becoming a necessity for all firms.
- With fast-paced Spark adoption by organizations, it is opening up new prospects in business and many of the applications have proved that instead of business driving technology it is becoming vice versa now, that technology is driving business.
- According to a survey, there is a huge demand for Spark engineers. Today, there are well over 1,000 contributors to the Apache Spark project across 250+ companies worldwide. Recently, Indeed.com listed over 2,400 full-time open positions for Apache Spark professionals across various industries including enterprise technology, e-commerce/retail, healthcare, and life sciences, oil and gas, manufacturing, and more.
- Apache Spark developers earn the highest average salary among all other programmers. So this is another and one of the major incentives one can get to learn and expertise Spark.
Conclusion
Apache Spark has capabilities to process huge amount of data in a very efficient manner with high throughput. It can solve problems related to batch processing, near real-time processing, can be used to apply lambda architecture, can be used for Structured streaming. Also, it can solve many of the complex data analytics and predictive analytics problems with the help of the MLlib component which comes out of the box. Apache Spark has been making a big impact on the whole data engineering and data science gamut at scale.
Embark on your journey into Software Development with our highly sought-after courses—featuring hands-on projects, expert mentorship, and the skills to secure your dream job!
In-Demand Software Development Skills
Step into the realm of Software Engineering with our highly sought-after courses—featuring practical projects, expert mentorship, and the skills you need to secure your next role!
Explore our Popular Software Engineering Courses
Start your journey into Software Development with our free, sought-after courses—experience hands-on projects, expert mentorship, and acquire the skills that lead to career success!
Explore Our Software Development Free Courses
Frequently Asked Questions (FAQs)
1. Does Apache Spark offer any benefits?
Apache Spark is a hugely popular unified analytics engine designed for machine learning and big data. Since its launch, Apache Spark has seen rapid adoption by organizations across various industries. There are several advantages of employing this platform, which account for its tremendous popularity. First is the lightning-fast speed of large-scale data processing offered by Apache Spark; it is up to 100 times faster than that provided by Hadoop. Then Spark comes as a whole unified package with high-level libraries, graph processing, SQL query support and data streaming capabilities, which contribute to the productivity of developers. And, of course, Apache Spark is highly user-friendly too.
2. What is exactly meant by data analytics?
Data analytics is the process by which meaningful insights are extracted from raw data with the help of specialized software applications. These applications help transform, streamline and arrange the data in specific models such that it can help in drawing conclusions and identifying patterns. With the infinite power that data holds today, data analytics has evolved into a complex practice that is used to extract meaning from massive volumes of data and often high-velocity data that bring about various challenges. This is why expert data analytics professionals, also known as data scientists, are required for the successful handling and modelling of data.
3. Is Big Data and Hadoop the same thing?
Big Data and Hadoop are not really the same; although they are very closely interconnected, i.e. without the existence of one, there would be no meaning or existence of the other. You can consider Big Data as an asset of extreme value to businesses, but to realize and make use of the value that is contained in this asset, you need some method or tool. Hadoop is a tool or platform developed to extract the maximum value from this asset, i.e. Big Data. Big Data refers to complex, massive datasets that are processed, analyzed and stored with the help of a sophisticated framework known as Apache Hadoop.