Contents
-
Setting the scene
-
What is Hadoop?
1. Setting the Scene
- Big data – the 4Vs
- What’s wrong with relational databases?
- How about massively parallel processing databases?
- Big data technologies
- Data is the lifeblood of any organization
The challenge…
- The volume of data is growing exponentially
- Magnitudes larger than a few years ago
- Big data has become one of the most exciting technology trends in recent years
- How to get business value out of massive datasets
- This is the problem that big data technologies aim to solve
The Big Data 4Vs – Volume
One definition of big data relates to the volume of data.
- Any dataset whose volume exceeds 10 petabytes
- If stored in an RDBMS, it would have billions of rows
Some startling numbers from IBM research…
- 100 terabytes (1012) of data held by most companies in the US
- 5 exabytes (1018) of data are created every day worldwide
- 40 zettabytes (1021) of data will be created by 2020 (estimated)
The Big Data 4Vs – Variety
Data comes in a wide variety of formats…
Structured data
- Highly organized, fits into well-known enterprise data models
- Typically stored in relational databases or spreadsheets
- Can be searched using standard search algorithms, e.g. SQL
Semi-structured data
- Log files, CSV files, etc.
- There is some degree of order, but not necessarily predictable
Unstructured data
- E.g. ad-hoc messaging notes, tweets, video clips
The Big Data 4Vs – Velocity
Velocity refers to the speed at which data enters your system.
- There are some industry sectors where it’s essential to gather, understand, and react to tremendous amounts of streaming data in real time – e.g. Formula 1, financial trading, etc.
With the IoT growing in importance, data is being generated at astonishing rates.
- g. NYSE captures 1TB of trade info each trading session
- g. there are approx. 20 billion network connections on the planet
- g. an F1 car has approx. 150 sensors and can transmit 2GB of data in one lap! (and approx. 3TB over a race weekend)
The Big Data 4Vs – Veracity
Veracity refers to the verifiable correctness of data. It’s essential you trust your data; otherwise how can you make critical business decisions based on the data?
Here are some more stats from IBM research:
- Poor data quality costs the US approx. £3 trillion a year
- In a survey, 1 in 3 business leaders said they don’t trust the info they use to make business decisions
What’s Wrong with Relational Databases?
Standard relational databases can’t easily handle big data.
- RDBMS technology was designed decades ago
- At that time, very few organizations had terabytes (1012) of data
- Today, organizations can generate terabytes of data every day!
It’s not only the volume of data that causes a problem, but also the rate it’s being generated.
- We need new technologies that can consume, process, and analyse large volume of data quickly
How about Massively Parallel Processing DBs?
Key driving factors of big data technology:
- Scalability
- High availability
- Fault tolerance
- … all these things at a low cost
Several proprietary commercial products emerged over the decades to address these requirements.
- Massively parallel processing (MPP) DBs, e.g. Teradata, Vertica
- However proprietary MPP products are expensive – not a general solution for everyone
Big Data Technologies
Big data technologies aim to address the issues we’ve just described.
- Some of the most active open-source projects today relate to big data
- Large corporates are making significant investments in big data technology
2. What is Hadoop?
- Hadoop was one of the first open-source big data technologies
- Scalable, fault-tolerant system for processing large datasets…
- Across a cluster of commodity servers
Hadoop provides high availability and fault tolerance.
- You don’t need to buy expensive hardware
- Hadoop is well suited for batch processing and ETL (extract transform load) of large-scale data
Many organizations have replaced expensive commercial products with Hadoop.
- Cost benefits – Hadoop is open source, runs on commodity h/w
- Easily scalable – just add some more (relatively cheap) servers
Hadoop Design Goals
Hadoop uses a cluster of commodity servers for storing and processing large amounts of data.
- Cheaper than using high-end powerful servers
- Hadoop uses a scale-out architecture (rather than scale-up)
- Hadoop is designed to work best with a relatively small number of huge files
- Average file size in Hadoop is > 500MB
Hadoop implements fault tolerance through software.
- Cheaper than implementing fault tolerance through hardware
- Hadoop doesn’t rely on fault-tolerant servers
- Hadoop assumes servers fail, and transparently handles failures
Developers don’t need to worry about handling hardware failures.
- You can leave Hadoop to handle these messy details!
Moving code from one computer to another is much faster and more efficient than moving large datasets.
- g. imagine you have a cluster of 50 computers with 1TB of data on each computer – what are the options for processing this data?
Option 1
Move the data to a very powerful server that can process 50TB of data.
- Moving 50TB of data will take a long time, even on a fast network
- Also, you’ll need expensive hardware to process data with this approach
Option 2
Move the code that processes the data to each computer in the 50-node cluster.
- It’s a lot faster and more efficient than Option 1
- Also, you don’t need high-end servers, which are expensive
Hadoop provides a framework that hides the complexities of writing distributed applications,
- It’s a lot easier to write code that runs on a single computer, rather than writing distributed applications
- There’s a much bigger pool of application developers who can write non-distributed applicationsWant to know more? Try TalkIT’s training course in big data.
Copyright 2019 TalkIT Andy Olsen