Want to read more? You can buy this book at oreilly.com in print and ebook format. Buy 2 books, get the 3rd FREE! Use discount code: OPC10 All orders over $29.95 qualify for free shipping within the US.
It’s also available at your favorite book retailer, including the iBookstore, the Android Marketplace, and Amazon.com.
Spreading the knowledge of innovators
oreilly.com
Hadoop: The Definitive Guide, Third Edition by Tom White Copyright © 2012 Tom White. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or [email protected]
Editors: Mike Loukides and Meghan Blanchette Production Editor: Rachel Steely Copyeditor: Genevieve d’Entremont Proofreader: Kevin Broccoli May 2012:
Indexer: Kevin Broccoli Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano
Third Edition.
Revision History for the Third Edition: 2012-01-27 Early release revision 1 2012-05-07 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449311520 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Hadoop: The Definitive Guide, the image of an elephant, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
ISBN: 978-1-449-31152-0 [LSI] 1336503003
Table of Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Data! Data Storage and Analysis Comparison with Other Systems Rational Database Management System Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop and the Hadoop Ecosystem Hadoop Releases What’s Covered in This Book Compatibility
1 3 4 4 6 8 9 12 13 15 15
2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python
17 17 19 20 20 22 30 30 33 36 36 36 39
v
Hadoop Pipes Compiling and Running
40 41
3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 The Design of HDFS HDFS Concepts Blocks Namenodes and Datanodes HDFS Federation HDFS High-Availability The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Data from a Hadoop URL Reading Data Using the FileSystem API Writing Data Directories Querying the Filesystem Deleting Data Data Flow Anatomy of a File Read Anatomy of a File Write Coherency Model Data Ingest with Flume and Sqoop Parallel Copying with distcp Keeping an HDFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations
43 45 45 46 47 48 49 50 52 53 55 55 57 60 62 62 67 67 67 70 72 74 75 76 77 77 79
4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Data Integrity Data Integrity in HDFS LocalFileSystem ChecksumFileSystem Compression Codecs Compression and Input Splits Using Compression in MapReduce Serialization The Writable Interface vi | Table of Contents
81 81 82 83 83 85 89 90 93 94
Writable Classes Implementing a Custom Writable Serialization Frameworks Avro Avro Data Types and Schemas In-Memory Serialization and Deserialization Avro Datafiles Interoperability Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro MapReduce in Other Languages File-Based Data Structures SequenceFile MapFile
96 103 108 110 111 114 117 118 121 123 124 128 130 130 130 137
5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 The Configuration API Combining Resources Variable Expansion Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunner Writing a Unit Test with MRUnit Mapper Reducer Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver Running on a Cluster Packaging a Job Launching a Job The MapReduce Web UI Retrieving the Results Debugging a Job Hadoop Logs Remote Debugging Tuning a Job Profiling Tasks MapReduce Workflows Decomposing a Problem into MapReduce Jobs JobControl
144 145 146 146 148 150 154 154 156 157 157 160 161 162 163 165 168 170 175 177 178 179 181 181 183
Table of Contents | vii
Apache Oozie
183
6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Anatomy of a MapReduce Job Run Classic MapReduce (MapReduce 1) YARN (MapReduce 2) Failures Failures in Classic MapReduce Failures in YARN Job Scheduling The Fair Scheduler The Capacity Scheduler Shuffle and Sort The Map Side The Reduce Side Configuration Tuning Task Execution The Task Execution Environment Speculative Execution Output Committers Task JVM Reuse Skipping Bad Records
189 190 196 202 202 204 206 207 207 208 208 210 211 214 215 215 217 219 220
7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output Multiple Outputs Lazy Output Database Output
223 227 234 234 245 249 250 251 251 252 252 253 257 258
8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Counters Built-in Counters User-Defined Java Counters viii | Table of Contents
259 259 264
User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort Joins Map-Side Joins Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes
268 268 269 270 274 277 283 284 285 288 288 289 295
9. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Cluster Specification Network Topology Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration Hadoop Configuration Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties User Account Creation YARN Configuration Important YARN Daemon Properties YARN Daemon Addresses and Ports Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Apache Whirr
297 299 301 302 302 302 303 303 304 305 307 311 316 317 320 320 321 324 325 326 328 329 331 331 333 334 334
Table of Contents | ix
10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metrics Java Management Extensions Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades
339 339 344 346 347 351 352 352 355 358 358 359 362
11. Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Installing and Running Pig Execution Types Running Pig Programs Grunt Pig Latin Editors An Example Generating Examples Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions Macros User-Defined Functions A Filter UDF An Eval UDF A Load UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice x | Table of Contents
368 368 370 370 371 371 373 374 375 376 377 381 382 384 388 390 391 391 394 396 399 399 400 402 407 408 409
Parallelism Parameter Substitution
409 410
12. Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 Installing Hive The Hive Shell An Example Running Hive Configuring Hive Hive Services The Metastore Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes HiveQL Data Types Operators and Functions Tables Managed Tables and External Tables Partitions and Buckets Storage Formats Importing Data Altering Tables Dropping Tables Querying Data Sorting and Aggregating MapReduce Scripts Joins Subqueries Views User-Defined Functions Writing a UDF Writing a UDAF
414 415 416 417 417 419 421 423 423 424 425 426 428 429 429 431 435 441 443 443 444 444 445 446 449 450 451 452 454
13. HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 HBasics Backdrop Concepts Whirlwind Tour of the Data Model Implementation Installation Test Drive Clients
459 460 460 460 461 464 465 467 Table of Contents | xi
Java Avro, REST, and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at Streamy.com Praxis Versions HDFS UI Metrics Schema Design Counters Bulk Load
467 470 472 472 473 476 479 480 481 481 483 483 484 485 485 485 486 486
14. ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 Installing and Running ZooKeeper An Example Group Membership in ZooKeeper Creating the Group Joining a Group Listing Members in a Group Deleting a Group The ZooKeeper Service Data Model Operations Implementation Consistency Sessions States Building Applications with ZooKeeper A Configuration Service The Resilient ZooKeeper Application A Lock Service More Distributed Data Structures and Protocols ZooKeeper in Production Resilience and Performance Configuration
xii | Table of Contents
490 492 492 493 495 496 498 499 499 501 506 507 509 511 512 512 515 519 521 522 523 524
15. Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 Getting Sqoop Sqoop Connectors A Sample Import Text and Binary File Formats Generated Code Additional Serialization Systems Imports: A Deeper Look Controlling the Import Imports and Consistency Direct-mode Imports Working with Imported Data Imported Data and Hive Importing Large Objects Performing an Export Exports: A Deeper Look Exports and Transactionality Exports and SequenceFiles
527 529 529 532 532 533 533 535 536 536 536 537 540 542 543 545 545
16. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 Hadoop Usage at Last.fm Last.fm: The Social Music Revolution Hadoop at Last.fm Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, Tuples, and Pipes
547 547 547 548 549 556 556 556 559 562 566 567 568 571 580 581 581 582 582 582 583 589 590 Table of Contents | xiii
Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at ShareThis Summary TeraByte Sort on Apache Hadoop Using Pig and Wukong to Explore Billion-edge Network Graphs Measuring Community Everybody’s Talkin’ at Me: The Twitter Reply Graph Symmetric Links Community Extraction
593 594 595 598 599 603 603 607 609 609 612 613
A. Installing Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 B. Cloudera’s Distribution Including Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . 623 C. Preparing the NCDC Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
xiv | Table of Contents
CHAPTER 1
Meet Hadoop
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. —Grace Hopper
Data! We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006 and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.1 A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world. This flood of data is coming from many sources. Consider the following:2 • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data and is growing at a rate of 20 terabytes per month. • The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year. 1. From Gantz et al., “The Diverse and Exploding Digital Universe,” March 2008 (http://www.emc.com/ collateral/analyst-reports/diverse-exploding-digital-universe.pdf). 2. http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705, http://mashable.com/ 2008/10/15/facebook-10-billion-photos/, http://blog.familytreemagazine.com/insider/Inside +Ancestrycoms+TopSecret+Data+Center.aspx, http://www.archive.org/about/faqs.php, and http://www .interactions.org/cms/?pid=1027032
1
So there’s a lot of data out there. But you are probably wondering how it affects you. Most of the data is locked up in the largest web properties (like search engines) or in scientific or financial institutions, isn’t it? Does the advent of “Big Data,” as it is being called, affect smaller organizations or individuals? I argue that it does. Take photos, for example. My wife’s grandfather was an avid photographer and took photographs throughout his adult life. His entire corpus of medium-format, slide, and 35mm film, when scanned in at high resolution, occupies around 10 gigabytes. Compare this to the digital photos that my family took in 2008, which take up about 5 gigabytes of space. My family is producing photographic data at 35 times the rate my wife’s grandfather’s did, and the rate is increasing every year as it becomes easier to take more and more photos. More generally, the digital streams that individuals are producing are growing apace. Microsoft Research’s MyLifeBits project gives a glimpse of the archiving of personal information that may become commonplace in the near future. MyLifeBits was an experiment where an individual’s interactions—phone calls, emails, documents—were captured electronically and stored for later access. The data gathered included a photo taken every minute, which resulted in an overall data volume of one gigabyte per month. When storage costs come down enough to make it feasible to store continuous audio and video, the data volume for a future MyLifeBits service will be many times that. The trend is for every individual’s data footprint to grow, but perhaps more important, the amount of data generated by machines will be even greater than that generated by people. Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions—all of these contribute to the growing mountain of data. The volume of data being made publicly available increases every year, too. Organizations no longer have to merely manage their own data; success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data. Initiatives such as Public Data Sets on Amazon Web Services, Infochimps.org, and theinfo.org exist to foster the “information commons,” where data can be freely (or in the case of AWS, for a modest price) shared for anyone to download and analyze. Mashups between different information sources make for unexpected and hitherto unimaginable applications. Take, for example, the Astrometry.net project, which watches the Astrometry group on Flickr for new photos of the night sky. It analyzes each image and identifies which part of the sky it is from, as well as any interesting celestial bodies, such as stars or galaxies. This project shows the kind of things that are possible when data (in this case, tagged photographic images) is made available and used for something (image analysis) that was not anticipated by the creator. It has been said that “more data usually beats better algorithms,” which is to say that for some problems (such as recommending movies or music based on past preferences),
2 | Chapter 1: Meet Hadoop
however fiendish your algorithms are, often they can be beaten simply by having more data (and a less sophisticated algorithm).3 The good news is that Big Data is here. The bad news is that we are struggling to store and analyze it.
Data Storage and Analysis The problem is simple: although the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives —have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,4 so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk. This is a long time to read all data on a single drive—and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. Using only one hundredth of a disk may seem wasteful. But we can store one hundred datasets, each of which is one terabyte, and provide shared access to them. We can imagine that the users of such a system would be happy to share access in return for shorter analysis times, and, statistically, that their analysis jobs would be likely to be spread over time, so they wouldn’t interfere with each other too much. There’s more to being able to read and write data in parallel to or from multiple disks, though. The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. This is how RAID works, for instance, although Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS), takes a slightly different approach, as you shall see later. The second problem is that most analysis tasks need to be able to combine the data in some way, and data read from one disk may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. MapReduce provides a programming model that abstracts the problem from disk reads and writes, 3. The quote is from Anand Rajaraman writing about the Netflix Challenge (http://anand.typepad.com/ datawocky/2008/03/more-data-usual.html). Alon Halevy, Peter Norvig, and Fernando Pereira make the same point in “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems, March/April 2009. 4. These specifications are for the Seagate ST-41600n.
Data Storage and Analysis | 3
transforming it into a computation over sets of keys and values. We look at the details of this model in later chapters, but the important point for the present discussion is that there are two parts to the computation, the map and the reduce, and it’s the interface between the two where the “mixing” occurs. Like HDFS, MapReduce has builtin reliability. This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.
Comparison with Other Systems The approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire dataset—or at least a good portion of it—is processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative. It changes the way you think about data and unlocks data that was previously archived on tape or disk. It gives people the opportunity to innovate with data. Questions that took too long to get answered before can now be answered, which in turn leads to new questions and new insights. For example, Mailtrust, Rackspace’s mail division, used Hadoop for processing email logs. One ad hoc query they wrote was to find the geographic distribution of their users. In their words: This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.
By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise would never have had, and, furthermore, they were able to use what they had learned to improve the service for their customers. You can read more about how Rackspace uses Hadoop in Chapter 16.
Rational Database Management System Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce needed?
4 | Chapter 1: Meet Hadoop
The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. In many ways, MapReduce can be seen as a complement to a Rational Database Management System (RDBMS). (The differences between the two systems are shown in Table 1-1.) MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated. Table 1-1. RDBMS compared to MapReduce Traditional RDBMS
MapReduce
Data size
Gigabytes
Petabytes
Access
Interactive and batch
Batch
Updates
Read and write many times
Write once, read many times
Structure
Static schema
Dynamic schema
Integrity
High
Low
Scaling
Nonlinear
Linear
Another difference between MapReduce and an RDBMS is the amount of structure in the datasets on which they operate. Structured data is data that is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other hand, is looser, and though there may be a schema, it is often ignored, so it may be used only as a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold any form of data. Unstructured data does not have any particular internal structure: for example, plain text or image data. MapReduce works well on unstructured or semistructured data because it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not intrinsic properties of the data, but they are chosen by the person analyzing the data.
Comparison with Other Systems | 5
Relational data is often normalized to retain its integrity and remove redundancy. Normalization poses problems for MapReduce because it makes reading a record a nonlocal operation, and one of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes. A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same client may appear many times), and this is one reason that logfiles of all kinds are particularly well-suited to analysis with MapReduce. MapReduce is a linearly scalable programming model. The programmer writes two functions—a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to another. These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one. More important, if you double the size of the input data, a job will run twice as slow. But if you also double the size of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries. Over time, however, the differences between relational databases and MapReduce systems are likely to blur—both as relational databases start incorporating some of the ideas from MapReduce (such as Aster Data’s and Greenplum’s databases) and, from the other direction, as higher-level query languages built on MapReduce (such as Pig and Hive) make MapReduce systems more approachable for traditional database programmers.5
Grid Computing The High Performance Computing (HPC) and Grid Computing communities have been doing large-scale data processing for years, using such Application Program Interfaces (APIs) as Message Passing Interface (MPI). Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared filesystem, hosted by a Storage Area Network (SAN). This works well for predominantly computeintensive jobs, but it becomes a problem when nodes need to access larger data volumes (hundreds of gigabytes, the point at which MapReduce really starts to shine), since the network bandwidth is the bottleneck and compute nodes become idle.
5. In January 2007, David J. DeWitt and Michael Stonebraker caused a stir by publishing “MapReduce: A major step backwards,” in which they criticized MapReduce for being a poor substitute for relational databases. Many commentators argued that it was a false comparison (see, for example, Mark C. ChuCarroll’s “Databases are hammers; MapReduce is a screwdriver,” and DeWitt and Stonebraker followed up with “MapReduce II,” where they addressed the main topics brought up by others.
6 | Chapter 1: Meet Hadoop
MapReduce tries to collocate the data with the compute node, so data access is fast because it is local.6 This feature, known as data locality, is at the heart of MapReduce and is the reason for its good performance. Recognizing that network bandwidth is the most precious resource in a data center environment (it is easy to saturate network links by copying data around), MapReduce implementations go to great lengths to conserve it by explicitly modelling network topology. Notice that this arrangement does not preclude high-CPU analyses in MapReduce. MPI gives great control to the programmer, but requires that he explicitly handle the mechanics of the data flow, exposed via low-level C routines and constructs such as sockets, as well as the higher-level algorithm for the analysis. MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit. Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure—when you don’t know whether or not a remote process has failed—and still making progress with the overall computation. MapReduce spares the programmer from having to think about failure, since the implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this because it is a shared-nothing architecture, meaning that tasks have no dependence on one other. (This is a slight oversimplification, since the output from mappers is fed to the reducers, but this is under the control of the MapReduce system; in this case, it needs to take more care rerunning a failed reducer than rerunning a failed map because it has to make sure it can retrieve the necessary map outputs, and if not, regenerate them by running the relevant maps again.) So from the programmer’s point of view, the order in which the tasks run doesn’t matter. By contrast, MPI programs have to explicitly manage their own checkpointing and recovery, which gives more control to the programmer but makes them more difficult to write. MapReduce might sound like quite a restrictive programming model, and in a sense it is: you are limited to key and value types that are related in specified ways, and mappers and reducers run with very limited coordination between one another (the mappers pass keys and values to reducers). A natural question to ask is: can you do anything useful or nontrivial with it? The answer is yes. MapReduce was invented by engineers at Google as a system for building production search indexes because they found themselves solving the same problem over and over again (and MapReduce was inspired by older ideas from the functional programming, distributed computing, and database communities), but it has since been used for many other applications in many other industries. It is pleasantly surprising to see the range of algorithms that can be expressed in MapReduce, from
6. Jim Gray was an early advocate of putting the computation near the data. See “Distributed Computing Economics,” March 2003, http://research.microsoft.com/apps/pubs/default.aspx?id=70001.
Comparison with Other Systems | 7
image analysis, to graph-based problems, to machine learning algorithms.7 It can’t solve every problem, of course, but it is a general data-processing tool. You can see a sample of some of the applications that Hadoop has been used for in Chapter 16.
Volunteer Computing When people first hear about Hadoop and MapReduce, they often ask, “How is it different from [email protected]?” SETI, the Search for Extra-Terrestrial Intelligence, runs a project called [email protected] in which volunteers donate CPU time from their otherwise idle computers to analyze radio telescope data for signs of intelligent life outside earth. [email protected] is the most well-known of many volunteer computing projects; others include the Great Internet Mersenne Prime Search (to search for large prime numbers) and [email protected] (to understand protein folding and how it relates to disease). Volunteer computing projects work by breaking the problem they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed. For example, a [email protected] work unit is about 0.35 MB of radio telescope data, and takes hours or days to analyze on a typical home computer. When the analysis is completed, the results are sent back to the server, and the client gets another work unit. As a precaution to combat cheating, each work unit is sent to three different machines and needs at least two results to agree to be accepted. Although [email protected] may be superficially similar to MapReduce (breaking a problem into independent pieces to be worked on in parallel), there are some significant differences. The [email protected] problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world8 because the time to transfer the work unit is dwarfed by the time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth. MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, [email protected] runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.
7. Apache Mahout (http://mahout.apache.org/) is a project to build machine-learning libraries (such as classification and clustering algorithms) that run on Hadoop. 8. In January 2008, [email protected] was reported at http://www.planetary.org/programs/projects/setiathome/ setiathome_20080115.html to be processing 300 gigabytes a day, using 320,000 computers (most of which are not dedicated to [email protected]; they are used for other things, too).
8 | Chapter 1: Meet Hadoop
A Brief History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.
The Origin of the Name “Hadoop” The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.
Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker9 keeps track of MapReduce jobs.
Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, but it is also a challenge to run without a dedicated operations team, since there are so many moving parts. It’s expensive, too: Mike Cafarella and Doug Cutting estimated a system supporting a one-billion-page index would cost around half a million dollars in hardware, with a monthly running cost of $30,000.10 Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratize search engine algorithms. Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn’t scale to the billions of pages on the Web. Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google.11 GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS). 9. In this book, we use the lowercase form, “jobtracker,” to denote the entity when it’s being referred to generally, and the CamelCase form JobTracker to denote the Java class that implements it. 10. See Mike Cafarella and Doug Cutting, “Building Nutch: Open Source Search,” ACM Queue, April 2004, http://queue.acm.org/detail.cfm?id=988408. 11. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003, http://labs.google.com/papers/gfs.html.
A Brief History of Hadoop | 9
In 2004, Google published the paper that introduced MapReduce to the world.12 Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year. All the major Nutch algorithms had been ported to run using MapReduce and NDFS. NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale (see the sidebar “Hadoop at Yahoo!” on page 11). This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster.13 In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. By this time, Hadoop was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. Some applications are covered in the case studies in Chapter 16 and on the Hadoop wiki. In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch through four terabytes of scanned archives from the paper, converting them to PDFs for the Web.14 The processing took less than 24 hours to run using 100 machines, and the project probably wouldn’t have been embarked upon without the combination of Amazon’s pay-by-the-hour model (which allowed the NYT to access a large number of machines for a short period) and Hadoop’s easy-to-use parallel programming model. In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds (just under 3½ minutes), beating the previous year’s winner of 297 seconds (described in detail in “TeraByte Sort on Apache Hadoop” on page 603). In November of the same year, Google reported that its MapReduce implementation sorted one terabyte in 68 seconds.15 As the first edition of this book was going to press (May 2009), it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds. Since then, Hadoop has seen rapid mainstream enterprise adoption. Hadoop’s role as a general-purpose storage and analysis platform for big data has been recognized by
12. Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters ,” December 2004, http://labs.google.com/papers/mapreduce.html. 13. “Yahoo! Launches World’s Largest Hadoop Production Application,” 19 February 2008, http://developer .yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html. 14. Derek Gottfrid, “Self-service, Prorated Super Computing Fun!” 1 November 2007, http://open.blogs .nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/. 15. “Sorting 1PB with MapReduce,” 21 November 2008, http://googleblog.blogspot.com/2008/11/sorting-1pb -with-mapreduce.html.
10 | Chapter 1: Meet Hadoop
the industry, and this fact is reflected in the number of products that use or incorporate Hadoop in some way. There are Hadoop distributions from the large, established enterprise vendors, including EMC, IBM, Microsoft, and Oracle, as well as from specialist Hadoop companies such as Cloudera, Hortonworks, and MapR.
Hadoop at Yahoo! Building Internet-scale search engines requires huge amounts of data and therefore large numbers of machines to process it. Yahoo! Search consists of four primary components: the Crawler, which downloads pages from web servers; the WebMap, which builds a graph of the known Web; the Indexer, which builds a reverse index to the best pages; and the Runtime, which answers users’ queries. The WebMap is a graph that consists of roughly 1 trillion (1012) edges, each representing a web link, and 100 billion (1011) nodes, each representing distinct URLs. Creating and analyzing such a large graph requires a large number of computers running for many days. In early 2005, the infrastructure for the WebMap, named Dreadnaught, needed to be redesigned to scale up to more nodes. Dreadnaught had successfully scaled from 20 to 600 nodes, but required a complete redesign to scale out further. Dreadnaught is similar to MapReduce in many ways, but provides more flexibility and less structure. In particular, each fragment in a Dreadnaught job can send output to each of the fragments in the next stage of the job, but the sort was all done in library code. In practice, most of the WebMap phases were pairs that corresponded to MapReduce. Therefore, the WebMap applications would not require extensive refactoring to fit into MapReduce. Eric Baldeschwieler (Eric14) created a small team and we started designing and prototyping a new framework written in C++ modeled after GFS and MapReduce to replace Dreadnaught. Although the immediate need was for a new framework for WebMap, it was clear that standardization of the batch platform across Yahoo! Search was critical and by making the framework general enough to support other users, we could better leverage investment in the new platform. At the same time, we were watching Hadoop, which was part of Nutch, and its progress. In January 2006, Yahoo! hired Doug Cutting, and a month later we decided to abandon our prototype and adopt Hadoop. The advantage of Hadoop over our prototype and design was that it was already working with a real application (Nutch) on 20 nodes. That allowed us to bring up a research cluster two months later and start helping real customers use the new framework much sooner than we could have otherwise. Another advantage, of course, was that since Hadoop was already open source, it was easier (although far from easy!) to get permission from Yahoo!’s legal department to work in open source. So we set up a 200-node cluster for the researchers in early 2006 and put the WebMap conversion plans on hold while we supported and improved Hadoop for the research users. Here’s a quick timeline of how things have progressed: • 2004: Initial versions of what is now Hadoop Distributed Filesystem and MapReduce implemented by Doug Cutting and Mike Cafarella.
A Brief History of Hadoop | 11
• December 2005: Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. • January 2006: Doug Cutting joins Yahoo!. • February 2006: Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS. • February 2006: Adoption of Hadoop by Yahoo! Grid team. • April 2006: Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. • May 2006: Yahoo! set up a Hadoop research cluster—300 nodes. • May 2006: Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark). • October 2006: Research cluster reaches 600 nodes. • December 2006: Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours. • January 2007: Research cluster reaches 900 nodes. • April 2007: Research clusters—two clusters of 1000 nodes. • April 2008: Won the 1-terabyte sort benchmark in 209 seconds on 900 nodes. • October 2008: Loading 10 terabytes of data per day onto research clusters. • March 2009: 17 clusters with a total of 24,000 nodes. • April 2009: Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100-terabyte sort in 173 minutes (on 3,400 nodes). —Owen O’Malley
Apache Hadoop and the Hadoop Ecosystem Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS, renamed from NDFS), the term is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing. All of the core projects covered in this book are hosted by the Apache Software Foundation, which provides support for a community of open source software projects, including the original HTTP Server from which it gets its name. As the Hadoop ecosystem grows, more projects are appearing, not necessarily hosted at Apache, that provide complementary services to Hadoop or build on the core to add higher-level abstractions. The Hadoop projects that are covered in this book are described briefly here: Common A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).
12 | Chapter 1: Meet Hadoop
Avro A serialization system for efficient, cross-language RPC and persistent data storage. MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS A distributed filesystem that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. HBase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). ZooKeeper A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop A tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS. Oozie A service for running and scheduling workflows of Hadoop jobs (including MapReduce, Pig, Hive, and Sqoop jobs).
Hadoop Releases Which version of Hadoop should you use? The answer to this question changes over time, of course, and also depends on the features that you need. “Hadoop Releases” on page 13 summarizes the high-level features in recent Hadoop release series. There are a few active release series. The 1.x release series is a continuation of the 0.20 release series and contains the most stable versions of Hadoop currently available. This series includes secure Kerberos authentication, which prevents unauthorized access to Hadoop data (see “Security” on page 325). Almost all production clusters use these releases or derived versions (such as commercial distributions).
Hadoop Releases | 13
The 0.22 and 2.x release series16 are not currently stable (as of early 2012), but this is likely to change by the time you read this as they undergo more real-world testing (consult the Apache Hadoop releases page for the latest status). 2.x includes several major new features: • A new MapReduce runtime, called MapReduce 2, implemented on a new system called YARN (Yet Another Resource Negotiator), which is a general resource management system for running distributed applications. MapReduce 2 replaces the “classic” runtime in previous releases. It is described in more depth in “YARN (MapReduce 2)” on page 196. • HDFS federation, which partitions the HDFS namespace across multiple namenodes to support clusters with very large numbers of files. See “HDFS Federation” on page 47. • HDFS high-availability, which removes the namenode as a single point of failure by supporting standby namenodes for failover. See “HDFS High-Availability” on page 48. Table 1-2. Features supported by Hadoop release series Feature
1.x
0.22
2.x
Secure authentication
Yes
No
Yes
Old configuration names
Yes
Deprecated
Deprecated
New configuration names
No
Yes
Yes
Old MapReduce API
Yes
Yes
Yes
New MapReduce API
Yes (with some missing libraries)
Yes
Yes
MapReduce 1 runtime (Classic)
Yes
Yes
No
MapReduce 2 runtime (YARN)
No
No
Yes
HDFS federation
No
No
Yes
HDFS high-availability
No
No
Yes
Table 1-2 only covers features in HDFS and MapReduce. Other projects in the Hadoop ecosystem are continually evolving too, and picking a combination of components that work well together can be a challenge. Thankfully, you don’t have to do this work yourself. The Apache Bigtop project (http://incubator.apache.org/bigtop/) runs interoperability tests on stacks of Hadoop components and provides Linux packages (RPMs and Debian packages) for easy installation. There are also commercial vendors offering Hadoop distributions containing suites of compatible components.
16. As this book went to press, the Hadoop community voted for the 0.23 release series to be renamed as the 2.x release series. This book uses the shorthand “releases after 1.x” to refer to releases in the 0.22 and 2.x (formerly 0.23) series.
14 | Chapter 1: Meet Hadoop
What’s Covered in This Book This book covers all the releases in Table 1-2. In the cases where a feature is available only in a particular release, it is noted in the text. The code in this book is written to work against all these release series, except in a small number of cases, which are called out explicitly. The example code available on the website has a list of the versions that it was tested against.
Configuration names Configuration property names have been changed in the releases after 1.x, in order to give them a more regular naming structure. For example, the HDFS properties pertaining to the namenode have been changed to have a dfs.namenode prefix, so dfs.name.dir has changed to dfs.namenode.name.dir. Similarly, MapReduce properties have the mapreduce prefix, rather than the older mapred prefix, so mapred.job.name has changed to mapreduce.job.name. For properties that exist in version 1.x, the old (deprecated) names are used in this book because they will work in all the versions of Hadoop listed here. If you are using a release after 1.x, you may wish to use the new property names in your configuration files and code to remove deprecation warnings. A table listing the deprecated properties names and their replacements can be found on the Hadoop website at http://hadoop.apache .org/common/docs/r0.23.0/hadoop-project-dist/hadoop-common/DeprecatedProperties .html.
MapReduce APIs Hadoop provides two Java MapReduce APIs, described in more detail in “The old and the new Java MapReduce APIs” on page 27. This edition of the book uses the new API for the examples, which will work with all versions listed here, except in a few cases where a MapReduce library using new API is not available in the 1.x releases. All the examples in this book are available in the old API version (in the oldapi package) from the book’s website. Where there are material differences between the two APIs, they are discussed in the text.
Compatibility When moving from one release to another, you need to consider the upgrade steps that are needed. There are several aspects to consider: API compatibility, data compatibility, and wire compatibility. API compatibility concerns the contract between user code and the published Hadoop APIs, such as the Java MapReduce APIs. Major releases (e.g., from 1.x.y to 2.0.0) are allowed to break API compatibility, so user programs may need to be modified and
Hadoop Releases | 15
recompiled. Minor releases (e.g., from 1.0.x to 1.1.0) and point releases (e.g., from 1.0.1 to 1.0.2) should not break compatibility.17 Hadoop uses a classification scheme for API elements to denote their stability. The preceding rules for API compatibility cover those elements that are marked InterfaceStability.Stable. Some elements of the public Hadoop APIs, however, are marked with the InterfaceStabil ity.Evolving or InterfaceStability.Unstable annotations (all these annotations are in the org.apache.hadoop.classification package), which mean they are allowed to break compatibility on minor and point releases, respectively.
Data compatibility concerns persistent data and metadata formats, such as the format in which the HDFS namenode stores its persistent data. The formats can change across minor or major releases, but the change is transparent to users because the upgrade will automatically migrate the data. There may be some restrictions about upgrade paths, and these are covered in the release notes. For example, it may be necessary to upgrade via an intermediate release rather than upgrading directly to the later final release in one step. Hadoop upgrades are discussed in more detail in “Upgrades” on page 362. Wire compatibility concerns the interoperability between clients and servers via wire protocols such as RPC and HTTP. There are two types of client: external clients (run by users) and internal clients (run on the cluster as a part of the system, e.g., datanode and tasktracker daemons). In general, internal clients have to be upgraded in lockstep; an older version of a tasktracker will not work with a newer jobtracker, for example. In the future, rolling upgrades may be supported, which would allow cluster daemons to be upgraded in phases, so that the cluster would still be available to external clients during the upgrade. For external clients that are run by the user—such as a program that reads or writes from HDFS, or the MapReduce job submission client—the client must have the same major release number as the server, but is allowed to have a lower minor or point release number (e.g., client version 1.0.1 will work with server 1.0.2 or 1.1.0, but not with server 2.0.0). Any exception to this rule should be called out in the release notes.
17. Pre-1.0 releases follow the rules for major releases, so a change in version from, say, 0.1.0 to 0.2.0 constitutes a major release, and therefore may break API compatibility.
16 | Chapter 1: Meet Hadoop
O’Reilly Ebooks—Your bookshelf on your devices!
When you buy an ebook through oreilly.com you get lifetime access to the book, and whenever possible we provide it to you in five, DRM-free file formats—PDF, .epub, Kindle-compatible .mobi, Android .apk, and DAISY—that you can use on the devices of your choice. Our ebook files are fully searchable, and you can cut-and-paste and print them. We also alert you when we’ve updated the files with corrections and additions.
Learn more at ebooks.oreilly.com You can also purchase O’Reilly ebooks through the iBookstore, the Android Marketplace, and Amazon.com.
Spreading the knowledge of innovators
oreilly.com
It’s also available at your favorite book retailer, including the iBookstore, the Android Marketplace, and Amazon.com.
Spreading the knowledge of innovators
oreilly.com
Hadoop: The Definitive Guide, Third Edition by Tom White Copyright © 2012 Tom White. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or [email protected]
Editors: Mike Loukides and Meghan Blanchette Production Editor: Rachel Steely Copyeditor: Genevieve d’Entremont Proofreader: Kevin Broccoli May 2012:
Indexer: Kevin Broccoli Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano
Third Edition.
Revision History for the Third Edition: 2012-01-27 Early release revision 1 2012-05-07 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449311520 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Hadoop: The Definitive Guide, the image of an elephant, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
ISBN: 978-1-449-31152-0 [LSI] 1336503003
Table of Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Data! Data Storage and Analysis Comparison with Other Systems Rational Database Management System Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop and the Hadoop Ecosystem Hadoop Releases What’s Covered in This Book Compatibility
1 3 4 4 6 8 9 12 13 15 15
2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python
17 17 19 20 20 22 30 30 33 36 36 36 39
v
Hadoop Pipes Compiling and Running
40 41
3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 The Design of HDFS HDFS Concepts Blocks Namenodes and Datanodes HDFS Federation HDFS High-Availability The Command-Line Interface Basic Filesystem Operations Hadoop Filesystems Interfaces The Java Interface Reading Data from a Hadoop URL Reading Data Using the FileSystem API Writing Data Directories Querying the Filesystem Deleting Data Data Flow Anatomy of a File Read Anatomy of a File Write Coherency Model Data Ingest with Flume and Sqoop Parallel Copying with distcp Keeping an HDFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations
43 45 45 46 47 48 49 50 52 53 55 55 57 60 62 62 67 67 67 70 72 74 75 76 77 77 79
4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Data Integrity Data Integrity in HDFS LocalFileSystem ChecksumFileSystem Compression Codecs Compression and Input Splits Using Compression in MapReduce Serialization The Writable Interface vi | Table of Contents
81 81 82 83 83 85 89 90 93 94
Writable Classes Implementing a Custom Writable Serialization Frameworks Avro Avro Data Types and Schemas In-Memory Serialization and Deserialization Avro Datafiles Interoperability Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro MapReduce in Other Languages File-Based Data Structures SequenceFile MapFile
96 103 108 110 111 114 117 118 121 123 124 128 130 130 130 137
5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 The Configuration API Combining Resources Variable Expansion Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunner Writing a Unit Test with MRUnit Mapper Reducer Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver Running on a Cluster Packaging a Job Launching a Job The MapReduce Web UI Retrieving the Results Debugging a Job Hadoop Logs Remote Debugging Tuning a Job Profiling Tasks MapReduce Workflows Decomposing a Problem into MapReduce Jobs JobControl
144 145 146 146 148 150 154 154 156 157 157 160 161 162 163 165 168 170 175 177 178 179 181 181 183
Table of Contents | vii
Apache Oozie
183
6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Anatomy of a MapReduce Job Run Classic MapReduce (MapReduce 1) YARN (MapReduce 2) Failures Failures in Classic MapReduce Failures in YARN Job Scheduling The Fair Scheduler The Capacity Scheduler Shuffle and Sort The Map Side The Reduce Side Configuration Tuning Task Execution The Task Execution Environment Speculative Execution Output Committers Task JVM Reuse Skipping Bad Records
189 190 196 202 202 204 206 207 207 208 208 210 211 214 215 215 217 219 220
7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 MapReduce Types The Default MapReduce Job Input Formats Input Splits and Records Text Input Binary Input Multiple Inputs Database Input (and Output) Output Formats Text Output Binary Output Multiple Outputs Lazy Output Database Output
223 227 234 234 245 249 250 251 251 252 252 253 257 258
8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Counters Built-in Counters User-Defined Java Counters viii | Table of Contents
259 259 264
User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort Joins Map-Side Joins Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes
268 268 269 270 274 277 283 284 285 288 288 289 295
9. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Cluster Specification Network Topology Cluster Setup and Installation Installing Java Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration Hadoop Configuration Configuration Management Environment Settings Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties User Account Creation YARN Configuration Important YARN Daemon Properties YARN Daemon Addresses and Ports Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Apache Whirr
297 299 301 302 302 302 303 303 304 305 307 311 316 317 320 320 321 324 325 326 328 329 331 331 333 334 334
Table of Contents | ix
10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metrics Java Management Extensions Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades
339 339 344 346 347 351 352 352 355 358 358 359 362
11. Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Installing and Running Pig Execution Types Running Pig Programs Grunt Pig Latin Editors An Example Generating Examples Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions Macros User-Defined Functions A Filter UDF An Eval UDF A Load UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice x | Table of Contents
368 368 370 370 371 371 373 374 375 376 377 381 382 384 388 390 391 391 394 396 399 399 400 402 407 408 409
Parallelism Parameter Substitution
409 410
12. Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 Installing Hive The Hive Shell An Example Running Hive Configuring Hive Hive Services The Metastore Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes HiveQL Data Types Operators and Functions Tables Managed Tables and External Tables Partitions and Buckets Storage Formats Importing Data Altering Tables Dropping Tables Querying Data Sorting and Aggregating MapReduce Scripts Joins Subqueries Views User-Defined Functions Writing a UDF Writing a UDAF
414 415 416 417 417 419 421 423 423 424 425 426 428 429 429 431 435 441 443 443 444 444 445 446 449 450 451 452 454
13. HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 HBasics Backdrop Concepts Whirlwind Tour of the Data Model Implementation Installation Test Drive Clients
459 460 460 460 461 464 465 467 Table of Contents | xi
Java Avro, REST, and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at Streamy.com Praxis Versions HDFS UI Metrics Schema Design Counters Bulk Load
467 470 472 472 473 476 479 480 481 481 483 483 484 485 485 485 486 486
14. ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 Installing and Running ZooKeeper An Example Group Membership in ZooKeeper Creating the Group Joining a Group Listing Members in a Group Deleting a Group The ZooKeeper Service Data Model Operations Implementation Consistency Sessions States Building Applications with ZooKeeper A Configuration Service The Resilient ZooKeeper Application A Lock Service More Distributed Data Structures and Protocols ZooKeeper in Production Resilience and Performance Configuration
xii | Table of Contents
490 492 492 493 495 496 498 499 499 501 506 507 509 511 512 512 515 519 521 522 523 524
15. Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 Getting Sqoop Sqoop Connectors A Sample Import Text and Binary File Formats Generated Code Additional Serialization Systems Imports: A Deeper Look Controlling the Import Imports and Consistency Direct-mode Imports Working with Imported Data Imported Data and Hive Importing Large Objects Performing an Export Exports: A Deeper Look Exports and Transactionality Exports and SequenceFiles
527 529 529 532 532 533 533 535 536 536 536 537 540 542 543 545 545
16. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 Hadoop Usage at Last.fm Last.fm: The Social Music Revolution Hadoop at Last.fm Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, Tuples, and Pipes
547 547 547 548 549 556 556 556 559 562 566 567 568 571 580 581 581 582 582 582 583 589 590 Table of Contents | xiii
Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at ShareThis Summary TeraByte Sort on Apache Hadoop Using Pig and Wukong to Explore Billion-edge Network Graphs Measuring Community Everybody’s Talkin’ at Me: The Twitter Reply Graph Symmetric Links Community Extraction
593 594 595 598 599 603 603 607 609 609 612 613
A. Installing Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 B. Cloudera’s Distribution Including Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . 623 C. Preparing the NCDC Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
xiv | Table of Contents
CHAPTER 1
Meet Hadoop
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. —Grace Hopper
Data! We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006 and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.1 A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world. This flood of data is coming from many sources. Consider the following:2 • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data and is growing at a rate of 20 terabytes per month. • The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year. 1. From Gantz et al., “The Diverse and Exploding Digital Universe,” March 2008 (http://www.emc.com/ collateral/analyst-reports/diverse-exploding-digital-universe.pdf). 2. http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705, http://mashable.com/ 2008/10/15/facebook-10-billion-photos/, http://blog.familytreemagazine.com/insider/Inside +Ancestrycoms+TopSecret+Data+Center.aspx, http://www.archive.org/about/faqs.php, and http://www .interactions.org/cms/?pid=1027032
1
So there’s a lot of data out there. But you are probably wondering how it affects you. Most of the data is locked up in the largest web properties (like search engines) or in scientific or financial institutions, isn’t it? Does the advent of “Big Data,” as it is being called, affect smaller organizations or individuals? I argue that it does. Take photos, for example. My wife’s grandfather was an avid photographer and took photographs throughout his adult life. His entire corpus of medium-format, slide, and 35mm film, when scanned in at high resolution, occupies around 10 gigabytes. Compare this to the digital photos that my family took in 2008, which take up about 5 gigabytes of space. My family is producing photographic data at 35 times the rate my wife’s grandfather’s did, and the rate is increasing every year as it becomes easier to take more and more photos. More generally, the digital streams that individuals are producing are growing apace. Microsoft Research’s MyLifeBits project gives a glimpse of the archiving of personal information that may become commonplace in the near future. MyLifeBits was an experiment where an individual’s interactions—phone calls, emails, documents—were captured electronically and stored for later access. The data gathered included a photo taken every minute, which resulted in an overall data volume of one gigabyte per month. When storage costs come down enough to make it feasible to store continuous audio and video, the data volume for a future MyLifeBits service will be many times that. The trend is for every individual’s data footprint to grow, but perhaps more important, the amount of data generated by machines will be even greater than that generated by people. Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions—all of these contribute to the growing mountain of data. The volume of data being made publicly available increases every year, too. Organizations no longer have to merely manage their own data; success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data. Initiatives such as Public Data Sets on Amazon Web Services, Infochimps.org, and theinfo.org exist to foster the “information commons,” where data can be freely (or in the case of AWS, for a modest price) shared for anyone to download and analyze. Mashups between different information sources make for unexpected and hitherto unimaginable applications. Take, for example, the Astrometry.net project, which watches the Astrometry group on Flickr for new photos of the night sky. It analyzes each image and identifies which part of the sky it is from, as well as any interesting celestial bodies, such as stars or galaxies. This project shows the kind of things that are possible when data (in this case, tagged photographic images) is made available and used for something (image analysis) that was not anticipated by the creator. It has been said that “more data usually beats better algorithms,” which is to say that for some problems (such as recommending movies or music based on past preferences),
2 | Chapter 1: Meet Hadoop
however fiendish your algorithms are, often they can be beaten simply by having more data (and a less sophisticated algorithm).3 The good news is that Big Data is here. The bad news is that we are struggling to store and analyze it.
Data Storage and Analysis The problem is simple: although the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives —have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,4 so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk. This is a long time to read all data on a single drive—and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. Using only one hundredth of a disk may seem wasteful. But we can store one hundred datasets, each of which is one terabyte, and provide shared access to them. We can imagine that the users of such a system would be happy to share access in return for shorter analysis times, and, statistically, that their analysis jobs would be likely to be spread over time, so they wouldn’t interfere with each other too much. There’s more to being able to read and write data in parallel to or from multiple disks, though. The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. This is how RAID works, for instance, although Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS), takes a slightly different approach, as you shall see later. The second problem is that most analysis tasks need to be able to combine the data in some way, and data read from one disk may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. MapReduce provides a programming model that abstracts the problem from disk reads and writes, 3. The quote is from Anand Rajaraman writing about the Netflix Challenge (http://anand.typepad.com/ datawocky/2008/03/more-data-usual.html). Alon Halevy, Peter Norvig, and Fernando Pereira make the same point in “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems, March/April 2009. 4. These specifications are for the Seagate ST-41600n.
Data Storage and Analysis | 3
transforming it into a computation over sets of keys and values. We look at the details of this model in later chapters, but the important point for the present discussion is that there are two parts to the computation, the map and the reduce, and it’s the interface between the two where the “mixing” occurs. Like HDFS, MapReduce has builtin reliability. This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.
Comparison with Other Systems The approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire dataset—or at least a good portion of it—is processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative. It changes the way you think about data and unlocks data that was previously archived on tape or disk. It gives people the opportunity to innovate with data. Questions that took too long to get answered before can now be answered, which in turn leads to new questions and new insights. For example, Mailtrust, Rackspace’s mail division, used Hadoop for processing email logs. One ad hoc query they wrote was to find the geographic distribution of their users. In their words: This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.
By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise would never have had, and, furthermore, they were able to use what they had learned to improve the service for their customers. You can read more about how Rackspace uses Hadoop in Chapter 16.
Rational Database Management System Why can’t we use databases with lots of disks to do large-scale batch analysis? Why is MapReduce needed?
4 | Chapter 1: Meet Hadoop
The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. In many ways, MapReduce can be seen as a complement to a Rational Database Management System (RDBMS). (The differences between the two systems are shown in Table 1-1.) MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated. Table 1-1. RDBMS compared to MapReduce Traditional RDBMS
MapReduce
Data size
Gigabytes
Petabytes
Access
Interactive and batch
Batch
Updates
Read and write many times
Write once, read many times
Structure
Static schema
Dynamic schema
Integrity
High
Low
Scaling
Nonlinear
Linear
Another difference between MapReduce and an RDBMS is the amount of structure in the datasets on which they operate. Structured data is data that is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other hand, is looser, and though there may be a schema, it is often ignored, so it may be used only as a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold any form of data. Unstructured data does not have any particular internal structure: for example, plain text or image data. MapReduce works well on unstructured or semistructured data because it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not intrinsic properties of the data, but they are chosen by the person analyzing the data.
Comparison with Other Systems | 5
Relational data is often normalized to retain its integrity and remove redundancy. Normalization poses problems for MapReduce because it makes reading a record a nonlocal operation, and one of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes. A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same client may appear many times), and this is one reason that logfiles of all kinds are particularly well-suited to analysis with MapReduce. MapReduce is a linearly scalable programming model. The programmer writes two functions—a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to another. These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one. More important, if you double the size of the input data, a job will run twice as slow. But if you also double the size of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries. Over time, however, the differences between relational databases and MapReduce systems are likely to blur—both as relational databases start incorporating some of the ideas from MapReduce (such as Aster Data’s and Greenplum’s databases) and, from the other direction, as higher-level query languages built on MapReduce (such as Pig and Hive) make MapReduce systems more approachable for traditional database programmers.5
Grid Computing The High Performance Computing (HPC) and Grid Computing communities have been doing large-scale data processing for years, using such Application Program Interfaces (APIs) as Message Passing Interface (MPI). Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared filesystem, hosted by a Storage Area Network (SAN). This works well for predominantly computeintensive jobs, but it becomes a problem when nodes need to access larger data volumes (hundreds of gigabytes, the point at which MapReduce really starts to shine), since the network bandwidth is the bottleneck and compute nodes become idle.
5. In January 2007, David J. DeWitt and Michael Stonebraker caused a stir by publishing “MapReduce: A major step backwards,” in which they criticized MapReduce for being a poor substitute for relational databases. Many commentators argued that it was a false comparison (see, for example, Mark C. ChuCarroll’s “Databases are hammers; MapReduce is a screwdriver,” and DeWitt and Stonebraker followed up with “MapReduce II,” where they addressed the main topics brought up by others.
6 | Chapter 1: Meet Hadoop
MapReduce tries to collocate the data with the compute node, so data access is fast because it is local.6 This feature, known as data locality, is at the heart of MapReduce and is the reason for its good performance. Recognizing that network bandwidth is the most precious resource in a data center environment (it is easy to saturate network links by copying data around), MapReduce implementations go to great lengths to conserve it by explicitly modelling network topology. Notice that this arrangement does not preclude high-CPU analyses in MapReduce. MPI gives great control to the programmer, but requires that he explicitly handle the mechanics of the data flow, exposed via low-level C routines and constructs such as sockets, as well as the higher-level algorithm for the analysis. MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit. Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure—when you don’t know whether or not a remote process has failed—and still making progress with the overall computation. MapReduce spares the programmer from having to think about failure, since the implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this because it is a shared-nothing architecture, meaning that tasks have no dependence on one other. (This is a slight oversimplification, since the output from mappers is fed to the reducers, but this is under the control of the MapReduce system; in this case, it needs to take more care rerunning a failed reducer than rerunning a failed map because it has to make sure it can retrieve the necessary map outputs, and if not, regenerate them by running the relevant maps again.) So from the programmer’s point of view, the order in which the tasks run doesn’t matter. By contrast, MPI programs have to explicitly manage their own checkpointing and recovery, which gives more control to the programmer but makes them more difficult to write. MapReduce might sound like quite a restrictive programming model, and in a sense it is: you are limited to key and value types that are related in specified ways, and mappers and reducers run with very limited coordination between one another (the mappers pass keys and values to reducers). A natural question to ask is: can you do anything useful or nontrivial with it? The answer is yes. MapReduce was invented by engineers at Google as a system for building production search indexes because they found themselves solving the same problem over and over again (and MapReduce was inspired by older ideas from the functional programming, distributed computing, and database communities), but it has since been used for many other applications in many other industries. It is pleasantly surprising to see the range of algorithms that can be expressed in MapReduce, from
6. Jim Gray was an early advocate of putting the computation near the data. See “Distributed Computing Economics,” March 2003, http://research.microsoft.com/apps/pubs/default.aspx?id=70001.
Comparison with Other Systems | 7
image analysis, to graph-based problems, to machine learning algorithms.7 It can’t solve every problem, of course, but it is a general data-processing tool. You can see a sample of some of the applications that Hadoop has been used for in Chapter 16.
Volunteer Computing When people first hear about Hadoop and MapReduce, they often ask, “How is it different from [email protected]?” SETI, the Search for Extra-Terrestrial Intelligence, runs a project called [email protected] in which volunteers donate CPU time from their otherwise idle computers to analyze radio telescope data for signs of intelligent life outside earth. [email protected] is the most well-known of many volunteer computing projects; others include the Great Internet Mersenne Prime Search (to search for large prime numbers) and [email protected] (to understand protein folding and how it relates to disease). Volunteer computing projects work by breaking the problem they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed. For example, a [email protected] work unit is about 0.35 MB of radio telescope data, and takes hours or days to analyze on a typical home computer. When the analysis is completed, the results are sent back to the server, and the client gets another work unit. As a precaution to combat cheating, each work unit is sent to three different machines and needs at least two results to agree to be accepted. Although [email protected] may be superficially similar to MapReduce (breaking a problem into independent pieces to be worked on in parallel), there are some significant differences. The [email protected] problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world8 because the time to transfer the work unit is dwarfed by the time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth. MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, [email protected] runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.
7. Apache Mahout (http://mahout.apache.org/) is a project to build machine-learning libraries (such as classification and clustering algorithms) that run on Hadoop. 8. In January 2008, [email protected] was reported at http://www.planetary.org/programs/projects/setiathome/ setiathome_20080115.html to be processing 300 gigabytes a day, using 320,000 computers (most of which are not dedicated to [email protected]; they are used for other things, too).
8 | Chapter 1: Meet Hadoop
A Brief History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.
The Origin of the Name “Hadoop” The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.
Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker9 keeps track of MapReduce jobs.
Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, but it is also a challenge to run without a dedicated operations team, since there are so many moving parts. It’s expensive, too: Mike Cafarella and Doug Cutting estimated a system supporting a one-billion-page index would cost around half a million dollars in hardware, with a monthly running cost of $30,000.10 Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratize search engine algorithms. Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn’t scale to the billions of pages on the Web. Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google.11 GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS). 9. In this book, we use the lowercase form, “jobtracker,” to denote the entity when it’s being referred to generally, and the CamelCase form JobTracker to denote the Java class that implements it. 10. See Mike Cafarella and Doug Cutting, “Building Nutch: Open Source Search,” ACM Queue, April 2004, http://queue.acm.org/detail.cfm?id=988408. 11. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003, http://labs.google.com/papers/gfs.html.
A Brief History of Hadoop | 9
In 2004, Google published the paper that introduced MapReduce to the world.12 Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year. All the major Nutch algorithms had been ported to run using MapReduce and NDFS. NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale (see the sidebar “Hadoop at Yahoo!” on page 11). This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster.13 In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. By this time, Hadoop was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. Some applications are covered in the case studies in Chapter 16 and on the Hadoop wiki. In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch through four terabytes of scanned archives from the paper, converting them to PDFs for the Web.14 The processing took less than 24 hours to run using 100 machines, and the project probably wouldn’t have been embarked upon without the combination of Amazon’s pay-by-the-hour model (which allowed the NYT to access a large number of machines for a short period) and Hadoop’s easy-to-use parallel programming model. In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds (just under 3½ minutes), beating the previous year’s winner of 297 seconds (described in detail in “TeraByte Sort on Apache Hadoop” on page 603). In November of the same year, Google reported that its MapReduce implementation sorted one terabyte in 68 seconds.15 As the first edition of this book was going to press (May 2009), it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds. Since then, Hadoop has seen rapid mainstream enterprise adoption. Hadoop’s role as a general-purpose storage and analysis platform for big data has been recognized by
12. Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters ,” December 2004, http://labs.google.com/papers/mapreduce.html. 13. “Yahoo! Launches World’s Largest Hadoop Production Application,” 19 February 2008, http://developer .yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html. 14. Derek Gottfrid, “Self-service, Prorated Super Computing Fun!” 1 November 2007, http://open.blogs .nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/. 15. “Sorting 1PB with MapReduce,” 21 November 2008, http://googleblog.blogspot.com/2008/11/sorting-1pb -with-mapreduce.html.
10 | Chapter 1: Meet Hadoop
the industry, and this fact is reflected in the number of products that use or incorporate Hadoop in some way. There are Hadoop distributions from the large, established enterprise vendors, including EMC, IBM, Microsoft, and Oracle, as well as from specialist Hadoop companies such as Cloudera, Hortonworks, and MapR.
Hadoop at Yahoo! Building Internet-scale search engines requires huge amounts of data and therefore large numbers of machines to process it. Yahoo! Search consists of four primary components: the Crawler, which downloads pages from web servers; the WebMap, which builds a graph of the known Web; the Indexer, which builds a reverse index to the best pages; and the Runtime, which answers users’ queries. The WebMap is a graph that consists of roughly 1 trillion (1012) edges, each representing a web link, and 100 billion (1011) nodes, each representing distinct URLs. Creating and analyzing such a large graph requires a large number of computers running for many days. In early 2005, the infrastructure for the WebMap, named Dreadnaught, needed to be redesigned to scale up to more nodes. Dreadnaught had successfully scaled from 20 to 600 nodes, but required a complete redesign to scale out further. Dreadnaught is similar to MapReduce in many ways, but provides more flexibility and less structure. In particular, each fragment in a Dreadnaught job can send output to each of the fragments in the next stage of the job, but the sort was all done in library code. In practice, most of the WebMap phases were pairs that corresponded to MapReduce. Therefore, the WebMap applications would not require extensive refactoring to fit into MapReduce. Eric Baldeschwieler (Eric14) created a small team and we started designing and prototyping a new framework written in C++ modeled after GFS and MapReduce to replace Dreadnaught. Although the immediate need was for a new framework for WebMap, it was clear that standardization of the batch platform across Yahoo! Search was critical and by making the framework general enough to support other users, we could better leverage investment in the new platform. At the same time, we were watching Hadoop, which was part of Nutch, and its progress. In January 2006, Yahoo! hired Doug Cutting, and a month later we decided to abandon our prototype and adopt Hadoop. The advantage of Hadoop over our prototype and design was that it was already working with a real application (Nutch) on 20 nodes. That allowed us to bring up a research cluster two months later and start helping real customers use the new framework much sooner than we could have otherwise. Another advantage, of course, was that since Hadoop was already open source, it was easier (although far from easy!) to get permission from Yahoo!’s legal department to work in open source. So we set up a 200-node cluster for the researchers in early 2006 and put the WebMap conversion plans on hold while we supported and improved Hadoop for the research users. Here’s a quick timeline of how things have progressed: • 2004: Initial versions of what is now Hadoop Distributed Filesystem and MapReduce implemented by Doug Cutting and Mike Cafarella.
A Brief History of Hadoop | 11
• December 2005: Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. • January 2006: Doug Cutting joins Yahoo!. • February 2006: Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS. • February 2006: Adoption of Hadoop by Yahoo! Grid team. • April 2006: Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. • May 2006: Yahoo! set up a Hadoop research cluster—300 nodes. • May 2006: Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark). • October 2006: Research cluster reaches 600 nodes. • December 2006: Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours. • January 2007: Research cluster reaches 900 nodes. • April 2007: Research clusters—two clusters of 1000 nodes. • April 2008: Won the 1-terabyte sort benchmark in 209 seconds on 900 nodes. • October 2008: Loading 10 terabytes of data per day onto research clusters. • March 2009: 17 clusters with a total of 24,000 nodes. • April 2009: Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100-terabyte sort in 173 minutes (on 3,400 nodes). —Owen O’Malley
Apache Hadoop and the Hadoop Ecosystem Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS, renamed from NDFS), the term is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing. All of the core projects covered in this book are hosted by the Apache Software Foundation, which provides support for a community of open source software projects, including the original HTTP Server from which it gets its name. As the Hadoop ecosystem grows, more projects are appearing, not necessarily hosted at Apache, that provide complementary services to Hadoop or build on the core to add higher-level abstractions. The Hadoop projects that are covered in this book are described briefly here: Common A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).
12 | Chapter 1: Meet Hadoop
Avro A serialization system for efficient, cross-language RPC and persistent data storage. MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS A distributed filesystem that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. HBase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). ZooKeeper A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop A tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS. Oozie A service for running and scheduling workflows of Hadoop jobs (including MapReduce, Pig, Hive, and Sqoop jobs).
Hadoop Releases Which version of Hadoop should you use? The answer to this question changes over time, of course, and also depends on the features that you need. “Hadoop Releases” on page 13 summarizes the high-level features in recent Hadoop release series. There are a few active release series. The 1.x release series is a continuation of the 0.20 release series and contains the most stable versions of Hadoop currently available. This series includes secure Kerberos authentication, which prevents unauthorized access to Hadoop data (see “Security” on page 325). Almost all production clusters use these releases or derived versions (such as commercial distributions).
Hadoop Releases | 13
The 0.22 and 2.x release series16 are not currently stable (as of early 2012), but this is likely to change by the time you read this as they undergo more real-world testing (consult the Apache Hadoop releases page for the latest status). 2.x includes several major new features: • A new MapReduce runtime, called MapReduce 2, implemented on a new system called YARN (Yet Another Resource Negotiator), which is a general resource management system for running distributed applications. MapReduce 2 replaces the “classic” runtime in previous releases. It is described in more depth in “YARN (MapReduce 2)” on page 196. • HDFS federation, which partitions the HDFS namespace across multiple namenodes to support clusters with very large numbers of files. See “HDFS Federation” on page 47. • HDFS high-availability, which removes the namenode as a single point of failure by supporting standby namenodes for failover. See “HDFS High-Availability” on page 48. Table 1-2. Features supported by Hadoop release series Feature
1.x
0.22
2.x
Secure authentication
Yes
No
Yes
Old configuration names
Yes
Deprecated
Deprecated
New configuration names
No
Yes
Yes
Old MapReduce API
Yes
Yes
Yes
New MapReduce API
Yes (with some missing libraries)
Yes
Yes
MapReduce 1 runtime (Classic)
Yes
Yes
No
MapReduce 2 runtime (YARN)
No
No
Yes
HDFS federation
No
No
Yes
HDFS high-availability
No
No
Yes
Table 1-2 only covers features in HDFS and MapReduce. Other projects in the Hadoop ecosystem are continually evolving too, and picking a combination of components that work well together can be a challenge. Thankfully, you don’t have to do this work yourself. The Apache Bigtop project (http://incubator.apache.org/bigtop/) runs interoperability tests on stacks of Hadoop components and provides Linux packages (RPMs and Debian packages) for easy installation. There are also commercial vendors offering Hadoop distributions containing suites of compatible components.
16. As this book went to press, the Hadoop community voted for the 0.23 release series to be renamed as the 2.x release series. This book uses the shorthand “releases after 1.x” to refer to releases in the 0.22 and 2.x (formerly 0.23) series.
14 | Chapter 1: Meet Hadoop
What’s Covered in This Book This book covers all the releases in Table 1-2. In the cases where a feature is available only in a particular release, it is noted in the text. The code in this book is written to work against all these release series, except in a small number of cases, which are called out explicitly. The example code available on the website has a list of the versions that it was tested against.
Configuration names Configuration property names have been changed in the releases after 1.x, in order to give them a more regular naming structure. For example, the HDFS properties pertaining to the namenode have been changed to have a dfs.namenode prefix, so dfs.name.dir has changed to dfs.namenode.name.dir. Similarly, MapReduce properties have the mapreduce prefix, rather than the older mapred prefix, so mapred.job.name has changed to mapreduce.job.name. For properties that exist in version 1.x, the old (deprecated) names are used in this book because they will work in all the versions of Hadoop listed here. If you are using a release after 1.x, you may wish to use the new property names in your configuration files and code to remove deprecation warnings. A table listing the deprecated properties names and their replacements can be found on the Hadoop website at http://hadoop.apache .org/common/docs/r0.23.0/hadoop-project-dist/hadoop-common/DeprecatedProperties .html.
MapReduce APIs Hadoop provides two Java MapReduce APIs, described in more detail in “The old and the new Java MapReduce APIs” on page 27. This edition of the book uses the new API for the examples, which will work with all versions listed here, except in a few cases where a MapReduce library using new API is not available in the 1.x releases. All the examples in this book are available in the old API version (in the oldapi package) from the book’s website. Where there are material differences between the two APIs, they are discussed in the text.
Compatibility When moving from one release to another, you need to consider the upgrade steps that are needed. There are several aspects to consider: API compatibility, data compatibility, and wire compatibility. API compatibility concerns the contract between user code and the published Hadoop APIs, such as the Java MapReduce APIs. Major releases (e.g., from 1.x.y to 2.0.0) are allowed to break API compatibility, so user programs may need to be modified and
Hadoop Releases | 15
recompiled. Minor releases (e.g., from 1.0.x to 1.1.0) and point releases (e.g., from 1.0.1 to 1.0.2) should not break compatibility.17 Hadoop uses a classification scheme for API elements to denote their stability. The preceding rules for API compatibility cover those elements that are marked InterfaceStability.Stable. Some elements of the public Hadoop APIs, however, are marked with the InterfaceStabil ity.Evolving or InterfaceStability.Unstable annotations (all these annotations are in the org.apache.hadoop.classification package), which mean they are allowed to break compatibility on minor and point releases, respectively.
Data compatibility concerns persistent data and metadata formats, such as the format in which the HDFS namenode stores its persistent data. The formats can change across minor or major releases, but the change is transparent to users because the upgrade will automatically migrate the data. There may be some restrictions about upgrade paths, and these are covered in the release notes. For example, it may be necessary to upgrade via an intermediate release rather than upgrading directly to the later final release in one step. Hadoop upgrades are discussed in more detail in “Upgrades” on page 362. Wire compatibility concerns the interoperability between clients and servers via wire protocols such as RPC and HTTP. There are two types of client: external clients (run by users) and internal clients (run on the cluster as a part of the system, e.g., datanode and tasktracker daemons). In general, internal clients have to be upgraded in lockstep; an older version of a tasktracker will not work with a newer jobtracker, for example. In the future, rolling upgrades may be supported, which would allow cluster daemons to be upgraded in phases, so that the cluster would still be available to external clients during the upgrade. For external clients that are run by the user—such as a program that reads or writes from HDFS, or the MapReduce job submission client—the client must have the same major release number as the server, but is allowed to have a lower minor or point release number (e.g., client version 1.0.1 will work with server 1.0.2 or 1.1.0, but not with server 2.0.0). Any exception to this rule should be called out in the release notes.
17. Pre-1.0 releases follow the rules for major releases, so a change in version from, say, 0.1.0 to 0.2.0 constitutes a major release, and therefore may break API compatibility.
16 | Chapter 1: Meet Hadoop
O’Reilly Ebooks—Your bookshelf on your devices!
When you buy an ebook through oreilly.com you get lifetime access to the book, and whenever possible we provide it to you in five, DRM-free file formats—PDF, .epub, Kindle-compatible .mobi, Android .apk, and DAISY—that you can use on the devices of your choice. Our ebook files are fully searchable, and you can cut-and-paste and print them. We also alert you when we’ve updated the files with corrections and additions.
Learn more at ebooks.oreilly.com You can also purchase O’Reilly ebooks through the iBookstore, the Android Marketplace, and Amazon.com.
Spreading the knowledge of innovators
oreilly.com
Hadoop The Definitive Guide Pdf Download
Hadoop Definitive Guide Pdf Download
Ebook Description. Counsels programmers and administrators for big and small organizations on how to work with large-scale application datasets using Apache Hadoop, discussing its capacity for storing and processing large amounts of data while demonstrating best practices for building reliable and scalable distributed systems.