[内容简介]
Construct a robust end-to-end solution for analyzing and visualizing streaming data
Real-time analytics is the hottest topic in data analytics today. In Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data, expert Byron Ellis teaches data analysts technologies to build an effective real-time analytics platform. This platform can then be used to make sense of the constantly changing data that is beginning to outpace traditional batch-based analysis platforms.
The author is among a very few leading experts in the field. He has a prestigious background in research, development, analytics, real-time visualization, and Big Data streaming and is uniquely qualified to help you explore this revolutionary field. Moving from a description of the overall analytic architecture of real-time analytics to using specific tools to obtain targeted results, Real-Time Analytics leverages open source and modern commercial tools to construct robust, efficient systems that can provide real-time analysis in a cost-effective manner. The book includes:
- A deep discussion of streaming data systems and architectures
- Instructions for analyzing, storing, and delivering streaming data
- Tips on aggregating data and working with sets
- Information on data warehousing options and techniques
Real-Time Analytics includes in-depth case studies for website analytics, Big Data, visualizing streaming and mobile data, and mining and visualizing operational data flows. The book's "recipe" layout lets readers quickly learn and implement different techniques. All of the code examples presented in the book, along with their related data sets, are available on the companion website.
[目录]
Introduction xv
Chapter 1 Introduction to Streaming Data 1
Sources of Streaming Data 2
Operational Monitoring 3
Web Analytics 3
Online Advertising 4
Social Media 5
Mobile Data and the Internet of Things 5
Why Streaming Data Is Different 7
Always On, Always Flowing 7
Loosely Structured 8
High-Cardinality Storage 9
Infrastructures and Algorithms 10
Conclusion 10
Part I Streaming Analytics Architecture 13
Chapter 2 Designing Real-Time Streaming Architectures 15
Real-Time Architecture Components 16
Collection 16
Data Flow 17
Processing 19
Storage 20
Delivery 22
Features of a Real-Time Architecture 24
High Availability 24
Low Latency 25
Horizontal Scalability 26
Languages for Real-Time Programming 27
Java 27
Scala and Clojure 28
JavaScript 29
The Go Language 30
A Real-Time Architecture Checklist 30
Collection 31
Data Flow 31
Processing 32
Storage 32
Delivery 33
Conclusion 34
Chapter 3 Service Configuration and Coordination 35
Motivation for Confi guration and Coordination Systems 36
Maintaining Distributed State 36
Unreliable Network Connections 36
Clock Synchronization 37
Consensus in an Unreliable World 38
Apache ZooKeeper 39
The znode 39
Watches and Notifi cations 41
Maintaining Consistency 41
Creating a ZooKeeper Cluster 42
ZooKeeper’s Native Java Client 47
The Curator Client 56
Curator Recipes 63
Conclusion 70
Chapter 4 Data-Flow Management in Streaming Analysis 71
Distributed Data Flows 72
At Least Once Delivery 72
The “n+1” Problem 73
Apache Kafka: High-Throughput Distributed Messaging 74
Design and Implementation 74
Configuring a Kafka Environment 80
Interacting with Kafka Brokers 89
Apache Flume: Distributed Log Collection 92
The Flume Agent 92
Configuring the Agent 94
The Flume Data Model 95
Channel Selectors 95
Flume Sources 98
Flume Sinks 107
Sink Processors 110
Flume Channels 110
Flume Interceptors 112
Integrating Custom Flume Components 114
Running Flume Agents 114
Conclusion 115
Chapter 5 Processing Streaming Data 117
Distributed Streaming Data Processing 118
Coordination 118
Partitions and Merges 119
Transactions 119
Processing Data with Storm 119
Components of a Storm Cluster 120
Configuring a Storm Cluster 122
Distributed Clusters 123
Local Clusters 126
Storm Topologies 127
Implementing Bolts 130
Implementing and Using Spouts 136
Distributed Remote Procedure Calls 142
Trident: The Storm DSL 144
Processing Data with Samza 151
Apache YARN 151
Getting Started with YARN and Samza 153
Integrating Samza into the Data Flow 157
Samza Jobs 157
Conclusion 166
Chapter 6 Storing Streaming Data 167
Consistent Hashing 168
“NoSQL” Storage Systems 169
Redis 170
MongoDB 180
Cassandra 203
Other Storage Technologies 215
Relational Databases 215
Distributed In-Memory Data Grids 215
Choosing a Technology 215
Key-Value Stores 216
Document Stores 216
Distributed Hash Table Stores 216
In-Memory Grids 217
Relational Databases 217
Warehousing 217
Hadoop as ETL and Warehouse 218
Lambda Architectures 223
Conclusion 224
Part II Analysis and Visualization 225
Chapter 7 Delivering Streaming Metrics 227
Streaming Web Applications 228
Working with Node 229
Managing a Node Project with NPM 231
Developing Node Web Applications 235
A Basic Streaming Dashboard 238
Adding Streaming to Web Applications 242
Visualizing Data 254
HTML5 Canvas and Inline SVG 254
Data-Driven Documents: D3.js 262
High-Level Tools 272
Mobile Streaming Applications 277
Conclusion 279
Chapter 8 Exact Aggregation and Delivery 281
Timed Counting and Summation 285
Counting in Bolts 286
Counting with Trident 288
Counting in Samza 289
Multi-Resolution Time-Series Aggregation 290
Quantization Framework 290
Stochastic Optimization 296
Delivering Time-Series Data 297
Strip Charts with D3.js 298
High-Speed Canvas Charts 299
Horizon Charts 301
Conclusion 303
Chapter 9 Statistical Approximation of Streaming Data 305
Numerical Libraries 306
Probabilities and Distributions 307
Expectation and Variance 309
Statistical Distributions 310
Discrete Distributions 310
Continuous Distributions 312
Joint Distributions 315
Working with Distributions 316
Inferring Parameters 316
The Delta Method 317
Distribution Inequalities 319
Random Number Generation 319
Generating Specific Distributions 321
Sampling Procedures 324
Sampling from a Fixed Population 325
Sampling from a Streaming Population 326
Biased Streaming Sampling 327
Conclusion 329
Chapter 10 Approximating Streaming Data with Sketching 331
Registers and Hash Functions 332
Registers 332
Hash Functions 332
Working with Sets 336
The Bloom Filter 338
The Algorithm 338
Choosing a Filter Size 340
Unions and Intersections 341
Cardinality Estimation 342
Interesting Variations 344
Distinct Value Sketches 347
The Min-Count Algorithm 348
The HyperLogLog Algorithm 351
The Count-Min Sketch 356
Point Queries 356
Count-Min Sketch Implementation 357
Top-K and “Heavy Hitters” 358
Range and Quantile Queries 360
Other Applications 364
Conclusion 364
Chapter 11 Beyond Aggregation 367
Models for Real-Time Data 368
Simple Time-Series Models 369
Linear Models 373
Logistic Regression 378
Neural Network Models 380
Forecasting with Models 389
Exponential Smoothing Methods 390
Regression Methods 393
Neural Network Methods 394
Monitoring 396
Outlier Detection 397
Change Detection 399
Real-Time Optimization 400
Conclusion 402
Index 403