Notes from Spark Summit East 2017

The Spark and Hadoop summits are one of my most valuable resources for keeping up to date with these technologies. Over 100 talks at this summit. So far, I have watched almost 60 of them and made short notes with print screens.

Hope someone else finds it useful as well.

Spark Summit East 2017

I ll update the file accordingly.

Notes from Marz’ Big Data – principles and best practices of scalable real-time data systems – chapter 2

Notes from chapter 1

2. Data model for Big Data

2.1 The properties of data

The core of the Lambda architecture is the master dataset. It is the only part of the architecture, which should be guarded from corruption.

There are two components to the master dataset:

data model
how master dataset is physically stored

Definitions of the terms:

information – general collection of information relevant to the system
data – information that cannot be derived from anything else
queries – questions to ask the data
views – information derived from the base data

One person’s data can be another’s view.

2.1.1 Data is raw

It is best to store the rawest data you can obtain. This is important because you might have some question to ask your data in the future that you cannot ask now.

Unstructured data is rawer than normalized data. When deciding what raw data to store, there is a grey area between parsing and semantic normalization. Semantic normalization is the process of transforming free-form information into structured form. The semantic normalization algorithm would try to match the input with a known value.

It is better to store data in unstructured form, because the semantic normalization algorithm might improve over time.

2.1.2 Data is immutable

Relational databases offer operation update. With Big Data systems, immutability is the key. Data is not updated or deleted, only added. Two advantages derive from it:

human-fault tolerance – no data is lost if a human failure is present
simplicity – immutable data model offers only append operation

One trade-off for immutable approach is that it uses more storage.

2.2 The fact-based model for representing data

Data is the set of information that cannot be derived from anything else.

In the fact-based model, you represent data as fundamental units – facts. Facts are atomic because they cannot be divided into further into meaningful components. Facts are also timestamped, which makes them eternally true.

Facts should also be uniquely identifiable – in case of two identical data coming in at the same time (f. ex. pageview from same IP address at the same time), nonce can be added. Nonce is a 64-bit randomly generated number.

Fact-based model:

stores your raw data as atomic facts.
facts are immutable and eternally true
each fact is identifiable

Benefits of the fact-based model:

queryable at any time in history
human-fault tolerant
handles partial information
has advantages of normalized (batch layer) and denormalized (serving layer) forms. These are mutually exclusive, so a choice between query efficiency and data consistency has to be made.

Having information stored in multiple locations increases the risk of it becoming inconsistent (list of values type of solution is in place here). This removes the risk of inconsistency, but a join is needed to answer queries – potentially expensive operation.

In the Lambda architecture, the master dataset is fully normalized. The batch views are like denormalized tables and are defined as functions on the master dataset.

2.3 Graph schemas

Graph schemas capture the structure of a dataset stored using the fact-based model.

2.3.1 Elements of a graph schema

Graph schema has three components:

nodes – entities in the system
edges – relationships between nodes
properties – information about entities

2.3.2 The need for an enforceable schema

Information is now stored as facts, graph schema describes the types of facts. What is missing is in what format to store the facts.

One option is to use semistructured text format like JSON. This provides simplicity and flexibility. The challenge might appear when valid JSON but with inconsistent format or missing data appears.

In order to guarantee consistent format an enforceable schema is an alternative. It guarantees all required fields are present and ensure all values are of expected type. This can be implement using serialization framework. Serialization network provides a language-neutral way to define the nodes, edges and properties of the schema.

One of the beauties of the fact-based model and graph schemas is that they can evolve as different types of data become available.

Notes from Marz’ Big Data – principles and best practices of scalable real-time data systems – chapter 1

Notes from Big Data: Principles and best practices of scalable realtime data systems, which is a book about how to implement Lambda architecture using Big Data technologies.

1. A new paradigm for Big Data

1.1 Scaling a traditional database

When you evolve the application, you will run into problems with scalability and complexity.

Steps from traditional database to Big Data architecture:

Single increment is wasteful; it is more efficient to batch many increments in a single request. This is done by using queues – 100 events from the queue are read and processed.
Database gets overloaded, one worker cannot keep up – you parallelize the updates. You try to scale a relational database – multiple servers are used and spreading the table across all the servers – each server will have a subset of the data, aka horizontal partitioning or sharding.
At one point, a bug sneaks in the production.

More time is spent on dealing with reading and writing data, and less on building new features for customers.

1.2 How will Big Data techniques help?

The databases and systems for Big Data are meant for distributed work. Sharding and replication are handled for you. For scaling, you just add nodes and the systems rebalances automatically.

Making data immutable is another key feature.

Scalable data systems are available now. Large-scale computation system like Hadoop and databases like Cassandra and MongoDB. They can handle very large amounts of data, but with trade-offs.

Hadoop can parallelize large-scale batch computations on very large amounts of data, but with high latency.

NoSQL databases offer you scalability, but with a more limited data model than what SQL has to offer.

These tools are not a universal problem solver. However, with the best use in a combination with one another, you can build scalable systems with human-fault tolerance and minimum complexity. The Lambda architecture offers that.

1.3 Desired properties of a Big Data system

A Big Data system must perform well and be resource-efficient.

1.3.1 Robustness and fault tolerance

Systems need to behave correctly despite machines going down randomly. The systems must be human-fault tolerant.

1.3.2 Low latency reads and updates

Big Data systems have to deliver low latency updates when needed.

1.3.3 Scalability

Scalability is the ability to maintain performance in case of increasing data by adding resources to the system.

1.3.4 Generalization

A general system can support a wide range of applications. Lambda architecture generalizes by basing on functions of all data; whether financial systems, social media analytics, etc.

1.3.5 Extensibility

Functionality is added with minimum development cost.

1.3.6 Ad hoc queries

It is important to be able to do ad hoc queries. Every dataset has some unexpected value in it.

1.3.7 Minimal maintenance

Keep it simple. The more complex a system, the more likely something will go wrong. Big Data tools with little implementation complexity should be prioritized. In Lambda architecture, the complexity is pushed out of the core components.

1.3.8 Debuggability

A Big Data system must provide the information necessary to debug the system when needed. The debuggability in Lambda architecture is accomplished in the batch layer by using recomputation algorithms when possible.

1.4 Lambda architecture

There is no single tool to provide a complete solution. A variety of tools and techniques is needed to build a complete Big Data system.

The main idea is to build a Big Data system as a series of layers. Each layer build upon functionality provided by the layer beneath it.

all-layers-1

Everything starts from the following equation:

query = function(all data)

Ideally, you could run the functions on the fly to get the results. This could be too expensive and too resource consuming. Imagine reading petabytes of data every time one simple query is executed.

The best alternative is to precompute the query function. The precomputed query function is the batch view. User’ query fetches data from the precomputed view. The batch view is indexed so that it can be accessed with random reads. The above equation would now look like this:

batch view = function(all data)

query = function(batch view)

Creating the batch view is clearly a high-latency operation. By the time it finishes, new data will be available that is not presented in the batch views.

all-layers-2

1.4.1 Batch layer

The batch layer stores the master copy of the dataset and precomputes the batch views on it.

The batch layer needs to be able to do two things:

store immutable, growing master dataset
compute random functions on that dataset (ad hos queries)

The batch layer is simple to use. Batch computations are written like single-threaded programs and the parallelism comes free.

The batch layer scales by adding new machines.

1.4.2 Serving layer

The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on.

A serving layer database supports batch updates and random reads. It does not need to support random writes. Random writes cause most of the complexity in databases.

The serving layer uses replication in case servers go down. Serving layer is also easily scalable.

1.4.3 Speed layer

The speed layer ensures that new data is represented in query functions. It looks only at recent, new data.

It updates the real-time views as it receives new data. It does not recompute the views from scratch like the batch layer.

real-time view = function(real-time data, new data)

The Lambda architecture in three equations:

batch view = function(all data)

real-time view = function(real-time data, new data)

query = function(batch view, real-time view)

Results from queries come from merging results together from batch and real-time views.

One data makes it from batch layer into serving layer; the corresponding results in the real-time views are no longer needed.

The speed layer is far more complex than the batch and serving layer. This is called complexity isolation – complexity is pushed into a layer who delivers temporary results.

On the batch layer, the exact algorithm is used, while on the speed layer, an approximate algorithm is used.

The batch layer constantly overrides the speed layer, so the approximations are corrected – the systems shows eventual accuracy.

1.5 Recent trends in technology

1.5.1 CPUs are not getting faster

We are hitting limit of how fast a single CPU can go. This means that if you want to scale more data, you must parallelize your computation.

Vertical scaling – scaling by buying a better machine.

Horizontal scaling – adding more machines.

1.5.2 Elastic clouds

Rise of elastic clouds – Infrastructure as a Service. User can increase or decrease the size of the cluster instantaneously.

1.5.3 Vibrant open source ecosystem for Big Data

Five categories of open source projects:

Batch computation systems

High throughput, high latency systems. Example is Hadoop.

Serialization frameworks

Provide tools and libraries for using objects between languages. They serialize an object into a byte array from any language and then deserialize that byte array into an object in any language. Examples are Thrift, Avro, and Protocol Buffers.

Random access NoSQL databases

They lack full expressiveness of SQL and specialize on certain operations. They are meant to be used for specific purposes. They are not meant to be used for data warehousing. Examples are Cassandra, HBase and MongoDB.

Messaging/queuing systems

They provide a way to send and consume messages between processes in a fault-tolerant and asynchronous way. They are key for doing real-time processing. Example is Kafka.

Realtime computation system

They are high throughput, low latency and stream-processing systems. They cannot do computations like batch-processing systems can, but they process messages extremely quickly. Example is Storm.

Notes from chapter 2

markobigdata

Big Data documentation in a blog

Category: Notes