Gradient Boosting in Machine Learning
29th January 2020What is Apache Hive?
3rd February 2020
A NoSQL (often interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability.
Why NoSQL?
NoSQL databases first started out as in-house solutions to real problems in companies such as Amazon Dynamo, Google and others. These companies found that SQL didn’t meet their requirements. In particular, these companies faced three primary issues: unprecedented transaction volumes, expectations of low-latency access to massive datasets, and nearly perfect service availability while operating in an unreliable environment. Initially, companies tried the traditional approach: they added more hardware or upgraded to faster hardware as it became available.
When that didn’t work, they tried to scale existing relational solutions by simplifying their database schema, de-normalizing the schema, relaxing durability and referential integrity, introducing various query caching layers, separating read-only from write-dedicated replicas, and, finally, data partitioning in an attempt to address these new requirements. None fundamentally addressed the core limitations, and they all introduced additional overhead and technical tradeoffs.
NoSQL’s Foundations
Companies needed a solution that would scale, be resilient, and be operationally efficient. They had been able to scale the Web (HTTP) and dynamic content generation and business logic layers (Application Servers), but the database continued to be the system’s bottleneck.
Understanding CAP Theorem – Consistency, Availability, Partition Tolerance
When evaluating NoSQL or other distributed systems, you’ll inevitably hear about the “CAP theorem.” In 2000 Eric Brewer proposed the idea that in a distributed system you can’t continually maintain perfect consistency, availability, and partition tolerance simultaneously. CAP is defined as:
Consistency: all nodes see the same data at the same time
Availability: a guarantee that every request receives a response about whether it was successful or failed
Partition tolerance: the system continues to operate despite arbitrary message loss
The theorem states that you cannot simultaneously have all three; you must make tradeoffs among them. The CAP theorem is sometimes incorrectly described as a simple design-time decision—“pick any two [when designing a distributed system]”—when in fact the theorem allows for systems to make tradeoffs at run-time to accommodate different requirements.
Relaxing ACID properties of RDBMS
Anyone familiar with databases will know the acronym ACID, which outlines the fundamental elements of transactions: atomicity, consistency, isolation, and durability. Together, these qualities define the basics of any transaction. As NoSQL solutions developed it became clear that in order to deliver scalability it might be necessary to relax or redefine some of these qualities, in particular consistency and durability.
To address this, most NoSQL solutions choose to relax the notion of complete consistency to something called “eventual consistency.” This allows each system to make updates to data and learn of other updates made by other systems within a short period of time, without being totally consistent at all times.
Another approach is optimistic concurrency control, using techniques such as multi-version concurrency control (MVCC).Such techniques allow for consistent reading of data in one transaction with concurrent writing in another transaction but do not address write conflicts and can introduce more transaction retries when transactions overlap or are long-running.