Understanding Nosql
1.What is Nosql
Agenda
Common Traits(特点)
Consistency
Indexing
Queries
MapReduce
Sharding
Nosql Common Traits
Non-relational
Non-schematized/schema-free
Eventual Consistency
Open source
Distributed
"Web scale"
Developed at big internet companies
Consistency
CAP Theorem
Databases may only excel at tow of the following thress attributes:
consistency,availability and partition tolerance
Nosql does not offer ‘ACID' guarantees
Atomicity,consistency,isolation and durability
Instead offers 'eventual consistency'
Similar to DNS propagation
Indexing
Most Nosql databases are indexed by key
Some allow so-called 'secondary' indexed
Often the primary key indexes are clustered
Hbase uses Hadoop Distributed File System,which is append-only
Writes are logged
Logged writes are batched
File is re-created and sorted
Queries
Typically no query language
Instead,create procedural program
Sometimes sql is supported
Sometimes MapReduce code is used ...
MapReduce
Map step:split the query up
Reduce step:merge the results
Most typical of Hadoop and used with Wide Column Stores,esp,Hbase
Amazon Web Service's Elastic MapReduce(EMR) can read/write DynamoDB,s3,Relational Database Service(RDS)
"Hive" offers a Hivesql(sql-like) abstraction over MR
Use with Hive tables
Use with Hbase
Sharding
A partition pattern where separate servers store partitions
Fan-out queries supported
Partitions may by duplicated,so replication also provided
Good for disaster recovery
Since 'shards" can be geographically distributed,sharding can act like a CDN
Good for keeping data close to processing
Reduces network traffic when MapReduce splitting takes place
2.Nosql Technology Breakdown
Agenda
Key-Value Stores
Wide-Column Stores
Document Stores
Demo Couchdb
Graph Databases
Key-Value "mechanics" present throughout
Key-Value Stores
The most common;not necessarily the most popular
Has rows,each with something like a big dictionary|associative array
Schema may differ from row to row
Common on Cloud platforms
e.g,Amazon SimpleDB,Azure Table Storage
MemcachedDB,Voldemort
DynamoDB(AWS),Dynomite,Redis and riak
Document Stores
Have 'databases',which are akin(类似) to tables
Have 'documents',akin to rows
Documents are typically JSON objects
Each document has properties and values
Values can be scalars,arrays,links to documents in other databases or sub-documents(i.e,contained JSON objects - Allow for hierarchical storage)
Can have attachments as well
Old versions are retained
So Doc Stores work well for content management
Some view doc stores as specialized KV stores
Most popular with developers,startups,VCs
The biggies:
CouchDB
MongoDB
Document Store Application Orientation
Documents can each be addressed by URIs
CouchDB supports full REST interface
Very geared towards JavaScript and JSON
Documents are JSON objects
CouchDB|MongoDB use JavaScript as native language
In CouchDB,'view functions' also have unique URIs and they return HTML
so you can build applications in the database
Demo CouchDB
http://127.0.0.1:5984/pluralsight/_design/example/_view/dotNet
http://127.0.0.1:5984/pluralsight/_design/example/_view/dataacess
http://127.0.0.1:5984/pluralsight/_design/example/_show/showfunction/_id
Wide Column Stores
Has tables with declared column families
Each column family has "columns" with are KV pair that can vary from row to row
These are the most foundational for large sites
BigTable(Google)
Hbase(Originally part of Yahoo-dominate Hadoop project)
Cassandra(Facebook)
Calls column families "super columns" and tables "super column families"
They are the most "Big Data"-ready
Especially Hbase + Hadoop
Graph Databases
Great for social network applications and other where relationships are important
Node and edges
Edge like a join
Nodes like rows in a table
Nodes can also have properities and values
Neo4j is a popular graph db
3.Where is a Nosql Killer App
Agenda
Content Management
Product Catalogs
Social
Big Data
Miscellaneous
Content Management
Document databases work really well here
Regular KV pairs can store Meta data
Can also store text-based content
Attachments can store file-based or binary content
Versioning and URI addressability help as well
CouchDB gets called a 'Web database'
Database for Web apps
Database that can contain Web apps
Think Web sites,not Browser-based LOB applications
Think EverNote
Product Catalogs
Products is a catalog tend to have many attributes in common and then varIoUs others that are class-specific
Common
ProductID
Name
Description
Price
Class-Specific
Flavor,Color
Resolution,Clockspeed
Key Value Stores and Wide Column Stores work well here
KV Stores better when schema will change over time
Since nothing is declared
Social
Graph databases work best here
Great for tracking:
Networks
Followers
Group membership
Threaded interactions(comments,likes/favorites)
Great for Membership,Ownership
Avoids the self-joins and many-to-many table necessary in relational DBs
Big Data
Wide Column and Key-Value stores work best here
MapReduce is designed for this scenarios
Hadoop and Hbase come up a lot
Sharding and append-only help here
Premise of analytics is reading data,not maintaining it
This is perfect for Nosql
Aggregation,Correlation,regression do not require formal schema,or sophisticated query capabilities
Just need to read and perform mathematical operations on data really,really quickly
Miscellaneous
Event-driven data(i.e,logs)
User Profiles,preferences
Mail,status message streams
Other Web data
Automobile directions
info for sites on maps(category,name,description,lat/long,photo)
User reviews
Etc.
4.What Good is Relational
Agenda
Transactional
Formal Schema
Line of Business Applications
Declarative Query
Banded Reporting
Transactional
Business systems require atomic transactions
You can't process an order without decrementing inventory(清单)
You can't register a credit without its corresponding debit
No exceptions,no excuses
Formal Schema
Regular processes have regular data
Stocks,trades
PO line items
Personnel records
Insurance policies
These need relational databases with declared schema
These don't need MapReduce,document or graph representation
Line of Businesses Applications
Screen layouts and data binding require consistent schema
Data Transfer Objects have properties defined in code
You can't have strong typing without a schema
Object Relational Mapping
Object models are mapped to database schema
If the schema is not consistent then the mapping can't be either
Declarative Query
I silly to write imperative code for each routing query
Makes ad hoc queries and reporting difficult
Lose out on engine optimization
Lose out on versatility(多功能性)
Imperative query works best when the range of queries is very small
Relational stored procedures do set precedent for pre-written queries,but they still don't iterate through data sets imperatively
Banded Reporting
Operational reporting is based on detail and group sections with predictable,consisent layout,based on known schema
Very hard to design pixel-perfect reports against indeterminate schema
You can dump all columns/all rows,but that's generic
Forms are formal,by definition
This highlights how operational business processes almost always require relational databases
5.Nosql and Microsoft
Agenda
Azure Table Storage
sql Server/Azure XML Columns
sql Azure Federations
Demo
OData
MongoDB on Azure
Hadoop on Azure/Windows
Demo
sql Server "Beyond Relational"
sql Server Parallel Data Warehouse
Azure Table Storage
Cloud-based Key-Value Store
Supports OData interface(more on that later)
Key-Value works nicely for general pupose storage and retrieval
sql Server Data Services (precursor to sql Azure) also implemented a Key-Value store
sql Azure XML Columns
XML columns hold structured data that can differ between rows
Combining scalar and XML columns allows combination of static and dynamic schemas
XML schemas can still be declared
But you can have more than one
And it's not required
If motivation to use NOsql is loose schema,then consider XML columns
To prove the point:Azure Dev Fabric's Table Storage is implemented with sql Server Express and XML columns
sql Azure Federations
Federations are the sql Azure version of sharding
Just for partitioning,not for replication
Replication is automatic,implicit in sql Azure
Federation Root (physical & logical db,defines F.Key)
F.Member(physical db - contains specific range of F.Key values)
F.Atomic Unit(AU - container for all data with same F.Key value)
F.Table
F.Members can be addRSSed by absolute name or relative key value
Allow online repartitioning
Offer ACID guarantees withing F.Members and adopt Evetual Consistency between them
Multi-tenancy(租用) applications
Do not support fan-out query
OData
RESTful api for data access,with rendering in XML or JSON
Clients for JavaScript,mobile platforms,.NET,Java
Works for Feeds and updates
The following feature OData interfaces:
Azure Table Storage
sql Server/Azure(via WCF Data Services)
Azure DataMarket
sql Server Reporting Services (in 2008 R2.2012)
SharePoint Lists(2010)
NetFlix,eBay catalogs;TwitPic
IBM WebSphere eXtrem Scale REST data service
Pluralsight catalog!
Compare to JavaScript/JSON orientation of Document Stores
Run MongoDB,others on Azure
Deploy to worker roles
Put databases in Azure Blog Storage;mount as drives(Azure Drive)
MongoDB Replica Set Azure wrapper supports this directly
Use from on-premise or cloud application code
Similar approach can be used for other Nosql DBs
Hadoop on Azure/Windows
MS + HortonWorks have developed Windows Version of Hadoop
Currently in Community Technology Preview
Can use installer to create cluster
On-premises
On Azure
Can also use Hadoop On Azure
Provision entire cluster from Portal
Currently has 48-hour lifetime
Browser-based Hive console
Hive ODBC Driver
Use from Excel (with add-in)
Also use from PowerPivot,Analysis Services(2012 Tabular Mode),Reporting Services
sql Server "Beyond Relational" Features
XML Columns(already discussed)
HierarchyId
Sparse columns(sql Server-only)
Filestream(sql Server-only)
Allow schema flexibility while retaining ACID guarantees
sql Server Parallel Data Warehouse Edition(sql PDWE)
Makes a cluster of sql Server instances appear as on logical server
Uses MPP:Massively Parallel Processing
Compare to MapReduce
Supports sql,so no imperative coding needed
Supports fan-out queries
Supported by most sql Server clients
Available only as appliance
Has finely tuned processor,storage,networking internals
6.Nosql,Relational or Both?
Agenda
Type of App
Productivity
Skill Sets and investment
Recommendations
Type of App
Really a question of consistency versus massive scale
Is this an internal system or a public one?
Is is an application for the data or data for a system?
Below a certain threshold of concurrent usage,Nosql may e slower than relational
Productivity
Nosql db tooling still immature
Queries require significant work,and testing
Programming platforms,frameworks and components may support RDBMSes much more robustly
Especially enterprise platforms
If schema subject to frequent change then Nosql may be more productive
Skill Sets and investment
Does your staff have RDBMS skills already?
Do you have significant investment in relational database hw/sw?(hardware/software)
Lots of apps that use an RDBMS?
Do you want to retool(改革)?
Do you want to support both?
Are you a startup?
Employ developers who possess Nosql skills and prefer Nosql?
Does availability/scalability make RDBMS investment questions moot?
Recommendations
Large,public,content-centric properties:Nosql
Internal LOB(line of business) supporting business operations:relational
Investment in RDBMS licenses,infrastructure,skills:
Relational
Use both (application-dependent)
Use Hybrid approaches
Productivity
Do cost-benefit analysis
How much extra dev times/$$?
What is cost of less scalable system?
It will be tempting ot use one for the other
And it very well may work,but that doesn't make it right