Cassandra: Backups Snapshot

Cassandra: Backups Snapshot

Taking a backup in Cassandra is actually taking a snapshot.
You can take a snapshot of the SSTable for a given Keyspace.
Snapshot is taken using the nodetool utility.

This creates a hardlink for the SSTables (backed up) in the keyspace/snapshot directory.
Hardlinks do not consume additional disk space.

The nodetool snapshot is a local command. For cluster wide, the command must be run on every node of the entire cluster.
For global snaphot, one must run using parallel ssh utility, pssh.
The snapshots must be copied onto to a seperate offline location.

As in RDBMS we also have Incremental backups.
By default, this feature is disabled.

To enable this, we need to change the settings in cassandra.yaml
incremental_backups = true

For recovery of data, the relevant SSTable must be present.

As a best practise, old snapshots must be deleted as they continue to be accumulated.

Fundamentally, backing up data in Cassandra involves taking a snapshot of the SSTable for
a given keyspace at a moment in time, as it must have all the tables in order to properly
recover if needed.

You can create a snapshot using nodetool (we specify hostname, JMX port and keyspace):
# nodetool -h localhost -p 7199 snapshot scott

The snapshot is created in data_directory/<keyspace>/table_name-UUID/snapshots/snapshot_name directory.
The snapshot directory will contain *.db files (these have data when snapshot is taken).

Posted in cassandra | Tagged , , , | Leave a comment

Dockers : Toolbox Installation

Docker Toolbox Installation









Finally we run the command docker run hello-world and press RETURN.


It will pull the docker image from online (since it isnt already present)
And post a successful deployment notification

I encourage everyone to get behind this technology and enjoy the exhilarating ride ahead.
Its Champagne time !!!

Note: The Toolbox isnt a Windows native docker application. But instead utilizes the Virtualbox where it (seems to) runs a Ubuntu VM. The toolbox is a good way to learn Dockers.

Posted in Dockers | Tagged , , | Leave a comment

Cassandra : Backup & Recovery

Backup & Recovery

You should tailor your backup strategy to the needs of your business.
Well said.

Need for backups

Programmatic failure (User or statement or process or application)
Instance failure (Single Node )
Disk failure
DataCenter failure (Geographic disaster)
Database Cloning

Cassandra backup concepts

Backups are unlike traditional Databases.
Distributed Hierarchy (explained in future posts)
SSTables being immutable are easily backed up
Snapshots are the ideal way to backup

We shall delve further into the methods for backups in future posts

Posted in cassandra | Tagged , , , | Leave a comment

Cassandra : Cassandra.yaml

Configuration for Cassandra are written into a text like file called the cassandra.yaml.

Cassandra.yaml file happens to be one of the few configurations file.
Changes are not dynamic and machine reboot is a must for changes to take effect.

Location: /etc/cassandra/cassandra.yaml

Configuration settings:

Internal (node specific)
Performance Tuning

A few important parameters:


cluster_name Default: Test Cluster.

Name of the cluster. Ensures that all (ONLY) logical nodes join the same cluster. Therefore all nodes need to have same value.

listen_address Default: localhost.

Advisable to be changed to IP address. The hostname/IP address Cassandra establishes for cluster connectivity.

listen_interface Default: eth0.

Interface used by Cassandra for cluster connectivity. Must set either listen_address or listen_interface. Advised to use Listen_address.

commitlog_directory Default: /var/lib/cassandra/commitlog.

Location for commit log files. commitlog is an append only operation, so HDD should be enough. For best performance,place this directory in a seperate HDD than data_file_directory.

Saved_Cache_directory Default: /var/lib/cassandra/saved_caches.

Location of table key and row cache.

data_file_directory Default: /var/lib/cassandra/data .

Location of SSTables. Note that compaction strategy matters. Recommended to use striped SSD for optimal performance.

endpoint_snitch Default: org.apache.cassandra.locator.SimpleSnitch . Recommended: GossipingPropertyFileSnitch.  Cloud snitches: Ec2MultiRegionSnitch / CloudstackSnitch / GoogleCloudSnitch.
concurrent_reads Default: 32 .Recommended : 16 × number_of_drives
concurrent_writes Default: 32 . Recommended : 8 × number_of_cpu_cores
incremental_backups Default: FALSE .

Note: Read the blog on Backup and Recovery.

broadcast_address Default: listen_address .

Used to broadcast to nodes outside the network/DataCenter and multiple regions.

initial_token Default: disabled .

Used for single-node-per-token architecture. This property overrides num_tokens.

num_tokens Default: 256.

To configure a cluster that uses virtual nodes (vnodes)

auto_snapshot Default: true.

Cassandra automatically takes a snapshot of the data before truncating a keyspace or dropping a table.

internode_compression Default: all.

Use compression for all data communication between nodes. all/dc/none/

Posted in cassandra | Tagged | Leave a comment

Cassandra : SSTableLoader & Spark

Often there is a need to load bulk data into the Cassandra data store,
While CQL Copy is great a a few million records, it is advisable to consider better options for bulk data.

SSTableLoader and Spark are advisable options.


Useful to bulk load external data into the C* cluster.
Flexible and fast option to load data.
Useful in case of need to duplicate data or create a data copy.
Suggested option in case of change in Replication factor or Partitioning strategy.
SSTableLoader streams SStable datafiles into live clsuter.
Transfers relevent data into each node confirming to Replication strategy of the target Cluster
It can appeand data into the tables already present in cluster.

SSTable load options:

Exisitng cluster or new cluster
Cluster of similar or different characteristics. [nodes, RF or partitioner]


SSTableLoader uses C* gossip protocol to understand the cluster topology.
A cassandra.yaml file is present in the classpath and Configured as per the target cluster.
Atleast 1 seed node is present online & configured.

cassandra.yaml file Properties:



Reads every SSTable mentioned and streams data into the cluster.
Repartitions data using snitches; and data is sent to correct node/location.


Spark is a fantastic open source framework to analyze data.
The flexible nature of Spark has allowed it to add options of Analytics, Machine Learning as well as Visualization.
Spark can also be extending to streaming data into C* cluster.
Files like CSV, JSON, XML and other formats are supported.

If data is already present in C* file system, blocks are ingested concurrently in 64MB blocks.
This allows for great parallelism.

If present on local file system of a single node, it is suggested to spread them across multple nodes.
This would give better preformant results.
Use spark to re-partition the dataset. Spark ‘PartitionBy’

SaveToCassandra‘ is used to finally save the data into the appropriate Keyspace.Table.

Posted in cassandra | Tagged , , , , | Leave a comment

Cassandra : SSTabledump

SSTableDump is one of the SSTable Utilities.
Useful to get the SStables into Human Readable format.
SStables are stored in binary form.
SStableDump is used to extract SStable data in JSON format


To read Timestamps and Tombstones
Internal structure of SStable
May be used to import data back into the cluster
To understand the data in SStable for debug purposes


# tools/binsstabledump <options> sstable_file


It is suggested to flush the table to disk using nodetool flush before sstabledump.

Posted in cassandra | Tagged , , | Leave a comment

Cassandra : SSTable tools

Cassandra has a few utilities to carry out tasks on SSTables.
For now, I am simply listing them
Will be looking at them in coming few days.


sstableloader (Cassandra bulk loader)
The primary method to upload data into Cassandra for bulk data. Flexible while re-streaming data into Cassandra acorss clusters with different characteristics.

Print metadata about a specified SSTable.

The sstableutil utility will list the SSTable files for a provided table.

The sstableverify utility will verify the SSTable for a provided table.

Use this tool to split SSTables files into multiple SSTables of a maximum designated size.

Dump the contents of the specified SSTable in JSON format

The sstableexpiredblockers utility will reveal blocking SSTables that prevent an SSTable from dropping.

The sstablekeys utility dumps table keys.

The sstablelevelreset utility will reset the level to 0 on a given set of SSTables.

The sstableofflinerelevel utility will relevel SSTables.

Utility reset to level 0 for list of SSTables.

SSTable scrubbing utility. Remove the corrupted parts while preserving non-corrupted data.

Upgrade the SSTables in the specified table or snapshot to match the currently installed version of Cassandra.

Posted in cassandra | Tagged , , | Leave a comment

Cassandra : CQL Copy utility

Cassandra offers an effective utility to import or export Data in and out of data store.
CQL operation is used to Import/Export data and uses the concept of DELIMITED data.


COPY table_name (COL1, COL2,…COLn) FROM (‘file1′,’file2′ | STDIN) WITH OPTION=’values’


Programmatic approach is more practical and using CQLSH to achieve same result
Uses DELIMITED files as input.
Ideal when data set is a couple of Million records
Not a backup option, its a data option.
Cassandra Bulk Loader is suitable for larger datasets.


Data Insertion

# COPY <keyspace>.<table_name> FROM ‘<input_file>’
WITH DELIMITER= ‘<delimiter like | or , >’

Data Extraction

# COPY <keyspace>.<table_name> (COL1, COL2,…COLn)
TO ‘<output_file>’


Posted in cassandra, Uncategorized | Tagged , , | Leave a comment

Cassandra-Stress Tool

To benchmark the Cassandra cluster, we use a Java program called ‘cassandra-stress’.

The cassandra-stress tool is the most assured way to test the Data Model. The program benchmarks by populating the cluster and running load tests.

Useful for:

  • Determining Schema performance.
  • DB scalability.
  • Optimize data model and settings
  • Determine PROD capacity


The tool uses a YAML file to build a profile.
Aspects considered are potential compaction strategy and cache settings.

An interesting option is the use of Graph plotting

YAML profile


  • DDL to define schema
  • Column distribution
  • Insert (DML) distribution
  • Data query distribution


./cassandra-stress mixed ratio\(writes=1,read=10\) n=100000 cl=quorum -pop dist=UNIFORM\(1..100000\) no-warmup -schema keyspace=”EMPLOYEE” -mode native cql3 -rate threads=50 -log file=~/stress_test.log -graph file=stress_test.html title=stress_test revision=stress_test1

Above the following options are in play:

Consistency Level is Quorum
Keys uniformly distributed between 1 and 1,000,000
Do not warmup the process, do a cold start.
Mode is native to CQL version 3
Thread count is fixed at 50. (Can be defined to vary)
Graph the results with title and revision as named

Posted in cassandra | Tagged , , , | Leave a comment

Cassandra: Multi-DataCenter part 2

Implementation of Multi-DataCenter:

Property Parameter
Class NetworkTopologyStrategy
Consistency Level LOCAL_*
Snitch GossipingPropertyFileSnitch

To allow for Multi DataCenter cluster, we cannot use the default SimpleStrategy.

Instead NetworkTopologyStrategy must be used. This is defined in the Keyspace.

Consistency Level is suggested to be LOCAL_QUORUM. This will allow for Quorum for Local DataCenter. Alternatively you can define EACH_QUORUM in case you want to have quorum of replica nodes in all DataCenters.

GossipingPropertyFileSnitch is recommended for Production environments. The snitch propogates rack and DC information across all the nodes via Gossip.Configured in the file.

Important: If implementing on the cloud, use respective snitches. ex: Ec2MultiRegionSnitch , GoogleCloudSnitch etc

Posted in Uncategorized | Tagged , | Leave a comment

Cassandra: Multi-DataCenter part 1

Multi DataCenter approach:

Node -> Rack -> DataCenter -> Cluster.

Nodes grouped together form a Rack.

These Racks are placed in DataCenters. And DataCenters can be across the globe. In fact they are probabaly placed across multiple Availability Zones.

However they will all be a part of a Cassandra cluster.

The properties are configured in a conf/ file.

Reasons for multi-DataCenter:

  1. Continuous availability for data and application.
  2. Disaster Recovery or also can be referred as Live Backup.
  3. Performance.
  4. Analytics site. (Run deep analytics across the 2nd DC)


Consistency Level can be defined as LOCAL or EACH.  (Topology specific)

Replication is automatic and transparent across the different DataCenters of a Cassandra cluster.

Failure DC/Nodes:

When a node/s fail, the communication to the failed nodes(s) will be impacted due to the Gossip protocol.

DC failure will be un-noticed. The application will access the data from the alternate DC. (Assuming Multi-DataCenter has been implemented.)

Recovery is carried out on the failed nodes via the rolling repair.

Posted in cassandra | Tagged , | Leave a comment

Remove a node from Cassandra cluster

check status of the Cluster. The node is supposed to be down

./nodetool status

Datacenter: US_EAST_COAST
|/ State=Normal/Leaving/Joining/Moving
—       Address           Load     Tokens Owns (effective) Host ID       Rack
DN  <IP address>   <load>   256         100.0%                <hostID>      DC1_RK_1

….(other nodes in cluster will also be listed)

It should say DN. Meaning Down/Normal

Proceed to removing the node from the Cluster:
./nodetool removenode

Finally you must run the repair on the Cluster:

./nodetool repair

Now for some reason suppose the node removal fails. We carry out the following.

./nodetool assassinate

Finally we run the repair.

./nodetool repair

Posted in cassandra | Tagged , , | Leave a comment