I recently spent a couple days suffering through configuring a business-ready Cassandra cluster in Azure’s VM’s… so I figured I should share my results. As a warning, this process is long and tedious — you will need to switch between the UI, powershell and command prompts.

However, the payoff is worth it — on Medium instances ($115/mo) I am able to do over 15 MB/s of throughput. This keeps our costs low and allows us to scale easily, not to mention the fact that my data drives are already backed up locally 3x by Azure.

So, let’s get started:

1. Download the Azure publishing settings here.

PublishingSettings

2. Import the publishing settings into your powershell session (note: if you do not have the Azure powershell, download the Azure SDK and cs-upload:
Import-AzurePublishSettingsFile "C:\Users\clamanna\Downloads\Windows Azure BizSpark 1111-Windows Azure MSDN - Visual Studio Ultimate-12-26-2012-credentials.publishsettings"

3.  Identify which subscription you want to use for your Cassandra cluster. To see all your available subscriptions:
Get-AzureSubscription | select SubscriptionName

4. Set your Azure subscription for your current PS session. E.g.:
Set-AzureSubscription -SubscriptionName 'Windows Azure BizSpark 1111'

5. Choose the region for your VMs – if you’re North America based, I recommend choosing West US since it has the newest Azure hardware (run the Get-AzureLocation cmdlet to see all your options).

Note: All your cluster VMs need to be in the same region for now.

6. Create a storage account for your VM:
New-AzureStorageAccount -StorageAccountName 'cassandravm001dsk' –Location 'West US'

7. Disable Geo replication for the storage account – this is required if you want to stripe your data disks (trust me, you do :) ).
Set-AzureStorageAccount -StorageAccountName cassandravm001dsk -GeoReplicationEnabled $false  -Label 'cassandravm001dsk' -Description 'cassandravm001dsk'

Note: This step is due to a limitation around remote replication on Azure storage accounts – a failover can cause data corruption.

8. Set the storage account for your current PS session:
Set-AzureSubscription -SubscriptionName 'Windows Azure BizSpark 1111' -CurrentStorageAccount 'cassandravm001dsk'

9. Copy down the Subscription ID for the subscription you used in the previous steps. You will need this to upload your VM image
Get-AzureSubscription | select SubscriptionName, SubscriptionId
(Note: Your subscription Id is considered “sensitive” – do not share it).

10. Open up the Visual Studio command prompt to create a self-signed certificate (required to upload your image, it’s located in the Visual Studio Tools folder inside Microsoft Visual Studio 2012)

vs_prompt

11. Create your self -signed certificate for csupload:
makecert -sky exchange -r -n "CN=linuxvm" -pe -a sha1 -len 2048 -ss My "linuxvm.cer"

12. Go to the Azure management portal (https://manage.windowsazure.com) and upload this newly created certificate file (linuxvm.cer).

The certificate will be in C:\Program Files (x86)\Microsoft Visual Studio…” or the folder where you ran the makecert command).

uploadcert

13. After uploading the certificate, copy its Thumbprint from the UI to be used later when making remote connections.

14. Open up the Windows Azure command prompt

Azure_cmd_prompt

15. Set your connection string for csupload by using your Subscription Id from earlier and your Certificate Thumbprint as noted from before.
csupload Set-Connection "SubscriptionID=XXXXXXXX;CertificateThumbprint=YYYYYY;ServiceManagementEndpoint=https://management.core.windows.net"

16. Upload your disk image for Ubuntu.  You should download the cloud Ubuntu image for Azure here (to your local machine).
csupload Add-PersistentVMImage -Destination "http://cassandravm001dsk.blob.core.windows.net/vhds/cassandravm001img.vhd" -Label cassandravm001img -LiteralPath "C:\Users\clamanna\Downloads\ubuntu-12.04-server-cloudimg-amd64-disk1.vhd (1)\cassandravm001img.vhd" -OS Linux

Note: The reason that we can *not* use the images from the Azure gallery is because they will all use a single storage account.

17. Create your Virtual Machine using the management portal. Unfortunately, you cannot add a VM to an existing cloud service from the powershell prompt. This is an issue that seems to be ongoing and is very frustrating (support thread here).

Navigate to the management portal and click new -> virtual machine:

Azure_AddVm

Select “From Gallery” and choose your recently uploaded image:

Azure_select_image

Complete the wizard for creating your VM – and, make sure to select “Connect to an existing virtual machine” when creating your second (and third, etc.) nodes (and to add them to your availability set!)! For the first node, though, select “Standalone Virtual Machine.”

Azure_connect_to_existing

18. Return back to your Azure Powershell window – and let’s create the disk configuration (make sure the bolded lines match your cloud service and VM name from earlier steps!).
PS C:\> Get-AzureVM -Name cassandravm001 -ServiceName cassandradb |
Add-AzureDataDisk -CreateNew -DiskSizeInGB 500 -DiskLabel 'cassandravm001dsk001' -LUN 0 |
Add-AzureDataDisk -CreateNew -DiskSizeInGB 500 -DiskLabel 'cassandravm001dsk002' -LUN 1 |
Add-AzureDataDisk -CreateNew -DiskSizeInGB 500 -DiskLabel 'cassandravm001dsk003' -LUN 2 |
Add-AzureDataDisk -CreateNew -DiskSizeInGB 500 -DiskLabel 'cassandravm001dsk004' -LUN 3 |
Update-AzureVM

Note: since we created a Medium instance, we can only add 4 data disks — which is a good number to stripe across. If you made a Large/ExtraLarge instance, you can add 8 or 16 disks. However, in my opinion– a set of medium machines gives you the best return on investment.

19. Next, let’s configure the actual node.  The machine should be done from the Azure perspective –SSH into your newly created VM (for Windows, I am partial to Putty).

The VMs are created with an SSH on a special port:

ssh_endpoint

a. Using your SSH session, update all the pre requisites for installing cassandra and managing your disks (e.g. openjdk): 
sudo apt-get update
sudo apt-get install lvm2
sudo apt-get install openjdk-7-jre
sudo apt-get install openjdk-6-jre
sudo apt-get install openjdk-7-jdk

b. Reboot your machine (so you can see your data disks) and configure your Cassandra drive.

First, find the disk names for your mounted Azure data disks (e.g. /dev/sdc-> /dev/sdcf)

grep SCSI /var/log/dmesg

Now, fdisk all these drives (repeat for each one of your data disks)

sudo fdisk /dev/sdc -> n, p, 1, p, w
sudo fdisk /dev/sdd -> n, p, 1, p, w
sudo fdisk /dev/sde -> n, p, 1, p, w
sudo fdisk /dev/sdf > n, p, 1, p, w

Next, pvcreate all these disks

sudo pvcreate /dev/sdc1
sudo pvcreate /dev/sdd1
sudo pvcreate /dev/sde1
sudo pvcreate /dev/sdf1

Then merge all the drives together (to stripe them… remember: your drives can *NOT* be geo replication enabled if you want to do this safely!)
sudo vgcreate datadrive /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
sudo lvcreate -l 100%FREE datadrive -n cassdata -I 64 -i 4

Finally, mount and prepare your striped data drives:

sudo mkfs.ext3 -m 0 /dev/datadrive/cassdata
sudo mkdir /mnt/datadrive
sudo mount /dev/datadrive/cassdata /mnt/datadrive
sudo mkdir -p /mnt/datadrive/log/cassandra
sudo mkdir -p /mnt/datadrive/lib/Cassandra

c. Download and unpack your Cassandra distro (you may need to bump the version number from 1.1.8 to a later one… depending when you read this )
wget http://apache.osuosl.org/cassandra/1.1.8/apache-cassandra-1.1.8-bin.tar.gz
sudo tar -zxvf apache-cassandra-1.1.6-bin.tar.gz

d. Update your Cassandra configuration for network connections / node config:

In conf/Cassandra-env.sh:

Replace the 180 by 256 on line 188. (It should say something like: JVM_OPTS=”$JVM_OPTS -Xss256k”)

if [ "`uname`" = "Linux" ] ; then
# reduce the per-thread stack size to minimize the impact of Thrift
# thread-per-client.  (Best practice is for client connections to
# be pooled anyway.) Only do so on Linux where it is known to be
# supported.
# u34 and greater need 180k
JVM_OPTS="$JVM_OPTS -Xss256k"
Fi

In conf/Cassandra.yaml :

Enable remote connections / IPs… Datastax has a good write up here; after setting this up on your first machine, I recommend copying it to your second+ nodes (i.e. via scp).

listen_address (line 271)

rpc_address (line 283)

e. Make your disk configuration and Cassandra resilient across reboots (Azure will reboot you a couple times a year)

First, get your striped data disks’  UUID from blkid and then add it to your fstab file:

sudo blkid
sudo vim /etc/fstab

An example from mine (bolded GUID is from my blkid, be sure to change it!):

UUID=63ab0827-4698-427a-818a-279b18886757 /mnt/datadrive ext3 defaults 0 0

Next, add the cassandra script to your start up scripts:

sudo vim /etc/rc.local
# Start the script on boot!
/home/..../apache-cassandra-1.1.6/bin/cassandra &

I hope that this walk through has been helpful — I have helped a few people get their Cassandra cluster running Azure using these steps, so hopefully this is self server at this point. If not — please reach out to me on twitter (@clamanna) and ask me with any questions or road blocks.