Reproducibility how-to: VMware to AWS and back

Reproducible science is ever more reliant on sharing software, data and how you linked them together – your workflow. Sharing software is often limited by the ability to set up the system that your software requires – operating system, programming languages, libraries, other software tools may all need to be set to specific versions in order to be compatible and these versions may be quickly hard to find or their installation may not be so straightforward for everyone.

One option is to take a snapshot of your system, with all settings, software and data, in a working configuration and then share that as a research object. This can be done by creating your system as a Virtual Machine (VM) and there are a variety of packages out there for running such VMs. The VM can be shared just like any other data file and e.g. GigaScience has citable VMs in GigaDB  to support publications.

VMs are becoming popular along with the rise of ‘the Cloud’ and as cloud service providers reduce costs, running your app or analysis on a VM in the cloud is becoming a viable alternative to running it locally. Arguably, the most popular VM in the cloud service is provided by Amazon’s AWS division.

Providing the flexible solution of either using AWS or a local VM requires planning and is technically a bit of a faff, so here I’ll provide a how-to of getting a VM in and out of AWS using VMware running on a Macbook Pro. 

Amazon’s tax-dodging, low-payment and poor working conditions, nefarious business practices undoubtedly cause real suffering in the world so it’s hard to promote them but lack of reproducibility undoubtedly wastes valuable money and time in medical and environmental research so also hurts lives. Therefore, I’ll promote AWS here but encourage you to check out Amazon’s tax-dodging and to join the Amazon Anonymous campaign. There are also other VM-cloud service providers such as Google, but as Google is proud of how it doesn’t pay tax, it’s hard to recommend them too.

Requirements

If you want to download a VM from AWS, you first have to have uploaded it in. Therefore you need to plan ahead. The most flexible system will be one you have created locally, uploaded into AWS and then made public. Users can choose whether to copy your image and run in the AWS cloud or whether to download it for their local use.

AWS also does not work with the free and awesome VirtualBox system. It does work with a few systems e.g. Microsoft Citrix, but here I’ll only deal with VMware. As I’m working on a Mac, I’ll be specifically dealing with VMware Fusion. At time of writing, this product costs US$69.99 for a single user- not too expensive.

My system

Host operating system:

Mac OS X 10.9.5

VMware system:

VMWare Fusion (5.0.5)
VMWare OVF Tools (4.0.0) – see steps 1 and 2

AWS tools:

AWS Command Line Interface (CLI) – see step 4

Guest operating system (the VM):

Ubuntu 13.10 64bit edition. I did not chose the ‘server’ version that comes without a graphical desktop environment. Doing so would have made the VM more compact but for maximum accessibility, I wanted to keep the option of a GUI for any end user who is not so happy on the Linux commandline.

NB: for AWS, the imported VM must be 64bit and from a limited set of flavours e.g. Ubuntu13.10, not 14.x

Before we start

I’m assuming that you’ve got a virtual machine running in VMware and that it’s one of the suitable ‘flavours’ of Linux for uploading into AWS (see link above). You can have any software installed on it, hopefully you’ll have some data and perhaps a Galaxy instance with your methods all mapped out etc.

Step 0. Load SSH server onto VM and set public key

To get your VM to communicate with the outside world, it’ll need an SSH server running.

For ubuntu type the following and then accept the installation by pressing ‘Y’ when asked

$> sudo apt-get install openssh-server

Then update the config file to allow public key access (and to listen to internet requests). The file can be opened for editing by using the following command (you may want a cheat sheat for help with how to work the command line editor ‘vi’) :

$> sudo vi /etc/ssh/sshd_config

Look through the file and uncomment (remove the prefix-hashtag) or add if these aren’t there:

ListenAddress 0.0.0.0 

PubkeyAuthentication yes

Then create your own RSA keypair using the following command and by pressing ‘enter’ to just accept the defaults

$> ssh-keygen -t rsa

In your EC2 dashboard on the AWS website, go to KeyPairs and select Import Key. Upload the public key that you just created (this will be located at ~/.ssh/id_rsa.pub and may be a ‘hidden’ file).

Also make sure to transfer a copy to the computer you’ll be using to log into your VM. I did this by emailing it to myself using Gmail via Firefox.

NB: it is probably worth testing that you can connect. In the VMware ‘settings’ for your VM, ensure that the network settings are ‘bridged networking’ (I used the ‘Autodetect’ sub-option, but you could specify whether to use ethernet or wifi connections for this). If you’ve done this with the VM on, it may be best to restart (that was necessary for me). Then find out your VM’s ip address by running the command

$> ifconfig

This should produce a lot of output but the key detail to look for is ‘inet addr’ which should be something like 172.20.11.2 (basically 4 numbers separated by periods). There will be multiple entries in the output e.g. eth0, eth1, lo. The particular entry you are looking for will depend on which connection the VM is broadcasting on. In most cases the inet address you are looking for will be eth0 but it might differ dependent on Linux version or type of connection. If in doubt, take a note of all the inet addr numbers and try them one at a time in the following instruction.

From your host system (your Mac), try connecting using your public key. In the following example, I’m assuming you’re in the same directory as the public key you emailed your self (for ease), and that the keyfile is called ‘my_public_key.pub’. You should replace ‘username’ with the username you selected for your linux account on the VM and you should replace 172.xx.xx.xx with the inet addr from the previous step. Oh, and I’m assuming you can ssh from the Mac OS commandline – but i think that comes as standard.

$> ssh -i my_public_key.pub username@172.xx.xx.xx

The computer ought to ask you if you are happy with storing the public key even though it doesn’t know who you’re connecting to – just say yes! You should then be asked for the password for your Linux user – obviously, enter that.

If successful, you are presented with the commandline and can browse around to your heart’s content. If so, go onto the next stage. If not, you’ll need to trouble shoot this. I’d recommend restarting the VM and then finding the inet addr again using ifconfig, double checking that SSH is on (best to google for openssh tutorials) and ensuring that the public key access is allowed in the SSH config file. Good luck!

Step 1. Download OVF Tool

Amazon states that it will only accept VMs from VMware if they have been saved/converted with the OVF tool that VMware provides as standard in some of its systems. VMware Fusion doesn’t have that option but you can download the tools for free if you have a VMware account.

Select ‘product download’ link at https://www.vmware.com/support/developer/ovf/

Or go to product download link at https://my.vmware.com/web/vmware/details?downloadGroup=OVFTOOL400&productId=353

You will need to sign in to VMWare, but the tool is free.
Select Mac OS download, and when download completes, double-click .pkg file to install.

Step 2. Change permissions in OVF Tool folders

The OVF Tool needs to be run without sudo, as the user that created the VM. But to do that you need to relax permissions for the files that you just downloaded. You may wish to change these back again later (replace 777 with e.g. 755). From the Mac OS commandline:

$> chmod -R 777 /path/to/OVFTool_directory/*

Step 3. Run OVFtool on your VM

Create a place for your OVF output to go, I created a new folder, “Virtual Machines OVF” to sit beside the standard location of all my saved VMs:

$> mkdir /Users/your_username/Documents/Virtual\ Machines\ OVF

If you navigate to the OVFTool folder you can run the ovf tool using “./ovftool” it’s important to remember the ‘dot slash’ prior to the command.
The basic syntax is:

$> ./ovftool [path to VM file] [path to OVF output file]

There are many options for input and output format etc – please see the OVFTool userguide for a complete list

NB: The OVF Tool will throw an error if you try to overwrite an existing file and it often creates the output  files even when it fails, so if you’re playing about with options and it’s producing errors, you’ll probably have to change the output file name each time or keep deleting the output files.

here are the generalised instructions for navigating to the default location of the OVF tool and performing the necessary conversion:

$> cd /Applications/VMware\ OVF\ Tool
$> ./ovftool /path/to/your_virtual_machine.vmx /path/to/your_virtual_machine.ovf

Step 4. Get AWS command line tools

Download the tools using curl – this comes as standard on a Mac. Select a suitable output location for the file to download to. I chose my ‘home’ directory (denoted by ~ ) for simplicity. From your Mac commandline:

$> curl "http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip" -o “~/ec2-api-tools.zip"

Unzip the file that has arrived (the following command will unzip the files to a folder with a name something like ‘ec2-api-tools-1.7.2.3’ in the same directory as the original zip file)

$> unzip ~/ec2-api-tools.zip

Check your Java version (the command line tools require Java 1.7 or above). First check that you have Java by typing this on your Mac commandline:

$> which java

That should return something like /usr/bin/java. If it doesn’t , you need to install Java. You can download either a JRE or JDK version of Java from http://www.java.com/en/download/index.jsp, it can be installed like any .dmg file for Mac.

With Java installed, you need to find the ‘java_home’ location. This can be found on Mac using the following command (and should return something like /Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home):

$> /usr/libexec/java_home

We now need to set this in your profile settings, but handily we can use the /usr/libexec/java_home command rather than copying where it actually is. To make a permanent change to your profile settings, you should edit your bash_profile file which can be found in your home folder as a hidden file called ‘.bash_profile’ (note the period at the start!). If you’re happiest using a graphical text editor, it may help to enable viewing of hidden files. If you’re happy with the commandline text editor I recommended earlier, you could open the file up for editing using:

$> vi ~/.bash_profile

Vi will create the file if it doesn’t already exist. Don’t worry if it’s blank, just add the following line to tell your Mac programmes where to find Java:

export JAVA_HOME=$(/usr/libexec/java_home)

Now tell the EC2 files where they live. This is similar to the previous step and you can update your ~/.bash_profile file with the following lines (making sure to make them fit the ec2 version number/directory name on your system):

export EC2_HOME=~/ec2-api-tools-1.7.2.3
export PATH=$PATH:$EC2_HOME/bin

An optional addition at this stage is to add your AWS access keys to the bash_profile file too. You need to get these from your AWS account, but they won’t remind you of your secret key so if you’ve forgotten that, you’ll need to set up a new one. Placing your keys in this file in this way gives them an alias that you’ll probably remember more easily than the keys themselves. In my example, my AWS access key gets the alias AWS_ACCESS_KEY. You do this by adding the following two lines to your .bash_profile:

export AWS_ACCESS_KEY=your_access_key
export AWS_SECRET_KEY=your_secret_key

NB: these aren’t related to accessing your VM, they are purely for logging into AWS from the commandline. The ‘public key’ you created in Step 1 is used for accessing the VM, once it’s been successfully imported.

There are other .bash_profile aliases (technically they’re called ‘environment variables’) that can be set , and you can get guidance on ways to test your setup and troubleshoot errors direct from Amazon (I’ve lifted most of these instructions from Amazon’s CLI reference page )

OF PARTICULAR NOTE is the benefit in setting another bash_profile environment variable to specify the location of your Amazon server. You can find out the available locations using the ec2-describe-regions command and then set a bash_profile variable e.g.

$> ec2-describe-regions

Then in your ~/.bash_profile add the appropriate location, copying the URL as described by ec2-describe-regions e.g.

export EC2_URL= https://ec2.eu-west-1.amazonaws.com

STEP 5. Upload to AWS

You will need to decide what size of AWS instance you want to use. This ought to match the details of your local VM e.g. in my case I have a dual core VM with 2Gb of RAM so I will select a “t2.medum” instance that offers 2 cores and 4Gb of RAM as the nearest to my existing setup. You can change this from within AWS too.

We will now use the AWS CLI tools to send your VM to Amazon’s S3 storage where it will be used to launch an instance. Check out Amazon’s site for a full AWS CLI API reference.

The ‘group’ parameter: you will need to have created a security group in your AWS account. If your VM is a webserver (e.g. if it runs Galaxy) it will be useful to have HTTP open. The imperative is to have the SSH port (22) remaining open so that you can log in to the server to make system changes.

The ‘bucket’ parameter: you will need to have created an S3 storage bucket within your AWS account. This is pretty easy to do from within the AWS dashboard.

Access and Secret keys: you can use the environment variables that you set in your .bash_profile earlier. If doing that, remember to put a dollar sign in front to indicate they are environment variables, other wise the tools will think your variable names are the keys themselves.

For simplicity, navigate (via Mac commandline) to the location of your converted VM and make the following call (updated with your specifics to replace ‘your_security_group’ and ‘your_s3_bucket’ ) to import the .vmdk file that was produced as part of the OVF transformation.

$> cd /path/to/your/ovf_converted_vm_files
$> ec2-import-instance -t t2.medium ./your_virtual_machine.vmdk -g your_security_group -f VMDK -a x86_64 -p Linux -b your_s3_bucket -o $AWS_ACCESS_KEY -w $AWS_SECRET_KEY

NB: you can add the parameter –dry-run at the end if you just want to check that it’s all working without the effort of uploading it to your AWS account.

If this all goes through as planned, AWS will tell you that the upload was successful and that it is now converting the files to an EC2 instance. You can check on the status of this conversion (can take a while) using the command:

$> ec2-describe-conversion-tasks

NB: despite all these specifications such as security group, your uploaded VM will probably not have a public IP, will not have a public key set, or have a security group other than ‘default’. In the next stage, we’ll cover changing these – which generally involves making a whole new instance from the one Amazon made for you.

Step 6. Adjust settings within AWS

If you log into AWS you will be able to see a folder within your S3 bucket and you will have a new instance in your EC2 dashboard. It ought to be ‘stopped’ initially.

In my case, the instance did not have a name (quickly edited so I could keep track of which one was which!), it was set to a default security group (annoying because I’d specified it earlier) and it didn’t have a public IP or DNS meaning it isn’t accessible over the internet and is completely useless!

The IP settings supposedly can be set automatically, but my ‘subnet’ has the appropriate settings as far as I could see and yet has never given me a public IP address. Sad times.

A public IP is absolutely necessary in most cases (an alternative would be to log in from another of your VMs that’s local in AWS… uh, yeah) and given that the automatic assignment process doesn’t seem to work, we’ll do it manually by creating a new instance copied from your imported one.

Select the uploaded Instance and choose ‘create image’. Follow the wizard to give it a unique name and description, it should be enough to retain all else as default. Once created, you ought to find it in the ‘AMIs’ section of your EC2dashboard. In that area you can select it and choose ‘launch’ to create a new instance from your uploaded VM.

During this instance launching process, make sure to select ‘assign public IP’. You should also select which security group it should belong to and remember this should be one that at least has SSH open (you can make it open to the whole web or you can make it open to known IP addresses only which is more secure).

You also need to select a security keypair – that’s the file you should have uploaded during VM setup earlier.

Step 7. Start Instance and attempt login!

You need the public DNS or IP address of your instance. An example of the DNS is: ec2-54-72-21-175.eu-west-1.compute.amazonaws.com

Make sure you have assigned security groups that allow someone from your location to access the instance via SSH!

Via your Mac commandline navigate to the folder containing your public key (.pem file if created on AWS, .pub if created on your Linux VM), or just locate it and put the full path in the command line:

$> ssh -i /path/to/my_public_key.pub username@ec2-54-72-21-175.eu-west-1.compute.amazonaws.com

NB: replace ‘username’ with the Linux account name from your original VM.

Your computer will probably ask if you’re happy to connect to this unknown location, just say yes! You should then be prompted for the password for your username – enter the one you created when setting up the vm locally.

Voila! you should be in! If connection is refused, this is probably to do with the SSH server not being set up correctly on your VM (could you log in at the end of Step 0?) or possibly due to your security group on AWS not opening SSH to your IP address.

Step 8. Exporting your VM from Amazon

The whole aim of this has been to enable import and export to and from AWS. Doing both provides a flexible solution with minimal reliance on a single provider.

It is only possible to export a VM from AWS if it was initially imported in – so you should be starting with steps 0-7. But having decided to change provider or having found someone else’s VM and wishing to run it locally, here are some steps to download. I’m getting these instructions from Amazon’s AWS reference site.

First, you must transfer the AWS instance to an S3 bucket. NB: I did say ‘instance’. There may be an AMI ‘image’ out there that you want to download. As far as I can tell, you must first launch an instance with that image and then you must transfer your instance to S3. Exporting your instance can be done using the EC2 commandline tools.

To run ec2-create-instance-export-task, we will require the instance ID and some other details: ‘target_environment’ means VMware, Citrix or Microsoft; ‘disk_image_format’ means VMDK for VMware; ‘container_format’ is optionally OVA for exporting to VMware; ‘s3_bucket’ is the name of your S3 bucket!

$> ec2-create-instance-export-task instance_id -e target_environment -f disk_image_format -c container_format -b s3_bucket.

A more specific example (using my own details) where I’ve opted to use the OVA container seeing as how we’d had to get the tools for the initial upload.

$> ec2-create-instance-export-task i-810ee865 -e VMware -f VMDK -c OVA -b robs-s3-bucket

On running this, we should be told that an export task has been created and some details about that task. One of those is the name of the output file on your S3 bucket. In my case it was ‘export-i-fh3nccn5.ova’. We are also told that it’s possible to check progress by running the command: ec2-describe-export-tasks. The export does take quite a bit of time and so it’s worth running the tool to see when the process has completed. You’ll know because the response goes from e.g.

EXPORTTASK export-i-fh3nvvn5 active i-810ee865 vmware vmdk ova galaxym-bucket export-i-fh3nvvn5.ova

to e.g.

EXPORTTASK export-i-fh3nvvn5 completed i-810ee865 vmware vmdk ova galaxym-bucket export-i-fh3nvvn5.ova

Once it has completed, you can then navigate to the bucket in S3 using your browser and the AWS dashboard where you ought to see a file called something like export-i-fh3nvvn5.ova

Downloading is possible by double-clicking on that file or by highlighting and then choosing ‘download’ from the ‘Actions’ drop-down menu.

Step 9. Converting your downloaded OVF file for import to VMware

The downloaded file is in OVA format and requires unpacking before import. Using our OVFTool we can translate the OVA container into .VMX file using the following commandline:

$> ./ovftool /path/to/my-exported-ovf-file.ova /path/to/the-newly-transformed.vmx

For the purposes of example, I’ll say that this has created an output file called ‘exported.vmx’ and a disk file ‘exported-disk1.vmdk’ and a few others. In VMware, you can try simply to ‘open’ the exported.vmx file…

Upon doing this, there is a warning that the downloaded VM could do with being ‘upgraded’ but if we do, it won’t be compatible with other systems until its downgraded again. Despite compatibility being our main aim, not upgrading caused me major problems with the display so I recommend choosing ‘upgrade’.

Also, when it asks did you move or copy this file, I tend to choose the default ‘copied’ which seems to work.

But in any case it reports an error:

“Cannot open the disk ‘Virtual Disk.vmdk’”

It’s no surprise that it can’t open that file because the virtual disk isn’t called ‘Virtual Disk.vmdk’, it’s called ‘exported-disk1.vmdk’. I tried changing the name of exported-disk1.vmdk to ‘Virtual Disk.vmdk’. But, using ‘open’ on the .VMX file, I am now told that exported-disk1.vmdk is missing… yeesh!

So. My workaround. That works.

Duplicate the exported-disk1.vmdk and make another file called Virtual Disk.vmdk. Use the ‘open’ command in VMware to open the exported.vmx file. This works! But is horribly inefficient because we’ve just doubled our VM.

Looking at the VM’s settings, it appears that there is an IDE and a SCSI type of hard drive – there is a warning when opening the IDE type (which is noted as being the actual exported disk, the SCSI is the copy):

“The virtual disk file specifies a SCSI bus type, but is located on an IDE controller. This may cause inaccessible or corrupted data. If you encounter problems, remove the disk from the virtual machine and then add it back.“

That only makes vague sense to me and I haven’t solved this problem. Oddly, when the machine is shut-down, only one hard drive (IDE, actual downloaded file) is shown. But when it’s turned on, it reports having two hard drives (IDE and duplicated SCSI file). Frustratingly, it is only possible to remove a hard drive when the VM is switched off. No joy.

The question is – does this have something to do with Amazon’s export or VMware’s OVF Tools transformation?

Conclusion

It’s disappointing that there is a lingering problem with the export of AWS to VMware. However, I’m counting the process as a success because at least the system, as stored in the cloud, is accessible locally. But this would not be a good solution if large data were also stored on the system drive. Doubling large data is highly wasteful of space – although perhaps not as much of a problem in the modern era where storage is fairly cheap. Here I haven’t tested whether this double disk problem would also apply to an S3 storage volume that was downloaded. In light of a better workaround, it may be possible to reduce large-scale duplication by having an external drive (volume in AWS speak) where data is stored. This could be downloaded separately and added to the system after duplication. I’m also not sure what is needed on the duplicated drive – perhaps an empty .VMDK file or at least a minimal installation of a VM would be sufficient to fool VMware… I’m not sure.

All in all, this was not a trivial process. It would be very difficult for AWS to facilitate upload and download of any system but it does seem as though they are hoping that most users won’t try. Some of the issues may also sit with VMware and their OVFtool system.

At least, it is possible to create a VM locally, upload it to the cloud for ease of sharing and still allow someone to download from the cloud if they wish. Those wishing to facilitate flexible access to their work ought to create their systems in this manner so that more options are available to the end user.

Leave a Reply

Your email address will not be published. Required fields are marked *