Friday, August 17, 2012

Troubleshooting cloud-init on Amazon Linux

cloud-init drops files as it works under /var/lib/cloud/data – you'll find your user-data.txt there, and if it was processed as an include, you'll also have user-data.txt.i.

If you're using #include to run a file from s3 and it wasn't public (cloud-init has no support yet for IAM Roles, nor special handling for S3), then user-data.txt.i will contain some XML indicating "Access Denied".  Otherwise, you should see your included script wrapped in an email-ish structure, and an unwrapped (and executable) version under /var/lib/cloud/data/scripts.

Update 23 Aug: Per this thread, user data is run once per instance by default, so you can't test it by simple reboots unless you have edited /etc/init.d/cloud-init-user-scripts to change once-per-instance to always.  Or use your first boot to set up an init script for subsequent boots.  But this doesn't apply if you build an AMI—see the 1 Oct/8 Oct update below for notes on that.

Update 2 Sep: I ended up dropping an upstart job into /etc/init/sapphirepaw-boot from my script; where the user data is #include\nhttp://example.s3....com/stage1.pl and the upstart job is a task script that runs curl http://example.s3....com/stage1.pl | perl.  stage1.pl is public, and knows how to get the IAM Role credentials from the instance data, then use them to pull the private stage2.pl.  That, in turn, actually knows how to read the EC2 tags for the instance and customize it accordingly.  Finally, stage2.pl ends up acting as the interpreter for scripts packed into role-foo.zip (some of them install further configuration files and such, so a zip is a nice, atomic unit to carry them all in).

Note that I have just duct-taped together my own ad-hoc, poorly specified clone of chef or puppet.  A smarter approach would have been to pack an AMI with one of those out-of-the-box, then have stage2.pl fetch the relevant recipe and use chef/puppet to apply it.  Another possibility would be creating an AMI per role, with no changes necessary on boot (aside, perhaps, from `git pull`) to minimize launch time.  That would prevent individual instances from serving multiple roles, but that could be a good thing at scale.

But now I'm just rambling; go forth and become awesome.

Update 1 Oct, 8 Oct: To cloud-init, "once per instance" means once per instance-id.  Building an AMI caches the initial boot script, and instances started from that AMI run the cached script, oblivious to whether the original has been updated in S3.  My scripts now actively destroy cloud-init's cached data.  Also, "the upstart job" I mentioned was replaced by a SysV style script because the SysV script I wanted to depend on is invisible to upstart: rc doesn't emit individual service events, only runlevel changes.

No comments: