Thursday, October 6, 2011

Setting Everything on Fire

I created a new user, gave them wheel group, and in case I needed another admin user, added %wheel to sudoers through visudo.  Then, I was trying to do more stuff, and...

[sudo] password for ec2-user: _

Wait.  What?  Not only does ec2-user have no password, but I didn't change its NOPASSWD line in sudoers.

It turns out that ec2-user is also in group wheel, and when confronted with the two permission sets, sudo did what I didn't mean: applied the %wheel rule and started requiring passwords for ec2-user.  Of course su was no help either: root likewise has no password set, because you have sudo as ec2-user....


Thus began the adventure.  I hacked around a little bit, but setgroups(2) is a privileged call, and thus can't be used to drop supplemental groups if you're not root.  My hope was that I could use it to shed the wheel group to make sudo skip the problematic %wheel rule, but the reality was not to be.

I did some reading on how to recover this the hard way, and worked my way through it.  Apparently by a massive amount of concentrated luck, I attached the volume with the busted sudoers after booting the new instance (a second one, in the matching availability zone: the volume in 1b didn't offer the choice of attaching to the instance in 1a), and fixed sudoers.

Next, I stopped the new instance and reattached the EBS volume to my old one.

Then the old instance wouldn't boot.  The AWS management console claimed it was running, but it was dead to the world.  The system log was empty (a handful of spaces).  So I detached its volume, put it back on the new instance, started it up, and things got really confusing: the system log from the old volume had events from the new system on it.  Eventually, I sorted this out: someone (AWS or the kernel) sees a disk labeled "/" and boots off of it, or mounts root there; since "LABEL=/" is listed as the root device in whatever /etc/fstab is in use, and both volumes have that label, it's possible that the kernel just picks one to resolve the conflict.

So apparently the old instance wouldn't boot, yet the new instance would boot with the old instance's disk at sdf1 mounted as root, though its boot device was sda1; even after some hackery, when sda1's fstab pointed specifically at sda1, the new instance ended up with the old volume visible as the root filesystem.  Confusing as this was, it was at least after I had fixed sudoers, so I could poke around the system as root, since the old volume's sudoers file was in effect by this point.

I never did solve the mystery of why the old instance wouldn't boot.  The new instance booted with its new volume detached and the old volume mounted at sda1, so I ended up deleting the new volume and old instance and calling it good enough.

No comments: