Below are the step by step instructions to recover EC2 instance if the system is not able to boot properly.
So far I have seen couple of scenarios where it helped:
– Corrupted latest kernel ( initrd image is missing )
– Not able to mount File system because of wrong entry in /etc/fstab and going to emergency
1) Once we login into AWS console for the instance in question we should see the “Status Checks” failing and reboot won’t help.
2) Check the “Get System Log”
3) Unfortunately in AWS we do not have console access so the alternative way is to correct the grub entry OR fstab by accessing the disk. To get access to the disk we should stop the instance.
4) Root and boot partition present in /dev/sda1. Click on the EBS ID volume link.
5) We should see the volume is “in-use” state
6) From the Actions – Select “Detach Volume” to detach it from the instance
7) After Detach Volume” we should see the volume in “Available” state.
8) Provision a temporary instance OR if we can use any unused running instance get the instance ID. Temporary instance should be in the same Availability zone as the EBS volume so that we can attach to it.
9) Attach volume to the temporary instance with device name /dev/sdb ( if /dev/sdb already used on the temporary instance then pick the next alphabet /dev/sdc and so on ).
10) Once attached to temporary instance – Volume status goes back to in-use.
11) Login into the temporary instance we should see the disk attached to it.
12) Run “lsblk –f” to show the disks with file systems. In this case /dev/xvdb1 is XFS type.
13) Create directory “/tmp/temp” and mount the disk to it.
14) Now we can access the files inside the disk
15) For kernel issue – Delete the corrupted entry from boot/grub2/grub.cfg
16) For file system issue – Update/Comment entries in fstab apart from basic ( / , swap, /dev/xvdb, Tmpfs ) as shown below. In this particular case the last entry with NFS caused the problem.
17) Once edited – save and come out of directory and unmount /tmp/temp
18) We need to rollback the steps now. So Detach the volume from temporary instance
19) Make sure it is available again
20) Attach to the original instance as “/dev/sda1”- that is what it used to be.
21) Start instance.
22) Check if the status checks are passing now and we could login.
23) Delete the temporary instance if you have provisioned.