If any PG is stuck due to OSD or node failure and becomes unhealthy, resulting in the cluster becoming inaccessible due to a blocked request for greater than 32 secs, try the following:
- Set noout to prevent data rebalancing:
#ceph osd set noout
- Query the PG to see which are the probing OSDs:
# ceph pg xx.x query
- Go to each probing OSD and delete the header folder here:
var/lib/ceph/osd/ceph-X/current/xx.x_head/
- Restart all OSDs.
- Run a PG query to see the PG does not exist. It should show something like a NOENT message.
- Force create a PG:
# ceph pg force_pg_create x.xx
- Restart PG OSDs.