For the button that blow up the moon scenario: could you have a proof of work like in bitcoin to arm it? The button could have some random constant data attached, then the ai has to try adding a nonce and hashing until it finds a hash meeting certain criteria. The more impact the button has, the higher the difficulty can be set. Then maybe add a backdoor for humans with a private key.

