What projects are CAIS working on?
Canonical Answer
Right now, only a very small subset of ML researchers are thinking about x-risk from AGI. CAIS seeks to change this -- their goal is to get the broader ML community, including both industry and academia.
CAIS is working on a number of projects, including:
- Writing papers that talk about x-risk.
- Publishing compilations of open problems.
- Make safety benchmarks that the ML community can iterate on.
- Running a NeurIPS competition on these benchmarks.
- Running the ML Safety Scholars program (MLSS)
- A Philosophy Fellowship aimed at recruiting philosophers to do conceptual alignment research.
One of these competitions is a Trojan detection competition, which is a way of operationalizing deceptive alignment. A Trojan is a backdoor into a neural network that causes it to behave weirdly on a very specific class of inputs. These are often trained into a model via poisoned data. Trojans are similar to deceptive alignment because there are a small number of examples (e.g. 300 out of 3 million training examples) that cause very different behavior (e.g. a treacherous turn), while for the vast majority of inputs cause the model to perform normally.
This competition is in a builder breaker format, with rewards for both detecting trojans as well as coming up with trojans that no one else could detect.
Non-Canonical Answers
Right now, only a very small subset of ML researchers are thinking about x-risk from AGI. CAIS seeks to change this -- their goal is to get the broader ML community, including both industry and academia.
CAIS is working on a number of projects, including:
- Writing papers that talk about x-risk.
- Publishing compilations of open problems.
- Make safety benchmarks that the ML community can iterate on.
- Running a NeurIPS competition on these benchmarks.
- Running the ML Safety Scholars program (MLSS)
- A Philosophy Fellowship aimed at recruiting philosophers to do conceptual alignment research.
One of these competitions is a Trojan detection competition, which is a way of operationalizing deceptive alignment. A Trojan is a backdoor into a neural network that causes it to behave weirdly on a very specific class of inputs. These are often trained into a model via poisoned data. Trojans are similar to deceptive alignment because there are a small number of examples (e.g. 300 out of 3 million training examples) that cause very different behavior (e.g. a treacherous turn), while for the vast majority of inputs cause the model to perform normally.
This competition is in a builder breaker format, with rewards for both detecting trojans as well as coming up with trojans that no one else could detect.
Asked by: | RoseMcClelland () |
OriginWhere was this question originally asked |
Wiki |
Date: | 2022/09/13 |
Discussion