AI Safety Support offers free calls to advise people interested in a career in AI Safety, so that's a great place to start. We're working on creating a bunch of detailed information for Stampy to use, but in the meantime check out these resources:
- 80,000 Hours AI safety syllabus
- Adam Gleave's Careers in Beneficial AI Research document
- Rohin Shah's FAQ on career advice for AI alignment researchers
- AI Safety Support has lots of other good resources, such as their links page, slack, newsletter, and events calendar.
- Safety-aligned research training programs (under construction).
If you're not already there, join the public Discord and ask for an invite to the semi-private one where contributors generally hang out.
The main ways you can help are to answer questions or add questions, or help to review questions, review answers, or improve answers (instructions for helping out with each of these tasks are on the linked pages). You could also join the dev team if you have programming skills.
One great thing you can do is write up a Google Doc with your top ~10 questions and post it to the Discord, or ask you friends to do the same (see follow-up question on collecting questions for a template message).
It’s pretty dependent on what skills you have and what resources you have access to. The largest option is to pursue a career in AI Safety research. Another large option is to pursue a career in AI policy, which you might think is even more important than doing technical research.
Smaller options include donating money to relevant organizations, talking about AI Safety as a plausible career path to other people or considering the problem in your spare time.
It’s possible that your particular set of skills/resources are not suited to this problem. Unluckily, there are many more problems that are of similar levels of importance.
The last section of Bostrom’s Superintelligence is called “Philosophy With A Deadline”.
Many of the problems surrounding superintelligence are the sorts of problems philosophers have been dealing with for centuries. To what degree is meaning inherent in language, versus something that requires external context? How do we translate between the logic of formal systems and normal ambiguous human speech? Can morality be reduced to a set of ironclad rules, and if not, how do we know what it is at all?
Existing answers to these questions are enlightening but nontechnical. The theories of Aristotle, Kant, Mill, Wittgenstein, Quine, and others can help people gain insight into these questions, but are far from formal. Just as a good textbook can help an American learn Chinese, but cannot be encoded into machine language to make a Chinese-speaking computer, so the philosophies that help humans are only a starting point for the project of computers that understand us and share our values.
The new field of machine goal alignment (sometimes colloquially called “Friendly AI”) combines formal logic, mathematics, computer science, cognitive science, and philosophy in order to advance that project. Some of the most important projects in machine goal alignment include:
1. How can computers prove their own goal consistency under self-modification? That is, suppose an AI with certain values is planning to improve its own code in order to become superintelligent. Is there some test it can apply to the new design to be certain that it will keep the same goals as the old design?
2. How can computer programs prove statements about themselves at all? Programs correspond to formal systems, and formal systems have notorious difficulty proving self-reflective statements – the most famous example being Godel’s Incompleteness Theorem. There’s been some progress in this area already, with a few results showing that systems that reason probabilistically rather than requiring certainty can come arbitrarily close to self-reflective proofs.
3. How can a machine be stably reinforced? Most reinforcement strategies ask a learner to maximize the level of their own reward, but this is vulnerable to the learner discovering how to maximize the reward signal directly instead of maximizing the world-states that are translated into reward (the human equivalent is stimulating the pleasure-center of the brain with electricity or heroin instead of going out and doing pleasurable things). Are there reward structures that avoid this failure mode?
4. How can a machine be programmed to learn “human values”? Granted that one has an AI smart enough to be able to learn human values if you told it to do so, how do you specify exactly what “human values” are so that the machine knows what it is that it should be learning, distinct from “human preferences” or “human commands” or “the value of that one human over there”?
This is the philosophy; the other half of Bostrom’s formulation is the deadline. Traditional philosophy has been going on almost three thousand years; machine goal alignment has until the advent of superintelligence, a nebulous event which may be anywhere from a decades to centuries away. If the control problem doesn’t get adequately addressed by then, we are likely to see poorly controlled superintelligences that are unintentionally hostile to the human race, with some of the catastrophic outcomes mentioned above. This is why so many scientists and entrepreneurs are urging quick action on getting machine goal alignment research up to an adequate level. If it turns out that superintelligence is centuries away and such research is premature, little will have been lost. But if our projections were too optimistic, and superintelligence is imminent, then doing such research now rather than later becomes vital.
Currently three organizations are doing such research full-time: the Future of Humanity Institute at Oxford, the Future of Life Institute at MIT, and the Machine Intelligence Research Institute in Berkeley. Other groups are helping and following the field, and some corporations like Google are also getting involved. Still, the field remains tiny, with only a few dozen researchers and a few million dollars in funding. Efforts like Superintelligence are attempts to get more people to pay attention and help the field grow.
Some other notes
- https://github.com/deepmind/cartesian-frames I emailed Scott about doing this in coq before this repo was published and he said "I wouldn't personally find such a software useful but sounds like a valuable exercise for the implementer" or something like this.
- When I mentioned the possibility of rolling some of infrabayesianism in coq to diffractor he wasn't like "omg we really need someone to do that" he was just like "oh that sounds cool" -- I never got around to it, if I would I'd talk to vanessa and diffractor about weakening/particularizing stuff beforehand.
- if you extrapolate a pattern from those two examples, you start to think that agent foundations is the principle area of interest with proof assistants! and again- does the proof assistant exercise advance the research or provide a nutritious exercise to the programmer?
- A sketch of a more prosaic scenario in which proof assistants play a role is "someone proposes isInnerAligned : GradientDescent -> Prop and someone else implements a galaxybrained new type theory/tool in which gradient descent is a primitive (whatever that means)", when I mentioned this scenario to Buck he said "yeah if that happened I'd direct all the engineers at redwood to making that tool easier to use", when I mentioned that scenario to Evan about a year ago he said didn't seem to think it was remotely plausible. probably a nonstarter.
Unanswered canonical questions