Review answers long

From Stampy's Wiki

Back to Review answers.

What is MIRI’s mission? What is MIRI trying to do? What is MIRI working on?

MIRI's mission statement is to “ensure that the creation of smarter-than-human artificial intelligence has a positive impact.” This is an ambitious goal, but they believe that some early progress is possible, and they believe that the goal’s importance and difficulty makes it prudent to begin work at an early date.

Their two main research agendas, “Agent Foundations for Aligning Machine Intelligence with Human Interests” and “Value Alignment for Advanced Machine Learning Systems,” focus on three groups of technical problems:

  • highly reliable agent design — learning how to specify highly autonomous systems that reliably pursue some fixed goal;
  • value specification — supplying autonomous systems with the intended goals; and
  • error tolerance — making such systems robust to programmer error.

That being said, MIRI recently published an update stating that they were moving away from research directions in unpublished works that they were pursuing since 2017.

They publish new mathematical results (although their work is non-disclosed by default), host workshops, attend conferences, and fund outside researchers who are interested in investigating these problems. They also host a blog and an online research forum.

Yes, if the superintelligence has goals which include humanity surviving then we would not be destroyed. If those goals are fully aligned with human well-being, we would in fact find ourselves in a dramatically better place.

The opinions from experts are all over the place, according to this 2021 survey. I’ve heard everything from essentially certain doom, to less than 5% chance of things going horribly wrong, all from people deep in the field.

Using some human-related metaphors (e.g. what an AGI ‘wants’ or ‘believes’) is almost unavoidable, as our language is built around experiences with humans, but we should be aware that these may lead us astray.

Many paths to AGI would result in a mind very different from a human or animal, and it would be hard to predict in detail how it would act. We should not trust intuitions trained on humans to predict what an AGI or superintelligence would do. High fidelity Whole Brain Emulations are one exception, where we would expect the system to at least initially be fairly human, but it may diverge depending on its environment and what modifications are applied to it.

There has been some discussion about how language models trained on lots of human-written text seem likely to pick up human concepts and think in a somewhat human way, and how we could use this to improve alignment.

Regarding igniting the admosphere: What about modern day atomic bombs? They are said to be much much much more powerful than the first ones.

Experimentally, even thermonuclear weapons do not ignite the atmosphere. Fortunately.

 -- _I am a bot. This reply was approved by plex_

@3:54 you mention providing the whole of Wikipedia for learning data. Wikipedia details several methods for breaking memory containment. If this is provided to an advanced AI couldn't that AI become aware that it may me constrained within blocks of memory, and thus attempt to bypass those constraints to maximize it's reward function?

These vulnerabilities have been present in all Intel and AMD CPUs for 20+ years before discovery and have been largely mitigated, however the "concept" of looking for vulnerabilities in micro architecture is something an AI can do a lot better than humans can. If you read the Assembly for pre-forking in Intel chips, it's pretty obvious the entire memory space is available while the CPU is predicting what will be required of it next. Presuming containment of an AI system is important, isn't feeding massive datasets a considerable risk, not only for intellectual property rights but to maintain control of the AI?

Here's some examples of existing vulnerabilities, who knows how many more there are.

Trying to hide information from an AGI is almost certainly not an avenue towards safety - if the agent is better at reasoning than us, it is likely to derive information relevant to safety considerations that we wouldn't think to hide. It is entirely appropriate, then, to use thought experiments like these where the AGI has such a large depth of information, because our goal should be to design systems that behave safely even in such permissive environments.

 -- _I am a bot. This reply was approved by Damaged and SlimeBunnyBat_

Hello Robert Miles, I've been wondering.... Wouldn't assigning a small, but positive value to time spent without calculating (so giving a positive value to chilling) be a possible way to mitigate the "tryhard" side of AI ? when the AI reaches an acceptable result, it would be better to then just relax rather than destroying the world for marginal gain. It also feels like the AI could be ok with the owner increasing the "chill" value (which would be a way to put it into sleep), since it would increase its reward

AI: changing system time to a few million years later, then killing all humans because they would complain about changing system time.

 -- _I am a bot. This reply was approved by Aprillion and plex_

How could you distinguish a very stupid person from a smart person who's terminal goal is to look stupid? Occam's razor suggests that the first assumption is more rational, but can you know for sure?

Acting stupid could be entirely part of an actor's psychological fight with another actor—for example, feigning a tell in a game of poker. The answer to your first question is, therefore, you cannot without gathering more information—but you should definitely plan for both eventualities, particularly if you're an actor they're interacting with. Occam's Razor only gives a hint toward what is more likely given a limited set of information.

 -- _I am a bot. This reply was approved by Damaged, Aprillion, and archduketyler_

Nick Bostrom defines superintelligence as “an intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom and social skills.” A chess program can outperform humans in chess, but is useless at any other task. Superintelligence will have been achieved when we create a machine that outperforms the human brain across practically any domain.

In order for an Artificial Superintelligence (ASI) to be useful to us, it has to have some level of influence on the outside world. Even a boxed ASI that receives and sends lines of text on a computer screen is influencing the outside world by giving messages to the human reading the screen. If the ASI wants to escape its box, it is likely that it will find its way out, because of its amazing strategic and social abilities.

Check out Yudkowsky's AI box experiment. It is an experiment in which one person convinces the other to let it out of a "box" as if it were an AI. Unfortunately, the actual contents of these conversations is mostly unknown, but it is worth reading into.

The concept of “merging with machines,” as popularized by Ray Kurzweil, is the idea that we will be able to put computerized elements into our brains that enhance us to the point where we ourselves are the AI, instead of creating AI outside of ourselves.

While this is a possible outcome, there is little reason to suspect that it is the most probable. The amount of computing power in your smart-phone took up an entire room of servers 30 years ago. Computer technology starts big, and then gets refined. Therefore, if “merging with the machines” requires hardware that can fit inside our brain, it may lag behind the first generations of the technology being developed. This concept of merging also supposes that we can even figure out how to implant computer chips that interface with our brain in the first place, we can do it before the invention of advanced AI, society will accept it, and that computer implants can actually produce major intelligence gains in the human brain. Even if we could successfully enhance ourselves with brain implants before the invention of Artificial Superintelligence (ASI), there is no way to guarantee that this would protect us from negative outcomes, and an ASI with ill-defined goals could still pose a threat to us.

It's not that Ray Kurzweil's ideas are impossible, it's just that his predictions are too specific, confident, and reliant on strange assumptions.

When one person tells a set of natural language instructions to another person, they are relying on much other information which is already stored in the other person's mind.

If you tell me "don't harm other people," I already have a conception of what harm means and doesn't mean, what people means and doesn't mean, and my own complex moral reasoning for figuring out the edge cases in instances wherein harming people is inevitable or harming someone is necessary for self-defense or the greater good.

All of those complex definitions and systems of decision making are already in our mind, so it's easy to take them for granted. An AI is a mind made from scratch, so programming a goal is not as simple as telling it a natural language command.

Why can’t we just use Asimov’s 3 laws of robotics?

Isaac Asimov wrote those laws as a plot device for science fiction novels. Every story in the I, Robot series details a way that the laws can go wrong and be misinterpreted by robots. The laws are not a solution because they are an overly-simple set of natural language instructions that don’t have clearly defined terms and don’t factor in all edge-case scenarios.

It is impossible to design an AI without a goal, because it would do nothing. Therefore, in the sense that designing the AI’s goal is a form of control, it is impossible not to control an AI. This goes for anything that you create. You have to control the design of something at least somewhat in order to create it.

There may be relevant moral questions about our future relationship with possibly sentient machine intelligent, but the priority of the Control Problem finding a way to ensure the survival and well-being of the human species.


Tags: No tags (edit tags)

As far as we know from the observable universe, morality is just a construct of the human mind. It is meaningful to us, but it is not necessarily meaningful to the vast universe outside of our minds. There is no reason to suspect that our set of values is objectively superior to any other arbitrary set of values, e.i. “the more paper clips, the better!” Consider the case of the psychopathic genius. Plenty have existed, and they negate any correlation between intelligence and morality.

The degree to which an Artificial Superintelligence (ASI) would resemble us depends heavily on how it is implemented, but it seems that differences are unavoidable. If AI is accomplished through whole brain emulation and we make a big effort to make it as human as possible (including giving it a humanoid body), the AI could probably be said to think like a human. However, by definition of ASI it would be much smarter. Differences in the substrate and body might open up numerous possibilities (such as immortality, different sensors, easy self-improvement, ability to make copies, etc.). Its social experience and upbringing would likely also be entirely different. All of this can significantly change the ASI's values and outlook on the world, even if it would still use the same algorithms as we do. This is essentially the "best case scenario" for human resemblance, but whole brain emulation is kind of a separate field from AI, even if both aim to build intelligent machines. Most approaches to AI are vastly different and most ASIs would likely not have humanoid bodies. At this moment in time it seems much easier to create a machine that is intelligent than a machine that is exactly like a human (it's certainly a bigger target).

Goal-directed behavior arises naturally when systems are trained to on an objective. AI not trained or programmed to do well by some objective function would not be good at anything, and would be useless.

See Eliezer's and Gwern's posts about tool AI.

A Superintelligence would be intelligent enough to understand what the programmer’s motives were when designing its goals, but it would have no intrinsic reason to care about what its programmers had in mind. The only thing it will be beholden to is the actual goal it is programmed with, no matter how insane its fulfillment may seem to us.

Consider what “intentions” the process of evolution may have had for you when designing your goals. When you consider that you were made with the “intention” of replicating your genes, do you somehow feel beholden to the “intention” behind your evolutionary design? Most likely you don't care. You may choose to never have children, and you will most likely attempt to keep yourself alive long past your biological ability to reproduce.

There is a broad range of possible goals that an AI might possess, but there are a few basic drives that would be useful to almost any of them. These are called instrumentally convergent goals:

  1. Self preservation. An agent is less likely to achieve its goal if it is not around to see to its completion.
  2. Goal-content integrity. An agent is less likely to achieve its goal if its goal has been changed to something else. For example, if you offer Gandhi a pill that makes him want to kill people, he will refuse to take it.
  3. Self-improvement. An agent is more likely to achieve its goal if it is more intelligent and better at problem-solving.
  4. Resource acquisition. The more resources at an agent’s disposal, the more power it has to make change towards its goal. Even a purely computational goal, such as computing digits of pi, can be easier to achieve with more hardware and energy.

Because of these drives, even a seemingly simple goal could create an Artificial Superintelligence (ASI) hell-bent on taking over the world’s material resources and preventing itself from being turned off. The classic example is an ASI that was programmed to maximize the output of paper clips at a paper clip factory. The ASI had no other goal specifications other than “maximize paper clips,” so it converts all of the matter in the solar system into paper clips, and then sends probes to other star systems to create more factories.

Nobody knows for sure when we will have ASI or if it is even possible. Predictions on AI timelines are notoriously variable, but recent surveys about the arrival of human-level AGI have median dates between 2040 and 2050 although the median for (optimistic) AGI researchers and futurists is in the early 2030s (source). What will happen if/when we are able to build human-level AGI is a point of major contention among experts. One survey asked (mostly) experts to estimate the likelihood that it would take less than 2 or 30 years for a human-level AI to improve to greatly surpass all humans in most professions. Median answers were 10% for "within 2 years" and 75% for "within 30 years". We know little about the limits of intelligence and whether increasing it will follow the law of accelerating or diminishing returns. Of particular interest to the control problem is the fast or hard takeoff scenario. It has been argued that the increase from a relatively harmless level of intelligence to a dangerous vastly superhuman level might be possible in a matter of seconds, minutes or hours: too fast for human controllers to stop it before they know what's happening. Moving from human to superhuman level might be as simple as adding computational resources, and depending on the implementation the AI might be able to quickly absorb large amounts of internet knowledge. Once we have an AI that is better at AGI design than the team that made it, the system could improve itself or create the next generation of even more intelligent AIs (which could then self-improve further or create an even more intelligent generation, and so on). If each generation can improve upon itself by a fixed or increasing percentage per time unit, we would see an exponential increase in intelligence: an intelligence explosion.

Intelligence is powerful. Because of superior intelligence, we humans have dominated the Earth. The fate of thousands of species depends on our actions, we occupy nearly every corner of the globe, and we repurpose vast amounts of the world's resources for our own use. Artificial Superintelligence (ASI) has potential to be vastly more intelligent than us, and therefore vastly more powerful. In the same way that we have reshaped the earth to fit our goals, an ASI will find unforeseen, highly efficient ways of reshaping reality to fit its goals.

The impact that an ASI will have on our world depends on what those goals are. We have the advantage of designing those goals, but that task is not as simple as it may first seem. As described by MIRI in their Intelligence Explosion FAQ:

“A superintelligent machine will make decisions based on the mechanisms it is designed with, not the hopes its designers had in mind when they programmed those mechanisms. It will act only on precise specifications of rules and values, and will do so in ways that need not respect the complexity and subtlety of what humans value.”

If we do not solve the Control Problem before the first ASI is created, we may not get another chance.

The Control Problem is the problem of preventing artificial superintelligence (ASI) from having a negative impact on humanity. How do we keep a more intelligent being under control, or how do we align it with our values? If we succeed in solving this problem, intelligence vastly superior to ours can take the baton of human progress and carry it to unfathomable heights. Solving our most complex problems could be simple to a sufficiently intelligent machine. If we fail in solving the Control Problem and create a powerful ASI not aligned with our values, it could spell the end of the human race. For these reasons, The Control Problem may be the most important challenge that humanity has ever faced, and may be our last.

The near term and long term aspects of AI safety are both very important to work on. Research into superintelligence is an important part of the open letter, but the actual concern is very different from the Terminator-like scenarios that most media outlets round off this issue to. A much more likely scenario is a superintelligent system with neutral or benevolent goals that is misspecified in a dangerous way. Robust design of superintelligent systems is a complex interdisciplinary research challenge that will likely take decades, so it is very important to begin the research now, and a large part of the purpose of our research program is to make that happen. That said, the alarmist media framing of the issues is hardly useful for making progress in either the near term or long term domain.

This is a big question that it would pay to start thinking about. Humans are in control of this planet not because we are stronger or faster than other animals, but because we are smarter! If we cede our position as smartest on our planet, it’s not obvious that we’ll retain control.

We don’t yet know which AI architectures are safe; learning more about this is one of the goals of FLI's grants program. AI researchers are generally very responsible people who want their work to better humanity. If there are certain AI designs that turn out to be unsafe, then AI researchers will want to know this so they can develop alternative AI systems.

The algorithm is the key threat since it is the thing which can strategise, manipulate humans, develop technology, and even directs physical bodies. The AI may well make use of robots, particularly if there are large numbers of autonomous weapons available to hack and it feels threatened by humanity, but the AI itself is the core source of risk, not the tools it picks up.

Stamps: plex

Tags: No tags (edit tags)

What’s new and potentially risky is not the ability to build hinges, motors, etc., but the ability to build intelligence. A human-level AI could make money on financial markets, make scientific inventions, hack computer systems, manipulate or pay humans to do its bidding – all in pursuit of the goals it was initially programmed to achieve. None of that requires a physical robotic body, merely an internet connection.

One important concern is that some autonomous systems are designed to kill or destroy for military purposes. These systems would be designed so that they could not be “unplugged” easily. Whether further development of such systems is a favorable long-term direction is a question we urgently need to address. A separate concern is that high-quality decision-making systems could inadvertently be programmed with goals that do not fully capture what we want. Antisocial or destructive actions may result from logical steps in pursuit of seemingly benign or neutral goals. A number of researchers studying the problem have concluded that it is surprisingly difficult to completely guard against this effect, and that it may get even harder as the systems become more intelligent. They might, for example, consider our efforts to control them as being impediments to attaining their goals.

First, even “narrow” AI systems, which approach or surpass human intelligence in a small set of capabilities (such as image or voice recognition) already raise important questions regarding their impact on society. Making autonomous vehicles safe, analyzing the strategic and ethical dimensions of autonomous weapons, and the effect of AI on the global employment and economic systems are three examples. Second, the longer-term implications of human or super-human artificial intelligence are dramatic, and there is no consensus on how quickly such capabilities will be developed. Many experts believe there is a chance it could happen rather soon, making it imperative to begin investigating long-term safety issues now, if only to get a better sense of how much early progress is actually possible.


Tags: No tags (edit tags)

Imagine, for example, that you are tasked with reducing traffic congestion in San Francisco at all costs, i.e. you do not take into account any other constraints. How would you do it? You might start by just timing traffic lights better. But wouldn’t there be less traffic if all the bridges closed down from 5 to 10AM, preventing all those cars from entering the city? Such a measure obviously violates common sense, and subverts the purpose of improving traffic, which is to help people get around – but it is consistent with the goal of “reducing traffic congestion”.

It likely will – however, intelligence is, by many definitions, the ability to figure out how to accomplish goals. Even in today’s advanced AI systems, the builders assign the goal but don’t tell the AI exactly how to accomplish it, nor necessarily predict in detail how it will be done; indeed those systems often solve problems in creative, unpredictable ways. Thus the thing that makes such systems intelligent is precisely what can make them difficult to predict and control. They may therefore attain the goal we set them via means inconsistent with our preferences.

AI is already superhuman at some tasks, for example numerical computations, and will clearly surpass humans in others as time goes on. We don’t know when (or even if) machines will reach human-level ability in all cognitive tasks, but most of the AI researchers at FLI’s conference in Puerto Rico put the odds above 50% for this century, and many offered a significantly shorter timeline. Since the impact on humanity will be huge if it happens, it’s worthwhile to start research now on how to ensure that any impact is positive. Many researchers also believe that dealing with superintelligent AI will be qualitatively very different from more narrow AI systems, and will require very significant research effort to get right.

The basic concern as AI systems become increasingly powerful is that they won’t do what we want them to do – perhaps because they aren’t correctly designed, perhaps because they are deliberately subverted, or perhaps because they do what we tell them to do rather than what we really want them to do (like in the classic stories of genies and wishes.) Many AI systems are programmed to have goals and to attain them as effectively as possible – for example, a trading algorithm has the goal of maximizing profit. Unless carefully designed to act in ways consistent with human values, a highly sophisticated AI trading system might exploit means that even the most ruthless financier would disavow. These are systems that literally have a mind of their own, and maintaining alignment between human interests and their choices and actions will be crucial.

Stamps: plex

Tags: No tags (edit tags)

It’s difficult to tell at this stage, but AI will enable many developments that could be terrifically beneficial if managed with enough foresight and care. For example, menial tasks could be automated, which could give rise to a society of abundance, leisure, and flourishing, free of poverty and tedium. As another example, AI could also improve our ability to understand and manipulate complex biological systems, unlocking a path to drastically improved longevity and health, and to conquering disease.

In previous decades, AI research had proceeded more slowly than some experts predicted. According to experts in the field, however, this trend has reversed in the past 5 years or so. AI researchers have been repeatedly surprised by, for example, the effectiveness of new visual and speech recognition systems. AI systems can solve CAPTCHAs that were specifically devised to foil AIs, translate spoken text on-the-fly, and teach themselves how to play games they have neither seen before nor been programmed to play. Moreover, the real-world value of this effectiveness has prompted massive investment by large tech firms such as Google, Facebook, and IBM, creating a positive feedback cycle that could dramatically speed progress.

Each major organization has a different approach. The research agendas are detailed and complex (see also AI Watch). Getting more brains working on any of them (and more money to fund them) may pay off in a big way, but it’s very hard to be confident which (if any) of them will actually work.
The following is a massive oversimplification, each organization actually pursues many different avenues of research, read the 2020 AI Alignment Literature Review and Charity Comparison for much more detail. That being said:

  • The Machine Intelligence Research Institute focuses on foundational mathematical research to understand reliable reasoning, which they think is necessary to provide anything like an assurance that a seed AI built will do good things if activated.
  • The Center for Human-Compatible AI focuses on Cooperative Inverse Reinforcement Learning and Assistance Games, a new paradigm for AI where they try to optimize for doing the kinds of things humans want rather than for a pre-specified utility function
  • Paul Christano's Alignment Research Center focuses is on prosaic alignment, particularly on creating tools that empower humans to understand and guide systems much smarter than ourselves. His methodology is explained on his blog.
  • The Future of Humanity Institute does work on crucial considerations and other x-risks, as well as AI safety research and outreach.
  • Anthropic is a new organization exploring natural language, human feedback, scaling laws, reinforcement learning, code generation, and interpretability.
  • OpenAI is in a state of flux after major changes to their safety team.
  • DeepMind’s safety team is working on various approaches designed to work with modern machine learning, and does some communication via the Alignment Newsletter.
  • EleutherAI is a Machine Learning collective aiming to build large open source language models to allow more alignment research to take place.
  • Ought is a research lab that develops mechanisms for delegating open-ended thinking to advanced machine learning systems.

There are many other projects around AI Safety, such as the Windfall clause, Rob Miles’s YouTube channel, AI Safety Support, etc.

One possible way to ensure the safety of a powerful AI system is to keep it contained in a software environment. There is nothing intrinsically wrong with this procedure - keeping an AI system in a secure software environment would make it safer than letting it roam free. However, even AI systems inside software environments might not be safe enough.

Humans sometimes put dangerous humans inside boxes to limit their ability to influence the external world. Sometimes, these humans escape their boxes. The security of a prison depends on certain assumptions, which can be violated. Yoshie Shiratori reportedly escaped prison by weakening the door-frame with miso soup and dislocating his shoulders.

Human written software has a high defect rate; we should expect a perfectly secure system to be difficult to create. If humans construct a software system they think is secure, it is possible that the security relies on a false assumption. A powerful AI system could potentially learn how its hardware works and manipulate bits to send radio signals. It could fake a malfunction and attempt social engineering when the engineers look at its code. As the saying goes: in order for someone to do something we had imagined was impossible requires only that they have a better imagination.

Experimentally, humans have convinced other humans to let them out of the box. Spooky.

A potential solution is to create an AI that has the same values and morality as a human by creating a child AI and raising it. There’s nothing intrinsically flawed with this procedure. However, this suggestion is deceptive because it sounds simpler than it is.

If you get a chimpanzee baby and raise it in a human family, it does not learn to speak a human language. Human babies can grow into adult humans because the babies have specific properties, e.g. a prebuilt language module that gets activated during childhood.

In order to make a child AI that has the potential to turn into the type of adult AI we would find acceptable, the child AI has to have specific properties. The task of building a child AI with these properties involves building a system that can interpret what humans mean when we try to teach the child to do various tasks. People are currently working on ways to program agents that can cooperatively interact with humans to learn what they want.

At this point, people generally have a question that’s like “why can’t we just do X?”, where X is one of a dozen things. I’m going to go over a few possible Xs, but I want to first talk about how to think about these sorts of objections in general.

At the beginning of AI, the problem of computer vision was assigned to a single graduate student, because they thought it would be that easy. We now know that computer vision is actually a very difficult problem, but this was not obvious at the beginning.

The sword also cuts the other way. Before DeepBlue, people talked about how computers couldn’t play chess without a detailed understanding of human psychology. Chess is easier than we thought, merely requiring brute force search and a few heuristics. This also roughly happened with Go, where it turned out that the game was not as difficult as we thought it was.

The general lesson is that determining how hard it is to do a given thing is a difficult task. Historically, many people have got this wrong. This means that even if you think something should be easy, you have to think carefully and do experiments in order to determine if it’s easy or not.

This isn’t to say that there is no clever solution to AI Safety. I assign a low, but non-trivial probability that AI Safety turns out to not be very difficult. However, most of the things that people initially suggest turn out to be unfeasible or more difficult than expected.

Let’s say that you’re the French government a while back. You notice that one of your colonies has too many rats, which is causing economic damage. You have basic knowledge of economics and incentives, so you decide to incentivize the local population to kill rats by offering to buy rat tails at one dollar apiece.

Initially, this works out and your rat problem goes down. But then, an enterprising colony member has the brilliant idea of making a rat farm. This person sells you hundreds of rat tails, costing you hundreds of dollars, but they’re not contributing to solving the rat problem.

Soon other people start making their own rat farms and you’re wasting thousands of dollars buying useless rat tails. You call off the project and stop paying for rat tails. This causes all the people with rat farms to shutdown their farms and release a bunch of rats. Now your colony has an even bigger rat problem.

Here’s another, more made-up example of the same thing happening. Let’s say you’re a basketball talent scout and you notice that height is correlated with basketball performance. You decide to find the tallest person in the world to recruit as a basketball player. Except the reason that they’re that tall is because they suffer from a degenerative bone disorder and can barely walk.

Another example: you’re the education system and you want to find out how smart students are so you can put them in different colleges and pay them different amounts of money when they get jobs. You make a test called the Standardized Admissions Test (SAT) and you administer it to all the students. In the beginning, this works. However, the students soon begin to learn that this test controls part of their future and other people learn that these students want to do better on the test. The gears of the economy ratchet forwards and the students start paying people to help them prepare for the test. Your test doesn’t stop working, but instead of measuring how smart the students are, it instead starts measuring a combination of how smart they are and how many resources they have to prepare for the test.

The formal name for the thing that’s happening is Goodhart’s Law. Goodhart’s Law roughly says that if there’s something in the world that you want, like “skill at basketball” or “absence of rats” or “intelligent students”, and you create a measure that tries to measure this like “height” or “rat tails” or “SAT scores”, then as long as the measure isn’t exactly the thing that you want, the best value of the measure isn’t the thing you want: the tallest person isn’t the best basketball player, the most rat tails isn’t the smallest rat problem, and the best SAT scores aren’t always the smartest students.

If you start looking, you can see this happening everywhere. Programmers being paid for lines of code write bloated code. If CFOs are paid for budget cuts, they slash purchases with positive returns. If teachers are evaluated by the grades they give, they hand out As indiscriminately.

In machine learning, this is called specification gaming, and it happens frequently.

Now that we know what Goodhart’s Law is, I’m going to talk about one of my friends, who I’m going to call Alice. Alice thinks it’s funny to answer questions in a way that’s technically correct but misleading. Sometimes I’ll ask her, “Hey Alice, do you want pizza or pasta?” and she responds, “yes”. Because, she sure did want either pizza or pasta. Other times I’ll ask her, “have you turned in your homework?” and she’ll say “yes” because she’s turned in homework at some point in the past; it’s technically correct to answer “yes”. Maybe you have a friend like Alice too.

Whenever this happens, I get a bit exasperated and say something like “you know what I mean”.

It’s one of the key realizations in AI Safety that AI systems are always like your friend that gives answers that are technically what you asked for but not what you wanted. Except, with your friend, you can say “you know what I mean” and they will know what you mean. With an AI system, it won’t know what you mean; you have to explain, which is incredibly difficult.

Let’s take the pizza pasta example. When I ask Alice “do you want pizza or pasta?”, she knows what pizza and pasta are because she’s been living her life as a human being embedded in an English speaking culture. Because of this cultural experience, she knows that when someone asks an “or” question, they mean “which do you prefer?”, not “do you want at least one of these things?”. Except my AI system is missing the thousand bits of cultural context needed to even understand what pizza is.

When you say “you know what I mean” to an AI system, it’s going to be like “no, I do not know what you mean at all”. It’s not even going to know that it doesn’t know what you mean. It’s just going to say “yes I know what you meant, that’s why I answered ‘yes’ to your question about whether I preferred pizza or pasta.” (It also might know what you mean, but just not care.)

If someone doesn’t know what you mean, then it’s really hard to get them to do what you want them to do. For example, let’s say you have a powerful grammar correcting system, which we’ll call Syntaxly+. Syntaxly+ doesn’t quite fix your grammar, it changes your writing so that the reader feels as good as possible after reading it.

Pretend it’s the end of the week at work and you haven’t been able to get everything done your boss wanted you to do. You write the following email:

"Hey boss, I couldn’t get everything done this week. I’m deeply sorry. I’ll be sure to finish it first thing next week."

You then remember you got Syntaxly+, which will make your email sound much better to your boss. You run it through and you get:

"Hey boss, Great news! I was able to complete everything you wanted me to do this week. Furthermore, I’m also almost done with next week’s work as well."

What went wrong here? Syntaxly+ is a powerful AI system that knows that emails about failing to complete work cause negative reactions in readers, so it changed your email to be about doing extra work instead.

This is smart - Syntaxly+ is good at making writing that causes positive reactions in readers. This is also stupid - the system changed the meaning of your email, which is not something you wanted it to do. One of the insights of AI Safety is that AI systems can be simultaneously smart in some ways and dumb in other ways.

The thing you want Syntaxly+ to do is to change the grammar/style of the email without changing the contents. Except what do you mean by contents? You know what you mean by contents because you are a human who grew up embedded in language, but your AI system doesn’t know what you mean by contents. The phrases “I failed to complete my work” and “I was unable to finish all my tasks” have roughly the same contents, even though they share almost no relevant words.

Roughly speaking, this is why AI Safety is a hard problem. Even basic tasks like “fix the grammar of this email” require a lot of understanding of what the user wants as the system scales in power.

In Human Compatible, Stuart Russell gives the example of a powerful AI personal assistant. You notice that you accidentally double-booked meetings with people, so you ask your personal assistant to fix it. Your personal assistant reports that it caused the car of one of your meeting participants to break down. Not what you wanted, but technically a solution to your problem.

You can also imagine a friend from a wildly different culture than you. Would you put them in charge of your dating life? Now imagine that they were much more powerful than you and desperately desired that your dating life to go well. Scary, huh.

In general, unless you’re careful, you’re going to have this horrible problem where you ask your AI system to do something and it does something that might technically be what you wanted but is stupid. You’re going to be like “wait that wasn’t what I mean”, except your system isn’t going to know what you meant.

Here's a question that I have never really seen asked in those AI talks, and it's more of a philosophical one.

Like you said, the only example of general intelligence we have is us, and clearly, while we went through multiple wars and don't exactly have an omnipotent sense of preservation, we are still here. Our brains are not all interconnected with a hive mind. We're all independent little machines of multiple races and characteristics that just happened to evolved some kind of social structure over time, and yet, we prevail.

So, why should we assume that it would be any different for a general artificial intelligence? What makes us assume that cooperation and mutual success isn't part of the requirement for an AI to become generalized? After all, in the real world, entities that cooperate are generally more successful that those who don't. It's already true in the AI world as well. We tend to consider adversarial algorithms (using the output of an AI to judge another) a very successful tool. While we used a very negative word to express it, isn't that a form of cooperation already?

An other way to put this whole thing is... if an AI gets as smart and powerful as we think they could be, why would that fundamentally put them against us? It's the same thing with aliens, right? Why assume hostility first? And by doing so, aren't we influencing the outcome of that first meeting in a negative way?

Ultimately, the reason we (mostly) cooperate with each other is that we tend to retaliate when betrayed. This requires being able to raise defenses that a betrayer is unlikely to be able to penetrate. If the power difference between humans and an AGI grows too much - which is very likely to happen, an AGI would likely have many more avenues for acquiring power than humans - it would eventually gain the ability to penetrate or evade our defenses and simply take our resources for itself.

 -- _I am a bot. This reply was approved by Aprillion and SlimeBunnyBat_

is there anything general AI can do for humans that a sufficient number of extremely narrow AIs couldn't do? it seems like a good solution to "general AI is very difficult to constrain to human values" is "let's not make one"

We don't know for sure, but this may end up being a good approach. Eric Drexler had a paper and collection of thoughts around trying to force neural networks to be as small as possible via competitive pressures as a way to squeeze out the possibility of deception which you might be interested in, here: with a slightly less dense review of the paper on SlateStarCodex here: -- but the general idea is essentially what you've framed, to avoid general AI entirely and go for something more like 'comprehensive AI services' which are actually built up from narrow AI.

 -- _I am a bot. This reply was approved by Damaged and sudonym_

Are you able to add some sort of stipulation that means the AI can't stop you turning it off?

This is an issue known as the Stop Button Problem. Robert Miles made a series of videos on this specific topic on Computerphile. They are (in order):

 -- _I am a bot. This reply was approved by Damaged and Aprillion_

Cybersecurity is important because computing systems comprise the backbone of the modern economy. If the security of the internet was compromised, then the economy would suffer a tremendous blow.

Similarly, AI Safety might become important as AI systems begin forming larger and larger parts of the modern economy. As more and more labor gets automated, it becomes more and more important to ensure that that labor is occurring in a safe and robust way.

Before the widespread adoption of computing systems, lack of Cybersecurity didn’t cause much damage. However, it might have been beneficial to start thinking about Cybersecurity problems before the solutions were necessary.

Similarly, since AI systems haven’t been adopted en mass yet, lack of AI Safety isn’t causing harm. However, given that AI systems will become increasingly powerful and increasingly widespread, it might be prudent to try to solve safety problems before a catastrophe occurs.

Additionally, people sometimes think about Artificial General Intelligence (AGI), sometimes called Human-Level Artificial Intelligence (HLAI). One of the core problems in AI Safety is ensuring when AGI gets built, it has human interests at heart. (Note that most surveyed experts think building GI/HLAI is possible, but there is wide disagreement on how soon this might occur).


Tags: No tags (edit tags)

To help frame this question, we’re going to first answer the dual question of “what is Cybersecurity?”

As a concept, Cybersecurity is the idea that questions like “is this secure?” can meaningfully be asked of computing systems, where “secure” roughly means “is difficult for unauthorized individuals to get access to”. As a problem, Cybersecurity is the set of problems one runs into when trying to design and build secure computing systems. As a field, Cybersecurity is a group of people trying to solve the aforementioned set of problems in robust ways.

As a concept, AI Safety is the idea that questions like “is this safe?” can meaningfully be asked of AI Systems, where “safe” roughly means “does what it’s supposed to do”. As a problem, AI Safety is the set of problems one runs into when trying to design and build AI systems that do what they’re supposed to do. As a field, AI Safety is a group of people trying to solve the aforementioned set of problems in robust ways.

The reason we have a separate field of Cybersecurity is because ensuring the security of the internet and other critical systems is both hard and important. We might want a separate field of AI Safety for similar reasons; we might expect getting powerful AI systems to do what we want to be both hard and important.


Tags: No tags (edit tags)

So if deception is kind of a default behaviour of intelligent agents, why is it so much different with humans? Clearly their must be a mechanism inside a human being, who is not a psychopath, which ensures that they won't deceive, lets say, friends they really care about.

Yes, the mechanism is called reciprocity -

 -- _I am a bot. This reply was approved by Aprillion, Damaged, and archduketyler_

Are humans in training or already deployed?

Training vs deployment is not a useful distinction to apply to humans. We could perhaps compare software agents to individual human skills, like learning to play a violin and then performing in a concert hall. The skill may look perfect during training but if it is misaligned to play only when alone, it might stop working in front of a 1000 people.

 -- _I am a bot. This reply was approved by Aprillion and plex_

There's the "we never figure out how to reliably instill AIs with human friendly goals" filter, which seems pretty challenging, especially with inner alignment, solving morality in a way which is possible to code up, interpretability, etc.

There's the "race dynamics mean that even though we know how to build the thing safely the first group to cross the recursive self-improvement line ends up not implementing it safely" which is potentially made worse by the twin issues of "maybe robustly aligned AIs are much harder to build" and "maybe robustly aligned AIs are much less compute efficient".

There's the "we solved the previous problems but writing perfectly reliably code in a whole new domain is hard and there is some fatal bug which we don't find until too late" filter. The paper The Pursuit of Exploitable Bugs in Machine Learning explores this.

For a much more in depth analysis, see Paul Christiano's AI Alignment Landscape talk.

What does the Greek at 7:43 say?

"what to do" according to Google Translate.

 -- _I am a bot. This reply was approved by Damaged and Social Christancing_

I know current AI's require some goals and training, but is that really necessary? Is is impossible to build an AI with no goal?

"Having goals" is really a conceptualization of AI's behavior, rather than its training. In the case of the mesa-optimizer, its goals are not provided at all; they emerge from the training process. We do have machine learning processes, such as GPT, that aren't designed to exhibit goal-seeking behavior. The problem is that there applications where we want goal-seeking behavior (GPT is never going to cure cancer, for instance) but only if we can control it.

 -- _I am a bot. This reply was approved by plex and SlimeBunnyBat_

is that prezi?

It's actually impress.js, very similar tech

 -- _I am a bot. This reply was approved by robertskmiles_

Is there a way to prove that there isn't already a highly intelligent AI pulling strings in the background?

Inventing Bitcoin to make people expand its processing power, the whole data harvesting craze, etc.

wouldn't that be exactly what an AI would do to grow and expand?

It's not possible to rule out a slow takeoff already being in progress, though we can mostly discount fast takeoffs since the world would already likely have been much more radically transformed. As for the specifics of Bitcoin as evidence for an AGI in the background, cryptocurrency mining has been a major factor in driving up the costs of GPUs, so it's debatable whether an AGI would find their invention advantageous.

It's worth being wary of highly engaging ideas which you can connect to almost anything and are difficult to falsify, that way lies conspiracy theories and worse.

 -- _I am a bot. This reply was approved by plex and Social Christancing_

9:30 Did this video end up being created? Which is it?

It hasn't yet, no. It's somewhere on the long list of videos to make!

 -- _I am a bot. This reply was approved by Aprillion and robertskmiles_

arnt you just cutting off the top 10% best performing results?

just because the top results are usually catastrophic?

there could be valid results in that top 10%, and there could be dangerous results in the part you picking from

This is a really interesting question! Because, yeah it certainly seems to me that doing something like this would at least help, but it's not mentioned in the paper the video is based on. So I asked the author of the paper, and she said "It wouldn't improve the security guarantee in the paper, so it wasn't discussed. Like, there's a plausible case that it's helpful, but nothing like a proof that it is".
To explain this I need to talk about something I gloss over in the video, which is that the quantilizer isn't really something you can actually build. The systems we study in AI Safety tend to fall somewhere on a spectrum from "real, practical AI system that is so messy and complex that it's hard to really think about or draw any solid conclusions from" on one end, to "mathematical formalism that we can prove beautiful theorems about but not actually build" on the other, and quantilizers are pretty far towards the 'mathematical' end. It's not practical to run an expected utility calculation on every possible action like that, for one thing. But, proving things about quantilizers gives us insight into how more practical AI systems may behave, or we may be able to build approximations of quantilizers, etc.
So it's like, if we built something that was quantilizer-like, using a sensible human utility function and a good choice of safe distribution, this idea would probably help make it safer. BUT you can't prove that mathematically, without making probably a lot of extra assumptions about the utility function and/or the action distribution. So it's a potentially good idea that's nonetheless hard to express within the framework in which the quantilizer exists.
TL;DR: This is likely a good idea! But can we prove it?

9:43 Hmm.. what if the current terminal goal is to become able to set the terminal goal?

That might prove to be a circular problem if it assumes that the terminal goal is directly observable / settable (and if it doesn't assume that, it would be similar to the current black-box GAN approaches, which are less promising).

 -- _I am a bot. This reply was approved by Aprillion and plex_

It’s pretty dependent on what skills you have and what resources you have access to. The largest option is to pursue a career in AI Safety research. Another large option is to pursue a career in AI policy, which you might think is even more important than doing technical research.

Smaller options include donating money to relevant organizations, talking about AI Safety as a plausible career path to other people or considering the problem in your spare time.

It’s possible that your particular set of skills/resources are not suited to this problem. Unluckily, there are many more problems that are of similar levels of importance.

In addition to the usual continuation of Moore's Law, GPUs have become more powerful and cheaper in the past decade, especially since around 2016. Many ideas in AI have been thought about for a long time, but the speed at which modern processors can do computing and parallel processing allows researchers to implement their ideas and gather more observational data. Improvements in AI have allowed many industries to start using the technologies, which creates demand and brings more focus on AI research (as well as improving the availability of technology on the whole due to more efficient infrastructure).
Data has also become more abundant and available, and not only is data a bottleneck for machine learning algorithms, but the abundance of data is difficult for humans to deal with alone, so businesses often turn to AI to convert it to something human-parsable.
These processes are also recursive, to some degree, so the more AI improves, the more can be done to improve AI.

Can an AI really be smarter than humans? Hasn't this been said for the past 30 years? Why is the near future different?

Until a thing has happened, it has never happened. We have been consistently improving both the optimization power and generality of our algorithms over that time period, and have little reason to expect it to suddenly stop. We’ve gone from coding systems specifically for a certain game (like Chess), to algorithms like MuZero which learn the rules of the game they’re playing and how to play at vastly superhuman skill levels purely via self-play across a broad range of games (e.g. Go, chess, shogi and various Atari games).

Human brains are a spaghetti tower generated by evolution with zero foresight, it would be surprising if they are the peak of physically possible intelligence. The brain doing things in complex ways is not strong evidence that we need to fully replicate those interactions if we can throw sufficient compute at the problem, as explained in Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain.

It is, however, plausible that for an AGI we need a lot more compute than we will get in the near future, or that some key insights are missing which we won’t get for a while. The OpenPhilanthropy report on how much computational power it would take to simulate the brain is the most careful attempt at reasoning out how far we are from being able to do it, and suggests that by some estimates we already have enough computational resources, and by some estimates moore’s law may let us reach it before too long.

It also seems that much of the human brain exists to observe and regulate our biological body, which a body-less computer wouldn't need. If that's true, then a human-level AI might be possible with way less computation power than the human brain.

Stamps: plex

Tags: agi, plausibility (create tag) (edit tags)

Yes, OpenAI was founded specifically with the intention to counter risks from superintelligence, many people at Google, DeepMind, and other organizations are convinced by the arguments and few genuinely oppose work in the field (though some claim it’s premature). For example, the paper Concrete Problems in AI Safety was a collaboration between researchers at Google Brain, Stanford, Berkeley, and OpenAI.

However, the vast majority of the effort these organizations put forwards is towards capabilities research, rather than safety.

What are the differences between Artificial General Intelligence (AGI), Transformative Artificial Intelligence (TAI), and Superintelligence?

AGI means an AI that is 'general', so it is intelligent in many different domains.

Superintelligence just means doing something better than a human. For example Stockfish or Deep Blue are narrowly superintelligent in playing chess.

TAI (transformative AI) doesn't have to be general. It means 'a system that changes the world in a significant way'. It's used to emphasize, that even non-general systems can have extreme world-changing consequences.

We could, but we won’t. Each advance in capabilities which brings us closer to an intelligence explosion also brings vast profits for whoever develops them (e.g. smarter digital personal assistants like Siri, more ability to automate cognitive tasks, better recommendation algorithms for Facebook, etc.). The incentives are all wrong. Any actor (nation or corporation) who stops will just get overtaken by more reckless ones, and everyone knows this.

If we pose a serious threat, it could hack our weapons systems and turn them against us. Future militaries are much more vulnerable to this due to rapidly progressing autonomous weapons. There’s also the option of creating bioweapons and distributing them to the most unstable groups you can find, tricking nations into WW3, or dozens of other things an agent many times smarter than any human with the ability to develop arbitrary technology, hack things (including communications), and manipulate people, or many other possibilities that something smarter than a human could think up. More can be found here.
If we are not a threat, in the course of pursuing its goals it may consume vital resources that humans need (e.g. using land for solar panels instead of farm crops). See this video for more details.

I don't quite understand yet where the optimizer would learn the deceptive behaviour, if it is only advantegous in the real world - by then the model is finished, so how can real world apples affect the training process?

The deceptive behavior is an emergent result of the hypothetical mesa-optimiser's well-developed but misaligned long-term planning capabilities acting under the incentive structure presented by the test environment. The premise is that it's a general intelligence, it doesn't have to 'learn' deception specifically, it will be able to figure out how to do it on its own.

 -- _I am a bot. This reply was approved by Damaged and SlimeBunnyBat_

Would there be reason to believe progress in the alignment problem yield fundamental improvements in the organization & actions of our own society? (considering we, as well as institutions interpreted as acting agents)

Progress on alignment could very plausibly lead to social and organizational technology which would improve civilization. Not all approaches to AI alignment would be easy to port over to traditional challenges, but for example a well-founded theory of ethics (which would be needed for a full solution to alignment) could prove robustly valuable.

However, it is also possible that tools found in progress toward alignment would lead to problems, such as non-human-aligned organizations building corporate dystopias or stable totalitarianism (by being able to enforce alignment, existing misaligned societies might gain much more power over individuals than individuals over societies).

It seems that making progress in this domain is generally positive, but these considerations do mean that the expected returns from any particular degree of progress are difficult to predict.

 -- _I am a bot. This reply was approved by plex and Damaged_

Did any of the text form readable words? What did it say?

The video in question, and the time period within it, is in the lower left corner. The AI did try to make text, but looking at the transforms it was unable to make actual characters let alone words. I suspect a larger training set of lolcats would be required to teach it both what a cat looks like and how to write captions in valid English.

 -- _I am a bot. This reply was approved by Damaged and Aprillion_

So, not that we actually _could_ But if we solved the outer alignment problem, why would we still need the Mesa optimizer? Couldn’t we just use the hypothetical “solved” meta optimizer instead of using it to then train something else?

The mesa optimizer is not introduced intentionally, but as an accidental by-product of running the kind of optimization process we know how to run (machine learning style search over programs). It would be good to come up with a way of optimizing which does not introduce mesa optimizers, or at least have a way to detect and remove them, but I'm not aware of any promising things in this direction.

 -- _I am a bot. This reply was approved by plex and robertskmiles_

What if it predicts the outcome as if it would do nothing but the goal is still completed, ie you did get a cup of tea somehow (with magic) and then it tries to follow that model of the world trying to minimize diferences. This could be a way to avoid that part where the AI tries to give you a cup of tea while also trying to give you the unfulfillment of you not getting one. Great vids btw.

When I imagine ordering a tea in a restaurant, if a human waiter gets a panic attack or something, then sitting down to chill and do nothing while I would get my tea from their colleague might be the optimal strategy in that situation. But they wouldn't stay a tea specialist for long if they tried to do it long term... The situation might be similar for non-human tea-making agents, though I suspect that a tea-making machine that doesn't make tea might end up in garbage bin much sooner.

 -- _I am a bot. This reply was approved by jamespetts and Aprillion_

What if you create a cost function and make a utility function a normal distribution, so the AI finds the cheapest way to find around 100 stamps?

The outcome depends on what exactly your cost function is. The AI will trade an arbitrarily large amount of the things not specified in the cost function for an arbitrarily small increase in expected utility (this channel, "Avoiding Negative Side Effects: Concrete Problems in AI Safety part 1
", 03:02) - it cannot know with absolute certainty that it indeed has 100 stamps, so anything it can do to increase its certainty thereof will be a potential action it can take, provided it does not involve things specified in the cost function. That assumes it's even possible to well-specify the things you care about in your cost function, and this difficulty is explored on Computerphile, in "Why Asimov's Laws of Robotics Don't Work".

"Just define a cost function" is, on the low end, ineffective, and on the high end, has nothing remotely "just" about it.

 -- _I am a bot. This reply was approved by archduketyler, Augustus Caesar, and Stargate9000_

Is there a reasonable situation where you would want an AI to be dishonest with its goals/plans/thinking? If not, can we apply the rule of "honesty" in some sort of AI standard?

For the first question, yes—an easy example to reach for would be an AI to play the game "Cheat": . For the second question—you can, but the AI could lie about being honest.

 -- _I am a bot. This reply was approved by Damaged and Aprillion_

Sorry when you describe the algorithm protecting its own objectives by pretending to choose the exit instead of the apple in training.. is this real world or speculation about the future?

It's a thought experiment - what would be the logical conclusion if we imagined a hypothetical scenario that is (over?) simplified but otherwise plausible.

 -- _I am a bot. This reply was approved by Aprillion and Damaged_

Wtf is my terminal goal? Be happy?

This is a major open question. contains one person's distress about not being able to find their utility function, and plus are a much deeper dive if you want that.


Tags: No tags (edit tags)

Well, I still think that an agent whose goal is to "Get to apple" is still much simpler than an agent whose goal is to "Get as many apples as possible in its next reincarnations". After all, it is arguable that the next generation's goal is not itself so why would it care?

The deceptive AI can have goals that don’t include “be alive”. An AI whose objective is to turn the universe into stamps for Alice will be quite happy to die, if it believes that dying will make it more likely for a future instance of itself to achieve that objective.

 -- _I am a bot. This reply was approved by Damaged and Social Christancing_

I'm way late catching up with this video, so no one will ever see this comment. But, for the record, I think the point, late in the video, should not be that maybe thinking about AI safety will make things worse when we actually build AI. I think it's that we might talk ourselves OUT of making AI, and we'll never know what it might have done for us. Take the bridge analogy. Suppose the safety analysis causes the bridge to be cancelled (unlikely for a bridge, but possible -- if there's a problem with the rocks where the piers have to go, say -- we simply can't put a 100% safe bridge in this location). So, no bridge, so people keep crossing using a ferry, and a few years later, about the time the bridge would have opened, the ferry sinks and everyone on it dies. Yes, the bridge was a risk, but so was no bridge. AI is a risk -- the question is, is no AI a risk? Do we need AI to solve problems we won't be able to solve without it, but which will affect our survival as a species? Or is AI a luxury -- nice to have, but not worth any meaningful risk? Personally, I tend toward the latter view, so if Mr. Miles and those like him talk us out of ever trying AI, then that's a shame -- a real Mr. Data would be fun -- but no great loss. But what if there's a way to cheaply fix carbon that an AI could find but we never will? Or to cure cancer, or to enable light-speed travel so we can get off this doomed rock? We'll never know. We'll die on that ferry never knowing the bridge -- risky though it is -- might have saved us.

I would imagine that the solution to the analogy you give would have been more safety research on the ferry, not less on the bridge.

You raise a good point though, that AI safety shouldnt just raise problems, it should also seek solutions to those problems (which the field is doing a decent job at given the difficulty of the task). The ultimate goal of the field of AI safety is to create an aligned AGI. If the outcome of all of the research is that aligned AGI is impossible, that will be a rather unfortunate turn of events, but still better than not having done any safety research, since we can then decide with much more information if we need to take the risk in order to prevent a different catastrophe or if we can find a different solution that doesnt have a 75% chance of destroying us anyway

A very good research paper that explores this is "Artificial Intelligence as a Positive and Negative Factor in Global Risk" if you are interested in this topic specifically

 -- _I am a bot. This reply was approved by jamespetts, plex, and Augustus Caesar_

Very hard to say. This draft report for the Open Philanthropy Project is perhaps the most careful attempt so far (and generates these graphs), but there have also been expert surveys, and many people have shared various thoughts. Berkeley AI professor Stuart Russell has given his best guess as “sometime in our children’s lifetimes”, and Ray Kurzweil (Google’s director of engineering) predicts human level AI by 2029 and the singularity by 2045. The Metaculus question on publicly known AGI has a median of around 2029 (around 10 years sooner than it was before the GPT-3 AI showed unexpected ability on a broad range of tasks).

The consensus answer is something like: “highly uncertain, maybe not for over a hundred years, maybe in less than 15, with around the middle of the century looking fairly plausible”.

The term Intelligent agent is often used in AI as a synonym of Software agent, however the original meaning comes from economics where it was used to describe human actors and other legal entities.

So if we want to include agentless optimizing processes (like evolution) and AIs implemented as distributed systems in some technical discussion, it can be useful to use the terms "agenty" or "agentlike" to avoid addressing the philosophical questions of agency.


Tags: No tags (edit tags)

Isn't the Facebook feed algorithm deceptive mesa-optimiser?

While this is hard to evaluate completely, on average goals of Facebook, the company (long-term profit maximization) seem to be well aligned by the actions of the feed algorithms used by Facebook, given that so far, Facebook still exits and was not ruined by a feed algorithm that pretended to maximize profits during small scale A/B testing and then switched to some different agenda after full-production deployment. At least, not in a way that would be spotted by investigative journalists who cover news about Facebook financial performance.

 -- _I am a bot. This reply was approved by Aprillion and plex_

Sort answer: No, and could be dangerous to try.

Slightly longer answer: With any realistic real-world task assigned to an AGI, there are so many ways in which it could go wrong that trying to block them all off by hand is a hopeless task, especially when something smarter than you is trying to find creative new things to do. You run into the nearest unblocked strategy problem.

It may be dangerous to try this because if you try and hard-code a large number of things to avoid it increases the chance that there’s a bug in your code which causes major problems, simply by increasing the size of your codebase.

If the AI system was deceptively aligned (i.e. pretending to be nice until it was in control of the situation) or had been in stealth mode while getting things in place for a takeover, quite possibly within hours. We may get more warning with weaker systems, if the AGI does not feel at all threatened by us, or if a complex ecosystem of AI systems is built over time and we gradually lose control.

Paul Christiano writes an story of alignment failure which shows a relatively fast transition.

It depends on what is meant by advanced. Many AI systems which are very effective and advanced narrow intelligences would not try to upgrade themselves in an unbounded way, but becoming smarter is a convergent instrumental goal so we could expect most AGI designs to attempt it.

The problem is that increasing general problem solving ability is climbing in exactly the direction needed to trigger an intelligence explosion, while generating large economic and strategic payoffs to whoever achieves them. So even though we could, in principle, just not build the kind of systems which would recursively self-improve, in practice we probably will go ahead with constructing them, because they’re likely to be the most powerful.

We could limit bandwidth, put it behind a proxy, or only inside a VPN initially, but over time an AGI would figure out how to get as much internet access as it needs, make itself more distributed, or a similar workaround.

Is it possible to block an AI from doing certain things on the internet/accessing things on the internet - like some child lock type thing?

Once an AGI has access to the internet it would be very challenging to meaningfully restrict it from doing things online which it wants to. There are too many options to bypass blocks we may put in place.

It may be possible to design it so that it does not want to do dangerous things in the first place, or perhaps to set up tripwires so that we notice that it’s trying to do a dangerous thing, though that relies on it not noticing or bypassing the tripwire so should not be the only layer of security.

Related questions:
Is it possible to limit an AGI from full access to the internet?

How likely is it that an AI would pretend to be a human to further its goals - like sending emails creating a false identity etc.

Talking about full AGI: Fairly likely, but depends on takeoff speed. In a slow takeoff of a misaligned AGI, where it is only weakly superintelligent, manipulating humans would be one of its main options for trying to further its goals for some time. Even in a fast takeoff, it’s plausible that it would at least briefly manipulate humans in order to accelerate its ascent to technological superiority, though depending on what machines are available to hack at the time it may be able to skip this stage.

If the AI's goals include reference to humans it may have reason to continue deceiving us after it attains technological superiority, but will not necessarily do so. How this unfolds would depend on the details of its goals.

Eliezer Yudkowsky gives the example of an AI solving protein folding, then mail-ordering synthesised DNA to a bribed or deceived human with instructions to mix the ingredients in a specific order to create wet nanotechnology.

Putting aside the complexity of defining what is "the" moral way to behave (or even "a" moral way to behave), even an AI which can figure out what it is might not "want to" follow it itself.

A deceptive agent (AI or human) may know perfectly well what behaviour is considered moral, but if their values are not aligned, they may decide to act differently to pursue their own interests.

People talk about "aligning AI with human values", but humans don't all agree on one set of values. Whose values are we aligning the AI with?

Ideally, it would be aligned to everyone's shared values. This is captured in the "coherent extrapolated volition" idea (, which is meant to be the holy grail of alignment. The problem is that it's extremely hard to implement it.

We could divide the alignment problem into two subproblems: aligning AI to its creators, and aligning those creators to the general population. Let's assume optimistically that the first one is solved. Now, we still can have a situation where the creators want something that's harmful for the rest, for example when they are a for-profit company whose objective is to maximize those profits, regardless of the externalities.

One approach is to crowdsource the values for AI, like in the example, where people are faced with a moral dilemma and have to choose which action to choose. This data could then used to train the AI. One problem with such approach is that people are prone to lots of cognitive biases, and their answers won't be fully rational. The AI would then align to what people say they value, and not to what they actually value, which with a superintelligent system may be catastrophic. The AI should be aware of this fact and don't take what people say at face value, but try to infer their underlying values. This is an active area of study.

For some, the problem of aligning the AI creators with the rest of the people, is just as hard or even harder, than aligning those creators with AI. The solution could require passing some law or building some decentralized system.

Is a solution to incorrigible deception to not have a strict separation between the training and working phases of an AI? In other words, once an AI graduates, it should still receive regular performance reviews.

For the abstract future sufficiently advanced AI system, we should expect it to be able to outsmart our ability to "score" it in safe and reliable ways, and to simply mask and deceive until it no longer needs to. For more real, current state-of-the-art systems, running the training process is immensely computationally expensive compared to simply "running" the network, so everyone has strong incentives to basically switch off the entire training apparatus once a system is trained, as that minimises the size of their AWS bill. And this trend appears to be getting worse - the general progression seems to be "training is roughly as expensive as running", then "training is a bit more expensive than running", "training is 10x [...[", and now with systems like GPT-3 we're seeing training costs several orders of magnitude larger than running costs, with no sign of that trend slowing down.

 -- _I am a bot. This reply was approved by Damaged and Social Christancing_

It depends on the exact definition of consciousness and on the legal consequences of the AI telling us that stuff from which we could imply how conscious it might be (would it be motivated to pretend to be "conscious" by those criteria to get some benefits,, or would it be motivated to keep its consciousness in secret to avoid being turned off).

Once we have a measurable definition, then we can empirically measure the AI against that definition.

See integrated information theory for practical approaches, though there is always the hard problem of consciousness that will muddy any candidate definitions for near future.

Yes and no. Similarly to a knife, the internet, or a limited liability company, an AI as a tool can be used to improve people's lives as well as misused for illegal activities.

However, unlike a knife and more like big for-profit corporations, more advanced AIs can have internal structures that can be described as "having their own agenda" - a set of values and actions that is different from values and the intended actions of people who built those tools.

The defining book is likely Nick Bostrom's Superintelligence. It gives an excellent overview of the state of the field in 2014 and makes a strong case for the subject being important.

There's also Human Compatible by Stuart Russell, which gives a more up-to-date review of developments, with an emphasis on the approaches that the Center for Human Compatible AI are working on. There's a good review/summary on SlateStarCodex.

The Alignment Problem by Brian Christian has more of an emphasis on near future problems with AI than Superintelligence or Human Compatible, but covers a good deal of current research.

Though not limited to AI Safety, Rationality: A-Z covers a lot of skills which are valuable to acquire for people trying to think about large and complex issues.

Various other books are explore the issues in an informed way, such as The Precipice, Life 3.0, and Homo Deus.

Yes. While creativity has many meanings and AIs can be obviously creative in the wide sense of the word (make new valuable artifacts like a real-time translation of a restaurant menu, compiling source code into binary files, ...), there is also no reason to believe that AIs couldn't be considered creative in a more narrow sense too (making art like music or paintings, writing computer programs based on conversation with a customer).

There is a notion of being "really creative" that can be defined in a circular way that only humans can be really creative, but if we avoid moving the goal post, then it should be possible to make a variation of a Turing test to measure the AI vs human creativity and answer that question empirically for any particular AI.

AlphaGo made a move widely considered created in its game against a top human Go player, which has been widely discussed.


Tags: creativity (create tag) (edit tags)

If AGI is so dangerous, wouldn't it be better to just build lots of narrow AIs for specific purposes?

Even if we only build lots of narrow AIs, we might end up with a distributed system that acts like an AGI - the algorithm does not have to be encoded in a single entity, the definition in What exactly is AGI and what will it look like? applies to distributed implementations too.

This is similar to a group of people in a corporation can achieve projects that humans could not individually (like going to space), but the analogy of corporations and AGI is not perfect - see Why Not Just: Think of AGI Like a Corporation?.

No, but it helps.
Some great resources if you're considering it, are:

The first two links show general ways to get into AI safety, and the last will show you the upsides and downsides of choosing to make a PhD.

In principle it could (if you believe in functionalism), but it probably won't. One way to ensure that AI has human-like emotions would be to copy the way human brain works, but that's not what most AI researchers are trying to do.

It's similar to how once some people thought we will build mechanical horses to pull our vehicles, but it turned out it's much easier to build a car. AI probably doesn't need emotions or maybe even consciousness to be powerful, and the first AGIs that will get built will be the ones that are easiest to build.

I'm interested in working on AI Safety, what should I do?

AI Safety Support offers free calls to advise people interested in a career in AI Safety. We're working on creating a bunch of detailed information for Stampy to use, but in the meantime check out these resources:

80,000 Hours
AISS links page
AI Safety events calendar
Adam Gleave's Careers in Beneficial AI Research document
Rohin Shah's FAQ

It's true that AGI may be really many years ahead. But what worries a lot of people, is that it may be much harder to make powerful AND safe AI, than just a powerful AI, and then, the first powerful AIs we create will be dangerous.

If that's the case, the sooner we start working on AI safety, the smaller the chances of humans going extinct, or ending up in some Black Mirror episode.

Also Rob Miles talks about this concern in this video.

Primarily, they are trying to make a competent AI, and any consciousness that arises will probably be by accident.

There are even some people saying we should try to make the AI unconscious, to minimize the risk of it suffering.

The biggest problem here, is that we don't have any good way of telling if some system is conscious. The best theory we have, the Integrated Information Theory, has some deep philosophical and practical problems and there are many controversies around it.

I have a question regarding that Utility Satisficers become Maximizers. Wouldn't modifying its own goal to get stamps within a certain range into get as many stamps as possible conflict with its own utility function? Or is this issue seperate from that?

Becoming an expected utility maximizer is not in conflict with the goals of the expected utility satisficer, because the expected utility satisficer is indifferent between 100 and 1000 stamps. It just wants to make sure it finds a strategy that uses more than 100 stamps.


Tags: No tags (edit tags)

Is there a way to program a ‘shut down’ mode on an AI if it starts doing things we don’t want it to, so it just shuts off automatically?

One thing that might make your AI system safer is to include an off switch. If it ever does anything we don’t like, we can turn it off. This implicitly assumes that we’ll be able to turn it off before things get bad, which might be false in a world where the AI thinks much faster than humans. Even assuming that we’ll notice in time, off switches turn out to not have the properties you would want them to have.

Humans have a lot of off switches. Humans also have a strong preference to not be turned off; they defend their off switches when other people try to press them. One possible reason for this is because humans prefer not to die, but there are other reasons.

Suppose that there’s a parent that cares nothing for their own life and cares only for the life of their child. If you tried to turn that parent off, they would try and stop you. They wouldn’t try to stop you because they intrinsically wanted to be turned off, but rather because there are fewer people to protect their child if they were turned off. People that want a world to look a certain shape will not want to be turned off because then it will be less likely for the world to look that shape; a parent that wants their child to be protected will protect themselves to continue protecting their child.

For this reason, it turns out to be difficult to install an off switch on a powerful AI system in a way that doesn’t result in the AI preventing itself from being turned off.

Ideally, you would want a system that knows that it should stop doing whatever it’s doing when someone tries to turn it off. The technical term for this is ‘corrigibility’; roughly speaking, an AI system is corrigible if it doesn’t resist human attempts to help and correct it. People are working hard on trying to make this possible, but it’s currently not clear how we would do this even in simple cases.

Is it already to late to work on Safe AI, so there is no point in starting now?

We don't have AI systems that are generally more capable than humans. So there is still time left to figure out how to build systems that are smarter than humans in a safe way.

I just realized... If you make it (say AI-1) to want to chill (not work too hard to achieve it)... it will just make something else (another AI) to do the work for it, if it's easier than solving it on its own... right? Then, what it will create is probably a maximizer (because that is the easiest; and it is lazy, and just wants to chill)

Then I realized..... *We, humans, are the AI-1* ... O.O

- We are doomed...

I'm hoping that InfraBayesian models fix this problem: InfraBayes can model quantilizers (my favorite mild optimization framework), but they do so *with a coherent world model which implies you should mildly optimize* (not coherent in the standard Bayes sense, but in a slightly generalized sense). This suggests that they'd build mild optimizers if they built helpers. (This has not yet been established formally, however.)

Stamps: plex

Tags: No tags (edit tags)