Hystrix基本知识
写在前面
以下所有都是,我从Hystrix官网的文档中找出翻译的
一、五个知识点
1.1、What Is Hystrix?
In a distributed environment, inevitably some of the many service dependencies will fail. Hystrix is a library that helps you control the interactions between these distributed services by adding latency tolerance and fault tolerance logic. Hystrix does this by isolating points of access between the services, stopping cascading failures across them, and providing fallback options, all of which improve your system’s overall resiliency.
在分布式环境中,不可避免地会有许多服务依赖项中的某些失败。 Hystrix是一个库,可通过添加等待时间容限和容错逻辑来帮助您控制这些分布式服务之间的交互。 Hystrix通过隔离服务之间的访问点,停止服务之间的级联故障并提供后备选项来实现此目的,所有这些都可以提高系统的整体弹性。
History of Hystrix
Hystrix evolved out of resilience engineering work that the Netflix API team began in 2011. In 2012, Hystrix continued to evolve and mature, and many teams within Netflix adopted it. Today tens of billions of thread-isolated, and hundreds of billions of semaphore-isolated calls are executed via Hystrix every day at Netflix. This has resulted in a dramatic improvement in uptime and resilience.
Hystrix源自Netflix API团队于2011年开始的弹性工程工作。2012年,Hystrix不断发展和成熟,Netflix内的许多团队都采用了它。 如今,每天在Netflix上通过Hystrix执行数百亿个线程隔离和数千亿个信号量隔离的调用。 这极大地提高了正常运行时间和弹性。
1.2、What Is Hystrix For?
Hystrix is designed to do the following:
Give protection from and control over latency and failure from dependencies accessed (typically over the network) via third-party client libraries.
Stop cascading failures in a complex distributed system.
Fail fast and rapidly recover.
Fallback and gracefully degrade when possible.
Enable near real-time monitoring, alerting, and operational control.
- 通过第三方客户端库提供保护,并控制延迟和失败(通过网络(通常是通过网络)访问)的依赖性。
- 停止复杂的分布式系统中的级联故障。
- 快速失败并快速恢复。
- 回退并在可能的情况下正常降级。
- 启用近乎实时的监视,警报和操作控制。
1.3、What Problem Does Hystrix Solve?
Applications in complex distributed architectures have dozens of dependencies, each of which will inevitably fail at some point. If the host application is not isolated from these external failures, it risks being taken down with them.
For example, for an application that depends on 30 services where each service has 99.99% uptime, here is what you can expect:
复杂分布式体系结构中的应用程序具有数十种依赖关系,每种依赖关系不可避免地会在某个时刻失败。 如果主机应用程序未与这些外部故障隔离开来,则可能会被淘汰。
例如,对于依赖于30个服务的应用程序,其中每个服务的正常运行时间为99.99%,您可以期望:
99.9930 = 99.7% uptime
0.3% of 1 billion requests = 3,000,000 failures
2+ hours downtime/month even if all dependencies have excellent uptime.
99.9930 = 99.7%的正常运行时间
10亿个请求中的0.3%= 3,000,000个失败
即使所有依赖项都具有出色的正常运行时间,每月也会有2个小时以上的停机时间。
Reality is generally worse.
Even when all dependencies perform well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if you do not engineer the whole system for resilience.
即使您没有对整个系统进行永续性设计,即使所有依赖项都能很好地执行,即使0.01%的停机时间对数十种服务中的每一项的总影响也等于每月可能停机数小时。
…
…
…
All of these represent failure and latency that needs to be isolated and managed so that a single failing dependency can’t take down an entire application or system.
所有这些代表故障和延迟,需要对其进行隔离和管理,以使单个故障依赖项无法关闭整个应用程序或系统。
1.4、What Design Principles Underlie Hystrix?
Hystrix works by:
- Preventing any single dependency from using up all container (such as Tomcat) user threads.
- Shedding load and failing fast instead of queueing.
- Providing fallbacks wherever feasible to protect users from failure.
- Using isolation techniques (such as bulkhead, swimlane, and circuit breaker patterns) to limit the impact of any one dependency.
- Optimizing for time-to-discovery through near real-time metrics, monitoring, and alerting
- Optimizing for time-to-recovery by means of low latency propagation of configuration changes and support for dynamic property changes in most aspects of Hystrix, which allows you to make real-time operational modifications with low latency feedback loops.
- Protecting against failures in the entire dependency client execution, not just in the network traffic.
翻译如下
- 防止任何单个依赖项耗尽所有容器(例如Tomcat)用户线程。
- 减少负载并快速失败,而不是排队。
- 在可行的情况下提供备用,以保护用户免受故障的影响。
- 使用隔离技术(例如隔板,泳道和断路器模式)来限制任何一种依赖关系的影响。
- 通过近实时指标,监控和警报来优化发现时间
- 通过低延迟传播配置更改来优化恢复时间,并在Hystrix的大多数方面支持动态属性更改,这使您可以通过低延迟反馈环路进行实时操作修改。
- 防止整个依赖客户端执行失败,而不仅仅是网络通信失败。
1.5、How Does Hystrix Accomplish Its Goals?
Hystrix does this by:
- Wrapping all calls to external systems (or “dependencies”) in a HystrixCommand or HystrixObservableCommand object which typically executes within a separate thread (this is an example of the command pattern).
- Timing-out calls that take longer than thresholds you define. There is a default, but for most dependencies you custom-set these timeouts by means of “properties” so that they are slightly higher than the measured 99.5th percentile performance for each dependency.
- Maintaining a small thread-pool (or semaphore) for each dependency; if it becomes full, requests destined for that dependency will be immediately rejected instead of queued up.
- Measuring successes, failures (exceptions thrown by client), timeouts, and thread rejections.
- Tripping a circuit-breaker to stop all requests to a particular service for a period of time, either manually or automatically if the error percentage for the service passes a threshold.
- Performing fallback logic when a request fails, is rejected, times-out, or short-circuits.
- Monitoring metrics and configuration changes in near real-time.
翻译如下:
- 将对外部系统(或“依赖项”)的所有调用包装在通常在单独线程中执行的HystrixCommand或HystrixObservableCommand对象中(这是命令模式的示例)。
- 超时呼叫花费的时间超过您定义的阈值。有一个默认值,但是对于大多数依赖项,您可以通过“属性”自定义设置这些超时,以使它们略高于针对每个依赖项测得的99.5%的性能。
- 为每个依赖项维护一个小的线程池(或信号灯);如果已满,发往该依赖项的请求将立即被拒绝,而不是排队。
- 测量成功,失败(客户端抛出的异常),超时和线程拒绝。
- 如果该服务的错误百分比超过阈值,则使断路器跳闸,以在一段时间内手动或自动停止所有对特定服务的请求。
- 当请求失败,被拒绝,超时或短路时执行回退逻辑。
- 几乎实时监控指标和配置更改。
When you use Hystrix to wrap each underlying dependency, the architecture as shown in diagrams above changes to resemble the following diagram. Each dependency is isolated from one other, restricted in the resources it can saturate when latency occurs, and covered in fallback logic that decides what response to make when any type of failure occurs in the dependency:
当您使用Hystrix封装每个基础依赖项时,如上图所示的体系结构将更改为类似于下图。 每个依赖项都是相互隔离的,受到延迟时发生饱和的资源的限制,并由后备逻辑覆盖,该逻辑决定了在依赖项中发生任何类型的故障时做出什么响应.