Kubernetes 1.34: Fine-Grained Container Restart Policies

Written by Raja Haikal | Sep 30, 2025 2:53:02 PM

Every Kubernetes user has faced the difficulty of managing Pods with multiple containers. Modern Pods often have a main application container, a sidecar container for extra tasks (like monitoring or managing secrets), and an init container for a one-time initial process. However, all these containers were tied to the same restart rule for the whole Pod, which was often frustrating.

Previously, if a single container in a Pod failed, the entire Pod had to be restarted, which was inefficient. Kubernetes 1.34 introduces per-container restart policies, allowing for smarter control and faster recovery. This article will show how it works and why it's important.

Kubernetes 1.34: Better Restart Control

With the release of Kubernetes 1.34, the way containers lifecycles are managed changes completely. The new feature called Container Restart Policy and Rules, which is still in the early (alpha) stage, allows for restart control at the container level, no longer just at the Pod level. This is a big change that is more aligned with modern architecture.

This feature has two main parts:

Per-Container Rules: Each container, including init containers and main containers, can now have its own restartPolicy rule that can override the Pod's rule. This allows each container within the same Pod to have different restart behaviors.
Exit Code-Based Rules: This feature has a sophisticated restartPolicyRules field. These rules allow a container to restart only if a specific exit code is found, providing a smarter way to handle problems.

💡To enable this feature: a cluster administrator must turn on the ContainerRestartRules feature gate.

The following table highlights the differences:

Category	Pod restartPolicy (Before K8s 1.34)	Container restartPolicy (Starting K8s 1.34, *Alpha*)
Scope	Whole Pod. One rule for all containers.	Each container can have its own rule.
Available Rules	Always, OnFailure, Never.	Always, OnFailure, Never at the container level, plus exit code-based rules.
Problem Handling	No control based on exit codes. Only based on a failed status (OnFailure) or every time it stops (Always).	restartPolicyRules allow for restarts based on specific exit codes.
Flexibility	Low. Requires complex solutions for Pods with many containers.	High. Allows for a cleaner Pod design.

How to Use These New Rules

To use this feature, you can add restartPolicy and restartPolicyRules to the containers and initContainers sections in the Pod's configuration file.

Technically, restartPolicyRules is a list of rules. Each rule has two parts:

- action: The action to take when the container stops. The current options are Restart to restart the container and DoNotRestart to prevent restart or stop the Pod.

- exitCodes: The condition that triggers this action. It can be an operator (In or NotIn) and values (a list of exit codes).

This system works best when your application is designed to output clear exit codes for different types of failures.

💡Pro Tip: Design your application to produce specific and well-documented exit codes. This will maximize the usefulness of the restartPolicyRules feature.

Real-World Examples

This feature opens the door to more resilient and intelligent Pod designs. Here are a few examples:

Database Migration and Web Service

- Old Problem: If a database migration initContainer failed, the entire Pod would repeatedly restart, which could lead to a corrupted database.

- New Solution: You can set restartPolicy: Never for the initContainer.

- Result: If the migration fails, the Pod stops, preventing the main application from running on an incomplete or corrupted database. The main service container can still have a restartPolicy: Always for normal operation.

apiVersion: v1
kind: Pod
metadata:
  name: app-with-migration
  labels:
    app: user-service
    version: v1.0.0
spec:
  # Pod-level restartPolicy is still required by API, but individual container policies override it.
  restartPolicy: Always
  initContainers:
  - name: db-migrator
    image: my-company/db-migration:1.0.0
    restartPolicy: Never # A single, one-time run. If it fails, the whole Pod fails.
    env:
    - name: DB_HOST
      valueFrom:
        secretKeyRef:
          name: db-credentials
          key: host
    command: ["/migrate", "--target-version=v1.0.0"]
  containers:
  - name: user-service
    image: my-company/web-app:1.0.0
    restartPolicy: Always # This service should always be running.
    ports:
    - containerPort: 8080
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
     initialDelaySeconds: 30

Smart Recovery for AI/ML

- Old Problem: For long-running AI/ML jobs, restarting the entire Pod is very expensive in terms of time and resources.

- New Solution: You can configure a container to restart only for specific, fixable errors, such as a memory issue or network problem.

- Result: The Pod can recover from temporary, retriable failures without the need to reschedule the entire job. For fatal errors like data corruption, you can use DoNotRestart rules to stop the Pod entirely.


apiVersion: v1
kind: Pod
metadata:
  name: ml-training-worker
  labels:
    job: nlp-classification
    role: worker
spec:
  restartPolicy: Never # Pod as a whole should not restart on failure
  containers:
  - name: training-worker
    image: ml-platform.io/trainer:v1.0.0
    restartPolicy: Never # Container also does not restart by default
    restartPolicyRules:
    - action: Restart
      exitCodes:
        operator: In
        values: [42] # CUDA out of memory - reduce batch size and retry
    - action: Restart
      exitCodes:
        operator: In
        values: [43] # Network timeout during gradient sync
    - action: DoNotRestart
      exitCodes:
        operator: In
        values: [1, 2] # Data corruption or invalid hyperparameters
...

Microservice with Observability Stack

- Old Problem: Production microservices often bundle the core application with sidecar containers for logging and monitoring. If one of these sidecars had an issue, it could affect the entire Pod's restart behavior.

- New Solution: Each container in the Pod can have its own restartPolicy. For example, the main application can be set to Always restart, while the log forwarder can be set to OnFailure.

- Result: You can ensure the core application remains highly available while giving monitoring sidecars a different restart behavior, and even use exit codes to manage specific sidecar issues, such as Elasticsearch connection problems.


apiVersion: v1
kind: Pod
metadata:
  name: payment-service-stack
  labels:
    app: payment-api
spec:
  restartPolicy: Never
  containers:
  - name: payment-api
    image: payments.io/api-server:v1.2.3
    ports:
    - containerPort: 9000
    restartPolicy: Always
...
  - name: prometheus-exporter
    image: prom/node-exporter:v1.3.1
    ports:
    - containerPort: 9100
    restartPolicy: OnFailure
...
  - name: log-forwarder
    image: fluent/fluent-bit:1.9.3
    restartPolicy: OnFailure
    restartPolicyRules:
    - action: Restart
      exitCodes:
        operator: In
        values: [1, 2]  # Elasticsearch connection issues
    - action: DoNotRestart
      exitCodes:
        operator: In
        values: [125]  # Configuration syntax error
...

Exit Code Design Patterns

This table provides an example of suggested exit code patterns and corresponding restart actions:

Error Type	Example Exit Codes	Description & Context	Recommended restartPolicyRules
Configuration	1, 2, 3	Incorrect configuration, missing variables. Fatal, cannot be fixed by restarting.	DoNotRestart, because the Pod will never succeed.
Resources	10, 11, 12	Out of memory, disk full, CPU too busy. Can be retried.	Restart, because a restart can resolve this temporary issue.
Network	20, 21, 22	Connection timed out, DNS failed. Often temporary and can be retried.	Restart, so the container can recover its connection without needing to reschedule the Pod.
Application	30, 31, 32	Logic error, data validation failed. Depends on the application; can be fatal or not.	Depends on the application's design.
System	40, 41, 42	Container forced to stop, node is low on resources.	Restart, because this is a problem outside the application that can be retried.

For informed restart decisions, applications should exit with codes that accurately reflect the cause of failure.

Benefits & Efficiency

This feature is not just a small fix, but a big step forward in operational efficiency.

More Efficient

The biggest benefit is better efficiency. When a container in a Pod fails, it can restart in-place, without needing to reschedule the entire Pod. Rescheduling a Pod takes time and resources to pull the image and mount new volumes.

With in-place restarts, recovery time can be much faster. Restart times can be reduced from the typical 30-60 seconds to just 5-15 seconds.

Preserve Data and Simpler Architecture

In-place restarts ensure that volumes and network configuration remain intact. This allows for a quicker recovery without losing important data. The feature also simplifies the architecture by removing the need to split functionality into different Pods.

Operational Considerations

Although the Container Restart Policy and Rules feature is very useful, you should use it carefully because it is still in the alpha stage.

⚠️This feature is still in the alpha stage in Kubernetes 1.34. Use it carefully and avoid deploying it on mission-critical production applications until it becomes stable.

Strategies for Migrate Existing Workloads

Follow these four steps when adopting the new feature:

Analyze: Check your current Pod restart patterns. Find the containers that often have problems and would benefit from these new rules.
Testing: Turn on the ContainerRestartRules feature gate in your development cluster. Test existing applications to make sure there are no issues.
Gradual Migration: Start using this feature on new or less critical applications. Avoid moving mission-critical production applications until the feature is stable.
Improve Monitoring: Update your monitoring tools to track restarts at the container level, not just the Pod level.

Debugging Complexity

With more control, debugging becomes more complex. You need to check container logs and container-level restart metrics to understand why a container is restarting or not.


# Check container-specific restart counts
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[*].restartCount}' 

# Examine restart reasons per container
kubectl describe pod my-pod | grep -A 5 "Container Statuses"

Resource Planning

Restart policies also change how we think about resources. Containers with Always restart policies need steady resources available, while those that Never restart can usually run with tighter limits.

Monitoring & Alerting

When using container-level restart policies, monitoring becomes more important.


# Example monitoring labels for Prometheus
- name: main-app
  restartPolicy: Always
  env:
  - name: CONTAINER_NAME
    value: "main-app"
  - name: POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name

When using container-level restart policies, monitoring becomes more important.

Keep an eye on how often each container restarts, not just the Pod. Set up alerts for cases where a container fails to restart as planned, or keeps restarting when it shouldn’t.

Implementation Recommendations

Keep it Simple First: Start with basic per-container rules, then add exit code logic as needed.
Plan Exit Codes: Design your app to return meaningful exit codes that show what kind of error happened.
Monitor Thoroughly: Observe restart trends at both Pod and container levels to detect unexpected behavior.
Document Clearly: Provide clear documentation of restart logic so operational teams can troubleshoot without confusion.

Conclusion

Kubernetes Container Restart Policy and Rules advance container management by offering granular control over Pods. This feature addresses long-standing design issues, enabling more resilient and efficient applications. Managing individual container lifecycles and smart recovery logic based on exit codes reduces latency, saves resources, and simplifies Pod architecture.

However, it’s important to note that this only controls restart behavior; the readiness rule of the Pod does not change. A Pod is considered Ready only if all containers are running. For example, if a sidecar crashes while the main application container is still running, direct communication to the main container (e.g., via Pod IP) remains safe, but access through a Service will be affected because the Pod is marked as not ready and removed from the load-balancing pool.

Reference

1. Kubernetes v1.34: Finer-Grained Control Over Container Restarts | Kubernetes

View full post