Analyzing Apache DolphinScheduler’s Fault Tolerance Mechanism

Both the Apache DolphinScheduler Master and Worker components support multi-node deployment with a decentralized design. The Master is mainly responsible for splitting the DAG workflow and distributing tasks to Worker nodes through RPC. The worker is responsible for executing tasks and reporting the task status back to the Master, which processes this information.


This content originally appeared on HackerNoon and was authored by William Guo

Both the Apache DolphinScheduler Master and Worker components support multi-node deployment with a decentralized design.

\

  • The Master is mainly responsible for splitting the DAG workflow and distributing tasks to Worker nodes through RPC, as well as handling task status updates from the Worker.

  • The worker is responsible for executing tasks and reporting the task status back to the Master, which processes this information.

    \

But what happens in case of failure?

  1. What if the Master fails? Since it’s responsible for managing workflow instances, the Worker can no longer report task statuses, and the Master can’t process them either.

  2. What if the Worker fails? Since it’s the one executing the actual tasks, how does the Master handle this situation?

    \

Let’s dive into the fault tolerance mechanism with the help of an illustration:

Fault Tolerance

\ Here’s a breakdown of how DolphinScheduler handles failures:

\

  • If the Master fails: Other Master nodes will handle the failover using a distributed lock. The workflow instance will switch from the failed Master to a new Master. In this case, the system will issue the new Master’s host address to the Worker, enabling it to report the task status to the new Master.

\

  • If the Worker fails: The system retries the task. However, before retrying, it needs to kill the task that is still running on YARN. Currently, DolphinScheduler does not support this out-of-the-box, because in a non-client-separated mode, the ProcessBuilder.waitFor method waits for the client process to exit. The applicationId is parsed only after this process exits, meaning the system can only obtain the applicationId after the program has finished running.

    \

In other words, you can’t get the applicationIduntil the process is completed.

\ Here’s the relevant code section:

org.apache.dolphinscheduler.server.master.service.WorkerFailoverService#killYarnTask 
private void killYarnTask(TaskInstance taskInstance, ProcessInstance processInstance) {
    try {
        if (!masterConfig.isKillApplicationWhenTaskFailover()) {
            return;
        }
        if (StringUtils.isEmpty(taskInstance.getHost()) || StringUtils.isEmpty(taskInstance.getLogPath())) {
            return;
        }
        TaskExecutionContext taskExecutionContext = TaskExecutionContextBuilder.get()
                .buildWorkflowInstanceHost(masterConfig.getMasterAddress())
                .buildTaskInstanceRelatedInfo(taskInstance)
                .buildProcessInstanceRelatedInfo(processInstance)
                .buildProcessDefinitionRelatedInfo(processInstance.getProcessDefinition())
                .create();
        log.info("TaskInstance failover begin kill the task related yarn or k8s job");
        ILogService iLogService =
                SingletonJdkDynamicRpcClientProxyFactory.getProxyClient(taskInstance.getHost(), ILogService.class);
        GetAppIdResponse getAppIdResponse =
                iLogService.getAppId(new GetAppIdRequest(taskInstance.getId(), taskInstance.getLogPath()));
        ProcessUtils.killApplication(getAppIdResponse.getAppIds(), taskExecutionContext);
    } catch (Exception ex) {
        log.error("Kill yarn task error", ex);
    }
}

What can be done?

In version 1.3.3, the LoggerServer and Masterwere separated, allowing the Master node (if it had the YARN client) to kill the applicationId running on YARN. So what now?

Two Possible Solutions:

  1. Master kills the task using the YARN REST API:
curl -X PUT -d '{"state":"KILLED"}' \
    -H "Content-Type: application/json" \
    http://xx.xx.xx.xx:8088/ws/v1/cluster/apps/application_1694766249884_1098/state?user.name=hdfs

:::info Note: You need to specify the user.

:::

\ 2. Worker kills the task: \n In this case, the task should be marked as a failover task. During retry, the task should be scheduled on a designated Worker node. Before retrying, the runningapplicationId needs to be killed. One optimization would be to first check the YARN status before killing. If the status is abnormal, then kill it. If it's RUNNING, you can wait for a set timeout period.

\


This content originally appeared on HackerNoon and was authored by William Guo


Print Share Comment Cite Upload Translate Updates
APA

William Guo | Sciencx (2024-10-27T13:39:09+00:00) Analyzing Apache DolphinScheduler’s Fault Tolerance Mechanism. Retrieved from https://www.scien.cx/2024/10/27/analyzing-apache-dolphinschedulers-fault-tolerance-mechanism/

MLA
" » Analyzing Apache DolphinScheduler’s Fault Tolerance Mechanism." William Guo | Sciencx - Sunday October 27, 2024, https://www.scien.cx/2024/10/27/analyzing-apache-dolphinschedulers-fault-tolerance-mechanism/
HARVARD
William Guo | Sciencx Sunday October 27, 2024 » Analyzing Apache DolphinScheduler’s Fault Tolerance Mechanism., viewed ,<https://www.scien.cx/2024/10/27/analyzing-apache-dolphinschedulers-fault-tolerance-mechanism/>
VANCOUVER
William Guo | Sciencx - » Analyzing Apache DolphinScheduler’s Fault Tolerance Mechanism. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/27/analyzing-apache-dolphinschedulers-fault-tolerance-mechanism/
CHICAGO
" » Analyzing Apache DolphinScheduler’s Fault Tolerance Mechanism." William Guo | Sciencx - Accessed . https://www.scien.cx/2024/10/27/analyzing-apache-dolphinschedulers-fault-tolerance-mechanism/
IEEE
" » Analyzing Apache DolphinScheduler’s Fault Tolerance Mechanism." William Guo | Sciencx [Online]. Available: https://www.scien.cx/2024/10/27/analyzing-apache-dolphinschedulers-fault-tolerance-mechanism/. [Accessed: ]
rf:citation
» Analyzing Apache DolphinScheduler’s Fault Tolerance Mechanism | William Guo | Sciencx | https://www.scien.cx/2024/10/27/analyzing-apache-dolphinschedulers-fault-tolerance-mechanism/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.