spark RPC超时造成任务异常 Attempted to get executor loss reason for executor id 17 at RPC address 192.168.48.172:59070, but got no response. Marking as slave lost.

阿里云国内75折回扣微信号：monov8

阿里云国际，腾讯云国际，低至75折。AWS 93折免费开户实名账号代冲值优惠多多微信号：monov8 飞机：@monov6

日志信息如下

Attempted to get executor loss reason for executor id 17 at RPC address 192.168.48.172:59070, but got no response. Marking as slave lost.
java.io.IOException: Failed to send RPC 9102760012410878153 to /192.168.48.172:59047: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237) ~[spark-network-common_2.11-2.2.0.jar:2.2.0]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) ~[netty-all-4.0.43.Final.jar:4.0.43.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) ~[netty-all-4.0.43.Final.jar:4.0.43.Final]
at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34) ~[netty-all-4.0.43.Final.jar:4.0.43.Final]
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431) ~[netty-all-4.0.43.Final.jar:4.0.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) ~[netty-all-4.0.43.Final.jar:4.0.43.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) ~[netty-all-4.0.43.Final.jar:4.0.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) ~[netty-all-4.0.43.Final.jar:4.0.43.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) ~[netty-all-4.0.43.Final.jar:4.0.43.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_101]
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[netty-all-4.0.43.Final.jar:4.0.43.Final]

现象

driver端显示日志内容为RPC通信错误，从而认为心跳超时，执行器被yarn杀掉，该问题有两种解决思路

driver或executor内存不足，GC时无法进行RPC通信从而心跳超时，定位方法

driver端：查询driver的pid，jstat -gcutil pid查看内存使用情况，或jmap -heap pid查看内存使用
executor端：查询executor的pid（可以从spark UI的执行器页面查看到执行器的ip和端口，通过ip和端口查询到executor所在的服务器和pid），根据pid查看内存使用情况

driver所在服务器与executor所在服务器之间的时间相差较多，相差1分钟以上就应该及时修改时间了，究其根本原因也很简单，两台服务器时间相差过大，造成本来就1ms内完成的通信，由于两个java进程计算的时间戳不同，造成driver认为响应超时，目前看大部分文章给的解决方式都是第一种，直接加executor内存，未必能解决问题，我们大部分集群都做了时钟同步，为什么还会造成时间相差很大呢，此时需要查看服务器是否开启了chronyd，如果你使用的是ntp，chronyd会对ntp有干扰，可以关闭chronyd

关闭chronyd方法
```
systemctl disable chronyd
systemctl stop chronyd
systemctl enable ntpd
systemctl start ntpd
```

阿里云国内75折回扣微信号：monov8

阿里云国际，腾讯云国际，低至75折。AWS 93折免费开户实名账号代冲值优惠多多微信号：monov8 飞机：@monov6

标签: go