ONTAP Discussions

Troubleshooting High Network% and Single CPU%

Peter_UBC
3,566 Views

I'm looking for suggestions on troubleshooting an issue we are seeing on our FAS3170 on DOT 7.3.7P3.  It's been running fine, but today we noticed that one of the CPU cores is pegged at 100%, with the Network% at 115+%.  The filer is in a HA pair, and the partner is running fine while processing more Ops.

 

Here's snipet from sysstat

sysstat -M 5

ANY1+ ANY2+ ANY3+ ANY4+  AVG CPU0 CPU1 CPU2 CPU3 Network Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s   CP
 100%   36%   21%   15%  43%  25%  24%  23% 100%    114%     12%   8%     3%     5%     26%( 16%)          0%        0%   1%     3%   2%   0%  3397   0%
 100%   45%   28%   19%  48%  28%  32%  32% 100%    115%     13%  11%     3%    11%     28%( 17%)          4%        0%   2%     5%   2%   0%  3415  32%
 100%   42%   26%   19%  47%  29%  31%  28% 100%    116%     12%  11%     4%     7%     28%( 18%)          0%        0%   1%     6%   2%   0%  3258  69%
 100%   39%   24%   18%  45%  25%  30%  27% 100%    115%     12%   8%     3%     7%     29%( 18%)          0%        0%   1%     6%   2%   0%  3029   0%
 100%   46%   29%   20%  49%  29%  33%  34% 100%    115%     13%  11%     3%    12%     28%( 18%)          6%        0%   1%     6%   2%   0%  2966  31%
 100%   47%   29%   21%  50%  34%  34%  31% 100%    118%     14%  13%     3%     9%     31%( 19%)          0%        0%   1%     7%   2%   0%  3832 100%
 100%   38%   23%   16%  45%  26%  26%  26% 100%    116%     12%   7%     3%     7%     26%( 17%)          0%        0%   1%     4%   2%   0%  3918  10%
 100%   36%   22%   16%  44%  24%  28%  25% 100%    115%     11%   8%     3%     6%     27%( 17%)          0%        0%   1%     4%   2%   0%  3537   0%
 100%   67%   43%   30%  60%  44%  47%  51% 100%    121%     17%  20%     3%    22%     40%( 24%)          5%        0%   1%    10%   2%   0%  4809  89%
 100%   55%   35%   24%  54%  37%  39%  38% 100%    118%     14%  16%     3%    13%     31%( 19%)          7%        0%   1%    10%   2%   0%  4218  61%
 100%   53%   37%   27%  55%  40%  39%  39% 100%    120%     14%  13%     3%    12%     39%( 24%)          0%        0%   1%    14%   2%   0%  4108  63%
 100%   54%   38%   29%  56%  40%  41%  42% 100%    125%     15%  11%     3%    12%     42%( 26%)          0%        0%   1%    11%   2%   0%  4752   0%

 

I've checked for the usual running sis processes and looked for zombie blocks.

 

statit showed that CPU3 spent 99% of its cycles on the nwk_legacy domain.  KB3014084 says that nwk_legacy is IP processing, NFS protocol processing, hmm, ok, so I checked nfsstat next.

 

After clearing the counters, and enabling per client stats, we added the per volume NFS ops, and they do add up to roughly the same Ops/s as shown by sysstat which ranges between 3500 - 6000.  Nothing that a FAS3170 can't handle.

 

DFM is still collecting stats, and the average filer network throughput is around 150mbps over the past day, which is lower than the 180-200mbps average it has seen over the past week or so.

 

So what could cause the CPU and Network util to be so high?

Thanks!

1 ACCEPTED SOLUTION

deepuj
3,536 Views

Hi,

 

We recommend you to open a support case to troubleshoot this issue.

You may have to provide systat -x 1 output, and also a perfstat with 5 minutes interval 20 iterations ( https://www.youtube.com/watch?v=NzSKZYJJkz4 ) once you open a case.


Thanks

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

View solution in original post

2 REPLIES 2

deepuj
3,537 Views

Hi,

 

We recommend you to open a support case to troubleshoot this issue.

You may have to provide systat -x 1 output, and also a perfstat with 5 minutes interval 20 iterations ( https://www.youtube.com/watch?v=NzSKZYJJkz4 ) once you open a case.


Thanks

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Peter_UBC
3,509 Views

Hi deepuj

 

We ended up failing the workload over to the partner to reboot the node.  The failover definitely took longer than usual because one of the CPU cores was so busy, but after the reboot and give back, everything is back to normal now.

 

Thanks!

Public