MantisBT - ParaView
View Issue Details
0010530ParaViewBugpublic2010-04-09 13:382011-05-16 21:50
Alan Scott 
Ken Moreland 
normalminoralways
closedfixed 
 
 
0010530: ParaView does not scale well to huge numbers of cores
ParaView seems to have a problem scaling to huge numbers of cores. I was getting out of resource errors with MPI, random cores, when trying to pass the 5360 core limit (+- a few dozen). For all practical purposes passing this limit is not necessary at this time, but it will be within not too long.
No tags attached.
related to 0010261closed  ParaView does not scale above 1024 processors well 
related to 0010672closed Utkarsh Ayachit Slow client side rendering due to communication with server 
Issue History
2010-04-09 13:38Alan ScottNew Issue
2010-04-09 17:12Alan ScottNote Added: 0020115
2010-04-14 13:40Ken MorelandNote Added: 0020182
2010-04-14 13:40Ken MorelandStatusbacklog => tabled
2010-04-14 13:40Ken MorelandAssigned To => Ken Moreland
2010-04-14 13:41Ken MorelandRelationship addedrelated to 0010261
2010-06-11 11:12Utkarsh AyachitRelationship addedrelated to 0010672
2010-06-11 11:18Utkarsh AyachitNote Added: 0020991
2010-06-11 12:03Utkarsh AyachitNote Added: 0020993
2010-09-01 11:27Utkarsh AyachitTarget Version4.0 => 3.10.shortlist
2011-05-11 18:19Ken MorelandNote Added: 0026499
2011-05-11 18:19Ken MorelandStatustabled => @80@
2011-05-11 18:19Ken MorelandResolutionopen => fixed
2011-05-16 21:50Alan ScottNote Added: 0026513
2011-05-16 21:50Alan ScottStatus@80@ => closed

Notes
(0020115)
Alan Scott   
2010-04-09 17:12   
Here is a copy of the output on the server side:

[rs181][[21896,1],0][connect/btl_openib_connect_oob.c:847:qp_create_one] error creating qp errno says Resource temporarily unavailable

[rs181][[21896,1],0][connect/btl_openib_connect_oob.c:1193:rml_recv_cb] error in endpoint reply start connect
[rs181:23477] [[21896,0],0]-[[21896,1],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)


Also, note that the core that is failing seems to be random (this it probably isn't a localized hardware issue), and that it is right at the end of when the ParaView client/server link is being established, at ParaView initialization time.
(0020182)
Ken Moreland   
2010-04-14 13:40   
I am pretty sure I have traced this problem to vtkPVProgressHandler::CleanupSatellites. Process 0 receives a message from a few nodes, and then everything (but those nodes it received from) locks up.

My unverified suspicion is that there is a bunch of unhandled asynchronous progress messages sent to process 0 that are filling up the MPI buffers and not allowing this Cleanup to finish.

I also suspect that even when CleanupSatellites completes, there might be several unhandled messages left over. The method attempts to cancel the communication, but even a canceled communication can complete. I don't think that is ever checked.

I need to talk this over with Utkarsh.
(0020991)
Utkarsh Ayachit   
2010-06-11 11:18   
Ken is absolutely right (no surprise there ;)). The issue is indeed my interpretation on MPI_Test(). From MPI documentation for MPI_Test"

"For send operations, the only use of status is for MPI_Test_cancelled or in the case that there is an error, in which case the MPI_ERROR field of status will be set."

However, the satellites are using it to determine if the the message was received by the root, which is WRONG. This is resulting in the satellites choking the mpi communication channels with progresses events.
(0020993)
Utkarsh Ayachit   
2010-06-11 12:03   
commit 8d0c8aa5288b368e7d4193ad8424e19ea7a28104
Author: Utkarsh Ayachit <utkarsh.ayachit@kitware.com>
Date: Fri Jun 11 12:01:22 2010 -0400

    Performance improvement for BUG 0010530.
    
    Ensuring that progresses are not sent anywhere unless a 2 sec timeout is passed.
    Reduces the frequency of progress events.
    
    There was a bug in vtkProcessModuleConnectionManager which was not initializing
    the self connection, consequently progress wasn't working in built-in mode.
    Fixed that as well.
(0026499)
Ken Moreland   
2011-05-11 18:19   
I believe we addressed the issue that caused this bug. We continue to perform scaling studies on interactive ParaView, but I do not think there is any further need for this bug.
(0026513)
Alan Scott   
2011-05-16 21:50   
I agree with Ken. Either the bug reported here is a problem with MPI (for which I have a workaround), or IceT - which again we have a replacement.