|
Notes |
|
|
(0020115)
|
|
Alan Scott
|
|
2010-04-09 17:12
|
|
Here is a copy of the output on the server side:
[rs181][[21896,1],0][connect/btl_openib_connect_oob.c:847:qp_create_one] error creating qp errno says Resource temporarily unavailable
[rs181][[21896,1],0][connect/btl_openib_connect_oob.c:1193:rml_recv_cb] error in endpoint reply start connect
[rs181:23477] [[21896,0],0]-[[21896,1],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
Also, note that the core that is failing seems to be random (this it probably isn't a localized hardware issue), and that it is right at the end of when the ParaView client/server link is being established, at ParaView initialization time. |
|
|
|
(0020182)
|
|
Ken Moreland
|
|
2010-04-14 13:40
|
|
I am pretty sure I have traced this problem to vtkPVProgressHandler::CleanupSatellites. Process 0 receives a message from a few nodes, and then everything (but those nodes it received from) locks up.
My unverified suspicion is that there is a bunch of unhandled asynchronous progress messages sent to process 0 that are filling up the MPI buffers and not allowing this Cleanup to finish.
I also suspect that even when CleanupSatellites completes, there might be several unhandled messages left over. The method attempts to cancel the communication, but even a canceled communication can complete. I don't think that is ever checked.
I need to talk this over with Utkarsh. |
|
|
|
(0020991)
|
|
Utkarsh Ayachit
|
|
2010-06-11 11:18
|
|
Ken is absolutely right (no surprise there ;)). The issue is indeed my interpretation on MPI_Test(). From MPI documentation for MPI_Test"
"For send operations, the only use of status is for MPI_Test_cancelled or in the case that there is an error, in which case the MPI_ERROR field of status will be set."
However, the satellites are using it to determine if the the message was received by the root, which is WRONG. This is resulting in the satellites choking the mpi communication channels with progresses events. |
|
|
|
(0020993)
|
|
Utkarsh Ayachit
|
|
2010-06-11 12:03
|
|
commit 8d0c8aa5288b368e7d4193ad8424e19ea7a28104
Author: Utkarsh Ayachit <utkarsh.ayachit@kitware.com>
Date: Fri Jun 11 12:01:22 2010 -0400
Performance improvement for BUG 0010530.
Ensuring that progresses are not sent anywhere unless a 2 sec timeout is passed.
Reduces the frequency of progress events.
There was a bug in vtkProcessModuleConnectionManager which was not initializing
the self connection, consequently progress wasn't working in built-in mode.
Fixed that as well. |
|
|
|
(0026499)
|
|
Ken Moreland
|
|
2011-05-11 18:19
|
|
|
I believe we addressed the issue that caused this bug. We continue to perform scaling studies on interactive ParaView, but I do not think there is any further need for this bug. |
|
|
|
(0026513)
|
|
Alan Scott
|
|
2011-05-16 21:50
|
|
|
I agree with Ken. Either the bug reported here is a problem with MPI (for which I have a workaround), or IceT - which again we have a replacement. |
|