MantisBT - ParaView
View Issue Details
0012720ParaView(No Category)public2011-11-10 21:552012-02-08 17:22
Alan Scott 
Utkarsh Ayachit 
urgentminorhave not tried
closedfixed 
3.12 
3.14 
Sandia
12720_cth_reads_too_much
crash
0012720: CTH reads file 0 for all processes
We suspect that large Cray clusters are serializing access to single files when multiple pvservers are trying to access these single files. As we scale into the thousands of pvservers, we believe this is becoming fatal.

ParaView 3.12.0, remote server (I am using 8 processes), Linux client.
Although I am sure you can replicate with any cth dataset, I am doing the following:
* Make soft links (ln -s) to files spcta.0, spcta.1, spcta.2 and spcta.3 of Dave's big CTH AMR dataset (i.e., 256 files). Now, we have a 4 file subset of this dataset.
* strace -o $HOME/pvserver.strace -tt -f -ff -e trace=open,close,read,write
  - This will create a different file for each process. Do a ls -ls on these files, the smaller ones are not of interest, the larger are from lib/paraview3.12/pvserver. We care about the larger ones.
  - Note that 4 of them are slightly larger than the smaller ones. We care about these larger files.

Open each file in turn. Search for spcth. Notice that each file opens file 0 4 times, and then opens it's real file 2 times.

As stated, we believe that these 4 opens of file 0 are fatal for Cielo and possibly other cray systems.

This is a show stopper bug for Cielo going into production with expected size datasets.

I will send the log files to Utkarsh and Robert from my run. I am marking this as a crash, although technically it is a hang (or a glacier - take your pick).
No tags attached.
parent of 0012729closed Utkarsh Ayachit vtkFileSeriesReader's MTime is being changed in ProcessRequest for several readers. 
Issue History
2011-11-10 21:55Alan ScottNew Issue
2011-11-11 13:40Utkarsh AyachitAssigned To => Utkarsh Ayachit
2011-11-14 10:51Utkarsh AyachitStatusbacklog => todo
2011-11-14 10:51Utkarsh AyachitStatustodo => active development
2011-11-14 17:20Utkarsh AyachitTopic Name => 12720_cth_reads_too_much
2011-11-14 17:20Utkarsh AyachitNote Added: 0027690
2011-11-14 17:20Utkarsh AyachitStatusactive development => gatekeeper review
2011-11-14 17:20Utkarsh AyachitFixed in Version => git-next
2011-11-14 17:20Utkarsh AyachitResolutionopen => fixed
2011-11-15 13:50Utkarsh AyachitRelationship addedparent of 0012729
2011-11-18 14:53Utkarsh AyachitFixed in Versiongit-next => git-master
2011-11-18 14:54Utkarsh AyachitStatusgatekeeper review => customer review
2011-11-18 14:54Utkarsh AyachitNote Added: 0027718
2011-12-21 21:47Alan ScottNote Added: 0027878
2011-12-21 21:47Alan ScottStatuscustomer review => closed
2012-02-08 17:22Utkarsh AyachitFixed in Versiongit-master => 3.14

Notes
(0027690)
Utkarsh Ayachit   
2011-11-14 17:20   
commit 1c9d8ffd920503167e80bbbb457112aa268bfe64
Author: Utkarsh Ayachit <utkarsh.ayachit@kitware.com>
Date: Mon Nov 14 17:14:19 2011 -0500

    Fixed BUG 0012720. Minimize reads on satellites.
    
    All processes were reading first file to gather meta-data. This caused issues
    when running in parallel on large number of cores. Fixed by reading the file on
    root node and then broadcasting the gathered information to all nodes.
    
    Structured the code slightly to avoid processing of the meta-data when timesteps
    changed.
(0027718)
Utkarsh Ayachit   
2011-11-18 14:54   
merged to master.
(0027878)
Alan Scott   
2011-12-21 21:47   
This appears to be working very well. The only concern I have is if we find that header info is different between files. So far, so good.

This increased read speeds an incredible amount. Nice.

Tested remote server, master, Linux.