/search.css" rel="stylesheet" type="text/css"/> /search.js">
| Classes | Job Modules | Data Objects | Services | Algorithms | Tools | Packages | Directories | Tracs |

In This Package:

Public Member Functions | Public Attributes | Private Member Functions
Scraper::dq::cq_zip_check::CQDB Class Reference
Inheritance diagram for Scraper::dq::cq_zip_check::CQDB:
Inheritance graph
[legend]
Collaboration diagram for Scraper::dq::cq_zip_check::CQDB:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def __init__
def create_indices
def seqno_groups
def runno_groups
def runranges
def count
def zip_scan_runno
def zip_scan_seqno
def zip_scan_seqno_sets

Public Attributes

 step
 tabs
 runr

Private Member Functions

def _seqno_range
def _runno_range

Detailed Description

Definition at line 61 of file cq_zip_check.py.


Constructor & Destructor Documentation

def Scraper::dq::cq_zip_check::CQDB::__init__ (   self,
  sect,
  step = 100,
  tabs = 'DqChannel DqChannelStatus'.split() 
)

Definition at line 62 of file cq_zip_check.py.

00063                                                                                 :
00064         DB.__init__(self, sect)
00065         self.step = step
00066         assert len(tabs) == 2
00067         self.tabs = tabs
00068         self.create_indices(tabs)
00069         self.runr = self._runno_range(tabs) 
00070         self("set group_concat_max_len = 4096")


Member Function Documentation

def Scraper::dq::cq_zip_check::CQDB::create_indices (   self,
  tabs 
)
Avoid very slow not indexed querying of RUNNO/FILENO
by creating indices. 

http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

Adding index to `DqChannel` allows to compare query times with/without the index:

    mysql> create index idx_DqChannel on DqChannel (runno, fileno) ;
    Query OK, 65489088 rows affected (15 min 38.60 sec)
    Records: 65489088  Duplicates: 0  Warnings: 0

    mysql> select RUNNO, group_concat(distinct(FILENO)) from DqChannel group by RUNNO limit 10 offset 0  ;
    10 rows in set, 1 warning (0.85 sec)
  
    mysql> set group_concat_max_len = 4096 ;   ## avoid the warning

    mysql> select RUNNO, group_concat(distinct(FILENO)) from DqChannelStatus group by RUNNO limit 10 offset 0  ;
    10 rows in set, 1 warning (1 min 50.01 sec)
              ## without the benefit of the index takes almost 2 min cf 0.85 seconds with it 

Benefit very clear add to `DqChannelStatus` too:: 

    mysql> create index idx_DqChannelStatus on DqChannelStatus (runno, fileno) ;
    Query OK, 65436731 rows affected (15 min 4.79 sec)
    Records: 65436731  Duplicates: 0  Warnings: 0

Definition at line 71 of file cq_zip_check.py.

00072                                   :
00073         """
00074         Avoid very slow not indexed querying of RUNNO/FILENO
00075         by creating indices. 
00076 
00077         http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
00078 
00079         Adding index to `DqChannel` allows to compare query times with/without the index:
00080 
00081             mysql> create index idx_DqChannel on DqChannel (runno, fileno) ;
00082             Query OK, 65489088 rows affected (15 min 38.60 sec)
00083             Records: 65489088  Duplicates: 0  Warnings: 0
00084 
00085             mysql> select RUNNO, group_concat(distinct(FILENO)) from DqChannel group by RUNNO limit 10 offset 0  ;
00086             10 rows in set, 1 warning (0.85 sec)
00087   
00088             mysql> set group_concat_max_len = 4096 ;   ## avoid the warning
00089 
00090             mysql> select RUNNO, group_concat(distinct(FILENO)) from DqChannelStatus group by RUNNO limit 10 offset 0  ;
00091             10 rows in set, 1 warning (1 min 50.01 sec)
00092                               ## without the benefit of the index takes almost 2 min cf 0.85 seconds with it 
00093 
00094         Benefit very clear add to `DqChannelStatus` too:: 
00095 
00096             mysql> create index idx_DqChannelStatus on DqChannelStatus (runno, fileno) ;
00097             Query OK, 65436731 rows affected (15 min 4.79 sec)
00098             Records: 65436731  Duplicates: 0  Warnings: 0
00099 
00100         """
00101         def _create_index(table, columns):
00102             sql = "create index idx_%(table)s on %(table)s %(columns)s " % locals()
00103             log.info(sql)  
00104             self(sql)
00105 
00106         for table in tabs:
00107             cols = map(lambda d:d['Column_name'], self(" show indexes from %(table)s where Key_name = 'idx_%(table)s' " % locals()))
00108             if cols == ['RUNNO', 'FILENO']:
00109                 log.debug("index alreay in place for table %(table)s " % locals() )
00110             elif len(cols) == 0:
00111                 log.warn("creating index for table %(table)s this might take 15min or more " % locals())
00112                 _create_index(table, "(runno, fileno)")
00113             else:
00114                 assert 0, "unexpected index columns for table %(table)s %(cols)s " % locals()                                     
00115 

def Scraper::dq::cq_zip_check::CQDB::seqno_groups (   self,
  table,
  offset,
  limit 
)

Definition at line 116 of file cq_zip_check.py.

00117                                                  :
00118         """
00119         
00120         """
00121         return self("select SEQNO, count(*) as N, RUNNO, FILENO from %(table)s group by SEQNO limit %(limit)s offset %(offset)s " % locals())

def Scraper::dq::cq_zip_check::CQDB::runno_groups (   self,
  table,
  range 
)

Definition at line 122 of file cq_zip_check.py.

00123                                          :
00124         """
00125         """
00126         if range is None:
00127             where = ""
00128         else:
00129             where = "where RUNNO >= %s and RUNNO  <= %s " % range
00130         return dict((_['RUNNO'],set(_['FILENO'].split(","))) for _ in self("select RUNNO, group_concat(distinct(FILENO)) as FILENO from %(table)s %(where)s group by RUNNO " % locals()))
     
def Scraper::dq::cq_zip_check::CQDB::_seqno_range (   self,
  table,
  runno,
  fileno 
) [private]

Definition at line 131 of file cq_zip_check.py.

00132                                                 :
00133         fileno = str(fileno).strip()
00134         if fileno[-2:] == ",)":
00135             fileno = fileno[0:-2] + ')'
00136         sql = " select min(SEQNO) as minSEQNO,max(SEQNO) as maxSEQNO from %(table)s where RUNNO = %(runno)s and FILENO in %(fileno)s " % locals()
00137         log.debug(sql)
00138         return self(sql)[0]

def Scraper::dq::cq_zip_check::CQDB::_runno_range (   self,
  tabs,
  where = "RUNNO > 0" 
) [private]
The `where` avoids part of the partial SEQNO 

Definition at line 139 of file cq_zip_check.py.

00140                                                    :
00141         """
00142         The `where` avoids part of the partial SEQNO 
00143         """
00144         rr = {}
00145         for table in tabs:
00146             rr[table] = self("select min(RUNNO) as minRUNNO, max(RUNNO) as maxRUNNO, max(RUNNO) - min(RUNNO) + 1 as numRUNNO  from %(table)s where %(where)s " % locals())[0]
00147         assert rr[tabs[0]]['minRUNNO'] == rr[tabs[1]]['minRUNNO']
00148         assert rr[tabs[0]]['maxRUNNO'] == rr[tabs[1]]['maxRUNNO']
00149         assert rr[tabs[0]]['numRUNNO'] == rr[tabs[1]]['numRUNNO']
00150         runr = ( rr[tabs[0]]['minRUNNO'], rr[tabs[0]]['maxRUNNO'], )
00151         self.runr = runr
00152         return runr

def Scraper::dq::cq_zip_check::CQDB::runranges (   self)

Definition at line 153 of file cq_zip_check.py.

00154                        :
00155         step = self.step
00156         raster_ = lambda n:( ( self.runr[0] + n*step ) / step )*step
00157         rcount = ( 2*step + self.runr[1] - self.runr[0] ) / step
00158         rr = []
00159         for n in range(rcount):
00160             rr.append( (n,raster_(n),raster_(n+1)) ) 
00161         return rr

def Scraper::dq::cq_zip_check::CQDB::count (   self,
  tabs 
)
Slow query. There are ~341k SEQNO 

Definition at line 162 of file cq_zip_check.py.

00163                          :
00164         """
00165         Slow query. There are ~341k SEQNO 
00166         """
00167         st = {}
00168         for table in tabs:
00169             st[table] = self("select count(distinct(SEQNO)) as PSEQ from %(table)s " % locals())[0]['PSEQ']
00170             st[table+'Vld'] = self("select count(distinct(SEQNO)) as VSEQ from %(table)sVld " % locals())[0]['VSEQ']
00171         log.info(st) 

def Scraper::dq::cq_zip_check::CQDB::zip_scan_runno (   self,
  nbase = 0 
)
Maybe better to group by RUNNO as a consistency check ?
Looking at things by SEQNO corresponds to checking the write ordering of the two tables.
That is far too poorly aligned to be useful.

This answers a different question, do the two tables cover the same files 

Definition at line 172 of file cq_zip_check.py.

00173                                      :
00174         """
00175         Maybe better to group by RUNNO as a consistency check ?
00176         Looking at things by SEQNO corresponds to checking the write ordering of the two tables.
00177         That is far too poorly aligned to be useful.
00178 
00179         This answers a different question, do the two tables cover the same files 
00180         """
00181         rrs = self.runranges()
00182         x = {}
00183         for rr in rrs:
00184             n, rmi, rmx = rr
00185             if n < nbase:continue
00186 
00187             a = self.runno_groups( self.tabs[0], range=(rmi,rmx) ) 
00188             b = self.runno_groups( self.tabs[1], range=(rmi,rmx)  ) 
00189 
00190             for r in set(a.keys()).intersection(set(b.keys())):
00191                 if a[r] != b[r]:
00192                     ab = tuple(a[r].difference(b[r]))
00193                     ba = tuple(b[r].difference(a[r]))
00194                     x[r] = ( r, "%s-%s" % (self.tabs[0], self.tabs[1]), ab, "%s-%s" % (self.tabs[1], self.tabs[0]), ba )
00195             pass
00196             for r in set(a.keys()).difference(set(b.keys())):
00197                 x[r] = ( r, "only %s" % self.tabs[0], tuple(a[r]) )
00198             pass
00199             for r in set(b.keys()).difference(set(a.keys())):
00200                 x[r] = ( r, "only %s" % self.tabs[1], tuple(b[r]) )
00201 
00202             dfr = filter(lambda r:r>=rmi and r<rmx, x.keys())
00203             log.info( " rr %s df %s " % ( str(rr), len(dfr) )) 
00204             for r in dfr:
00205                 log.info(str(x[r]))
00206 
00207         return x
00208  

def Scraper::dq::cq_zip_check::CQDB::zip_scan_seqno (   self,
  tabs 
)
Many swaps in SEQNO ordering of the same fileno/runno  at 0-10 per 1000 level
within the lower SEQNO.  Gets more are go to higher SEQNO as offsets hang around.

There are too many differences for this to be a useful check.

::

    2013-05-27 13:00:24,370 __main__ INFO              89981   192     22522      1      :   89981   192     22647      206     : ***  
    2013-05-27 13:00:24,370 __main__ INFO              89982   192     21520      272    :   89982   192     22522      1       : ***  
    2013-05-27 13:00:24,370 __main__ INFO              89983   192     22647      206    :   89983   192     21520      272     : ***  
    2013-05-27 13:00:24,370 __main__ INFO              89985   192     22405      197    :   89985   192     22266      5       : ***  
    2013-05-27 13:00:24,370 __main__ INFO              89986   192     22266      5      :   89986   192     22405      197     : ***  
    2013-05-27 13:00:24,370 __main__ INFO              89991   192     21804      44     :   89991   192     22516      205     : ***  
    2013-05-27 13:00:24,370 __main__ INFO              89992   192     22516      205    :   89992   192     22655      70      : ***  
    2013-05-27 13:00:24,370 __main__ INFO              89993   192     22655      70     :   89993   192     22640      32      : ***   

By doing a "group by SEQNO" query for each and comparing the RUNNO/FILENO.
many ordering swaps between the two tables are observable.
Perhaps the explanation is that multiple instances of the filling script 
are concurrently writing to the ingredients and summary tables.

This breaks the sequentiality of closing of the two writers 
from any one script instance preventing them having the 
same SEQNO in the two tables, much of the time.
       
In order to make syncronized SEQNO writing it would be 
necessary to wrap the closing in lock/unlock. Something like:

      db("lock tables DqChannel WRITE, DqChannelVld WRITE, DqChannelStatus WRITE, DqChannelStatusVld WRITE") 
      wseqno = wrt.Close()
      wseqno_status = wrt_status.Close() 
      db("unlock tables")
      assert wseqno ==  wseqno_status

In this way the first instance of the script to take the lock will be able
to sequentially perform its writes before releasing its lock.  Other scripts
will hang around until the first is done and so on.

Definition at line 209 of file cq_zip_check.py.

00210                                   :
00211         """
00212         Many swaps in SEQNO ordering of the same fileno/runno  at 0-10 per 1000 level
00213         within the lower SEQNO.  Gets more are go to higher SEQNO as offsets hang around.
00214 
00215         There are too many differences for this to be a useful check.
00216 
00217         ::
00218 
00219             2013-05-27 13:00:24,370 __main__ INFO              89981   192     22522      1      :   89981   192     22647      206     : ***  
00220             2013-05-27 13:00:24,370 __main__ INFO              89982   192     21520      272    :   89982   192     22522      1       : ***  
00221             2013-05-27 13:00:24,370 __main__ INFO              89983   192     22647      206    :   89983   192     21520      272     : ***  
00222             2013-05-27 13:00:24,370 __main__ INFO              89985   192     22405      197    :   89985   192     22266      5       : ***  
00223             2013-05-27 13:00:24,370 __main__ INFO              89986   192     22266      5      :   89986   192     22405      197     : ***  
00224             2013-05-27 13:00:24,370 __main__ INFO              89991   192     21804      44     :   89991   192     22516      205     : ***  
00225             2013-05-27 13:00:24,370 __main__ INFO              89992   192     22516      205    :   89992   192     22655      70      : ***  
00226             2013-05-27 13:00:24,370 __main__ INFO              89993   192     22655      70     :   89993   192     22640      32      : ***   
00227 
00228         By doing a "group by SEQNO" query for each and comparing the RUNNO/FILENO.
00229         many ordering swaps between the two tables are observable.
00230         Perhaps the explanation is that multiple instances of the filling script 
00231         are concurrently writing to the ingredients and summary tables.
00232 
00233         This breaks the sequentiality of closing of the two writers 
00234         from any one script instance preventing them having the 
00235         same SEQNO in the two tables, much of the time.
00236        
00237         In order to make syncronized SEQNO writing it would be 
00238         necessary to wrap the closing in lock/unlock. Something like:
00239 
00240               db("lock tables DqChannel WRITE, DqChannelVld WRITE, DqChannelStatus WRITE, DqChannelStatusVld WRITE") 
00241               wseqno = wrt.Close()
00242               wseqno_status = wrt_status.Close() 
00243               db("unlock tables")
00244               assert wseqno ==  wseqno_status
00245 
00246         In this way the first instance of the script to take the lock will be able
00247         to sequentially perform its writes before releasing its lock.  Other scripts
00248         will hang around until the first is done and so on.
00249 
00250         """
00251         assert len(tabs) == 2
00252         offset, limit = 0, self.limit 
00253         fmt = " %(SEQNO)-7s %(N)-7s %(RUNNO)-10s %(FILENO)-5s "
00254         while offset < 1000000:
00255             ch = self.seqno_groups( tabs[0], offset, limit  ) 
00256             cs = self.seqno_groups( tabs[1], offset, limit  ) 
00257             offset += limit 
00258             log.info(" [%-7s]  %s cf %s " % ( offset, tabs[0], tabs[1] ) )
00259             for dch, dcs in zip(ch,cs):
00260                 assert dch['N'] == 192, dch
00261                 assert dcs['N'] == 192, dcs
00262                 msg = "===" if dcs == dch else "***"
00263                 if dcs != dch:
00264                     log.info( "        %s :  %s  : %s  " % ( fmt % dch,  fmt % dcs, msg  ) )

def Scraper::dq::cq_zip_check::CQDB::zip_scan_seqno_sets (   self,
  tabs 
)
Comparing sets of RUNNO/FILENO 

BUT not so useful as the SEQNO groupings are arbitrary so the boundaries will split swaps

Definition at line 265 of file cq_zip_check.py.

00266                                        :
00267         """
00268         Comparing sets of RUNNO/FILENO 
00269 
00270         BUT not so useful as the SEQNO groupings are arbitrary so the boundaries will split swaps
00271         """
00272         assert len(tabs) == 2
00273         offset, limit = 0, self.limit 
00274         fmt = " %(SEQNO)-7s %(N)-7s %(RUNNO)-10s %(FILENO)-5s "
00275         h_ = lambda d:hdict(runno=d['RUNNO'],fileno=d['FILENO'])
00276 
00277         def makeset(dlist):
00278             """
00279             :param  dlist:
00280             :return: set of hasable dicts  
00281             """ 
00282             s = set()
00283             for h in map(h_, dlist):
00284                 s.add(h)
00285             return s
00286 
00287         while offset < 1000000:
00288             ch = self.seqno_groups( tabs[0], offset, limit  ) 
00289             cs = self.seqno_groups( tabs[1], offset, limit  ) 
00290             sch = makeset(ch)
00291             scs = makeset(cs)
00292             msg = "===" if sch == scs else "***"
00293             log.info( " %s : %s : %s " % ( sch, scs, msg ))
00294             offset += limit 
00295 


Member Data Documentation

Definition at line 62 of file cq_zip_check.py.

Definition at line 62 of file cq_zip_check.py.

Definition at line 62 of file cq_zip_check.py.


The documentation for this class was generated from the following file:
| Classes | Job Modules | Data Objects | Services | Algorithms | Tools | Packages | Directories | Tracs |

Generated on Fri May 16 2014 09:50:03 for Scraper by doxygen 1.7.4