/search.css" rel="stylesheet" type="text/css"/> /search.js">
| Classes | Job Modules | Data Objects | Services | Algorithms | Tools | Packages | Directories | Tracs |

In This Package:

Public Member Functions | Public Attributes | Static Public Attributes | Properties | Private Member Functions | Private Attributes
Scraper::base::sourcevector::SourceVector Class Reference

List of all members.

Public Member Functions

def __init__
def get_tcursor
def set_tcursor
def iinst
def __repr__
def smry
def __str__
def __call__
def lastentry
def lag

Public Attributes

 scraper
 source
 offset
 lastresult
 cached last entry SA instance, for lag determination without querying
 calls
 lastresult enum follows progress
 aggfns
 grpfns

Static Public Attributes

tuple state = Enum(("empty", "partial", "full", "ERROR" ))
tuple result = Enum(("noupdate","notfull","overage","changed","unchanged", "init","lastfirst", "replay", "ERROR"))

Properties

 tcursor = property(get_tcursor, set_tcursor, doc=set_tcursor.__doc__ )
 status = property( _status , doc="enum status integer")
 confirmed to be doing "limit n,1" cf r14845
 status_ = property( lambda self:self.state.whatis(self.status) , doc="status string representing `status` enum integer" )
 lastresult_ = property( lambda self:self.result.whatis(self.lastresult), doc="progress string representing `lastresult` enum integer" )
 ids = property(lambda self:tuple(map(lambda _:int(getattr(_,'id')),self)))

Private Member Functions

def _prepare_grpfns
def _prepare_aggfns
def _lastfirst
def _status
def _collect
def _aggregate
def _tcut
def _replay
def _sample

Private Attributes

 _tcursor
 SQL offset allowing source entry skipping.
 _lastentry
 confirmed : limit 0,1

Detailed Description

This is a simply a holder for source instances and the timecursor, the action is  
driven by the `Scraper` instance, which invokes the ``SourceVector.__call__``
method to perform the sampling, ie querying for new instances at times beyond the ``tcursor``

As each instance is collected the prior last instance is discarded until sufficient deviation
(in age or value) between the first and last is seen. 
Deviation results in this `SourceVector` being collapsed to start again from the last sample.
This also is driven from the `Scraper` by setting the `tcursor` property.

Manages:

#. 0,1 or 2 source class instances  
#. timecursor
#. ``lastresult`` enum from last ``_sample`` 

Actions:

#. checks to see if conditions are met to propagate collected source instances into target, in ``__call__`` method

Definition at line 30 of file sourcevector.py.


Constructor & Destructor Documentation

def Scraper::base::sourcevector::SourceVector::__init__ (   self,
  scraper,
  source 
)
:param scraper: parent scraper
:param source: SA mapped class

Definition at line 55 of file sourcevector.py.

00056                                          :
00057         """
00058         :param scraper: parent scraper
00059         :param source: SA mapped class
00060         """
00061         self.scraper = scraper  
00062         self.source  = source 
00063         self.offset = scraper.offset               ## SQL offset allowing source entry skipping
00064 
00065         self._tcursor = None
00066         self._lastentry = None                      ## cached last entry SA instance, for lag determination without querying  
00067 
00068         self.lastresult = self.result.init          ## lastresult enum follows progress
00069         self.calls = 0 
00070 
00071         self.aggfns = self._prepare_aggfns()
00072         self.grpfns = self._prepare_grpfns()
00073         list.__init__(self)
00074 


Member Function Documentation

def Scraper::base::sourcevector::SourceVector::_prepare_grpfns (   self) [private]
Preparation of SA functions used for group_by 

Definition at line 75 of file sourcevector.py.

00076                              :
00077         """
00078         Preparation of SA functions used for group_by 
00079         """
00080         scraper = self.scraper
00081         if not scraper.aggregate or not scraper.aggregate_group_by or len(scraper.aggregate_group_by.strip())==0:
00082             return None
00083         grpfns = self.source.grpfns(by=scraper.aggregate_group_by, att=scraper.aggregate_group_by_att )
00084         log.debug("_prepare_grpfns %s group_by functions" % len(grpfns) )
00085         log.debug("_prepare_grpfns %s " % map(str,grpfns))
00086         return grpfns
 
def Scraper::base::sourcevector::SourceVector::_prepare_aggfns (   self) [private]
Preparation of SA generic functions used for making aggregate queries

Definition at line 87 of file sourcevector.py.

00088                              :
00089         """
00090         Preparation of SA generic functions used for making aggregate queries
00091         """
00092         scraper = self.scraper
00093         if not scraper.aggregate or len(scraper.aggregate.strip())==0:
00094             return None
00095         funcs  = scraper.aggregate
00096         skips  = scraper.aggregate_skips   #skips  = 'id,date_time' although some id,date_time aggregates are non-sensical others are useful for debugging 
00097         count  = scraper.aggregate_count
00098         if not skips:
00099             skips  = ''
00100 
00101         if scraper.aggregate_group_by and scraper.aggregate_group_by_att:
00102             _minmax = scraper.aggregate_group_by_att        ## typically date_time 
00103             log.debug("include _minmax aggregate query of %s as required for aggregate_group_by operation " % _minmax )
00104         else:
00105             _minmax = None 
00106 
00107         aggfns = self.source.aggfns(funcs=funcs,skips=skips,count=count,_minmax=_minmax)
00108         log.debug("_prepare_aggfns %s aggregate functions" % len(aggfns) )
00109         log.debug("_prepare_aggfns %s " % map(str,aggfns))
00110         return aggfns 

def Scraper::base::sourcevector::SourceVector::_lastfirst (   self,
  tc 
) [private]
:param tc: 

Collapses the SourceVector, making the last instance become the first 
Called only on assigning to the tcursor

Definition at line 111 of file sourcevector.py.

00112                              :
00113         """
00114         :param tc: 
00115 
00116         Collapses the SourceVector, making the last instance become the first 
00117         Called only on assigning to the tcursor
00118         """
00119         if type(tc) == list:
00120             self.lastresult = self.result.replay
00121         else:
00122             self.lastresult = self.result.lastfirst
00123             if self.status == self.state.full:
00124                 self[0] = self.pop()

def Scraper::base::sourcevector::SourceVector::get_tcursor (   self)

Definition at line 125 of file sourcevector.py.

00126                          :
        return self._tcursor
def Scraper::base::sourcevector::SourceVector::set_tcursor (   self,
  tc 
)
Assigning to this  `sv.tcursor` not only changes the cursor but also
collapses the SourceVector ready to collect new sampled source instances.

Definition at line 127 of file sourcevector.py.

00128                               :
00129         """
00130         Assigning to this  `sv.tcursor` not only changes the cursor but also
00131         collapses the SourceVector ready to collect new sampled source instances.
00132         """ 
00133         self._lastfirst(tc)
        self._tcursor = tc
def Scraper::base::sourcevector::SourceVector::iinst (   self,
  i 
)
:param i: instance index into source vector
:rtype source:  returns instance or None  

Definition at line 136 of file sourcevector.py.

00137                       :
00138         """
00139         :param i: instance index into source vector
00140         :rtype source:  returns instance or None  
00141         """
00142         try:
00143             i = self[i]
00144         except IndexError:
00145             i = None
00146         return i 

def Scraper::base::sourcevector::SourceVector::__repr__ (   self)

Definition at line 147 of file sourcevector.py.

00148                       :
00149         return self.smry()

def Scraper::base::sourcevector::SourceVector::smry (   self)
Example::

    SV 4   (43, 46) 2011-01-10 10:02:00 full       unchanged  (10:34:28 10:34:34)  
    # calls   ids        tcursor        status     lastresult       times  

Shows the status of source vector including the ``id`` and ``date_time`` of 
source entries ``sv[0]`` and ``sv[-1]``

**calls**
      iteration count
**ids**
      ``id`` of source entries
**tcursor**
      timecursor, stepping through time. Changes only at each propagation
**status**  
      fullness of source vector: ``empty/partial/full``  (**full** means 2 entries)
**lastresult**  
      possibilities: "noupdate","notfull","overage","changed","unchanged","init","lastfirst"
**times**
      ``date_time`` of source entries, changes as each sample is made     

Definition at line 150 of file sourcevector.py.

00151                   :
00152         """
00153         Example::
00154 
00155             SV 4   (43, 46) 2011-01-10 10:02:00 full       unchanged  (10:34:28 10:34:34)  
00156             # calls   ids        tcursor        status     lastresult       times  
00157 
00158         Shows the status of source vector including the ``id`` and ``date_time`` of 
00159         source entries ``sv[0]`` and ``sv[-1]``
00160 
00161         **calls**
00162               iteration count
00163         **ids**
00164               ``id`` of source entries
00165         **tcursor**
00166               timecursor, stepping through time. Changes only at each propagation
00167         **status**  
00168               fullness of source vector: ``empty/partial/full``  (**full** means 2 entries)
00169         **lastresult**  
00170               possibilities: "noupdate","notfull","overage","changed","unchanged","init","lastfirst"
00171         **times**
00172               ``date_time`` of source entries, changes as each sample is made     
00173 
00174         """
00175         a_ = self.iinst(0)
00176         b_ = self.iinst(1)
00177         a = a_.date_time.strftime("%H:%M:%S") if a_ else "--"
00178         b = b_.date_time.strftime("%H:%M:%S") if b_ else "++"
00179 
00180         if type(self.tcursor) == list:
00181             t = "%s=>%s" % ( self.tcursor[0].strftime("%Y-%m-%d %H:%M:%S") , self.tcursor[1].strftime("%H:%M:%S") ) 
00182         else:
00183             t = self.tcursor.strftime("%Y-%m-%d %H:%M:%S")
00184 
00185         if b_: 
00186             if type(b_.aggd) == list:
00187                 n = len(b_.aggd)
00188             else:
00189                 n = -1
00190         else:
00191             n = -2   
00192 
00193         return "SV %-3s %-20r %s %-10s %-10s (%s %s) [%s] " % ( self.calls, self.ids, t, self.status_ , self.lastresult_ , a,b,n )

def Scraper::base::sourcevector::SourceVector::__str__ (   self)

Definition at line 194 of file sourcevector.py.

00195                      :
00196         labels = range(len(self)) 
00197         cflods = self.source.cflods( self , labels )    
00198         td = TabularData(cflods)
00199         return str(td)

def Scraper::base::sourcevector::SourceVector::_status (   self) [private]
Hmm fullness would be a better name

state enum values of SourceVector

Definition at line 200 of file sourcevector.py.

00201                      :
00202         """
00203         Hmm fullness would be a better name
00204 
00205         state enum values of SourceVector
00206         """
00207         if len(self) == 0: 
00208             return self.state.empty
00209         elif len(self) == 1: 
00210             return self.state.partial 
00211         elif len(self) == 2: 
00212             return self.state.full
00213         else:
            return self.state.ERROR
def Scraper::base::sourcevector::SourceVector::_collect (   self,
  inst 
) [private]
Collect sampled instances into this SourceVector list such that the instance 
always displaces the prior last, thus holding on to the 
initial and last.  

Note that appending the instance will change the 
dynamically determined status property in the early stages after the 
scraper is started with the SV state going through::

      empty => partial => full

In steady state the status stays "full", only virtually flitting thru 
"partial" within this method::

      full (=> partial) => full 


Only called by `_sample`

:param inst: source instance

Definition at line 218 of file sourcevector.py.

00219                              :
00220         """
00221         Collect sampled instances into this SourceVector list such that the instance 
00222         always displaces the prior last, thus holding on to the 
00223         initial and last.  
00224 
00225         Note that appending the instance will change the 
00226         dynamically determined status property in the early stages after the 
00227         scraper is started with the SV state going through::
00228 
00229               empty => partial => full
00230 
00231         In steady state the status stays "full", only virtually flitting thru 
00232         "partial" within this method::
00233 
00234               full (=> partial) => full 
00235 
00236 
00237         Only called by `_sample`
00238 
00239         :param inst: source instance
00240 
00241         """  
00242         assert inst != None
00243         log.debug( "_collect %r" % inst.asdict )
00244         if self.status == self.state.full:   
00245             self.pop()     
00246         if self.status in (self.state.empty, self.state.partial) :
00247             self.append(inst)   
00248         else:
00249             assert 0, "unexpected status"
00250 
   
def Scraper::base::sourcevector::SourceVector::__call__ (   self)
Samples for new source instances at or beyond ``tcursor``, 
the initial and the last intances are held.
When these differ sufficiently in age or value ``True`` is
returned to the Scraper to signal the need to propagate.

Definition at line 251 of file sourcevector.py.

00252                       :
00253         """
00254         Samples for new source instances at or beyond ``tcursor``, 
00255         the initial and the last intances are held.
00256         When these differ sufficiently in age or value ``True`` is
00257         returned to the Scraper to signal the need to propagate.
00258         """
00259         self.calls += 1 
00260 
00261         tc = self.tcursor
00262         if type(tc) == list:
00263             result = self._replay(tc)
00264         else:  
00265             result = self._sample(tc)
00266 
00267         self.lastresult = result
00268         proceed = result in (self.result.overage, self.result.changed, self.result.replay)
00269         msg = "PROCEED" if proceed else "WAIT"        
00270         log.info( "%r ==> %s " % ( self,  msg )) 
00271         return proceed

def Scraper::base::sourcevector::SourceVector::_aggregate (   self,
  rnge 
) [private]
Aggregate between the instances, date_time filter includes 1st and excludes 2nd

:param rnge: list containing 2 datetimes specifying timerange of query 
:return: either a single aggregate dict or a list of these when using `aggregate_group_by` 

Note that in `aggregate_group_by` mode the time range filtering is not done 
**could be very memory expensive**

When grouping by minutes or hours want to favor rounded ones

Definition at line 274 of file sourcevector.py.

00275                                :
00276         """
00277         Aggregate between the instances, date_time filter includes 1st and excludes 2nd
00278 
00279         :param rnge: list containing 2 datetimes specifying timerange of query 
00280         :return: either a single aggregate dict or a list of these when using `aggregate_group_by` 
00281 
00282         Note that in `aggregate_group_by` mode the time range filtering is not done 
00283         **could be very memory expensive**
00284 
00285         When grouping by minutes or hours want to favor rounded ones
00286 
00287         """
00288         log.debug("_aggregate aggfns %s " % self.aggfns)
00289         session = self.source.db.session 
00290 
00291         namedtuple_to_aggd = lambda nt:Aggregate(zip(nt.keys(),nt))
00292 
00293         aggregate_group_by = self.scraper.aggregate_group_by
00294         aggregate_filter = self.scraper.aggregate_filter
00295 
00296         q = session.query(*self.aggfns)
00297         q = q.filter(self.source.date_time >= rnge[0])
00298         q = q.filter(self.source.date_time <  rnge[1])
00299 
00300         log.debug("_aggregate date_time filtering %s %s " % ( rnge[0], rnge[1] ) )
00301 
00302         if aggregate_filter:
00303             q = q.filter(aggregate_filter)
00304 
00305         if aggregate_group_by:
00306             q = q.group_by(*self.grpfns)
00307             aggl = q.all()   
00308             assert type(aggl) == list , "expecting list from all query "
00309             aggs = map(namedtuple_to_aggd, aggl )
00310             log.debug("_aggregate group_by query return count %s %s " % ( len(aggl), len(aggs)) )
00311             return aggs
00312         else: 
00313             aggv = q.one()       ## actually do the query 
00314             aggd = namedtuple_to_aggd( aggv )
00315             log.debug("_aggregate aggd %r " % aggd )
00316             return aggd 
00317  

def Scraper::base::sourcevector::SourceVector::_tcut (   self,
  tcursor 
) [private]
:param tcursor: time cursor

Determine the timecut based on timecursor and any already collected instances.
The time cut is the later time:
Avoids turning up the same instances and getting stuck at **unchanged**.

#. timecursor 
#. ``sv[-1].date_time``

On comparisons between ``self[-1].date_time``,  ``tcursor``

#. equal at startup
#. while sampling  ``self[-1].date_time`` will be ahead
#. following propagation, ``tcursor`` is ahead briefly, until the next sample
  
     #. sv is collapsed ``sv[-1].date_time`` holds the old ``tcursor`` 
     #. ``tcursor`` is stepped agead by the ``interval``, thus ``tcursor`` 

Definition at line 318 of file sourcevector.py.

00319                               :
00320         """
00321         :param tcursor: time cursor
00322 
00323         Determine the timecut based on timecursor and any already collected instances.
00324         The time cut is the later time:
00325         Avoids turning up the same instances and getting stuck at **unchanged**.
00326 
00327         #. timecursor 
00328         #. ``sv[-1].date_time``
00329 
00330         On comparisons between ``self[-1].date_time``,  ``tcursor``
00331 
00332         #. equal at startup
00333         #. while sampling  ``self[-1].date_time`` will be ahead
00334         #. following propagation, ``tcursor`` is ahead briefly, until the next sample
00335   
00336              #. sv is collapsed ``sv[-1].date_time`` holds the old ``tcursor`` 
00337              #. ``tcursor`` is stepped agead by the ``interval``, thus ``tcursor`` 
00338 
00339         """
00340         assert tcursor
00341         if self.status == self.state.empty:
00342             tcut = tcursor 
00343         elif self.status == self.state.partial or self.status == self.state.full:
00344             tcut = max( self[-1].date_time , tcursor )
00345         else:
00346             assert 0, "status error "        
00347         return tcut
00348 
  
def Scraper::base::sourcevector::SourceVector::_replay (   self,
  tcursor 
) [private]
Equivalent of `_sample` for replay scraping aka catchup  
when replatying the intelligence come from above in the form of a tcursor containing 
a list with 2 datetimes

:param tcursor: 

Definition at line 349 of file sourcevector.py.

00350                                :
00351         """
00352         Equivalent of `_sample` for replay scraping aka catchup  
00353         when replatying the intelligence come from above in the form of a tcursor containing 
00354         a list with 2 datetimes
00355 
00356         :param tcursor: 
00357 
00358         """
00359         assert type(tcursor) == list , "unexpected tcursor %s " % tcursor 
00360 
00361         ## dummy instances of the SA mapped class
00362         self[:] = []
00363         self.append(self.source())
00364         self.append(self.source())
00365 
00366         self.scraper.cooldownsec_sleep()
00367         aggl = self._aggregate(tcursor)
00368 
00369         self[0].date_time = tcursor[0]
00370         self[0].id = 0
00371         self[0].aggd = aggl
00372 
00373         self[-1].date_time = tcursor[1]
00374         self[-1].id = 1
00375         self[-1].aggd = aggl
00376 
00377         return self.result.replay
00378 

def Scraper::base::sourcevector::SourceVector::_sample (   self,
  tcursor 
) [private]
Samples/collects instances from source DB returning result enum : 

:param tcursor: time cursor
:rtype result:  result enum integer  noupdate/notfull/overage/changed/unchanged
 
Outcome of using **first** instance beyond the cut (rather that eg **last**):    

#. reproducible rerunning 

    #. want same scraping to occur no matter what the running details

#. reads every source entry ``limit 1 offset 0`` (too much for some tables) 

    #. reproducible skipping using SQL offsets ``limit 1 offset N``  
    #. SQLAlchemy expresses this with python array slicing ``[N:1]``
    #. add **offset** (skip name is taken) parameter with default of 0, for no skipping 

When a new instance is found it is collected and if the source vector is full 
(holds 2 instances) then age and change checks are performed and the result enum value is returned.

If aggregation is configured, a query invoking avg,min,max,std (or other similar functions) 
is performed after the row query, but only when the `SourceVector` is full (containing 2 instances) 
and using an aggregation timerange based on timestamps of the two row instances.

The aggregate results dict is tacked to the 2nd instance ``sv[-1].aggd`` 

Formerly used the destined for depecation shortcuts::

     inst = self.source.qafter(tcursor).first()    
     inst = self.source.qafter(tcursor).filter(self.source.date_time > self[-1].date_time).first()     ## avoid sampling the same row as previous

Note that query slicing `q[n:n+1]` uses limit and offset under the covers, 
unlike q[-1] which dumbly queries all entries then supplies the last. 

Special cased startup,  see :dybsvn:`ticket:1241`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

At startup with an empty SV the former use of `tcut = tcursor` coupled with `>` requirements 
and offsets made it impossible to regain precisely the same source instance that 
created the TIMEEND. This caused target validity discontinuities 
in the case of immediate changes after scraper startup.

Avoid the gaps by plucking the exact same source instance that led to the  
TIMEEND. In order to regain that instance it is necessary to use offsetting of 0 
to see all source instances and use either "==" or ">=" for the time cut.

Plump for ">=" rather than "==" requirement to avoid having 
to seed empty target tables with TIMEEND matching the corresponding source 
instances to the precise second. 

Definition at line 379 of file sourcevector.py.

00380                               :
00381         """
00382         Samples/collects instances from source DB returning result enum : 
00383 
00384         :param tcursor: time cursor
00385         :rtype result:  result enum integer  noupdate/notfull/overage/changed/unchanged
00386  
00387         Outcome of using **first** instance beyond the cut (rather that eg **last**):    
00388 
00389         #. reproducible rerunning 
00390 
00391             #. want same scraping to occur no matter what the running details
00392 
00393         #. reads every source entry ``limit 1 offset 0`` (too much for some tables) 
00394 
00395             #. reproducible skipping using SQL offsets ``limit 1 offset N``  
00396             #. SQLAlchemy expresses this with python array slicing ``[N:1]``
00397             #. add **offset** (skip name is taken) parameter with default of 0, for no skipping 
00398 
00399         When a new instance is found it is collected and if the source vector is full 
00400         (holds 2 instances) then age and change checks are performed and the result enum value is returned.
00401 
00402         If aggregation is configured, a query invoking avg,min,max,std (or other similar functions) 
00403         is performed after the row query, but only when the `SourceVector` is full (containing 2 instances) 
00404         and using an aggregation timerange based on timestamps of the two row instances.
00405 
00406         The aggregate results dict is tacked to the 2nd instance ``sv[-1].aggd`` 
00407 
00408         Formerly used the destined for depecation shortcuts::
00409 
00410              inst = self.source.qafter(tcursor).first()    
00411              inst = self.source.qafter(tcursor).filter(self.source.date_time > self[-1].date_time).first()     ## avoid sampling the same row as previous
00412 
00413         Note that query slicing `q[n:n+1]` uses limit and offset under the covers, 
00414         unlike q[-1] which dumbly queries all entries then supplies the last. 
00415 
00416         Special cased startup,  see :dybsvn:`ticket:1241`
00417         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
00418 
00419         At startup with an empty SV the former use of `tcut = tcursor` coupled with `>` requirements 
00420         and offsets made it impossible to regain precisely the same source instance that 
00421         created the TIMEEND. This caused target validity discontinuities 
00422         in the case of immediate changes after scraper startup.
00423 
00424         Avoid the gaps by plucking the exact same source instance that led to the  
00425         TIMEEND. In order to regain that instance it is necessary to use offsetting of 0 
00426         to see all source instances and use either "==" or ">=" for the time cut.
00427 
00428         Plump for ">=" rather than "==" requirement to avoid having 
00429         to seed empty target tables with TIMEEND matching the corresponding source 
00430         instances to the precise second. 
00431 
00432         """
00433         kls = self.source
00434         session = kls.db.session 
00435         tcut = self._tcut( tcursor )
00436 
00437         if self.status == self.state.empty:
00438             log.debug("attempting to resume from exactly where left off, plucking source instance at %s " % repr(tcut) )
00439             q = session.query(kls).order_by(kls.date_time).filter(kls.date_time >= tcut)   
00440             n = 0
00441         else:
00442             q = session.query(kls).order_by(kls.date_time).filter(kls.date_time > tcut)      # ">=" causes stuck at unchanged, as pull up same entry 
00443             n = int(self.offset)
00444 
00445         self.scraper.cooldownsec_sleep()
00446         try:
00447             inst = q[n:n+1][0]     ## confirmed to be doing "limit n,1" cf r14845 
00448         except IndexError:
00449             inst = None
00450 
00451         if self.status == self.state.empty:
00452             if inst.date_time == tcut:
00453                 log.info("startup succeeded to pickup from the exact instance id %s date_time %s tcut %s " % ( inst.id, inst.date_time, tcut ))
00454             else:
00455                 log.warn("startup FAILED to pickup from the exact instance id %s date_time %s tcut %s " % ( inst.id, inst.date_time, tcut ))
00456                 ## cannot assert on this without requiring target seeds with second precise TIMEENDs  
00457 
00458         if not inst:
00459             return self.result.noupdate 
00460 
00461         inst.aggd = None      ## maybe filled in below
00462         self._collect(inst)   ## displacing prior top inst in steady state of being full, potentially changes status
00463 
00464         if self.status != self.state.full:
00465             return self.result.notfull     ## not being full only occurs at startup
00466 
00467         ## aggregate when configured and full, for group_by querying this holds a list 
00468         self[-1].aggd = self._aggregate([self[0].date_time,self[-1].date_time]) if self.aggfns else None
00469 
00470         age = self[-1].age(self[0]) 
00471         if age > self.scraper.maxage:
00472             return self.result.overage
00473 
00474         ## when using aggregates this could be based on changes in aggregate dict or row instances or mixtures thereof
00475         changed = self.scraper.changed( self )      
00476         return self.result.changed if changed else  self.result.unchanged        
00477 

def Scraper::base::sourcevector::SourceVector::lastentry (   self)
Query source to find last entry with `date_time` greater than the timecursor
When the tcursor is approaching the cached last entry time, need to query otherwise just use cached            

:return: SA instance or None 

Definition at line 478 of file sourcevector.py.

00479                        :
00480         """
00481         Query source to find last entry with `date_time` greater than the timecursor
00482         When the tcursor is approaching the cached last entry time, need to query otherwise just use cached            
00483 
00484         :return: SA instance or None 
00485         """ 
00486         if self._lastentry and self._tcursor: 
00487             age = self._lastentry.date_time - self._tcursor 
00488             age = abs(age.seconds)
00489         else:
00490             age = 0  
00491 
00492         agecut = 10*self.scraper.interval.seconds
00493 
00494         if age < agecut:
00495             log.debug("lastentry cache miss as age %s < agecut %s " % ( age, agecut ))
00496             kls = self.source
00497             session = kls.db.session 
00498             q = session.query(kls).order_by(kls.date_time.desc()).filter(kls.date_time > self._tcursor)  
00499             last = q.first()     ## confirmed : limit 0,1 
00500             self._lastentry = last 
00501         else:
00502             log.debug("lastentry cached hit as age %s >= agecut %s " % ( age, agecut ))
00503             last = self._lastentry
00504         return last

def Scraper::base::sourcevector::SourceVector::lag (   self)
:return: timedelta instance representing scraper lag or None if no last entry beyind the tcursor 

Query source to find datetime of last entry and return the time difference ``last - tcursor`` 
indicating how far behind the scraper is. This will normally be positive indicating that the scraper is 
behind.

It would be inefficient to do this on every iteration 

Definition at line 505 of file sourcevector.py.

00506                  :
00507         """
00508         :return: timedelta instance representing scraper lag or None if no last entry beyind the tcursor 
00509 
00510         Query source to find datetime of last entry and return the time difference ``last - tcursor`` 
00511         indicating how far behind the scraper is. This will normally be positive indicating that the scraper is 
00512         behind.
00513 
00514         It would be inefficient to do this on every iteration 
00515         """
00516         t = self._tcursor 
00517         last = self.lastentry()
00518         log.debug("lag query for cut %s %s " % ( t, last) ) 
00519         return last.date_time - t if last else None
00520 
00521 


Member Data Documentation

tuple Scraper::base::sourcevector::SourceVector::state = Enum(("empty", "partial", "full", "ERROR" )) [static]

Definition at line 52 of file sourcevector.py.

tuple Scraper::base::sourcevector::SourceVector::result = Enum(("noupdate","notfull","overage","changed","unchanged", "init","lastfirst", "replay", "ERROR")) [static]

Definition at line 53 of file sourcevector.py.

Definition at line 58 of file sourcevector.py.

Definition at line 58 of file sourcevector.py.

Definition at line 58 of file sourcevector.py.

SQL offset allowing source entry skipping.

Definition at line 59 of file sourcevector.py.

confirmed : limit 0,1

Definition at line 59 of file sourcevector.py.

cached last entry SA instance, for lag determination without querying

Definition at line 60 of file sourcevector.py.

lastresult enum follows progress

Definition at line 61 of file sourcevector.py.

Definition at line 61 of file sourcevector.py.

Definition at line 61 of file sourcevector.py.


Property Documentation

Scraper::base::sourcevector::SourceVector::tcursor = property(get_tcursor, set_tcursor, doc=set_tcursor.__doc__ ) [static]

Definition at line 134 of file sourcevector.py.

Scraper::base::sourcevector::SourceVector::status = property( _status , doc="enum status integer") [static]

confirmed to be doing "limit n,1" cf r14845

Definition at line 214 of file sourcevector.py.

Scraper::base::sourcevector::SourceVector::status_ = property( lambda self:self.state.whatis(self.status) , doc="status string representing `status` enum integer" ) [static]

Definition at line 215 of file sourcevector.py.

Scraper::base::sourcevector::SourceVector::lastresult_ = property( lambda self:self.result.whatis(self.lastresult), doc="progress string representing `lastresult` enum integer" ) [static]

Definition at line 216 of file sourcevector.py.

Scraper::base::sourcevector::SourceVector::ids = property(lambda self:tuple(map(lambda _:int(getattr(_,'id')),self))) [static]

Definition at line 272 of file sourcevector.py.


The documentation for this class was generated from the following file:
| Classes | Job Modules | Data Objects | Services | Algorithms | Tools | Packages | Directories | Tracs |

Generated on Fri May 16 2014 09:50:03 for Scraper by doxygen 1.7.4