Other api/scraper

Table Of Contents

Previous topic

NonDbi

Next topic

DybTest

This Page

Daya Bay Links

Content Skeleton

Scraper

In addition to this API reference documentation, see the introductory documentation at Scraping source databases into offline_db

Scraper

Generic Scraping Introduction at Scraping source databases into offline_db

Table specific scraper module examples

Scraper.pmthv

PMT HV scraping specialization

Scraper.pmthv.PmtHv

class Scraper.pmthv.PmtHv(*args, **kwa)[source]

Bases: Scraper.base.regime.Regime

Regime frontend class with simple prescribed interface, takes the cfg argument into this dict and no args in call. This allows the frontend to be entirely generic.

Scraper.pmthv.PmtHvSource

class Scraper.pmthv.PmtHvSource(srcdb)[source]

Bases: list

Parameters:srcdb – source DB instance of Scraper.base.DCS

List of source SA classes that map tables/joins in srcdb Accommodates a table naming irregularity HVPw rather than HV_Pw

Scraper.pmthv.PmtHvScraper

class Scraper.pmthv.PmtHvScraper(srcs, target, cfg)[source]

Bases: Scraper.base.scraper.Scraper

Parameters:
  • srcs – list of source SA classes
  • targetTarget instance that encapsulates the DybDbi class
  • cfg – instance of relevant Regime subclass (which isa dict holding config)

Config options:

Parameters:
  • maxiter – maximum iterations or 0 for no limit
  • interval – timedelta cursor step size
  • maxage – timedelta maximum age, beyond which even an unchanged row gets written
  • sleep – timedelta sleep between scrape update sampling
changed(sv)[source]
Parameters:sv – source vector instance Scraper.base.sourcevector.SourceVector

Decide if sufficient change to propagate based on differences between the first and last elements of SourceVector instance argument

propagate(sv)[source]
Parameters:sv – source vector instance Scraper.base.sourcevector.SourceVector

Yield write ready DybDbi target dicts to base class, note that a single source vector instance is yielding multiple target dicts. The keys of the target dict must match the specified attributes of the DybDbi target class.

Here the output is based entirely on the last element of the source vector. A smarter implementation might average the first and last to smooth variations. The python yield command makes it possible to iterate over a what is returned by a function/method.

seed(sc)[source]

Used for seeding target DB when testing into empty tables

Parameters:sc – source class, potentially different seeds will be needed for each source that feeds into a single target

Scraper.pmthv.PmtHvFaker

class Scraper.pmthv.PmtHvFaker(srcs, cfg)[source]

Bases: Scraper.base.faker.Faker

Creates fake instances and inserts them into sourcedb

fake(inst, id, dt)[source]

Invoked from base class call method, set attributes of source instance to form a fake

Parameters:
  • inst – source instance
  • id – id to assign to the instance instance

Scraper.adtemp

AD Temperature scraping specialization

Scraper.adtemp.AdTemp

class Scraper.adtemp.AdTemp(*args, **kwa)[source]

Bases: Scraper.base.regime.Regime

Regime frontend class with simple prescribed interface, takes the cfg argument into this dict and no args in call ... allowing the frontend to be entirely generic

Scraper.adtemo.AdTempSource

class Scraper.adtemp.AdTempSource(srcdb)[source]

Bases: list

A list of SQLAlchemy dynamic classes

Coordinates of source table/joins

Scraper.adtemp.AdTempScraper

class Scraper.adtemp.AdTempScraper(srcs, target, cfg)[source]

Bases: Scraper.base.scraper.Scraper

Specialization of generic scraper for AD temperature tables

Parameters:
  • srcs – list of source SA classes
  • targetTarget instance that encapsulates the DybDbi class
  • cfg – instance of relevant Regime subclass (which isa dict holding config)

Config options:

Parameters:
  • maxiter – maximum iterations or 0 for no limit
  • interval – timedelta cursor step size
  • maxage – timedelta maximum age, beyond which even an unchanged row gets written
  • sleep – timedelta sleep between scrape update sampling
changed(sv)[source]

returns changed decision to base class

Caution DB/SQLAlchemy is providing decimal.Decimal values... unify types to float before comparison to avoid surprises

propagate(sv)[source]

yields one or more target dicts ready for writing to target DB

Scraper.adtemp.AdTempFaker

class Scraper.adtemp.AdTempFaker(srcs, cfg)[source]

Bases: Scraper.base.faker.Faker

fake(inst, id, dt)[source]

Invoked from base class, sets source instance attributes to form a fake

Parameters:
  • inst – source instance
  • id – suggested id to use
  • dt – suggested date_time to use

Note the settings can not easily be done in the framework as the inst can represent a join of multiple tables, requiring specialized action.

Scrapers in development

Scraper.adlidsensor

AD lid sensors scraping specialization

Discussion from Wei:

  1. we were discussing scrapping the average, its standard deviation, the minimum and the maximum within each hour.

  2. It seems average once per hour is sufficient. (Note: reactor flux will be available sparser than 1/hour).

    reference  
    doc:6673 discussion
    doc:6996 for the current status given by David W.
    doc:6983 summarizes the lid sensor data so far.

Scraper.adlidsensor.AdLidSensor

class Scraper.adlidsensor.AdLidSensor(*args, **kwa)[source]

Bases: Scraper.base.regime.Regime

Regime frontend class with simple prescribed interface, takes the cfg argument into this dict and no args in call ... allowing the frontend to be entirely generic

Scraper.dcs : source DB naming conventions

Encapsulation of naming conventions for tables and fields used in DCS database

Scraper.base : directly used/subclassed

Functions/classes subclassed or used directly by specific table scrapers

Scraper.base.main()

Scraper.base.main()

Scraper/Faker frontend

Parses the config into the cfg dict and imports, instanciates and calls the regime class identified by the cfg. This pattern minimises code duplication and retains flexibility.

  1. bulk behaviour controlled via single argument pointing to section of config file $SCRAPER_CFG which defines all the default settings of options
  2. config includes which classes to import into the main and invoke... so need simple common interface for frontends in all regimes : pmthv/adtemp

Note that the default section and its settings are listed together with the option names to change these defaults by:

scr.py --help    
scr.py -s adtemp_testscrape --help    ## show defaults for this section
scr.py -s adtemp_faker --help  

Typical Usage:

scr.py -s adtemp_scraper 
scr.py -s pmthv_scraper 

scr.py -s adtemp_faker 
scr.py -s pmthv_faker

During testing/development options can be added to change the behavior

The primary argument points to the section of .scraper.cfg which configures the details of the scrape:

[adtemp_scraper]

regime = Scraper.adtemp:AdTemp
kls = GDcsAdTemp
mode = scraper

source = fake_dcs
target = offline_db_dummy

interval = 10s
sleep = 3s
maxage = 10m 

threshold = 1.0
maxiter = 100

dbi_loglevel = INFO

Scraper.base.Regime

class Scraper.base.Regime(*args, **kwa)

Bases: dict

The regime class ctor takes the cfg as its sole argument, which being a dict takes the cfg into itself.

initialize()

Preparations done prior to calling the regime class, including:

setsignals()

signal handling following the approach of supervisord

Scraper.base.DCS

Specialization of SA providing SQLAlchemy access to source DCS DB

class Scraper.base.DCS(dbconf)

Bases: Scraper.base.sa.SA

SQLAlchemy connection to database, performing table reflection and mappings from tables

Specializations:

  1. standard query ordering, assuming a date_time attribute in tables
qafter(kls, cut)

date_time ordered query for instances at or beyond the time cut:

t0  t1  t2  t3  (t4  t5  t6  t7  t8  t9 ... )  
Parameters:
  • kls – source SQLAlchemy mapped class
  • cut – local time cutoff datetime
qbefore(kls, cut)

date_time ordered query for instances before the cut

subbase(dtn)

subclass to use, that can be dependent on table coordinate

Scraper.base.Scraper

class Scraper.base.Scraper(srcs, target, cfg)

Bases: Scraper.base.propagator.Propagator

Base class holding common scrape features, such as the scrape logic which assumes:

  1. source instances correspond to fixed time measurement snapshots
  2. target instances represent source measurements over time ranges
  3. 2 source instances are required to form one target instance, the target validity is derived from the datetimes of two source instances

Initialisation in Propagator superclass

Parameters:
  • srcs – list of source SA classes
  • targetTarget instance that encapsulates the DybDbi class
  • cfg – instance of relevant Regime subclass (which isa dict holding config)

Config options:

Parameters:
  • maxiter – maximum iterations or 0 for no limit
  • interval – timedelta cursor step size
  • maxage – timedelta maximum age, beyond which even an unchanged row gets written
  • sleep – timedelta sleep between scrape update sampling
changed(sv)

Override in subclasses to return if a significant change in source instances is observed. This together with age checks is used to decide is the propagate method is called.

Parameters:sv – source vector containing two source instances to interrogate for changes
propagate(sv)

Override this method in subclasses to yield one or more write ready target dicts derived from the sv[-1] source instance or sv[-1].aggd aggregate dict

Parameters:sv – source vector containing two source instances to propagate to one target write
tunesleep(i)

Every self.tunesleepmod iterations check lags behind sources and adjust sleep time accordingly. Allowing to turn up the beat in order to catchup.

Tune heuristic uses an effective heartbeat, which is is the time between entries of interest to the scrapee, ie time between source updates scaled by offset+1

Only makes sense to tune after a write, as it is only then that tcursor gets moved ahead. When are close to current the sleep time can correspond to the timecursor interval when behind sleep to allow swift catchup

POSSIBLE ISSUES

  1. if ebeatlag never gets to 0, the sleep time will sink to the minimum
    1. minimum was formerly 0.1, adjusted to max(0.5,ebeatsec/10.) out of concern for excessive querying
    2. adjusting to ebeatsec would be too conservative : would prevent catchup

Scraper.base.Target

class Scraper.base.Target(*args, **kwa)

Bases: dict

Encapsulate DybDbi dealings here to avoid cluttering Scraper

Relevant config parameters

Parameters:timefloor – None or a datetime or string such as ‘2010-09-18 22:57:32’ used to limit the expense of validity query
instance(**kwa)

Might fail with TypeError if kwa cannot be coerced, eg from aggregate queries returning None when zero samples

If the attribute names are not expected for the target kls they are skipped. This will be the case for the system attributes _date_time_min _date_time_max

lastvld(source)

Last validity record in target database for context corresponding to source class. Query expense is restricted by the timefloor. If timefloor is None a sequence of progressively more expensive queries are performed to get the target last validty.

Parameters:
  • source – source context instance either an xtn of MockSource instance with subsite and sitemask attributes
  • timefloor – time after which to look for validity entries in target database or None

Note this is called only at scraper initialization, in order for the scraper to find its time cursor.

require_manual(msg)

Require manual operation (ie running scr.py from commandline) preventing usage of rare operations/options under supervisor control

seed(srcs, scraper, dummy=False)

This is invoked at scraper instanciation when the conditions are met:

  1. seed_target_tables is configured True

Seed entries are written to the target table. The seed validity range is configured with the options: seed_timestart seed_timeend and formerly the payload entry was specified by the def seed() method implemented in the scraper class.

Attempts to perform seeding under supervisor raises an exception, to enforce this restriction.

When testing seeding start from scratch with eg:

mysql> drop table DcsAdTemp, DcsAdTempVld ;    
mysql> update LOCALSEQNO set LASTUSEDSEQNO=0 where TABLENAME='DcsAdTemp' ;  

Changes from Oct 2012,

  1. allow use against an existing table
  2. remove table creation functionality is removed
  3. move to payloadless seeds (removing need for dummy payload instances)

Motivated by the need to add new sources that contribute to an existing target which has already collected data eg for adding ADs to the DcsAdWpHv scraper.

writer(sv, localstart=None, localend=None)

Prepare DybDbi writer for target class, with contextrange/subsite appropriate for the source instance

Use of non-default localstart and localend type is typically only used for aggregate_group_by quering where the instance datetimes such as sv[0].date_time do not correspond to the contextrange of the aggregate dict.

Parameters:
  • sv – source vector instance that contains instances of an SA mapped class
  • localstart – default of None corresponds to sv[0].date_time
  • localend – default of None corresponds to sv[-1].date_time

Scraper.base.Faker

class Scraper.base.Faker(srcs, cfg)

Bases: list

create fake source instances and insert them

Other classes used internally

Scraper.base.sourcevector.SourceVector

class Scraper.base.sourcevector.SourceVector(scraper, source)[source]

Bases: list

This is a simply a holder for source instances and the timecursor, the action is driven by the Scraper instance, which invokes the SourceVector.__call__ method to perform the sampling, ie querying for new instances at times beyond the tcursor

As each instance is collected the prior last instance is discarded until sufficient deviation (in age or value) between the first and last is seen. Deviation results in this SourceVector being collapsed to start again from the last sample. This also is driven from the Scraper by setting the tcursor property.

Manages:

  1. 0,1 or 2 source class instances
  2. timecursor
  3. lastresult enum from last _sample

Actions:

  1. checks to see if conditions are met to propagate collected source instances into target, in __call__ method
Parameters:
  • scraper – parent scraper
  • source – SA mapped class
iinst(i)[source]
Parameters:i – instance index into source vector
Rtype source:returns instance or None
lag()[source]
Returns:timedelta instance representing scraper lag or None if no last entry beyind the tcursor

Query source to find datetime of last entry and return the time difference last - tcursor indicating how far behind the scraper is. This will normally be positive indicating that the scraper is behind.

It would be inefficient to do this on every iteration

lastentry()[source]

Query source to find last entry with date_time greater than the timecursor When the tcursor is approaching the cached last entry time, need to query otherwise just use cached

Returns:SA instance or None
lastresult_

progress string representing lastresult enum integer

set_tcursor(tc)[source]

Assigning to this sv.tcursor not only changes the cursor but also collapses the SourceVector ready to collect new sampled source instances.

smry()[source]

Example:

SV 4   (43, 46) 2011-01-10 10:02:00 full       unchanged  (10:34:28 10:34:34)  
# calls   ids        tcursor        status     lastresult       times  

Shows the status of source vector including the id and date_time of source entries sv[0] and sv[-1]

calls
iteration count
ids
id of source entries
tcursor
timecursor, stepping through time. Changes only at each propagation
status
fullness of source vector: empty/partial/full (full means 2 entries)
lastresult
possibilities: “noupdate”,”notfull”,”overage”,”changed”,”unchanged”,”init”,”lastfirst”
times
date_time of source entries, changes as each sample is made
status

enum status integer

status_

status string representing status enum integer

tcursor

Assigning to this sv.tcursor not only changes the cursor but also collapses the SourceVector ready to collect new sampled source instances.

Scraper.base.aparser.AParser : argparser/configparser amalgam

class Scraper.base.aparser.AParser(*args, **kwargs)[source]

Bases: argparse.ArgumentParser

Primes an argparser with defaults read from a section of an ConfigParser style config file and sets up logging

Operates via 2-stage parsing

Usage:

parser = AParser(defpath="~/.scraper.cfg",defsect="default")
parser.add_argument( '-m','--maxiter', help="maximum iterations, or 0 for no limit")
parser.set_defaults( maxiter=0 )
args = parser() 
print args

Draws upon:

Scraper.base.parser.Parser :

class Scraper.base.parser.Parser(*args, **kwargs)[source]

Bases: Scraper.base.aparser.AParser

To see all the available options and defaults for a particular config sections:

scr.py --help 
scr.py -s adtemp_scraper --help
scr.py -s pmthv_scraper --help

Performs two stage parsing, with the first stage driven by -s/--sect option to specify the section name within a configuration file. The path at which a config file is read from can be controlled by SCRAPER_CFG, with default value:

echo $SCRAPER_CFG      ## path of default config file 
    -->  $SITEROOT/dybgaudi/Database/Scraper/python/Scraper/.scraper.cfg
    -->  $SCRAPERROOT/python/Scraper/.scraper.cfg

Note that the first stage of parsing occurs in the AParser.__init__ which:

  1. provides config section name and path
  2. primes the base AParser dict with defaults read from that section

The 2nd stage parse typically does nothing, as it is preferable to keep config at defaults read from file. This commandline control is mainly for testing/debugging.

Note the configparser/argparser mismatch in boolean handling:

  1. argparse typically has “store_true/store_false” actions for convenient/brief commandline control
  2. configparser and config file understandability requires True/False strings

Have sided with configparser as the commandline interface beyond the -s is mainly for developer usage. However some options, such as –dryrun which make little sense in config files, buck this tendency.

classmethod config(defsect='adtemp_scraper', defpath=None, noargs=False)[source]

Conveniuece classmethod for config access

Scraper.base.sa.SA : details of SQLAlchemy connection

class Scraper.base.sa.SA(dbconf)[source]

Bases: object

Manages SQLAlchemy DB connections, orchestrates reflection on tables, dynamic class creation and mapping from tables to classes. These are all done lazily, when a class is requested via .kls(xtn)

kls(xtn)[source]

Return mapped dynamic class from a xtn instance

reflect(tn)[source]

Reflect on the table, recording it in the meta

table(tn)[source]

Return the sqlalchemy.schema.Table representation of a table, reflect upon the table if not already done