Other sop/scraper

Table Of Contents

Previous topic

Non DBI access to DBI and other tables

Next topic

DBI Internals

This Page

Daya Bay Links

Content Skeleton

Scraping source databases into offline_db

In addition to this introductory documentation see also the API reference documentation at Scraper

Generic Scraping Introduction

Pragmatic Goals
  • eliminate duplication in scraper code
  • make it easy to scrape new tables with little additional code to develop/debug
  • use DybDbi for writing to offline_db, eliminate problems arising from incomplete DBI spoofing by using real DBI for all writing
Assumptions/features of each scraper
  • 2 databases : source and target
  • target is represented by DybDbi generated classes
  • source(s) are represented by SQLAlchemy mapped classes which can correspond to entries in single source tables or entries from the result of joins to multiple source tables
  • one source instance corresponds to 1 or more DybDbi writes under a single DBI writer/contextrange

Scraper Status

regime target table notes
pmthv GDcsPmtHv GDcsAdPmtHv duplicates old scraper with new framework, needs testing by Liang before deployment
adtemp GDcsAdTemp duplicates old scraper with new framework, needs testing by Liang before deployment
adlidsensor GDcsAdLidSensor development started end August by David Webber
muoncalib? GDcsMuonCalib interest expressed by Deb Mohapatra
wppmt? GDcsWpPmtHv ?
adgas? ? Raymond named as responible by Wei, doc:7005
Existing scraper modules are visible at
dybgaudi:Database/Scraper/python/Scraper
Existing target table specifications
dybgaudi:Database/DybDbi/spec

DCS peculiarities

DCS tables grouped/ordered by schema

DCS tables have the nasty habit of encoding content (stuff that should be in rows) into table and column names. As a result mapping from source to target in code must interpret these names and sometimes one row of source DCS table will become many rows of destination table.

The task of developing scrapers is much easier when:

  • source and target tables are developed with scraping in mind

Time zones and scraping

Time zone conventions assumed by the generic scraper:

  • All timestamps in offline_db and tmp_offline_db are in UTC, hardcoded into DBI: cannot be changed
  • All timestamps in DCS and DAQ DB are in local(Beijing) time

If the 2nd assumption is not true for your tables, you must either change it to follow the local standard of bad practice or request special handling in the scraper framework.

TODO

Framework Level

  1. scraper catchup feature needs documenting and further testing
  2. DAQ support ? probably no new DAQ tables coming down pipe, but still needs DBI writing
  3. confirm assumption : all DCS times local, all DBI times UTC
  4. more precise testing, will fully controlled faking/scraping and comparison against expectations (not high priority as this kind of precision is not really expected from a scraper)

Specific Regimes

  1. in old scraper code : table names do not match current offline_db : DcsPmtHv
  2. in old scraper code : apparently no timezone handling ?

Running Scrapers

Warning

scraper config include source and target DBCONF, thus ensure that the corresponding entries in ~/.my.cnf are pointing at the intended Databases before running scrapers or fakers

Dybinst Level

To allow use of scrapers and fakers from a pristine environment, such as when invoked under supervisord control, a dybinst level interface is provided:

./dybinst trunk scrape adtemp_scraper
./dybinst trunk scrape pmthv_scraper

The last argument specifies a named section in $SCRAPERROOT/python/Scraper/.scraper.cfg When testing fake entries can be written to a fake source DB using a faker config section, with for example:

./dybinst trunk scrape adtemp_faker
./dybinst trunk scrape pmthv_faker

Package Level

The dybinst interface has the advantage of operating from an empty environment but is not convenient for debugging/testing. When within the environment of dybgaudi:Database/Scraper package (included in standard DybRelease environment) it is preferable to directly use:

scr.py --help               ## uses a default section
scr.py -s adtemp_scraper
scr.py -s adtemp_faker

Examining the help is useful for seeing the config defaults for each config section:

scr.py -s adtemp_faker --help
scr.py -s adtemp_scraper --help

Implementing Scrapers

The generic scraper framework enables the addition of new scrapers with only code that is specific to the source and target tables concerned. The essential tasks are to judge sufficient change to warrant propagation and to translate from source instances to one or more target DBI table instances. Note that the source instances can be joins between multiple source tables.

Outline Steps

  1. Create offline_db target table by committing a .spec file and building DybDbi, DB Table Creation
  2. Create scraper module, implementing only the table specifics: Create Scraper Module
  3. Test scraper operation into a copy of offline_db, Copy offline_db to tmp_offline_db

Create Scraper Module

Scraper modules live in dybgaudi:Database/Scraper/python/Scraper. To eliminate duplication they only handle the specifics of transitioning source DCS/DAQ table(s) columns into target offline_db table columns as specified in your .spec

Compare and contrast the example scraper modules:

Note the structure of classes, using PmtHv as an example:

  1. PmtHv(Regime) umbrella sub-class

  2. PmtHvSource(list) list of source tables (or joins of tables)

  3. PmtHvScraper(Scraper) sub-class that implements two methods, both taking single SourceVector sv argument

    1. changed(self,sv) returns True/False indicating if there is sufficient change to justify calling the propagate method
    2. propagate(self,sv) converts source vector into one or more yielded target dicts with keys corresponding to .spec file attribute names
  4. PmtHvFaker(Faker) sub-class used to Fake entries in the source DB table(s) to allow fully controlled testing

Further implementation details are documented in the API docs Scraper

Implementing changed

The simplest changed implementation:

def changed(self, sv ):
    return False

The source vector sv holds 2 source instances, accessible with sv[0] and sv[-1] corresponding to the last propagated instance and the latest one. Even with a changed implementation that always returns False the propagate will still be called when the age differences between sv[0] and sv[-1] exceed the maxage configuration setting.

Note

changed() is not intended for age checking, instead just use config setting such as maxage=1h for that

If Generic Aggregation can be used it is easier and more efficient to do so. However if the required aggregation can not be performed with MySQL aggregation functions then the changed() method could be used to collect samples as shown in the below example. Note that self.state is entirely created/updated/used within the changed and propagate methods. This is just an example of how to maintain state, entirely separately from the underlying framework:

def changed(self, sv):

     if not hasattr(self, 'state'):          ## only on 1st call when no state
         kls = self.target.kls               ## the genDbi target  class
         keys = kls.SpecKeys().aslist()
         state = dict(zip(keys,map(lambda _:0, keys)))    ## state dict with all values 0
         self.state = state

     ## work of associating source to target attributes
     for k in self.state:
         sk = ..some_fn..( k )       ## source key from target key
         ## do running aggregates  min/max/avg
         self.state[k] += sv[-1][sk]

     return False    ## True if sufficient changes to warrant non-age based propagation

Implementing propagate

The main work of changed and propagate is translating between the source instances eg sv[-1] and the target dict ready to be written using target genDbi class. The ease with which this can be done depends on the design of source and target.

Example implementation, when do accumulation at each changed sampling:

def propagate(self, sv ):
    yield self.state

Alternatively if do not need to accumulate over samples and want to write just based on the last values can see examples:

  1. Scraper.pmthv.PmtHvScraper
  2. Scraper.adtemp.AdTempScraper

Generic Aggregation

Aggregation is configured via config keys beginning with aggregate. Presence of a non-empty aggregate key switches on an extra aggregation query, performed at every sample immediately after the normal entry query. The aggregate key must specify a comma delimited list naming MySQL aggregate/group-by functions:

aggregate = avg,min,max,std
aggregate_count = count
aggregate_skips = id,date_time
aggregate_filter = Quality != 0

Meanings of the settings:

setting notes
aggregate comma delimited list of MySQL aggregation functions
aggregate_count name of attribute that holds the sample count, default count
aggregate_skips comma delimited attributes to skip aggregating, default None
aggregate_filter SQL where clause applied in addition to time filtering, default None

Note

Most MySQL group_by functions do not work well with times, if that is a major problem workarounds could be developed

The functions are called for every source attribute within a single query that is performed on the source DB after the simple row query. The results are provided in the aggd dict with keys such as DBNS_SAB_Temp_PT1_avg, DBNS_SAB_Temp_PT1_min etc..

The aggregation query is over all source DB entries within a timerange that excludes the time of the last instance:

sv[0].date_time <= t < sv[-1].date_time

The aggd dict are available from the sv[0].aggd and sv[-1].aggd within the changed and propagate methods, but existance of sv[0].aggd should be checked as will not be available at startup:

aggz = sv[0].aggd
if aggz:
    for k,v in aggz.items():
        print k,v
else:
    print "no aggz at startup"

aggd = sv[-1].aggd
assert aggd, "missing aggd - this should always be present"
for k,v in aggd.items():
    print k,v

When using the docs virtual python Build Instructions for Sphinx based documentation the aggregate can be dumped print str(aggd) as an rst table like:

att [2] avg max min std
DBNS_SAB_Temp_PT1 48.500000 49.00 48.00 0.5
DBNS_SAB_Temp_PT2 28.500000 29.00 28.00 0.5
DBNS_SAB_Temp_PT3 38.500000 39.00 38.00 0.5
DBNS_SAB_Temp_PT4 48.500000 49.00 48.00 0.5
DBNS_SAB_Temp_PT5 58.500000 59.00 58.00 0.5
date_time 2.01102010008e+13 2011-02-01 00:08:10 2011-02-01 00:08:00 5.0
id 48.5000 49 48 0.5

Error Handling

Possibles error cases that must be handled:

  • aggregation query may yield zero samples, resulting in the configured aggregate_count value coming back as zero and all aggregates being None

    • occurs when the configured aggregate_filter (typically a source quality requirement) results in no entries

    • most likely to occur on the first sample after a propagation

    • having a very short interval compared to the source heartbeat will make this more likely to occur

    • if not trapped the scraper will crash when attempting to coerce None into the float/int attributes of the DybDbi instance, eg:

      File "/home/dwebber/NuWa/NuWa-trunk/dybgaudi/InstallArea/python/DybDbi/wrap.py", line 98, in _create
        Set( instance, v )
      TypeError: void GDcsAdLidSensor::SetTemp_LS_avg(double Temp_LS_avg) => could not convert argument 1 (a float is required)
      
  • options to handle this is under consideration

    • replace the None with an aggregate_none configured value

Configuring Scrapers

Understanding Scraper Operation

Scrapers distinguish between the notions:

  1. actual time ticking by, at which actual DB queries are made
  2. DB time populated by date_time stamps on DB entries

This allows the scraper to catch up, on being restarted after a hiatus and not substantially impact the resulting scraped target table.

Ascii art, each lines corresponding to a sample:

tc0      tc1
|         |
|1        |
|1 2      |
|1 . 3    |    propagation can be triggerd for any
|1 . . 4  |    of these if sufficient change in value or date_time
|1 . . . 5|
|         |6
|         |6 7
|         |6 . 8
|         |6 . . 9
|         |6 . . . a
|         |6 . . . . b
|         |6 . . . . . c

The scrapers region of interest in DB time is controlled by:

  • the time cursor tcursor
  • date_time of last collected instance in the source vector

The first entry beyond the tcursor and any already collected entries is read. In this way the scraper is always looking for the next entry. Following a propagation the tcursor is moved ahead to the time of the last propagated entry plus the interval.

Note

to avoid excessive querying scraper parameters must be tuned, Configuration Tuning

Sampling activity in actual time is controlled by:

  • sleep config setting, initial value for sleep that is subsequently tuned depending on lag
  • heartbeat config setting, guidance to scraper regards source updating period : used to constrain other parameters in sleep tuning
  • offset config setting, allows skipping of source entries : default of zero reads all entries, set to 10 to only read every 10th

Propagation is controlled by:

  • value changes between source vector instances, optionally parameterized by the threshold config setting
  • the maxage config setting compared to the difference in date_time between sourve verctor instances

Features of this approach:

  1. reproducible re-running of the scraper (target entries made should depend on the source not details of scraper running)
  2. allows the scraper to catch up with missed entries after a hiatus
  3. realistic testing

The heartbeat setting should correspond approximately to the actual source table updating period. (sleep setting should be the same as the heartbeat , it is subsequently tuned depending on detected lags behind the source).

See the below Scraper Logging section to see how this works in practice.

Note

some of the config parameters can probably be merged/eliminated, however while development of scrapers is ongoing retaining flexibility is useful

Catchup and Sleep Auto-Tuning

Relevant config parameters:

parameter notes
tunesleepmod how often to tune the sleep time, default of 1 tunes after every propagation, 10 for every 10th etc..
interval quantum of DB time, controls tcursor step after propagation
offset integer sampling tone down, default of 0 samples almost all source entries. 10 would sample only every 10th entry.
heartbeat guidance regarding the raw source tables updating period (without regard for any offset) used to control sleep tuning
timefloor time prior to which the scraper has no interest

A restarted scrapers first actions include checking the DBI validity table in the target DB to determine the target last validity, a DBI Validity Record which allows the tcursor from the prior run of the scraper to be recovered. Hence the scraper determines where it is up to and resumes from that point.

Following some propagations the scraper queries to determine date_time of the last entry in the source table. Comparing this with the tcursor yields a lag for each source Scraper.base.sourcevector.SourceVector.lag(), the maximum lag over all sources Scraper.base.Scraper.maxlag() is obtained.

The extent of this maximum lag time is translated into units of the effective heartbeat (namely heartbeat*(offset+1) ). This number of effective heartbearts lag is used within Scraper.base.Scraper.tunesleep() to adjust the sleep time. This algorithm is currently very primitive; it may need to be informed by real world operational experience.

Configuration Mechanics

All scrapers are configured from a single config file, which is arranged into sections for each scraper/faker. The path of the config file can be controlled by SCRAPER_CFG, the default value:

echo $SCRAPER_CFG      ## path of default config file
    -->  $SITEROOT/dybgaudi/Database/Scraper/python/Scraper/.scraper.cfg
    -->  $SCRAPERROOT/python/Scraper/.scraper.cfg

Generality of scraper frontends is achieved by including a specification of the Regime subclass with the configuration, for example an extract from:

[adtemp_scraper]

regime = Scraper.adtemp:AdTemp
kls = GDcsAdTemp
mode = scraper
source = fake_dcs
target = tmp_offline_db

interval = 10s
sleep = 3s
heartbeat = 3s
offset = 0
maxage = 10m
threshold = 1.0
maxiter = 0

dbi_loglevel = INFO

Settings defining what and where:

   
regime python dotted import path and Regime subclass name
kls target DybDbi class name
mode must be scraper, can be faker for a Faker
source name of dbconf section in ~/.my.cnf, pointing to origin DB typically fake_dcs
target name of dbconf section in ~/.my.cnf, pointing to DBI database typically tmp_offline_db while testing

Settings impacting how and when:

   
interval DB time quantum, minimum sampling step size (DB time)
heartbeat approximate source table updating period, used by sleep tuning.
offset Integer specifying source table offsets. The default of 0 reads all source entries, 1 for every other, 10 for every 10th, etc.. This is the best setting to increase to reduce excessive sampling.
sleep Initial period to sleep in the scraper loop (is subsequently auto-tuned depending on lag to the` source)
maxage maximum period after which entries are propagated even if unchanged (DB time)
threshold optional parameter accessed within scrapers via self.threshold, typically used within def changed() method
maxiter number of interations to scrape, usually 0 for no limit

Time durations such as interval, sleep and maxage are all expressed with strings such as 10s, 1m or 1h representings periods in seconds, minutes or hours.

Other configuration settings for scrapers:

## time before which the scraper is not interested,  used to limit expense of lastvld query at startup
timefloor      = 2010-01-01 00:00:00

## see below section on seeding the target, seeding is not allowed when targeting offline_db
seed_target_tables = True
seed_timestart = 2011-02-01 00:00:00
seed_timeend   = 2011-02-01 00:01:00

See Scraper.base.main() for further details on configuation.

Configuration Tuning

Consider scraping wildly varying source quantities that always leads to a propagation, the 1st entry beyond the tcursor would immediately be propagated and the tcursor moved ahead to the time of the last propagated entry plus the interval leading to another propagation of the 1st entry beyond the new tcursor.

In this situation:

  • offset could be increased to avoid sampling all source entries
  • interval must be tuned to achieve desired propagation frequency/value tracking

Alternatively consider almost constant source quantities, that never cause a def changed() to return True. In this case samples are dutifully made of entries beyond the tcursor until the time difference between the first and last exceeded the maxage and a propagation results and tcursor is moved forwards to the time of the last propagated entry plus the interval.

In this situation:

  • maxage dominates what is scraped
  • offset should be increased to avoid pointless unchanged sampling within the maxage period

Note

setting offset only impacts the raw querying, it does not influence the aggregate query which aggregates over all source entries within the time range defined by the raw queries.

Testing Scraper Operation

Test Scraper With Faker

Fakers exist in order allow testing of Scrapers via fully controlled population of a fake source DB, typically fake_dcs. At each faker iteration an instance for each source class (an SQLAlchemy dynamic class) is created and passed to the fakers fake method, for example:

class AdTempFaker(Faker):
    def fake(self, inst , id , dt ):
        """
        Invoked from base class, sets source instance attributes to form a fake

        :param inst: source instance
        :param id: id to assign to the instance
        """
        if dt==None:
            dt = datetime.now()
        for k,v in inst.asdict.items():
            if k == 'id':
                setattr( inst, k, id )
            elif k == 'date_time':
                setattr( inst, k, dt )
            else:
                setattr( inst, k, float(id%10))    ## silly example of setting attribute values based on modulo of ``id``

This allows the attributes of the fake instance to be set as desired. It is necessary to set the id and date_time attributes as shown to mimic expect source DB behaviour.

Faker configuration

Fakers are configured similarly to scrapers. An example configuration:

[adtemp_faker]

regime = Scraper.adtemp:AdTemp
mode = faker

source = fake_dcs
faker_dropsrc = True

timeformat = %Y-%m-%d %H:%M:%S
faker_timestart = 2011-02-01 00:00:00
profile = modulo_ramp
interval = 10s
sleep = 3s
maxiter = 0

Warning

When running in mode = faker the faker_dropsrc = True wipes the DB pointed to by source = fake_dcs

The faker_dropsrc=True key causes the fake source DB to be dropped and then recreated from a mysql dump file ~/fake_dcs.sql that must reside with $HOME. This dropping and reloading is done at each start of the faker.

Preparing Target DB for testing

The database specified in the target config parameter of scrapers must be existing and accessible to the scraper identity, as defined in the ~/.my.cnf. Create the target DB and grant permissions with:

mysql> create database offline_db_dummy
mysql> grant select,insert,create,drop,lock tables,delete on offline_db_dummy.* to 'dayabay'@'%' identified by '3items' ;

Privileges are needed for DBI operartions used by the Scraper:

priv first fail without it
lock tables locks around updating LOCALSEQNO
insert inserting ('*',0) into LOCALSEQNO
delete LASTUSEDSEQNO updating deletes then inserts

Seeding Target Database

Scraping requires an entry in the target DB table in order to discern where in time the scraping is up to. When testing into empty DB/Tables a seed entry needs to be planted using DybDbi for each source table. This can be done using the config settings like:

seed_target_tables = True
seed_timestart = 2011-02-01 00:00:00
seed_timeend   = 2011-02-01 00:01:00

Together with implementing the def seed(src): method in the scraper to return a dict of attributes appropriate to the genDbi target class. If the target has many attributes, a programmatic approach can be used, eg starting from:

In [1]: from DybDbi import GDcsAdLidSensor as kls

In [2]: kls.SpecKeys().aslist()
Out[2]:
 ['PhysAdId',
  'Ultrasonic_GdLS',
  'Ultrasonic_GdLS_SD',
  'Ultrasonic_GdLS_min',
  'Ultrasonic_GdLS_max',
  'Ultrasonic_LS',
  'Ultrasonic_LS_SD',
  'Ultrasonic_LS_min',
  'Ultrasonic_LS_max',
   ...

Scraper Logging

The bulk of the output comes from the smry method of Scraper.base.sourcevector which displays the id and date_time of the source instances held by the SourceVector as well as the time cursor of the source vector which corresponds to the time of last propagation. An extract from a scraper log, showing the startup:

INFO:Scraper.base.scraper:timecursor(local) {'subsite': 1, 'sitemask': 32} Tue Feb  1 00:01:00 2011
INFO:Scraper.base.sourcevector:SV 1   (6,)     2011-02-01 00:01:00 partial    notfull    (00:01:00 ++)  ==>
INFO:Scraper.base.sourcevector:SV 2   (6, 7)   2011-02-01 00:01:00 full       unchanged  (00:01:00 00:01:10)  ==>
INFO:Scraper.base.sourcevector:SV 3   (6, 8)   2011-02-01 00:01:00 full       unchanged  (00:01:00 00:01:20)  ==>
INFO:Scraper.base.sourcevector:SV 4   (6, 9)   2011-02-01 00:01:00 full       unchanged  (00:01:00 00:01:30)  ==>
INFO:Scraper.base.sourcevector:SV 5   (6, 10)  2011-02-01 00:01:00 full       unchanged  (00:01:00 00:01:40)  ==>
INFO:Scraper.base.sourcevector:SV 6   (6, 11)  2011-02-01 00:01:00 full       unchanged  (00:01:00 00:01:50)  ==>
INFO:Scraper.base.sourcevector:SV 7   (6, 12)  2011-02-01 00:01:00 full       unchanged  (00:01:00 00:02:00)  ==>
INFO:Scraper.base.sourcevector:SV 8   (6, 13)  2011-02-01 00:01:00 full       overage    (00:01:00 00:02:10)  ==> PROCEED
Warning in <TClass::TClass>: no dictionary for class DbiWriter<GDcsPmtHv> is available
Proceeding despite Non-unique versionDate: 2011-01-31 16:01:00 collides with that of SEQNO: 2 for table DcsPmtHv within sitemask/simmask  0/0 (i hope you know what you are doing )
INFO:Scraper.base.scraper: 0 tune detects maxlag 9 minutes behind namely 59 intervals ... sleep 0:00:01 s
INFO:Scraper.base.sourcevector:SV 9   (13, 14) 2011-02-01 00:02:20 full       unchanged  (00:02:10 00:02:20)  ==>
INFO:Scraper.base.sourcevector:SV 10  (13, 15) 2011-02-01 00:02:20 full       unchanged  (00:02:10 00:02:30)  ==>
INFO:Scraper.base.sourcevector:SV 11  (13, 16) 2011-02-01 00:02:20 full       unchanged  (00:02:10 00:02:40)  ==>
INFO:Scraper.base.sourcevector:SV 12  (13, 17) 2011-02-01 00:02:20 full       unchanged  (00:02:10 00:02:50)  ==>
INFO:Scraper.base.sourcevector:SV 13  (13, 18) 2011-02-01 00:02:20 full       unchanged  (00:02:10 00:03:00)  ==>
INFO:Scraper.base.sourcevector:SV 14  (13, 19) 2011-02-01 00:02:20 full       unchanged  (00:02:10 00:03:10)  ==>
INFO:Scraper.base.sourcevector:SV 15  (13, 20) 2011-02-01 00:02:20 full       overage    (00:02:10 00:03:20)  ==> PROCEED
INFO:Scraper.base.scraper: 1 tune detects maxlag 8 minutes behind namely 52 intervals ... sleep 0:00:01 s
INFO:Scraper.base.sourcevector:SV 16  (20, 21) 2011-02-01 00:03:30 full       unchanged  (00:03:20 00:03:30)  ==>
INFO:Scraper.base.sourcevector:SV 17  (20, 22) 2011-02-01 00:03:30 full       unchanged  (00:03:20 00:03:40)  ==>
INFO:Scraper.base.sourcevector:SV 18  (20, 23) 2011-02-01 00:03:30 full       unchanged  (00:03:20 00:03:50)  ==>
INFO:Scraper.base.sourcevector:SV 19  (20, 24) 2011-02-01 00:03:30 full       unchanged  (00:03:20 00:04:00)  ==>
INFO:Scraper.base.sourcevector:SV 20  (20, 25) 2011-02-01 00:03:30 full       unchanged  (00:03:20 00:04:10)  ==>
INFO:Scraper.base.sourcevector:SV 21  (20, 26) 2011-02-01 00:03:30 full       unchanged  (00:03:20 00:04:20)  ==>
INFO:Scraper.base.sourcevector:SV 22  (20, 27) 2011-02-01 00:03:30 full       overage    (00:03:20 00:04:30)  ==> PROCEED
INFO:Scraper.base.scraper: 2 tune detects maxlag 7 minutes behind namely 45 intervals ... sleep 0:00:01 s
INFO:Scraper.base.sourcevector:SV 23  (27, 28) 2011-02-01 00:04:40 full       unchanged  (00:04:30 00:04:40)  ==>
INFO:Scraper.base.sourcevector:SV 24  (27, 29) 2011-02-01 00:04:40 full       unchanged  (00:04:30 00:04:50)  ==>

This is with config settings:

interval = 10s
sleep = 3s
maxage = 1m
threshold = 1.0
maxiter = 0
task = 0

Note

while testing it is convenient to sample/propagate far faster that would be appropriate in production

Points to note:

  1. initially the source vector contains only one sample and is marked partial/notfull, there is no possibility of propagation
  2. at the 2nd sample (10s later in DB time, not necessarily the same in real time) the source vector becomes full/unchanged and the source id held are (6,7) at times (00:01:00 00:01:10)
  3. for the 3rd to 7th samples the sv[0] stays the same but sv[-1] gets replaced by new sampled instances
  4. at the 8th sample a sufficient change between sv[0] and sv[-1] is detected (in this example due to maxage = 1m being exceeded) leading to a PROCEED which indicates a propagation into the target DB
  5. at the 9th sample, the sv[0] is replaced by the former sv[-1] which led to the propagation, correspondingly note the change in id held to (13,14) and times (00:02:10 00:02:20)

In this case propagation as marked by the PROCEED is occurring due to overage arising from config. If aggregation were to be configured in this example the aggregation would have been performed:

  1. at 2nd sample for all entries between (00:01:00 00:01:10)
  2. for 3rd to 7th samples for all entries betweenn (00:01:00 00:01:20) and so on
  3. at the 8th sample the aggregation is between (00:01:00 00:02:10) which would have then been propagated
  4. at the 9th sample the aggregation is between (00:02:10 00:02:20) with starting point corresponding to the former endpoint

Continuous running under supervisord

Initial Setup

Prepare example supervisord config file using -S option:

./dybinst -S /tmp/adtemp_scraper.ini trunk scrape adtemp_scraper
sudo cp /tmp/adtemp_scraper.ini /etc/conf/

Prepare the configs for all named section of the file using special cased ALL argument:

mkdir /tmp/scr
./dybinst -S /tmp/scr trunk scrape ALL
sv- ; sudo cp /tmp/scr/*.ini $(sv-confdir)/          ## when using option sv- bash functions

NB the location to place supervisord .ini depends on details of the supervisord installation and in particular settings in supervisord.conf, for background see http://supervisord.org/configuration.html The config simply specifies details of how to run the command, and can define the expected exit return codes that allow auto-restarting. For example:

[program:adtemp_scraper]
environment=HOME='/home/scraper',SCRAPER_CFG='/home/scraper/adtemp_scraper_production.cfg'
directory=/data1/env/local/dyb
command=/data1/env/local/dyb/dybinst trunk scrape adtemp_scraper
redirect_stderr=true
redirect_stdout=true
autostart=true
autorestart=true
priority=999
user=blyth

Note:

  1. program name adtemp_scraper, which is used in supervisorctl commands to control and examine the process.

  2. environment setting pointing the production scraper to read config from a separate file:

    ``environment=HOME='/home/scraper',SCRAPER_CFG='/home/scraper/adtemp_scraper_production.cfg'``
    

    NB the single quotes which is a workaround for svenvparsebug needed in some supervisord versions.

Supervisorctl CLI

Start the process from supervisorctl command line as shown:

[blyth@belle7 conf]$ sv                   ## OR supervisorctl if not using sv- bash functions
dybslv                           RUNNING    pid 2990, uptime 5:39:59
hgweb                            RUNNING    pid 2992, uptime 5:39:59
mysql                            RUNNING    pid 2993, uptime 5:39:59
nginx                            RUNNING    pid 2991, uptime 5:39:59

N> help

default commands (type help <topic>):
=====================================
add    clear  fg        open  quit    remove  restart   start   stop  update
avail  exit   maintail  pid   reload  reread  shutdown  status  tail  version


N> reread
adtemp_faker: available
adtemp_scraper: available
pmthv_faker: available
pmthv_scraper: available

N> avail
adtemp_faker                     avail     auto      999:999
adtemp_scraper                   avail     auto      999:999
dybslv                           in use    auto      999:999
hgweb                            in use    auto      999:999
mysql                            in use    auto      999:999
nginx                            in use    auto      999:999
pmthv_faker                      avail     auto      999:999
pmthv_scraper                    avail     auto      999:999

N> add adtemp_faker
adtemp_faker: added process group

N> status
adtemp_faker                     STARTING
dybslv                           RUNNING    pid 2990, uptime 5:41:46
hgweb                            RUNNING    pid 2992, uptime 5:41:46
mysql                            RUNNING    pid 2993, uptime 5:41:46
nginx                            RUNNING    pid 2991, uptime 5:41:46

N> status
adtemp_faker                     RUNNING    pid 22822, uptime 0:00:01
dybslv                           RUNNING    pid 2990, uptime 5:41:50
hgweb                            RUNNING    pid 2992, uptime 5:41:50
mysql                            RUNNING    pid 2993, uptime 5:41:50
nginx                            RUNNING    pid 2991, uptime 5:41:50

Subsequently can start/stop/restart/tail in normal manner. Following changes to supervisord configuration, such as environment changes, using just start for stopped process does not pick up the changed config. Ensure changes are picked up by using remove, reread and add which typically also starts the child process.

Steps to Deployment

Recommended steps towards scraper deployment:

  1. setup a faker to write into fake_dcs with one process while the corresponding scraper is run in another process fake_dcs -> tmp_offline_db, as described above Testing Scraper Operation, this allows testing:

    1. live running
    2. catchup : by stop/start of the scraper
    3. scraper parameter tuning
  2. test from real dcs -> tmp_offline_db

    1. make sure you have readonly permissions in the DBCONF “dcs” source section first!

    2. get supervisord setup Continuous running under supervisord to allow long term running over multiple days

    3. check the scraper can run continuously,

      1. look for sustainability (eg avoid dumping huge logs)
      2. check responses to expected problems (eg network outtages), possibly supervisord config can be adjusted to auto-restart scrapers

Development Tips

Obtain mysqldump of DCS DB to populate fake source DB

Dumping just the table creation commands from the replicated DCS DB into file ~/fake_dcs.sql (password read from a file):

mysqldump --no-defaults --no-data --lock-tables=false --host=202.122.37.89 --user=dayabay --password=$(cat ~/.dybpass) dybdcsdb > ~/fake_dcs.sql

Note:

  1. --no-data option must be used, to avoid creation of unusably large dump files
  2. --lock-tables=false is typically needed to avoid permission failures

Single table mysqldump for averager testing

Averager testing requires a large dataset, so rather than add batch capability to the faker to generate this it is simpler and more realistic to just dump real tables from the replicated DCS DB. For example:

time mysqldump --no-defaults  --lock-tables=false --host=202.122.37.89 --user=dayabay --password=$(cat ~/.dybpass) dybdcsdb AD1_LidSensor > ~/AD1_LidSensor.sql
  ## 27 min yielded 207MB of truncated dump up to 1420760,'2011-02-19 10:19:10'

time mysqldump --no-defaults  --lock-tables=false --host=202.122.37.89 --user=dayabay --password=$(cat ~/.dybpass) --where="id%1000=0" dybdcsdb AD1_LidSensor > ~/AD1_LidSensor_1000.sql
  ## cut the dump down to size with where clause :   10 seconds, 2.1M, full range

time mysqldump --no-defaults  --lock-tables=false --host=202.122.37.89 --user=dayabay --password=$(cat ~/.dybpass) --where="id%100=0" dybdcsdb AD1_LidSensor > ~/AD1_LidSensor_100.sql
  ## 84 seconds, 21M, full range

time mysqldump --no-defaults  --lock-tables=false --host=202.122.37.89 --user=dayabay --password=$(cat ~/.dybpass) --where="id%100=0" dybdcsdb AD2_LidSensor > ~/AD2_LidSensor_100.sql

time mysqldump --no-defaults  --lock-tables=false --host=202.122.37.89 --user=dayabay --password=$(cat ~/.dybpass) --where="id%10=0" dybdcsdb AD1_LidSensor > ~/AD1_LidSensor_10.sql
  ## 462 seconds, 203M, full range

Check progress of the dump with:

tail --bytes=200  ~/AD1_LidSensor_10.sql       ## use bytes option as very few newlines in mysqldumps

Replace any pre-existing fake_dcs.AD1_LidSensor table with:

cat ~/AD1_LidSensor_10.sql   | mysql fake_dcs
cat ~/AD1_LidSensor_100.sql  | mysql fake_dcs
cat ~/AD1_LidSensor_1000.sql | mysql fake_dcs

Check ranges in the table with group by year query:

echo "select count(*),min(id),max(id),min(date_time),max(date_time) from AD1_LidSensor group by year(date_time)" | mysql --no-defaults --host=202.122.37.89 --user=dayabay --password=$(cat ~/.dybpass) dybdcsdb

count(*)        min(id) max(id)    min(date_time)         max(date_time)
13697                 1   3685338  0000-00-00 00:00:00     0000-00-00 00:00:00
151             9941588  13544749  1970-01-01 08:00:00     1970-01-01 08:00:00
11032689             43  11046429  2011-01-10 10:34:28     2011-12-31 23:59:58
2508947        11046430  13555485  2012-01-01 00:00:00     2012-02-29 15:19:43

If seeding is used, the range of averaging will be artificially truncated. For rerunnable test averages over full range:

time ./scr.py -s adlid_averager --ALLOW_DROP_CREATE_TABLE --DROP_TARGET_TABLES
    ## full average of modulo 10  single AD1_LidSensor table : ~6m
    ## full average of modulo 100 single AD1_LidSensor table : ~4m35s

Append capable mysqldumps

The dumps created as described above have structure:

DROP TABLE IF EXISTS `AD1_LidSensor`;
..
CREATE TABLE `AD1_LidSensor` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `date_time` datetime NOT NULL,
   ...
);
LOCK TABLES `AD1_LidSensor` WRITE;
INSERT INTO `AD1_LidSensor` VALUES
(10,'0000-00-00 00:00:00',237,301,'18.77','18.95','-0.77','0.24','0.01','-0.44','-0.57','1.12',5,'19.00',10,'19.00',0,'19.38','18.00','0.82',NULL),
(20,'0000-00-00 00:00:00',237,302,'18.77','18.90','-0.77','0.24','0.02','-0.44','-0.57','1.12',5,'19.00',10,'19.00',0,'19.38','18.00','0.82',NULL),
...
(13558330,'2012-02-29 16:54:33',2277,2103,'22.30','22.42','-1.01','0.27','-0.28','-0.42','-0.75','1.24',255,'24.88',307,'24.75',311,'23.00','0.00','0.00',0);
UNLOCK TABLES;

Skip the DROP+CREATE with --no-create-info, restrict to new id and pipe the dump directly into dev DB to bring uptodate (modulo 100):

maxid=$(echo "select max(id) from AD1_LidSensor" | mysql --skip-column-names fake_dcs ) ; echo $maxid
time mysqldump --no-defaults  --no-create-info --lock-tables=false --host=202.122.37.89 --user=dayabay --password=$(cat ~/.dybpass) --where="id%100=0 and id>$maxid" dybdcsdb AD1_LidSensor | mysql fake_dcs

Test append running of averager:

time ./scr.py -s adlid_averager

Catches up with 2 bins:

INFO:Scraper.base.datetimebin: [0  ] ['Wed Feb 29 15:00:00 2012', 'Thu Mar  1 00:00:00 2012'] 9:00:00
INFO:Scraper.base.datetimebin: [1  ] ['Thu Mar  1 00:00:00 2012', 'Thu Mar  1 11:00:00 2012'] 11:00:00
INFO:Scraper.base.averager:looping over 2 territory bins performing grouped aggregate queries in each bin
INFO:Scraper.base.sourcevector:SV 1   (0, 1)               2012-02-29 15:00:00=>00:00:00 full       replay     (15:00:00 00:00:00) [9]  ==> PROCEED
INFO:Scraper.base.sourcevector:SV 2   (0, 1)               2012-03-01 00:00:00=>11:00:00 full       replay     (00:00:00 11:00:00) [11]  ==> PROCEED

Checking target, shows no seams:

echo "select * from DcsAdLidSensorVld where TIMESTART > DATE_SUB(UTC_TIMESTAMP(),INTERVAL 36 HOUR)" |  mysql tmp_offline_db

6515    2012-02-29 02:00:13     2012-02-29 02:56:53     1       1       1       0       -1      2012-02-29 02:00:13     2012-02-29 10:12:37
6516    2012-02-29 03:00:13     2012-02-29 03:56:53     1       1       1       0       -1      2012-02-29 03:00:13     2012-02-29 10:12:37
6517    2012-02-29 04:00:13     2012-02-29 04:56:53     1       1       1       0       -1      2012-02-29 04:00:13     2012-02-29 10:12:37
6518    2012-02-29 05:00:13     2012-02-29 05:56:53     1       1       1       0       -1      2012-02-29 05:00:13     2012-02-29 10:12:37
6519    2012-02-29 06:00:13     2012-02-29 06:56:53     1       1       1       0       -1      2012-02-29 06:00:13     2012-02-29 10:12:37
6520    2012-02-29 07:00:13     2012-02-29 07:56:53     1       1       1       0       -1      2012-02-29 07:00:13     2012-03-01 04:04:41
6521    2012-02-29 08:00:13     2012-02-29 08:56:53     1       1       1       0       -1      2012-02-29 08:00:13     2012-03-01 04:04:41
6522    2012-02-29 09:00:13     2012-02-29 09:56:57     1       1       1       0       -1      2012-02-29 09:00:13     2012-03-01 04:04:41
6523    2012-02-29 10:00:17     2012-02-29 10:56:57     1       1       1       0       -1      2012-02-29 10:00:17     2012-03-01 04:04:41
6524    2012-02-29 11:00:17     2012-02-29 11:56:57     1       1       1       0       -1      2012-02-29 11:00:17     2012-03-01 04:04:41
6525    2012-02-29 12:00:17     2012-02-29 12:56:57     1       1       1       0       -1      2012-02-29 12:00:17     2012-03-01 04:04:41

Start from scratch following schema changes to DCS

Drop pre-existing fake_dcs DB and recreate from the nodata mysqldump:

mysql> status                               ## verify connected to local development server
mysql> drop database if exists fake_dcs ;
mysql> create database fake_dcs ;
mysql> use fake_dcs
mysql> source ~/fake_dcs.sql                ## use nodata dump to duplicate table definitions
mysql> show tables

Warning

only use the below approach on local development server when confident of mysql config

Quick (and DANGEROUS) way of doing the above which works as mysqldump defaults to including DROP TABLE IF EXISTS prior to CREATE TABLE allowing emptying data from all tables without having to drop/recreate the DB. CAUTION: this assumes that the client section of ~/.my.cnf is on the same server as the DB called fake_dcs

cat ~/fake_dcs.sql | mysql fake_dcs

Interactive SQLAlchemy Querying

Use NonDbi to pull up a session and dynamic SQLAlchemy class to query with ipython:

[blyth@belle7 Scraper]$ ipython
Python 2.7 (r27:82500, Feb 16 2011, 11:40:18)
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

In [1]: from NonDbi import session_

In [2]: session = session_("fake_dcs")   ## dbconf

In [3]: kls = session.kls_("DBNS_AD1_HV")   ## table name

In [4]: q = session.query(kls).order_by(kls.date_time)  ## does not touch DB yet

In [5]: q.count()         ## hits DB now
Out[5]: 74L

In [6]: q.first()         ## LIMIT 0, 1    same as q[0:1][0]   (maybe different errors if empty though)
Out[6]: <NonDbi.YDBNS_AD1_HV object at 0xa7a404c>

In [7]: q[70:74]          ## LIMIT 70,4
Out[7]:
[<NonDbi.YDBNS_AD1_HV object at 0xa7bea4c>,
 <NonDbi.YDBNS_AD1_HV object at 0xa7beaac>,
 <NonDbi.YDBNS_AD1_HV object at 0xa7bea8c>,
 <NonDbi.YDBNS_AD1_HV object at 0xa7beb2c>]

In [8]: q[70:75]         ## LIMIT 70,5
Out[8]:
[<NonDbi.YDBNS_AD1_HV object at 0xa7bea4c>,
 <NonDbi.YDBNS_AD1_HV object at 0xa7beaac>,
 <NonDbi.YDBNS_AD1_HV object at 0xa7bea8c>,
 <NonDbi.YDBNS_AD1_HV object at 0xa7beb2c>]

In [9]: q[74:75]         ## LIMIT 74,1
Out[9]: []

In [10]: q[73:74]        ## LIMIT 73,1
Out[10]: [<NonDbi.YDBNS_AD1_HV object at 0xa7beb2c>]