Archive Ensembl HomeArchive Ensembl Home
Bio::EnsEMBL::Hive::RunnableDB::JobFactory Class Reference
Inheritance diagram for Bio::EnsEMBL::Hive::RunnableDB::JobFactory:

List of all members.


Class Summary

Synopsis

    standaloneJob.pl Bio::EnsEMBL::Hive::RunnableDB::JobFactory \
                    --inputcmd 'cd ${ENSEMBL_CVS_ROOT_DIR}/ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB; ls -1 *.pm' \
                    --input_id "{'meta_key'=>'module_name','meta_value'=>'#_0#'}" \
                    --flow_into "{ 2 => ['mysql://ensadmin:${ENSADMIN_PSW}@127.0.0.1:2912/lg4_compara_families_64/meta']}"

Description

This is a generic RunnableDB module for creating batches of similar jobs using dataflow mechanism
(a fan of jobs is created in one branch and the funnel in another).
Make sure you wire this buliding block properly from outside.
You can supply as parameter one of 4 sources of ids from which the batches will be generated:
    param('inputlist');  The list is explicitly given in the parameters, can be abbreviated: 'inputlist' => ['a'..'z']
    param('inputfile');  The list is contained in a file whose name is supplied as parameter: 'inputfile' => 'myfile.txt'
    param('inputquery'); The list is generated by an SQL query (against the production database by default) : 'inputquery' => 'SELECT object_id FROM object WHERE x=y'
    param('inputcmd');   The list is generated by running a system command: 'inputcmd' => 'find /tmp/big_directory -type f'
 

Definition at line 31 of file JobFactory.pm.

Available Methods

protected _fisher_yates_shuffle_in_place ()
protected _get_rows_from_list ()
protected _get_rows_from_open ()
protected _get_rows_from_query ()
protected _substitute_minibatched_rows ()
protected _substitute_rows ()
public Bio::EnsEMBL::Analysis analysis ()
public catch ()
public void check_if_exit_cleanly ()
public
Bio::EnsEMBL::DBSQL::DBConnection 
data_dbc ()
public dataflow_output_id ()
public
Bio::EnsEMBL::Hive::DBSQL::DBAdaptor 
db ()
public
Bio::EnsEMBL::DBSQL::DBConnection 
dbc ()
public Int debug ()
public void deprecate ()
public DESTROY ()
public fetch_input ()
public go_figure_dbc ()
public void info ()
public input_id ()
public
Bio::EnsEMBL::Hive::AnalysisJob 
input_job ()
public new ()
public Array output ()
public param ()
public param_defaults ()
public param_substitute ()
public parameters ()
public Bio::EnsEMBL::Hive::Queen queen ()
public run ()
public Arrayref runnable ()
public Array stack_trace ()
public String stack_trace_dump ()
public strict_hash_format ()
public void throw ()
public Depend try ()
public Int verbose ()
public warning ()
public worker ()
public worker_temp_directory ()
public write_output ()

Method Documentation

protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_fisher_yates_shuffle_in_place ( )
    
    Description: a private function (not a method) that shuffles a list of ids
 
Code:
click to view
protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_get_rows_from_list ( )
    
    Description: a private method that ensures the list is 2D
 
Code:
click to view
protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_get_rows_from_open ( )
    
    Description: a private method that loads ids from a given file or command pipe
 
Code:
click to view
protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_get_rows_from_query ( )
    
    Description: a private method that loads ids from a given sql query
    param('db_conn'): An optional hash to pass in connection parameters to the database upon which the query will have to be run.
 
Code:
click to view
protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_substitute_minibatched_rows ( )
    
    Description: a private method that minibatches a list and transforms every minibatch using param-substitution
 
Code:
click to view
protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_substitute_rows ( )
    Description: a private method that goes through a list and transforms every row into a hash
 
Code:
click to view
public Bio::EnsEMBL::Analysis Bio::EnsEMBL::Hive::Process::analysis ( ) [inherited]
    Title   :  analysis
    Usage   :  $self->analysis;
    Function:  Returns the Analysis object associated with this
               instance of the Process.
    Returns :  Bio::EnsEMBL::Analysis object
 
Code:
click to view
public void Bio::EnsEMBL::Hive::Process::check_if_exit_cleanly ( ) [inherited]
    Title   :   check_if_exit_cleanly
    Usage   :   $self->check_if_exit_cleanly()
    Function:   Check if we want to exit or kill it cleanly at the
                runnable level
    Returns :   None
    Args    :   None
 
Code:
click to view
public Bio::EnsEMBL::DBSQL::DBConnection Bio::EnsEMBL::Hive::Process::data_dbc ( ) [inherited]
    Title   :   data_dbc
    Usage   :   my $data_dbc = $self->data_dbc;
    Function:   returns a Bio::EnsEMBL::DBSQL::DBConnection object (the "current" one by default, but can be set up otherwise)
    Returns :   Bio::EnsEMBL::DBSQL::DBConnection
 
Code:
click to view
public Bio::EnsEMBL::Hive::Process::dataflow_output_id ( ) [inherited]

Undocumented method

Code:
click to view
public Bio::EnsEMBL::Hive::DBSQL::DBAdaptor Bio::EnsEMBL::Hive::Process::db ( ) [inherited]
    Title   :   db
    Usage   :   my $hiveDBA = $self->db;
    Function:   returns DBAdaptor to Hive database
    Returns :   Bio::EnsEMBL::Hive::DBSQL::DBAdaptor
 
Code:
click to view
public Bio::EnsEMBL::DBSQL::DBConnection Bio::EnsEMBL::Hive::Process::dbc ( ) [inherited]
    Title   :   dbc
    Usage   :   my $hiveDBConnection = $self->dbc;
    Function:   returns DBConnection to Hive database
    Returns :   Bio::EnsEMBL::DBSQL::DBConnection
 
Code:
click to view
public Int Bio::EnsEMBL::Hive::Process::debug ( ) [inherited]
    Title   :  debug
    Function:  Gets/sets flag for debug level. Set through Worker/runWorker.pl
               Subclasses should treat as a read_only variable.
    Returns :  integer
 
Code:
click to view
public Bio::EnsEMBL::Hive::Process::DESTROY ( ) [inherited]
    Title   :  DESTROY
    Function:  sublcass can implement functions related to cleanup and release.
               Typical activities includes freeing datastructures or 
	       closing files.
 
Code:
click to view
public Bio::EnsEMBL::Hive::Process::fetch_input ( ) [inherited]
    Title   :  fetch_input
    Function:  sublcass can implement functions related to data fetching.
               Typical acivities would be to parse $self->input_id and read
               configuration information from $self->analysis.  Subclasses
               may also want to fetch data from databases or from files 
               within this function.
 
Code:
click to view

Reimplemented in Bio::EnsEMBL::Hive::RunnableDB::Dummy, Bio::EnsEMBL::Hive::RunnableDB::FailureTest, Bio::EnsEMBL::Hive::RunnableDB::LongMult::AddTogether, Bio::EnsEMBL::Hive::RunnableDB::LongMult::PartMultiply, Bio::EnsEMBL::Hive::RunnableDB::LongMult::Start, Bio::EnsEMBL::Hive::RunnableDB::MySQLTransfer, Bio::EnsEMBL::Hive::RunnableDB::NotifyByEmail, Bio::EnsEMBL::Hive::RunnableDB::SqlCmd, and Bio::EnsEMBL::Hive::RunnableDB::SystemCmd.

public Bio::EnsEMBL::Hive::Process::go_figure_dbc ( ) [inherited]

Undocumented method

Code:
click to view
public Bio::EnsEMBL::Hive::Process::input_id ( ) [inherited]

Undocumented method

Code:
click to view
public Bio::EnsEMBL::Hive::AnalysisJob Bio::EnsEMBL::Hive::Process::input_job ( ) [inherited]
    Title   :  input_job
    Function:  Returns the AnalysisJob to be run by this process
               Subclasses should treat this as a read_only object.          
    Returns :  Bio::EnsEMBL::Hive::AnalysisJob object
 
Code:
click to view
public Bio::EnsEMBL::Hive::Process::new ( ) [inherited]

Undocumented method

Code:
click to view
public Array Bio::EnsEMBL::Hive::Process::output ( ) [inherited]
    Title   :   output
    Usage   :   $self->output()
    Function:   
    Returns :   Array of Bio::EnsEMBL::FeaturePair
    Args    :   None
 
Code:
click to view
public Bio::EnsEMBL::Hive::Process::param ( ) [inherited]

Undocumented method

Code:
click to view
public Bio::EnsEMBL::Hive::Process::param_defaults ( ) [inherited]
    Title   :  param_defaults
    Function:  sublcass can define defaults for all params used by the RunnableDB/Process
 
Code:
click to view

Reimplemented in Bio::EnsEMBL::Hive::RunnableDB::FailureTest.

public Bio::EnsEMBL::Hive::Process::param_substitute ( ) [inherited]

Undocumented method

Code:
click to view
public Bio::EnsEMBL::Hive::Process::parameters ( ) [inherited]

Undocumented method

Code:
click to view
public Bio::EnsEMBL::Hive::Queen Bio::EnsEMBL::Hive::Process::queen ( ) [inherited]
    Title   :   queen
    Usage   :   my $hiveDBA = $self->queen;
    Function:   returns the 'Queen' this Process was created by
    Returns :   Bio::EnsEMBL::Hive::Queen
 
Code:
click to view
public Bio::EnsEMBL::Hive::RunnableDB::JobFactory::run ( )
    Description : Implements run() interface method of Bio::EnsEMBL::Hive::Process that is used to perform the main bulk of the job (minus input and output).
    param('column_names'):  Controls the column names that come out of the parser: 0 = "no names", 1 = "parse names from data", arrayref = "take names from this array"
    param('delimiter'): If you set it your lines in file/cmd mode will be split into columns that you can use individually when constructing the template input_id hash.
    param('input_id'):  The template that will become the input_id of newly created jobs (Note: this is something entirely different from $self->input_id of the current JobFactory job).
                        After introduction of param('column_names') its significance has dropped, but it may still become handy.
    param('randomize'): Shuffles the rows before creating jobs - can sometimes lead to better overall performance of the pipeline. Doesn't make any sence for minibatches (step>1).
    param('step'):      The requested size of the minibatch (1 by default). The real size of a range may be smaller than the requested size.
    param('key_column'): If every line of your input is a list (it happens, for example, when your SQL returns multiple columns or you have set the 'delimiter' in file/cmd mode)
                         this is the way to say which column is undergoing 'ranging'
        # The following 4 parameters are mutually exclusive and define the source of ids for the jobs:
    param('inputlist');  [param_substituted] The list is explicitly given in the parameters, can be abbreviated: 'inputlist' => ['a'..'z']
    param('inputfile');  [param_substituted] The list is contained in a file whose name is supplied as parameter: 'inputfile' => 'myfile.txt'
    param('inputquery'); [param_substituted] The list is generated by an SQL query (against the production database by default) : 'inputquery' => 'SELECT object_id FROM object WHERE x=y'
    param('inputcmd');   [param_substituted] The list is generated by running a system command: 'inputcmd' => 'find /tmp/big_directory -type f'
 
Code:
click to view

Reimplemented from Bio::EnsEMBL::Hive::Process.

public Arrayref Bio::EnsEMBL::Hive::Process::runnable ( ) [inherited]
    Title   :   runnable
    Usage   :   $self->runnable($arg)
    Function:   Sets a runnable for this RunnableDB
    Returns :   arrayref of Bio::EnsEMBL::Analysis::Runnable
    Args    :   Bio::EnsEMBL::Analysis::Runnable
 
Code:
click to view
public Bio::EnsEMBL::Hive::Process::strict_hash_format ( ) [inherited]
    Title   :  strict_hash_format
    Function:  if a subclass wants more flexibility in parsing job.input_id and analysis.parameters,
               it should redefine this method to return 0
 
Code:
click to view

Reimplemented in Bio::EnsEMBL::Hive::RunnableDB::Dummy, Bio::EnsEMBL::Hive::RunnableDB::SqlCmd, and Bio::EnsEMBL::Hive::RunnableDB::SystemCmd.

public Bio::EnsEMBL::Hive::Process::warning ( ) [inherited]

Undocumented method

Code:
click to view

Reimplemented from Bio::EnsEMBL::Utils::Exception.

public Bio::EnsEMBL::Hive::Process::worker ( ) [inherited]

Undocumented method

Code:
click to view
public Bio::EnsEMBL::Hive::Process::worker_temp_directory ( ) [inherited]
    Title   :  worker_temp_directory
    Function:  Returns a path to a directory on the local /tmp disk 
               which the subclass can use as temporary file space.
               This directory is made the first time the function is called.
               It persists for as long as the worker is alive.  This allows
               multiple jobs run by the worker to potentially share temp data.
               For example the worker (which is a single Analysis) might need
               to dump a datafile file which is needed by all jobs run through 
               this analysis.  The process can first check the worker_temp_directory
               for the file and dump it if it is missing.  This way the first job
               run by the worker will do the dump, but subsequent jobs can reuse the 
               file.
    Usage   :  $tmp_dir = $self->worker_temp_directory;
    Returns :  <string> path to a local (/tmp) directory
 
Code:
click to view
public Bio::EnsEMBL::Hive::RunnableDB::JobFactory::write_output ( )
    Description : Implements write_output() interface method of Bio::EnsEMBL::Hive::Process that is used to deal with job's output after the execution.
                  Here we rely on the dataflow mechanism to create jobs.
    param('fan_branch_code'): defines the branch where the fan of jobs is created (2 by default).
 
Code:
click to view

Reimplemented from Bio::EnsEMBL::Hive::Process.


The documentation for this class was generated from the following file: