See also
- @collate in the Ruffus Manual
- Decorators for more decorators
@collate( input, filter, output, [extras,...] )ΒΆ
Purpose:
Use filter to identify common sets of inputs which are to be grouped or collated together:
Each set of inputs which generate identical output and extras using the formatter or regex (regular expression) filters are collated into one job.
This is a many to fewer operation.
Only out of date jobs (comparing input and output files) will be re-run.
- Example:
regex(r".+\.(.+)$"), "\1.summary" creates a separate summary file for each suffix:
animal_files = "a.fish", "b.fish", "c.mammals", "d.mammals" # summarise by file suffix: @collate(animal_files, regex(r".+\.(.+)$"), r'\1.summary') def summarize(infiles, summary_file): pass
output and optional extras parameters are passed to the functions after string substitution. Non-string values are passed through unchanged.
Each collate job consists of input files which are aggregated by string substitution to identical output and extras
The above example results in two jobs:["a.fish", "b.fish" -> "fish.summary"]["c.mammals", "d.mammals" -> "mammals.summary"]Parameters:
- input = tasks_or_file_names
can be a:
- Task / list of tasks.
File names are taken from the output of the specified task(s)
- (Nested) list of file name strings (as in the example above).
- File names containing *[]? will be expanded as a glob.
E.g.:"a.*" => "a.1", "a.2"
- filter = matching_regex
is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax
- filter = matching_formatter
a formatter indicator object containing optionally a python regular expression (re).
- output = output
Specifies the resulting output file name(s) after string substitution
- extras = extras
Any extra parameters are passed verbatim to the task function
If you are using named parameters, these can be passed as a list, i.e. extras= [...]
Any extra parameters are consumed by the task function and not forwarded further down the pipeline.
Example2:
Suppose we had the following files:
cows.mammals.animal horses.mammals.animal sheep.mammals.animal snake.reptile.animal lizard.reptile.animal crocodile.reptile.animal pufferfish.fish.animaland we wanted to end up with three different resulting output:
cow.mammals.animal horse.mammals.animal sheep.mammals.animal -> mammals.results snake.reptile.animal lizard.reptile.animal crocodile.reptile.animal -> reptile.results pufferfish.fish.animal -> fish.resultsThis is the @collate code required:
animals = [ "cows.mammals.animal", "horses.mammals.animal", "sheep.mammals.animal", "snake.reptile.animal", "lizard.reptile.animal", "crocodile.reptile.animal", "pufferfish.fish.animal"] @collate(animals, regex(r"(.+)\.(.+)\.animal"), r"\2.results") # \1 = species [cow, horse] # \2 = phylogenetics group [mammals, reptile, fish] def summarize_animals_into_groups(species_file, result_file): " ... more code here" pass
See @merge for an alternative way to summarise files.