Configuration¶
Note
The configuration key nomenclature hasn’t been settled yet
Note
The main lts_workflows
documentation provides more
information about general configuration settings.
Required configuration¶
The following options must be set in the configuration file:
settings:
sampleinfo: sampleinfo.csv
runfmt: "{SM}/{SM}_{PU}_{DT}"
samplefmt: "{SM}/{SM}"
ngs.settings:
db:
ref: # Reference sequences
- ref.fa
- gfp.fa
- ercc.fa
transcripts:
- ref-transcripts.fa
- gfp.fa
- ercc.fa
annotation:
sources:
- ref-transcripts.gtf
- gfp.genbank
- ercc.gb
# Optional; change these if read names and fastq file suffixes differ
read1_label: "_1"
read2_label: "_2"
fastq_suffix: ".fastq.gz"
# list of sample identifiers corresponding to the sampleinfo 'SM'
# column
samples:
- sample1
- sample2
The configuration settings runfmt
and samplefmt
describe how
your data is organized. They represent python miniformat strings, where
the entries correspond to columns in the sampleinfo file; hence, in
this case, the columns SM, PU and DT must be present in
the sampleinfo file.
Note
Since the runfmt
and samplefmt
can represent any format you
wish, in principle, you could use any label formatting names. This
is true except for SM, which represents the sample name and
must be present in the sampleinfo file. The two-letter sample
labels above are convienient representations of metadata and
correspond to samtools read group record types.
Example sampleinfo.csv¶
SM,PU,DT,fastq
s1,AAABBB11XX,010101,s1_AAABBB11XX_010101_1.fastq.gz
s1,AAABBB11XX,010101,s1_AAABBB11XX_010101_2.fastq.gz
s1,AAABBB22XX,020202,s1_AAABBB22XX_020202_1.fastq.gz
s1,AAABBB22XX,020202,s1_AAABBB22XX_020202_2.fastq.gz
s2,AAABBB11XX,010101,s2_AAABBB11XX_010101_1.fastq.gz
s2,AAABBB11XX,010101,s2_AAABBB11XX_010101_2.fastq.gz
The example sampleinfo file would work with the required settings
above. The following runfmt
and samplefmt
would be generated
for sample s2, read 1:
runfmt = s2/s2_AAABBB11XX_010101
samplefmt = s2/s2
Workflow specific configuration¶
In addition to the required configuration, there are some
configuration settings that affect the workflow itself. These settings
are accessed and set via config['workflow']
.
- use_multimapped
- (boolean) Use multimapped reads for quantification. Default false.
- quantification
- (list) List quantification methods to use. Available options are rsem and rpkmforgenes.
Example workflow configuration section¶
workflow:
use_multimapped: false
quantification:
- rsem
- rpkmforgenes
Application level configuration¶
Note
Unfortunately, there is no straightforward way to automatically list the available application configuration options. You therefore have look in the rule files themselves for available options. In most cases, the default settings should work fine.
Note
Rules live in separate files whose names consist of the application name followed by the rule name. Rules are located in package subdirectory rules, in which each application lives in a separate directory.
Tip
There is a option
configuration key for each rule. Most often,
this is the setting one wants to modify.
Individual applications (e.g. star) are located at the top level, with
sublevels corresponding to specific application rules. For instance,
the following configuration would affect settings in star
and
rsem
:
star:
star_index:
# The test genome is small; 2000000 bases. --genomeSAindexNbases
# needs to be adjusted to (min(14, log2(GenomeLength)/2 - 1))
options: --genomeSAindexNbases 10
rsem:
index: ../ref/rsem_index
Additional advice¶
There are a couple of helper rules for generating spikein input files and the transcript annotation file.
dbutils_make_transcript_annot_gtf
- For QC statistics calculated by RSEQC, the gtf annotation file
should reflect the content of the alignment index. You can
automatically create the file name defined in
['ngs.settings']['annotation']['transcript_annot_gtf']
from the list of files defined in['ngs.settings']['annotation']['sources']
via the ruledbutils_make_transcript_annot_gtf
. gtf and genbank input format is accepted. ercc_create_ref
- The ERCC RNA Spike-In Mix is
commonly used as spike-in. The rule
ercc_create_ref
automates download of the sequences in fasta and genbank formats.