[BioC] Bsgenome gap mask conundrum

Ugo Borello ugo.borello at inserm.fr
Wed Jul 17 17:33:39 CEST 2013


Hi everybody,
In forging my custom Bsgenome data package I encountered a problem with gap
masks.

For my genome of interest, NCBI has 2 different gap masks for each assembled
chromosome: the chr?.comp.agp file (chromosome from component AGP)  and the
chr?.agp file (chromosome from scaffold AGP). There is only one agp file for
the unlocalized and one for the unplaced scaffold sequences.

So when I forge my Bsgenome package using only the assembled chromosomes,
everything goes very well.
In this case I set the nmask_per_seq field  in the seed file to 3: 2 agp
masks (comp.agp and .agp files) and 1 repeatmasker mask for each assembled
chromosome.

Same positive result when I forge my Bsgenome package using the assembled
chromosomes, the unlocalized, and the unplaced scaffold sequences and I set
the nmask_per_seq field  in the seed file to 2 (because I include in the
package 1 agp mask (the .agp file) and 1 repeatmasker mask for all the fasta
files).

If you are still with me after this boring "maskerade", you can easily
anticipate that forgeBSgenomeDataPkg() throws me an error when I try to use
3 masks for the assembled chromosomes and 2 for the rest of the sequences.
In this case I set the nmask_per_seq field  in the seed file to 3.


My questions are:
- is there a way to use to use 3 masks for the assembled chromosomes and 2
for the rest of the sequences? In this case what is the value of the
nmask_per_seq field  in the seed file?
- shall I simply ignore the comp.agp files? Are they useful for the
assembled chromosomes?

Thank you very much for  your help

Ugo



More information about the Bioconductor mailing list