[Bioc-devel] ShortRead readFasta UniProt Incorrect Import

Hervé Pagès hpages at fredhutch.org
Wed Oct 18 17:30:04 CEST 2017


Hi,

I just modified the Sequence Data workflow to suggest the
use of readDNAStringSet() and family to read in a FASTA file.

Cheers,
H.

On 10/18/2017 08:03 AM, Martin Morgan wrote:
> On 10/18/2017 01:00 AM, Dario Strbenac wrote:
>> Good day,
>>
>> If I have a FASTA file that contains
>>
>>> sp|Q9NYW0|T2R10_HUMAN Taste receptor type 2 member 10 OS=Homo sapiens
>>> GN=TAS2R10 PE=1 SV=3
>> MLRVVEGIFIFVVVSESVFGVLGNGFIGLVNCIDCAKNKLSTIGFILTGLAISRIFLIWI
>> IITDGFIQIFSPNIYASGNLIEYISYFWVIGNQSSMWFATSLSIFYFLKIANFSNYIFLW
>> LKSRTNMVLPFMIVFLLISSLLNFAYIAKILNDYKTKNDTVWDLNMYKSEYFIKQILLNL
>> GVIFFFTLSLITCIFLIISLWRHNRQMQSNVTGLRDSNTEAHVKAMKVLISFIILFILYF
>> IGMAIEISCFTVRENKLLLMFGMTTTAIYPWGHSFILILGNSKLKQASLRVLQQLKCCEK
>> RKNLRVT
>>
>> readFasta fails to import it with the warning
>>
>> proteins <- readFasta('.', "test.fasta")
>>
>> Warning message:
>> In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec,  :
>>    reading FASTA file test.fasta: ignored 129 invalid one-letter
>> sequence codes
>>
>> Also, the amino acid sequence is incomplete. There are 308 amino
>> acids, but
>>
>>> width(proteins)
>> [1] 178
>>
>> It's undesirable for users that some amino acids are discarded.
>> Hopefully, they notice the warning message before proceeding with the
>> analysis.
>>
>> Admittedly, readFasta is in ShortRead, so is designed to work with
>> high througput sequencing reads. But, perhaps it would be better
>> suited to a infrastructure package such as Biobase and generalised to
>> correctly import any FASTA file. There's even a Bioconductor workflow
>> at
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bioconductor.org_help_workflows_sequencing_&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=PQf2b4ItE83XXwu2NechRSckpuQ_eISbAf4B017Xrp4&s=zjUnVdFrQnCYVhojezx1OQ3ulJox_FLqiv8GAl_gzsg&e=
>> which has a section titled "DNA/amino acid sequence from FASTA files"
>> and demonstrates the use of readFasta.
>
> See Biostrings::readAAStringSet (and friends).
>
>
>>
>> I used version 1.34.2 of ShortRead which is the newest one.
>>
>> --------------------------------------
>> Dario Strbenac
>> University of Sydney
>> Camperdown NSW 2050
>> Australia
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=PQf2b4ItE83XXwu2NechRSckpuQ_eISbAf4B017Xrp4&s=i0gBwUFsMcadakXB1QgRHhPyK-ovrJcS-9_s06Vf0dc&e=
>>
>>
>
>
> This email message may contain legally privileged and/or...{{dropped:2}}
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=PQf2b4ItE83XXwu2NechRSckpuQ_eISbAf4B017Xrp4&s=i0gBwUFsMcadakXB1QgRHhPyK-ovrJcS-9_s06Vf0dc&e=
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list