[Bioc-devel] ShortRead readFasta UniProt Incorrect Import
Hervé Pagès
hpages at fredhutch.org
Wed Oct 18 17:30:04 CEST 2017
Hi,
I just modified the Sequence Data workflow to suggest the
use of readDNAStringSet() and family to read in a FASTA file.
Cheers,
H.
On 10/18/2017 08:03 AM, Martin Morgan wrote:
> On 10/18/2017 01:00 AM, Dario Strbenac wrote:
>> Good day,
>>
>> If I have a FASTA file that contains
>>
>>> sp|Q9NYW0|T2R10_HUMAN Taste receptor type 2 member 10 OS=Homo sapiens
>>> GN=TAS2R10 PE=1 SV=3
>> MLRVVEGIFIFVVVSESVFGVLGNGFIGLVNCIDCAKNKLSTIGFILTGLAISRIFLIWI
>> IITDGFIQIFSPNIYASGNLIEYISYFWVIGNQSSMWFATSLSIFYFLKIANFSNYIFLW
>> LKSRTNMVLPFMIVFLLISSLLNFAYIAKILNDYKTKNDTVWDLNMYKSEYFIKQILLNL
>> GVIFFFTLSLITCIFLIISLWRHNRQMQSNVTGLRDSNTEAHVKAMKVLISFIILFILYF
>> IGMAIEISCFTVRENKLLLMFGMTTTAIYPWGHSFILILGNSKLKQASLRVLQQLKCCEK
>> RKNLRVT
>>
>> readFasta fails to import it with the warning
>>
>> proteins <- readFasta('.', "test.fasta")
>>
>> Warning message:
>> In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, :
>> reading FASTA file test.fasta: ignored 129 invalid one-letter
>> sequence codes
>>
>> Also, the amino acid sequence is incomplete. There are 308 amino
>> acids, but
>>
>>> width(proteins)
>> [1] 178
>>
>> It's undesirable for users that some amino acids are discarded.
>> Hopefully, they notice the warning message before proceeding with the
>> analysis.
>>
>> Admittedly, readFasta is in ShortRead, so is designed to work with
>> high througput sequencing reads. But, perhaps it would be better
>> suited to a infrastructure package such as Biobase and generalised to
>> correctly import any FASTA file. There's even a Bioconductor workflow
>> at
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bioconductor.org_help_workflows_sequencing_&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=PQf2b4ItE83XXwu2NechRSckpuQ_eISbAf4B017Xrp4&s=zjUnVdFrQnCYVhojezx1OQ3ulJox_FLqiv8GAl_gzsg&e=
>> which has a section titled "DNA/amino acid sequence from FASTA files"
>> and demonstrates the use of readFasta.
>
> See Biostrings::readAAStringSet (and friends).
>
>
>>
>> I used version 1.34.2 of ShortRead which is the newest one.
>>
>> --------------------------------------
>> Dario Strbenac
>> University of Sydney
>> Camperdown NSW 2050
>> Australia
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=PQf2b4ItE83XXwu2NechRSckpuQ_eISbAf4B017Xrp4&s=i0gBwUFsMcadakXB1QgRHhPyK-ovrJcS-9_s06Vf0dc&e=
>>
>>
>
>
> This email message may contain legally privileged and/or...{{dropped:2}}
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=PQf2b4ItE83XXwu2NechRSckpuQ_eISbAf4B017Xrp4&s=i0gBwUFsMcadakXB1QgRHhPyK-ovrJcS-9_s06Vf0dc&e=
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list