[BioC] SRAmetadb Bioconductor package; study record count low for 2013

Jack Zhu zhujack at mail.nih.gov
Tue Jun 17 21:31:04 CEST 2014


Hi Jamie and all,

By modifying my codes and pulling data from a curated table
(SRA_Accessions.tab) from the SRA, I think the missing
'submission_date' in the submission table have been fixed:

  strftime('%Y', s.submission_date) count(*)
1                              2008      348
2                              2009     1260
3                              2010     2865
4                              2011     4276
5                              2012     6606
6                              2013    15309
7                              2014     7706

Please let me know if you still see any problems or have any questions.  Thanks.

Jack
----
Yuelin Jack Zhu

Genetics Branch/CCR/NCI/NIH
Tel: (301)496-4527
FAX: (301) 402-3241
E-mail: zhujack at mail.nih.gov


On Sun, Jun 8, 2014 at 11:11 AM, Jack Zhu <zhujack at mail.nih.gov> wrote:
> Hi all,
>
> Regarding missing studies by submission_date for 2013 and 2014 in the
> SRAdb SQLite database, I did some investigation and found the reason.
> The metadata in the SRAdb is mainly parsed from the XML files of the
> SRA submissions and it is true with the submission table.  But I see
> quite some submission xml files don't have submission date, e.g.
>
> ftp://ftp-trace.ncbi.nih.gov/sra/Submissions/SRA157/SRA157949/
>
>   SRA157949.experiment.xml
>   SRA157949.submission.xml
>
> So it seem all the study and submission records are there, but some
> submission records just don't submission date.  I am looking into the
> possibility of adding dates for those records.
>
> Jamie, thanks for the finding and I will keep you updated.
>
> Jack
>
>
> On Fri, Jun 6, 2014 at 3:49 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>> Hi, Jack.
>>
>> I took a look at this and it does appear that the number of
>> submissions is very low for 2013.  Also, there are no 2014 submissions
>> listed that I could find.  This was using the June 1, 2014 sqlite
>> file.
>>
>> Sean
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Al-Nasir, Jamie (2012) <Jamie.Al-Nasir.2012 at live.rhul.ac.uk>
>> Date: Thu, Jun 5, 2014 at 2:20 PM
>> Subject: [BioC] SRAmetadb Bioconductor package; study record count low for 2013
>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>> Cc: "Shanahan, Hugh" <Hugh.Shanahan at rhul.ac.uk>
>>
>>
>> Hello,
>>
>>
>> I have been looking at the SRA (Sequence Read Archive) SQLite database
>>
>> provided as a Bioconductor package for R.
>>
>>
>> My question concerns top-level studies, which are found in the study table
>>
>> and dated in the submissions table.
>>
>>
>> The question is why are there so few entries for the top level studies for 2013
>>
>> as compared with 2011 and 2012....
>>
>>
>> The SQL queries I have written, joining the Submission table and Study table
>>
>> in order to obtain the submission_date yield the following counts of top-level
>>
>> studies by year....
>>
>>
>> 2005|64
>> 2006|38
>> 2007|94
>> 2008|269
>> 2009|893
>> 2010|2631
>> 2011|4077
>> 2012|5208
>> 2013|724
>>
>>
>> As one can see the number of studies in the meta-data falls off on 2013.
>>
>> I have been using the sraDB bioconductor SQLite database which has
>>
>> the creation timestamp of 2013-12-03 08:29:26 in the metaInfo table.
>>
>>
>> Would really appreciate if anyone has any useful thoughts on this.
>>
>>
>> Best regards,
>>
>> Jamie
>>
>> Jamie Al-Nasir MPharm (Hons)
>> Department of Computer Science
>> Centre for Systems and Synthetic Biology
>> Mobile: +44 (0)759 4800 229
>> Web: http://jamie.al-nasir.com/
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list