Help for package gap.datasets

Version:

0.0.6

Date:

2023-8-14

Title:

Datasets for 'gap'

Description:

Datasets associated with the 'gap' package. Currently, it includes an example data for regional association plot (CDKN), an example data for a genomewide association meta-analysis (OPG), data in studies of Parkinson's diease (PD), ALHD2 markers and alcoholism (aldh2), APOE/APOC1 markers and Schizophrenia (apoeapoc), cystic fibrosis (cf), a Olink/INF panel (inf1), Manhattan plots with (hr1420, mhtdata) and without (w4) gene annotations.

LazyData:

Yes

LazyLoad:

Yes

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

URL:

https://jinghuazhao.github.io/R/

NeedsCompilation:

Packaged:

2023-08-19 19:19:17 UTC; jhz22

Depends:

R (≥ 2.10)

RoxygenNote:

7.1.2

Author:

Jing Hua Zhao [aut, cre], Swetlana Herbrandt [ctb]

Maintainer:

Jing Hua Zhao <jinghuazhao@hotmail.com>

Repository:

CRAN

Date/Publication:

2023-08-25 10:20:05 UTC

An example data for regional association plot

Description

These data are adapted from the DGI study on CDKN2A/CDKN2B region.

Usage

data(CDKN)

Format

There are three data objects in the dataset: CDKNgenes, the gene list from the Chromosome 9 according to UCSC browser (https://genome.ucsc.edu/); CDKNmap, the genetic map as from the HapMap website (https://ftp.ncbi.nlm.nih.gov/hapmap/recombination/2006-10_rel21_phaseI+II/rates/); CDKNlocus, the results from the association analysis of the locus based on DGI data.

Source

The data were obtained from the Harvard-MIT Broad Institute (see https://www.broadinstitute.org/diabetes)

References

Diabetes Genetics Initiative of Broad Institute of Havard and MIT, Lund University and Novartis Institute for BioMedical Research. Whole-genome association analysis identifies novel loci for type 2 diabetes and triglyceride levels Science 2007;316(5829):1331-6

Examples

data(CDKN)
head(CDKNlocus)

An example data for forest plot using METAL output

Description

This example contains METAL outputs (OPGtbl) as with association statistics from contributing studies (OPGall). It is appropriate to use chr:pos_A1_A2 (A1<=A2) (SNPID) rather than reference id (rsid) due to its variability – therefore a SNPID-rsid mapping file (OPGrsid) is also provided.

Usage

data(OPG)

Format

Three data frames

Source

SCALLOP consortium

References

The SCALLOP paper.

Examples

data(OPG)
head(OPGtbl)
head(OPGall)
head(OPGrsid)

A study of Parkinson's disease and APOE, LRRK2, SNCA makers

Description

A study of Parkinson's disease and controls with APOE, LRRK2 markers rs10506151, rs10784486, rs1365763, rs1388598, rs1491938, rs1491941 and SNCA markers m770, int4 and SNCA. The column abc indicates if a subject is familial Parkinson's (+), sporadic (-), or controls (Control). Races involved are American Indians (AI), African American (B), and the rest are Caucasians. Diagnosis also included possible (POS), probable (PRO) and definite PDs. AON is the age at onset.

Usage

data(PD)

Format

A data frame

Source

Prof Abbas Parsian at NIH

References

Parsian et al. ASHG 2005, Toronto

ALDH2 markers and Alcoholism

Description

This data set contains eight ALDH2 markers and Japanese alcohlic patients (y=1) and controls (y=0). There are genotypes for 8 loci, with a prefix name (e.g., "EXON12") and a suffix for each of two alleles (".a1" and ".a2").

The eight markers loci follows the following map (base pairs)

D12S2070	(> 450 000),
D12S839	(> 450 000),
D12S821	(`\sim` 400 000),
D12S1344	( 83 853),
EXON12	( 0),
EXON1	( 37 335),
D12S2263	( 38 927),
D12S1341	(> 450 000)

Usage

data(aldh2)

Format

A data frame

Source

Prof Ian Craig of Oxford and SGDP Centre, KCL

References

Koch HG, McClay J, Loh E-W, Higuchi S, Zhao J-H, Sham P, Ball D, et al (2000) Allele association studies with SSR and SNP markers at known physical distances within a 1 Mb region embracing the ALDH2 locus in the Japanese, demonstrates linkage disequilibrium extending up to 400 kb. Hum. Mol. Genet. 9:2993-2999

APOE/APOC1 markers and Alzheimer's

Description

This data set contains APOE/APOC1 markers and Chinese Alzheimer's patients and controls. Variable id is subject id and y takes value 0 for controls and 2 for Alzheimer's.

The last six variables are age, sex and genotypes for APOE and APOC with suffixes for each of two alleles (".a1" and ".a2").

Usage

data(apoeapoc)

Format

A data frame

Source

Shi J, Zhang S, Ma C, Liu X, Li T, Tang M, Han H, Guo Y, Zhao JH, Zheng K, Kong X, Zhang K, Su Z, Zhao Z. Association between apolipoprotein CI HpaI polymorphism and sporadic Alzheimer's disease in Chinese. Acta Neurol Scan 2004, 109:140-145.

Cystic fibrosis data

Description

This data set contains a case-control indicator and 23 SNPs.

The inter-marker distances (Morgan) are as follows

0.000090, 0.000158, 0.005000, 0.000100, 0.000200, 0.000150, 0.000250, 0.000200, 0.000050, 0.000350, 0.000300, 0.000250, 0.000350, 0.000350, 0.000800, 0.000100, 0.000200, 0.000150, 0.000550, 0.006000, 0.000700, 0.001000

Usage

data(cf)

Format

A data frame containing 186 rows and 24 columns

Note

This can be used as an example of converting PL-EM to matrix format,

cfdata <- vector("numeric")
cfname <- vector("character")
for (i in 2:dim(cf)[2])
{
    tmp <- plem2m(cf[,i])
    a1 <- tmp[[1]]
    a2 <- tmp[[2]]
    cfdata <- cbind(cfdata,a1,a2)
    a1name <- paste("loc",i-1,".a1",sep="")
    a2name <- paste("loc",i-1,".a2",sep="")
    cfname <- cbind(cfname,a1name,a2name)
}
cfdata <- as.data.frame(cfdata)
names(cfdata) <- cfname

Source

Liu JS, Sabatti C, Teng J, Keats BJB, Risch N (2001). Bayesian Analysis of Haplotypes for Linkage Disequilibrium Mapping. Genome Research 11:1716-1724

A CNV data

Description

A CNV dataset.

Usage

data(cnv)

Format

A CNV data

Source

Zheng Ye

Crohn's disease data

Description

The data set consist of 103 common (>5% minor allele frequency) SNPs genotyped in 129 trios from an European-derived population. These SNPs are in a 500-kb region on human chromosome 5q31 implicated as containing a genetic risk factor for Crohn disease.

The positions, names and haplotype blocks reported are as follows,

274044   IGR1118a_1	BLOCK 1
274541   IGR1119a_1	*
286593   IGR1143a_1	*
287261   IGR1144a_1	*
299755   IGR1169a_2	*
324341   IGR1218a_2	*
324379   IGR1219a_2	*
358048   IGR1286a_1	BLOCK 1
366811	 TSC0101718
395079   IGR1373a_1	BLOCK 2
396353   IGR1371a_1	*
397334   IGR1369a_2	*
397381   IGR1369a_1	*
398352   IGR1367a_1	BLOCK 2
411823   IGR2008a_2
411873   IGR2008a_1	BLOCK 3
412456   IGR2010a_3	*
413233   IGR2011b_1	*
415579   IGR2016a_1	*
417617   IGR2020a_15	*
419845   IGR2025a_2	*
424283   IGR2033a_1	*
425376   IGR2036a_2	*
425549   IGR2036a_1	BLOCK 3
433467   IGR2052a_1	BLOCK 4
435282   IGR2055a_1	*
437682   IGR2060a_1	*
438883   IGR2063b_1	*
443565   IGR2072a_2	*
443750   IGR2073a_1	*
445337   IGR2076a_1	*
447791   IGR2081a_1	*
449895   IGR2085a_2	*
455246   IGR2096a_1	*
463136   IGR2111a_3	BLOCK 4
482171   IGR2150a_1	BLOCK 5
485828   IGR2157a_1	*
495082   IGR2175a_2	*
506266   IGR2198a_1	*
506890   IGR2199a_1	BLOCK 5
507208   IGR2200a_1	BLOCK 6
508338   IGR2202a_1	*
508858   IGR2203a_1	*
510951   IGR2207a_1	*
518478   IGR2222a_2	BLOCK 6
519387   IGR2224a_2	BLOCK 7
519962   IGR2225a_1	*
520521   IGR2226a_3	*
522600   IGR2230a_1	*
525243   IGR2236a_1	*	
529556   IGR2244a_4	*
532363   IGR2250a_4	*
545062   IGR2276a_1	*
553189   IGR2292a_1	*
570978   IGR3005a_1	*
571022   IGR3005a_2	*
576586   IGR3016a_1	*
577141   IGR3018a_2	*
577838   IGR3019a_2	*
578122   IGR3020a_1	*
579217   IGR3022a_1	*
579529   IGR3023a_1	*
579818   IGR3023a_3	*
582651   IGR3029a_1	*
582948   IGR3029a_2	*
583131   IGR3030a_1	*
587836   IGR3039a_1	*
590425   IGR3044a_1	*
590585   IGR3045a_1	*
594115   IGR3051a_1	*
594812   IGR3053a_1	*
598805   IGR3061a_1	*
601294   IGR3066a_1	*
608759   IGR3081a_1	*
610447   IGR3084a_1	*
611177   IGR3086a_1	BLOCK 7
613488   IGR3090a_1
616241   IGR3096a_1	BLOCK 8
616763   IGR3097a_1	*
617299   IGR3098a_1	*
626881   IGR3117a_1	*
633786   IGR3131a_1	*
635072   IGR3134a_1	*
637441   IGR3138a_1	BLOCK 8
648564   IGR3161a_1
649061   IGR3162a_1	BLOCK 9
649903   IGR3163a_1	*
657234   IGR3178a_1	*
662077   IGR3188a_1	*
662819   IGR3189a_2	*
676688   IGRX100a_1	BLOCK 9
683387   IGR3230a_1	BLOCK 10
686249   IGR3236a_1	*
692320   IGR3248a_1	*
718291   IGR3300a_2	*
730313   IGR3324a_1	*
731025   IGR3326a_1	*
738461   IGR3340a_1	BLOCK 10
871978   GENS021ex1_2	BLOCK 11
877571   GENS020ex3_3	*
877671   GENS020ex3_2	*
877809   GENS020ex3_1	*
890710   GENS020ex1_1	BLOCK 11

However it has been changed after the paper was published.

An example use of the data is with the following paper, Kelly M. Burkett, Celia M. T. Greenwood, BradMcNeney, Jinko Graham. Gene genealogies for genetic association mapping, with application to Crohn's disease. Fron Genet 2013, 4(260) doi: 10.3389/fgene.2013.00260

Usage

data(crohn)

Format

A data frame containing 387 rows and 212 columns

Source

Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES (2001). High-resolution haplotype structure in the human genome Nature Genetics 29:229-232

Friedreich Ataxia data

Description

This data set contains a case-control indicator and twelve microsatellite markers. An extra unphased individual with the following genotype

 2  7  7  7  1  3  2  2  2  2  6  3
 3  8 10  8  3  9  3  4  2  2  7  5

has not been included.

The inter-marker distances (Morgan) are as follows,

0.03, 0.065, 0.00125, 0.00125, 0.00125, 0.00125, 0.00125, 0.00125, 0.00125, 0.00125, 0.045

Usage

data(fa)

Format

A data frame containing 127 rows and 13 columns

Source

Liu JS, Sabatti C, Teng J, Keats BJB, Risch N (2001). Bayesian analysis of haplotypes for linkage disequilibrium mapping Genome Research 11:1716-1724

A case-control data involving four SNPs with missing genotype

Description

This is a simulated data of four SNPs with their alleles coded in characters. The variable y contains phenotypes (1=case, 0=control).

Usage

data(fsnps)

Format

A data frame

Source

Dr Sebastien Lissarrague of Genset

The HLA data

Description

This data set contains HLA markers DRB, DQA, DQB and phenotypes of 271 Schizophrenia patients (y=1) and controls (y=0). Genotypes for 3 HLA loci have prefixes name (e.g., "DQB") and a suffix for each of two alleles (".a1" and ".a2").

Usage

data(hla)

Format

A data frame containing 271 rows and 8 columns

Source

Dr Padraig Wright of Pfizer

An example data for Manhattan plot with annotation

Description

This example contains p values for a list of SNPs wtih information on chromosome, position and gene symnol.

In the reference below, seven established SNPs are in light blue, 14 new SNPs in dark blue and those failed to replicate in red. The paper size is set to 189 width x 189/2 height (mm) and 1200 dpi resolution. The font is Verdana.

Usage

data(hr1420)

Format

A data frame

Source

Dr Marcel den Hoed

References

de Hoed M et al. (2013) Heart rate-associated loci and their effects on cardiac conduction and rhythm disorders. Nature Genetics 45(6):621-31, doi: 10.1038/ng.2610.

Examples

head(hr1420)

A data containing protein panel

Description

This data is used to illustrate cis/trans classification, containing the following columns:

Target Target.Short 1 Osteoprotegerin (OPG) OPG 2 C-X-C motif chemokine 11 (CXCL11) CXCL11 3 TNF-related activation cytokine (TRANCE) TRANCE 4 Axin-1 (AXIN1) AXIN1 5 C-C motif chemokine 25 (CCL25) CCL25 6 Tumor necrosis factor (Ligand) superfamily member 12 (TWEAK) TWEAK UniProtID Gene chrom Start End 1 O00300 TNFRSF11B 8 119935796 119964439 2 O14625 CXCL11 4 76954835 76962568 3 O14788 TNFSF11 13 43136872 43182149 4 O15169 AXIN1 16 337440 402673 5 O15444 CCL25 19 8117651 8127534 6 O43508 TNFSF12 17 7452208 7464925

Usage

data(inf1)

Format

A data frame containing 92 rows and 7 columns

Source

Undisclosed

A data containing independent GWAS hits as from GCTA

Description

This data is used to illustrate cis/trans classification, containg the following columns:

    prot Chr                SNP       bp refA       freq          b        se
1 4E.BP1  19 chr19:54327313_A_C 54327313    A 0.20550900  0.4510040 0.0243056
2 4E.BP1  19 chr19:54329063_G_T 54329063    T 0.10023500 -0.3233240 0.0333274
3    ADA  19 chr19:54327313_A_C 54327313    A 0.20550900  0.3542660 0.0246266
4    ADA  20 chr20:37456819_C_T 37456819    T 0.00388582 -0.2473080 0.1749800
5    ADA  20 chr20:38196991_G_T 38196991    G 0.00236927 -0.0171435 0.2238980
6    ADA  20 chr20:38603207_A_G 38603207    A 0.17074600 -0.0269075 0.0271976
            p       n  freq_geno         bJ     bJ_se           pJ        LD_r
1 2.48545e-74 6483.69 0.20079500   0.426476 0.0251676  2.07907e-64 -0.13397800
2 4.69307e-22 6480.60 0.08846920  -0.246444 0.0338712  3.44090e-13  0.00000000
3 5.47833e-46 6441.97 0.20079500   0.354266 0.0250171  1.59869e-45  0.00000000
4 1.57618e-01 5553.51 0.00497018  -5.873090 0.2241210 2.32892e-151 -0.00633091
5 9.38970e-01 5556.57 0.00198807 -13.473100 0.3790980 1.18609e-276  0.02467370
6 3.22550e-01 6285.16 0.15009900  -0.299797 0.0278787  5.69806e-27  0.11116200
  UniProtID
1    Q13541
2    Q13541
3    P00813
4    P00813
5    P00813
6    P00813

Usage

data(jma.cojo)

Format

A data frame containing 445 rows and 16 columns

Source

Undisclosed

An example pedigree data

Description

The data contains data on 51 individuals in a pedigree. Below it is used for comparing results from various packages.

Usage

data(l51)

Format

A data frame

Source

Morgan v3.

References

Morgan v3. https://sites.stat.washington.edu/thompson/Genepi/MORGAN/Morgan.shtml

Examples

## Not run: 
km <- kin.morgan(l51)
k2 <- km$kin.matrix*2

# quantitative trait
library(regress)
r <- regress(qt ~ 1, ~k2, data=l51)
names(r)
r
# qualitative trait
N <- dim(l51)[1]
w <- with(l51,quantile(qt,probs=0.75,na.rm=TRUE))
ped51 <- within(l51, bt <- ifelse(qt<=w,0,1))
d <- regress(bt ~ 1, ~k2, data=ped51)
d
# for other tests not shown here
set.seed(12345)
ped51 <- within(ped51,{r <- rnorm(N); bt[is.na(bt)] <- 0})
library(foreign)
write.dta(ped51,"ped51.dta")

## End(Not run)

An example pedigree

Description

A multi-generational pedigree containing individual, father, mother IDs and sex.

Usage

data(lukas)

Format

An example pedigree

Source

Lukas Keller

A study of Parkinson's disease and MAO gene

Description

The markers are both with actual allele sizes and allele numbers. The dataset is distributed with the GENECOUNTING version 2.0 illustrating gene counting method involving chromosome X. A total of 183 patients and 157 controls (150 males, 190 females) were available, together with five markers in MAOA (monoamine oxidase A) region with alleles 12, 9, 6, 5, 3, and the first three markers were genotyped in all individuals while the fourth and fifth were genotyped for 294 and 304 individuals.

Usage

data(mao)

Format

A data frame

Source

Dr Helen Latsoudis of Institute of Psychiatry, KCL

References

Zhao JH (2004). 2LD, GENECOUNTING and HAP: computer programs for linkage disequilibrium analysis. Bioinformatics 20:1325-1326

A pedigree data on 282 animals deriving from two generations

Description

A data frame attributed to Meyer (1989).

“The pedigrees for each of these 282 animals derive from an additional 24 base population (Generation 0) animals that do not have records of their own but, nevertheless, are of interest with respect to the inference on their own additive genetic values. Furthermore, it is presumed that these original 24 base animals are not related to each other. Therefore, the row dimension of u is 306 (282+24).” (Templeman & Rosa 2004)

Usage

data(meyer)

Format

A data frame containing 306 records

Source

Meyer K (1989). Restricted maximum likelihood to estimate variance components for animal models with several random effects using a derivative-free algorithm. Genetics, Selection, Evolution 21:317-340.

Tempelman RJ, Rosa GJM. Empirical Bayes Approaches to Mixed Model Inference in Quantitative Genetics. in Saxton AM (Ed). Genetic Analysis of Complex Traits Using SAS, chapter 7. SAS Institute Inc., Cary, NC, USA, 2004

Examples

## Not run: 
library(gap)
meyer <- within(meyer,{
   g1 <- ifelse(generation==1,1,0)
   g2 <- ifelse(generation==2,1,0)
})
lm(y~-1+g1+g2,data=meyer)
library(MCMCglmm)
m <-MCMCglmm(y~-1+g1+g2,random=animal~1,pedigree=meyer[,1:3],data=meyer,verbose=FALSE)
summary(m)
plot(m)   

meyer <- within(meyer,{
   id <- animal
   animal <- ifelse(!is.na(animal),animal,0)
   dam <- ifelse(!is.na(dam),dam,0)
   sire <- ifelse(!is.na(sire),sire,0)
})
# library(kinship)
# A <- with(meyer,kinship(animal,sire,dam))*2

A <- kin.morgan(meyer)$kin.matrix*2

library(regress)
regress(y~-1+g1+g2,~A,data=meyer)
prior <- list(R=list(V=1, nu=0.002), G=list(G1=list(V=1, nu=0.002)))
m2 <- MCMCgrm(y~-1+g1+g2,prior,meyer,A,singular.ok=TRUE,verbose=FALSE)
summary(m2)
plot(m2)   

## End(Not run)

Example data for ACEnucfam

Description

This is the companion data for ACEnucfam.

Usage

data(mfblong)

Format

The data is a random subset of the birth weight data from the mental health registry of Norway.

male-a dummy variable for being male; first-a dummy variable for being the first child; midage-a dummy variable for mother aged 20-35 at time of birth; highage-a dummy variable for mother older than 35 at time of birth and birthyr-year of birth minus 1967 (earliest birth year in birth registry).

Source

The data were obtained from the Biometrics website and preprocessed with f.mfb.R.

References

Rabe-Hesketh S, Skrondal A, Gjessing HK. Biometrical modeling of twin and family data using standard mixed model software. Biometrics 2008, 64:280-288

An example data for Manhattan plot with annotation (mhtplot)

Description

This example contains p values for a list of SNPs whose information regarding chromosome, position and reference seqeuence as with gene annotation is obtained separately.

Usage

data(mhtdata)

Format

A data frame

Source

Dr Tuomas Kilpelainen at the MRC Epidemiology Unit

References

Kilpelainen1 TO, et al. (2011) Genetic variation near IRS1 associates with reduced adiposity and an impaired metabolic profile. Nature Genetics 43(8):753-60, doi: 10.1038/ng.866.

Examples

head(mhtdata)

A study of Alzheimer's disease with eight SNPs and APOE

Description

This is a study of the neprilysin gene and sporadic Alzheimer's disease in Chinese. There are 257 cases and 242 controls, each with eight SNPs detecting through denaturing high-performance liquid chromatography (DHPLC).

Usage

data(nep499)

Format

A data frame

Source

Shi J, Zhang S, Tang M, Ma C, Zhao J, Li T, Liu X, Sun Y, Guo Y, Han H, Ma Y, Zhao Z. Mutation Screening and Association Study of the Neprilysin Gene in Sporadic Alzheimer's Disease in Chinese Persons. J Gerontol A: Bio Sci Med Sci 60:301-306, 2005

Results from a GWAS on Chickens

Description

This example contains p values for a list of SNPs wtih information on chromosome and positions.

Usage

data(w4)

Format

A data frame

Source

Titan <lone9@qq.com>

Examples

head(w4)