GLSconvert

Derek Pappas, Ph.D. (dpappas@chori.org)

2020-02-10

Overview

GLSconvert represents a suite of tools for cross converting HLA or KIR genotyping data from gene list text strings to multi-column tabular format as desribed in Milius RP, Mack SJ, Hollenbach JA, et al. 2013. Genotype List String: a grammar for describing HLA and KIR genotyping results in a text string. Tissue Antigens. 82:106-112.

###Anatomy of a GL String
The figure below depicts a genotype List (GL) String representation of a multilocus unphased genotype. A GL String representing HLA-A genotype (A02:69 and A23:30, or A02:302 and, either A23:26 or A23:39) and HLA-B genotype (B44:02:13 and B*49:08) for a single individual is shown. GL String delimiters are parsed hierarchically starting from the locus delimiter (^), proceeding to the genotype delimiter (|), then the chromosome delimiter (+), and ending with the allele delimiter (/). A GL String should include the genetic system name (HLA or KIR) as part of the locus name.

Input Data

Data may be passed to GLSconvert() as a tab delimited text file name (full path recommended) or as a R object (data frame). Whether a text file or an R object, the first row must be a header line and include column names for the GL string/locus genotypes. Any column preceeding the genotyping information is considered to be identifying/miscellanous information. This would generally include at least the sample id. While there is no limit to the number of columns, the direction of the conversion may dictate a specific column order. Empty rows will be excluded from final output.

GL String to Table
Formatting for GL string conversion requires that the last column of the data table must contain the GL string.

SubjectID Exp ID GLString
Subject1 Center1 HLA-A*01:01+HLA-A*02:01^HLA-B*08:01+HLA-B*44:02^HLA-DRB1*01:01+HLA-DRB1*03:01

Table to GL String
Formatting for table conversion requires at least three columns. One (or more) column(s) of identifying information followed by column pairs for each locus. Genotype locus pairs must be located in adjacent columns. Column names for a given locus may use (not required) ’_1’, ‘.1’,’_2’,‘.2’ to distinguish each locus pair. Only columns defining genotypes names for each locus may repeat, all other column names must be unique. You may format your alleles as Locus*Allele or Allele following defined HLA and KIR naming conventions.

SubjectID Exp ID A A B B DRB1 DRB1 DRB3 DRB3
Subject1 Center1 01:01 02:01 08:01 44:02 01:01 04:01

Ambiguity
Only ambiguity at the allele level is compatible with the GL conversion tool (separated by “/”). Ambiguity at the genotype (separated by “|”) cannot be used with the GL conversion tool and will terminate the script. This includes tables submitted for conversion to GL strings containing rows of identical non-genotype identifying information (sample ID, experiment ID, etc.), these will be considered to be ambiguous genotypes in the table and the conversion tool will stop.

Homozygosity
Homozygous allele calls can be represented as single alleles in the GL string. For example, HLA-A*01:01:01:01 + HLA-A*01:01:01:01 can be written as HLA-A*01:01:01:01. This only applies to Tab2GL conversions. When a locus is represented by a single allele in a GL string, that allele will be reported in both fields for that locus in the converted table

HLA-DRB3, HLA-DRB4, and HLA-DRB5
HLA-DRB3, HLA-DRB4, and HLA-DRB5 are parsed for homozygous or hemizygous status based on the DRB1 haplotype as defined by Andersson, 1998 (Andersson G. 1998. Evolution of the HLA-DR region. Front Biosci. 3:d739-45) and can be flagged for inconsistency. Inconsistent haplotypes will be indicated in a separate column called “DR.HapFlag” with the locus or loci that are inconsistent with the respective DRB1 status. You may choose not to have haplotypes flagged using the ‘DRB345.Check’ parameter (see below).

Missing Information
For HLA-DRB3, HLA-DRB4, and HLA-DRB5 (HLA-DRBx) When there is missing information, either for lack of genotyping calls or absence of genotyped loci, GLSconvert allows for a convention to differentiate data missing due to genomic structural variation (i.e., locus absence). The acceptable indicator of locus absence is the 2-Field designation HLA-DRBx*00:00 (x = 3,4,5). For example, HLA-DRB5*00:00+HLA-DRB5*00:00 would indicate absence of a HLA-DRB5 locus and not a failed or missing genotype call. You may choose to have GLSconvert fill in absent calls for these loci. For Tab2GL conversion, a NA can be used to indicate missing due to lack of genotyping call, however a NA is not compatible with GL2Tab conversion and should be avoided.

Data Output

Data can be output to either a text file (tab or comma delimitted) or R object (sent to data frame). See Output parameters below. When running the GL2Tab conversion, all adjacent pairs of loci will include ’_1’ and ’_2’ to distinguish each chromosome. Please note, subsequent programs used to analyze the data table such as BIGDAWG or Pypop may not accept files with ambiguous genotyping data.

Parameters

GLSconvert(Data,Convert,Output="txt",System="HLA",HZY.Red=FALSE,DRB345.Check=FALSE,Strip.Prefix=TRUE,Cores.Lim=1L)

Data

Class: String/Object. (No Default).

e.g., Data=“/your/path/to/file/foo.txt” -or- Data=“foo.txt” -or- Data=foo (No Default)

Specifies data file name or data object. File name is either full file name path to specify file location (recommended) or name of file within a set working directory. See Data Input section for details about file formatting. This parameter is required for the conversion utility.

Convert

Class: String. Options: “GL2Tab” -or- “Tab2GL” (No Default).

Specifies data file name or data object. May use file name within working directory or full file name path to specify file location (recommended). See Data Input section for details about file formatting. This parameter is required for the conversion utility.

File.Output

Class: String. Options: “R” -or- “txt” -or- “csv” -or- “pypop” (Default = “txt”).

Specifies the type of output for the converted genotypes. For file writing, if you specified the full path for a file name then the resultant file will be written to the same directory. Otherwise the file will be written to whichever working directory was defined at initiation of conversion. The converted file name will be of the form “Converted_foo.txt” depending on output setting. If the data was an R object, the file name will be “Converted.txt” if output to file is desired. To output as R object will require an assignment to some object (see examples below).

System

Class: String. Options: “HLA” or “KIR” (Default=“HLA”).

Defines the genetic system of the data being converted. This parameter is required for Tab2GL conversion and is ignored for GL2Tab. The default system is HLA.

HZY.Red

Class: Logical (Default=FALSE).

Homozygous reduction: Should non-DRBx homozygotes be represent by a single allele name in GL string? For example: HLA-A01:01:01:01+HLA-A01:01:01:01 as HLA-A*01:01:01:01. The default behavior to keep both allele names in the GL string. This parameter is only used when Convert = Tab2GL, and only applies to non-DRBx genotype data

DRB345.Check

Class: Logical (Default=FALSE).

Indicates whether DR haplotypes should be parsed for correct zygosity and unusual DR haplotypes flagged. Inconsistent loci will appear flagged in a separate column labeled ‘DR.HapFlag’ that follows the genotype columns. The default behavior will flag unusual haplotypes. HLA-DRBx alleles without a respective HLA-DRB1 will remain unchanged and the flag will say ‘ND’ for not determined.

Strip.Prefix

Class: Logical (Default=TRUE).

Applies only to Convert=“GL2Tab” conversions. Indicates whether alleles should be stripped of System/Locus prefixes in the final data when converting from GL strings to tabular format. For example, should HLA-A*01:01:01:01 be recorded as 01:01:01:01 in the final table. Required when outputting to a PyPop compatible file. The default will strip prefixes.

Abs.Fill

Class: Logical (Default=FALSE).

Relevant only to data containing one or more of the loci: HLA-DRB3, HLA-DRB4, or HLA-DRB5. Directs GLSconvert to fill in missing information with the 2-Field designation HLA-DRBx*00:00.

For example, when data contain HLA-DRB5 typing, those subjects with no HLA-DRB5 will be given the designation HLA-DRB5*00:00 or HLA-DRB5*00:00+HLA-DRB5*00:00 depending on the situation. If you have absent locus designations already present in your data, then GLconverion will remove them in the final output.

Examples

These are examples only and need not be run as defined below.

# Run the GL2Tab conversion on a data file with default output to text file and no prefix stripping
GLSconvert(Data="/your/path/to/file/foo.txt", Convert="GL2Tab", Strip.Prefix=FALSE)

# Run the Tab2GL conversion on a R object outputting to a R object with DRB345 Flagging
foo.tab <- GLSconvert(Data=foo, Convert="Tab2GL", File.Output="R", DRB345.Check=TRUE)

# Run the Tab2GL conversion on a text file outputting to a csv file and with homozygous allele reduction of non-DRB345 alleles
GLSconvert(Data=foo, Convert="Tab2GL", File.Output="csv", HZY.Red=TRUE)

# Run the GL2Tab conversion on a data file without the full path name
setwd("/your/path/to/file")
GLSconvert(Data="foo.txt", Convert="GL2Tab")

End of vignette.