writeStandardOutput can write genotypes and phenotypes as well as possible covariates and kinship matrices into a number of formats for standard GWAS software: plink, snptest, bimbam, gemma, limmbo. For more information on the different file formats see External formats.

writeStandardOutput(directory, phenotypes = NULL, genotypes = NULL,
  additionalPhenotypes = NULL, covariates = NULL, kinship = NULL,
  eval_kinship = NULL, evec_kinship = NULL, id_samples, id_snps,
  id_phenos, outstring = NULL, standardInput_samples = NULL,
  standardInput_genotypes = NULL, format = NULL,
  intercept_gemma = FALSE, nameAdditional = "_nonLinear",
  verbose = TRUE)

Arguments

directory

Absolute path (no tilde expansion) to parent directory [string] where the data should be saved [needs user writing permission]

phenotypes

[NrSamples x NrTrait] Data.frame/matrix of phenotypes [doubles].

genotypes

[NrSamples x NrSNP] Data.frame/matrix of genotypes [integers]/[doubles].

additionalPhenotypes

[NrSamples x NrTrait] Data.frame/matrix of additional phenotypes (for instance non-linearly tranformed orginal

covariates

[NrSamples x NrCovariates] Data.frame/matrix of covariates [integers]/[doubles].

kinship

[NrSamples x NrSamples] Data.frame/matrix of kinship estimates [doubles].

eval_kinship

[NrSamples] vector with eigenvalues of kinship matrix [doubles].

evec_kinship

[NrSamples x NrSamples] Data.frame/matrix with eigenvectors of kinship matrix [doubles].

id_samples

Vector of [NrSamples] sample IDs [string] of simulated phenotypes, genotypes and covariates.

id_snps

Vector of [NrSNPs] SNP IDs [string] of (simulated) genotypes.

id_phenos

Vector of [NrTraits] phenotype IDs [string] of simulated phenotypes.

outstring

(optional) Name [string] of subdirectory (in relation to directory) to save set-up independent simulation results.

standardInput_samples

(optional) Data.frame of sample information obtained when genotypes were read from plink, oxgen or genome file.

standardInput_genotypes

(optional) Data.frame of genotypes obtained when reading genotypes from plink, oxgen, or genome file.

format

Vector of name(s) [string] of file formats, options are: "plink", "snptest", "gemma", "bimbam", "delim". For details on the file formats see External formats.

intercept_gemma

[boolean] When modeling an intercept term in gemma, a column of 1's have to be appended to the covariate files. Set intercept_gemma to TRUE to include a column of 1's in the output.

nameAdditional

name [string] of additonal phenotypes to be appended to filename.

verbose

[boolean]; If TRUE, progress info is printed to standard out

External formats

  • plink format: consists of three files, .bed, .bim and .fam. From https://www.cog-genomics.org/plink/1.9/formats: The .bed files contain the primary representation of genotype calls at biallelic variants in a binary format. The .bim is a text file with no header line, and one line per variant with the following six fields: i) Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name, ii) Variant identifier, iii) Position in morgans or centimorgans (safe to use dummy value of '0'), iv) Base-pair coordinate (normally 1-based, but 0 ok; limited to 231-2), v) Allele 1 (corresponding to clear bits in .bed; usually minor), vi) Allele 2 (corresponding to set bits in .bed; usually major). The .fam file is a text file with no header line, and one line per sample with the following six fields: i) Family ID ('FID'), ii), Within- family ID ('IID'; cannot be '0'), iii) Within-family ID of father ('0' if father isn't in dataset, iv) within-family ID of mother ('0' if mother isn't in dataset), v) sex code ('1' = male, '2' = female, '0' = unknown), vi) Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

  • snptest format: consists of two files, the genotype file ending in .gen (genotypes_snptest.gen) and the sample file ending in .sample (Ysim_snptest.sample). From https://www.well.ox.ac.uk/~gav/snptest/#input_file_formats: The genotype file stores data on a one-line-per-SNP format. The first 5 entries of each line should be the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line should be the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers should be the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file. The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. a) The header line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4, a continuous phenotype named pheno1 and a binary phenotype named bin1. All phenotypes should appear after the covariates in this file. b) The second line (the variable type line) details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules: D for Discrete covariates (coded using positive integers), C for Continuous covariates, P for Continuous Phenotype, B for Binary Phenotype (0 = Controls, 1 = Cases). c) Individual information: one line for each individual containing the information specified by the entries of the header line. Entries of the sample file are separated by spaces.

  • bimbam format: consists of a) a simple, tab-separated phenotype file without sample or phenotype header/index (Ysim_bimbam.txt) and b) the mean genotype file format which is a single file, without information on individuals: (genotypes.bimbam). From http://www.haplotype.org/bimbam.html: The first column of the mean genotype files is the SNP ID, the second and third columns are allele types with minor allele first. The rest columns are the mean genotypes of different individuals – numbers between 0 and 2 that represents the (posterior) mean genotype, or dosage of the minor allele.

  • gemma format: consists of a) a simple, tab-separated phenotype file without sample or phenotype header/index (Ysim_gemma.txt) and b) the mean genotype file format which is a single file, without information on individuals(genotypes.gemma); a) and b) both the same as above for bimbam format). In addition and if applicable, c) a kinship file (kinship_gemma.txt) and d) covariate file (Covs_gemma.txt). From http://www.xzlab.org/software/GEMMAmanual.pdf: The kinship file contains a NrSample × NrSample matrix, where each row and each column corresponds to individuals in the same order as in the mean genotype file, and ith row and jth column is a number indicating the relatedness value between ith and jth individuals. The covariates file has the same format as the phenotype file dsecribed above and must contain a column of 1’s if one wants to include an intercept term (set parameter intercept_gemma=TRUE).

  • limmbo format: consists of a) a comma-separated phenotype file without sample IDs as index and phenotype IDs as header (Ysim_limmbo.csv), b) the mean genotype file format with one genetic variant per line. The first column contains the variant ID, column 2-N+1 contain the genotype code (numbers between 0 and 2 that represent the (posterior) mean genotype/dosage of the minor allele) for N samples, c) a kinship file (kinship_limmbo.csv) and d) covariate file (covs_limmbo.csv). From

See also

Examples

simulation <- runSimulation(N=10, P=2, genVar=0.4, h2s=0.2, phi=1) genotypes <- simulation$rawComponents$genotypes kinship <- simulation$rawComponents$kinship phenotypes <- simulation$phenoComponents$Y if (FALSE) { # Save in plink format (.bed, .bim, .fam, Y_sim_plink.txt) writeStandardOutput(directory=tempdir(), genotypes=genotypes$genotypes, phenotypes=phenotypes, id_samples = genotypes$id_samples, id_snps = genotypes$id_snps, id_phenos = colnames(phenotypes), format="plink") # Save in gemma and snptest format writeStandardOutput(directory=tempdir(), genotypes=genotypes$genotypes, phenotypes=phenotypes, id_samples = genotypes$id_samples, id_snps = genotypes$id_snps, id_phenos = colnames(phenotypes), kinship=kinship, format=c("snptest", "gemma")) }