Title: | Accessing and Processing a 'Mega2' Genetic Database |
---|---|
Description: | Uses as input genetic data that have been reformatted and stored in a 'SQLite' database; this database is initially created by the standalone 'mega2' C++ program (available freely from <https://watson.hgen.pitt.edu/register/>). Loads and manipulates data frames containing genotype, phenotype, and family information from the input 'SQLite' database, and decompresses needed subsets of the genotype data, on the fly, in a memory efficient manner. We have also created several more functions that illustrate how to use the data frames as well as perform useful tasks: these permit one to run the 'pedgene' package to carry out gene-based association tests on family data using selected marker subsets, to run the 'SKAT' package to carry out gene-based association tests using selected marker subsets, to run the 'famSKATRC' package to carry out gene-based association tests on families (optionally) and with rare or common variants using selected marker subsets, to output the 'Mega2R' data as a VCF file and related files (for phenotype and family data), and to convert the data frames into CoreArray Genomic Data Structure (GDS) format. |
Authors: | Robert V. Baron [aut], Daniel E. Weeks [aut, cre], University of Pittsburgh [cph] |
Maintainer: | Daniel E. Weeks <[email protected]> |
License: | GPL-2 |
Version: | 1.1.0 |
Built: | 2024-11-02 05:16:33 UTC |
Source: | https://github.com/cran/Mega2R |
This function generates base pair ranges from its input arguments.
Each range specifies a chromosome, a start
base pair and end base pair. Typically, a range could be a gene transcript, though
it could be a whole chromosome, or a run of base pairs on a chromosome. Once the
ranges are generated, applyFnToRanges
is called to find all the
rows (i.e. markers) from the markers data frame that fall in each range. For these
markers, a matrix of the genotypes is generated. Finally, the op
function is called for
each range with the arguments: markers, range, and 'environment'.
applyFnToGenes(op = function (markers, range, envir) {}, genes_arg = NULL, ranges_arg = matrix(ncol = 3, nrow = 0), chrs_arg = vector("integer", 0), markers_arg = vector("character", 0), type_arg = "TX", fuzz_arg = 0, envir = ENV)
applyFnToGenes(op = function (markers, range, envir) {}, genes_arg = NULL, ranges_arg = matrix(ncol = 3, nrow = 0), chrs_arg = vector("integer", 0), markers_arg = vector("character", 0), type_arg = "TX", fuzz_arg = 0, envir = ENV)
op |
Is a function of three arguments. It will be called repeatedly by
|
genes_arg |
a character vector of gene names.
All the transcripts identified with the specified gene in BioConductor Annotation, |
ranges_arg |
an integer matrix of three columns. The columns define a range: a chromosome number, a start base pair value, and an end base pair value. |
chrs_arg |
an integer vector of chromosome numbers. All of the base pairs on each chromosomes will be selected as a single range. |
markers_arg |
a data frame with the following 5 observations:
|
type_arg |
a character vector of length 1 that contains "TX" or does not. If it is
"TX", which is the default, the TX fields of BioConductor Annotation, |
fuzz_arg |
is an integer vector of length one or two. The first argument is used to reduce the start base pair selected from each transcript and the second to increase the end base pair position. (If only one value is present, it is used for both adjustments.) Note: The values can be positive or negative. |
envir |
an 'environment' that contains all the data frames created from the SQLite database. |
None
If you want subsequent calls to op
to share information, data can be placed in
a data frame that is added to the 'environment'.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) show = function(m, r, e) { print(r) print(m) print(head(getgenotypes(m, envir = e))) } # apply function "show" to all transcripts on genes ELL2 and CARD15 # donttestcheck: time applyFnToGenes(show, genes_arg = c("CEP104")) # apply function "show" to all genotypes on chromosomes 11 for two base # pair ranges applyFnToGenes(show, ranges_arg = matrix(c(1, 5000000, 10000000, 1, 10000000, 15000000), ncol = 3, nrow = 2, byrow = TRUE)) # apply function "show" to all genotypes for first marker in each chromosome applyFnToGenes(show, markers_arg = ENV$markers[! duplicated(ENV$markers$chromosome), 3]) # apply function "show" to all genotypes on chromosomes 24 and 26 applyFnToGenes(show, chrs_arg=c(24, 26))
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) show = function(m, r, e) { print(r) print(m) print(head(getgenotypes(m, envir = e))) } # apply function "show" to all transcripts on genes ELL2 and CARD15 # donttestcheck: time applyFnToGenes(show, genes_arg = c("CEP104")) # apply function "show" to all genotypes on chromosomes 11 for two base # pair ranges applyFnToGenes(show, ranges_arg = matrix(c(1, 5000000, 10000000, 1, 10000000, 15000000), ncol = 3, nrow = 2, byrow = TRUE)) # apply function "show" to all genotypes for first marker in each chromosome applyFnToGenes(show, markers_arg = ENV$markers[! duplicated(ENV$markers$chromosome), 3]) # apply function "show" to all genotypes on chromosomes 24 and 26 applyFnToGenes(show, chrs_arg=c(24, 26))
A matrix of the genotypes for all the specified markers is generated. Then, the call back function, op
,
is called with the markers, NULL (for the range), and the 'environment'.
applyFnToMarkers(op = function (markers, range, envir) {}, markers_arg, envir = ENV)
applyFnToMarkers(op = function (markers, range, envir) {}, markers_arg, envir = ENV)
op |
Is a function of three arguments. It will be called once by
|
markers_arg |
a data frame with the following 5 observations:
|
envir |
an 'environment' that contains all the data frames created from the SQLite database. |
None
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) show = function(m, r, e) { print(r) print(m) print(head(getgenotypes(m, envir = e))) } # apply function "show" to all genotypes > 5,000,000 bp applyFnToMarkers(show, ENV$markers[ENV$markers$position > 5000000,])
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) show = function(m, r, e) { print(r) print(m) print(head(getgenotypes(m, envir = e))) } # apply function "show" to all genotypes > 5,000,000 bp applyFnToMarkers(show, ENV$markers[ENV$markers$position > 5000000,])
First, for each range, determine the markers that fall between the start and end
base pair of the range. Then, for each set of
markers generate a matrix of the genotypes of those markers. Finally, the op
function is called for
each range with the arguments: markers, range, and 'environment'.
applyFnToRanges(op = function (markers, range, envir) {}, ranges_arg = NULL, indices_arg = NULL, fuzz_arg = 0, envir = ENV)
applyFnToRanges(op = function (markers, range, envir) {}, ranges_arg = NULL, indices_arg = NULL, fuzz_arg = 0, envir = ENV)
op |
Is a function of three arguments. It will be called repeatedly by
|
ranges_arg |
is a data frame that contains at least 4 observations: a name, a chromosome, a start base pair position and an end base pair position. |
indices_arg |
is a vector of 3 integers that specify the location of chromosome, start base pair column and end base pair column of the ranges_arg data frame. An optional fourth integer indicates the column containing the name of the ranges. |
fuzz_arg |
is an integer vector of length one or two. The first argument is used to reduce the start base pair selected from each range and the second to increase the end base pair position. (If only one value is present, it is used for both changes.) Note: The values can be positive or negative. |
envir |
an 'environment' that contains all the data frames created from the SQLite database. |
None
If the ranges_arg and indices_arg are NULL or missing, then the default ranges that have been set by setRanges
are used. If setRanges
has not been called, a default set of the ranges is used.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) show = function(m, r, e) { print(r) print(m) print(head(getgenotypesraw(m, envir = e))) } # apply function "show" to all genotypes on chromosomes 1 for two base pair # ranges applyFnToRanges(show, ranges_arg = matrix(c(1, 2244000, 2245000, 1, 3762500, 3765000), ncol = 3, nrow = 2, byrow = TRUE), indices_arg = 1:3) # apply function "show" to all genotypes on chromosomes 1 for two base pair # ranges applyFnToRanges(show, ranges_arg = matrix(c(1, 2240000, 2245000, "range1", 1, 3760000, 3765000, "range2"), ncol = 4, nrow = 2, byrow = TRUE), indices_arg = 1:4)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) show = function(m, r, e) { print(r) print(m) print(head(getgenotypesraw(m, envir = e))) } # apply function "show" to all genotypes on chromosomes 1 for two base pair # ranges applyFnToRanges(show, ranges_arg = matrix(c(1, 2244000, 2245000, 1, 3762500, 3765000), ncol = 3, nrow = 2, byrow = TRUE), indices_arg = 1:3) # apply function "show" to all genotypes on chromosomes 1 for two base pair # ranges applyFnToRanges(show, ranges_arg = matrix(c(1, 2240000, 2245000, "range1", 1, 3760000, 3765000, "range2"), ncol = 4, nrow = 2, byrow = TRUE), indices_arg = 1:4)
This function removes the Mega2R tutorial (inst/exdata) data that was copied to the specified directory.
clean_mega2rtutorial_data(dir = file.path(tempdir(), "Mega2Rtutorial"))
clean_mega2rtutorial_data(dir = file.path(tempdir(), "Mega2Rtutorial"))
dir |
The directory to remove the tutorial data to. By default, this is tempdir()/Mega2Rtutorial |
None
clean_mega2rtutorial_data()
clean_mega2rtutorial_data()
Convert the genotypesraw() allele patterns of 0x10001, 0x10002 (or 0x20001), 0x20002, 0 to the numbers 0, 1, 2, 9 for each marker. (Reverse, the order iff allele "1" has the minor allele frequency.)
computeDosage(markers_arg, range_arg, envir)
computeDosage(markers_arg, range_arg, envir)
markers_arg |
a data.frame with the following 5 observations:
|
range_arg |
one row of a ranges_arg. The latter is a data frame of at least three integer columns. The columns indicate a range: a chromosome number, a start base pair value, and an end base pair value. |
envir |
'environment' containing SQLite database and other globals especially the
phenotype_table, |
a matrix of samples X markers for all the markers that have nonzero changes.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_famSKATRC(db, verbose = TRUE) dimDosage = function(m, r, e) {print(dim(computeDosage(m, r, e)))} applyFnToRanges(dimDosage, ENV$refRanges[50:60, ], ENV$refIndices, envir=ENV) # This will use return dosage matrices for the markers in the ranges 50 - 60, # but is basically ignores the results.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_famSKATRC(db, verbose = TRUE) dimDosage = function(m, r, e) {print(dim(computeDosage(m, r, e)))} applyFnToRanges(dimDosage, ENV$refRanges[50:60, ], ENV$refIndices, envir=ENV) # This will use return dosage matrices for the markers in the ranges 50 - 60, # but is basically ignores the results.
Read the fields of SQLite data base tables that are required for Mega2R into data frames. These data frames are stored in an 'environment' which is returned. This function also adds some state data, extra data frames, and computed data frames to the 'environment'.
dbmega2_import(dbname, bpPosMap = NULL, verbose = FALSE)
dbmega2_import(dbname, bpPosMap = NULL, verbose = FALSE)
dbname |
file path to SQLite database. |
bpPosMap |
index that specifies which map in the map_table should be used for
marker chromosome/position. If it is NULL, the internal variable
base_pair_position_index is used instead.
|
verbose |
print out statistics on the name/size of each table read and show column headers. Also, save the verbose value for use by other Mega2R functions. |
envir an environment that contains all the data frames made from the SQLite database.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = dbmega2_import(db, verbose = TRUE) ENV = dbmega2_import(db)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = dbmega2_import(db, verbose = TRUE) ENV = dbmega2_import(db)
Convert the genotypesraw() allele patterns of 0x10001, 0x10002 (or 0x20001), 0x20002, 0
to the numbers 0, 1, 2, 9 for each marker. (Reverse, the order iff allele "1" has the
minor allele frequency.) Ignore markers that have no variants.
Finally, invoke famSKAT_RC
with the converted genotype matrix.
Save information about the range and the p.value calculated by famSKAT_RC
in envir$famSKATRC_results.
If you want to change the argument values to this function they should be changed instead
when calling the Mega2famSKATRC
function.
DOfamSKATRC( markers_arg, range_arg, envir, pheno = 3, id = NULL, covariates = NULL, sqrtweights_c = NULL, sqrtweights_r = NULL, binomialimpute = TRUE, acc = 1e-06, maf = 0.05, phi = c(0, 0.2, 0.5, 0.9) )
DOfamSKATRC( markers_arg, range_arg, envir, pheno = 3, id = NULL, covariates = NULL, sqrtweights_c = NULL, sqrtweights_r = NULL, binomialimpute = TRUE, acc = 1e-06, maf = 0.05, phi = c(0, 0.2, 0.5, 0.9) )
markers_arg |
a data.frame with the following 5 observations:
|
range_arg |
one row of a ranges_arg. The latter is a data frame of at least three integer columns. The columns indicate a range: a chromosome number, a start base pair value, and an end base pair value. |
envir |
'environment' containing SQLite database and other globals especially the
phenotype_table, |
pheno |
is an index into the phenotypes_table to select the phenotype. Missing phenotypes are represented by NA. |
id |
a vector of individuals to be included in the test, a subset of the family members. If NULL is given, all members will be used. |
covariates |
a matrix of covariates for the phenotype. |
sqrtweights_c |
weight function for common variants, if NULL use weight set in init_famSKAT |
sqrtweights_r |
weight function for rare variants, if NULL use weight set in init_famSKAT. |
binomialimpute |
if TRUE, impute missing genotypes using a binomial distribution. |
acc |
accuracy used in Davies approximation. |
maf |
threshold used to separate rare from common variants. |
phi |
a vector of ratios ratios; each indicates the contribution of rare variants. |
None
This function accumulates output in the data frame, envir$famSKATRC_results. It will print out the lines as they are generated if envir$verbose is TRUE. It does not write the data frame to a file. You must save the data frame. You also must initialize the data frame when necessary.
init_famSKATRC
, Mega2famSKATRC
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_famSKATRC(db, verbose = TRUE) ENV$famSKATRC_results = ENV$famSKATRC_results[0, ] Mega2famSKATRC(gs=1:1, envir=ENV, pheno=3) # this sets one of the many arguments for DOfamSKATRC # but basically prepares the ENV for the direct use of DOfamSKATRC (below). # donttestcheck: try this below instead if there is time Mega2famSKATRC(genes=c("CEP104"), envir=ENV, pheno=3 ) # DOfamSKATRC is called within Mega2famSKATRC. init_famSKATRC and Mega2famSKATRC need to be # called to set up the environment for famSKAT_RC to run. BUT, you should ignore DOfamSKATRC # and use Mega2famSKATRC instead. # applyFnToRanges(DOfamSKATRC, ENV$refRanges[50:60, ], ENV$refIndices, envir=ENV) # this will use all the default argument values for DOfamSKATRC
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_famSKATRC(db, verbose = TRUE) ENV$famSKATRC_results = ENV$famSKATRC_results[0, ] Mega2famSKATRC(gs=1:1, envir=ENV, pheno=3) # this sets one of the many arguments for DOfamSKATRC # but basically prepares the ENV for the direct use of DOfamSKATRC (below). # donttestcheck: try this below instead if there is time Mega2famSKATRC(genes=c("CEP104"), envir=ENV, pheno=3 ) # DOfamSKATRC is called within Mega2famSKATRC. init_famSKATRC and Mega2famSKATRC need to be # called to set up the environment for famSKAT_RC to run. BUT, you should ignore DOfamSKATRC # and use Mega2famSKATRC instead. # applyFnToRanges(DOfamSKATRC, ENV$refRanges[50:60, ], ENV$refIndices, envir=ENV) # this will use all the default argument values for DOfamSKATRC
First, ignore call backs that have less than two markers. Second, convert the genotypesraw()
patterns of 0x10001, 0x10002 (or 0x20001), 0x20002, 0 from the genotype matrix
to the numbers 0, 1, 2, 0 for each marker. (Reverse, the order iff allele "1" has the
minor allele frequency.) Next, prepend the pedigree and person columns of the family data
to this modified genotype matrix. Finally, invoke pedgene
with the family data and
genotype matrix for several different weights. Save the kernel and burden, value and p-value for each
measurement in envir$pedgene_results.
DOpedgene(markers_arg, range_arg, envir = ENV)
DOpedgene(markers_arg, range_arg, envir = ENV)
markers_arg |
a data.frame with the following 5 observations:
|
range_arg |
one row of a ranges_arg. The latter is a data frame of at least three integer columns. The columns indicate a range: a chromosome number, a start base pair value, and an end base pair value. |
envir |
'environment' containing SQLite database and other globals |
None
This function appends output to the data frame, envir$pedgene_results. It will print out the lines as they are generated if envir$verbose is TRUE. The data frame envir$pedgene_results is initialized by init_pedgene, and is appended to each time DOpedgene is run.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_pedgene(db) ENV$verbose = TRUE applyFnToRanges(DOpedgene, ENV$refRanges[50:60,], ENV$refIndices) # donttestcheck: try this below if there is time applyFnToGenes(DOpedgene, genes_arg = c("CEP104"))
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_pedgene(db) ENV$verbose = TRUE applyFnToRanges(DOpedgene, ENV$refRanges[50:60,], ENV$refIndices) # donttestcheck: try this below if there is time applyFnToGenes(DOpedgene, genes_arg = c("CEP104"))
Convert the genotypesraw() allele patterns of 0x10001, 0x10002 (or 0x20001), 0x20002, 0
to the numbers 0, 1, 2, 9 for each marker. (Reverse, the order iff allele "1" has the
minor allele frequency.) Ignore markers that have no variants (unless allMarkers is TRUE).
Finally, invoke SKAT
with the converted genotype matrix, Null model saved in envir$obj,
and any additionally supplied arguments.
Save information about the range and the p.value calculated by SKAT
in envir$SKAT_results.
DOSKAT(markers_arg, range_arg, envir, ...)
DOSKAT(markers_arg, range_arg, envir, ...)
markers_arg |
a data.frame with the following 5 observations:
|
range_arg |
one row of a ranges_arg. The latter is a data frame of at least three integer columns. The columns indicate a range: a chromosome number, a start base pair value, and an end base pair value. |
envir |
'environment' containing SQLite database and other globals |
... |
extra arguments for SKAT |
None
This function accumulates output in the data frame, envir$SKAT_results. It will print out the lines as they are generated if envir$verbose is TRUE. It does not write the data frame to a file. You must save the data frame. You also must initialize the data frame when necessary.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_SKAT(db, verbose = TRUE, allMarkers = FALSE) Mega2SKAT(ENV$phe[, 3] - 1 ~ 1, "D", gs=1:1) # donttestcheck: try this below instead if there is time Mega2SKAT(ENV$phe[, 3] - 1 ~ 1, "D", kernel = "linear.weighted", weights.beta = c(0.5, 0.5), genes=c("CEP104") ) # DOSKAT is called internally to Mega2SKAT. init_SKAT and Mega2SKAT need to be # called to set up the environment for DOSKAT to run. You should ignore DOSKAT # and use Mega2SKAT instead # applyFnToRanges(DOSKAT, ENV$refRanges[50:60, ], ENV$refIndices)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_SKAT(db, verbose = TRUE, allMarkers = FALSE) Mega2SKAT(ENV$phe[, 3] - 1 ~ 1, "D", gs=1:1) # donttestcheck: try this below instead if there is time Mega2SKAT(ENV$phe[, 3] - 1 ~ 1, "D", kernel = "linear.weighted", weights.beta = c(0.5, 0.5), genes=c("CEP104") ) # DOSKAT is called internally to Mega2SKAT. init_SKAT and Mega2SKAT need to be # called to set up the environment for DOSKAT to run. You should ignore DOSKAT # and use Mega2SKAT instead # applyFnToRanges(DOSKAT, ENV$refRanges[50:60, ], ENV$refIndices)
This function retrieves data stored in the Mega2rtutorial (inst/exdata). It dumps them in the specified directory.
dump_mega2rtutorial_data(dir = file.path(tempdir(), "Mega2Rtutorial"))
dump_mega2rtutorial_data(dir = file.path(tempdir(), "Mega2Rtutorial"))
dir |
The directory to store the tutorial data to. By default, this is tempdir()/Mega2Rtutorial |
None
dump_mega2rtutorial_data()
dump_mega2rtutorial_data()
This function calls a C++ function that does all the heavy lifting. It passes the arguments necessary for the C++ function: some from the caller's arguments and some from data frames that are in the "global" environment, envir. From the markers_arg argument, it fetches the locus_index and the index in the unified_genotype_table. It also passes the allele nucleotide separator argument. From the "global" environment, envir, it gets a bit vector of compressed genotype information, the alleles for each marker, and some bookkeeping related data. Note: This function also contains a dispatch/switch on the type of compression in the genotype vector. A different C++ function is called when there is compression versus when there is no compression.
getgenotypes(markers_arg, sepstr = "", envir = ENV)
getgenotypes(markers_arg, sepstr = "", envir = ENV)
markers_arg |
a data.frame with the following 5 observations:
|
sepstr |
separator string inserted between the alleles (default is none). When present, this is typically a space, a tab or "/". |
envir |
an environment that contains all the data frames created from the SQLite database. |
The unified_genotype_table contains one raw vector for each person. In the vector there are two bits for each genotype. This function creates an output matrix by fixing the marker and collecting genotype information for each person and then repeating for all the needed markers. (Currently, this appears slightly faster than a scan which is fixes the person and iterates over markers.)
a matrix of genotypes represented as two allele pairs. The matrix has one column for each marker in markers_arg argument. There is one row for each person in the family (fam) table.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) getgenotypes(ENV$markers)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) getgenotypes(ENV$markers)
This function calls a C++ function that does all the heavy lifting. It passes the arguments necessary for the C++ function: some from the caller's arguments and some from data frames that are in the "global" environment, envir. From its markers_arg argument, it gets the locus_index and the index in the unified_genotype_table. From the "global" environment, envir, it gets a bit vector of compressed genotype information, and some bookkeeping related data. Note: This function also contains a dispatch/switch on the type of compression in the genotype vector. A different C++ function is called when there is compression versus when there is no compression.
getgenotypesdos(markers_arg, envir = ENV)
getgenotypesdos(markers_arg, envir = ENV)
markers_arg |
a data.frame with the following 5 observations:
|
envir |
an environment that contains all the data frames created from the SQLite database. |
The unified_genotype_table contains one raw vector for each person. In the vector, there are two bits for each genotype. This function creates an output matrix by fixing the marker and collecting genotype information for each person and then repeating for all the specified markers.
a list of 3 values, named "ncol", "zero", "geno".
is a matrix of dosages as integers. The value 0 is given to the Major allele value, 1 is given to the heterozygote value, and 2 is given to the Minor allele. In the matrix, there is usually one column for each marker in the markers_arg argument. But if there would be only the one allele 0 or 2 in the column, the column is ignorednot present. There is one row for each person in the family (fam) table.
Is the count of the actual number of columns in the geno matrix.
Is a vector with one entry per marker. The value will be 0 if the marker is not in the geno matrix. Otherwise the value is the column number in the geno matrix where the marker data appears.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) getgenotypesdos(ENV$markers[ENV$markers$chromosome == 1,])
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) getgenotypesdos(ENV$markers[ENV$markers$chromosome == 1,])
This function calls a C++ function that does all the heavy lifting. It passes the arguments necessary for the C++ function: some from the caller's arguments and some from data frames that are in the "global" environment, envir. From its markers_arg argument, it gets the locus_index and the index in the unified_genotype_table. From the "global" environment, envir, it gets a bit vector of compressed genotype information, allele information, and some bookkeeping related data. Note: This function also contains a dispatch/switch on the type of compression in the genotype vector. A different C++ function is called when there is compression versus when there is no compression.
getgenotypesgenabel(markers_arg, envir = ENV)
getgenotypesgenabel(markers_arg, envir = ENV)
markers_arg |
a data.frame with the following 5 observations:
|
envir |
an environment that contains all the data frames created from the SQLite database. |
This function reads the genotype data in Mega2 compressed format and converts it to the GenABEL compressed format. The unified_genotype_table contains one raw vector for each person. In the vector, there are two bits for each genotype; each byte has the data for 4 markers. In GenABEL, there is one raw vector per marker, and each byte has the data for 4 persons. The C++ function does the conversion as well as adjusts the bits' contents. For example, in GenABEL the genotype represented by bits == 0, is what Mega2 represents with 2. Doing the conversion in C++ is 10 - 20 times faster than converting the Mega2 data to PLINK .tped files and then having GenABEL read in and process/convert those files.
the GenABEL gwaa.data-class object component that contains the genotype data.
This function is called from Mega2ENVGenABEL
; it is not intended to be called
by the programmer.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) aa = getgenotypesgenabel(ENV$markers[ENV$markers$chromosome == 1,]) aa
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) aa = getgenotypesgenabel(ENV$markers[ENV$markers$chromosome == 1,]) aa
This function calls a C++ function that does all the heavy lifting. It passes the arguments necessary for the C++ function: some from the caller's arguments and some from data frames that are in the "global" environment, envir. From its markers_arg argument, it gets the locus_index and the index in the unified_genotype_table. From the "global" environment, envir, it gets a bit vector of compressed genotype information, and some bookkeeping related data. Note: This function also contains a dispatch/switch on the type of compression in the genotype vector. A different C++ function is called when there is compression versus when there is no compression.
getgenotypesraw(markers_arg, envir = ENV)
getgenotypesraw(markers_arg, envir = ENV)
markers_arg |
a data.frame with the following 5 observations:
|
envir |
an environment that contains all the data frames created from the SQLite database. |
The unified_genotype_table contains one raw vector for each person. In the vector, there are two bits for each genotype. This function creates an output matrix by fixing the marker and collecting genotype information for each person and then repeating for all the needed markers.
a matrix of genotypes represented as integers. Each 32 bit integer represents contains two allele values: the high 16 bits contains the index of allele1 and the low 16 bits contains the index of allele2. In the matrix, there is one column for each marker in the markers_arg argument. There is one row for each person in the family (fam) table.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) # two ints in upper/lower half integer representing allele # for all persons in chromosome 1 getgenotypesraw(ENV$markers[ENV$markers$chromosome == 1,])
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) # two ints in upper/lower half integer representing allele # for all persons in chromosome 1 getgenotypesraw(ENV$markers[ENV$markers$chromosome == 1,])
This populates the R data frames with the specified Mega2 SQLite database. It initializes the fam(ily) table and makes sure the person entries are unique. Finally, it generates a kinship matrix from the family data. It also stores a weighting for the common and rare variant that may be used later if NULL is specified as a weight in Mega2famSKATRC. The common weighting is the function dbeta(maf, 1, 25). The rare weighting is the function dbeta(maf, 0.5, 0.5).
init_famSKATRC(db = NULL, verbose = FALSE, ALPHA = FALSE, ...)
init_famSKATRC(db = NULL, verbose = FALSE, ALPHA = FALSE, ...)
db |
specifies the path of a Mega2 SQLite database containing study data. |
verbose |
TRUE indicates that diagnostic printouts should be enabled. This value is saved in the returned environment. |
ALPHA |
TRUE indicates that two runs of famSKAT_RC should be enabled. One with ALPHA numeric ID's and one with numeric IDs ... this is temporary. The default is FALSE. |
... |
fed to dbmega2_import(); should be bpPosMap= to select from the maps of base pairs, if the default is not desired. |
"environment" containing data frames from an SQLite database and some computed values.
init_famSKATRC creates a new data frame, envir$phe, containing phenotype observations. In addition, it initializes a matrix to aid in translating a genotype allele matrix to a genotype count matrix.
It also initializes the data frame envir$famSKATRC_results to zero rows.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_famSKATRC(db, verbose = FALSE) ls(ENV)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_famSKATRC(db, verbose = FALSE) ls(ENV)
This populates the R data frames from the specified Mega2 SQLite database.
init_pedgene(db = NULL, verbose = FALSE, traitname = "default", ...)
init_pedgene(db = NULL, verbose = FALSE, traitname = "default", ...)
db |
specifies the path of a Mega2 SQLite database containing study data. |
verbose |
TRUE indicates that diagnostic printouts should be enabled. This value is saved in the returned environment. |
traitname |
Name of the affection status trait to use to set the case/control status; default value = "default". |
... |
fed to dbmega2_import(); should be bpPosMap= to select from the maps of base pairs, if the default is not desired. |
"environment" containing data frames from an SQLite database and some computed values.
init_pedgene calculates schaidPed and pedPer that are used later in the Dopedgene calculation. In addition, it initializes a matrix to aid in translating a genotype allele matrix to a genotype count matrix.
It also initializes the dataframe envir$pedgene_results to zero rows.
DOpedgene
, Mega2pedgene
, mkfam
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_pedgene(db, traitname = "default") ls(ENV)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_pedgene(db, traitname = "default") ls(ENV)
This populates the R data frames from the specified Mega2 SQLite database. It then prunes the samples to only include members that have a definite case or control status. Undefined samples are ignored.
init_SKAT(db = NULL, verbose = FALSE, allMarkers = FALSE, ...)
init_SKAT(db = NULL, verbose = FALSE, allMarkers = FALSE, ...)
db |
specifies the path of a Mega2 SQLite database containing study data. |
verbose |
TRUE indicates that diagnostic printouts should be enabled. This value is saved in the returned environment. |
allMarkers |
TRUE means use all markers in a given transcript even if there is no variation. FALSE means ignore markers that show no variation; this is the default. |
... |
fed to dbmega2_import(); should be bpPosMap= to select from the maps of base pairs, if the default is not desired. |
"environment" containing data frames from an SQLite database and some computed values.
init_SKAT creats a data frame, envir$phe, of phenotype observations. In addition, it initializes a matrix to aid in translating a genotype allele matrix to a genotype count matrix.
It also initializes the data frame envir$SKAT_results to zero rows.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_SKAT(db, verbose = FALSE, allMarkers = FALSE) ls(ENV)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_SKAT(db, verbose = FALSE, allMarkers = FALSE) ls(ENV)
create a gwaa.data-class object from the data frames in a Mega2 environment. This
function is a front end that eventually calls a C++ Rcpp function that reads
the genotype data in Mega2 compressed format and converts it to the GenABEL
compressed format. The results of Mega2ENVGenABEL
are/should be the same
as Mega2GenABEL
, but the calculation is much faster, typically a factor of 10 to 20.
Mega2ENVGenABEL( markers = NULL, force = TRUE, makemap = FALSE, sort = TRUE, envir = ENV )
Mega2ENVGenABEL( markers = NULL, force = TRUE, makemap = FALSE, sort = TRUE, envir = ENV )
markers |
data frame of markers to be processed |
force |
pass value to gwaa conversion function |
makemap |
pass value to gwaa conversion function |
sort |
pass value to gwaa conversion function |
envir |
'environment' containing SQLite database and other globals |
gwaa.data-class object created from Mega2R database
## Not run: require("GenABEL") db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) gwaa = Mega2ENVGenABEL(markers=ENV$markers[1:10,]) str(gwaa) head(summary(gwaa)) ## End(Not run)
## Not run: require("GenABEL") db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) gwaa = Mega2ENVGenABEL(markers=ENV$markers[1:10,]) str(gwaa) head(summary(gwaa)) ## End(Not run)
If the gene argument is NULL, execute the famSKAT_RC function on the first gs gene transcripts (default is gs = 1:100). Update the envir$famSKATRC_results data frame with the results. Otherwise, gene is a string vector of genes to process. The special value '*' stands for all the known genes.
Mega2famSKATRC(gs = 1:100, genes = NULL, envir = ENV, ...)
Mega2famSKATRC(gs = 1:100, genes = NULL, envir = ENV, ...)
gs |
a subrange of the default transcripts (refRanges) over which to calculate the DOfamSKATRC function. |
genes |
a list of genes over which to calculate the DOfamSKATRC function. The value, "*", means use all the transcripts in the selected Bioconductor database. If genes is NULL, the gs range of the internal refRanges will be used. |
envir |
'environment' containing SQLite database and other globals. |
... |
extra arguments that are acceptable to famSKAT_RC. These are listed with the
|
The data frame with the results is stored in the environment and named famSKATRC_results, viz. envir$famSKATRC_results
A helper function
SKAT3arg
is defined for the 3 argument callback function which in turn calls
DOfamSKATRC
with the appropriate arguments (including those specific to the
Mega2famSKATRC
function).
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_famSKATRC(db, verbose = FALSE) ENV$verbose = FALSE ENV$famSKATRC_results = ENV$famSKATRC_results[0, ] Mega2famSKATRC(gs=50:60, envir=ENV, pheno=3) # donttestcheck: try this below if there is time Mega2famSKATRC(genes=c("CEP104"), envir=ENV, pheno=3 ) ENV$famSKATRC_results
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_famSKATRC(db, verbose = FALSE) ENV$verbose = FALSE ENV$famSKATRC_results = ENV$famSKATRC_results[0, ] Mega2famSKATRC(gs=50:60, envir=ENV, pheno=3) # donttestcheck: try this below if there is time Mega2famSKATRC(genes=c("CEP104"), envir=ENV, pheno=3 ) ENV$famSKATRC_results
Reads the data frames in "envir" and builds a GDSFMT COREARRAY file from them.
Mega2gdsfmt( filename = "test.gds", markers = NULL, snp.order = FALSE, SeqArray = FALSE, envir = ENV )
Mega2gdsfmt( filename = "test.gds", markers = NULL, snp.order = FALSE, SeqArray = FALSE, envir = ENV )
filename |
gdsfmt file to create |
markers |
data frame of markers to be processed |
snp.order |
TRUE indicates that the "genotype" data matrix has SNP as the first index which changes more quickly than subsequent indices. FALSE indicates that SAMPLE is the the first index. |
SeqArray |
TRUE uses SeqArray labels for the gdsfmt vector elements. FALSE it uses labels shown in SNPRelate |
envir |
'environment' containing SQLite database and other globals |
writes the "filename" file containing the CoreArray data. Then returns an internal pointer, class .gds, to the file data.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) gdsfmtfile = file.path(where_mega2rtutorial_data(), "test.gds") append_genotype_a = TRUE append_genotype_b = append_genotype_c = FALSE gn = Mega2gdsfmt(gdsfmtfile, envir=ENV) gn
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) gdsfmtfile = file.path(where_mega2rtutorial_data(), "test.gds") append_genotype_a = TRUE append_genotype_b = append_genotype_c = FALSE gn = Mega2gdsfmt(gdsfmtfile, envir=ENV) gn
Call the Mega2R functions to: create a .tped file, a .tfam file and a .phe file.
Then call the GenABEL functions to process these files: the .tped and the .tfam
file are processed by convert.snp.tped
to produce a tped.raw file. The latter
is combined with a .phe (phenotype) file by load.gwaa.data
to create a gwaa.data-class
object in memory. All these files are deleted when the exits.
Mega2GenABEL(markers = NULL, mapno = 0, envir = ENV)
Mega2GenABEL(markers = NULL, mapno = 0, envir = ENV)
markers |
data frame of markers to be processed |
mapno |
specify which map index to use for physical distances |
envir |
'environment' containing SQLite database and other globals |
gwaa.data-class object generated from the Mega2R database
## Not run: require("GenABEL") db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) seqsimgwaa = Mega2GenABEL(markers=ENV$markers[1:10,]) str(seqsimgwaa) head(summary(seqsimgwaa)) ## End(Not run)
## Not run: require("GenABEL") db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) seqsimgwaa = Mega2GenABEL(markers=ENV$markers[1:10,]) str(seqsimgwaa) head(summary(seqsimgwaa)) ## End(Not run)
Verify by fields, all the fields in two gwaa.data-class objects. Show more detailed marker information iff the coding values are different. (When comparing two gwaa.data-class objects, one native and one created via Mega2R sometimes when an allele frequency is .5 for both alleles, the allele order 1/2 vs 2/1 can not be currently be determined.)
Mega2GenABELtst(mega_ = mega, gwaa_ = srdta, full = TRUE, envir = ENV)
Mega2GenABELtst(mega_ = mega, gwaa_ = srdta, full = TRUE, envir = ENV)
mega_ |
name of first gwaa.data-class object |
gwaa_ |
name of second gwaa.data-class object |
full |
if TRUE convert genotypes to text as.character(gwaa_@gtdata) |
envir |
'environment' containing SQLite database and other globals |
None
## Not run: db = system.file("exdata", "seqsimm.db", package="Mega2R") require("GenABEL") ENV = read.Mega2DB(db) y = Mega2ENVGenABEL() Mega2GenABELtst(y, y, full = FALSE) ## End(Not run) ## Not run: # donttestcheck: if you have more time, try ... x = Mega2GenABEL() Mega2GenABELtst(x, y, full = FALSE) ## End(Not run)
## Not run: db = system.file("exdata", "seqsimm.db", package="Mega2R") require("GenABEL") ENV = read.Mega2DB(db) y = Mega2ENVGenABEL() Mega2GenABELtst(y, y, full = FALSE) ## End(Not run) ## Not run: # donttestcheck: if you have more time, try ... x = Mega2GenABEL() Mega2GenABELtst(x, y, full = FALSE) ## End(Not run)
Execute the pedgene function on the first gs default gene transcript ranges (gs = 1:100). Update the envir$pedgene_results data frame with the results.
Mega2pedgene(gs = 1:100, genes = NULL, envir = ENV)
Mega2pedgene(gs = 1:100, genes = NULL, envir = ENV)
gs |
a subrange of the default transcript ranges over which to calculate the Dopedgene function. |
genes |
a list of genes over which to calculate the DOpedgene function. The value, "*", means use all the transcripts in the selected Bioconductor database. If genes is NULL, the gs range of the internal refRanges will be used. |
envir |
'environment' containing SQLite database and other globals |
None the data frame with the results is stored in the environment and named pedgene_results, viz. envir$pedgene_results
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_pedgene(db) ENV$verbose = TRUE Mega2pedgene(gs = 50:60)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_pedgene(db) ENV$verbose = TRUE Mega2pedgene(gs = 50:60)
This character vector indicates the names of the Mega2 SQLite3 database tables to load. (Not all of the existing tables are loaded.)
TBLS
TBLS
An object of class character
of length 15.
Robert V Baron
This list contains named values. The name corresponds to an SQLite database table. The value is a character string of column names from the "named" table that should be fetched. A table is in this list, if not all the database table columns are needed. The columns for each table are separated by commas.
TBLSFilter
TBLSFilter
An object of class list
of length 7.
For the data base tables not in this list, all columns are stored in the corresponding data frame.
Robert V Baron
This string indicates the current release of Mega2R
Mega2RVersion
Mega2RVersion
An object of class character
of length 1.
Robert V Baron
Execute the SKAT function on the first gs default gene transcripts (gs = 1:100). Update the envir$SKAT_results data frame with the results.
Mega2SKAT(f, ty, gs = 1:100, genes = NULL, skat = SKAT::SKAT, envir = ENV, ...)
Mega2SKAT(f, ty, gs = 1:100, genes = NULL, skat = SKAT::SKAT, envir = ENV, ...)
f |
SKAT_Null_Model formula. If this is non NULL, envir$obj is initialized by calling
SKAT_Null_Model(f, out_type = ty). If you need to specify additional arguments to the Model
viz. (data, Adjustment, n.Resampling, type.Resampling)
or need to use a different model viz. SKAT_NULL_emmaX, |
ty |
type of phenotype C/D = Continuous/Binary 5 (internal type 1/2) |
gs |
a subrange of the default transcripts (refRanges) over which to calculate the DOSKAT function. |
genes |
a list of genes over which to calculate the DOSKAT function. The value, "*", means use all the transcripts in the selected Bioconductor database. If genes is NULL, the gs range of the internal refRanges will be used. |
skat |
alternate SKAT function, viz. SKATBinary, SKAT_CommonRare. If it is also necessary is to pass additional arguments to the SKAT function, they may be added to the end of the Mega2SKAT function and will be passed. See examples |
envir |
'environment' containing SQLite database and other globals |
... |
extra arguments for SKAT |
None the data frame with the results is stored in the environment and named SKAT_results, viz. envir$SKAT_results
The SKAT_Null_Model
is called if the formula, f, is not NULL. A helper function
SKAT3arg
is defined for the 3 argument callback function which in turn calls
DOSKAT
with the appropriate arguments (including those additional to the
Mega2SKAT
function).
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_SKAT(db, verbose = FALSE, allMarkers = FALSE) ENV$verbose = FALSE ENV$SKAT_results = ENV$SKAT_results[0, ] Mega2SKAT(ENV$phe[, 3] - 1 ~ 1, "D", kernel = "linear.weighted", weights.beta = c(0.5, 0.5), gs=50:60 ) # donttestcheck: try this below if there is time Mega2SKAT(ENV$phe[, 3] - 1 ~ 1, "D", kernel = "linear.weighted", weights.beta = c(0.5, 0.5), genes=c("CEP104") ) ENV$SKAT_results
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = init_SKAT(db, verbose = FALSE, allMarkers = FALSE) ENV$verbose = FALSE ENV$SKAT_results = ENV$SKAT_results[0, ] Mega2SKAT(ENV$phe[, 3] - 1 ~ 1, "D", kernel = "linear.weighted", weights.beta = c(0.5, 0.5), gs=50:60 ) # donttestcheck: try this below if there is time Mega2SKAT(ENV$phe[, 3] - 1 ~ 1, "D", kernel = "linear.weighted", weights.beta = c(0.5, 0.5), genes=c("CEP104") ) ENV$SKAT_results
Generate a VCF file from the specified Mega2 SQLite database. The file is named "prefix".vcf If the markers argument is.null(), the entire envir$markers set is used, otherwise markers argument MUST be rows of the markers (envir$markers) data frame. In addition, several other files are generated to hold additional database information: "prefix".fam, "prefix".freq, "prefix".map, "prefix".phe, and "prefix".pen, which contain the pedigree, allele frequency, marker genetic and physical map position, member phenotype and phenotype penetrance data.
Mega2VCF( prefix, markers = NULL, mapno = 0, alleleOrder = "default", envir = ENV )
Mega2VCF( prefix, markers = NULL, mapno = 0, alleleOrder = "default", envir = ENV )
prefix |
prefix of output files including the VCF file (see Description section above). This prefix can include a path. |
markers |
markers selected to be in the VCF output file |
mapno |
specify which map index to use for genetic distances. The function |
alleleOrder |
how to order alleles in VCF file. 'default' is Mega2order, 'minor' is minor allele freq first, 'major' is major allele freq first, and 'name' is ascending ascii character order of allele name. |
envir |
'environment' containing SQLite database and other globals |
None
This code in this package illustrates how to extract the various kinds of data in the Mega2 data frames to use for further processing. Some of the data internal representations are a bit quirky but the code "explains" it all.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) vcfdir = file.path(where_mega2rtutorial_data(), "vcfr") if (!dir.exists(vcfdir)) dir.create(vcfdir) vcffile = file.path(where_mega2rtutorial_data(), "vcfr", "vcf.01") Mega2VCF(vcffile, ENV$markers[ENV$markers$chromosome == 1, ][1:10,], envir = ENV) list.files(vcfdir)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) vcfdir = file.path(where_mega2rtutorial_data(), "vcfr") if (!dir.exists(vcfdir)) dir.create(vcfdir) vcffile = file.path(where_mega2rtutorial_data(), "vcfr", "vcf.01") Mega2VCF(vcffile, ENV$markers[ENV$markers$chromosome == 1, ][1:10,], envir = ENV) list.files(vcfdir)
Generate a data frame with a row for each person. The observations are:
pedigree | family pedigree name |
person | person name |
father | father of person |
mother | mother of person |
sex | sex of person |
trait | value of case/control phenotype for person |
mkfam(brkloop = FALSE, traitname = "default", envir = ENV)
mkfam(brkloop = FALSE, traitname = "default", envir = ENV)
brkloop |
I haven't needed to set this TRUE yet. Maybe never will. If loops are broken, a person will be replaced by a dopple ganger in the same family with a different father/mother. The number of persons per family will be different when there are broken loops. Also, the person_link numbers will be different for all the persons after the first loop is broken. |
traitname |
Name of the trait to use as case/control value; by default, "default" |
envir |
An 'environment' that contains all the data frames created from the SQLite database. |
data frame that is described above
The columns of this data frame come by selecting the values after merging the data frames: pedigree_table, person_table, and trait_table.
Also, the father and mother columns from person_table are translated from the row index in the person_table to the corresponding name.
This function stores the data frame in the 'environment' and also returns it.
The function setfam()
stores the data frame
into the 'environment' and adjusts the genotype_table and the phenotype_table.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) fam = mkfam() fam
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) fam = mkfam() fam
Create the markers data frame. It contains 5 observations:
locus offset of this marker
locus offset plus an accumulating fudge factor that jumps
with each new chromosome because the count of markers per chromosome is force to be
a multiple of 4. (This value corresponds to the offset of the marker in the
unified_genotype_table
.)
name of the marker
chromosome number of the marker
base pair position of the marker (selected by bpPosMap[below])
mkMarkers(bpPosMap = 1, envir = ENV)
mkMarkers(bpPosMap = 1, envir = ENV)
bpPosMap |
An integer that indicates the map (index) to use to merge the chromosome/position fields from the map_table data frame to the marker_table data frame. See showMapNames() for the string name to index mapping. |
envir |
an environment that contains all the data frames created from the SQLite database. |
Select a map (index) from the map_table to merge with the select marker_table data frame to make the marker data frame. See showMapNames() for the string name to index mapping.
None
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db, verbose = FALSE) mkMarkers(1) ENV$markers
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db, verbose = FALSE) mkMarkers(1) ENV$markers
Convert data in phenotype_table to a data frame of columns that are phenotypes. The columns may be affection status or quantitative values
mkphenotype(envir)
mkphenotype(envir)
envir |
"environment" containing SQLite database and other globals |
is a data frame with FID column, then IID column, and then an additional column for each phenotype.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) out = mkphenotype() out
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) out = mkphenotype() out
Call dbmega2_import()
with the specified database and create an 'environment', with the
SQLite table data loaded into data frames.
Also run mkfam()
to create the pedigree data frame fam and then store it with setfam()
.
setfam()
modifies the unified_genotype_table (and phenotype_table) to match the family members
that remain.
read.Mega2DB(db, ...)
read.Mega2DB(db, ...)
db |
specify SQLite database to load |
... |
additional arguments to pass to |
an 'environment' that contains all the data frames created from the SQLite database.
By default, mkfam
will remove one of each
person that was replicated to break loops in the pedigree, see mkfam
for details.
If you want to leave loops broken, the code is available, but you will have to write your own
version of read.Mega2DB with a different invokation of mkfam()
.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db, verbose = TRUE)
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db, verbose = TRUE)
This function takes two string parameters: one to specify entrez gene ids to transcripts, the other to map gene names to entrez gene id's.
setAnnotations(txdb, entrezGene, envir = ENV)
setAnnotations(txdb, entrezGene, envir = ENV)
txdb |
name of Bioconductor transcription database. |
entrezGene |
name of Bioconductor mapping of gene name or gene alias to entrez gene id |
envir |
an 'environment' that contains all the data frames created from the SQLite database. |
None
Mega2R will take care to load the necessary databases, but you will have to install them from Bioconductor. This is explained at length in the package Vignette.
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) setAnnotations("TxDb.Hsapiens.UCSC.hg19.knownGene", "org.Hs.eg.db") ENV$txdb ENV$entrezGene
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) setAnnotations("TxDb.Hsapiens.UCSC.hg19.knownGene", "org.Hs.eg.db") ENV$txdb ENV$entrezGene
You should first modify the fam data frame to filter the members you need to remove.
(For example, you might want to delete members that have an unknown case/control status.)
This function takes a new data frame of pedigree information and replaces the fam
data frame in the 'environment' with it. Additionally,
changing fam data frame will filter the genotypes data frame to only contain persons
matching those in the fam data frame. setfam
also filters for the phenotype data
records.
setfam(fam, envir = ENV)
setfam(fam, envir = ENV)
fam |
data frame of family information filtered from fam
data frame (generated by |
envir |
an 'environment' that contains all the data frames created from the SQLite database. |
None
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) fam = mkfam() # remove founders fam = fam[ !( (fam[ , 5] == fam[ , 6]) & (fam[ , 5] == 0)), ] setfam(fam) ENV$fam
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) fam = mkfam() # remove founders fam = fam[ !( (fam[ , 5] == fam[ , 6]) & (fam[ , 5] == 0)), ] setfam(fam) ENV$fam
This function sets the default list of ranges used by applyFnToRanges
. applyFnToRanges
examines each range and the set of markers that fall within the range will be
processed.
setRanges(ranges, indices, envir = ENV)
setRanges(ranges, indices, envir = ENV)
ranges |
a data frame that contains at least 4 observations: a name, a chromosome, a start base pair position and an end base pair position. |
indices |
a vector of 3 or 4 integers that specify the chromosome column, start base pair, column and end base pair column of range data frame and lastly the name column. If the vector only contains 3 integers, a name will be generated from the three range elements and it will be appended to the ranges and the last range column will be added to the indices. |
envir |
an 'environment' that contains all the data frames created from the SQLite database. |
None
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) ranges = matrix(c(1, 2240000, 2245000, 1, 2245000, 2250000, 1, 3760000, 3761000, 1, 3761000, 3762000, 1, 3762000, 3763000, 1, 3763000, 3764000, 1, 3764000, 3765000, 1, 3765000, 3763760, 1, 3763760, 3767000, 1, 3767000, 3768000, 1, 3768000, 3769000, 1, 3769000, 3770000), ncol = 3, nrow = 12, byrow = TRUE) setRanges(ranges, 1:3) ENV$refRanges ranges = matrix(c(1, 2240000, 2245000, 1, 2245000, 2250000, 1, 3760000, 3761000, 1, 3761000, 3762000, 1, 3762000, 3763000, 1, 3763000, 3764000, 1, 3764000, 3765000, 1, 3765000, 3763760, 1, 3763760, 3767000, 1, 3767000, 3768000, 1, 3768000, 3769000, 1, 3769000, 3770000), ncol = 3, nrow = 12, byrow = TRUE) ranges = data.frame(ranges) ranges$name = LETTERS[1:12] names(ranges) = c("chr", "start", "end", "name") setRanges(ranges, 1:4) ENV$refRanges
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) ranges = matrix(c(1, 2240000, 2245000, 1, 2245000, 2250000, 1, 3760000, 3761000, 1, 3761000, 3762000, 1, 3762000, 3763000, 1, 3763000, 3764000, 1, 3764000, 3765000, 1, 3765000, 3763760, 1, 3763760, 3767000, 1, 3767000, 3768000, 1, 3768000, 3769000, 1, 3769000, 3770000), ncol = 3, nrow = 12, byrow = TRUE) setRanges(ranges, 1:3) ENV$refRanges ranges = matrix(c(1, 2240000, 2245000, 1, 2245000, 2250000, 1, 3760000, 3761000, 1, 3761000, 3762000, 1, 3762000, 3763000, 1, 3763000, 3764000, 1, 3764000, 3765000, 1, 3765000, 3763760, 1, 3763760, 3767000, 1, 3767000, 3768000, 1, 3768000, 3769000, 1, 3769000, 3770000), ncol = 3, nrow = 12, byrow = TRUE) ranges = data.frame(ranges) ranges$name = LETTERS[1:12] names(ranges) = c("chr", "start", "end", "name") setRanges(ranges, 1:4) ENV$refRanges
Mega2R allows several different physical and genetic maps to be stored and used to select positions. This function shows the association between map number and map name.
showMapNames(envir = ENV)
showMapNames(envir = ENV)
envir |
an environment that contains all the data frames created from the SQLite database. |
None
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) showMapNames()
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) showMapNames()
Mega2R uses an environment to store the data frames when it reads SQLite database tables. This function shows the data frames and their sizes; it also shows the count of samples and markers in the database. Note: It is not necessary to provide an argument, if the environment is named ENV.
showMega2ENV(envir = ENV)
showMega2ENV(envir = ENV)
envir |
an environment that contains all the data frames created from the SQLite database. |
None
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) showMega2ENV()
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) showMega2ENV()
Mega2R stores several phenotypes, both affective and quantitative. This function displays the mapping between phenotype (name), index, and the phenotype type (affection or quantitative).
showPhenoNames(envir = ENV)
showPhenoNames(envir = ENV)
envir |
an environment that contains all the data frames created from the SQLite database. |
None
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) showPhenoNames()
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) showPhenoNames()
Reads the fam data frame in "envir" and returns a new one with unique entries in the member column
uniqueFamMember(envir = ENV)
uniqueFamMember(envir = ENV)
envir |
'environment' containing SQLite database and other globals |
a data frame with columns the same as the "fam" data frame but with the member column containing unique entries
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) setfam(uniqueFamMember(envir = ENV))
db = system.file("exdata", "seqsimm.db", package="Mega2R") ENV = read.Mega2DB(db) setfam(uniqueFamMember(envir = ENV))
This function shows the directory the Mega2Rtutorial (inst/exdata) was copied to.
where_mega2rtutorial_data(dir = file.path(tempdir(), "Mega2Rtutorial"))
where_mega2rtutorial_data(dir = file.path(tempdir(), "Mega2Rtutorial"))
dir |
The directory to store the tutorial data to. By default, this is tempdir()/Mega2Rtutorial |
dir tutorial to hold vignette
directory = where_mega2rtutorial_data()
directory = where_mega2rtutorial_data()