# Introduction to Bioconductor

Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.”

https://www.bioconductor.org/

library("dplyr")
library("ggplot2")

# Installation

To install core packages, type the following in an R command window.

• This may take around 5 minutes
• When the option for updating packages appears, type in “a” for “all”
#leave as eval = FALSE when knitting

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()

## Installing Bioconductor packages

There are about 20000 packages in the Bioconductor universe. For now, let us install a couple of the more popular packages with the following code. To install a BioConductor package, simply type in the name of the package in quotes inside of BiocManager::install()

• This may take around 5 minutes
• Hereafter, when the option for updating packages appears, type in “n” for “no”
#leave as eval = FALSE when knitting

BiocManager::install("GenomicFeatures")
BiocManager::install("Biostrings")

# Exploring GenomicFeatures

The following instructions were extracted from the vignette for Genomic Features

• browseVignettes(package = "GenomicFeatures")

Load the GenomicFeatures package. It is suggested that you load the packages with the following code.

#set as eval = TRUE when knitting

suppressPackageStartupMessages(library('GenomicFeatures'))

## Introduction

“The GenomicFeatures package retrieves and manages transcript-related features fromthe UCSC Genome Bioinformatics and BioMart data resources. The package isuseful for ChIP-chip, ChIP-seq, and RNA-seq analyses.”

## TxDb Objects

"The GenomicFeatures package uses TxDb objects to store transcript metadata. This class maps the 5’ and 3’ untranslated regions (UTRs), protein coding sequences(CDSs) and exons for a set of mRNA transcripts to their associated genome. TxDb objects have numerous accessors functions to allow such features to be retrieved individually or grouped together in a way that reflects the underlying biology.

All TxDb objects are backed by a SQLite database that manages genomic locations and the relationships between pre-processed mRNA transcripts, exons, protein coding sequences, and their related gene identifiers."

#set as eval = TRUE when knitting

samplefile <- system.file("extdata",
"hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb
1. Write a couple of sentences about what you observe in the output (the description of the txdb sample file)

1. Install the following packages (i.e. figure out how from some code earlier in this tutorial)
• BSgenome.Hsapiens.UCSC.hg19
• TxDb.Hsapiens.UCSC.hg19.knownGene
1. Run the following code and then write a couple of sentences about what you observe in the output.
#set as eval = TRUE when knitting

library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

tx_seqs1 <- extractTranscriptSeqs(Hsapiens,
TxDb.Hsapiens.UCSC.hg19.knownGene,
use.names=TRUE)
suppressWarnings(translate(tx_seqs1))

# Exploring Biostrings

The following instructions were extracted from the vignette for Biostrings

• browseVignettes(package = "Biostrings")
1. Install the following packages (i.e. figure out how from some code earlier in this tutorial)
• hgu95av2probe
• hgu95av2cdf

Load the Biostrings package. It is suggested that you load the packages with the following code.

#set as eval = TRUE when knitting

suppressPackageStartupMessages(library('Biostrings'))
suppressPackageStartupMessages(library('hgu95av2probe'))
suppressPackageStartupMessages(library('hgu95av2cdf'))
1. The following code will load an example of a “Large DNAStringSet” and perform a couple of simple data analyses. Run the code and write a couple of sentences about the results.
#set as eval = TRUE when knitting

this_DNAString_set <- DNAStringSet(hgu95av2probe)
this_DNAString_set_bc <- alphabetFrequency(this_DNAString_set, baseOnly = TRUE)
nrow(this_DNAString_set_bc)
alphabetFrequency(this_DNAString_set_bc, baseOnly=TRUE, collapse=TRUE)
1. The following code will perform a calculation called the GC content of a DNA sequence and graph the resulting amounts as a histogram. Run the code and write a couple of sentences about the results.
#set as eval = TRUE when knitting

this_DNAString_df <- data.frame(this_DNAString_set_bc)

# alas "T" is treated as "TRUE" in R, so let's use the full names of the nucleotides
colnames(this_DNAString_df) <- c("Adenine", "Cytosine", "Guanine", "Thymine", "other")

this_DNAString_df %>%
theme(legend.position = "none")