Informática Aplicada a la Investigación Rotating Header Image

Genetics

Genetics

R eta RStudio

Informazio Orokorra

R 3.3.3 is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information.

RStudio and RCommander are a graphical front ends for R.

Instalatutako paketeak

abind, ape, biomformat, cummeRbund, DCGL, DESeq2, DEXSeq, e1071, edgeR, FactoMineR, GEOquery, lavaan, metagenomeSeq, mnormt, optparse, psych, randomForest, Rcmdr, RColorBrewer, ReactomePA, RUVSeq, vegan, WGCNA, xlsx..

Besterenbat behar izanez gero, ezka iezaguzu.

Nola erabili

R exekutatzeko kola sistemaren scriptetan erabili:

/software/bin/R  CMD  BATCH R-input-file.R

eta RStudio erabiltzeko X2Go bitartez egin behar da lehioak ireki ahal izateko, Katramilara edo Txinpartara konektatuz eta exekutatuz:

rstudio

eta RCommander erabiltzeko X2Go bitartez egin behar da lehioak ireki ahal izateko, Katramilara edo Txinpartara konektatuz eta exekutatu R. Gero R-n kargatu:

library(Rcmdr)

Informazio Gehiago

R web orrialdea.

rstudio web orrialdeal.

IDBA-UD

General information

IDBA-UD 1.1.1 is a iterative De Bruijn Graph De Novo Assembler for Short Reads Sequencing data with Highly Uneven Sequencing Depth. It is an extension of IDBA algorithm. IDBA-UD also iterates from small k to a large k. In each iteration, short and low-depth contigs are removed iteratively with cutoff threshold from low to high to reduce the errors in low-depth and high-depth regions. Paired-end reads are aligned to contigs and assembled locally to generate some missing k-mers in low-depth regions. With these technologies, IDBA-UD can iterate k value of de Bruijn graph to a very large value with less gaps and less branches to form long contigs in both low-depth and high-depth regions.

How to use

To send jobs to the queue you can use the command

send_idba-ud

which after a few questions configures the job.

Performance

IDBA-UD has a good performance and scaling up to 8 cores. Above we did not measure a improvement. In the benchmark the --mimk 40 --step 20 options has been used. When we have decreased the step the the scalling is worse. This trend can be also seen in the second table.

1 core as base 2 cores as base
Cores Time (s) Speed up Performance (%) Speed up Performance (%)
1 480 1 100
2 296 1.6 81 1.0 100
4 188 2.6 64 1.6 79
8 84 5.7 71 3.5 88
12 92 5.2 43 3.2 54

The second benchark has been done with a bigger file with 10 million bases and the  --mink 20 --step 10 --min_support 2 options. We observe a regular behaviour than in the previous benchmark and how the panellization is good up to 4 cores.

Cores Time (s) Speed up Performance
1 13050 1 100
2 6675 2.0 98
4 3849 3.4 85
8 3113 4.2 52
16 2337 5.6 35
20 2409 5.4 27

More information

IDBA-UD web page.

SPAdes

General information

SPAdes 3.6.0 – St. Petersburg genome assembler – is intended for both standard isolates and single-cell MDA bacteria assemblies. It works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. You can also provide additional contigs that will be used as long reads. Supports paired-end reads, mate-pairs and unpaired reads. SPAdes can take as input several paired-end and mate-pair libraries simultaneously. Note, that SPAdes was initially designed for small genomes. It was tested on single-cell and standard bacterial and fungal data sets.

How to use

To send jobs to the queue you can use the

send_spades

command that asks few questions to configure the job.

Performance

We have not measure any performance improvement or time reduction when using several cores in a standard calculation like:

spades.py -pe1-1 file1 -pe1-2 file2 -o outdir

We recommend to use 1 core, unless you know that you can use better performance with several cores.

More information

Web page of SPAdes.

MetAMOS

General information

MetAMOS represents a focused effort to create automated, reproducible, traceable assembly & analysis infused with current best practices and state-of-the-art methods. MetAMOS for input can start with next-generation sequencing reads or assemblies, and as output, produces: assembly reports, genomic scaffolds, open-reading frames, variant motifs, taxonomic or functional annotations, Krona charts and HTML report. 1.5rc3 version.

How to use

To send a job to the queue system there is the

send_metamos

command where you answer a few questions to set up the job. Take into account that MetAMOS use a lot of RAM memory, about 1 GB per million reads.

More information

MetAMOS web page.

QIIME

General information

QIIME (Quantitative Insights Into Microbial Ecology) is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data. QIIME is designed to take users from raw sequencing data generated on the Illumina or other platforms through publication quality graphics and statistics. This includes demultiplexing and quality filtering, OTU picking, taxonomic assignment, and phylogenetic reconstruction, and diversity analyses and visualizations. QIIME has been applied to studies based on billions of sequences from tens of thousands of samples

 How to use

To send QIIME jobs run the command

send_qiime

and answer the questions.

USEARCH

QIIME can use the USEARCH pakage.

More information

QIIME home page.

USEARCH.

 

 

USEARCH

General information

USEARCH is a unique sequence analysis tool that offers search and clustering algorithms that are often orders of magnitude faster than BLAST. We have the free 32 bits version that can not be distributed to third parties and has a 4 GB of RAM limitation.

How to use

To use USEARCH execute

/software/bin/usearch

for example

/software/bin/usearch -cluster_otus data.fa -otus otus.fa -uparseout out.up -relabel OTU_ -sizein -sizeout

USEARCH is only available in the xeon20 type nodes.

QIIME

USEARCH can be use under QIIME.

More information

USEARCH home page.

QIIME.

SAMtools, BCFtools and HTSlib 1.2

General Information

Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:

Samtools
Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format
BCFtools
Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants
HTSlib
A C library for reading/writing high-throughput sequencing data
Samtools and BCFtools both use HTSlib internally, but these source packages contain their own copies of htslib so they can be built independently.

 

How to use It

They are installed in /software/samtools-1.2//software/bcftools-1.2/ and  /software/htslib-1.2.1 respectibely.

Something like this should be added in the PBS script.

export PATH=/software/samtools-1.2/bin:/software/bcftools-1.2/bin:$PATH

export LD_LIBRARY_PATH=/software/htslib-1.2.1/lib:$LD_LIBRARY_PATH

 

More Information

http://www.htslib.org/

Trinity

General information

2.1.1 release. Trinity, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

  • Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
  • Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
  • Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.

How to use

You can use the

send_trinity

command to submit jobs to the queue system. After answering few questions a script will be created and submitted to the queue system. For advanced users it can be used to generate a sample script.

Performance

Trinity can be run in parallel but it is not very efficient above 4 cores with low performance, as can be seen in the the table. Trinity consumes high amounts of RAM.

Performance of Trinity
Cores  1 4 8 12
Time 5189 2116 1754 1852
Speddup 1 2.45 2.96 2.80
Efficiency (%)  100 61 37 23

 

More information

Trinity web page.

ABySS

General Information

1.3.2 version of ABySS (Assembly By Short Sequences). ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. ABySS can be executed in parallel.

See also the installed velvet and comparing both we have published article.

How to use

The executables can be found in /software/abyss/bin. To run abyss in a script type in it:

/software/abyss/bin/abyss-pe [abyss-pe options]

Performance

See also the installed velvet and comparing both we have published article.

Parallelization

Some benchmarks has been performed with ABySS. They have been performed using file from an Illumina HiSeq2000 NGS with 100 bp per sequence. In the table 1 we can see an example about how ABySS scales as a function of the number of cores. As we can see ABySS scales very up to 8 cores. The results is valid unless for more than 10e6 sequences.

Table 1. Execution time of abyss-pe in seconds as a function the number of cores
cores 2 4 8 12 24
Time (s) 47798 27852 16874 14591 18633
Aceleration 1 1.7 2.8 3.3 2.6
Performance(%) 100 86 71 55 21

Execution time

We have analized as well the execution time as a function of the size of the data. In the table 2 we  observe how from 1 million to 10 millions of sequences the execution time increases by 10 as well. From 10 to 100 millions of sequences the time increases a little more, between 10 t0 20. Therefore, the behavior is more or less lineal.

Table 2. Execution time in seconds of abyss-pe executed in 2, 4 y 8 cores as a function of the number of processed sequences.
sequences 10e6 10e7 10e8
Time in 2 cores (s) 247 2620 47798
Time in 4 cores (s) 134 1437 27852
Time in 8 cores (s) 103 923 16874

RAM memory

In these kind of programs more important than the execution time, which is reasonable, is the RAM memory usage, which can limit the calculation type. In the table 3 we observe how the RAM increases as a function of the number of sequences. We also show the logarithms of the measured values which has been used for a lineal regression. The jobs has been performed in 12 cores.

Tabl3 3. RAM memory used by abyss-pe as a function of the number of processed sequences. The logarithms of the measured values are also shown.
sequences 10e6 5*10e6 10e7 5*10e7 10e8
RAM (GB) 4.0 7.6 11 29 44
log(sequences) 6 6.7 7 7.7 8
log(RAM) 0.60 0.88 1.03 1.46 1.65

From the values of the table we obtain a fitting of the RAM in GB as a function of the number of sequences (s) to the equation

log(RAM)=0.53*log(s)-2.65

o equivalently

RAM=(s^0.53)/447

Conclusion

The memory usage is smaller than in other assemblers like Velvet, see as well the report Velvet performance in the machines of the Computing Service of the UPV/EHU and comparing both we have published article. In addition, the parallelization with MPI of ABySS allows to aggregate the RAM memory of several nodes to perform larger calculations.

More information

ABySS web page.
Velvet assembler.
Velvet performance in the machines of the Computing Service of the UPV/EHU report.

Velvet and ABySS performance in the machines of the Computing Service of the UPV/EHU, post in the hpc blog.

abyss-pe

Clean reads

General information

0.2.2 Version. clean_reads cleans NGS (Sanger, 454, Illumina and solid) reads. It can trim:

  • bad quality regions
  • adaptors
  • vectors
  • regular expresssions

It also filters out the reads that do not meet a minimum quality criteria based on the sequence length and the mean quality. It can run in parallel.

Ho to use

To submit clean_reads jobs to the queue system execute the command

send_clean_reads

It will ask few questions to build the script and submit it to the queue.

Performance

clean_reads can be executed in parallel and scales well up to 8 cores. For 12 cores the performance is very poor. In the table 1 we show the results of the benchmark. They have been executed in a 12 cores node with E5645 Xeon processors.

Execution time in seconds as a function of the number of cores
cores 1 4 8 12
Time (s) 1600 422 246 238
Speedup 1 3.8 6.5 6.7
Performance (%) 100 95 81 56

The used command has been

clean_reads -i in.fastq -o out.fastq -p illumina -f fastq -g fastq -a a.fna -d UniVec -n 20 --qual_threshold=20 --only_3_end False -m 60 -t 12

More information

clean_reads web page.

Velvet

General information

1.2.03 version. Velvet is a set of algorithms manipulating de Bruijn graphs for genomic and de novo transcriptomic Sequence assembly. It was designed for short read sequencing technologies, such as Solexa or 454 Sequencing and was developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute. The tool takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.

See also the installed ABySS and comparing both we have published article.

How to use

To run velveth or velvetg add in your scripts for the Torque queue system the corresponding command:

/software/bin/velvet/velveth [velvet options]
/software/bin/velvet/velvetg [velvet option]

Performance

Velvet has been compiled with parallel support througth OpenMP. We have measured the perfomance and the results are available in the report about the Velvet performance in the machines of the Computing Service of the UPV/EHU. Velvet uses huge amount of RAM for large calculations and we have measured it. In the report some simple formulas are obtained to predict the use of RAM for their input files, so the researches can know the needed RAM before start the calculations and in this way can plan their research.

See also the installed ABySS and comparing both we have published article.

More information

Velvet web page.
Velvet performance in the machines of the Computing Service of the UPV/EHU report.
Velvet and ABySS performance in the machines of the Computing Service of the UPV/EHU, post in the hpc blog.

BLAST

General Information

2.2.24 version of BLAST de NCBI. Due to performance reasons it has not been installed in Itanium nodes.

 

Data bases

The Service has installed several data bases, contact the technicians to use them or install new ones.

How to use

To submit jobs to the queue system we strongly recomend to use the command

send_blast

it will make some questions to prepare the job

Performance and gpuBLAST

We have compared BLAST with mpiBLAST and gpuBLAST, the result of the bechmarks are in the blog of Service. mpiBLAST is installed in the Service.

 

More information

BLAST web page.

Blast2GO y mpiBLAST is also installed.

Genepop

General information

4.1 version.

Genepop is a population genetics software package, which has options for the following analysis: Hardy Weinberg equilibrium, Linkage Disequilibrium, Population Differentiation, Effective number of migrants, Fst or other correlations.

How to use

To execute Genepop in the queue system you must include in the script of the queue system:

/software/bin/Genepop < input_file

where input_file has the options for Genepop, i.e., the answer to Genepop when it runs in interactive mode. We recommend to use qsub in interactive mode to submit the jobs

 

More information

Genepop web page.

CLUMPP

General information

1.1.3 version. CLUMPP is a program that deals with label switching and multimodality problems in population-genetic cluster analyses. CLUMPP permutes the clusters output by independent runs of clustering programs such as structure, so that they match up as closely as possible. The user has the option of choosing one of three algorithms for aligning replicates, with a tradeoff of speed and similarity to the optimal alignment.

How to use

To execute CLUMPP in the queue system you must include in the script of the queue system:

/software/bin/CLUMPP

with the corresponding options of structure. We recommend to use qsub in interactive mode to submit the jobs

 

More information

CLUMPP web page.

Structure

General information

2.33 version.

The program structure is a free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.

How to use

To execute the graphical user interface execute in Péndulo, Maiz or Guinness

structure

To execute graphical applications read how to connect to Arina.

To execute structure in the queue system you must include in the script of the queue system:

/software/bin/structure

with the corresponding options of structure. We recommend to use qsub in interactive mode to submit the jobs

 

More information

Structure web page.