Category Archives: Genetics

Genetics

R eta RStudio

February 10, 2017Genetics, Scientific Software, Softwareadmin

Informazio Orokorra

R 3.3.3 is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information.

RStudio and RCommander are a graphical front ends for R.

Instalatutako paketeak

abind, ape, biomformat, cummeRbund, DCGL, DESeq2, DEXSeq, e1071, edgeR, FactoMineR, GEOquery, lavaan, metagenomeSeq, mnormt, optparse, psych, randomForest, Rcmdr, RColorBrewer, ReactomePA, RUVSeq, vegan, WGCNA, xlsx..

Besterenbat behar izanez gero, ezka iezaguzu.

Nola erabili

R exekutatzeko kola sistemaren scriptetan erabili:

/software/bin/R  CMD  BATCH R-input-file.R

eta RStudio erabiltzeko X2Go bitartez egin behar da lehioak ireki ahal izateko, Katramilara edo Txinpartara konektatuz eta exekutatuz:

rstudio

eta RCommander erabiltzeko X2Go bitartez egin behar da lehioak ireki ahal izateko, Katramilara edo Txinpartara konektatuz eta exekutatu R. Gero R-n kargatu:

library(Rcmdr)

Informazio Gehiago

R web orrialdea.

rstudio web orrialdeal.

IDBA-UD

September 11, 2015Genetics, Scientific Softwareadmin

General information

IDBA-UD 1.1.1 is a iterative De Bruijn Graph De Novo Assembler for Short Reads Sequencing data with Highly Uneven Sequencing Depth. It is an extension of IDBA algorithm. IDBA-UD also iterates from small k to a large k. In each iteration, short and low-depth contigs are removed iteratively with cutoff threshold from low to high to reduce the errors in low-depth and high-depth regions. Paired-end reads are aligned to contigs and assembled locally to generate some missing k-mers in low-depth regions. With these technologies, IDBA-UD can iterate k value of de Bruijn graph to a very large value with less gaps and less branches to form long contigs in both low-depth and high-depth regions.

How to use

To send jobs to the queue you can use the command

send_idba-ud

which after a few questions configures the job.

Performance

IDBA-UD has a good performance and scaling up to 8 cores. Above we did not measure a improvement. In the benchmark the --mimk 40 --step 20 options has been used. When we have decreased the step the the scalling is worse. This trend can be also seen in the second table.

		1 core as base		2 cores as base
Cores	Time (s)	Speed up	Performance (%)	Speed up	Performance (%)
1	480	1	100
2	296	1.6	81	1.0	100
4	188	2.6	64	1.6	79
8	84	5.7	71	3.5	88
12	92	5.2	43	3.2	54

The second benchark has been done with a bigger file with 10 million bases and the --mink 20 --step 10 --min_support 2 options. We observe a regular behaviour than in the previous benchmark and how the panellization is good up to 4 cores.

Cores	Time (s)	Speed up	Performance
1	13050	1	100
2	6675	2.0	98
4	3849	3.4	85
8	3113	4.2	52
16	2337	5.6	35
20	2409	5.4	27

More information

IDBA-UD web page.

SPAdes

September 9, 2015Genetics, Scientific Software, Uncategorized @enadmin

General information

SPAdes 3.6.0 – St. Petersburg genome assembler – is intended for both standard isolates and single-cell MDA bacteria assemblies. It works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. You can also provide additional contigs that will be used as long reads. Supports paired-end reads, mate-pairs and unpaired reads. SPAdes can take as input several paired-end and mate-pair libraries simultaneously. Note, that SPAdes was initially designed for small genomes. It was tested on single-cell and standard bacterial and fungal data sets.

How to use

To send jobs to the queue you can use the

send_spades

command that asks few questions to configure the job.

Performance

We have not measure any performance improvement or time reduction when using several cores in a standard calculation like:

spades.py -pe1-1 file1 -pe1-2 file2 -o outdir

We recommend to use 1 core, unless you know that you can use better performance with several cores.

More information

Web page of SPAdes.

MetAMOS

August 5, 2015Genetics, Scientific Softwareadmin

General information

MetAMOS represents a focused effort to create automated, reproducible, traceable assembly & analysis infused with current best practices and state-of-the-art methods. MetAMOS for input can start with next-generation sequencing reads or assemblies, and as output, produces: assembly reports, genomic scaffolds, open-reading frames, variant motifs, taxonomic or functional annotations, Krona charts and HTML report. 1.5rc3 version.

How to use

To send a job to the queue system there is the

send_metamos

command where you answer a few questions to set up the job. Take into account that MetAMOS use a lot of RAM memory, about 1 GB per million reads.

More information

MetAMOS web page.

QIIME

August 4, 2015Genetics, Scientific Softwareadmin

General information

QIIME (Quantitative Insights Into Microbial Ecology) is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data. QIIME is designed to take users from raw sequencing data generated on the Illumina or other platforms through publication quality graphics and statistics. This includes demultiplexing and quality filtering, OTU picking, taxonomic assignment, and phylogenetic reconstruction, and diversity analyses and visualizations. QIIME has been applied to studies based on billions of sequences from tens of thousands of samples

How to use

To send QIIME jobs run the command

send_qiime

and answer the questions.

USEARCH

QIIME can use the [intlink id=”7744″ type=”post”]USEARCH[/intlink] pakage.

More information

QIIME home page.

[intlink id=”7700″ type=”post”]USEARCH[/intlink].

USEARCH

June 8, 2015Genetics, Scientific Softwareadmin

General information

USEARCH is a unique sequence analysis tool that offers search and clustering algorithms that are often orders of magnitude faster than BLAST. We have the free 32 bits version that can not be distributed to third parties and has a 4 GB of RAM limitation.

How to use

To use USEARCH execute

/software/bin/usearch

for example

/software/bin/usearch -cluster_otus data.fa -otus otus.fa -uparseout out.up -relabel OTU_ -sizein -sizeout

USEARCH is only available in the xeon20 type nodes.

QIIME

USEARCH can be use under [intlink id=”7758″ type=”post”]QIIME[/intlink].

More information

USEARCH home page.

[intlink id=”7686″ type=”post”]QIIME[/intlink].

SAMtools, BCFtools and HTSlib 1.2

May 11, 2015Genetics, Scientific Software, Software @enadmin

General Information

Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:

Samtools
Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format
BCFtools
Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants
HTSlib
A C library for reading/writing high-throughput sequencing data
Samtools and BCFtools both use HTSlib internally, but these source packages contain their own copies of htslib so they can be built independently.

How to use It

They are installed in /software/samtools-1.2/, /software/bcftools-1.2/ and /software/htslib-1.2.1 respectibely.

Something like this should be added in the PBS script.

export PATH=/software/samtools-1.2/bin:/software/bcftools-1.2/bin:$PATH

export LD_LIBRARY_PATH=/software/htslib-1.2.1/lib:$LD_LIBRARY_PATH

More Information

http://www.htslib.org/

Trinity

November 12, 2013Genetics, Scientific Softwareadmin

General information

2.1.1 release. Trinity, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.

Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.

Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.

How to use

You can use the

send_trinity

command to submit jobs to the queue system. After answering few questions a script will be created and submitted to the queue system. For advanced users it can be used to generate a sample script.

Performance

Trinity can be run in parallel but it is not very efficient above 4 cores with low performance, as can be seen in the the table. Trinity consumes high amounts of RAM.

Performance of Trinity
Cores	1	4	8	12
Time	5189	2116	1754	1852
Speddup	1	2.45	2.96	2.80
Efficiency (%)	100	61	37	23

More information

Trinity web page.

ABySS

March 27, 2012Genetics, Scientific Softwareadmin

General Information

1.3.2 version of ABySS (Assembly By Short Sequences). ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. ABySS can be executed in parallel.

See also the installed [intlink id=”6043″ type=”post”]velvet[/intlink] and comparing both we have published article.

How to use

The executables can be found in /software/abyss/bin. To run abyss in a script type in it:

/software/abyss/bin/abyss-pe [abyss-pe options]

Performance

See also the installed [intlink id=”6043″ type=”post”]velvet[/intlink] and comparing both we have published article.

Parallelization

Some benchmarks has been performed with ABySS. They have been performed using file from an Illumina HiSeq2000 NGS with 100 bp per sequence. In the table 1 we can see an example about how ABySS scales as a function of the number of cores. As we can see ABySS scales very up to 8 cores. The results is valid unless for more than 10e6 sequences.

Table 1. Execution time of `abyss-pe` in seconds as a function the number of cores
cores	2	4	8	12	24
Time (s)	47798	27852	16874	14591	18633
Aceleration	1	1.7	2.8	3.3	2.6
Performance(%)	100	86	71	55	21

Execution time

We have analized as well the execution time as a function of the size of the data. In the table 2 we observe how from 1 million to 10 millions of sequences the execution time increases by 10 as well. From 10 to 100 millions of sequences the time increases a little more, between 10 t0 20. Therefore, the behavior is more or less lineal.

Table 2. Execution time in seconds of `abyss-pe` executed in 2, 4 y 8 cores as a function of the number of processed sequences.
sequences	10e6	10e7	10e8
Time in 2 cores (s)	247	2620	47798
Time in 4 cores (s)	134	1437	27852
Time in 8 cores (s)	103	923	16874

RAM memory

In these kind of programs more important than the execution time, which is reasonable, is the RAM memory usage, which can limit the calculation type. In the table 3 we observe how the RAM increases as a function of the number of sequences. We also show the logarithms of the measured values which has been used for a lineal regression. The jobs has been performed in 12 cores.

Tabl3 3. RAM memory used by `abyss-pe` as a function of the number of processed sequences. The logarithms of the measured values are also shown.
sequences	10e6	5*10e6	10e7	5*10e7	10e8
RAM (GB)	4.0	7.6	11	29	44
log(sequences)	6	6.7	7	7.7	8
log(RAM)	0.60	0.88	1.03	1.46	1.65

From the values of the table we obtain a fitting of the RAM in GB as a function of the number of sequences (s) to the equation

log(RAM)=0.53*log(s)-2.65

o equivalently

RAM=(s^0.53)/447

Conclusion

The memory usage is smaller than in other assemblers like [intlink id=”6043″ type=”post”]Velvet[/intlink], see as well the report Velvet performance in the machines of the Computing Service of the UPV/EHU and comparing both we have published article. In addition, the parallelization with MPI of ABySS allows to aggregate the RAM memory of several nodes to perform larger calculations.

More information

ABySS web page.
[intlink id=”6043″ type=”post”]Velvet[/intlink] assembler.
Velvet performance in the machines of the Computing Service of the UPV/EHU report.

Velvet and ABySS performance in the machines of the Computing Service of the UPV/EHU, post in the hpc blog.

abyss-pe

Clean reads

March 9, 2012Genetics, Scientific Softwareadmin

General information

0.2.2 Version. clean_reads cleans NGS (Sanger, 454, Illumina and solid) reads. It can trim:

bad quality regions
adaptors
vectors
regular expresssions

It also filters out the reads that do not meet a minimum quality criteria based on the sequence length and the mean quality. It can run in parallel.

Ho to use

To submit clean_reads jobs to the queue system execute the command

send_clean_reads

It will ask few questions to build the script and submit it to the queue.

Performance

clean_reads can be executed in parallel and scales well up to 8 cores. For 12 cores the performance is very poor. In the table 1 we show the results of the benchmark. They have been executed in a 12 cores node with E5645 Xeon processors.

Execution time in seconds as a function of the number of cores
cores	1	4	8	12
Time (s)	1600	422	246	238
Speedup	1	3.8	6.5	6.7
Performance (%)	100	95	81	56

The used command has been

clean_reads -i in.fastq -o out.fastq -p illumina -f fastq -g fastq -a a.fna -d UniVec -n 20 --qual_threshold=20 --only_3_end False -m 60 -t 12

More information

clean_reads web page.

Velvet

February 22, 2012Genetics, Scientific Softwareadmin

General information

1.2.03 version. Velvet is a set of algorithms manipulating de Bruijn graphs for genomic and de novo transcriptomic Sequence assembly. It was designed for short read sequencing technologies, such as Solexa or 454 Sequencing and was developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute. The tool takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.

See also the installed [intlink id=”6200″ type=”post”]ABySS[/intlink] and comparing both we have published article.

How to use

To run velveth or velvetg add in your scripts for the Torque queue system the corresponding command:

/software/bin/velvet/velveth [velvet options]

/software/bin/velvet/velvetg [velvet option]

Performance

Velvet has been compiled with parallel support througth OpenMP. We have measured the perfomance and the results are available in the report about the Velvet performance in the machines of the Computing Service of the UPV/EHU. Velvet uses huge amount of RAM for large calculations and we have measured it. In the report some simple formulas are obtained to predict the use of RAM for their input files, so the researches can know the needed RAM before start the calculations and in this way can plan their research.

See also the installed [intlink id=”6200″ type=”post”]ABySS[/intlink] and comparing both we have published article.

More information

Velvet web page.
Velvet performance in the machines of the Computing Service of the UPV/EHU report.
Velvet and ABySS performance in the machines of the Computing Service of the UPV/EHU, post in the hpc blog.

BLAST

February 21, 2012Genetics, Scientific Softwareadmin

General Information

2.2.24 version of BLAST de NCBI. Due to performance reasons it has not been installed in Itanium nodes.

Data bases

The Service has installed several data bases, contact the technicians to use them or install new ones.

How to use

To submit jobs to the queue system we strongly recomend to use the command

send_blast

it will make some questions to prepare the job

Performance and gpuBLAST

We have compared BLAST with mpiBLAST and gpuBLAST, the result of the bechmarks are in the blog of Service. [intlink id=”1495″ type=”post” target=”_blank”]mpiBLAST[/intlink] is installed in the Service.

More information

BLAST web page.

[intlink id=”1493″ type=”post”]Blast2GO[/intlink] y [intlink id=”1495″ type=”post” target=”_blank”]mpiBLAST[/intlink] is also installed.

Genepop

January 26, 2012Genetics, Scientific Softwareadmin

General information

4.1 version.

Genepop is a population genetics software package, which has options for the following analysis: Hardy Weinberg equilibrium, Linkage Disequilibrium, Population Differentiation, Effective number of migrants, Fst or other correlations.

How to use

To execute Genepop in the queue system you must include in the script of the queue system:

/software/bin/Genepop < input_file

where input_file has the options for Genepop, i.e., the answer to Genepop when it runs in interactive mode. We recommend to use [intlink id=”233″ type=”post”]qsub in interactive mode[/intlink] to submit the jobs

More information

Genepop web page.

CLUMPP

January 26, 2012Genetics, Scientific Softwareadmin

General information

1.1.3 version. CLUMPP is a program that deals with label switching and multimodality problems in population-genetic cluster analyses. CLUMPP permutes the clusters output by independent runs of clustering programs such as [intlink id=”5875″ type=”post”]structure[/intlink], so that they match up as closely as possible. The user has the option of choosing one of three algorithms for aligning replicates, with a tradeoff of speed and similarity to the optimal alignment.

How to use

To execute CLUMPP in the queue system you must include in the script of the queue system:

/software/bin/CLUMPP

with the corresponding options of structure. We recommend to use [intlink id=”233″ type=”post”]qsub in interactive mode[/intlink] to submit the jobs

More information

CLUMPP web page.

Structure

January 26, 2012Genetics, Scientific Softwareadmin

General information

2.33 version.

The program structure is a free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.

How to use

To execute the graphical user interface execute in Péndulo, Maiz or Guinness

structure

To execute graphical applications read [intlink id=”48″ type=”post”]how to connect to Arina[/intlink].

To execute structure in the queue system you must include in the script of the queue system:

/software/bin/structure

with the corresponding options of structure. We recommend to use [intlink id=”233″ type=”post”]qsub in interactive mode[/intlink] to submit the jobs

More information

Structure web page.