Monthly Archives: March 2012

ABySS

General Information

1.3.2 version of ABySS (Assembly By Short Sequences). ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. ABySS can be executed in parallel.

See also the installed [intlink id=”6043″ type=”post”]velvet[/intlink] and comparing both we have published article.

How to use

The executables can be found in /software/abyss/bin. To run abyss in a script type in it:

/software/abyss/bin/abyss-pe [abyss-pe options]

Performance

See also the installed [intlink id=”6043″ type=”post”]velvet[/intlink] and comparing both we have published article.

Parallelization

Some benchmarks has been performed with ABySS. They have been performed using file from an Illumina HiSeq2000 NGS with 100 bp per sequence. In the table 1 we can see an example about how ABySS scales as a function of the number of cores. As we can see ABySS scales very up to 8 cores. The results is valid unless for more than 10e6 sequences.

Table 1. Execution time of abyss-pe in seconds as a function the number of cores
cores 2 4 8 12 24
Time (s) 47798 27852 16874 14591 18633
Aceleration 1 1.7 2.8 3.3 2.6
Performance(%) 100 86 71 55 21

Execution time

We have analized as well the execution time as a function of the size of the data. In the table 2 we  observe how from 1 million to 10 millions of sequences the execution time increases by 10 as well. From 10 to 100 millions of sequences the time increases a little more, between 10 t0 20. Therefore, the behavior is more or less lineal.

Table 2. Execution time in seconds of abyss-pe executed in 2, 4 y 8 cores as a function of the number of processed sequences.
sequences 10e6 10e7 10e8
Time in 2 cores (s) 247 2620 47798
Time in 4 cores (s) 134 1437 27852
Time in 8 cores (s) 103 923 16874

RAM memory

In these kind of programs more important than the execution time, which is reasonable, is the RAM memory usage, which can limit the calculation type. In the table 3 we observe how the RAM increases as a function of the number of sequences. We also show the logarithms of the measured values which has been used for a lineal regression. The jobs has been performed in 12 cores.

Tabl3 3. RAM memory used by abyss-pe as a function of the number of processed sequences. The logarithms of the measured values are also shown.
sequences 10e6 5*10e6 10e7 5*10e7 10e8
RAM (GB) 4.0 7.6 11 29 44
log(sequences) 6 6.7 7 7.7 8
log(RAM) 0.60 0.88 1.03 1.46 1.65

From the values of the table we obtain a fitting of the RAM in GB as a function of the number of sequences (s) to the equation

log(RAM)=0.53*log(s)-2.65

o equivalently

RAM=(s^0.53)/447

Conclusion

The memory usage is smaller than in other assemblers like [intlink id=”6043″ type=”post”]Velvet[/intlink], see as well the report Velvet performance in the machines of the Computing Service of the UPV/EHU and comparing both we have published article. In addition, the parallelization with MPI of ABySS allows to aggregate the RAM memory of several nodes to perform larger calculations.

More information

ABySS web page.
[intlink id=”6043″ type=”post”]Velvet[/intlink] assembler.
Velvet performance in the machines of the Computing Service of the UPV/EHU report.

Velvet and ABySS performance in the machines of the Computing Service of the UPV/EHU, post in the hpc blog.

abyss-pe

Clean reads

General information

0.2.2 Version. clean_reads cleans NGS (Sanger, 454, Illumina and solid) reads. It can trim:

  • bad quality regions
  • adaptors
  • vectors
  • regular expresssions

It also filters out the reads that do not meet a minimum quality criteria based on the sequence length and the mean quality. It can run in parallel.

Ho to use

To submit clean_reads jobs to the queue system execute the command

send_clean_reads

It will ask few questions to build the script and submit it to the queue.

Performance

clean_reads can be executed in parallel and scales well up to 8 cores. For 12 cores the performance is very poor. In the table 1 we show the results of the benchmark. They have been executed in a 12 cores node with E5645 Xeon processors.

Execution time in seconds as a function of the number of cores
cores 1 4 8 12
Time (s) 1600 422 246 238
Speedup 1 3.8 6.5 6.7
Performance (%) 100 95 81 56

The used command has been

clean_reads -i in.fastq -o out.fastq -p illumina -f fastq -g fastq -a a.fna -d UniVec -n 20 --qual_threshold=20 --only_3_end False -m 60 -t 12

More information

clean_reads web page.