Ray - Online in the Cloud

This is the command Ray that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

PROGRAM:

NAME


Ray - assemble genomes in parallel using the message-passing interface

SYNOPSIS


mpiexec -n NUMBER_OF_RANKS Ray -k KMERLENGTH -p l1_1.fastq l1_2.fastq -p l2_1.fastq
l2_2.fastq -o test

mpiexec -n NUMBER_OF_RANKS Ray Ray.conf # with commands in a file

DESCRIPTION:


The Ray genome assembler is built on top of the RayPlatform, a generic plugin-based
distributed and parallel compute engine that uses the message-passing interface for
passing messages.

Ray targets several applications:

- de novo genome assembly (with Ray vanilla) - de novo meta-genome assembly (with
Ray Meta) - de novo transcriptome assembly (works, but not tested a lot) -
quantification of contig abundances - quantification of microbiome consortia
members (with Ray Communities) - quantification of transcript expression - taxonomy
profiling of samples (with Ray Communities) - gene ontology profiling of samples
(with Ray Ontologies)

-help

Displays this help page.

-version

Displays Ray version and compilation options.

Using a configuration file

Ray can be launched with mpiexec -n 16 Ray Ray.conf The configuration file can
include comments (starting with #).

K-mer length

-k kmerLength

Selects the length of k-mers. The default value is 21. It must be odd because
reverse-complement vertices are stored together. The maximum length is defined at
compilation by MAXKMERLENGTH Larger k-mers utilise more memory.

Inputs

-p leftSequenceFile rightSequenceFile [averageOuterDistance standardDeviation]

Provides two files containing paired-end reads. averageOuterDistance and
standardDeviation are automatically computed if not provided.

-i interleavedSequenceFile [averageOuterDistance standardDeviation]

Provides one file containing interleaved paired-end reads. averageOuterDistance
and standardDeviation are automatically computed if not provided.

-s sequenceFile

Provides a file containing single-end reads.

Outputs

-o outputDirectory

Specifies the directory for outputted files. Default is RayOutput

Assembly options (defaults work well)

-disable-recycling

Disables read recycling during the assembly reads will be set free in 3 cases: 1.
the distance did not match for a pair 2. the read has not met its mate 3. the
library population indicates a wrong placement see Constrained traversal of repeats
with paired sequences. Sebastien Boisvert, Elenie Godzaridis, Francois Laviolette
& Jacques Corbeil. First Annual RECOMB Satellite Workshop on Massively Parallel
Sequencing, March 26-27 2011, Vancouver, BC, Canada.

-disable-scaffolder

Disables the scaffolder.

-minimum-contig-length minimumContigLength

Changes the minimum contig length, default is 100 nucleotides

-color-space

Runs in color-space Needs csfasta files. Activated automatically if csfasta files
are provided.

-use-maximum-seed-coverage maximumSeedCoverageDepth

Ignores any seed with a coverage depth above this threshold. The default is
4294967295.

-use-minimum-seed-coverage minimumSeedCoverageDepth

Sets the minimum seed coverage depth. Any path with a coverage depth lower than
this will be discarded. The default is 0.

Distributed storage engine (all these values are for each MPI rank)

-bloom-filter-bits bits

Sets the number of bits for the Bloom filter Default is 268435456 bits, 0 bits
disables the Bloom filter.

-hash-table-buckets buckets

Sets the initial number of buckets. Must be a power of 2 ! Default value:
268435456

-hash-table-buckets-per-group buckets

Sets the number of buckets per group for sparse storage Default value: 64, Must be
between >=1 and <= 64

-hash-table-load-factor-threshold threshold

Sets the load factor threshold for real-time resizing Default value: 0.75, must be
>= 0.5 and < 1

-hash-table-verbosity

Activates verbosity for the distributed storage engine

Biological abundances

-search searchDirectory

Provides a directory containing fasta files to be searched in the de Bruijn graph.
Biological abundances will be written to RayOutput/BiologicalAbundances See
Documentation/BiologicalAbundances.txt

-one-color-per-file

Sets one color per file instead of one per sequence. By default, each sequence in
each file has a different color. For files with large numbers of sequences, using
one single color per file may be more efficient.

Taxonomic profiling with colored de Bruijn graphs

-with-taxonomy Genome-to-Taxon.tsv TreeOfLife-Edges.tsv Taxon-Names.tsv

Provides a taxonomy. Computes and writes detailed taxonomic profiles. See
Documentation/Taxonomy.txt for details.

-gene-ontology OntologyTerms.txt
Annotations.txt

Provides an ontology and annotations. OntologyTerms.txt is fetched from
http://geneontology.org Annotations.txt is a 2-column file (EMBL_CDS handle &
gene ontology identifier) See Documentation/GeneOntology.txt

Other outputs

-enable-neighbourhoods

Computes contig neighborhoods in the de Bruijn graph Output file:
RayOutput/NeighbourhoodRelations.txt

-amos

Writes the AMOS file called RayOutput/AMOS.afg An AMOS file contains read positions
on contigs. Can be opened with software with graphical user interface.

-write-kmers

Writes k-mer graph to RayOutput/kmers.txt The resulting file is not utilised by
Ray. The resulting file is very large.

-write-read-markers

Writes read markers to disk.

-write-seeds

Writes seed DNA sequences to RayOutput/Rank<rank>.RaySeeds.fasta

-write-extensions

Writes extension DNA sequences to RayOutput/Rank<rank>.RayExtensions.fasta

-write-contig-paths

Writes contig paths with coverage values to RayOutput/Rank<rank>.RayContigPaths.txt

-write-marker-summary

Writes marker statistics.

Memory usage

-show-memory-usage

Shows memory usage. Data is fetched from /proc on GNU/Linux Needs __linux__

-show-memory-allocations

Shows memory allocation events

Algorithm verbosity

-show-extension-choice

Shows the choice made (with other choices) during the extension.

-show-ending-context

Shows the ending context of each extension. Shows the children of the vertex where
extension was too difficult.

-show-distance-summary

Shows summary of outer distances used for an extension path.

-show-consensus

Shows the consensus when a choice is done.

Checkpointing

-write-checkpoints checkpointDirectory

Write checkpoint files

-read-checkpoints checkpointDirectory

Read checkpoint files

-read-write-checkpoints checkpointDirectory

Read and write checkpoint files

Message routing for large number of cores

-route-messages

Enables the Ray message router. Disabled by default. Messages will be routed
accordingly so that any rank can communicate directly with only a few others.
Without -route-messages, any rank can communicate directly with any other rank.
Files generated: Routing/Connections.txt, Routing/Routes.txt and
Routing/RelayEvents.txt and Routing/Summary.txt

-connection-type type

Sets the connection type for routes. Accepted values are debruijn, hypercube,
polytope, group, random, kautz and complete. Default is debruijn.

debruijn: a full de Bruijn graph a given alphabet and diameter hypercube: a
hypercube, alphabet is {0,1} and the vertices is a power of 2 polytope: a convex
regular polytope, alphabet is {0,1,...,B-1} and the vertices is a power of B group:
silly model where one representative per group can communicate with outsiders
random: Erdos-Renyi model kautz: a full de Kautz graph, which is a subgraph of a de
Bruijn graph complete: a full graph with all the possible connections

With the type debruijn, the number of ranks must be a power of something.
Examples: 256 = 16*16, 512=8*8*8, 49=7*7, and so on. Otherwise, don't use debruijn
routing but use another one With the type kautz, the number of ranks n must be
n=(k+1)*k^(d-1) for some k and d

-routing-graph-degree degree

Specifies the outgoing degree for the routing graph. See Documentation/Routing.txt

Hardware testing

-test-network-only

Tests the network and returns.

-write-network-test-raw-data

Writes one additional file per rank detailing the network test.

-exchanges NumberOfExchanges

Sets the number of exchanges

-disable-network-test

Skips the network test.

Debugging

-verify-message-integrity

Checks message data reliability for any non-empty message. add '-D CONFIG_SSE_4_2'
in the Makefile to use hardware instruction (SSE 4.2)

-run-profiler

Runs the profiler as the code runs. By default, only show granularity warnings.
Running the profiler increases running times.

-with-profiler-details

Shows number of messages sent and received in each methods during in each time
slices (epochs). Needs -run-profiler.

-show-communication-events

Shows all messages sent and received.

-show-read-placement

Shows read placement in the graph during the extension.

-debug-bubbles

Debugs bubble code. Bubbles can be due to heterozygous sites or sequencing errors
or other (unknown) events

-debug-seeds

Debugs seed code. Seeds are paths in the graph that are likely unique.

-debug-fusions

Debugs fusion code.

-debug-scaffolder

Debug the scaffolder.

FILES

Input files

Note: file format is determined with file extension.

.fasta .fasta.gz (needs HAVE_LIBZ=y at compilation) .fasta.bz2 (needs HAVE_LIBBZ2=y
at compilation) .fastq .fastq.gz (needs HAVE_LIBZ=y at compilation) .fastq.bz2
(needs HAVE_LIBBZ2=y at compilation) .sff (paired reads must be extracted manually)
.csfasta (color-space reads)

Outputted files

Scaffolds

RayOutput/Scaffolds.fasta

The scaffold sequences in FASTA format

RayOutput/ScaffoldComponents.txt

The components of each scaffold

RayOutput/ScaffoldLengths.txt

The length of each scaffold

RayOutput/ScaffoldLinks.txt

Scaffold links

Contigs

RayOutput/Contigs.fasta

Contiguous sequences in FASTA format

RayOutput/ContigLengths.txt

The lengths of contiguous sequences

Summary

RayOutput/OutputNumbers.txt

Overall numbers for the assembly

de Bruijn graph

RayOutput/CoverageDistribution.txt

The distribution of coverage values

RayOutput/CoverageDistributionAnalysis.txt

Analysis of the coverage distribution

RayOutput/degreeDistribution.txt

Distribution of ingoing and outgoing degrees

RayOutput/kmers.txt

k-mer graph, required option: -write-kmers

The resulting file is not utilised by Ray. The resulting file is very large.

Assembly steps

RayOutput/SeedLengthDistribution.txt

Distribution of seed length

RayOutput/Rank<rank>.OptimalReadMarkers.txt

Read markers.

RayOutput/Rank<rank>.RaySeeds.fasta

Seed DNA sequences, required option: -write-seeds

RayOutput/Rank<rank>.RayExtensions.fasta

Extension DNA sequences, required option: -write-extensions

RayOutput/Rank<rank>.RayContigPaths.txt

Contig paths with coverage values, required option: -write-contig-paths

Paired reads

RayOutput/LibraryStatistics.txt

Estimation of outer distances for paired reads

RayOutput/Library<LibraryNumber>.txt

Frequencies for observed outer distances (insert size + read lengths)

Partition

RayOutput/NumberOfSequences.txt

Number of reads in each file

RayOutput/SequencePartition.txt

Sequence partition

Ray software

RayOutput/RayVersion.txt

The version of Ray

RayOutput/RayCommand.txt

The exact same command provided

AMOS

RayOutput/AMOS.afg

Assembly representation in AMOS format, required option: -amos

Communication

RayOutput/MessagePassingInterface.txt

Number of messages sent

RayOutput/NetworkTest.txt

Latencies in microseconds

RayOutput/Rank<rank>NetworkTestData.txt

Network test raw data

DOCUMENTATION

- mpiexec -n 1 Ray -help|less (always up-to-date) - This help page (always
up-to-date) - The directory Documentation/ - Manual (Portable Document Format):
InstructionManual.tex (in Documentation) - Mailing list archives:
http://sourceforge.net/mailarchive/forum.php?forum_name=denovoassembler-users

AUTHOR

Written by Sebastien Boisvert.

REPORTING BUGS

Report bugs to denovoassembler-users@lists.sourceforge.net Home page:
<http://denovoassembler.sourceforge.net/>

COPYRIGHT

This program is free software: you can redistribute it and/or modify it under the
terms of the GNU General Public License as published by the Free Software
Foundation, version 3 of the License.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You have received a copy of the GNU General Public License along with this program
(see LICENSE).

Ray 2.1.0

License for Ray: GNU General Public License version 3 RayPlatform version: 1.1.0 License
for RayPlatform: GNU Lesser General Public License version 3

MAXKMERLENGTH: 32 KMER_U64_ARRAY_SIZE: 1 Maximum coverage depth stored by CoverageDepth:
4294967295 MAXIMUM_MESSAGE_SIZE_IN_BYTES: 4000 bytes FORCE_PACKING = n ASSERT = n
HAVE_LIBZ = y HAVE_LIBBZ2 = y CONFIG_PROFILER_COLLECT = n CONFIG_CLOCK_GETTIME = n
__linux__ = y _MSC_VER = n __GNUC__ = y RAY_32_BITS = n RAY_64_BITS = y MPI standard
version: MPI 2.1 MPI library: Open-MPI 1.4.2 Compiler: GNU gcc/g++ 4.4.5

Use Ray online using onworks.net services



Latest Linux & Windows online programs