ncbi-seg - Online in the Cloud

This is the command ncbi-seg that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

Run in Ubuntu Run in Fedora Run in Windows Sim Run in MACOS Sim

PROGRAM:

NAME

ncbi-seg - segment sequence(s) by local complexity

SYNOPSIS

ncbi-seg sequence [ W ] [ K(1) ] [ K(2) ] [ -x ] [ options ]

DESCRIPTION

ncbi-seg divides sequences into contrasting segments of low-complexity and high-
complexity. Low-complexity segments defined by the algorithm represent "simple sequences"
or "compositionally-biased regions".

Locally-optimized low-complexity segments are produced at defined levels of stringency,
based on formal definitions of local compositional complexity (Wootton & Federhen, 1993).
The segment lengths and the number of segments per sequence are determined automatically
by the algorithm.

The input is a FASTA-formatted sequence file, or a database file containing many FASTA-
formatted sequences. ncbi-seg is tuned for amino acid sequences. For nucleotide
sequences, see EXAMPLES OF PARAMETER SETS below.

The stringency of the search for low-complexity segments is determined by three user-
defined parameters, trigger window length [ W ], trigger complexity [ K(1) ] and extension
complexity [ K(2)] (see below under PARAMETERS ). The defaults provided are suitable for
low-complexity masking of database search query sequences [ -x option required, see
below].

OUTPUTS AND APPLICATIONS

(1) Readable segmented sequence [Default]. Regions of contrasting complexity are
displayed in "tree format". See EXAMPLES.

(2) Low-complexity masking (see Altschul et al, 1994). Produce a masked FASTA-formatted
file, ready for input as a query sequence for database search programs such as BLAST or
FASTA. The amino acids in low-complexity regions are replaced with "x" characters [-x
option]. See EXAMPLES.

(3) Database construction. Produce FASTA-formatted files containing low-complexity
segments [-l option], or high-complexity segments [-h option], or both [-a option]. Each
segment is a separate sequence entry with an informative header line.

ALGORITHM

The SEG algorithm has two stages. First, identification of approximate raw segments of
low- complexity; second local optimization.

At the first stage, the stringency and resolution of the search for low-complexity
segments is determined by the W, K(1) and K(2) parameters. All trigger windows are
defined, including overlapping windows, of length W and complexity less than or equal to
K(1). "Complexity" here is defined by equation (3) of Wootton & Federhen (1993). Each
trigger window is then extended into a contig in both directions by merging with extension
windows, which are overlapping windows of length W and complexity less than or equal to
K(2). Each contig is a raw segment.

At the second stage, each raw segment is reduced to a single optimal low-complexity
segment, which may be the entire raw segment but is usually a subsequence. The optimal
subsequence has the lowest value of the probability P(0) (equation (5) of Wootton &
Federhen, 1993).

PARAMETERS

These three numeric parameters are in obligatory order after the sequence file name.

Trigger window length [ W ]. An integer greater than zero [ Default 12 ].

Trigger complexity. [ K1 ]. The maximum complexity of a trigger window in units of bits.
K1 must be equal to or greater than zero. The maximum value is 4.322 (log[base 2]20) for
amino acid sequences [ Default 2.2 ].

Extension complexity [ K2 ]. The maximum complexity of an extension window in units of
bits. Only values greater than K1 are effective in extending triggered windows. Range of
possible values is as for K1 [ Default 2.5 ].

OPTIONS

The following options may be placed in any order in the command line after the W, K1 and
K2 parameters:

-a Output both low-complexity and high-complexity segments in a FASTA-formatted file, as
a set of separate entries with header lines.

-c [characters-per-line]
Number of sequence characters per line of output [Default 60]. Other characters, such
as residue numbers, are additional.

-h Output only the high-complexity segments in a FASTA-formatted file, as a set of
separate entries with header lines.

-l Output only the low-complexity segments in a FASTA-formatted file, as a set of
separate entries with header lines.

-m [length]
Minimum length in residues for a high-complexity segment [default 0]. Shorter
segments are merged with adjacent low-complexity segments.

-o Show all overlapping, independently-triggered low-complexity segments [these are
merged by default].

-q Produce an output format with the sequence in a numbered block with markings to assist
residue counting. The low-complexity and high-complexity segments are in lower- and
upper-case characters respectively.

-t [length]
"Maximum trim length" parameter [default 100]. This controls the search space (and
search time) during the optimization of raw segments (see ALGORITHM above). By
default, subsequences 100 or more residues shorter than the raw segment are omitted
from the search. This parameter may be increased to give a more extensive search if
raw segments are longer than 100 residues.

-x The masking option for amino acid sequences. Each input sequence is represented by a
single output sequence in FASTA-format with low-complexity regions replaced by strings
of "x" characters.

EXAMPLES OF PARAMETER SETS

Default parameters are given by 'ncbi-seg sequence' (equivalent to 'ncbi-seg sequence 12
2.2 2.5'). These parameters are appropriate for low- complexity masking of many amino
acid sequences [with -x option ].

Database-database comparisons:
More stringent (lower) complexity parameters are suitable when masked sequences are
compared with masked sequences. For example, for BLAST or FASTA searches that compare two
amino acid sequence databases, the following masking may be applied to both databases:

ncbi-seg database 12 1.8 2.0 -x

Homopolymer analysis:
To examine all homopolymeric subsequences of length (for example) 7 or greater:

ncbi-seg sequence 7 0 0

Non-globular regions of protein sequences:
Many long non-globular domains may be diagnosed at longer window lengths, typically:

ncbi-seg sequence 45 3.4 3.75

For some shorter non-globular domains, the following set is appropriate:

ncbi-seg sequence 25 3.0 3.3

Nucleotide sequences:
The maximum value of the complexity parameters is 2 (log[base 2]4). For masking, the
following is approximately equivalent in effect to the default parameters for amino acid
sequences:

ncbi-seg sequence.na 21 1.4 1.6

EXAMPLES

The following is a file named 'prion' in FASTA format:

>PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQP
HGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGA
VVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
NITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPV
ILLISFLIFLIVG

The command line:

ncbi-seg /usr/share/doc/ncbi-seg/examples/prion.fa

gives the standard output below

>PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR

1-49 MANLGCWMLVLFVATWSDLGLCKKRPKPGG
WNTGGSRYPGQGSPGGNRY
ppqggggwgqphgggwgqphgggwgqphgg 50-94
gwgqphgggwgqggg
95-112 THSQWNKPSKPKTNMKHM
agaaaagavvgglggymlgsams 113-135
136-187 RPIIHFGSDYEDRYYRENMHRYPNQVYYRP
MDEYSNQNNFVHDCVNITIKQH
tvttttkgenftet 188-201
202-236 DVKMMERVVEQMCITQYERESQAYYQRGSS
MVLFS
sppvillisflifliv 237-252
253-253 G

The low-complexity sequences are on the left (lower case) and high-complexity sequences
are on the right (upper case). All sequence segments read from left to right and their
order in the sequence is from top to bottom, as shown by the central column of residue
numbers.

The command line:

ncbi-seg /usr/share/doc/ncbi-seg/examples/prion.fa -x

gives the following FASTA-formatted file:-

>PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxTHSQWNKPSKPKTNMKHMxxxxxxxx
xxxxxxxxxxxxxxxRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
NITIKQHxxxxxxxxxxxxxxDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSxxxx
xxxxxxxxxxxxG

Use ncbi-seg online using onworks.net services

Latest Linux & Windows online programs

Spring Boot Demo

This repository is a hands-on, deep
learning by doing collection of Spring
Boot demos that you can run and study
module by module. It currently includes
66 p...

Enter

Source Code Hunter

Source Code Hunter is an open source
project by Doocs that focuses on
analyzing and explaining the source code
of widely used Java frameworks and
libraries. It...

Enter

DSA Bootcamp Java

DSA Bootcamp Java is an open source
educational repository created by Kunal
Kushwaha to teach Data Structures and
Algorithms (DSA) using Java. It is
designed a...

Enter

Clean Code JavaScript

clean-code-javascript adapts Robert C.
Martins Clean Code principles to the
JavaScript ecosystem, presenting them as
pragmatic, example-driven guidelines
rath...

Enter

Hacker Scripts

Hacker Scripts is a cheeky collection
of small automation scripts and language
ports collected under the tagline Based
on a true story. The repository gather...

Enter

Remote Jobs

Remote Jobs is a community-curated,
open source repository that lists
companies around the world that support
remote work. Each entry describes a
company, whet...

Enter

Learn Git Branching

LearnGitBranching (LGB) is a Git
repository visualizer, sandbox, and
interactive tutorial platform that
teaches Git concepts through
visualization and gamified...

Enter

33 JS Concepts

33-js-concepts is a curated collection
of essential JavaScript concepts that
every developer should understand to
strengthen their knowledge of the
language. T...

Enter

Hiring Without Whiteboards

Hiring-without-whiteboards is a curated
list of companies and teams that avoid
traditional "whiteboard"
interviews, instead focusing on
realistic and p...

Enter

ncbi-seg - Online in the Cloud

PROGRAM:

NAME

SYNOPSIS

DESCRIPTION

OUTPUTS AND APPLICATIONS

ALGORITHM

PARAMETERS

OPTIONS

EXAMPLES OF PARAMETER SETS

EXAMPLES

Latest Linux & Windows online programs

Categories to download Software & Programs for Windows & Linux