EnglishFrenchSpanish

OnWorks favicon

getData - Online in the Cloud

Run getData in OnWorks free hosting provider over Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

This is the command getData that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

PROGRAM:

NAME


getData - retrieves databases from the Internet

SYNOPSIS


getData [ --mirrordir <path> ] <list of db names>

getData --list

DESCRIPTION


Bioinformatics has the intrinsic problem to bring the biological data to the end user.
Astronomers have the equivalent problem and particle physicists, well, they haven come up
with (first) the web and (second) the computational grids to address their problems.
Debian helps with the programs but will not provide such huge datasets that are even
frequently updated - not even in volatile.debian.org. Most bioinformatics researchers will
not need too many of such databases. And even more so will gladly continue in using public
services remotely.

For those who need a set of databases on a regular basis, this script shall be a start to
automate the burden to download the data and update indices and the like. The world has
seen such magic before with the Lion Biosciences Prisma tool
(http://bib.oxfordjournals.org/cgi/reprint/3/4/389.pdf) but how about something simpler
(as a start) that at least gets close to what we desire and is Free. The aim must be to
address the needs of all (most) communities, not only of the bioinformatics world. The
seed was hence made with databases from astronomy.

Please contact the Debian-Med community if you consider this program to be almost ready
for your needs and explain what still needs to be added. Public databases that you managed
to integrate with this system are also very warmly welcomed as feedback.

OPTIONS


--help
this help

--man
Present a more detailed description in form of a man page.

--verbose
Say one or two words more than required.

--mirrordir <path>
Specifies destination directory. The data will be mirrored to the folder
$mirrordir/$dbname/. Please be aware that this mirrordir is nowhere stored. The
directory can consequently be moved to arbitrary locations at any time, if the users
of the data are only informed about that moving.

--list
Lists all databases that may be requested to be installed.

<list of db names>
Only those databases that are explicitly requested to be downloaded will be
downloaded. Such databases may require considerable bandwidth, so please make sure you
know you are doing the right thing.

--post
Perform only the unpacking/indexing, but do not retrieve/update the databases. This
option is considered useful when adding a new database management system to the
system, e.g. after installing EMBOSS.

--source
Perform only the unpacking/indexing, but do not retrieve/update the databases. This
option may be beneficial when the site administator is aware of current analyses that
should not be disturbed by the indexing process but the downloading from the net can
already be started.

--confd <directory>
Allows for the specification of a directory in which multiple files can be stored that
will be read by getData upon its invocation. These may add values to the global
variable %toBeMirrored that specifies the databases and their download scripts.

--config <system>
Preparation of the configuration file that would be reuired for a particular system
that deals with the database. The configuration is printed to stdout and is expected
to be copied manually to the proper file or folder. One could imagine this process to
be automated, though this is not yet implemented. Currently available is support for
two systems:

emboss This specifies the EMBOSS suite of tools for bioinformatics (www.emboss.org)
that is also available as a Debian package. The configuration for the Uniprot
databases will allow the sequence retrieval with the seqret tool.

dre - ARC Grid Runtime Environment
Runtime environments (REs) are a concept of the ARC grid middleware of which
more can be learned on http://www.nordugrid.org. A script is needed to
indicate the presence of a runtime environment. Here, the name of the script
is important, which is not definable by getData though since it only writes to
stdout.

Unfortunately, the configuration was not yet be found to be modularised. It all needs
to happen within the getData script itself.

--remove <list of dbnames>
This command removes folders that store the data. In principle this could be perfomed
manually, though some databases may have special requirements pre- or post-removal,
which can be specified individually for every database.

SPECIFICATION OF DATABASES


Databases for download and their post-processing are specified at two different locations.
One is the getData script itself, the other are files stored in /etc/getData.d. Either
will define elements of a considerably large hash. The key is the identifier which is also
shown by the 'getData --list' directive. The value is a reference to another hash, which
assigns values to all the properties that a database has for its download and post-
processing:

name - a human-readable pretty-printed name or short description that makes clear to the
world what this database is about.
A bad example is the mere assignment of "DE405", which few people understand. A better
example is "Pfam-A : Manually curated protein families and domains, only the seed is
presented.". One could argue that one should have that field renamed to "description".

source - shell commands to perform the initial download and subsequent updates
Commonly the wget tool is used for download. The such presented little script is
executed underneath the mirrordir directory. One simple example is "wget --mirror
ftp://ssd.jpl.nasa.gov/pub/eph/export/unix/unxp2[01]*.405". With increasing
proficiency in using wget, one is tempted to substitute "--mirror" with "--recursive
--no-host-directories --no-directories --level 1 --no-parent".

post-download - shell commands to perform after the data has been downloaded.
A simple (and unnecessary when used the right flags to wget) example is the mere
setting of a symbolic link:

"post-download" => "ln -s ssd.jpl.nasa.gov/pub/eph/export/unix/unxp*.405 ."

Some more effort has been put into TrEMBL for the merging of releases with subsequent
updates and the indexing for EMBOSS:

"d=uncompressed; if [ ! -d \$d ]; then mkdir \$d; fi; "
."rm -rf \$d/trembl.dat; "
."(find ftp.ebi.ac.uk -name '*.dat.gz' | xargs -r zcat ) > \$d/trembl.dat; "
."[ -x /usr/bin/dbxflat ] "
. "&& cd \$d && "
. "dbxflat -dbresource embl -dbname trembllocal -idformat swiss -filenames=trembl.dat -fields id,acc -auto",

The dots are connecting strings in Perl. This helps the readability of the code. When
writing these scripts, please be aware the newlines don't separate the individual
commands here. Semicolon are required.

recommends - suggests a series of packages to be present for the use of the database or
the performance of the indexing.
This information is not used at the moment, also to render this script more useful for
other Linux distributions than Debian.

getWgetOptions - private command to get wget options
This is used at download time by makefiles, is not intended to be used interactively,
and could be removed anytime.

EXAMPLES


The following will list the identifiers and the descriptions of the first 4 databases that
area available via getData on your system.

./getData --mirrordir=/local/databases/mirrored --list | head 4

To install any particular database, only give its name as an argument. If the installation
is performed at another directory than the default, then the --mirrordir needs again to be
set.

./getData swiss.dat

To remove the database again, give the script a hint with the --remove flag

./getData --remove swiss.dat

To perform the indexing only and circumvent the download (attention, this is dangerous
since the index files will look newer than the database is), do

./getData --post swiss.dat

A special exception to these extra scripts is the --config flag in that it takes a list of
extra arguments. Each shall denote a particular system that this database may be of
interest for. There are today two systems supported:

TODO


We now need a mechanism with which packages can specify hooks that shall be called upon an
update of a database. But we cannot assume that every indexing that can be performed
because of the installation of some package is also desired by the user. How to configure
this properly is left to be decided.

Use getData online using onworks.net services


Free Servers & Workstations

Download Windows & Linux apps

  • 1
    PAC Manager
    PAC Manager
    PAC is a Perl/GTK replacement for
    SecureCRT/Putty/etc (linux
    ssh/telnet/... gui)... It provides a GUI
    to configure connections: users,
    passwords, EXPECT regula...
    Download PAC Manager
  • 2
    GeoServer
    GeoServer
    GeoServer is an open-source software
    server written in Java that allows users
    to share and edit geospatial data.
    Designed for interoperability, it
    publishes da...
    Download GeoServer
  • 3
    Firefly III
    Firefly III
    A free and open-source personal finance
    manager. Firefly III features a
    double-entry bookkeeping system. You can
    quickly enter and organize your
    transactions i...
    Download Firefly III
  • 4
    Apache OpenOffice Extensions
    Apache OpenOffice Extensions
    The official catalog of Apache
    OpenOffice extensions. You'll find
    extensions ranging from dictionaries to
    tools to import PDF files and to connect
    with ext...
    Download Apache OpenOffice Extensions
  • 5
    MantisBT
    MantisBT
    Mantis is an easily deployable, web
    based bugtracker to aid product bug
    tracking. It requires PHP, MySQL and a
    web server. Checkout our demo and hosted
    offerin...
    Download MantisBT
  • 6
    LAN Messenger
    LAN Messenger
    LAN Messenger is a p2p chat application
    for intranet communication and does not
    require a server. A variety of handy
    features are supported including
    notificat...
    Download LAN Messenger
  • More »

Linux commands

  • 1
    abidw
    abidw
    abidw - serialize the ABI of an ELF
    file abidw reads a shared library in ELF
    format and emits an XML representation
    of its ABI to standard output. The
    emitted ...
    Run abidw
  • 2
    abilint
    abilint
    abilint - validate an abigail ABI
    representation abilint parses the native
    XML representation of an ABI as emitted
    by abidw. Once it has parsed the XML
    represe...
    Run abilint
  • 3
    coresendmsg
    coresendmsg
    coresendmsg - send a CORE API message
    to the core-daemon daemon ...
    Run coresendmsg
  • 4
    core_server
    core_server
    core_server - The primary server for
    SpamBayes. DESCRIPTION: Currently serves
    the web interface only. Plugging in
    listeners for various protocols is TBD.
    This ...
    Run core_server
  • 5
    fwflash
    fwflash
    fwflash - program to flash image file
    to a connected NXT device ...
    Run fwflash
  • 6
    fwts-collect
    fwts-collect
    fwts-collect - collect logs for fwts
    bug reporting. ...
    Run fwts-collect
  • More »

Ad