Home Genome Synteny Blast / Blat WormMart Batch Sequences Markers Genetic Maps Submit Searches Site Map
WormBase Banner
Find:

WormBase development site. Master is at www.wormbase.org

[Linking to HTML] [Linking Genome Images] [Mining WormBase with AcePerl]
[Linking to Text] [Linking Genetic Maps] [Mining WormBase with MySQL & PostgreSQL]
[Linking to XML] [Linking the Cosmid Physical Map] [Public Access AcePerl Databases]
[Linking to ACeDB Dumps]   [Common WormBase Object Classes]

Linking and Mining WormBase

It is easy to link to WormBase, or to extract information from it for data mining purposes. You can link to HTML pages, text-only dumps, and XML pages.


Naming Conventions

All objects in WormBase have a name and a class. The name is a unique identifier which is usually, but not invariably, a human-readable name. The class describes the type of the object, such as "Sequence" or "Protein.

For example, the cell ADAR has a name of "ADAR" and a class of "Cell".

For historical reasons, the classes of objects are not always what you expect. While the class of a predicted gene such as ZK154.1 is "Predicted_gene", as you might expect, it is less obvious that the named gene zyg-1 has a class of "Locus". The table at the bottom of this page lists some of the common classes.

If you are not sure, you can learn the name and class of a particular WormBase object by performing a search for the object. Once you find and display it, look at the URL at the top of the page. The URL will contain the arguments name= and class=. These are the name and class of the object.


Linking to an HTML Page

To link to a WormBase web page describing the object, create a link to http://www.wormbase.org/db/get?name=X;class=Y, where X and Y are replaced with the name and class of the object you wish to retrieve. For example:

  <a href="http://www.wormbase.org/db/get?name=F59E12.2;class=Predicted_Gene">F59E12.2</a>
(try it)

Linking to XML

To link to an XML dump of the object, create a link to http://www.wormbase.org/db/misc/xml?name=X;class=Y, where X and Y are again replaced with the object name and class. For example:

  <a href="http://www.wormbase.org/db/misc/xml?name=WBGene00006988;class=Gene">WBGene00006988</a>
(try it)

Linking to Text

To link to a text-only representation of the object, create a link to http://www.wormbase.org/db/misc/text?name=X;class=Y, where X and Y are again replaced with the object name and class. For example:

  <a href="http://www.wormbase.org/db/misc/text?name=WBGene00006988;class=Gene">WBGene00006988</a>
(try it)

Linking to a Physical Map

To link to an image of the C. elegans physical map and you know the contig name, create a link to http://www.wormbase.org/db/misc/epic?name=X;class=Map, where X is replaced with the contig of interest. To link to a particular clone on the map, link to: http://www.wormbase.org/db/misc/epic?name=X;class=Clone, where X is replaced with the name of the clone. For example:

  <a href="http://www.wormbase.org/db/misc/epic?name=F59E12;class=Clone">F59E12</a>
(try it)

Linking to the Genome

Create a URL like this one:

<a href="http://www.wormbase.org/db/seq/gbrowse?source=wormbase;name=F59E12.2">F59E12.2>
(try it)

The "name=" argument can be filled with almost any landmark, including predicted gene names, locus names, cosmid names, and chromosome names. You can also combine a landmark specify a range such as II:1000..20000. This means to show chromosome II from positions 1000 to 20,000, inclusive.


Linking to an Image of the Genome

You can create an inline image of a portion of the genome using a URL like this one:

<img src="http://www.wormbase.org/db/seq/gbrowse_img?source=wormbase;name=mec-3;width=650">

This will produce the following 650-pixel wide image:

There are many more options. You can turn on tracks, add your own tracks, and much more. See the gbrowse_img help page for details.


Linking to Genetic Maps

Create a link like this one:

<a href="http://www.wormbase.org/db/misc/epic?class=Map;name=III">III>
(try it)

To link to a region of the map defined by a centimorgan interval, add the map_start and map_stop arguments:

<a href="http://www.wormbase.org/db/misc/epic?class=Map;name=III;map_start=3;map_stop=4">III:3..4>
(try it)

Linking to Acedb Format

To link to an ACEDB representation of the object suitable for loading into a local ACeDB database, create a link to http://www.wormbase.org/db/misc/acedb?name=X;class=Y, where X and Y are again replaced with the object name and class. For example:

  <a href="http://www.wormbase.org/db/misc/acedb?name=F59E12.2;class=Predicted_Gene">F59E12.2</a>
(try it)

Fetching Multiple Objects

You can fetch multiple objects by using wild cards in the name, where "*" replaces any character, and "?" replaces a single character. For example, to retrieve all RME cells in XML format, use:

  <a href="http://www.wormbase.org/db/misc/xml?name=RME*;class=Cell">All Cells</a>
(try it)

Learning More about the Data Model

The data model for any WormBase object can be viewed by choosing the "Schema" menu item from the yellow navigation bar at the top of any object display page. You can search for particular data models using the simple search on the front page, and selecting "Model" as the type of object you are searching for.

WormBase uses ACEDB data models. Their format is described in detail at acedb.org.


Mining WormBase with AcePerl and Bio::DB::GFF

Accessing the Database Directly Via AcePerl

WormBase can be queried from the command line using AcePerl. This allows you to write sophisticated Perl scripts to mine WormBase. For details, see the AcePerl pages for information on downloading, installing and using this software.


Public Access AcePerl Servers

Network access to the C. elegans database is available at the following site:

LocationHostPort
Cold Spring Harbor Laboratory aceserver.cshl.org 2005

Be aware that you will be sharing this server with other people. If it seems slow, it may be because others are using it. Wait a while and try again.


Mining WormBase with MySQL & PostgreSQL

The BioPerl library provides a simple relational schema and database access layer for querying genomic features. This is the access method of choice to use for mining WormBase for:

  1. Spliced and unspliced genes.
  2. UTRs, upstream regions or introns.
  3. Searching for annotations that overlap one another, for example, finding all genes that are in an intron of another gene.
  4. Comparing syntenic regions of the C. elegans and C. briggsae genomes.

You will need to install your own local database. Currently WormBase does not provide world access to the MySQL database of genome features (although this may change in the future). You must:

  1. Install BioPerl and either MySQL or PostgreSQL.
  2. Read the manual pages on Bio::DB::GFF. This describes the API for loading and querying the database.
  3. Download the following flat files that describe the C. elegans and C. briggsae genomes:
    elegansWSXXXX.gff.gz
    GFF-format descriptions of sequence annotations on C. elegans. The elegansWSXXXX.gff.gz file contains complete annotations for WormBase release XXXX, where XXXX is the current release number. Other files in this directory contain the same information sorted by chromosome.
    CHROMOSOME_*.dna.gz
    Fasta-format files of C. elegans chromosome DNA.
    EST_Elegans.dna.gz
    Fasta-format files of C. elegans ESTs and cDNAs.
    briggsae_25.gff.gz
    GFF-format descriptions of sequence annotations on the C. briggsae genome.
    briggsae_25.fa.gz
    Fasta-format file of C. briggsae genomic contigs.

After downloading the C. elegans FASTA files, you should combine them into a single file to facilitate loading. This is most easily done like this:

 % gunzip -c CHROMOSOME_*.dna.gz EST_Elegans.dna.gz > CElegans.fa

You do not have to do this with C. briggsae, because its DNA data is already combined into one file.

Once these are downloaded, create appropriately-named databases using the MySQL or PostgreSQL administrators' tools. You may create separate databases for each of the genomes (recommended) or put them together in the same database. Assuming that you have created two MySQL databases, one named "elegans" and the other named "briggsae", you will use the bulk_load_gff.pl tool to load them from the downloaded files:

 % bulk_load_gff.pl -c -d elegans  -fasta CElegans.fa elegansWSXXXX.gff.gz
 % bulk_load_gff.pl -c -d briggsae -fasta briggsae_25.fa.gz briggsae_25.gff.gz

The bulk_load_gff.pl program comes with BioPerl. Look for it in the subdirectory scripts/Bio-DB-GFF.

If you are using PostgreSQL, you should use the script "load_gff.pl", which works, but is very slow, or "pg_bulk_load_gff.pl," which is fast and optimized for PostgreSQL. They both have the same command-line syntax.

Once you've got the database loaded, you can write scripts to mine the data. For example, this script will find all named (3-letter) genes that are contained within the intron of another named or predicted gene and print out 100 bp upstream from their 5' end:

#!/usr/bin/perl

# find 3-letter named genes that are contained within the intron of another gene
use strict;
use Bio::DB::GFF;

my $db = Bio::DB::GFF->new('elegans');
my $intron_stream = $db->get_seq_stream('intron:curated');
while (my $intron = $intron_stream->next_feature) {
  my @contained_genes = $intron->contained_features('gene:curated') or next;
  for my $gene (@contained_genes) {
    my $upstream = $gene->subseq(-99,0);  # 100 bp upstream - position 0 is 1 bp to left of translational start
    print $gene->name,"\t",$upstream->dna,"\n";
  }
}

The output starts like this:

fkh-6	acctccgtcttcacagttccgagaccccgccctcactcttagcttctgcataatccgttgtctcatttgacaccccctaccataaaaaaatacaataatc
kin-31	aaaaaaaaatcgattttatcaaaaaacaatttatttcacatttttgtataactgacactcgtcagaattgtaaaaaccattaatttcatcgttgcattaa
...

To fetch all gene models and print their coding regions and UTRs, use the following script:

#!/usr/bin/perl

use strict;
use Bio::DB::GFF;

my $db  = Bio::DB::GFF->new(-dsn         => 'elegans',
			    -aggregators => 'gene_model{coding_exon,5_UTR,3_UTR/CDS}');

my $gene_stream = $db->get_seq_stream('gene_model:curated');

while (my $gene = $gene_stream->next_seq) {
  print $gene->name,"\n";
  for my $part ($gene->get_SeqFeatures) {
    print "\t",join("\t",$part->method,$part->start,$part->end),"\n";
  }
  print "\n";
}

This will produce output like:

2L52.1
	coding_exon	II	1867	1911	1
	coding_exon	II	2506	2694	1
	coding_exon	II	2738	2888	1
	coding_exon	II	2931	3036	1
	coding_exon	II	3406	3552	1
	coding_exon	II	3802	3984	1
	coding_exon	II	4201	4663	1

2RSSE.1
	5_UTR	II	15268097	15268367	1
	coding_exon	II	15268368	15268441	1
	coding_exon	II	15269346	15269681	1
	coding_exon	II	15269747	15269918	1
	coding_exon	II	15270683	15270860	1
	coding_exon	II	15272930	15273201	1

See the documentation for the BioPerl Bio::DB::GFF class for details. Note the use of the aggregator "gene_model{coding_exon,5_UTR,3_UTR/CDS}", which says to aggregate parts of type coding_exon, 5_UTR, 3_UTR and CDS into a single feature of type "gene_model."


Common Object Classes

The following table lists common WormBase classes.

Object Name Class Notes
A Predicted Gene A cosmid dot name, such as F59E12.2 Predicted_Gene A class name of "Sequence" is also recognized.
A Named Gene A three letter name, such as zyg-1 Locus  
A Genbank Accession Number (protein or nucleotide) The accession number Accession_Number This will retrieve an Accession_Number object, which is a list of WormBase names that correspond to that accession number.
A Protein A wormpep accession number, proceeded by "WP:", as in WP:CE28571 Protein  
A Clone The cosmid or YAC name, for example F59E12 Clone  
Genomic Sequence The cosmid or YAC name from which the sequence was made, for example F59E12 Genomic_Sequence A class name of "Sequence" is also recognized.
A GenePair The Research Genetics name, preceded by the prefix "sjj_" (for Steve Jones, who designed the pairs). For example, sjj_F59E12.2. PCR_Product  
A Cell A cell name, such as ADAR; notice that paired cells, such as ADAR and ADAL are treated separately Cell  
A Protein Family The accession number, proceeded by the database name. Examples include the Interpro family INTERPRO:IPR000039, the Prosite motif PS:PS00041, and the PFAM family PFAM:PF00352 Motif  
An RNAi Experiment The accession number of the experiment. This is typically the name of the overlapping gene preceded by the laboratory prefix. For example, an RNAi experiment that covers gene F59E12.2 and was performed in Julie Ahringer's lab (JA), will have the name JA:59E12.2 RNAi