About FAQ Downloads Contact Acknowledgements Changes
 

About

ProbTF is a software tool for predicting transcription factor binding using experimentally verified Position Weight Matrices (PWMs). The included sets of PWMs are from the TRANSFAC Public 7.0 database

ProbTF provides a probabilistic framework for transcription factor binding prediction which has three important features. First, ProbTF is probabilistic in nature and thus outputs a probability of binding (as opposed to a p-value). Second, the method answers the question of whether the whole promoter has a binding site. Third, ProbTF provides a principled way of combining multiple data sources, such as evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip, and other prior knowledge, into a unified probabilistic framework.

This method was used in:


Algorithm Description

Full computational details can be found from the article.

[Back to top of page]

TRANSFAC Public 7.0 Matrices

The TRANSFAC matrices used in the ProbTF are from the Public 7.0 release. Matrices were divided into sets based on the species they are derived from: 187 mouse matrices corresponding to 121 mouse TFs. Due to the TRANSFAC licence agreement, only those matrices within the Public release can be used in analyses via a web service.

Full list of included TRANSFAC Public 7.0 matrices: TRANSFACmatrices.txt.

[Back to top of page]

Preprocessing of Position Weight Matrices

Some of the PSWMs can be remarkably diffuse because they are computed from a few experimentally verified sequences. Consequently, many of the PSWMs are likely to contain zero probabilities (zero pseudocounts). To prevent zero probabilities, we add one pseudo count to all entries in the PSWMs. To keep keep this process comperable between different matrices, we first scale the counts so that they sum to 100, add an additional pseudocount, and then re-scale the column of PSWMs to get PSFMs. For example,


Before:  A       C       G       T
         0      20      30       0

Step1:   A       C       G       T
         0      40      60       0

Step2:   A       C       G       T
         1      41      61       1

Final:   A       C       G       T
      0.0096  0.3942  0.5865  0.0096

[Back to top of page]


Sequence Requirements

FASTA files: FASTA is probably the simplest of formats for unaligned sequences. FASTA files are easily created in a text editor. Each sequence is preceded by a line starting with >. The first word on this line is the name of the sequence. The rest of the line is a description of the sequence (free format). The remaining lines contain the sequence itself. You can put as many letters on a sequence line as you want. An example is shown below:

>sequenceOne The first example sequence.
GATGGATGGGCTAGATGATCGGATAGAGAGAGAGAGATTGTAG
GATGGTATTTTAGATAGATAGAGAGAG

FASTA files are conventionally named with a .fa extension.

[Back to top of page]

Evidence Scores

  • Each line in the file should contain a SINGLE probability/score value for each basepair in the matching sequence file.
  • The scores should range between 0 and 1 (value 1 is excluded). For cases where the evidence score is 1, these are set to 0.9999999.
  • The number of scores in the evidence file MUST match the same number of basepairs in the sequence file.
  • Note that depending on the type of your additional evidence, you may need to scale the probability scores closer to value 0.5, e.g. between 0.4 and 0.6. For example, we found that evolutionary conservation probabilities from PhastCons program produce the best results when they are scaled between 0.46 and 0.54.
[
Back to top of page]

Background Model

A set of DNA sequences was used to compute parameters of the Markovian background models. Models of order 0 to 3 are currently available.

[
Back to top of page]

Downloads

[Back to top of page]

Contact

If you have an problems using the web server please read the FAQ first to see whether there is an answer to your problem there.

Otherwise if you have any comments or questions regarding ProbTF you may email:


[Back to top of page]

Acknowledgements

The development of the ProbTF is supported by grants from the National Institute of General Medical Sciences (R01-GM072855) and the National Institute of Allergy and Infectious Diseases (U54 AI54253).

  • Alistair Rust: Implemented the web server and hand-curated the mouse TFBS test set.
  • Harri Lähdesmäki: Developed and implemented the computational methods.
  • Andrew Peabody: Managed all the necessary web server admin issues.
  • Ryan Pelan: Created the great looking 'skin' for the website.
  • Stephen Ramsey: Made the perl backbone code more robust and reliable. Wrote and debugged the first incarnation of the webserver.
[Back to top of page]

Change Log

  • Septmeber 2009: Mouse TFBS test set added for downloading.
  • September 2008: Added links to the Matlab source files and an example use-case. Updated the paper reference.
  • June 2007: Initial release of server.
[Back to top of page]