The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from the UniProt Knowledgebase (including isoforms) and selected UniParc records in order to obtain complete coverage of the sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view. Unlike in UniParc, sequence fragments are merged in UniRef: The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues from any organism into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged entries and links to the corresponding UniProtKB and UniParc records. UniRef90 is built by clustering UniRef100 sequences with 11 or more residues using the CD-HIT algorithm (Li W. and Godzik A., Bioinformatics, 22: 1658-1659, 2006) such that each cluster is composed of sequences that have at least 90% sequence identity to and 80% overlap with the longest sequence (a.k.a. seed sequence) of the cluster. Similarly, UniRef50 is built by clustering UniRef90 seed sequences that have at least 50% sequence identity to and 80% overlap with the longest sequence in the cluster. Prior to 2013 there was no overlap threshold, so clusters were more heterogeneous in length. UniRef90 and UniRef50 yield a database size reduction of approximately 58% and 79%, respectively, providing for significantly faster sequence similarity searches. The seed sequences are the longest members of the cluster. However, the longest sequence is not always the most informative. There is often more biologically relevant information (name, function, cross-references) available on other cluster members. All the proteins in a cluster are therefore ranked as follows to facilitate the selection of a biologically relevant representative for the cluster:
- quality of the entry: manually reviewed entries (from the UniProtKB/Swiss-Prot section) are preferred
- annotation score: entries that have higher UniProtKB annotation scores are preferred. This also means that UniProtKB entries will always take precedence over entries that are in UniParc but not in UniProtKB (annotation score is undefined in UniParc, which does not contain any annotations).
- organism: entries from reference proteomes and model organisms are preferred
- length of the sequence: longest sequence is preferred
UniRef100 contains all UniProt Knowledgebase records plus selected UniParc records (see below). In UniRef100, all identical sequences and subfragments with 11 or more residues are placed into a single record. UniRef50 and UniRef90 are built based on UniRef100.
The UniRef100 identifier is generated by placing a “UniRef100_” prefix before the UniProtKB accession or UniParc identifier of the representative UniProtKB or UniParc entry, e.g. “UniRef100_P99999” or “UniRef100_UPI0000027233”.
UniRef90 is generated by clustering UniRef100 seed sequences.
The UniRef100 sequences shorter than 11 residues are excluded in UniRef90 clusters. Each UniRef90 cluster has one representative sequence from the UniRef100 database.
UniRef90 cluster titles and identifiers are derived from the representative UniRef100 entry. The UniRef90 identifier is generated by replacing the “UniRef100_” prefix of the representative with “UniRef90_”, e.g. “UniRef90_P99999”.
UniRef50 is generated by clustering UniRef90 seed sequences.
UniRef50 cluster titles and identifiers are derived from the representative UniRef90 entry. The UniRef50 identifier is generated by replacing the “UniRef100_” prefix of the representative with “UniRef50_”, e.g. “UniRef50_P99999”.