Contents:
When you run a make update in kent/src/hg/makeDb/trackDb/, the hgFindSpec program is invoked to read search specifications from the trackDb.ra files and to create an hgFindSpec_$USER table in each database. make alpha causes an hgFindSpec table to be created in each database. This parallels the generation of trackDb_$USER and trackDb tables from the same trackDb.ra files and the same make targets.
This document assumes familiarity with trackDb and the kent/src/hg/ tree, and describes how to go about adding a new search to a trackDb.ra file. It also contains a bit of reference info about the new table and programs, including a diagnostic program checkHgFindSpec which helps test the searches.
searchTable mySpiffyTrack
In more complicated cases, you might have nice names in a non-positional table (e.g. stsAlias) which you would like to make searchable by cross-referencing them to a positional/track table (e.g. stsMap). In that case, you'll need to add a line to identify the cross-referencing table:
xrefTable mySpiffyAlias
If there is already a search defined for your table, but you want to add another, then you will have to make up a different searchName to distinguish the two searches (by default, searchName = searchTable; searchTable does not have to be unique among searches, but searchName does):
searchName mySpiffyTrackSpecial searchTable mySpiffyTrack
searchType bedSome tables/searches already have special search code written for them, so you can name them as the searchType (but you usually won't have to write new descriptions for them since most are in the top-level trackDb.ra):
query select myChrom, myChromStart, myChromEnd, myName from %s where myName like '%s'
If you want your query format to end with '%s%%' (so that a search term of "hox" will result in a query on 'hox%'), then add this line to your search spec:
searchMethod prefixSimilarly, if you want to use '%%%s%%' (=> '%hox%'), add this:
searchMethod fuzzyThe default searchMethod is exact ('%s'). If you don't write your own query, then hgFindSpec will use searchMethod to pick an ending for the default query for searchType. If you write your own query and don't have an xrefTable/xrefQuery, then your query must end with a pattern that's consistent with searchMethod.
If you have defined an xrefTable above, then you will definitely have to define an xrefQuery for it. Here's what that will look like:
xrefQuery select trackName, searchName from %s where searchName like '%s'Use searchMethod prefix or searchMethod fuzzy if that's what you want for xrefQuery. If you define an xrefQuery, then searchMethod applies to xrefQuery only, not query, and query has to be an exact search.
termRegex NT_[0-9]{6}
Ultimately, the pattern we're looking for is in the user's search terms that should be applied to our table -- this is almost always the same as the pattern of the names in the table. (Exceptions: when the user types in a prefix that is not found in the table's names, e.g. "HG-U95:", or when the user omits a suffix that is found in the table's names.)
Human beings are pretty good at recognizing patterns in names. We've even written little languages to describe text patterns as "regular expressions", or regexes, which are easy for computers to parse and then evaluate on arbitrary input (like users' search terms). hgFindSpec's termRegex field uses the regular expression language regex. If you have used egrep before, you already know regex. If you have used fancy glob commands, you have a good headstart. If you use Perl regexps a lot, you are spoiled but regex will be straightforward enough.
One way to learn regex is by example:
man 7 regex man regexThere are also numerous references on regex out there, e.g. http://www.delorie.com/gnu/docs/regex/regex_toc.html... wow, looks like there's even a GUI wizard/coach: http://www.weitz.de/regex-coach/. And you can always ask an old UNIX-head like me for help.
Here's my favorite way to define a termRegex, and make sure that it
really covers all the names in a table:
hgsql $db -N -e "select name from $table limit 10"
Eyeball the results and write a regex. Then try out that regex in this command (substitute it in for __TERMREGEX__):
hgsql $db -N -e "select name from $table" | egrep -vi '^__TERMREGEX__$' | headIf that returns any results, then your regex needs to be loosened up to incorporate those. Keep on playing with the regex and running that command until it comes back clean - there's your termRegex*!
*In those rare cases mentioned above when the user types in something a little different from what's in the table, use the regex you just derived as the dontCheck setting, so checkHgFindSpec -checkTermRegex won't complain. Write a termRegex to match what the user types in. Do some extra testing to make sure that your termRegex encompasses all user search terms that should match.
If we have a nice clear-cut case like that, we can make it a
shortCircuit search:
shortCircuit 1
... but be extra-sure that terms found there won't have interesting
matches anywhere else. For example, the snpMap table contains a bunch
of IDs that start with rs and end with one or more digits. But there
is a gene rs10, so we don't want to shortCircuit because then the user
couldn't search for that gene -- they'd be zapped to the SNP whether
they wanted it or not. So we define two searches for snpMap: one that
shortCircuits for rs followed by a bunch of numbers (unambiguously a
SNP ID), and one that doesn't shortCircuit but searches for rs
followed by a small number of numbers.
hgFind performs shortCircuit searches first, stopping if it gets a match. If no shortCircuit search produces a match, then hgFind performs all other (additive/non-shortCircuit) searches.
A slight twist to this mechanism is the semiShortCircuit setting:
semiShortCircuit 1
That allows other shortCircuit or semiShortCircuit searches to be
performed even if a match is found for this search, and is for use
when we need the speed but are not absolutely sure that this track
contains the only correct result for the search term.
checkHgFindSpec $dbFigure out where your search should fit in (this is not nearly as important as whether it's shortCircuit or not! but ask an old-timer if you're having trouble deciding). For additive/non-short-circuit searches, if there are a bunch of matches from various tracks, in what order should those tracks' matches be presented to the user?
Then look at the searchPriorities of the searches between which your search should fit, and pick a (floating-point) number between those two numbers.
searchPriority 42
cd kent/src/hg/makeDb/trackDb make update DBS=$db1 ZOO_DBS= # or if your search applies to more than one db make update DBS="$db1 $db2" ZOO_DBS=hgFindSpec can catch some problems with search definitions, such as missing fields or improperly formatted queries or termRegexes.
Next, use the checkHgFindSpec utility
to try out an example search and see if there are any incomplete
termRegexes.
checkHgFindSpec $db $exampleTerm
checkHgFindSpec -checkTermRegex
Then open up a browser window on hgwdev-$USER and try a bunch of examples. If it looks OK, check in your trackDb.ra changes, go to a clean updated tree, and do a "make alpha" in kent/src/hg/makeDb/trackDb/ .
Sometimes the search terms that users type in are not quite the same as the name values in the tables to be searched. For example, for our affy* tracks, we tell users to prefix probe IDs with chip IDs, but the affy* tables contain just probe IDs. So the user may type in "HG-U95:1003_s_at", but the item name in the affyU95 table is just "1003_s_at". To tell hgFind (and checkHgFindSpec -checkTermRegex) that search terms (and termRegex) have a prefix that does not appear in the table, add a line like this to the trackDb.ra search spec:
termPrefix HG-U95:
For all other cases where user search terms (and therefore termRegex) don't match the actual values in the table, or are a subset of the actual values in the table, add a line like this with a regex that will cover the table values not covered by termRegex, so that checkHgFindSpec -checkTermRegex doesn't flag it as an error:
dontCheck [[:alnum:]]+\.[0-9]+
padding 5000That will cause 5000 to be subtracted from the start and added to the end of search results (unless the user has entered multiple search terms separated by ";" in order to get the range between them).
searchBoth 1
searchDescription Alias of STS Marker
Here's the kent/src/hg/lib/hgFindSpec.as description of the fields:
string searchName; "Unique name for this search. Defaults to searchTable if not specified in .ra." string searchTable; "(Non-unique!) Table to be searched. (Like trackDb.tableName: if split, omit chr*_ prefix.)" string searchMethod; "Type of search (exact, prefix, fuzzy)." string searchType; "Type of search (bed, genePred, knownGene etc)." ubyte shortCircuit; "If nonzero, and there is a result from this search, jump to the result instead of performing other searches." string termRegex; "Regular expression (see man 7 regex) to eval on search term: if it matches, perform search query." string query; "sprintf format string for SQL query on a given table and value." string xrefTable; "If search is xref, perform xrefQuery on search term, then query with that result." string xrefQuery; "sprintf format string for SQL query on a given (xref) table and value." float searchPriority; "0-1000 - relative order/importance of this search. 0 is top." string searchDescription; "Description of table/search (default: trackDb.{longLabel,tableName})"
Here is a description of currently supported settings:
checkHgFindSpec database [options | termToSearch] If given a termToSearch, displays the list of tables that will be searched and how long it took to figure that out; then performs the search and the time it took. options: -showSearches Show the order in which tables will be searched in general. [This will be done anyway if no termToSearch or options are specified.] -checkTermRegex For each search spec that includes a regular expression for terms, make sure that all values of the table field to be searched match the regex. (If not, some of them could be excluded from searches.) -checkIndexes Make sure that an index is defined on each field to be searched.The most common uses:
checkHgFindSpec $db
checkHgFindSpec $db $searchTerm
checkHgFindSpec $db -checkTermRegex
Here's the hgFindSpec usage, just for completeness:
hgFindSpec [options] orgDir database hgFindSpec hgFindSpec.sql hgRoot Options: -strict Add spec to hgFindSpec only if its table(s) exist. -raName=trackDb.ra - Specify a file name to use other than trackDb.ra for the ra files.