SQL in SkyServer
Database Fundamentals
The CAS provides access to SDSS-III catalog data that is stored in a relational database management system (DBMS) for data integrity and access speed. The data is organized into tables in a relational database. The SkyServer is the Web portal to the CAS databases and allows you to submit SQL queries to extract the data that you need from these databases. However, you do not usually need to specify which database your query is run on, since the SkyServer is configured by default to submit your queries to a particular database. This site is configured to submit your queries to the BestDR18 database, which contains the best data and most recent processing for the entire released sky area.
The bestdr16 database contains a large number of tables, some of which contain photometric measurements (such as PhotoObj), spectroscopic measurements (such as SpecObj), or information about the observing conditions (Field) or survey geometry(TileBoundary). See the Introduction to the Data Model page for more details.
In addition to the tables, we have defined Views, which can be thought of as virtual tables and are subsets or combinations of the data stored in the tables. Views are queried the same way Tables are; they exist just to make your life easier. For instance, the view Galaxy can be used to get photometric data on objects we classify as galaxies, without having to specify the classification in your query.
Both the SkyServer and CasJobs interfaces have a Schema Browser. It shows you all of the available databases, the tables in each database, and the quantities stored in each column of the tables.
Finally, we have created a variety of functions and stored procedures which let you easily perform some common operations. Usually, their names are prefixed by f or sp, like in fPhotoStatus or spGetFiberList. The full list of functions and store procedures is found in the Schema Browser. Note that some functions are scalar-valued, meaning that they return a single value, while others (such as the commonly used dbo.fGetNearbyObjEq, are table-valued; they actually return a table of data, and not a single number. This is important when interpreting the returned data and performing joins.
Please note the caution about using function calls as noted in the Optimizing Queries section when attempting queries over that return a large number of objects.
Query Fundamentals
Now that you have an overview of the database structure, how can you actually get data out? You can either use one of the form-based Search Tools, , or you can choose to write your own query using SQL using the SQL Search tool. Running a SQL query is the most direct and powerful way to interact with the database. The following is a brief introduction to writing SkyServer SQL queries. You can view other help pages like Cooking with Sloan and SQL Tutorial for additional help on writing SQL queries in SkyServer.
The most basic query consists of three parts:
- A SELECT clause, which specifies the parameters you wish to retrieve;
- A FROM clause, which specifies the database tables you want to extract the data from;
- A WHERE clause, which specifies the limitations/predicates you want to place on the extracted data.
The WHERE clause is not necessary if you want to retrieve parameters of all objects in a specified table, but this typically will be an overwhelming amount of data!
Note that the query language is insensitive to splitting the query over many lines. It is also NOT case sensitive. To make queries more readable, it is common practice to write the distinct query clauses on separate lines. The Sample Queries button on the CasJobs Query page provides a variety of samples, ordered in complexity. For instance, to obtain the list of unique Fields that have been loaded into the database, we use:
select fieldID from field
You can just copy and paste this (or any other) query into the Sql Search tool of SkyServer, and press submit, or into the CasJobs query window, and press the submit button.
If we want to retrieve multiple parameters from the database, we separate them with commas:
select ra, dec from galaxy
Of course, the parameters you request must be included in the table(s) you are querying! Now, let's say we want magnitudes of all bright galaxies. We will need to specify a magnitude range to do this:
SELECT u,g,r,i,z FROM Galaxy WHERE r < 12 and r > 0
Here, we have used the WHERE clause to provide a magnitude range. The and operator is used to require that multiple limits be met. This leads us to...
Simple Logical and Mathematical Operators
Not only can we place limits on individual parameters, we can place multiple limits using logical operators, as well as place limits on the results of mathematical operations on multiple parameters. We may also retrieve results that are logical joins of multiple queries. Here we list the logical, comparison, and mathematical operators.
The LOGICAL operators are AND,OR,NOT; they work as follows:
a | b | a AND b | a OR b |
---|---|---|---|
TRUE | TRUE | TRUE | TRUE |
TRUE | FALSE | FALSE | TRUE |
TRUE | NULL | NULL | TRUE |
FALSE | FALSE | FALSE | FALSE |
FALSE | NULL | FALSE | NULL |
NULL | NULL | NULL | NULL |
When comparing values, you will use the COMPARISON operators:
Operator | Description |
---|---|
< | less than |
> | greater than |
<= | less than or equal to |
>= | greater than or equal to |
= | equal |
<> or != | not equal |
In addition to the comparison operators, the special BETWEEN construct is available. a BETWEEN x and y is equivalent to a >= x AND a <= y.
Similarly, a NOT BETWEEN x and y is equivalent to a < x OR a > y.
Finally, the MATHEMATICAL operators (both numeric and bitwise) are:
Name | Description | Example | Result |
---|---|---|---|
+ | Addition | 2 + 3 | 5 |
- | Subtraction | 2 - 3 | -1 |
* | Multiplication | 2 * 3 | 6 |
/ | Division | 4 / 2 | 2 |
% | Modulo (remainder) | 5 % 4 | 1 |
POWER | Exponentiation | POWER (2.0,3.0) | 8.0 |
SQRT | Square root | SQRT (25.0) | 5.0 |
ABS | Absolute value | ABS (-5.0) | 5.0 |
& | Bitwise AND | 91 & 15 01011011 & 00001111 | 11 00001011 |
| | Bitwise OR | 32 | 3 00100000 | 00000011 | 35 00100011 |
^ | Bitwise XOR | 17 # 5 00010001 # 00000101 | 20 00010100 |
~ | Bitwise NOT | ~1 | -2 |
AVG | Average | AVG(ModelMag_r) | |
MIN | Minimum | MIN(ModelMag_r) | |
MAX | Maximum | MAX(ModelMag_r) | |
LOG | Natural Logarithm | LOG(petroMag_r) | |
LOG10 | Base-10 Logarithm | LOG10(petroMag_r) | |
EXP | Exponential Value (ex) | EXP(2.5) | 12.182494 |
In addition, the usual mathematical and trigonometric functions are available in SQL, such as COS, SIN, TAN, ACOS, etc.
Querying Bit Flags
Several SDSS tables contain bit-encoded flags to indicate various types of information about the object or quantity in question (e.g., the PhotoObjAll table and the PhotoTag view each have the flags column, SpecObj has zWarning and various targeting flags etc.).
One of the most important uses of bit flags is to indicate why an object was targeted for spectroscopy. A list of spectroscopic target flags is available on the Optical Spectra Target Flags page.
This section describes how you can test for flag values in your query. For sample queries that demonstrate the use of flags, see the Using flags, Diameter limited sample, LRG sample, Clean photometry with flags - stars, and Clean photometry with flags - galaxies Sample Queries for examples on how to use flags.
Checking a single flag
To return rows for which the flag is set, the basic syntax for the constraint is:
(flag-column & bitmask) != 0
and to return rows for which the flag is not set:
(flag-column & bitmask) = 0
where bitmask is the binary value in which the bit corresponding to the flag is 1 and all other bits are 0. You can use the flag functions provided by the SkyServer (listed in the Schema Browser) to get the bitmask for a given flag, but if you are scanning a large fraction of a large table like PhotoObj, you are better off not making a function call for each row, and in that case you should first get the actual binary value of the bitmask first and substitute that instead. This is described in the Using dbo functions in your query subsection of the Optimizing Queries section below.
For example, to select objects for which the BLENDED flag is set in PhotoTag, you would use a query like:
SELECT top 10 objid, flags FROM PhotoTag WHERE flags & dbo.fPhotoFlags('BLENDED') > 0
and to select only objects for which the flag is NOT set, use
SELECT top 10 objid, flags FROM PhotoTag WHERE flags & dbo.fPhotoFlags('BLENDED') = 0
Checking multiple flags
To test if multiple flags are set, you can combine the values by adding them and then testing the result.
To select objects for which all of several flags are set, generate the combined bitmask by adding the individual flag bitmasks, then compare the result of ANDing the combined bitmask with the flag column with the combined bitmask itself, e.g.,
SELECT top 10 objid, flags FROM PhotoTag WHERE ( flags & (dbo.fPhotoFlags('NODEBLEND') + dbo.fPhotoFlags('BINNED1') + dbo.fPhotoFlags('BINNED2')) ) = ( dbo.fPhotoFlags('NODEBLEND') + dbo.fPhotoFlags('BINNED1') + dbo.fPhotoFlags('BINNED2') )
To select objects for which at least one of several flags is set, you just need to check that ANDing the combined bitmask with the flag column returns a non-zero result, e.g.,
SELECT top 10 objid, flags FROM PhotoTag WHERE ( flags & (dbo.fPhotoFlags('NODEBLEND') + dbo.fPhotoFlags('BINNED1') + dbo.fPhotoFlags('BINNED2')) ) > 0
To select objects for which none of several flags is set, the result of ANDing the flag column with the combined bitmask must be 0, e.g.,
SELECT top 10 objid, flags FROM PhotoTag WHERE ( flags & (dbo.fPhotoFlags('NODEBLEND') + dbo.fPhotoFlags('BINNED1') + dbo.fPhotoFlags('BINNED2')) ) = 0
As mentioned above, if you are running a query that is expected to match a large number of rows (millions), it is better to first obtain the binary bitmask resulting from the multiple flag arithmetic and using that single bitmask instead of repeated function calls to the flag functions, as described in the Using dbo functions in your query subsection of the Optimizing Queries section below.
Clean Photometry
The SDSS photo pipeline sets a number of flags that indicate the quality of the photometry for a given object in the catalog. If you desire objects with only clean photometry for science, you should be aware that you need to filter out unwanted objects yourself in your query. This is not done automatically for you (e.g. with a view of the PhotoObjAll table). The main reason is that the flag constraints that are required for this filtering often impose a significant performance penalty on your query.
There is a single up or down flag that is available in the PhotoObjAll table (and its views) called "clean" that is set to 1 if the photometry meets our definition of good photometry, and the use of this shorthand flag is illustrated in the Clean Photometry sample query. This is meant to provide a simple way to select objects with clean photometry. However, if you do not trust this or you want to be more specific and use the individual photo flags to select objects that meet your criteria for "clean photometry", this is illustrated in two other sample queries: Clean Photometry with flags - Stars and Clean Photometry with flags - Galaxies Sample Query .Excluding Invalid Data Values
As mentioned in the Early Data Release paper (Stoughton et al. 2002), the database designates quantities that are not calculated for a particular object in a table with special values, as follows:
- The value of a quantity that has not been calculated is set to -9999
- The value of an error that has not been calculated is set to -1000
To exclude such invalid values from your query result, you should include constraints in your WHERE clause explicitly to filter them out, e.g.
SELECT ra,dec,u,err_u FROM PhotoObj WHERE ra BETWEEN 180 AND 181 AND dec BETWEEN -0.5 AND 0.5 AND u BETWEEN -9999 AND 20.0 -- or "u > -9999 AND u < 20.0", -- instead of just "u < 20.0" AND err_u BETWEEN -1000 AND 0.1 -- or err_u > -1000 AND err_u < 0.1, -- instead of just "err_u < 0.1"
Changing Precision of Query Output
Use the STR(column,n,d) SQL construct (where n is the total number of digits and d is the number of decimal places) to set the precision of the column that your query requests. The SkyServer returns values with a default precision that is set for each data type, and this may not be enough for columns like ra, dec etc. See the Selected neighbors in run or Uniform Quasar Sample Sample Queries queries for examples of how to use STR.
Joins: Querying with Multiple Tables
You may wish to obtain quantities from multiple tables, or place constraints on quantities in one table while obtaining measurements from another. For instance, you may want magnitudes (from PhotoObj) from all objects spectroscopically identified (SpecObj) as galaxies. To perform these types of queries, you must use a join. You can join any two (or more) tables in the databases as long as they have some quantity in common (typically an object or field ID). To actually perform the join, you must have a JOIN subclause in the FROM clause of your query that specifies the common quantity to be equal in the two tables. Here is an example, getting the g magnitudes for stars in fields where the PSF fitting worked well:
SELECT TOP 10 s.psfMag_g FROM Star s JOIN Field f ON s.fieldID = f.fieldID WHERE s.psfMag_g < 20 AND f.pspStatus = 2
Notice how we define abbreviations for the table names in the FROM clause; this is not necessary but makes for a lot less typing. Also, you do not have to ask for quantities to be returned from all the tables. You must specify all the tables on which you place constraints (including the join) in the FROM clause, but you can use any subset of these tables in the SELECT. If you use more than two tables, they do not all need to be joined on the same quantity. For instance, this three way join is perfectly acceptable:
SELECT TOP 10 p.objID,f.field,z.z FROM PhotoObj p JOIN Field f ON f.fieldid = p.fieldid JOIN photoz z ON p.objid = z.objid WHERE f.psfWidth_r > 1.2 AND p.colc > 400.0
The type of joins shown above are called inner joins. In the above examples, we only return those objects which are matched between the multiple tables. If we want to include all rows of one of the tables, regardless of whether or not they are matched to another table, we must perform an outer join. One example is to get photometric data for all objects, while getting the spectroscopic data for those objects that have spectroscopy.
In the example below, we perform a left outer join, which means that we will get all entries (regardless of matching) from the table on the left side of the join. In the example below, the join is on P.objID = s.BestObjID; therefore, we will get all photometric (P) objects, with data from the spectroscopy if it exists. If there is no spectroscopic data for an object, we'll still get the photometric measurements but have nulls for the corresponding spectroscopy.
select top 100 P.objID, P.ra, P.dec, S.SpecObjId, S.ra, S.dec from PhotoObj as P left outer join SpecObjAll as S on P.objID = s.BestObjID
When using table valued functions, you must do the join explicitly (rather than using "="). To do this, we use the syntax:
SELECT quantities
FROM table1
JOIN table2 on table1.quantity = table2.quantity
WHERE constraints
For instance, in the example below, we use the function dbo.fGetNearbyObjEq to get all objects within a given radius (in this case, 1') of a specified coordinate. This is a table-valued, so it returns a table, containing the ObjIDs and distances of nearby objects. We want to get further photometric parameters on the returned objects, so we must join the output table with PhotoObj:
SELECT G.objID, GN.distance FROM Galaxy as G JOIN dbo.fGetNearbyObjEq(115.,32.5, 1.0) AS GN ON G.objID = GN.objID WHERE (G.flags & dbo.fPhotoFlags('saturated')) = 0
Manipulating Query Output
SQL provides a number of ways to reorder, group, or otherwise arrange the output of your queries. Some of these options are:
-
count: Just tell me how many objects would be returned by my query. You can specify a
column name as the argument to the count function or just "*" to mean all columns. It doesn't really matter
(unless you are including a DISTINCT qualifier, see below) since it will count all rows that match your
query anyway. Example:
SELECT count(*) FROM Galaxy WHERE ra between 180.1 and 180.5
-
distinct: Return only the unique values of the quantities requested in the SELECT
statement. Example:
SELECT distinct run FROM Field
Here a COUNT would return different counts depending on the column you selected, e.g.SELECT count(distinct run) FROM Field
would return a different number in general from:SELECT count(distinct field) FROM Field
-
top: Return only the first n rows of the query results. We've already been using
this above. Note that the rows selected in the TOP x are not deterministic, i.e., there is no order
implied in the rows selected. This is true of all database queries in general. You have to explicitly
enforce an order in your query if you want, and it is an expensive option in terms of query execution
time usually (see ORDER BY clause below). Example:
SELECT top 10 r FROM Star
-
order by: Order the output by the specified quantities. Default is ascending order,
but you can specify descending as well. You can also order by multiple columns. Example:
SELECT top 10 u,g,r FROM Star order by g,r desc
Note how repeatedly executing this query returns the same 100 rows. This is not true of the previous query, especially if you run it at different times so the cache does not come into play. -
group by: Group the output by the specified quantities. For instance, you could have all
the stars in the output, followed by the galaxies. You could also perform operations on the grouped
quantities. You could get the min and max magnitudes for stars and galaxies separately, as shown below:
SELECT min(r),max(r) FROM PhotoPrimary group by type
You can use this to count how many of each object type is loaded as primary photometric objects, for instance:SELECT count(r) FROM PhotoPrimary group by type
Optimizing Queries
It is easy to construct very complex queries which can take a long time to execute. When writing queries, one can often rewrite them to run faster. This is called optimization.
The first, and most trivial, optimization trick is to use the minimal Table or View for your query. For instance, if all you care about are galaxies, use the Galaxy view in your FROM clause, instead of PhotoObj. We have also created a 'thin' version of PhotoObjAll, called PhotoTag. This vertical subset contains all the objects in PhotoObjAll, but only a subset of the measured quantities. Using the PhotoTag view to speed up the query only makes sense if you do NOT want parameters that are only available in the full PhotoObjAll.
It is extremely useful to think about how a database handles queries, rather than trying to write a plain, sequential list of constraints. NOT every query that is syntactically correct will necessarily be efficient; the built-in query optimizer is not perfect! Thus, writing queries such that they use the tricks below can produce significant speed improvements.
Using indices in your query
Another simple way to make queries faster is to use indexed quantities to search on. There are two types of indices in the CAS, indices built into the database, and an external spatial index that we have added to make spatial searches much faster, called Hierarchical Triangular Mesh (HTM) . The latter is explicitly invoked by using the built-in spatial search functions like fGetNearbyObjEq (does a radial search in equatorial coordinates), or fGeObjFromRectEq (searches in a rectangular area). Database indices are invoked automatically when you include columns in your search (in the WHERE clause) that have indices built on them. There are 3 types of database indices:
- Primary Key (PK) indices: these are also called clustered indices because the data is physically arranged (clustered) on disk in the ascending order of this key. Each table has exactly one clustered or PK index, built on its primary search key. For most of the photometric data tables, this is the objID column, and for most of the spectroscopic data tables, this is the specoObjID column.
- Primary Key (PK) indices: these are indices built on columns that define a relationship with other tables, e.g. objID in the SpecObjAll table is a foreign key on the PhotoObjAll table. Having FK indices helps to speed up queries that include a JOIN between the SpecObjAll table (and its associated views) and the PhotoObjAll table (and its views).
- Covering indices: these are indices created on (groups of) columns that are frequently used (together). Thus they "cover" the search space defined by those columns. They may be unique or non-unique (allow duplicates).
All database indices defined on a given table in the CAS are listed in the Schema Browser under the Indices tab. If you click on the Indices tab, you will get a listing of all indices for all tables. If you want to view the indices for a single table, expand the Indices tab, scroll down the list to the table you want and just click on the table name to see all its indices.
If you must search on non-indexed columns in addition to the indexed ones, you can still benefit by first performing a query using only the indexed quantities, and then select those parameters from the returned subset of objects. An indexed quantity is one where a look-up table has effectively been calculated, so that the database software does not have to do a time-consuming sequential search through all the objects in the table. For instance, sky coordinates cx,cy,cz are indexed using a Hierarchical Triangular Mesh (HTM). So, you can make a query faster by rewriting it such that it is nested; the inner query grabs the entire row for objects of interest based on the indexed quantities, while the outer query then gets the specific quantities desired.
Using dbo functions in your query
Finally, a word of caution about using function calls in queries. If your query is going to match a large number of objects (million or more), using a function call, especially one that operates on a constant or literal, in the WHERE clause is not a good idea, because the function will be called once per matching row in that table, resulting in a significant performance hit. Here is an example of this:
SELECT ... FROM PhotoObj WHERE flags & dbo.fPhotoFlags('BLENDED') > 0
In this case, it would be better to first do the pre-query:
SELECT dbo.fPhotoFlags('BLENDED')
to get the bitmask value for that flag, and then rewrite the above query as:
SELECT ... FROM PhotoObj WHERE flags & 8 > 0
This will avoid the wastefully repeated function call for each and every photoObj in the table. This is even more important when you are using multiple flags and you can reduce the comparison to a single bitmask using flag arithmetic. In the final example above in the Querying Bit Flags section, you can replace the original query:
SELECT top 10 objid, flags FROM PhotoTag WHERE ( flags & (dbo.fPhotoFlags('NODEBLEND') + dbo.fPhotoFlags('BINNED1') + dbo.fPhotoFlags('BINNED2')) ) = 0
with a more efficient version by first running the following pre-query:
SELECT (dbo.fPhotoFlags('NODEBLEND') + dbo.fPhotoFlags('BINNED1') + dbo.fPhotoFlags('BINNED2'))
which returns the bitmask value 805306432, which can in turn be substituted back in the original query as follows:
SELECT top 10 objid, flags FROM PhotoTag WHERE ( flags & 805306432 ) = 0
so as to save 3 function calls and make the query significantly more efficient. (In this particular example it does not matter because we are only asking for 10 rows, but if the "TOP 10" were to be removed and the query was run on millions of rows, it would make a difference).