TL;DR see this git repository for a collection of scripts on web-scraping the ADS database.

On ADS bibcodes

(2021-05-24)

My favourite piece of literature is definitely 1916AbhKP1916..189S!

The de facto database on astronomical scientific literature is NASA's ADS. The under-the-hood solution of storing literature data relies on giving each item in the database a unique identifier, akin to a DOI (digital object identifier) or a social security number. In the case of ADS, this identifier is known as a bibcode. Understanding the format of the bibcode is the one-to-one mapping between a bibliographic item (paper, conference talk, etc.) and the ADS database entry.

Consider for the sake of an example Karl Schwarzschild's seminal 1916 work "Über das Gravitationsfeld eines Massenpunktes nach der Einsteinschen Theorie" published in "Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften zu Berlin". The link to this work online in the ADS database is:

https://ui.adsabs.harvard.edu/abs/1916AbhKP1916..189S

and the corresponding bibliographic code is:

1916AbhKP1916..189S

That is, the bibcode consists of exactly 19 characters with the following fields defining the format:

Entry Meaning Width (characters) Explanation
1916 Year 4 4-digit number
AbhKP Journal 5 ADS journal abbreviation e.g. MNRAS, A&A, ApJ, AN, etc.
1916 Volume 4 volume of the journal
empty Qualifier 1 special qualifier flag for e.g. letters
189 Firstpage 4 first page in the journal
S Lastname 1 first letter of the last name of the author

Note that the width of the bibcode is exactly 19 characters. To justify, dots are added in place of missing fields (e.g. in our example the qualifier) or the complete fields that do not fill the maximum width. Text fields such as the journal are left-justified (e.g. "ApJ" becomes "ApJ..") and number fields are right-justified (e.g. firstpage "189" becomes ".189").

The above description holds only true for published journal articles, where the above fields are not ill-defined. For arXiv pre-prints, the "journal" field is simply replaced with "arXiv", and the following 9 characters contain the arXiv identifier (e.g. 2105.00001 for the article of May 2021). So for a hypotethical authors with last name initial 'A', the corresponding ADS bibcode reads simply:

2021arXiv210500001A

Notice the width of exactly 19 characters with the dot omitted from the arXiv identifier.

https://arxiv.org/help/arxiv_identifier

The usefulness of the bibcode relates to the fact that the prevalent citing practice in the field of astronomy is incomplete. Author-year style citation e.g. "Schwarzschild (1916)" is a unique identifier of the article only if the surname and the year provide a unique combination. This is not the case either if the surname is common, or there exists multiple publications from the same author within the course of a year. Thus typically some additional piece of information such as the title or the journal is required to uniquely identify the article.

Using ADS bibcodes

UNIX way

(A disclaimer is in order. The automation of any interaction with the ADS web service is at the mildest frowned upon and at the most serious a violation of their terms of use. If one wants to do a heavier database scavenging, see the next section on the ADS API, which authorizes some 5000 queries daily.)

The UNIX way of using the ADS bibcodes is obviously to parse the information directly that would be shown by a browser. In ADS, different elements of a bibliographic entry are neatly labeled in the URL e.g. "abs" for abstract or "exportcitation" for the BibTeX citation. Thus with the help of a few UNIX spells, e.g. the abstract and citation are retrieved by

#!/bin/sh
# Retrieve an ADS abstract for a given bibcode
curl -sL 'https://ui.adsabs.harvard.edu/abs/2021MNRAS.502.5962A/abstract' \
| grep -A3 '[<]div class[=]["]s[-]abstract-text["][>]' \
| tail -n1 \
| sed -e 's/^\s*//'

#!/bin/sh
# Retrieve an ADS BibTeX entry for a given bibcode
curl 'https://ui.adsabs.harvard.edu/abs/2021MNRAS.502.5962A/exportcitation' \
| awk '/@ARTICLE/ { p = 1 } p { print } '/^}$/' { p = 0}' \
| sed -e 's/^.*[>]//' \
| sed -e 's/[&][#]34;/"/g'

(if the following concepts of urls, html, streams, pipes, flags, regexes, curl, grep, tail and awk are unfamiliar, plenty of information is available in the manuals and online, so rtfm!)

Any further useful information is parseable and requires only the inspection of the html elements to decide what to parse for.

Lastly, to download the full article in pdf form, use either of the following snippets. Simply replace the bibcode with the one you wish to retrieve and "tmp.pdf" with the desired name of the output file:

# arXiv e-print (should work most of the times)
curl -Lo tmp.pdf 'https://ui.adsabs.harvard.edu/link_gateway/1998AJ....116.1009R/EPRINT_PDF'

# publisher pdf (might require e.g. university VPN to work)
curl -Lo tmp.pdf 'https://ui.adsabs.harvard.edu/link_gateway/1998AJ....116.1009R/PUB_PDF'

ADS API

The ADS API is the go-to solution for either complex programmed queries or retrieving large amounts of data from the ADS database in a controlled manner (from ADS point-of-view).

First, get the ADS API token for your ADS account (create one if needed) and store the API token in a dark, dry, and cool place (i.e. never ever share your API token with anyone else)

https://ui.adsabs.harvard.edu/user/settings/token

Then, access the API from the command-line by using e.g. curl

curl -H \
'Authorization: Bearer:"your ADS token"' \
'https://api.adsabs.harvard.edu/v1/search/query?q=author:schwarzschild&fl=bibcode,author,title'

(replace "your ADS token" with the token retrieved in the first step)

Human-readable explanation of the URL is: query the ADS database for the author name (any position in the author list) 'Schwarzschild', and from the retrieved query, export the bibcode, list of authors, and title of every found item in the database. The retrieved JSON data can be further processed with e.g. 'jq'.

To manipulate the URL for the query, see here

https://github.com/adsabs/adsabs-dev-api

To build complex queries in a automated manner (e.g. process 5000 bibcodes, or find all the references of a paper with 10 or more citations, or build and visualize an author's collaboration network), consider the unofficial python ADS library:

https://ads.readthedocs.io/en/latest/

Acquiring ADS bibcodes

Finally we address the problem of acquiring an ADS bibcode. Two use cases come immediately to mind. The first one is the problem that the bibcode "solves", the author-year citation style. In this scenario retrieving bibcodes should be as quick, effortless, and painless as possible i.e. "out of your way". The second problem is merely an optimization on best workflows. In this scenario you build the bibcode by hand from an article you already have access to. Here are both of the scenarios explained.

From author year

Since the query engine of the modern ADS portal is behind a javascript engine, the simplest solution is the use an alternative portal for the front-end of the query e.g. adsabs.net (also mirrored on this site here).

So here is the formula: imagine you are interested in knowing the bibcode of a combination of "Surname et al. (year)", e.g. Viitanen et al. 2019. The URL to this query in ADS is: http://adsabs.net/cgi-bin/nph-abs_connect?db_key=AST&author=%5Eviitanen&start_year=2019&end_year=2019

From this list of entires it is rather easy to pick and choose the correct bibcode. To automate, we use the information that a bibcode is strictly formatted so that we can simply use grep and a regex spell to extract the information. The following POSIX compliant solution provides an example:

#!/bin/sh
author="$1"
year="$2"
url="$(printf \
'http://adsabs.net/cgi-bin/nph-abs_connect?db_key=AST&author=%%5E%s&start_year=%s&end_year=%s' \
"$author" "$year" "$year")"
curl "$url" > html.tmp
grep -Eo '[0-9]{4}[[:alnum:]&.]{5}.{9}[A-Z]' html.tmp | uniq > bibcodes.tmp
grep -Po '.*?' html.tmp \
| perl -pe 's|(.*?)|\1|g' > titles.tmp
paste -d' ' bibcodes.tmp titles.tmp
rm html.tmp bibcodes.tmp titles.tmp

Save the above to e.g. "ads_bibcode_authoryear" and run as:

$ ./ads_bibcode_authoryear viitanen 2019 2> /dev/null
2019A&A...629A..14V The XMM-Newton wide field survey in the COSMOS field: Clustering dependence of X-ray selected AGN on host galaxy properties

Directly from the article itself i.e. roll your own

Suppose you have an article and you want to parse the information by hand for some reason. In some cases this is indeed faster than running the query (since you already know what you are looking for). The method is straightforward, simply look for year, journal abbreviation, volume, qualifier, first page, and the last name of the author from the article itself:

As previously, the bibcode is "2019A&A...629A..14V", where all the different fields are readily identified in the top-left corner of the titlepage, with the added 'V' at the end for the surname initial.

Further reading