Script to poll the NYT Newswire API and print out recent article info

havard

Havard
Staff member
Admin
Hoppy Gorilla
Learned the New York Times has some APIs available. First things first, this doesn't give the full article text, which is unfortunate, but I'm sure they have an equivalent feed if you pay them actual money. In the stock configuration it polls the API every 75 seconds. Since they limit us to 4000 requests per day, and articles aren't rolling in every minute, it's adequate. That limit can be increased if you pay them actual money dollars. For each article we haven't seen, the output looks something like:

Code:
For Johnson, a Political Rebuke as Omicron Variant Engulfs Britain
   BY MARK LANDLER AND STEPHEN CASTLE
   2021-12-17T14:33:02-05:00 updated 2021-12-17T14:35:42-05:00
   World - Europe

   The prime minister’s Conservative Party lost a seat it had held for
   more than a century, a loss that could hamper his efforts to address
   the Omicron variant now sweeping Britain.

It tracks seen state keyed off the combined slug_name and created_date and looking for any differences in the updated_date.

It would probably be a good idea to set ATTRIBUTE=True if you are running this somewhere that will be seen by others. This will print out NYT's copyright notice at the tail end of each batch.

My idea is to make this emit NTX messages, toss it into the NTX network, and ultimately have them print out on the AN/UGC-129(V)1 when I actually get around to implementing it.

Annoyances...

They will post an article, update it, update it again, and back out an update which reposts the previous update, then re-post the backed out update. This means you will occasionally see the same article pop five or six times. Either that, or there's something really weird happening on their backend. I may need to replace the current seen state tracking with one that simply tracks the combined slug_name and updated_date rather than what I'm doing now. Either that, or I need to actually do a time-based comparison rather than dumb equality check.

Python:
#!/usr/bin/python3

# load the api key from a file named apikey
apikey=open("apikey").read().strip()

URL="https://api.nytimes.com/svc/news/v3/content/all/all.json?api-key=%s" % apikey

from pprint import pprint
from textwrap import wrap
import json
import urllib.request as req

from textwrap import wrap

import time
import pickle

import io

import sys

# seconds between polls
LOOPTIME=75

# when to print a tick, in seconds
TICKTIME=900

# If True then we load from TESTFILE
# otherwise we fetch from URL. Keep in mind
# NYT API has a limit of like 4000 reqs per day
# which is why loop time is 75 rather than
# something shorter. Those 288 requests we don't
# make per day could matter if you do other
# fun things with NYT APIs
TESTING=False

# test json file and seen state file
TESTFILE="feed.json"
SAVEFILE="seen.save"

# controls printing of the copyright at the
# tail end of each loop that has messages to
# print
ATTRIBUTE=False

pre = "  "
col = 69
dif = col - len(pre) - 2

def load_seen():
    try:
        seen = pickle.load(open(SAVEFILE, "rb+"))
    except:
        seen = {}

    return seen

def save_seen(seen):
    pickle.dump(seen, open(SAVEFILE, 'wb+'))

def do_results(feed,seen):
    """
    Takes the result from the NYT News Wire API and our
    seen state dict, returns the new seen state dict

    if there are articles that haven't been seen or have
    been updated, we print out the data we have.
    """
    # we print to a buffer to make this easier
    buf = io.StringIO()

    # you can ring my bell
    buf.write("\x07")

    # we don't print out the thing unless out_count > 0
    big_count=0
    out_count=0

    for x in feed['results']:
        big_count+=1

        akey = "%s:%s" % (x['slug_name'], x['first_published_date'])

        # skip if we've seen this one before
        try:
            if (seen[akey] == x['updated_date']):
                continue
        except:
            None

        seen[akey] = x['updated_date']

        # we haven't see
        out_count+=1

        for y in wrap(x['title'], width=col):
            print(y,file=buf)

        if (x['subheadline'] != ''):
            subhead = x['subheadline']
            if (type(subhead) != list):
                subhead = wrap(subhead,width=dif)
            for y in wrap(subhead,width=dif):
                print(pre,y,file=buf)

        for y in wrap(x['byline'],width=dif):
            print(pre,y,file=buf)

        if (x['updated_date'] != x['first_published_date']):
            print(pre,x['first_published_date'], 'updated', x['updated_date'],file=buf)
        else:
            print(pre,x['first_published_date'],file=buf)

        if (x['subsection'] != ''):
            print(pre,"%s - %s" % (x['section'], x['subsection']),file=buf)
        else:
            print(pre,x['section'],file=buf)

        if (x['source'] != 'New York Times'):
            print(pre,"via",x['source'],file=buf)

        if (x['abstract'] != ''):
            print("",file=buf)
            if (type(x['abstract']) != list):
                abt = wrap(x['abstract'],width=dif)
            else:
                abt = x['abstract']
            for y in abt:
                print(pre,y,file=buf)

        print("",file=buf)

    # Attribution matters
    if ATTRIBUTE:
        print(feed['copyright'],file=buf)
        print("",file=buf)

    # The pay off
    if (out_count>0):
        print(buf.getvalue(),end='') # we get newline from above...

    # return seen state
    return seen

def live_fetch():
    """
    Grabs the feed, returns decoded dict
    """
    with req.urlopen(URL) as response:
        jsr = response.read()

    return json.loads(jsr)

def test_fetch():
    """
    like fetch, but loads "local.json" from disk
    instead of a fetch. saves us API hits when
    we just need to test a bunch of times.
    """

    return json.load(open(TESTFILE))

def main():
    if TESTING == True:
        fetch = test_fetch
    else:
        fetch = live_fetch

    seen = load_seen()

    last_tick = time.time()

    while True:
        meowtime=time.strftime("%Y%j%H%M%S",time.gmtime())
        # make sure the user knows the system is active...
        # print a tick every once in a while. Also provides
        # a marker on the scroll back. See? That's why I
        # didn't make this more clever, Sheila. It was
        # to provide a marker in the output.
        if (time.time() >= last_tick+TICKTIME):
                last_tick = time.time()
                print("TICK",meowtime)
                print("")
        feed = fetch()
        seen = do_results(feed, seen)
        save_seen(seen)
        time.sleep(LOOPTIME)

if __name__ == "__main__":
    main()
 

rsayers

Hoppy Gorilla
Staff member
Hoppy Gorilla
Little Jumpy Monkey
Exclusive Gold Banner #221,348,771 Blue (1 of 1)
Neat. I see you still use the old format thing with %. f strings are quite a bit faster, and look better imho.

Python:
import timeit
print(
    [
        timeit.timeit('["{}".format(n) for n in range(100)]', number=100000),
        timeit.timeit('[f"{n}" for n in range(100)]', number=100000),
        timeit.timeit('["%s" % n  for n in range(100)]', number=100000)
    ]
)

Produces [1.729534636, 0.8919102590000001, 1.1992342630000001]
 

havard

Havard
Staff member
Admin
Hoppy Gorilla
Probably so, but the amount of time spent on io, whether to disk, the screen, or network, will likely far exceed whatever gains you get from the optimization. Especially if it's to the terminal.
 

rsayers

Hoppy Gorilla
Staff member
Hoppy Gorilla
Little Jumpy Monkey
Exclusive Gold Banner #221,348,771 Blue (1 of 1)
For sure. The best thing about f strings are just how tidy they are. At $dayjob if we're ever mucking with code and come across .format() uses that dont need it (sometimes that's the right tool still), we're expected to update that.
 

havard

Havard
Staff member
Admin
Hoppy Gorilla
For sure. The best thing about f strings are just how tidy they are. At $dayjob if we're ever mucking with code and come across .format() uses that dont need it (sometimes that's the right tool still), we're expected to update that.
okay, after playing around with it I think the f strings have won me over.
 

havard

Havard
Staff member
Admin
Hoppy Gorilla
Code:
>>> f"tacos {type(mlep)} ... {mlep(128)-1:08X}"
"tacos <class 'function'> ... 00003FFF"

That really solves some of the mess and makes it a little more compact.
 
Top