Learned the New York Times has some APIs available. First things first, this doesn't give the full article text, which is unfortunate, but I'm sure they have an equivalent feed if you pay them actual money. In the stock configuration it polls the API every 75 seconds. Since they limit us to 4000 requests per day, and articles aren't rolling in every minute, it's adequate. That limit can be increased if you pay them actual money dollars. For each article we haven't seen, the output looks something like:
It tracks seen state keyed off the combined slug_name and created_date and looking for any differences in the updated_date.
It would probably be a good idea to set ATTRIBUTE=True if you are running this somewhere that will be seen by others. This will print out NYT's copyright notice at the tail end of each batch.
My idea is to make this emit NTX messages, toss it into the NTX network, and ultimately have them print out on the AN/UGC-129(V)1 when I actually get around to implementing it.
Annoyances...
They will post an article, update it, update it again, and back out an update which reposts the previous update, then re-post the backed out update. This means you will occasionally see the same article pop five or six times. Either that, or there's something really weird happening on their backend. I may need to replace the current seen state tracking with one that simply tracks the combined slug_name and updated_date rather than what I'm doing now. Either that, or I need to actually do a time-based comparison rather than dumb equality check.
Code:
For Johnson, a Political Rebuke as Omicron Variant Engulfs Britain
BY MARK LANDLER AND STEPHEN CASTLE
2021-12-17T14:33:02-05:00 updated 2021-12-17T14:35:42-05:00
World - Europe
The prime minister’s Conservative Party lost a seat it had held for
more than a century, a loss that could hamper his efforts to address
the Omicron variant now sweeping Britain.
It tracks seen state keyed off the combined slug_name and created_date and looking for any differences in the updated_date.
It would probably be a good idea to set ATTRIBUTE=True if you are running this somewhere that will be seen by others. This will print out NYT's copyright notice at the tail end of each batch.
My idea is to make this emit NTX messages, toss it into the NTX network, and ultimately have them print out on the AN/UGC-129(V)1 when I actually get around to implementing it.
Annoyances...
They will post an article, update it, update it again, and back out an update which reposts the previous update, then re-post the backed out update. This means you will occasionally see the same article pop five or six times. Either that, or there's something really weird happening on their backend. I may need to replace the current seen state tracking with one that simply tracks the combined slug_name and updated_date rather than what I'm doing now. Either that, or I need to actually do a time-based comparison rather than dumb equality check.
Python:
#!/usr/bin/python3
# load the api key from a file named apikey
apikey=open("apikey").read().strip()
URL="https://api.nytimes.com/svc/news/v3/content/all/all.json?api-key=%s" % apikey
from pprint import pprint
from textwrap import wrap
import json
import urllib.request as req
from textwrap import wrap
import time
import pickle
import io
import sys
# seconds between polls
LOOPTIME=75
# when to print a tick, in seconds
TICKTIME=900
# If True then we load from TESTFILE
# otherwise we fetch from URL. Keep in mind
# NYT API has a limit of like 4000 reqs per day
# which is why loop time is 75 rather than
# something shorter. Those 288 requests we don't
# make per day could matter if you do other
# fun things with NYT APIs
TESTING=False
# test json file and seen state file
TESTFILE="feed.json"
SAVEFILE="seen.save"
# controls printing of the copyright at the
# tail end of each loop that has messages to
# print
ATTRIBUTE=False
pre = " "
col = 69
dif = col - len(pre) - 2
def load_seen():
try:
seen = pickle.load(open(SAVEFILE, "rb+"))
except:
seen = {}
return seen
def save_seen(seen):
pickle.dump(seen, open(SAVEFILE, 'wb+'))
def do_results(feed,seen):
"""
Takes the result from the NYT News Wire API and our
seen state dict, returns the new seen state dict
if there are articles that haven't been seen or have
been updated, we print out the data we have.
"""
# we print to a buffer to make this easier
buf = io.StringIO()
# you can ring my bell
buf.write("\x07")
# we don't print out the thing unless out_count > 0
big_count=0
out_count=0
for x in feed['results']:
big_count+=1
akey = "%s:%s" % (x['slug_name'], x['first_published_date'])
# skip if we've seen this one before
try:
if (seen[akey] == x['updated_date']):
continue
except:
None
seen[akey] = x['updated_date']
# we haven't see
out_count+=1
for y in wrap(x['title'], width=col):
print(y,file=buf)
if (x['subheadline'] != ''):
subhead = x['subheadline']
if (type(subhead) != list):
subhead = wrap(subhead,width=dif)
for y in wrap(subhead,width=dif):
print(pre,y,file=buf)
for y in wrap(x['byline'],width=dif):
print(pre,y,file=buf)
if (x['updated_date'] != x['first_published_date']):
print(pre,x['first_published_date'], 'updated', x['updated_date'],file=buf)
else:
print(pre,x['first_published_date'],file=buf)
if (x['subsection'] != ''):
print(pre,"%s - %s" % (x['section'], x['subsection']),file=buf)
else:
print(pre,x['section'],file=buf)
if (x['source'] != 'New York Times'):
print(pre,"via",x['source'],file=buf)
if (x['abstract'] != ''):
print("",file=buf)
if (type(x['abstract']) != list):
abt = wrap(x['abstract'],width=dif)
else:
abt = x['abstract']
for y in abt:
print(pre,y,file=buf)
print("",file=buf)
# Attribution matters
if ATTRIBUTE:
print(feed['copyright'],file=buf)
print("",file=buf)
# The pay off
if (out_count>0):
print(buf.getvalue(),end='') # we get newline from above...
# return seen state
return seen
def live_fetch():
"""
Grabs the feed, returns decoded dict
"""
with req.urlopen(URL) as response:
jsr = response.read()
return json.loads(jsr)
def test_fetch():
"""
like fetch, but loads "local.json" from disk
instead of a fetch. saves us API hits when
we just need to test a bunch of times.
"""
return json.load(open(TESTFILE))
def main():
if TESTING == True:
fetch = test_fetch
else:
fetch = live_fetch
seen = load_seen()
last_tick = time.time()
while True:
meowtime=time.strftime("%Y%j%H%M%S",time.gmtime())
# make sure the user knows the system is active...
# print a tick every once in a while. Also provides
# a marker on the scroll back. See? That's why I
# didn't make this more clever, Sheila. It was
# to provide a marker in the output.
if (time.time() >= last_tick+TICKTIME):
last_tick = time.time()
print("TICK",meowtime)
print("")
feed = fetch()
seen = do_results(feed, seen)
save_seen(seen)
time.sleep(LOOPTIME)
if __name__ == "__main__":
main()