-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError for 'content-location' and 'link' when trying to save non-HTML #65
Comments
I believe you it's a bug, but they just worked for me. Here's my term log:
|
I'm trying it again right now to see if I can use this tonight with election results. Each time I do this ...
... savepagenow always raises a WaybackRuntimeError with this content (I prettified it):
|
Hmm. I'm struggling to understand what the error is even complaining about. Can you parse any reason out? |
I think the savepagenow library just defaults to raising a WaybackRuntimeError when it doesn't find either The response headers that I get when I try to archive these election files generally didn't have either of those fields. But they did often include I honestly don't understand ... I would expect responses from API like this one from IA to be consistent and always have the same fields. Looking at the Wayback Machine this morning, it seems like maybe some of my requests last night may have succeeded, because I do see a few entries for each of the files. But I don't see as many as I would have expected. |
Hmm. I can't say I know the answer here, @Kirkman, since the insides of the IA system are a mystery to me. I could imagine a new catch, like the one below, that would only raise the error if the status code is not 200. except Exception:
# If neither of those things works, check the status code.
# If it's 200, we assume the archiving request worked but didn't return a URL
if status_code == 200:
return None
# If it's not 200, we raise an error.
raise WaybackRuntimeError(
dict(status_code=response.status_code, headers=response.headers)
) I'm not sure how I feel about this solution, given that we don't understand the response. What do you think? |
Yeah, it's definitely not ideal. Since we don't know exactly what it means, maybe it might make sense within that catch to raise a warning (and return None) if there's a 200, and raise an error if not? |
Not crazy. Would inserting a warning affect our CLI outputs at all? I'm not smart about how warnings works. |
Oof, I'm not sure. I'm a novice with warnings, and haven't done much with CLI stuff. I became familiar with seeing warnings from libraries like |
Gotcha. Would you mind joining me for a scan of the public API docs to see if we can spot any clues? https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit?usp=sharing |
My source at archive.org says:
|
As archive.org asked, I added the response content to our errors. Here's what I'm seeing:
|
I'm not sure I know what that is, but I wonder if it's a redirect to a previously archived page that isn't registering as cached. |
I'm trying to see if I can integrate
savepagenow
into my election night scraping system. The idea would be to save online results files into the Wayback Machine when my system detects the results have changed.Most of the URLs I want to save are CSVs, JSON, or XML files. However, I am often finding that when I try to use
savepagenow
to save them, I get error tracebacks like these:It's very odd. Occasionally the requests work, but most times they error out with this same sequence. You may be able to reproduce with any/all three of these command-line examples:
Anyway, is this just me? Am I doing something wrong?
The text was updated successfully, but these errors were encountered: