University News

Class Welcomes Former White House Press Secretary

American University News - Tue, 11/03/2015 - 00:00
A former White House Press Secretary visits an SOC class to discuss what it's like working in the White House, presidential debates and more.
Categories: University News

International Photojournalist Donates Collection to AU Library

American University News - Tue, 11/03/2015 - 00:00
SOC professor and photojournalist, Bill Gentile's work will now be housed in Archives and Special Collections in the University Library
Categories: University News

Digital Media Arts Conference Showcases Communication Faculty

American University News - Tue, 11/03/2015 - 00:00
SOC faculty and alumni presented at the 2015 International Digital Media Arts Association conference.
Categories: University News

Journalism Professor Lewis Heads to Oxford University for Fellowship

American University News - Mon, 11/02/2015 - 00:00
SOC professor Charles Lewis is a Visiting Fellow at Oxford University’s Reuters Institute for the study of Journalism.
Categories: University News

American University Photographers Focuses on FotoWeekDC

American University News - Mon, 11/02/2015 - 00:00
AU professors and students on display at FotoWeek DC 2015.
Categories: University News

Book Notes: Higher Education Revolutions in the Gulf

American University News - Mon, 11/02/2015 - 00:00
Economics Professor John Willoughby coauthors new book.
Categories: University News

AU 2030: William DeLone

American University News - Mon, 11/02/2015 - 00:00
Information technology professor helps lead the new Kogod Cybersecurity Governance Center.
Categories: University News

Book Notes: Higher Education Revolutions in the Gulf

American University News - Mon, 11/02/2015 - 00:00
Economics Professor Jon Willoughby coauthors new book.
Categories: University News

H. Kent Baker Day

American University News - Mon, 11/02/2015 - 00:00
Kogod hosted its first annual Kent Baker day celebrating the professor’s accomplishments. Always humble, Professor Baker lectured about finance, showcased his published works and took time to thank everyone for taking time to celebrate his work.
Categories: University News

Claudia Rankine to Visit AU

American University News - Mon, 11/02/2015 - 00:00
Award-winning author speaks on Thursday, November 12
Categories: University News

AU Alumni Make Strides in Social Psychology

American University News - Mon, 11/02/2015 - 00:00
Graduates are conducting innovative research at prestigious institutions.
Categories: University News

David Gregory Discusses New Book to a Full House at SIS

American University News - Mon, 11/02/2015 - 00:00
Alumni, students, and parents joined Dean James Goldgeier for a discussion with David Gregory, SIS/BA ’92, about Gregory’s new book "How’s Your Faith?" at the School of International Service (SIS).
Categories: University News

SPA Awards Enable Students to Undertake Internship Opportunities

American University News - Mon, 11/02/2015 - 00:00
Mariam Khorenyan, a sophomore pursuing a five-year master’s degree in public administration, and Laurel Cratsley, a junior in SPA’s CLEG program, received the awards from the SPA Dean’s office.
Categories: University News

Citations & Tweets: Tech-Savvy Research Impact Measurements

American University News - Fri, 10/30/2015 - 00:00
Science Librarian Rachel Borchardt's new book focuses on the importance of altmetrics, a growing approach to analyzing the influence of research based on measuring emerging digital modes of scholarship.
Categories: University News

Community-Based Learning Gives Students First-Hand Experience

American University News - Fri, 10/30/2015 - 00:00
CBL courses connect students to valuable, varied learning opportunities.
Categories: University News

International Student Coffee Hour in the Global Resources Center

The George Washington University - Thu, 10/29/2015 - 19:59
October 29, 2015

Tuesday, Nov. 17
Global Resources Center, 7th floor

Please join us in the Global Resources Center (GRC) for an international student coffee hour co-hosted with the International Services Office (ISO). Take a tour of the GRC, chat with a specialist about your research and global interests, and enjoy a snack with your ISO friends! This event is part of GW's International Education Week.  

Please RSVP: 

The GRC focuses upon the political, socio-economic, historical, and cultural aspects of countries and regions around the globe from the 20th century onward with the following specialized resource centers: Russia, Eurasia, Central & Eastern Europe, China Documentation Center, Taiwan Resource Center, Japan Resource Center, Korea Resources, Middle East & North Africa.

DHS Secretary Johnson Speaks with Students at the School of Public Affairs

American University News - Thu, 10/29/2015 - 00:00
In an 80-minute conversation with students, Jeh Johnson spoke about his political awakening, as well as the many threats his agency works to combat.
Categories: University News

SPA Professor Awarded APSA's Best Book on Race, Ethnicity and Politics

American University News - Thu, 10/29/2015 - 00:00
The American Political Science Association recognized David Lublin's book, "Minority Rules: Electoral Systems, Decentralization, and Ethnoregional Party Success".
Categories: University News

Social Media Harvesting Techniques

The George Washington University - Wed, 10/28/2015 - 07:38
October 28, 2015Justin Littman

Social Feed Manager (SFM) is a tool developed by the Scholarly Technology Group for harvesting social media to support research and build archives. As part of enhancements to SFM being performed under a grant from the National Historical Publications and Records Commission (NHPRC), we are adding support for writing social media to Web ARChive (WARC) files. This blog entry describes two techniques for retrieving social media records from the application programming interfaces (APIs) of social media platforms and writing to WARCs. These techniques are based on Python, though these or similar approaches are applicable to other programming languages.

Background on social media APIs

Many social media platforms provide APIs to allow retrieval of social media records. Examples of such APIs include the Twitter REST API, the Flickr API, and the Tumblr API. These APIs use HTTP as the communications protocol and provide the records in a machine readable formats such as JSON. Compared to harvesting HTML from the social media platform’s website, harvesting social media from APIs offers some advantages:

  • The APIs are more stable. The creators of the APIs understand that when they change the API, they will be breaking consumers of the API. (Want notification when an API changes? Give API Changlog a try.)
  • The APIs provide social media records in formats that are intended for machine processing.
  • The APIs sometimes provide access to data that is not available from the platform’s website. For example, the following shows the record for a tweet retrieved from Twitter’s REST API:
{ "created_at": "Tue Jun 02 13:22:55 +0000 2015", "id": 605726286741434400, "id_str": "605726286741434368", "text": "At LC for @archemail today: Thinking about overlap between email archiving, web archiving, and social media archiving.", "source": "Twitter Web Client", "truncated": false, "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": { "id": 481186914, "id_str": "481186914", "name": "Justin Littman", "screen_name": "justin_littman", "location": "", "description": "", "url": null, "entities": { "description": { "urls": [] } }, "protected": false, "followers_count": 45, "friends_count": 47, "listed_count": 5, "created_at": "Thu Feb 02 12:19:18 +0000 2012", "favourites_count": 34, "utc_offset": -14400, "time_zone": "Eastern Time (US & Canada)", "geo_enabled": true, "verified": false, "statuses_count": 72, "lang": "en", "contributors_enabled": false, "is_translator": false, "is_translation_enabled": false, "profile_background_color": "C0DEED", "profile_background_image_url": "", "profile_background_image_url_https": "", "profile_background_tile": false, "profile_image_url": "", "profile_image_url_https": "", "profile_link_color": "0084B4", "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "has_extended_profile": false, "default_profile": true, "default_profile_image": false, "following": false, "follow_request_sent": false, "notifications": false }, "geo": null, "coordinates": null, "place": { "id": "01fbe706f872cb32", "url": "", "place_type": "city", "name": "Washington", "full_name": "Washington, DC", "country_code": "US", "country": "United States", "contained_within": [], "bounding_box": { "type": "Polygon", "coordinates": [ [ [ -77.119401, 38.801826 ], [ -76.909396, 38.801826 ], [ -76.909396, 38.9953797 ], [ -77.119401, 38.9953797 ] ] ] }, "attributes": {} }, "contributors": null, "is_quote_status": false, "retweet_count": 0, "favorite_count": 0, "entities": { "hashtags": [], "symbols": [], "user_mentions": [], "urls": [] }, "favorited": false, "retweeted": false, "lang": "en" }

and how the same tweet appears on Twitter’s website:


It is worth emphasizing that retrieving social media records from an API are just HTTP transactions, just like the HTTP transactions between a web browser and a website or a web crawler and a website.


(The one exception worth noting is Twitter’s Streaming APIs. While these APIs do use HTTP, the HTTP connection is kept open while additional data is added to the HTTP response over a long period of time. Thus, this API is unique in that the HTTP response may last for minutes, hours, or days rather than the normal milliseconds or seconds and the HTTP response may be significantly larger in size than the typical HTTP response from a social media API. This will require special handling and is outside the scope for this discussion, though ultimately requires consideration.)


To simplify interacting with social media APIs, developers have created API libraries. An API library is for a specific programming language and social media platform and makes it easier to interact with the API by handling authentication, rate limiting, HTTP communication, and other low-level details. In turn, API libraries use other libraries such as an HTTP client for HTTP communication or an OAuth library for authentication. Examples of Python API libraries include Twarc or Tweepy for Twitter, Python Flickr API Kit for Flickr, and PyTumblr for Tumblr. Rather than having to re-implement all of these low-level details, ideally a social media harvester will use existing API libraries.

  Background on WARCs

WARCs allow for recording an entire HTTP transaction between an HTTP client and an HTTP server. A typical transaction consists of the client issuing a request message and the server replying with a response message. These are recorded in the WARC as a request record and response record pair. In a WARC, each record is composed of a record header containing some named metadata fields and a record body containing the HTTP message. In turn, each HTTP message is composed of a message header and a message body. Here is an example request record for GWU’s homepage:

  WARC/1.0 WARC-Type: request Content-Type: application/http;msgtype=request WARC-Date: 2015-10-14T18:01:10Z WARC-Record-ID: WARC-Target-URI: WARC-IP-Address: WARC-Block-Digest: sha1:A7SJCNM5DLPJCLQMGJOXD7XDWWFQRDGH WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ Content-Length: 69 WARC-Warcinfo-ID: GET / HTTP/1.1 User-Agent: Wpull/1.2.1 (gzip) Host:  

and a response record:

  WARC/1.0 WARC-Type: response Content-Type: application/http;msgtype=response WARC-Date: 2015-10-14T18:01:10Z WARC-Record-ID: WARC-Target-URI: WARC-IP-Address: WARC-Concurrent-To: WARC-Block-Digest: sha1:FAGHJPTSB4TIHWBMNPAIXM6IRS7EMOHS WARC-Payload-Digest: sha1:D2OLR4C4UASIRNSGJCNQMK5XBQ6RAWGV Content-Length: 79609 WARC-Warcinfo-ID: HTTP/1.1 200 OK Server: Apache/2.2.15 (Oracle) X-Powered-By: PHP/5.3.3 Expires: Sun, 19 Nov 1978 05:00:00 GMT Last-Modified: Wed, 14 Oct 2015 03:33:00 GMT Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0 ETag: "1444793580" Content-Language: en X-Generator: Drupal 7 ( Link: ; rel="image_src",; rel="canonical",; rel="shortlink" Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Date: Wed, 14 Oct 2015 18:01:11 GMT X-Varnish: 982060864 981086065 Age: 52090 Via: 1.1 varnish Connection: keep-alive X-Cache: Hit from web1 Set-Cookie: NSC_dnt_qspe_tey_80=ffffffff83ac15c345525d5f4f58455e445a4a423660;expires=Wed, 14-Oct-2015 18:31:11 GMT;path=/;httponly b3a <!DOCTYPE html> <html xmlns="" xml:lang="en" version="XHTML+RDFa 1.0" dir="ltr" xmlns:og="" xmlns:fb="" xmlns:content="" xmlns:dc="" xmlns:foaf="" xmlns:rdfs="" xmlns:sioc="" xmlns:sioct="" xmlns:skos="" xmlns:xsd=""> <head profile=""> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" /> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> [A whole bunch of HTML skipped here] </body> </html>

(This was recorded using Wpull: wpull --warc-file warc_example --no-warc-compression)

Putting together this discussion of social media APIs and WARCs, we'll describe techniques for harvesting social media records using existing API libraries and record the HTTP transactions in WARCs.

The first technique

The first technique is to attempt to record the HTTP transaction from the HTTP client used by the API library. While there are a number of higher-level clients in Python (e.g., requests), the underlying HTTP protocol client is generally httplib. Unfortunately, httplib does not provide ready access to the entire HTTP message, just the message body. However, when the debug level of httplib is set to 1, httplib writes the message header to standard output (stdout). For example:

>>> import httplib >>> conn = httplib.HTTPConnection("") >>> conn.set_debuglevel(1) >>> conn.request("GET", "/") send: 'GET / HTTP/1.1\r\nHost:\r\nAccept-Encoding: identity\r\n\r\n' >>> resp = conn.getresponse() reply: 'HTTP/1.1 200 OK\r\n' header: Server: Apache/2.2.15 (Oracle) header: X-Powered-By: PHP/5.3.3 header: Expires: Sun, 19 Nov 1978 05:00:00 GMT header: Last-Modified: Wed, 14 Oct 2015 03:33:00 GMT header: Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0 header: ETag: "1444793580" header: Content-Language: en header: X-Generator: Drupal 7 ( header: Link: ; rel="image_src",; rel="canonical",; rel="shortlink" header: Content-Type: text/html; charset=utf-8 header: Transfer-Encoding: chunked header: Date: Wed, 14 Oct 2015 18:16:54 GMT header: X-Varnish: 982091814 981086065 header: Age: 53034 header: Via: 1.1 varnish header: Connection: keep-alive header: X-Cache: Hit from web1 header: Set-Cookie: NSC_dnt_qspe_tey_80=ffffffff83ac15c345525d5f4f58455e445a4a423660;expires=Wed, 14-Oct-2015 18:46:54 GMT;path=/;httponly

By capturing this debugging output, the HTTP message can be reconstructed and recorded in the appropriate WARC records. We use Internet Archive’s WARC library for writing to WARCs. Here’s a gist showing some code that uses the Python Flickr API Kit to retrieve the record for a photo from Flickr’s API and record in a WARC: (The resulting WARC is also provided in the gist.)

Advantages of this technique:

  • Complete control over writing the WARC, including WARC record headers and deduplication strategy.

Disadvantages of this technique:

  • Reconstructs the HTTP message instead of recording directly as passed over the network.
  • Fragile, since depends on debugging output of httplib. There is no guarantee that this debugging output will remain unchanged in the future.
  • Often requires hacking the API library to get access to the HTTP client.
The second technique

The second approach was suggested by Ed Summers. In this approach, an HTTP proxy records the HTTP transaction. In a proxying setup, the HTTP client makes its request to the proxy. The proxy in turn relays the request to the HTTP server. It receives the response from the server and relays it back to the client. By acting as a “man in the middle”, the proxy has access to the entire HTTP transaction.

Internet Archive’s warcprox is an HTTP proxy that writes the recorded HTTP transactions to WARCs. Among other applications, warcprox is used in Ilya Kreymer’s, which records the HTTP transactions from a user browsing the web. In our case, warcprox will record the HTTP transactions between the API library and the social media platform’s server.

This gist demonstrates using the Python Flickr API Kit to retrieve the record for a photo from Flickr’s API and recording it using warcprox:



  • Depends on the API library supports configuring a proxy or hacking the API library to get access to the HTTP client to configure proxying.
  • Does not provide control over the WARC, especially the ability to write WARC record headers.
  • Requires running proxy as a separate process from the harvester.

STG is continuing to experiment with and refine these two approaches. Thoughts on these approaches or suggestions for other techniques would be appreciated and we welcome any discussion of social media harvesting in general.

SOC Alums Share Success, Film Screenings

American University News - Wed, 10/28/2015 - 00:00
American University Alumni Ted Roach and Najwa Najjar screen their most recent films on campus.
Categories: University News