WRLC Libraries

Help Transcribe Churchill’s WWII Calendar

The George Washington University - Tue, 02/16/2016 - 14:11
February 16, 2016

Help the GW Libraries discover and decode history by transcribing appointments from Sir Winston Churchill’s World War II engagement diary. This crowdsourcing project will make Churchill’s wartime activities widely available for the first time to students and scholars around the world. Participants will gain new insight into the day-to-day process of national leadership, learn about Churchill and WWII, and provide a valuable service to historians around the world. Follow #ChurchillsDay on twitter where we'll share some interesting results of this project as they become available. 

This collection of handwritten cards details Winston Churchill’s appointments during World War II, including such historic events as Victory in Europe (VE) Day and the British prime minister’s regular meetings with the King of England and President Franklin Roosevelt. The collection of 30 cards will be featured in the new National Churchill Library and Center on Gelman's 1st floor. 

Learn more about the project and participate at crowdcrafting.org/project/churchill/.

About the National Churchill Library and Center

The National Churchill Library and Center, slated to open in 2016, is part of a philanthropic partnership with the George Washington University and the Chicago-based Churchill Centre. Housed on the first floor of Gelman Library, this will be the first major research facility in the nation’s capital dedicated to the study of Winston Churchill.

President's Day Workshops Canceled due to Weather

The George Washington University - Mon, 02/15/2016 - 05:59
February 15, 2016

Due to icy conditions, the Lit Review How To and Data Bootcamp workshops scheduled for President's Day have been CANCELED. We apologize for any inconvenience.

You can find much of the information from these sessions in the research guide, "What Graduate Students Need to Know."

Data Bootcamp sessions will be offered again on Thursday, March 10.  You can also find much of the information to be covered in research guides: "Data Management," "Maps, Cartographic Data, and GIS Information," and "Uploading your ETD." 

For immediate research help, use the Research Calendar to make an appointment with a librarian. If the available times don't work for you, send an email and we'll contact you.


Data Bootcamp: Collect, Manage, & Visualize Research Data

The George Washington University - Thu, 02/11/2016 - 13:29
February 11, 2016

Are you a graduate student who needs help collecting, managing, and visualizing research data? Data Bootcamps bring together several, 30-minute workshops filled with practical solutions to save you hours of needless work.  All sessions will be first-come, first serve, with the GIS session limited to 20 participants. Attend one session or all.

If you can't make it to all of the sessions or need more information be sure to check out the research guides: "Data Management," "Maps, Cartographic Data, and GIS Information," and "Uploading your ETD." 

Kids off school? Quiet and happily occupied offspring are welcome.

Monday, February 15 (President's Day):
1:00-1:30: What is Data?
1:30-2:00: Data Management
2:15-2:45: Geographic Information Systems (GIS) Data Basics

What is Data?
Research data is data that is collected, observed, or created, for purposes of analysis to produce original research results, but what does that really mean for your own work? Data librarian Mandy Gooch will define research data & data-related terms and discuss common data formats. You'll explore use agreements and restrictions, and identify library and campus services and resources related to data.

Data Management
Data management refers to activities that support the long-term preservation, access, and use of data. In this short workshop Data Librarian Mandy Gooch will discuss best practices for data management and the tools, people, and resources the GW Libraries provide to help you.

Geographic Information Systems (GIS) Data Basics
Learn how you can integrate geographic information systems (GIS) into your research and discover the resources available at the GW Libraries and beyond. This workshop will cover the basics of data discovery and display using ArcGIS software. Let it spark your cartographic imagination!

More E-books Available for Your Convenience

The George Washington University - Thu, 02/11/2016 - 12:57
February 11, 2016

If you’ve searched the GW Libraries catalog lately, you might have noticed that we’ve increased the number of e-books available to GW users. This is all part of our efforts to make the material you need available when you need it. Almost all of our e-books can be read online in a web browser as well as downloaded and read on a computer or device. Just click the “Online” button to begin, and log in as you would to any other GW Libraries resource.

Prefer a print book? Look for the option to “Request a Print Copy,” which is located underneath the Online button. Click this link, fill out and submit a brief form, and GW Libraries will use library funds to purchase a print copy for the collection. (On the form, you can request that the print book be placed on hold for you when it arrives.)

Visit our website for more information on using GW Libraries e-books.

Programming & Software Development Consultation Services

The George Washington University - Fri, 02/05/2016 - 11:04
February 5, 2016

The GW Libraries are proud to announce a new service to support digital scholarship at GW: Programming & Software Development Consultation Services. Assistance is available from professional software developers to GW students, faculty, and staff who are working on an academic or scholarly inquiry which requires coding. Ask questions and get hands-on assistance with:

Coding, software development, scripting, and programming
Code review and debugging
Tools selection
Working with data markup and encoding (e.g., XML, JSON, CSV, RDF)
Retrieving data from websites and APIs
Data cleansing and manipulation
Databases (e.g., table design, querying, optimizing, loading)
Fulltext searching
Online exhibits
Data visualization

Use our convenient Research Calendar to schedule an appointment with anyone labeled "coding/programming help."  You may also email libdata@gwu.edu for additional appointment times. Appointments are available both in-person and via WebEx. Learn more about these consultation services and see a list of programming languages, databases, and other areas of special expertise at go.gwu.edu/coding

New Opportunities Within the Libraries

The George Washington University - Tue, 02/02/2016 - 17:23
February 2, 2016

A statement from University Librarian and Vice Provost Geneva Henry:

As you may have already read in GW Today, GW’s Interim Provost, Forrest Maltzman, has announced a realignment of his office, consolidating academic technologies, the eDesign shop, and the university teaching and learning center under my leadership.

Pulling these units together is an excellent opportunity to seamlessly meet the instructional needs of our faculty. This deeper collaboration between previously separate areas will benefit all of our students. We look forward to the many possibilities this realignment has to further quality teaching and academic excellence at GW.  

I began my career as a programmer and IT architect working with organizations like NASA and IBM’s Higher Education Industry group, but I found my passion at the intersection of technology and information. I’ve spent the past 15 years exploring and building many of the tools used for digital scholarship and look forward to this new opportunity to expand the tools available at GW, both in the classroom and in the libraries.

I am especially excited to return to working with online education, an area in which I played a leadership role at Rice University where we were pioneers in open education in the early 2000’s.  Online education is built around systems that IT architects design, such as servers and websites that can scale to support streaming audio and video for online education.  But fundamental to all successful courses is the instruction and course plan of the faculty members. During my years with the Connexions project and in collaboration with the OpenCourseWare project at MIT, I’ve seen how the quality of online materials and the ability to reliably deliver them worldwide enhances the teaching and learning experiences for all of our students.

I care deeply about providing the information, technology, and pedagogical resources needed for excellence in research and instruction here at GW. I look forward to continuing our partnership with faculty, students and staff to make sure our students have the best possible experience at GW.


Mr. Novak: Hollywood and National Education Association

The George Washington University - Tue, 02/02/2016 - 14:05

In 1963, NEA teamed up with Hollywood to create Mr. Novak. The show was about an idealistic young high school teacher, played by James Franciscus, facing problems many teachers would recognize. As producer E.

Coloring Pages: Works from the Corcoran Collection of Artists' Books

The George Washington University - Mon, 02/01/2016 - 10:41
February 1, 2016

With winter now making its appearance, we look for other sources of warmth in these sometimes-dreary months. Artists’ books from the Art & Design Collection from the Corcoran showcase color and color imagery in the pages of their work. Some stories are told completely through color; others, though they might use muted palettes, create a sensation with words that paint images of colorful scenes. These bright pages serve as a complement to the exhibition Color Bloc: Paintings by Elizabeth Osborne, on view in the Luther W. Brady Art Gallery through February 26, 2016.

The exhibit displays only a sampling of artists’ books from the Art & Design Collection from the Corcoran. These and many others can be viewed in the Special Collections Research Center at on the 7th foor of Gelman Library. 

Coloring Pages runs through March 25, 2016, in the 2nd floor display cases of the School of Media and Public Affairs during regular building hours

Digital Humanities Showcase

The George Washington University - Sat, 01/30/2016 - 22:08
January 30, 2016

Friday, February 12
12:30 - 3 p.m.
Please RSVP at go.gwu.edu/GWdoesDH 

Everyone is invited to a showcase of Digital Humanities (DH) projects underway across the University.  The program will include brief presentations followed by discussion and a reception.  Find out about innovative endeavors happening in Classics, The Elliot School, Corcoran School of the Arts and Design, Philosophy, Statistics, Health Sciences, DC Africana Archives Project, and more. Presented by the GW Digital Humanities Institute and GW Libraries, with opening remarks by Associate Professor of History Diane Cline, Director of Cross Disciplinary Collaboration and the XD@GW Faculty Cooperative

Gelman to close Saturday, Jan. 30, from 1-9 a.m.

The George Washington University - Thu, 01/28/2016 - 18:28
January 28, 2016

Gelman Library will close on Saturday, January 30, from 1 a.m. - 9 a.m. and 24-hour building access will be unavailable during this time.  This closure is required to safely X-ray the building as part of construction activities for the National Churchill Library and Center on Gelman’s 1st floor.  The building must be completely vacant to ensure complete protection from radiation exposure.  Surrounding streets and sidewalks will not be affected.

Gelman & Eckles to CLOSE at 3 p.m. on Friday, January 22

The George Washington University - Thu, 01/21/2016 - 22:39
January 21, 2016

Due to adverse weather, the GW Libraries (Gelman, Eckles, and the Virginia Science and Technology Campus Libraries) will close at 3 p.m. on Friday, January 22.

Gelman Library will remain closed all of Saturday, January 23, and will reopen on Sunday, January 24 from noon - 8:00 p.m.

Eckles Library will reopen from 10 a.m.- 6 p.m. on Saturday, January 23 and from 10 a.m. - 10 p.m. on Sunday, January 24.

Power outages are predicted with this storm and may impact library hours. Please check library.gwu.edu for updated information before attempting to visit a library this weekend.

An Experiment with Social Feed Manager and the ELK stack

The George Washington University - Wed, 01/13/2016 - 10:49
January 13, 2016Justin Littman

The latest in our social media harvesting experiments for the Social Feed Manager project involves analysis, discovery, and visualization of social media content. An analytics service may help satisfy two needs:

  1. 1. For the collection creator, being able to evaluate the content that is being collected so as to adjust the collection criteria. For example, for Twitter a collection creator may discover additional hashtags to collect. Since a collection creator may be collecting a rapidly evolving event, this requires near real-time analysis.
  2. 2. For the researcher, being able to analyze the content. Though many researchers will need to export the social media content for use with other tools, having available some sort of an analytics service may meet the needs of some researchers and may lower the barrier to performing social media research.

We also wanted to test the extensibility of the SFM architecture to make sure that additional services can be readily added.

The ELK (Elasticsearch, Logstash, Kibana) stack was selected for this experiment. It was selected primarily on the intuition that it was a good fit, rather than an analysis of its features or a comparison against other options. For those not familiar with this stack, Kibana is the discovery and visualization interface, Elasticsearch is the data store, and Logstash loads Elasticsearch with data. We’ll refer to our own implementation as SFM-ELK.

In SFM infrastructure, harvesters, such as the Twitter harvester, invoke the APIs of social media platforms and record the results in WARC files. Harvesters publish warc_created messages to a message queue whenever a WARC file is created. This provides the critical hook for SFM-ELK to perform loading -- a message consumer application listens for warc_created messages. When it receives a warc_created message, it:

  1. 1. Invokes the appropriate WARC iterator (e.g., TwitterRestWarcIter) to read the WARC file and output the social media records as line-oriented JSON.
  2. 2. Pipes this to jq, which filters the JSON. Most types of social media records contain extraneous metadata which do not need to be indexed in Elasticsearch. Logstash supports various mechanisms for filtering and transforming loaded data, but jq proved better for JSON data.
  3. 3. Pipes this into Logstash, which loads it into Elasticsearch.

Once properly loaded into Elasticsearch, the data is available for discovery and visualization using Kibana. Note that additional data is loaded as new WARC files are created.

For the purposes of this experiment, data harvested from Twitter’s search API using the search terms "gwu" and "gelman" was used.

While understanding the full power and flexibility of Kibana involves a significant learning curve, some of the functionality is readily usable. For example, to discover the tweets mentioning GWU’s President Knapp, enter “knapp” in the search box on the Discover screen:

or to find tweets posted by @gelmanlibrary:

Kibana allows you to easily adjust the timeframe of any discovery or visualization:

To demonstrate the sort of visualizations that might be useful for a collection creator or researcher, we created a Twitter dashboard:

Here’s each of those visualizations in a more readable size:

Note that the dashboard is periodically refreshed as new data is added.

As should be evident, this experiment barely scratches the surface of the capabilities of the ELK stack, or more generally, the potential of adding an analytics service to Social Feed Manager.  The code for SFM-ELK is available at https://github.com/gwu-libraries/sfm-elk. Instructions are provided to bring up a Docker environment so that you can give it a try yourself. Keep in mind that this is only a proof-of-concept and it is not currently in scope of SFM development.

If any of this is of interest to you or your organization, collaborators are welcome.

P.S.  It was just announced that Washington University in St. Louis, the Maryland Institute for Technology in the Humanities (MITH) at the University of Maryland, and the University of California, Riverside were awarded a Mellon grant for a project titled "Documenting the Now: Supporting Scholarly Use and Preservation of Social Media Content." Since there’s a clear need to support researchers' and archivists' needs for good analytical tools, we look forward to their work. Follow the project at @documentnow.

Learn the Skills Professors Want and Employers Expect!

The George Washington University - Mon, 01/11/2016 - 11:42
January 11, 2016

Prepare yourself for academic and professional success  by learning the communication skills you need.  The GW Libraries offer a wide range of free workshops, which are open to all GW students, staff, and alumni.

Spring topics include:

Check our website for a complete list of upcoming workshops and events.

Blue Wings Project Photos & Poems Displayed in Gelman

The George Washington University - Mon, 01/11/2016 - 11:24
January 11, 2016

The GW Libraries are thrilled to host a display of photographs and poems from the Blue Wings Project on bulletin boards throughout Gelman. Blue Wings Project brings together writers and artists of all disciplines to explore and make cross-national connections. The project is lead by the Corcoran School New Media Photojournalism (NMPJ) Master of Arts program in collaboration with the Afghan Women's Writing Project (AWWP). Originally launched as a classroom-based project in Spring 2015, Blue Wings has expanded to include the entire university community. BFA Photojournalism and New Media Photojournalism graduate students were invited to read and respond to the writings of AWWP authors, all of whom are women residing in Afghanistan. The result is an exciting launch of virtual conversations between the writers in Afghanistan and photographers at the Corcoran. #bluewings

Afghan Women's Writing Project
The Afghan Women's Writing Project (AWWP) was founded in 2009 to support the human rights of an individual to tell her story. AWWP provides a platform for Afghan women to develop their voices and discover their power in the world without the filter of the media or other influences. AWWP works with women in Afghanistan and helps them to write in English and Dari. Students sent their writings to the wokrshop which later get published in an online magazine. AWWP has also published two collections of poetry and prose, available online: The Sky is a Nest of Swallows (2015) and Washing the Dust from Our Hearts (2014).

New Media Photojournalism
The New Media Photojournalism program at the Corcoran School of the Arts and Design is the first of its kind, created to help visual journalists study and excel within the changing world of photojournalism. 


Lit Review How To: Holiday Boot Camps for Grad Students

The George Washington University - Wed, 01/06/2016 - 16:13
January 6, 2016

Are you a graduate student working on a literature review for a thesis or dissertation?  Get serious about your scholarship by attending these 30-minute workshops to learn tips that will save you time and sanity.  Our "boot camps" on Martin Luther King's Birthday and President's Day offer several popular workshops together - attend one or all.

All sessions will take place in Gelman Library, Room 301-302.  Please bring your own computer.  Kids off school? Quiet and happily occupied offspring are welcome.

Monday, January 18 (MLK's Birthday) & Monday, February 15 (President's Day):
9:00-9:30: The Basics: Mapping your Research
9:30-10:00: Searching Beyond Gelman
10:00-10:30:  Citation Management
10:45-11:15:  Citation Chasing
11:15-11:45: Staying Current in One's Field

The Basics: Mapping your Research
What is a Literature Review, and what information do I need to begin one? Learn tips on how to begin your search, discover keywords, and narrow topics. Save time and frsutration by discovering how to find the right databases and resources for your topic using GW Libraries’ tools. 

Searching Beyond Gelman
How do you know what research is out there?  How can you know what you don't know?  Be sure with a comprehensive search of all published book literature using Worldcat.  This workshop is best for disciplines that write books, especially the humanities and social sciences.

Citation Management
Once you've done all that research how do you keep track of it?  Step away from the notecards and learn about online citation tools like RefworksZotero and Mendeley. Librarians will help you find the tool that is right for you and get you started using it.

Citation Chasing
How do you build on someone else's research?  How do you find the research they used? Learn to chase down those citations like a pro in this short workshop.

Staying Current in One's Field
A successful graduate student participates in the research conversation of her/his field. If you need help getting started, this workshop will help you find out how to stay current. You'll learn how to set up journal table of contents alerts, search alerts, and identify key journals in your field.

If you can't make it to all of the sessions or need more information be sure to check out the research guide "What Graduate Students Need to Know."

Study Modernism in Paris this Summer

The George Washington University - Wed, 01/06/2016 - 14:46
January 6, 2016

Explore the journey of Picasso, Diaghilev, Kertesz, Stravinsky, & others who forged artistic collaborations and established Paris as the center of Modernist thought in the early 20th century.  

Visiting museums, touring iconic architectural sites and viewing contemporary performance spaces, we will measure today's art against the past.

Learn more at www.gwuparis.com or contact Professor Mary Buckley or Librarian Bill Gillis.

June 1-14, 2016
Paris: Modernism and the Arts, Then and Now —TRDA 4595w
No language requirements   
3 credits, WID, Elliot School and Cultural Studies Course Humanities GCR

The Sound of the Library at Work

The George Washington University - Sun, 01/03/2016 - 19:46
January 5, 2016Laura Wrubel

At the Access Conference in Toronto in September 2015, I attended an all-day hackfest on data sonification, led by William Denton of York University and Katie Legere of Queen’s University. Data sonification is the translation of data into sound, much as data visualization transforms data into a graph or image. You can read about the workshop and see some examples of data sonification at Music, Code and Data: Hackfest and Happening at Access 2015.

In brief, it was a fast, fun, and practical introduction to both data sonification and the freely available Sonic Pi synthesizer software. Everyone who attended made some kind of music using a data file they brought with them or from provided sample files. The fact that we were able to get so far in a day speaks to both the skill of our hackfest leaders and the ease with which you can make sound with Sonic Pi.

I’d like to describe a few experiments, one from this workshop and another more recently, that have me excited about data sonification.

Music from the circulation desk

I brought to the hackfest a csv file with the number of circulation transactions each day for a year--July through June--created from our Voyager system’s circulation transactions logs. The values ranged from roughly 100-1000. With such a broad range, I chunked up the values into a smaller set from 10-100 corresponding to Sonic Pi’s numbering of notes as on a piano keyboard. However, the pitches were so wildly dispersed, that while it was easy to hear outliers, the arc of activity through the semester was hard to perceive. The notes just didn’t make sense to my ears.

To provide a more listenable-- and I’m hoping, more meaningful--line, I assigned the chunks of values to specific notes in the C major scale, across two octaves. This is a much smaller range than the first version, and the notes feel more coherent, being in the same key. 

Thinking there may be patterns in the volume of activity within the week, I added underlying drum beats to emphasize the first day of each week, Sunday. Finally, lighter beats accompany each note during the semester, underscoring the quiet in library activity during semester breaks.

You can listen to it here:

Github beats

More recently, I worked with my colleague Dan Chudnov to make visible to the library staff the activity of our team, the Scholarly Technology Group. The steady work of creating and maintaining software to help our user community often simply looks like us working away on our computers with our heads down. Dan created a visualization of our team’s work as expressed in commits to our projects’ repositories on Github. We also wondered what our team’s work might sound like.

I focused on one software project, an interface to our catalog data and other APIs for discovery; we call it “Launchpad” internally. It’s something quite a number of us have worked on over the past three years. I started with a file which listed one Github commit per line, including the name of the file changed and the person making the change. I then assigned a pitch to each person on the team, all within the same key, giving the initial project manager (Dan) and current project manager (Michael Cummings) the tonic note to provide some centering. To add a sense of time passing, I added drum beats, with a sound sample for the two major rollouts of Launchpad to our user community.

You can listen to it and watch the supplementary logging within Sonic Pi on YouTube (best viewed fullscreen):

You can hear how the project started with three core developers, who worked intensively on the project through its first roll-out. Over time, participation broadened to a larger group, and each new person’s entry to the project is audible. It bears acknowledging that we’re hearing only a slice of the activity in the creation of Launchpad; this project had considerable contributions from others who represented end users, participated in testing, wrote documentation, conducted usability testing, and performed analyses to inform feature development.

A few observations

In each of these experiments, the first few iterations were not pleasant to listen to. I struggled to make sense of the noise, lacking anything to latch onto, the audible equivalent of X and Y axes. A little knowledge of music goes a long way in providing some structure that our ears are trained to recognize: rhythm, key, tempo.

As in creating a visualization, aesthetic choices in sonification can interfere with accuracy. Even in these small experiments, I wrestled with choices that mute, in a sense, aspects of the data and could mislead a listener trying to understand the data. For example, in the Github sonification, I chose to represent the activity across time uniformly, one beat per commit. Obviously, the work was not evenly distributed across three years; the pace and changes in work intensity don’t come across in this sonification. 

When it comes to coding, Sonic Pi was an easy entry point to making music from data, particularly when you don’t have a live orchestra at hand. The tips in William Denton’s blog post about reading csv files helped get me started, along with Sonic Pi’s built-in tutorial. Beyond Sonic Pi, there are many other software and tools to support data sonification; I’d be interested to hear what others have tried and found useful.

I'd also like to explore the growing cross-disciplinary literature on data sonification. Other examples of applying data sonification to library data include Denton's STAPLR experiment and Legere's research on using sonification to inform real-time library management decisions. In the end, these pieces were fun to create and made my colleagues’ work apparent in a new way.  There’s something satisfying about hearing your work turned into music.


Gelman Library to House Winston Churchill’s World War II Engagement Diary

The George Washington University - Wed, 12/16/2015 - 10:06
December 16, 2015Construction of the National Churchill Library and Center to Begin this Month

A collection of handwritten cards detailing Winston Churchill’s appointments during World War II, including such historic events as Victory in Europe (VE) Day and the British prime minister’s regular meetings with the King of England and President Franklin Roosevelt, will have a new home at the George Washington University. The “engagement diary” will be featured in the new National Churchill Library and Center to be located at GW.

Steve Forbes, chairman of Forbes Media and a Churchill enthusiast, donated the collection of 30 cards to the Chicago-based Churchill Centre. The collection was then given to GW’s Estelle and Melvin Gelman Library for use in the National Churchill Library and Center, which begins construction in December.

“The engagement diary is an important historical resource, and I am pleased that they will now be seen by a broad audience,” said Mr. Forbes. “I join Churchillians everywhere in applauding The Churchill Centre’s initiative to partner with GW to create a permanent home for Churchill scholarship, studies and education in the heart of our nation’s capital.”

Privately held since the end of World War II, the cards are a source for the history of Mr. Churchill’s wartime leadership, recording the extraordinary extent of his activities and the frequency and range of his wartime journeys. Between September 1939 and June 1945, Mr. Churchill’s private secretaries kept the handwritten “engagement diary” on two-sided cards measuring 12 by 13 inches. The library has created high-resolution digital images of the cards and will launch a crowdsourcing project, open to the public, to provide full text transcription and annotation for the cards, all of which will be available to the public on a dedicated website. 

“We are delighted to receive this fantastic record that gives us a window into part of Winston Churchill’s life during World War II,” said Geneva Henry, university librarian and vice provost for libraries. “The gift coincides with the construction of the National Churchill Library and Center, the first permanent U.S. home in our nation’s capital for the study of Winston Churchill.” 

The National Churchill Library and Center, which is expected to open in 2016, will educate new generations about Mr. Churchill and will serve as a classroom and meeting space for public programs and lectures highlighting the historical significance of Mr. Churchill, his contemporaries and more recent world leaders. 

“We are honored that Steve Forbes has entrusted us with these historic documents, and we are glad that they will be a part of the National Churchill Library and Center at GW,” said Lee Pollock, executive director of the Churchill Centre. “For the first time, the original record of Churchill’s wartime activities will be made freely and widely available to scholars and students around the world.”

The library will work with academic programs across the university to develop programming.  


About the National Churchill Library and Center

The National Churchill Library and Center is part of a philanthropic partnership with the George Washington University and the Chicago-based Churchill Centre. Housed on the first floor of the Estelle and Melvin Gelman Library, this will be the first major research facility in the nation’s capital dedicated to the study of Winston Churchill.

MEDIA CONTACTS:br /> Kurie Fitzgerald: kfitzgerald@gwu.edu, 202-994-6461
Emily Grebenstein: egrebenstein@gwu.edu, 202-994-3087

Harvesting the Twitter Streaming API to WARC files

The George Washington University - Tue, 12/15/2015 - 08:54
December 15, 2015

The Twitter Streaming API is very powerful, allowing harvesting tweets not readily available from the other APIs. However, recall from our previous post that the Twitter Streaming API does not behave like REST APIs that are typical of social media platforms -- see Twitter’s description of the differences. A single HTTP response is potentially huge and may be collected over the course of hours, days, or weeks. This is a poor fit for both the normal web harvesting model in which a single HTTP response is recorded as a single WARC response record in a single WARC file, and for most web archiving tools, which store HTTP responses in-memory and don’t write them to the WARC file until the response is completed.

This post describes an approach we’ve developed for harvesting the Twitter Streaming API and recording in WARC files. We will also show how the tweets can be extracted from the WARC files for use by a researcher.

The Twitter Streaming API is not the only form of streaming content on the Web and the authors of WARC Specification had the forethought to support record segmentation. In record segmentation, a single HTTP response is split into multiple WARC records, potentially in multiple WARC files. The first record is a WARC response record; subsequent records are WARC continuation records. The header of the final continuation record also contains the total number of bytes of the entire HTTP response.

While WARC record segmentation is theoretically a good solution for the Twitter Streaming API, record segmentation is not widely supported in most web archiving tools. Our first step was to modify Internet Archive’s warcprox to support record segmentation. (Our pull request is #15. The crux of the change is between lines 210 and 245 in warcprox.py.) Recall from the earlier post that warcprox is an HTTP proxy that records the HTTP transaction in a WARC.

The following shows snippets from a WARC file created by the modified warcprox from the Twitter filter API retrieved by twarc tracking “obama”. It consists of a WARC response record, a request record, a continuation record, and a final continuation record.

WARC/1.0 WARC-Type: response WARC-Record-ID: <urn:uuid:9aff4bf7-d64a-411c-9ef8-cd82778e036e> WARC-Date: 2015-12-02T16:59:07Z WARC-Target-URI: https://stream.twitter.com/1.1/statuses/filter.json WARC-IP-Address: Content-Type: application/http;msgtype=response WARC-Segment-Number: 1 Content-Length: 1149 WARC-Block-Digest: sha1:7c8de1bd439cf62c67f9f4b0c48e6f3ae39eb4ef WARC-Payload-Digest: sha1:cc1b7bf9a2945ddf8ae7c35d5f05513d0d8b691b HTTP/1.1 200 OK connection: close content-Encoding: gzip content-type: application/json date: Wed, 02 Dec 2015 16:59:07 GMT server: tsa transfer-encoding: chunked x-connection-hash: 8439cf557d0f807635797377d9e7d0b6 a ? 1f1 tSۊ?0??A/}?%??ر??^???¶??P?q#"KF??n??w?ٔ%?O3?͜?y`?GQ    Y?~?????!+?U?? ^r? ?ي?bZ???r^WeU?_?:[?ѓ??$?"?I?7????1`?ہ?;?oH?}?a?v?.?ε                                                         }???F???t??|???N??????m?i?t??9? ??1???B?c?A?<?;a?/???&?d?dkziR?Vxͽ????q                                                ??8?څ??;?Z "?c'c?$g????? ????     4???ʁ|???5?Y-k???z???9FM?<v{?v픗2K>_?2!??d????q?v???E?{|??ct???=???=n??_E IQ?'? U?&??]???n?ֽ??"?(:*?6,???F??????4:?%?? ?=-??x?-ל????EQ????N>?????VOW???c'\???^gk?Z=???lZ???y?? 163 ?U?n?0???C?^??Æ^ =?T?)?4X_U????7~T?75??~Q?˵Ғ1??????`"????c?wfgR?`?g???kp<???r)+. ?4zD?????ie6?/F????˭*???   Xm??rLhEiƈs???B)y???b;a??Am??d׮?<??ԍNȄ?$????T?r?ϝ,ot?m???L???                         ?j4??.??Q??b???%????7?????????7??XT?2B%?,aQ?4I?p?ž?wn?z                                                                                 ??\??7`                                                                                        R{Z???8?Ϲ<?$?t??)u?^?5?u?{}?K??yOo?]?(??.f??|??m???? 229 [o?0???'q???6??-J?.?z@k'??IL@?? WARC/1.0 WARC-Type: request WARC-Record-ID: <urn:uuid:3a6ce873-13a9-401a-bfd9-3ddc321aab96> WARC-Date: 2015-12-02T16:59:07Z WARC-Target-URI: https://stream.twitter.com/1.1/statuses/filter.json WARC-Concurrent-To: <urn:uuid:9aff4bf7-d64a-411c-9ef8-cd82778e036e> WARC-Block-Digest: sha1:fa301cb54fd6c38adac4a43bacf36d38198ec8e0 Content-Type: application/http;msgtype=request Content-Length: 566 POST /1.1/statuses/filter.json HTTP/1.1 content-length: 30 accept-encoding: deflate, gzip host: stream.twitter.com accept: */* user-agent: python-requests/2.8.1 content-type: application/x-www-form-urlencoded authorization: OAuth oauth_nonce="149931870481283598461449075546", oauth_timestamp="1449075546", oauth_version="1.0", oauth_signature_method="HMAC-SHA1", oauth_consumer_key="EHdoTe7ksBgflP5nUalEfhaeo", oauth_token="481186914-c2yZjbk1np0Z5MWEFYYQKSQNFBXd8T9r4k90YkJl", oauth_signature="m0hHjrPnU7aTtOhjmk8om3Vv7Ok%3D" track=obama&stall_warning=True WARC/1.0 WARC-Type: continuation WARC-Record-ID: <urn:uuid:c18791da-24e0-42a7-91df-82dfdae6697e> WARC-Date: 2015-12-02T16:59:07Z WARC-Target-URI: https://stream.twitter.com/1.1/statuses/filter.json WARC-IP-Address: Content-Type: application/http;msgtype=response WARC-Segment-Number: 2 WARC-Segment-Origin-ID: <urn:uuid:9aff4bf7-d64a-411c-9ef8-cd82778e036e> Content-Length: 1220 WARC-Block-Digest: sha1:82794503724ba3bb06fee69302614a3f5ef00c39 ?????a??N?*M???_l???y"uU]IZ`RU1?/?n?V?`???&H??h?U??x??Ea j???mٌSjfsr¨??ê˽RN?&F'?<?h^H~ ?è?ـ                                                                                             ??m?@?'?]???:?sT?‡T?/S??W??t??]M???_??.???o?ҷa??Sn1???/?;Z;?+?PF??                                        $L?HnD?????x?t?|ľ?    ?    -G^?|?    "?????gr?? ? )?e[????{]vW???j???-??*T&?{)2\?9^?`\?_??>?.-????ҚO??{v?+?W??4??ps %c?8?'?`?nU???a??%?q?/q?о?X???&???G}71G?&V?                                                                                   ?w?ȱZn?ӯ?&?*C??&s?R???rRa???? ?j??es??q?@?s??\/7?w??v?????+???2(????????mNS? ?iZ?????p}?8?.?????????;?? 16c ̘AO?0ǿ      g?F˸??&?!?u???2D????&U?Ń'J?ڒ??????????K5??pBm?T??=)?0?                                                            8Ę?????Ԉ,?                                                                      O??>u?~???3?A???Ώho??[?rYV'??jW??J?e?IV?r?d?*L6    ;???????i/ R-       ??   ??Y?Cĭ??           ??2]vj ??7??C5B??????!?;????m(j???^?d/??jK??m?d?K ,???|P˂?ۥF2??5*%`Lﲞ?x\g????'qs?F?                                                                                                ?O?                                                                                                   ?=Ԥz`??k+?l?gS????                                                                                                                     qU?g#?S????3??SӕS???`2=HM?-? ??Ys?5S?O??? 68 ??    U??X?<???̀4?B???Q'Ԇ7(?!?S?፮?>F??^??????Rm,?A????r?<(e??:?28;?f???? 1a1 ??OO?@??&~    ?"?"??D?5?Lj6P?,?@K??    [ ?F?`????~? ???<?T5? ???%'ap,$?FCZ????vP???D‚?N?8p?-/???l[??y???#?{]??(?J????'E?&΃???զj???X??7?<Ɩg?ՅŸU?Bh%                                                                                                            m??u?h????????s?N??u????u??0֜d WARC/1.0 WARC-Type: continuation WARC-Record-ID: <urn:uuid:d7bfe010-7831-45a8-8361-715692ea014b> WARC-Date: 2015-12-02T16:59:09Z WARC-Target-URI: https://stream.twitter.com/1.1/statuses/filter.json WARC-IP-Address: Content-Type: application/http;msgtype=response WARC-Segment-Number: 3 WARC-Segment-Origin-ID: <urn:uuid:9aff4bf7-d64a-411c-9ef8-cd82778e036e> WARC-Segment-Total-Length: 924 WARC-Truncated: unspecified Content-Length: 307 WARC-Block-Digest: sha1:57b73cdaab8025cc04a83f3ae6eff2dd6e2bfa15 ?^,~0??Cc?43??n????8???????A^]d???ן&??qSN?FZ ??m?$p? ?&?A?p$?$?S??d,^zk?#?Y    ?q?g~????R????P?\???~??w??T?&`                                                               ????L?r????i????Th2?2B??$?C??:????T????? 20e tRMk?@?+??C]YV??T NqZHS?K/??F???Y?QE?|GVjB?u?a?y??͋(,J??Vz???X? ??̲i??)|???$?L?H?Rd?y???"

As should be obvious, this data is not readily usable by most researchers. In particular, there are four barriers to use:

In order to be confident in this approach, we feel it is prudent to make sure that we can access the tweets given these various barriers and the lack of support for record segmentation in web archiving tools. To this end, we developed TwitterStreamWarcIter and the parent class BaseWarcIter.  TwitterStreamWarcIter outputs the tweets from a WARC file, one per line. This is the same output as twarc or cat-ing a line-oriented json file and can be piped to other tools such as jq:

$ python twitter_stream_warc_iter.py test_1-20151202200525007-00000-30033-GLSS-F0G5RP-8000.warc.gz {"contributors": null, "truncated": false, "text": "RT @Litorodbujan: Obama quiere visitar Espa\u00f1a!\nAhora s\u00ed somo s un pa\u00eds serio; con Rajoy no se repetir\u00e1 esto.   #RajoyconPiqueras https://t.c\u2026", "is_quote_status": false,  "in_reply_to_status_id": null, "id": 672144412936445952, "favorite_count": 0, "source": "<a href=\"https://mobile.twitter. com\" rel=\"nofollow\">Mobile Web (M2)</a>", "retweeted": false, "coordinates": null, "timestamp_ms": "1449086690540", "ent ities": {"user_mentions": [{"id": 320317854, "indices": [3, 16], "id_str": "320317854", "screen_name": "Litorodbujan", "nam ....

or suitable for human-consumption with the --pretty flag:

$ python twitter_stream_warc_iter.py test_1-20151202200525007-00000-30033-GLSS-F0G5RP-8000.warc.gz --pretty {     "contributors": null,      "truncated": false,      "text": "RT @Litorodbujan: Obama quiere visitar Espa\u00f1a!\nAhora s\u00ed somos un pa\u00eds serio; con Rajoy no se repetir\u00e1 esto.   #RajoyconPiqueras https://t.c\u2026",      "is_quote_status": false,      "in_reply_to_status_id": null,      "id": 672144412936445952,      "favorite_count": 0,      "source": "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Mobile Web (M2)</a>",      "retweeted": false,      "coordinates": null,      "timestamp_ms": "1449086690540",      "entities": { ....

This approach addresses the WARC barrier by using Internet Archive’s WARC library to read the WARC file. The IA WARC library is extended to handle record segmentation by stitching the payload back together. (See CompositeFilePart. It still doesn’t handle continuations that are in other WARC files, but solving that problem is just software development.) And lastly, the content encoding and transfer encoding barriers are remedied by loading the payload into a urllib3 HTTPResponse which handles the decoding of the content encoding and transfer encoding, as well as providing a familiar, pythonic interface to the response.

As we have explored the similarity between web harvesting and social media harvesting, the Twitter Streaming API represents the point of greatest friction. However, the above represents a reasonable first approach to addressing the unique features of the Twitter Streaming API.

Addressing Temperature Complaints in Gelman

The George Washington University - Thu, 12/10/2015 - 13:31
December 10, 2015

We hear your complaints about the heat in Gelman and we are working with GW Facilities to get the temperatures under control! The building is currently experiencing areas of extreme heat (primarily the 3rd floor) and areas of extreme cold (mostly the 5th and 7th floors). Please continue to email gelman@gwu.edu or tweet @gelmanlibrary with reports of when & where you experience extreme temperatures in the building. These reports help the maintenance crew pinpoint and correct the problem.

We sincerely apologize for the inconvenience at this important time of the semester and having been working since the initial reports to provide more comfortable temperatures in the building.