Last modified: Friday, 23-Feb-2024 20:00:27 UTC. Maintained by: Elisa E. Beshero-Bondar (eeb4 at psu.edu). Powered by firebellies.

Regex Exercise: Convert a Text File of Movie Data to XML

Consult the following resources as you work with Regular Expressions:

Our newtFire tutorial on Autotagging with Regular Expressions (Regex)
Regular-Expressions.info Tutorial: a mine of helpful detail on regular expression matching,

Get the source text ready in <oXygen/>

We downloaded some data about movies from the 1930s to 2018 in a spreadsheet and saved it as a plain-text file, which you can download from our site here: movieData.txt. (Spreadsheets can be saved as raw text in tab-separated or comma-separated format. We saved ours with tabs as separators, because some of the movie titles contain commas.) Working with text saved with tab-separated values can be a good way to orient yourself to regular expression patterns. Our goal is to apply regular expressions to convert this text document (with thousands of lines) into an XML document without coding by hand!

To download the file, go to File and Save as in your web browser, and choose a useful name and location on your computer to save the file. We typically keep the .txt extension, and you might rename this as YourName_MovieData.txt.)
Then open <oXygen/>, and open the file you saved.

Prepare a Step File

Next, open a new, separate text file, in which you will record each step you take in up-converting this document to XML. This needs to be a plain text (*.txt) or markdown (*.md) file and not something you write in a word processor (not a Microsoft Word document) so you do not have to struggle with autocorrections of the regex patterns you are recording.
Save this file as your main homework submission for this assignment, following our standard homework file naming conventions for upload to Canvas. We will duplicate the steps you record to make sure they work to up-convert the text file to XML. Suggestions: You can open a new markdown file in <oXygen/> by going to open a New document (the folded piece of paper icon) and typing in markdown in the search bar. (Or type in text to open a new plain text file.) On Windows, you can find and open Notepad and record your steps in plain text form here outside of oXygen, which may be convenient, so you don’t accidentally try your find-and-replace operations on your step file instead of the main text. On Mac, you might try TextEdit, or stick with <oXygen/> and open your window in Tile View as we did with your Relax NG Schema files.

The task

Your goal is to produce an XML version of the movie data file by using the search-and-replace techniques we discussed in class, and record each step you take in a plain text or markdown file so others can reproduce exactly what you did. (You may, in a real-life project situation, need to share the steps you take in up-converting plain text documents to XML, and share that on your GitHub repo in GitHub’s markdown (the same that we write on the GitHub Issues board), and in that case you would save the file with a .md extension.

Your up-converted XML output should look something like movieData.xml. This involves putting each class period date in its own element and reformatting it to hold the full date information in an attribute. It also involves wrapping the three (M W and F) class period dates for each week in an element to wrap the weeks.

Your Steps file needs to be detailed enough to indicate each step of your process: what regular expression patterns you attempted to find, and what expressions you used to replace them. You might record the number finds you get and even how you fine-tuned your steps when you were not finding everything you wanted to at first. Note: we strongly recommend copying and pasting your find and replace expressions into your Steps file instead of retyping them (since it is easy to introduce errors that way).

How to proceed

There are several ways to get to the target output, but the starting points are standard:

Starting work:

First of all, for any up-conversion of plain text, you must check for the special reserve characters: the ampersand & and the angle brackets < and >. You need to search for those and, if they turn up, replace them with their corresponding XML entities, so that these will not interfere with well-formed XML markup.

Search for:	Replace with:
`&`	`&`
`<`	`<`
`>`	`>`

Note that you need to process the special XML reserve characters in the correct order. Why is it important that you search and replace the & first?

To perform regex searching, you need to check the box labeled Regular expression at the bottom of the <oXygen/> find-and-replace dialog box, which you open with Control-f (Windows) or Command-f (Mac). If you don’t check this box, <oXygen/> will just search for what you type literally, and it won’t recognize that some characters in regex have special meaning. You don’t have to check anything else yet. Be sure that Dot matches all is unchecked, though; we’ll explain why below.

How to approach the conversion process

Our data is organized in lines of text, so we recommend starting by wrapping those lines in a simple wrapper element (<movie>....</movie> to isolate each line of data about each movie. We can then proceed to fine-tune the markup and add more inside each move element working around the tab characters.

Find and Replace: Working with Capturing Groups

From each text row of movie data, we ultimately want to create this pretty-printed, structured XML markup (showing a sample of the data for the movie Operation Dunkirk:

           <movie>
               <title>Operation Dunkirk</title>
               <date>2017</date>
               <location>USA</location>
               <time unit="min">96</time>
           </movie>

To get to this point, start by looking at the lines in the text file as you have it open in oXygen. You'll see that each line is numbered. We can try working on this data from the outside in, that is, wrap each whole line in a wrapper element so each movie’s data is contained in tags:

Operation Dunkirk	2017	USA     96 min

Use a Find and Replace operation to isolate each line with a simple regular expression. Then, in your replace, refer to the Find expressing to capture it, either as a whole unit, or as a capturing group.

<movie>Operation Dunkirk	2017	USA        96 min</movie>

The way regex really thinks of this process is, match every movie line, delete it, and replace it with remixed pieces of itself wrapped in <movie> tags. That is, regex doesn’t think about leaving the movie line in place and inserting something before and after it; it thinks about matching each movie line, deleting it, and then putting the whole thing back, with the tags that you desire. You need to refer to what you want to keep (in this case the whole thing), as a capturing group. When we want to keep the whole expression that we found, the whole line of text here, we refer to capturing group 0 with \0.

More fun with capturing groups

Once you have isolated the movie lines and wrapped them in start and end tags, it is time to apply more detailed markup inside, to isolate each movie title, date, and location. We will do something special with the time unit, remixing that data to put the unit inside an attribute value. To do this work, you will need to learn how to mark and apply capturing groups.

To make capturing groups you set parentheses around the portions of your regular expression that you want to keep. Think of setting capturing groups as a way to isolate pieces of your Find so that you can point to them and position them exactly where you want in your Replace>. Take your first step by locating the <movie> start tag that you just set down followed by just the movie title (which is bordered by a tab character). Once you can find these things, wrap the element tag in its own capturing group, and then the title information in second capturing group.

In the replace, you will need to refer to the capturing groups using a special regular expression. The sequence \1 points to the first capturing group, ordered from left to right. \2 refers to the second capturing group. Remember, the expression \0 refers to the entire match regardless of the capturing groups. Try experimenting with Find and Replace using capturing groups in various ways until you set down the tagging you want. (The Undo button in oXygen is under the Edit menu, and we use it frequently when we are experimenting like this!)

We are not going to tell you how to create your regular expressions: part of the learning process here is looking stuff up in the tutorial sites we have provided, and asking for help when you get stuck on our DIGIT-Coders Slack or by opening an issue on our textAnalysis-Hub. Do your best to wrap the data you see in meaningful tags, even if what you create does not look exactly like our sample XML.

Cleaning up and checking your results

Save your text file now as an XML file by saving as .xml. You will now need to reopen the document to see if it is well-formed so that oXygen actually recognizes and reads the file as an XML document. It probably is not well-formed, because you need to wrap the document in a root element. Do that and inspect the document for well-formedness. To check for well-formedness in the XML file, you can use Control+Shift+W on Windows, Command+Shift+W on Mac, or click on the arrow next to the red check mark in the icon bar at the top and choose Check well-formedness. If you see regular patterns of something that you can fix with regular expressions, use them and document your steps.

General

As we mention above, there are several ways to get to the target output, and whatever works is legitimate, as long as you make meaningful use of computational tools, including regular expressions (where appropriate), and don’t just tag everything manually. As you saw in class, there are ways to build your own regular expressions to match whatever patterns you need to identify, and the regex languages is complex and often difficult to read. The way we would approach this task is by figuring out what we need to match and then looking up how to match it. In addition to the mini-tutorial above, there is a more comprehensive tutorial information at http://www.regular-expressions.info/tutorialcnt.html. If you decide to look around for alternative reference sites and find something that seems especially useful, please post the URL on the discussion boards, so that your classmates can also consult it.

What to submit

the original source text file you started with
a step file as a markdown (.md) or plain text (.txt) document (a step-by-step description of what you did), and
your results file (the XML document as .xml)

If you don’t get all the way to a solution, just upload the description of what you did, what the output looked like, and why you were not able to proceed any further. As you are working on this, post any questions on Slack or our class GitHub Issues board!