With this pair of assignments you will first learn (in Part 1) how to extract data from your XML in a special tabular plain text format called a TSV file, which you will then import into the network analysis software, Cytoscape. In Part 2, you will learn how to analyze and organize your data as a network graph working in Cytoscape. What you are learning here will prepare you for other kinds of data analysis and visualization work, because this simple, handy data format can be read by spreadsheets and web mapping applications, too. TSV stands for Tab Separated Values, and it applies a tab control character with the unicode special entity notation
, which signals a movement to the next tab stop, the location the cursor jumps to when you hit the tab key. Basically a TSV presents a table layout in plain text, and actually, any plain text file can represent a tabular column format just by using a regularly repeating pattern of characters, such a white space or a comma (the comma-separated output is known as a CSV file). You should save these files with a
.tsv (or a
.csv) extension depending on whether you use a comma or a tab separator. Here is some sample TSV output from the Decameron project in TSV format:
Stratilia frame Pampinea Stratilia frame Fiammetta Stratilia frame Filomena Stratilia frame Emilia Stratilia frame Lauretta Stratilia frame Neifile Stratilia frame Filostrato Stratilia frame Dioneo Stratilia frame Parmeno Stratilia frame Sirisco Stratilia frame Tindaro Stratilia frame Misia Stratilia frame Licisca Stratilia frame Chimera Bergamino floatingFrame Filostrato Bergamino floatingFrame Lauretta Martellino floatingFrame Filostrato Martellino floatingFrame Neifile Marchese novella Martellino Marchese novella Agolanti Agolanti novella Martellino Agolanti novella Marchese Agolanti novella Pampinea Agolanti novella Filostrato Agolanti novella Lamberti Lamberti novella Pampinea Lamberti novella Filostrato
This is a portion of a much larger TSV file that represents co-occurrence network data, that is, it shows individual characters from The Decameron who are connected with each other by being present in the same portion of the text, whether in the introduction or concluding
frame portions of each day of storytelling, the
floatingFrame sections in which the frame narrators provide commentary inside the story sections, and the stories themselves in the story or
novella level. These characters appear together in the same locations in the text, and this is a typical co-occurrence relationship for network analysis, which connects
nodes (the characters, here) and
edges (what they share or what location host them both, whether that is inside a
<div type="novella"> here). For more on networks of co-occurrence see our Introduction to Network Analysis and Cytoscape for XML Coders. That is the kind of network you will be plotting from XML in this exercise.
You may work with any of our student project files to plot a network of co-occurrence of any kind that interests you, but keep in mind our advice in our tutorial: keep it simple with just one kind of node (say, individual names, place names, reading witnesses, etc.) and some unit of co-occurence drawn from the structure of your XML files. Your project files are loaded into our eXist database, so you may access them from the following locations:
or with the collection of the pair of English and Italian files, if you want to plot a graph that compares something between the two:
Each of these projects is coded in TEI. Since we are outputting plain text and not code in a namespace, we can simplify our XQuery by declaring the TEI as the default namespace at the top of the file (so that you will not need to use the
tei: prefix in referring to elements). To explore the code and look for things to try plotting, we recommend you pull from the GitHub projects to open the files in <oXygen/> to study, using XPath and the Outline view (
Window → Show View → Outline).
xquery version "3.0"; declare default element namespace "http://www.tei-c.org/ns/1.0";
Study the XML code from a sample project file (ideally in <oXygen/>) to identify a co-occurrence relationship that interests you. For example, which places are mentioned together in the frame narration vs. the stories in The Decameron? Your network does not have to be about people and places, but could be based on something else you have marked, such as the different publications that represented particular poems in the Emily Dickinson collection.
Note: You may wish to update your project files in our eXist database (and as you review and plot your output data you will almost certainly see evidence of tagging errors like extra spaces in elements that might yield two separate nodes with the same name, etc.). To update the file, go to
File → Manage, browse for your project directory and locate the file you want to change, and delete it from the database by selecting it and clicking on the trashcan symbol. Upload your new file by clicking on the upload button to the left of the trashcan.)
Networks of data are created from
nodes connected by
edges. For Cytoscape to read and plot network data, it requires a CSV or TSV import in the form of:
Source-Node Edge Target-Node
We typically use a TSV because sometimes our node data contains commas and simple white spaces, but it never contains a tab character, so we know we can safely use it as a separator character. Since our output will be strings of text, we will need to use the
concat() function to concatenate (or combine together) each single piece we need for each line, including the tab characters,
. Read about
concat() and its cousin
string-join() in the Michael Kay book on p. 545 or search for
concat on the w3schools XSLT, XPath, and XQuery Functions page . We will actually want to use these two functions together when we return our text output, because we will want to produce the following format for our TSV.
Source-Node [tab] Edge-Interaction [tab] Target-Node [return]This effectively expresses something like a simple sentence:
Thing-1 [tab] is-in-a-special-shared-place-with [tab] Thing-2 [return]
We are almost certainly going to need to
clean up and
de-dupe (or remove duplicates from) the input data! Almost every project will feature some level of mess to clean up, and one very simple clean-up you can apply here is to remove any extra white spaces in your input nodes, while doing the XQuery! For this we use the XPath function
normalize-space(), which simply removes leading and following white spaces, and makes sure that
<city> Greensburg</city> turns out to be the same single distinct value as
<city>Greensburg </city> and
<city>Greensburg</city>. To use
normalize-space(), we typically walk the tree to the nodes we want to process, and place
normalize-space(.) like so at the end of the XPath:
let $input1 := $yourVariableStartingPoint//walk//the//tree//to//here/normalize-space(.)
In our return, we are going to use
concat() to hold the Source-Node, [tab], Edge-Interaction, [tab], Target-Node, and then we will bundle that concat function inside a
string-join() with the special unicode character of a line-feed or hard-return,
, as the separator of each line in the output text. Typically we don’t express the whole verb phrase as the Edge, but we output a word or phrase that identifies what the shared space or shared interaction consists of, as in this example:
Bergamino floatingFrame Lauretta
Here, the character Bergamino shares with the character Lauretta a position in one of the
<floatingText> sections of our TEI XML for The Decameron, and this relationship constitutes one base unit of a larger network of connections. Generating the TSV file that holds a collection of information like this effectively stores all the network data, and when we import it in Cytoscape we can run the software to calculate, plot, and study its network statistics: which nodes are the most connected to other nodes? Which nodes are necessary to hold the network together? Which parts of the network are broken off from the others? Which nodes only appear to have one edge type (say only in sharing
<floatingText>) and which ones share multiple edge types? We can output our network plot in many different ways to consider these questions, and that will be our focus in Part 2, but for now, we need to generate the network data to identify the nodes and edges in the first place.
This is an exercise in nesting a pair of
. Let’s think about why. You need to output each Node-1 or Source-Node, so you want an outer
For Loop to generate this (together with any information you want to share about that node, called a node attribute), and hold its edge information too: anything you need that is in a one-to-one relationship with the Source Node. But in order to retrieve the Target-Nodes, you need to realize that for each single Source Node, there may several other nodes that co-occur with it in the same space. That means that you need to define a variable that will catch the whole series of target nodes, and then walk through them one at a time, so that you produce each separate line of text to match on each Target node. That means that each Source Node will need to be output several times, each time for every Target. Return everything in a
concat() using the tab characters we described above, and bundle that in a
string-join() with the line-feed return character, also described above.
distinct-values()for Source and Target Nodes?
Think about whether you want to network every single time your node appears with every other node in your document. You would produce many duplicate lines of data and your resulting graph would contain many edge lines: Bergamino may appear in the same place with Lauretta over and over and over again. Is that data relevant to your network? You could simplify by taking
distinct-values(), and then you would only be noting whether or not two characters appear together at all in a given location, not how many times they appear together. Then again, you might actually want to know that information! To make this really efficient, you can reduce the size of your output by taking distinct-values, and you could also create a separate variable that just goes and checks the
count() of the number of times the target node appears in the same context with the source node. If you simply collect that as a number, you could use that number as an
edge attribute in Cytoscape when you graph your edge lines: Perhaps you could plot the thickness of an edge line based on how many times the target node shows up in the presence of the source node. Varying the thickness of the edge-lines in a network graph is known as
weighting the edges.
if (...) then ... else ...
Depending on what you are plotting in your network, you may want to distinguish among different kinds of nodes or different kinds of edge locations. In our example from The Decameron we output three different words to indicate whether an interaction occurred in floatingText, in the outer frame around the stories, or inside the stories themselves. We also needed to determine the peers of each distinct character who are mentioned in the same layer of text, and that meant looking only inside the appropriate
ancestor::floatingText element that contains the characters in question, all the persName elements that are not equal to the Source Node. To output different kinds of information based on the distinct locations of these elements will require a conditional series of
if () then () and
else statements to determine the output of a variable. Here is how to work with iffy conditionals. These sit inside a variable definition to control how it may be defined based on the conditions you set:
let $variable:= if (XPath condition 1) then some-value-to-store--either XPath or "text" else if (XPath condition 2) then some-alternative-value-to-store--either XPath or "text" else some-other-value-for-all-other-cases--either XPath or "text"So, in something more like the variables we prepared for network analysis:
let $edge:= if ($treeWalker[. = $distinctValue]/ancestor::whatEver) then "whatEver" else if ($treeWalker[. = $distinctValue]/ancestor::somethingElse) then "somethingElse" else "remainingOption"
In this example, we involved a
$treeWalker variable that we set earlier in the XQuery to walk the tree of the XML file(s) before we took
distinct-values(). In order to check for the peers and to look up the edge data for our network, we need to check each name element in the XML to see if it corresponds to the current entry in our list of
distinct-values, and when it does, check to see which conditions it meets. It should output a different condition depending on its placement, and if we do this right, we will identify every condition that matters to us. The final
else statement could be left empty, or could be given a value to output to account for any other case that we didn't define in the preceding conditionals. We can have as many
else if statements as we like and make a long running list of conditionals, but for the purpose of our network we decided to keep this simple: The words we output in this variable will signal three different states that ultimately we will be able to color-code or plot distinctly in our network graph,
When you are retrieving good output, you need to pack this up into a TSV file that we can import into Cytoscape. This is a little tricky with outputting a plain text file, because every line of returned text seems like a separate thing to XQuery, and eXist will throw an error when you try to save your output as a single file. To bind all the lines together so it can be read as one united piece of text, you need to position a
string join() around the whole FLOWR and return, so that the
concat() function in the return is actually the first argument in the
string-join(), which then has the second argument be a line-feed (or hard-return) character. And one more thing! To make sure that the output is understood to be plain text and not the default XML format that eXist expects to be producing, just a
"text/plain" assertion to the end of the
xmldb:store() function. Here's a sort of abstract view of how that should look, with a little summary of what we have discussed so far. We decided to output lines of text containing four values: a source node, an edge, an edge attribute, and a target node.
xquery version "3.0"; declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $ThisFileContent:= string-join(let $engdecameron := doc('/db/decameron/engDecameronTEI.xml')/* let $engpeople := [stuff] let $engdistinctPs := [stuff] for $edp in $engdistinctPs let $peers:= if (
condition 1--the floating frames) then
distinct-values(XPath-to-list-of-peers-in-floatingText)else if (
condition 2--the novellas) then
distinct-values(XPath-to-all-the-other-peers-not-covered-in-the-other-conditions)let $edgeType:= if (
condition 1--the floating frames) then "floatingFrame" else if (
condition 2--the novellas) then "novella" else "frame" let $edgeWeight:= if (
condition 1--the floating frames) then
count(XPath-to-list-of-peers-in-floatingText)else if (
condition 2--the novellas) then
count(XPath-to-all-the-other-peers-not-covered-in-the-other-conditions)for $peer in $peers return concat($edp
(:source node:), "	"(:tab character:), $edgeType
(:shared interaction or edge:), "	", $edgeWeight, "	", $peer
, " ") ;let $filename := "MyNetworkData.tsv" let $doc-db-uri := xmldb:store("/db/myOutput", $filename, $ThisFileContent,
"text/plain") return $doc-db-uri (: output at :http://dxcvm05.psc.edu:8080/exist/rest/db/myOutput/MyNetworkData.tsv ) :)
View your data in the browser and you should be able to download it from there and save it locally (when prompted, save as
all files instead of
plain text, so your computer preserves the
.tsv at the end and doesn’t add
.txt to the file extension.) Or navigate your way to it in your output directory in eXist, and use the File menu there to download it.
To make sure that your data is good and readable, we conclude this assignment by having you import your TSV file into Cytoscape. Follow the instructions for import in the Cytoscape Tutorial. If Cytoscape gives you a preliminary plot and a network table, you have successfully prepared a good TSV file to work with! If not, you may need to repair something in your XQuery. In the next assignment, we will work on processing your data in Cytoscape to calculate its network statistics and prepare meaningful and legible network visualizations.
Upload your XQuery script (in a text file), and your output TSV file to the Courseweb upload point for this assignment.