Goals:
To learn about the major data formats and practice writing scripts that generate (or reformat) data into a specific format.
Software & Technologies:
python
(simple scripting, regular expressions, batch processing)- basic data formats: csv/tsv (comma/tab-separated values), json, yml, etc.
Class:
- explanation of major formats and their importance
- hands-on: converting data into structured formats
The Essence
The ease of editing, suitability for analytical software, human-friendliness and readability, open vs. proprietary.
XML (Extensible Markup Language)
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
CSV/TSV (Comma-Separated Values/ Tab-Separated Values)
to,from,heading,body
Tove,Jani,Reminder,Don't forget me this weekend!
JSON (JavaScript Object Notation)
{
"to": "Tove",
"from": "Jani",
"heading": "Reminder",
"body": "Don't forget me this weekend!"
}
YML or YAML (Yet Another Markup Language > YAML Ain’t Markup Language)
to: Tove
from: Jani
heading: Reminder
body: Don't forget me this weekend
Larger Examples
NB data example from here.
There are some online converters that can help you to convert one format into another. For example: http://www.convertcsv.com/.
CSV
/ TSV
city,growth_from_2000_to_2013,latitude,longitude,population,rank,state
New York,4.8%,40.7127837,-74.0059413,8405837,1,New York
Los Angeles,4.8%,34.0522342,-118.2436849,3884307,2,California
Chicago,-6.1%,41.8781136,-87.6297982,2718782,3,Illinois
Houston,11.0%,29.7604267,-95.3698028,2195914,4,Texas
Philadelphia,2.6%,39.9525839,-75.1652215,1553165,5,Pennsylvania
TSV
is a better option than a CSV
, since TAB
characters are very unlikely to appear in values.
Neither TSV
not CSV
are good for preserving new line characters (\n
)—or, in other words, text split into multiple lines. As a workaround, one can convert \n
into some unlikely-to-occur character combination (for example, ;;;
), which would allow to restore \n
later , if necessary.
JSON
[
{
"city": "New York",
"growth_from_2000_to_2013": "4.8%",
"latitude": 40.7127837,
"longitude": -74.0059413,
"population": "8405837",
"rank": "1",
"state": "New York"
},
{
"city": "Los Angeles",
"growth_from_2000_to_2013": "4.8%",
"latitude": 34.0522342,
"longitude": -118.2436849,
"population": "3884307",
"rank": "2",
"state": "California"
},
{
"city": "Chicago",
"growth_from_2000_to_2013": "-6.1%",
"latitude": 41.8781136,
"longitude": -87.6297982,
"population": "2718782",
"rank": "3",
"state": "Illinois"
},
{
"city": "Houston",
"growth_from_2000_to_2013": "11.0%",
"latitude": 29.7604267,
"longitude": -95.3698028,
"population": "2195914",
"rank": "4",
"state": "Texas"
},
{
"city": "Philadelphia",
"growth_from_2000_to_2013": "2.6%",
"latitude": 39.9525839,
"longitude": -75.1652215,
"population": "1553165",
"rank": "5",
"state": "Pennsylvania"
}
]
YML
/YAML
YAML is often used only for a single set of parameters.
city: New York
growth_from_2000_to_2013: 4.8%
latitude: 40.7127837
longitude: -74.0059413
population: 8405837
rank: 1
state: New York
But it can also be used for storage of serialized data. It has advantages of both JSON and CSV: the overall simplicity of the format (no tricky syntax) is similar to that of CSV/TSV, but it is more readable than CSV/TSV in any text editor, and is more difficult to break—again, due to the simplicity of the format.
New York:
growth_from_2000_to_2013: 4.8%
latitude: 40.7127837
longitude: -74.0059413
population: 8405837
rank: 1
state: New York
Los Angeles:
growth_from_2000_to_2013: 4.8%
latitude: 34.0522342
longitude: -118.2436849
population: 3884307
rank: 2
state: California
Chicago:
growth_from_2000_to_2013: -6.1%
latitude: 41.8781136
longitude: -87.6297982
population: 2718782
rank: 3
state: Illinois
Houston:
growth_from_2000_to_2013: 11.0%
latitude: 29.7604267
longitude: -95.3698028
population: 2195914
rank: 4
state: Texas
Philadelphia:
growth_from_2000_to_2013: 2.6%
latitude: 39.9525839
longitude: -75.1652215
population: 1553165
rank: 5
state: Pennsylvania
YAML files can be read with Python into dictionaries
like so:
import yaml
dictionary = yaml.load(open(pathToFile))
You will most likely need to install yaml
library; it is also quite easy to write a script that would read such serialized data.
Reference Materials:
- https://beginnersbook.com/2015/04/json-tutorial/
- More broadly: Baker, James. 2014. “Preserving Your Research Data.” Programming Historian, April. https://programminghistorian.org/lessons/preserving-your-research-data.
Homework:
- reformatting the “Dispatch”: generate one TSV-file with the entire content of the “Dispatch”, where each article (or, more broadly—an entry) is a single record; each record should include:
- date;
- type of an entry (there are articles, advertisements, notices, etc);
- header;
- the text of an entry.
- publish the description of this process (with excerpts of code) on your website.
- Codecademy’s Learn Python, Unit 11-12.
- Github: publish the confirmation screenshot as a post on your new site.
Homework solution
import re, os
source = "path_where_initial_files_are"
target = "path_to_save_new_files"
lof = os.listdir(source)
counter = 0 # general counter to keep track of the progress
ourCSV = []
for f in lof:
if f.startswith("dltext"): # fileName test
with open(source + f, "r", encoding="utf8") as f1:
text = f1.read()
# try to find the date
date = re.search(r'<date value="([\d-]+)"', text).group(1)
# splitting the issue into articles/items
split = re.split("<div3 ", text)
c = 0 # item counter
for s in split[1:]:
c += 1
s = "<div3 " + s # a step to restore the integrity of items
#input(s)
# try to find a unitType
try:
unitType = re.search(r'type="([^\"]+)"', s).group(1)
except:
unitType = "noType"
print(s)
# try to find a header
try:
header = re.search(r'<head>(.*)</head>', s).group(1)
header = re.sub("<[^<]+>", "", header)
except:
header = "NO HEADER"
#print("No header found!")
text = re.sub("<[^<]+>", "", s)
text = re.sub(" +\n|\n +", "\n", text)
text = re.sub("\n+", ";;; ", text)
# generating necessary bits
fName = date+"_"+unitType+"_"+str(c)
itemID = date+"_"+unitType+"_"+str(c)
dateVar = date
#unitType = unitType
#header = header
text = text.replace("\t", " ")
# creating a text variable
var = "\t".join([itemID,dateVar,unitType,header,text])
#input(var)
ourCSV.append(var)
# count processed issues and print progress counter at every 100
counter += 1
if counter % 100 == 0:
print(counter)
# saving
with open("dispatch_as_TSV.csv", "w", encoding="utf8") as f9:
f9.write("\n".join(ourCSV))
print(counter)