Two-Week Project
Summary
In this project, you and a partner will work together to create an interactive graphical software application that pulls live data from Wikipedia. You will bring into practice all of the techniques we have studied in the first three weeks of the semester while learning some new ones along the way. This will help you better understand course concepts and prepare you for the final project.
Requirements
Your client for this project is Orwellian News Service (ONS), who are currently involved in some investigative journalism around global politics. Staff are trying to track down the behaviors of editors on Wikipedia, specifically those around government leaders.
Functional Requirements
The functional requirements for the project are given by the following user story statements and conditions of satisfaction:
- As an investigative journalist, I want to see who most recently changed a Wikipedia article.
- I can provide the name of a Wikipedia page and the system responds with the thirty most recent changes to that page—newest first—showing:
- the username of the person who made the change
- the time of the change, localized to my timezone
- Anonymous users are grouped by IP identification, so that anonymous edits from the same IP are assumed to be the same user.
- If there is no Wikipedia page for the name I gave, the system tells me so.
- If I was redirected as part of my search (such as from “Obama” to “Barack Obama”), the system tells me so.
- Notify me if a network connection cannot be established to Wikipedia, informing me to check my network connection and try again later.
- As an investigative journalist, I want to know who has been most actively editing a Wikipedia page.
- I can sort the list of the most recent thirty changes to a Wikipedia page by name. The editor with the most changes is shown first, the rest in descending order, with timestamps breaking ties.
Nonfunctional Requirements
- Each solution will be completed by a pair of CS222 students from the same section. There may be exactly one group of three per section to deal with odd numbers of registrants.
- Your solution will have its own private repository within our GitHub organization. Name the repository in the format “twp-user1-user2”, where user1 and user2 are the BSU usernames of the team members in alphabetical order.
- Each team will use Pair Programming and follow the rules of Clean Code. This includes using model-view separation, with the domain model developed using Test-Driven Development.
- The software must be written in Java 1.8 as a desktop JavaFX application using IntelliJ IDEA Community Edition.
- Parse structured data using an appropriate and robust technology: don't rely on simple String searches to extract what data you need.
- Follow the GitHub convention and include a README.md file at the root of your project. This file should include a summary of the project and the names of the authors.
- Your released code must be on the
master
branch. - Be considerate of the Wikipedia server. Only generate requests when the user asks for answers: do not write a program that periodicially polls Wikipedia. Identify your client as per the MediaWiki guide.
- Be considerate of the user. Do not generate files on their hard drive. Present a reasonably well-designed user experience. Consider incorporating visual information display if you have time; JavaFX has a built-in Charts library.
- Tag your repository as “0.1” and submit your repository URL to the Two-Week Project assignment on
by by the start of class on Tuesday, September 26. - Prepare to demonstrate your solution in class on Tuesday, September 26.
Getting Data from Wikipedia
The technology behind Wikipedia is MediaWiki, and it has an extensive API for developers. Wikipedia has an API Sandbox that you can use to learn the API by playing with live data. Doing so helped me come up with the following HTTP request, which pulls down information about the four most recent changes to the Soup page as a JSON document:
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&titles=Soup&rvprop=timestamp|user&rvlimit=4&redirects
For example, when I requested the data recently, it returned the following JSON data:
{"continue":{"rvcontinue":"20170817064314|795902840","continue":"||"},"query":{"pages":{"19651298":{"pageid":19651298,"ns":0,"title":"Soup","revisions":[{"user":"FrescoBot","timestamp":"2017-09-04T17:33:49Z"},{"user":"Darylgolden","timestamp":"2017-08-31T14:34:15Z"},{"user":"50.76.156.229","anon":"","timestamp":"2017-08-31T14:33:12Z"},{"user":"CommonsDelinker","timestamp":"2017-08-22T12:38:06Z"},{"user":"2601:541:4304:E6B0:218:8BFF:FE74:FE4F","anon":"","timestamp":"2017-08-19T16:04:56Z"}]}}}}
Fortunately, we can pretty-print that to make it more readable:
{ "continue": { "rvcontinue": "20170817064314|795902840", "continue": "||" }, "query": { "pages": { "19651298": { "pageid": 19651298, "ns": 0, "title": "Soup", "revisions": [ { "user": "FrescoBot", "timestamp": "2017-09-04T17:33:49Z" }, { "user": "Darylgolden", "timestamp": "2017-08-31T14:34:15Z" }, { "user": "50.76.156.229", "anon": "", "timestamp": "2017-08-31T14:33:12Z" }, { "user": "CommonsDelinker", "timestamp": "2017-08-22T12:38:06Z" }, { "user": "2601:541:4304:E6B0:218:8BFF:FE74:FE4F", "anon": "", "timestamp": "2017-08-19T16:04:56Z" } ] } } } }
You might find yourself searching for a page whose title has a space in it, but URLs are not allowed to have a space in them. There are many syntactic rules like this for URLs. Rather than try to remember all the rules, use the UrlEncoder
class to manage them.
Getting a stream of live data is straightforward using Java's URLConnection
class, as documented in this trail from The Java Tutorial. However, note that the terms of use for Wikipedia require that we identify our clients in the HTTP request, so our code will look more like this:
URL url = new URL("https://en.wikipedia.org"); URLConnection connection = url.openConnection(); connection.setRequestProperty("User-Agent", "Revision Tracker/0.1 (http://www.cs.bsu.edu/~pvg/courses/cs222Fa17; me@bsu.edu)"); InputStream in = connection.getInputStream();
Video Tutorials
I have developed a series of videos designed to help with some of the challenges that come up in this project. They are all in the CS222 playlist. This video shows how you can use the GSON library to parse through JSON data.
Some students have reported that the Alt-Enter trick from around 6:09 won't automatically add the JUnit dependency to Maven for them. That's not an insurmountable problem—when it works, Alt-Enter is just a shortcut to something you can do by hand. In this case, just drop the dependency directly into your pom.xml
file. You can see what to add in the video, or you can always search the Web for “junit pom” to end up on mvnrepository.com, where you can also see the XML configuration needed.
The following video gives an example of how to migrate logic from learning tests to production code. The specific example deals with exceptions and XML, but the more important concept shown here how to create a domain model from learning tests. A well-formed solution to the two-week project will have a similar structure, with a Parser
object that generates Revision
objects from JSON streams.
It is important that any non-trivial computation be handled off of the main event thread. The thread that processes the tap of a button, for example, needs to be free to handle other UI events. The following video gives an example of how to delegate computation to a new thread from within JavaFX. This is the kind of approach you will need to do your analysis, since you want the UI to remain responsive while the data is processed.
You have some creative freedom in this project as to how you want to develop your JavaFX user interface. I have a pair of videos that walk through the fundamentals, one using SceneBuilder...
and one without.
Recommended Approach
As a preliminary step, you will need to configure your work environment. Create a new Java project within IDEA, configuring it via Maven to load the GSON library, as shown in the video. Grab sample JSON data from Wikipedia and save that into a file in your test/resources
directory, so you can use it in your unit tests. Commit this configuration (a conventional first commit message being “Initial Commit”), create your team's repository on GitHub, and push your code there. Now, you can make sure your partner can clone the project from GitHub.
Once your basic configuration is done, there are many ways to move forward. TDD is mandated by the nonfunctional requirements, so a critical step is transforming the functional requirements into SMART tasks in your task list. Below you will find a recommended starter set of tasks to help you model this transformation process. Note that these aren't exactly SMART since I don't know what sample data you have on hand; you should modify your list to be specific.
- Write a learning test to ensure you can read the name of the first revision author from your test data folder.
- Make a class
RevisionParser
, and test that itsparse
method, given the test input stream, returns the name of the first revision author. - Refactor
RevisionParser
, so that the parser returns aRevision
object instead. This improves encapsulation rather than relying on primitive types andString
. - Redesign
RevisionParser
'sparse
method to return aList<Revision>
of all the revisions in the stream. - At this point, I would do a vertical integration: create a simple JavaFX UI with a text field that accepts a search term and a button. When you click the button, connect to the URL, create an input stream from it, pass the stream to
RevisionParser
, and dump the results into an output text area. - Augment the UI so that the network request and parsing happen on a background thread, and the rest of the UI is disabled while this operation happens. For example, you could use a
Task
Technical note about git, IDEA, and Maven
I frequently use IDEA's “New→Project from Version Control” feature to clone projects from GitHub directly into IDEA. This does not work as expected for Maven projects. My advice is to clone the project from the command line to some convenient place on your hard drive, and then use IDEA's “New→Project from Existing Sources” option to import the Maven project. That is, from the console or Git Bash or equivalent, you can use the command git clone URL
, where URL
is your project URL on GitHub. You should be able to click through the default values during this import process. You only have to do this once, and then you can treat it like any other project with respect to pulling, committing, pushing, etc.
Formal Evaluation
A submission must compile without errors or warnings and meet all functional requirements; only IDEA's package-private inspection may be disabled summarily. Provided the above conditions are satisfied, a submission will be graded out of six points: three points for Clean Code structural rules and three points for Clean Code procedural rules. It is recommended to use the Clean Checklist in preparing your commits and release.