If you want to teach anybody about typical tasks in the life of a system administrator, there is no way around painstaking log processing. No matter how many log aggregation and correlation solutions you have, being able to process large text files and extract specific information from them is crucial. But just as so often, only so-called ``real world experience'' provides the ability to learn this.
Assigning students the task to write a tool that processes log files and extracts certain information is futile without being able to provide sample data for them to test their work on. The more limited your sample data is, the more limited the result will be -- inevitably, students write solutions for the data at hand, not accounting for the possibility of data outside the given parameters.
The only way to solve this dilemma is by providing actual production logs, containing all the warts and edge cases: with certain lines not matching the expected format; with unexpected characters in unexpected places; with invalid time stamps or date format; with surprising variation or equally surprising duplication of the same data. The list goes on. Only real data provides a realistic data set to test your tools on.
Getting access to this data is, however, near impossible: as a system administrator, you would not want to just share your log files. They do, after all, contain a lot of information about your system, about your users, about your business model, about everything. Before you could share this with outsiders, you'd have to anonymize it, remove, censor or modify information. (For Apache access logs, certain solutions do exist and may help.)
Of course the problem is that by changing the data, it becomes less useful. As a teaching tool, you really want data that is as close to actual data as possible. Yes, in some cases certain components can be removed or anonymized, but data processed in this manner becomes useful only for the specific given exercise. For example, if I ask students to extract IP addresses and identify the top ten countries from which requests originated, having anonymizers translate IP addresses to, say, RFC1918 addresses, ruins the exercise. Stripping or truncating referrer data in an apache access log can similarly ruin a useful exercise that requires handling of unexpectedly large requests etc.
It seems possible to gain access to certain datasets for the purposes of academic research. For example, the Wikimedia Foundation may make available certain data, but the efforts and barrier to meet the requirements seem significant. What's more, people outside of academic research projects may have a hard time convincing other organizations to provide this kind of data. (To be fair, the page view statistics data set looks quite useful for some exercises.)
I have no solution to this dilemma, but I think that as educators we need to come up with some possible solutions. For example, I'm perfectly happy to share the access logs of this website, but that is not a significant data set. Suppose other people were willing to share theirs? Combined, the data is significantly less site-specific, but more valuable for educational purposes.