I'm reading in data about an HTTP access log. I've got a file with columns for the ip address, year, month, day, hour and requested URL. I read the file in like this:
ipdata = scan(file="sample_r.log", what=list(ip="", year=0, month=0, day=0, hour=0, verb="", url=""))
This seems to work. R-Studio says that ipdata is a list[7] and "names(ipdata)" returns
[1] "ip" "year" "month" "day" "hour" "verb" "url"
So that seems cool. I wanted to do something fun, like graph some data for a specific hour. I tried doing a subset:
s <- subset(ipdata, ipdata$hour==3)
This data looks remarkably different than the first data frame. s is a list[297275] and the following doesn't work right:
> table(ipdata$verb)
GET POST
2870709 1596748
> table(s$verb)
character(0)
Am I going about this the correct way? What I typically do is wrap my data frame in a table() and then barplot or dotplot it. Is R a good way to do this? I want to say "Show me all of the top URLs in hour 3", for example. Or "How many times did this IP address show up per hour?"
Update It looks like by using read.table
instead of scan
I was able to get a data frame. Apparently scan returns a list of lists or something? Definitely confusing to a n00b like myself but I'm feeling good about it now.
If you ran
dat <- as.data.frame(ipdata)
str(dat)
.... you would probably see that it was pretty much the same as the results of your read.table() operation. read.table
is a wrapper for scan
and does a lot of formatting and consistency checking.
read.table
assigns "data.frame" as the class of its returned object. It does a lot of checking of names and lengths and classes before it assigns the class. Just type read.table
at your console. In addition to seeing the amount of consistency enforcement, you will get an appreciation for why it is sometimes slow - 42- 2012-04-05 19:22
read.table
essentially do a "as.data.frame" when it's done - Dave 2012-04-05 19:15