Module wsdk_record

Primitives to handle WARC records.

Copyright © 2010-2012 ALEPH ARCHIVES Ltd. All rights reserved.

Version: 1.0.0

Authors: Aleph Archives Ltd. [web site: http://aleph-archives.com/].

Description

Primitives to handle WARC records.

This module allows you to read, parse, check, dump, create and alter WARC records in an intuitive manner.

The underlying WARC record format is abstracted in such a way that you never have to understand its internals.

Data Types

http_eoh()

http_eoh() = eoh

http_error()

http_error() = {error, http_string()}

http_field()

http_field() = http_field_atom() | http_string()

http_field_atom()

http_field_atom() = 'Cache-Control' | 'Connection' | 'Date' | 'Pragma' | 'Transfer-Encoding' | 'Upgrade' | 'Via' | 'Accept' | 'Accept-Charset' | 'Accept-Encoding' | 'Accept-Language' | 'Authorization' | 'From' | 'Host' | 'If-Modified-Since' | 'If-Match' | 'If-None-Match' | 'If-Range' | 'If-Unmodified-Since' | 'Max-Forwards' | 'Proxy-Authorization' | 'Range' | 'Referer' | 'User-Agent' | 'Age' | 'Location' | 'Proxy-Authenticate' | 'Public' | 'Retry-After' | 'Server' | 'Vary' | 'Warning' | 'Www-Authenticate' | 'Allow' | 'Content-Base' | 'Content-Encoding' | 'Content-Language' | 'Content-Length' | 'Content-Location' | 'Content-Md5' | 'Content-Range' | 'Content-Type' | 'Etag' | 'Expires' | 'Last-Modified' | 'Accept-Ranges' | 'Set-Cookie' | 'Set-Cookie2' | 'X-Forwarded-For' | 'Cookie' | 'Keep-Alive' | 'Proxy-Connection'

http_header()

http_header() = {header, http_field(), http_version()}

http_method()

http_method() = 'OPTIONS' | 'GET' | 'HEAD' | 'POST' | 'PUT' | 'DELETE' | 'TRACE' | http_string()

http_request()

http_request() = {request, http_method(), http_uri(), http_version()}

http_response()

http_response() = {response, http_version(), pos_integer(), http_string()}

http_string()

http_string() = string() | binary()

http_uri()

http_uri() = '*' | {absoluteURI, http | https, http_string(), non_neg_integer() | undefined, http_string()} | {scheme, http_string(), http_string()} | {abs_path, http_string()} | http_string()

http_version()

http_version() = {non_neg_integer(), non_neg_integer()}

property()

property() = {write_slot_type(), value()}

proplist()

proplist() = [property()]

read_slot_type()

read_slot_type() = vsn | type | recid | date | sub_slot_type()

sub_slot_type()

sub_slot_type() = len | recid | ip | uri | mime | conc | bdig | pdig | rto | wid | wfn | wfile | prof | trunc | tyload | segnum | seglen | segorig

validation_error()

validation_error() = {error, found_forbidden_field, atom()} | {error, missing_mandatory_field, atom()} | {error, invalid_version, term()} | {error, invalid_content_type, term()} | {error, invalid_content_length, term()} | {error, invalid_date, term()} | {error, invalid_uri, term()} | {error, invalid_profile, term()} | {error, invalid_ip_address, term()} | {error, invalid_segment_number, term()} | {error, invalid_segment_total_length, term()} | {error, invalid_segment_origin_id, term()} | {error, invalid_record_id, term()} | {error, invalid_mime_type, term()} | {error, invalid_concurrent_to, term()} | {error, invalid_block_digest, term()} | {error, invalid_payload_digest, term()} | {error, invalid_refers_to, term()} | {error, invalid_info_id, term()} | {error, invalid_filename, term()} | {error, invalid_truncated, term()} | {error, invalid_identified_payload_type, term() | {error, record_semantically_invalid, term()}}

value()

value() = calendar:datetime1970() | non_neg_integer() | binary() | inet:ip_address() | file:name() | function() | file | bytes | stream

write_slot_type()

write_slot_type() = data | source | read_slot_type()

Function Index

clone_hdr/1Clones a WARC record's header block (no matter its type: read/write).
get/2Getter to access WARC record's internal state (i.e fields).
http_decode/2Parse any WARC's HTTP payload 'request' or 'response' as a stream of data (extremely fast).
is_valid/1Is the WARC record valid (syntactic and semantic validity) and compliant with the WARC v1.0 ISO 28500:2009 specifications?.
new/0Returns an empty WARC record for writing.
new/1Returns a new WARC record for writing filled with data from proplist PropList.
payload/1Retrieves the WARC record's payload chunk by chunk.
payload/2Efficiently retrieves the WARC record's payload and dump its content to file Filename on disk.
set/3Setter to update the WARC record internal state (i.e fields).
unset/2Reset the WARC record field to its default value.

Function Details

clone_hdr/1

clone_hdr(Record::#wsdk_rrec{} | #wsdk_wrec{}) -> #wsdk_wrec{}

Clones a WARC record's header block (no matter its type: read/write). This call creates a carbon copy of the original record, for writing purposes.

Note

The copy does not duplicate the payload. See wsdk_record:payload/1 and wsdk_record:payload/2.

get/2

get(Record::#wsdk_rrec{} | #wsdk_wrec{}, FieldName::write_slot_type() | soff) -> value()

Getter to access WARC record's internal state (i.e fields).

http_decode/2

http_decode(Selector::status_line | headers, Bin::binary()) -> {ok, http_response() | http_request() | http_header() | http_eoh() | http_error(), binary()} | more | {error, term()}

Parse any WARC's HTTP payload 'request' or 'response' as a stream of data (extremely fast).

Notes

- If an entire packet is contained in Bin, it is returned together with the remainder of the binary as {ok,Packet,Rest}.

- If Bin does not contain the entire packet, 'more' is returned. http_decode/2 can then be called again with more data added.

- If the packet does not conform to the HTTP protocol format {error,Reason} is returned.

is_valid/1

is_valid(Record::#wsdk_rrec{} | #wsdk_wrec{}) -> ok | validation_error()

Is the WARC record valid (syntactic and semantic validity) and compliant with the WARC v1.0 ISO 28500:2009 specifications?

This is a complex operation.

Note

This call validates the record's block only. If interested to check the payload for records of type 'request' or 'response', use wsdk_record:http_decode/2.

new/0

new() -> #wsdk_wrec{}

Returns an empty WARC record for writing.

new/1

new(PropList::proplist()) -> #wsdk_wrec{}

Returns a new WARC record for writing filled with data from proplist PropList.

payload/1

payload(Record::#wsdk_rrec{}) -> {ok, binary(), #wsdk_rrec{}} | eof | incomplete

Retrieves the WARC record's payload chunk by chunk.

payload/2

payload(Record::#wsdk_rrec{}, Filename::file:name()) -> ok | incomplete

Efficiently retrieves the WARC record's payload and dump its content to file Filename on disk.

This call ensures that all parent directories exist, trying to create them if necessary. If the output file Filename already exists, it will be erased first.

Warning

This method is optimized for speed. So, you can't call wsdl_record:payload/1 after wsdl_record:payload/2 (and vice-versa).

set/3

set(Record::#wsdk_wrec{}, FieldName::write_slot_type(), Value::value()) -> #wsdk_wrec{}

Setter to update the WARC record internal state (i.e fields).

unset/2

unset(Record::#wsdk_wrec{}, FieldName::write_slot_type()) -> #wsdk_wrec{}

Reset the WARC record field to its default value.


Generated by EDoc, Sep 5 2012, 17:38:09.